Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Faster & Fewer Page Faults

Faster & Fewer Page Faults

We have improved the Linux page fault mechanism to reduce the number of faults and handle them more quickly when they do happen. By managing memory in large folios, we reduce the number of page faults. The 4KiB page used on many architectures is simply too small for the amount of memory we need to manage today. When you take a page fault, the kernel can allocate multiple pages and map them all at the same time. By managing VMAs in a Maple Tree, we handle page faults more quickly. The Maple Tree is shallower than the red-black tree and uses the CPU cache more effectively. When you take a page fault, the kernel can find the information it needs to handle the page fault more quickly. These two projects together result in a significant reduction of time spent handling page faults and allow your computer to spend more of its time running user code. No cars were crashed in the execution of this project.

Matthew Wilcox

Kernel Recipes
PRO

September 30, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. Technical Advisor
    Oracle Linux Development
    2023-09-27
    Matthew Wilcox
    Kernel Recipes
    Faster & Fewer Page Faults

    View Slide

  2. The following is intended to outline our general product direction. It is intended for information
    purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any
    material, code, or functionality, and should not be relied upon in making purchasing decisions. The
    development, release, timing, and pricing of any features or functionality described for Oracle’s
    products may change and remains at the sole discretion of Oracle Corporation.
    Safe harbor statement
    Copyright © 2023, Oracle and/or its affiliates
    2

    View Slide

  3. Copyright © 2023, Oracle and/or its affiliates
    3
    • Maple Tree
    • Per-VMA Locking
    • Large Folios
    • New PTE manipulation interfaces
    https://www.cs.virginia.edu/~robins/YouAndYourResearch.html
    Four projects

    View Slide

  4. Copyright © 2023, Oracle and/or its affiliates
    4
    • Your CPU is attempting to extract parallelism from your sequential code
    • My 2.8GHz laptop CPU is able to issue 6 insn/clock
    30 insn/5-cycle L1 cache hit
    70 insn/14-cycle L2 cache hit
    200 insn/40-cycle (14ns) L3 cache hit
    1680 insn/100ns L3 cache miss
    • Linked lists bottleneck on fetching the next entry in the list
    • Arrays can be prefetched
    • Walking a million-entry array is 12x faster than a million-entry list on my laptop
    Linked Lists are Immoral

    View Slide

  5. Copyright © 2023, Oracle and/or its affiliates
    5
    • Look up VMA (Virtual Memory Area) for virtual address
    • Walk down the page tables
    • If VMA is anonymous, allocate a page
    • Otherwise, call VMA fault handler
    - Fault handler may return a page or populate page table directly
    • If page provided, insert entry into page table
    Anatomy of a page fault

    View Slide

  6. Copyright © 2023, Oracle and/or its affiliates
    6
    • Maple Tree
    • Per-VMA Locking
    • Large Folios
    • New PTE manipulation interfaces
    Four projects

    View Slide

  7. Copyright © 2023, Oracle and/or its affiliates
    7
    • VMAs were originally stored on a singly-linked list in 0.98 (1992)
    • An AVL tree was added in 1.1.83 (1995)
    • A Red-Black tree replaced the AVL tree in 2.4.9.11 (2001)
    • A Maple Tree replaced the linked list & Red-Black tree in 6.1 (2022)
    Looking up a VMA

    View Slide

  8. Copyright © 2023, Oracle and/or its affiliates
    8
    • In-memory, RCU-safe B-tree for non-overlapping ranges
    • Average branching factor of eight creates shallower trees (faster lookups)
    • Modifications allocate memory (slower modifications)
    • Applications typically have between 20 VMAs (cat) and 1000 (Mozilla)
    - Can be millions in pathological cases (ElectricFence)
    • RCU safety guarantees that a VMA which was present before the RCU lock was taken, and is still
    present after the RCU lock is released will be found.
    Maple Tree

    View Slide

  9. Copyright © 2023, Oracle and/or its affiliates
    9
    • Maple Tree
    • Per-VMA Locking
    • Large Folios
    • New PTE manipulation interfaces
    Four projects

    View Slide

  10. Copyright © 2023, Oracle and/or its affiliates
    10
    • Protected by a semaphore from 2.0.19 (1996)
    • Changed to a read-write semaphore from 2.4.2.5 (2001)
    • Added per-VMA read-write semaphores in 6.4 (2023)
    VMA tree locking

    View Slide

  11. Copyright © 2023, Oracle and/or its affiliates
    11
    • Take RCU read lock to prevent Maple tree nodes and VMAs from being freed
    • Load VMA from Maple tree
    • Read-trylock the per-VMA lock
    - If write-locked, a writer is modifying this VMA.
    • If MM seqcount is equal to VMA seqcount, VMA is locked
    - This allows a writer to unlock all locked VMAs just by updating mm seqcount
    • Drop RCU read lock; we will not look at the Maple Tree, and the VMA cannot be freed
    Per-VMA locking lookup

    View Slide

  12. Copyright © 2023, Oracle and/or its affiliates
    12
    • Anonymous VMAs handled from 6.4 on arm64, powerpc, s390, x86; 6.5 on riscv
    • Swap and Userfaultfd support in 6.6
    • In-core page cache VMAs support in 6.6
    • DAX support in 6.6
    • Page cache faults that need reads in 6.7?
    • COW faults of page cache VMAs in 6.7?
    • More support is possible, both architectures and types of memory
    - Device drivers may rely on mmap_sem synchronisation
    - HugeTLB faults have not yet been converted
    Support for per-VMA locking

    View Slide

  13. Copyright © 2023, Oracle and/or its affiliates
    13
    • Maple Tree
    • Per-VMA Locking
    • Large Folios
    • New PTE manipulation interfaces
    Four projects

    View Slide

  14. Copyright © 2023, Oracle and/or its affiliates
    14
    • XFS files can be buffered in larger chunks than PAGE_SIZE since 5.17 (2022)
    - AFS since 6.0, EROFS since 6.2
    • Large folios can be created on write() since 6.6
    • Support for other filesystems & anonymous memory is in progress
    Large Folios

    View Slide

  15. Copyright © 2023, Oracle and/or its affiliates
    15
    • Maple Tree
    • Per-VMA Locking
    • Large Folios
    • New PTE manipulation interfaces
    Four projects

    View Slide

  16. Copyright © 2023, Oracle and/or its affiliates
    16
    • set_pte_at() could only insert a single Page Table Entry
    • set_ptes() can insert n consecutive Page Table Entries pointing to contiguous pages
    • flush_dcache_folio() flushes the entire folio from the data cache
    • flush_icache_pages() flushes n consecutive pages from the instruction cache
    • update_mmu_cache_range() acts on n consecutive pages
    - Also tells the architecture which page was actually requested
    New PTE manipulation interfaces

    View Slide

  17. Copyright © 2023, Oracle and/or its affiliates
    17
    • Large Anonymous Folios
    • Removing writepage()

    • Removing launder_folio()

    • Shrinking struct page
    • Batched folio freeing
    • bdev_getblk()
    • ext2 directory handling
    • folio_end_read()
    • mrlock removal
    • Converting buffer_heads to use folios
    • Lockless page faults
    • Removing GFP_NOFS
    • struct ptdesc
    Projects I Don’t Have Time To Talk About
    • A better approach to the LRU list
    • Block size > PAGE_SIZE
    • Removing arch_make_page_accessible()
    • Why kernel-doc is not my favourite
    • Rewriting the swap subsystem
    • Removing __GFP_COMP
    • What does folio mapcount mean anyway?
    • Replacing the XArray radix tree with the maple tree
    • Converting HugeTLBfs to folios
    • Making HugeTLBfs less special
    • mshare
    • Improving readahead for modern storage
    • Support folios larger than PMD size

    View Slide

  18. Copyright © 2023, Oracle and/or its affiliates
    18
    • Andrew Morton
    • Darrick Wong
    • Dave Chinner
    • David Howells
    • David Hildenbrand
    • David Rientjes
    • Davidlohr Bueso
    • Greg Marsden
    • Jan Kara
    • Johannes Weiner
    • Jon Corbet
    • Kiryl Shutsemau
    Thanks
    • Laurent Dufour
    • Liam Howlett
    • Michal Hocko
    • Michel Lespinasse
    • Mike Kravetz
    • Mike Rapoport
    • Paul McKenney
    • Ryan Roberts
    • Song Liu
    • Suren Baghdasaryan
    • Vlastimil Babka
    • Yin Fengwei

    View Slide

  19. View Slide