Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hardware and its Concurrency Habits

Hardware and its Concurrency Habits

This talk will present a high-level overview of the hardware structure of modern computer systems, and then will use this structure to show how the laws of physics put limitations on the design and implementation of concurrent software. Venue permitting, it will include a stark visual exposition of these limitations. Finally, this talk will compare how well various synchronization use cases are aligned with modern hardware structure and the laws of physics. Audience members will gain a firmer intuition on the costs and benefits of different synchronization mechanisms.

Paul E McKenney

Kernel Recipes
PRO

September 30, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. Hardware and its Concurrency
    Habits
    © 2023 Meta Platforms
    Paul E. McKenney, Meta Platforms Kernel Team
    Kernel Recipes, September 27, 2023
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapter 3

    View Slide

  2. 2
    Recette Pour le Codage Simultané

    Une pincée de connaissance des lois de la
    physique

    Compréhension modeste du matériel informatique

    Compréhension approfondie des exigences

    Conception soignée, y compris la synchronisation

    Validation brutale
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

    View Slide

  3. 3
    Recette Pour le Codage Simultané

    Une pincée de connaissance des lois de la
    physique

    Compréhension modeste du matériel informatique

    Compréhension approfondie des exigences

    Conception soignée, y compris la synchronisation

    Validation brutale
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html Chapitre 3

    View Slide

  4. 5
    “Let Them Run Free!!!”

    View Slide

  5. 6
    “Let Them Run Free!!!”
    CPU Benchmark Trackmeet
    CPU Benchmark Trackmeet

    View Slide

  6. 7
    “Let Them Run Free!!!”
    CPU Benchmark Trackmeet
    CPU Benchmark Trackmeet
    Sadly, it is now more of an obstacle course than a track...

    View Slide

  7. 8
    Don’t Make ‘em Like They Used To!

    View Slide

  8. 9
    Don’t Make ‘em Like They Used To
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.

    View Slide

  9. 10
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution

    View Slide

  10. 11
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution
    “Tiny Bulldozer” “Semi Tractor-Trailer”

    View Slide

  11. 12
    4.0 GHz clock, 20 MB L3
    cache, 20 stage pipeline...
    The only pipeline I need
    is to cool off that hot-
    headed brat.
    Don’t Make ‘em Like They Used To!

    No cache
    No cache

    Shallow pipeline
    Shallow pipeline

    In-order execution
    In-order execution

    One instruction at a time
    One instruction at a time

    Predictable (slow)
    Predictable (slow)
    execution
    execution

    Large cache
    Large cache

    Deep pipeline
    Deep pipeline

    Out of order
    Out of order

    Super scalar
    Super scalar

    Unpredictable (fast)
    Unpredictable (fast)
    execution
    execution
    What would be the computing-systems equivalents of a freight train?

    View Slide

  12. 13
    “Good Olde Days” CPU Architecture
    80386 Architecture (Wikipedia user “Appaloosa” GDFL, simplified and reformatted)
    32 bit
    Protection
    Test Unit
    ALU

    Barrel Shifter

    Multiply/Divide

    Register File
    Segmentation
    Bus
    Control
    32 Bit
    Paging
    Prefetcher /
    Limit Checker
    Code Queue
    (16 Bytes)
    Instruction
    Decoder
    3-Decoded
    Instruction
    Queue
    Decode and
    Sequencing
    Control
    ROM
    Dedicated ALU Bus
    34 bit
    32 bit Effective Address
    Status Flags
    ALU Control

    View Slide

  13. 14
    The 80386 Taught Me Concurrency
    That and a logic analyzer...

    View Slide

  14. 15
    But Instructions Took Several Cycles!

    View Slide

  15. 16
    Pipelined Execution For The Win!!!
    (Wikipedia user “Amit6” CC BY-SA 3.0, reformatted)

    View Slide

  16. 17
    Superscalar Execution For The Win!!!
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit

    View Slide

  17. 18
    Why All This Hardware Complexity?

    View Slide

  18. 19
    Laws of Physics: Atoms Are Too Big!!!
    Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.

    View Slide

  19. 20
    Laws of Physics: Atoms Are Too Big!!!
    Each spot is an atom. Qingxiao Wang/UT Dallas ca. 2016.
    Speed controlled by base thickness:
    At least one atom thick!!!

    View Slide

  20. 21
    Laws of Physics: Light Is Too Slow!!!
    “One nanosecond per foot” courtesy of Grace Hopper (https://www.youtube.com/watch?v=9eyFDBPk4Yw)
    https://en.wikipedia.org/wiki/List_of_refractive_indices A 50% sugar solution is “light syrup”.

    Following the footsteps of Admiral Hopper:
    – Light goes 11.803 inches/ns in a vacuum

    Or, if you prefer, 1.0097 lengths of A4 paper per nanosecond

    Light goes 1 width of A4 paper per nanosecond in 50% sugar solution
    – But over and back: 5.9015 in/ns
    – But not 1GHz! Instead, ~2GHz: ~3in/ns
    – But Cu: ~1 in/ns, or Si transistors: ~0.1 in/ns
    – Plus other slowdowns: prototols, electronics, ...

    View Slide

  21. 22
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM

    View Slide

  22. 23
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View Slide

  23. 24
    Laws of Physics: Data Is Slower!!!
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View Slide

  24. 25
    Laws of Physics: Summary

    The speed of light is finite (especially in Cu and
    Si) and atoms are of non-zero size

    Mathematics, electronics, and chemistry also
    take their toll

    Systems are fast, so this matters

    View Slide

  25. 26
    Laws of Physics: Summary

    The speed of light is finite (especially in Cu and
    Si) and atoms are of non-zero size

    Mathematics, electronics, and chemistry also
    take their toll

    Systems are fast, so this matters
    “Gentlemen, you have two fundamental problems:
    (1) the finite speed of light and (2) the atomic nature of matter.” *
    * Gordon Moore quoting Stephen Hawking

    View Slide

  26. 27
    Why All This Hardware Complexity?
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Slow light and big atoms create modern computing obstacle course!!!
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View Slide

  27. 28
    Account For All CPU Complexity???

    Sometimes, yes! (Assembly language!)

    But we also need portability: CPUs change
    – From family to family
    – With each revision of silicon
    – To work around hardware bugs
    – As a given physical CPU ages

    View Slide

  28. 29
    One of the ALUs Might Be Disabled
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    X
    X

    View Slide

  29. 30
    Thus, Simple Portable CPU Model
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)

    View Slide

  30. 31
    And Lots Of CPUs Per System!!!
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Interconnect
    Interconnect
    CPUs 0-27 &
    CPUs 224-251
    CPUs 28-55 &
    CPUs 252-279
    CPUs 56-83 &
    CPUs 280-307
    CPUs 84-111 &
    CPUs 308-335
    CPUs 112-139 &
    CPUs 336-363
    CPUs 140-167 &
    CPUs 364-391
    CPUs 168-195 &
    CPUs 392-419
    CPUs 196-223 &
    CPUs 420-447

    View Slide

  31. 32
    Obstacles for Modern Computers

    View Slide

  32. 33
    Obstacle: Pipeline Flush
    PIPELINE
    ERROR
    PIPELINE
    ERROR
    BRANCH
    MISPREDICTION
    Running at full speed requires perfect branch prediction

    View Slide

  33. 34
    Obstacle: Memory Reference
    A single fetch all the way from memory can cost hundreds of clock cycles

    View Slide

  34. 35
    Obstacle: Atomic Operation
    Atomic operations require locking cachelines and/or busses, incurring significant delays

    View Slide

  35. 36
    Obstacle: Memory Barrier
    Memory barriers result in stalls and/or ordering constraints, again incurring delays
    Memory
    Barrier
    Memory
    Barrier

    View Slide

  36. 37
    Obstacle: Thermal Throttling
    Efficient use of CPU hardware generates heat, throttling the CPU clock frequency

    View Slide

  37. 38
    Obstacle: Cache Miss
    Cache misses result in waiting for data to arrive (from memory or other CPUs)
    CACHE-
    MISS
    TOLL
    BOOTH
    CACHE-
    MISS
    TOLL
    BOOTH

    View Slide

  38. 39
    Obstacle: Input/Output Operation
    And here you thought that cache misses were slow...

    View Slide

  39. 40
    Which Obstables To Focus On?
    1) I/O operations (but often higher-level issue)
    2) Communications cache misses
    3) Memory barriers and atomic operations
    4) Capacity/geometry cache misses (memory)
    5) Branch prediction

    View Slide

  40. 41
    Which Obstables To Focus On?
    1) I/O operations (but often higher-level issue)
    2) Communications cache misses
    3) Memory barriers and atomic operations
    4) Capacity/geometry cache misses (memory)
    5) Branch prediction
    These obstacles can (usually) be overcome in a portable manner.

    View Slide

  41. 42
    Xeon Platinum 8176 2.1GHz: CPU 0

    View Slide

  42. 43
    Location Really Matters!!!
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Last-Level Cache
    Interconnect
    Interconnect
    CPUs 0-27 &
    CPUs 224-251
    CPUs 28-55 &
    CPUs 252-279
    CPUs 56-83 &
    CPUs 280-307
    CPUs 84-111 &
    CPUs 308-335
    CPUs 112-139 &
    CPUs 336-363
    CPUs 140-167 &
    CPUs 364-391
    CPUs 168-195 &
    CPUs 392-419
    CPUs 196-223 &
    CPUs 420-447

    View Slide

  43. 44
    Latency Demonstration

    View Slide

  44. 49
    Can Hardware Help???

    View Slide

  45. 50
    Can Hardware Help???
    Source
    Drain
    Sub-Atomic Base
    Vacuum-gap transistor: At these scales, the atmosphere is a vacuum!!!

    View Slide

  46. 51
    We Really Can Hand-Place Atoms...
    Actually a carbon monoxide molecule that I moved across a few planes of copper

    View Slide

  47. 52
    We Really Can Hand-Place Atoms...
    But not trillions of them in a cost-effective manner!!!
    Does
    Not Scale

    View Slide

  48. 53
    Incremental Help From Hardware

    View Slide

  49. 54
    Hardware 3D Integration
    3 cm
    1.5 cm
    Half the distance,
    twice the speed!!!
    Both stacked chiplets and lithographically stacked transistors.

    View Slide

  50. 55
    Stacked Chiplets/Dies
    Diagram by Shmuel Csaba Otto Traian, CCSA4.0

    View Slide

  51. 56
    Lithographically Stacked Transistors
    https://ieeexplore.ieee.org/document/9976473
    https://spectrum.ieee.org/intels-stacked-nanosheet-transistors-could-be-the-next-step-in-moores-law

    View Slide

  52. 57
    Hardware 3D Integration
    * Give or take issues with power, cooling, alignment, interconnect drivers, and so on.
    3 cm
    1.5 cm
    Half the distance,
    twice the speed *

    View Slide

  53. 58
    Hardware Integration is Helping
    Q3 2017: 56 CPUs with ~100ns latencies

    View Slide

  54. 59
    Hardware Integration is Helping
    November 2008: 16 CPUs with ~100ns latencies: More than 3x in nine years!!!

    View Slide

  55. 60
    Hardware Accelerators

    View Slide

  56. 61
    Hardware Accelerators, Theory
    Data Accelerator
    Unidirectional data flow, no out and back, twice the speed!!!

    View Slide

  57. 62
    Hardware Accelerators, Practice
    Data Accelerator
    Sadly, back to request-response, but better latency with local memory?
    Accelerator-local
    memory
    System main memory

    View Slide

  58. 63
    So Why Hardware Accelerators???

    Optimized data transfers (e.g., larger blocks)

    Optimized hard-wired computation

    Better performance per watt

    Better performance per unit capital cost

    View Slide

  59. 64
    Hardware Has Been Helping All Along

    View Slide

  60. 65
    What Hardware Is Up Against
    CPUs Caches Interconnects
    Memories
    DRAM & NVM
    Protocol
    overheads
    (Mathematics!)
    Multiplexing &
    Demultiplexing
    (Electronics)
    Clock-domain
    transitions
    (Electronics)
    Phase
    changes
    (Chemistry)
    Light is way too slow in Cu and Si and atoms are way too big!!!

    View Slide

  61. 66
    Therefore, Memory Hierarchies!!!

    View Slide

  62. 67
    Simple Portable CPU Model Redux
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    (Not yet)

    View Slide

  63. 68
    Read-Side Hardware Help (1/7)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)
    Request cacheline x

    View Slide

  64. 69
    Read-Side Hardware Help (2/8)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    Request cacheline x
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  65. 70
    Read-Side Hardware Help (3/8)
    CPU 0
    Cache
    CPU 3
    Cache
    x=42,y=63
    CPU 1 CPU 2
    Request cacheline x
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  66. 71
    Read-Side Hardware Help (4/8)
    CPU 0
    Cache
    CPU 3
    Cache
    CPU 1 CPU 2
    Cacheline x = 42, y = 63
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  67. 72
    Read-Side Hardware Help (5/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Cacheline x = 42, y = 63
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  68. 73
    Read-Side Hardware Help (6/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  69. 74
    Read-Side Hardware Help (7/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Caches help beat laws of physics given spatial locality!!!
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  70. 75
    Read-Side Hardware Help (8/8)
    CPU 0
    Cache
    x=42,y=63
    CPU 3
    Cache
    CPU 1 CPU 2
    Caches help beat laws of physics given temporal locality!!!
    r1 = READ_ONCE(x)
    r2 = READ_ONCE(y)
    r3 = READ_ONCE(x)

    View Slide

  71. 76
    Levels of Cache on My Laptop
    Index Line Size Associativity Size
    0 64 8 32K
    1 64 8 32K
    2 64 4 256K
    3 64 16 16,384K
    When taking on the laws of physics, don’t be afraid to use a few transistors

    View Slide

  72. 77
    Levels of Cache on Large Old Server
    Index Line Size Associativity Size
    0 64 8 32K
    1 64 6 32K
    2 64 16 1,024K
    3 64 11 39,424K
    When taking on the laws of physics, don’t be afraid to use lots of transistors

    View Slide

  73. 79
    But What About Writes?

    View Slide

  74. 80
    Write-Side Hardware Help (Summary)

    Store buffers for the win!!! Sort of…
    – Cache line for variable x is initially at CPU 3
    – CPU 0 writes 1 to x, but doesn't have cacheline

    So holds the write in CPU 0's store buffer

    And requests exclusive access to the cacheline (which takes time)
    – CPU 3 reads x, obtaining “0” immediately from cacheline
    – CPU 0 receive's x's cacheline

    And CPU 0's write finally gets to the cacheline

    Overwriting the value that CPU 3 read, despite the write starting earlier

    Writes only appear to be instantaneous!!!

    View Slide

  75. 81
    Simple Portable CPU Model Redux
    Intel Core 2 Architecture (Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted and remixed)
    128 entry
    ITLB
    32 KB Instruction Cache
    (8 way)
    32 Byte Pre-Decode,
    Fetch Buffer
    18 Entry
    Instruction Queue
    128 Bit
    6 Instructions
    Instruction
    Fetch Unit
    Micro-
    code
    Simple
    Decoder
    Simple
    Decoder
    Simple
    Decoder
    Complex
    Decoder
    1μop
    1μop
    1μop
    4μops
    7 Entry μop Buffer
    4μops
    Register Alias Table
    and Allocator
    96 Entry Reorder Buffer (ROB) Retirement Register File
    4μops
    4μops
    Store
    Data
    Store
    Address
    SSE
    ALU
    ALU
    Branch
    SSE
    Shuffle
    MUL
    ALU
    SSE
    Shuffle
    ALU
    ALU
    Load
    Address
    4μops
    32 Entry Reservation Station
    128 Bit
    FMUL
    FDIV
    128 Bit
    FADD
    Memory Ordering Buffer
    (MOB)
    128 Bit
    32 KB Dual Ported Data Cache
    (8 way)
    16 entry
    DTLB
    Store
    128 Bit
    Load
    256 entry
    L2 DTLB
    Shared
    L2 Cache
    (16 way)
    Shared
    Bus
    Interface
    Unit
    256 Bit
    CPU
    CPU
    Store
    Store
    Buffer
    Buffer
    Cache
    Cache
    Now!!!

    View Slide

  76. 82
    Write-Side Hardware Help (1/7)
    CPU 0
    Store Buffer
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)

    View Slide

  77. 83
    Write-Side Hardware Help (2/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)
    The store buffer allows writes to completes quickly!!! Take that, laws of physics!!!

    View Slide

  78. 84
    Write-Side Hardware Help (3/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Except that later read gets older value...

    View Slide

  79. 85
    Write-Side Hardware Help (4/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    x=0
    CPU 1 CPU 2
    Request cacheline x
    WRITE_ONCE(x, 1) READ_ONCE(x)

    View Slide

  80. 86
    Write-Side Hardware Help (5/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Respond with cacheline x = 0

    View Slide

  81. 87
    Write-Side Hardware Help (6/7)
    CPU 0
    Store Buffer
    x=1
    Cache
    x=0
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Respond with cacheline x = 0

    View Slide

  82. 88
    Write-Side Hardware Help (7/7)
    CPU 0
    Store Buffer
    Cache
    x=1
    CPU 3
    Store Buffer
    Cache
    CPU 1 CPU 2
    WRITE_ONCE(x, 1) READ_ONCE(x)
    Quick write completion, sort of. Laws of physics: Slow or misordered!!!

    View Slide

  83. 89
    Misordering? Or Propagation Delay?
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    WRITE_ONCE(x, 1);
    r1 = READ_ONCE(x) == 0;
    X
    ==
    0 X
    ==
    1
    fr
    Time

    View Slide

  84. 90
    And Careful What You Wish For!!!
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    WRITE_ONCE(x, 1);
    r1 = READ_ONCE(x) == 0;
    X
    ==
    0 X
    ==
    1
    fr
    Time
    Hardware tricks help reduce the red triangle. But too bad about Meltdown and Spectre...

    View Slide

  85. 91
    Can Software Help?

    View Slide

  86. 92
    Can Software Help?

    Increasingly, yes!!!

    Use concurrent libraries, applications,
    subsystems, and so on
    – Let a few do the careful coding and tuning
    – Let a great many benefit from the work of a few

    Use proper APIs to deal with memory ordering
    – Chapter 14: “Advanced Synchronization: Memory Ordering”
    https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

    View Slide

  87. 93
    Summary

    View Slide

  88. 94
    Summary

    Modern hardware is highly optimized
    – Most of the time!
    – Incremental improvements due to integration
    – But the speed of light is too slow and atoms too big

    Use concurrent software where available

    Structure your code to avoid the big obstacles
    – Micro-optimizations only sometimes needed

    View Slide

  89. 95
    For More Information

    “Computer Architecture: A Quantitative Approach”, Hennessey & Patterson (Recent HW)

    “Parallel Computer Architecture: A Hardware/Software Approach”, Culler et al.
    – Includes SGI Origin and Sequent NUMA-Q

    “Programing Massively Parallel Processors: A Hands-on Approach”, Kirk & Hwu
    – Primarily NVIDIA GPGPUs

    https://developer.nvidia.com/educators/existing-courses
    – List of NVIDIA university courseware

    https://gpuopen.com/professional-compute
    – List of AMD GPGPU-related content

    “Is Parallel Programming Hard, And, If So, What Can You Do About It?”
    – https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

    View Slide

  90. 96
    L'antidote de Codage Simultané de la Femme de Paul

    Cordial de mûres sauvages de l'Himalaya
    – Mettre deux litres de mûres dans un pot de quatre litres
    – Ajouter cinq huitièmes litres de sucre
    – Remplissez le pot de vodka
    – Secouez tous les jours pendant cinq jours
    – Secouez chaque semaine pendant cinq semaines
    – Passer au tamis: Ajoutez des baies à la glace, consommez le
    liquide filtré comme vous voulez

    View Slide

  91. 97
    Paul’s Wife’s Concurrency Antidote

    Wild Himalayan Blackberry Cordial
    – Put 8 cups wild himalayan blackberries in 1 gallon jar
    – Add 2½ cups sugar
    – Fill jar with vodka
    – Shake every day for five days
    – Shake every week for five weeks
    – Pour through sieve: Add berries to ice cream, consume filtered
    liquid as you wish

    View Slide

  92. 98
    Paul’s Wife’s Concurrency Antidote

    Wild Himalayan Blackberry Cordial
    – 8 cups wild himalayan blackberries in 1 gallon jar
    – Add 2½ cups sugar
    – Fill jar with vodka
    – Shake every day for five days
    – Shake every week for five weeks
    – Pour through sieve: Add berries to ice cream, consume filtered
    liquid as you wish
    If there is no right tool, invent it!!!
    Questions?

    View Slide