concurrent data structures, with a signi•cantly simpler design KEYWORDS concurrent data structures; parallel programs; processing-in-memory; near-memory computing 1 NEAR-MEMORY COMPUTING „e performance gap between memory and CPU has grown ex-ponentially Memory vendors have focused mainly on improv-
Data Structures + Hardware 6 • Tight integration between algorithmic design and hardware characteristics • Memory becomes an active component in managing data • Managing data structures in PIM • Old work: pointer chasing for sequential data structures • Our work: concurrent data structures
near-memory compute units, and between the host threads and near-memory compute units is required to meet high concurrency and correctness guarantees In summary, this paper makes the following contributions: •We propose hybrid data structures, which are cache-conscious concurrent data structures designed for new NMP architectures
the-art concurrent data structures This paper makes the following contributions: •We define a generic near-data-processing (NDP) architecture that is well-suited for concurrent data structures •We implement actual software kernels of the NDP-based concurrent data structures on a cycle-accurate full-system
Abstract of \Concurrent Data Structures with Near-Memory Processing: Software-Hardware Co-Design" by Jiwon Choe, Ph D , Brown University, May 2022 In recent years, there has been a renewed interest in near-memory processing (NMP) architectures as a workaround for the performance and energy issues of frequent and irregular memory access,
and software simulators con?rm our theoretical ?ndings, and they show that the LPM approach can be applied in diverse computing platforms and can effectivelyguide performance optimization of memory systems Index Terms—Memory wall, memory stall time, ef?ciency, performance optimization, layered performance matching (LPM), memory
How to Design Data Structures for PIM? Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu, "Concurrent Data Structures for Near-Memory Computing" Proceedings of the29th ACM Symposium on Parallelism in Algorithms and Architectures(SPAA), Washington, DC, USA, July 2017 [Slides (pptx) ( pdf )] 12
© 2017 VMware Inc. All rights reserved.Concurrent Data Structures for Near-Memory ComputingZhiyuLiu (Brown)Irina Calciu(VMware Research)Maurice Herlihy(Brown)OnurMutlu(ETH)
Concurrent Data Structures2Are used everywhere: kernel, libraries, applicationsIssues:•Difficult to design and implement •Data layout and memory/cache hierarchy play crucial role in performanceCacheMemory
The Memory Wall3CPUL1 cacheL2 cacheL3 cacheMemory< 10 nsTens of nsHundredsof ns2.2 GHz = 220K cycles during this timeData movement
Near Memory Computing4•Also called Processing In Memory (PIM)•Avoid data movement by doing computation in memory•Old idea •New advances in 3D integration and die-stacked memory•Viable in the near future
Near Memory Computing: Architecture5•Vaults: memory partitions•PIM cores: lightweight•Fast access to its own vault•Communication•Between a CPU and a PIM•Between PIMs•Via messages sent to buffersCPUCrossbarNetworkCPUPIMcoreVaultPIM memoryPIMcorePIMcoreCPUCPUVaultVaultDRAM
Data Structures + Hardware 6•Tight integration between algorithmic designand hardware characteristics•Memory becomes an active component in managing data •Managing data structures in PIM •Old work: pointer chasing for sequential data structures•Our work: concurrent data structures
CONTENTION7LOWHIGHCacheable Pointer-chasingGoals: PIM Concurrent Data Structures1. How do PIM data structures compare to state-of-the-art concurrent data structures?2. How to design efficient PIM data structures?
CONTENTION8LOWHIGHCacheable Pointer-chasingGoals: PIM Concurrent Data Structures1. How do PIM data structures compare to state-of-the-art concurrent data structures?2. How to design efficient PIM data structures?e.g., uniform distribution skiplist& linkedlist
CONTENTION10LOWHIGHCacheable Pointer-chasingGoals: PIM Concurrent Data Structures1. How do PIM data structures compare to state-of-the-art concurrent data structures?2. How to design efficient PIM data structures?e.g., uniform distribution skiplist& linkedlist
Naïve PIM Skiplist11CPUCPUCPUadd(30)add(5)delete(80)PIM Memory VaultPIMcore111134678101114447101010812141214716161616Low Latency
Concurrent Data Structures12CPUCPUCPUadd(30)add(5)delete(80)111134678101114447101010812141214716161616DRAMHigh Latency
Flat Combining13CPUCPUCPUadd(30)add(5)delete(80)111134678101114447101010812141214716161616Sequential accessCombiner lockDRAMHigh Latency
PIM Performance15NSize ofthe skiplistpNumber of processesLCPULatencyof a memory access from the CPULLLCLatency ofa LLC access LATOMICLatencyof an atomic instruction (by the CPU)LPIMLatency of a memory access from the PIM core LMSGLatencyof a message from the CPU to the PIM core
PIM Performance16LCPU= r1LPIM= r2LLLCLMSG= LCPUr1 = r2 = 3PIM Performance17AlgorithmThroughputLock-freep / ( B * LCPU)Flat Combining (FC)1 / ( B * LCPU)PIM1 / ( B * LPIM+ LMSG )B= average number of nodes accessed during a skiplistoperation
CONTENTION19LOWHIGHCacheable Pointer-chasingGoals: PIM Concurrent Data Structures1. How do PIM data structures compare to state-of-the-art concurrent data structures?2. How to design efficient PIM data structures?e.g., uniform distribution skiplist& linkedlist
New PIM algorithm: Exploit Partitioning20CPUCrossbarNetworkCPUPIMcoreVaultPIM memoryPIMcorePIMcoreCPUCPUVaultVaultDRAM
PIM Skiplistw/ Partitioning2111113467810112026447101010812202012207CPUCPUCPU11020CPUcacheVault1Vault2Vault3PIMcorePIMcorePIMcorePIM Memory
PIM Performance22AlgorithmThroughputLock-freep / ( B * LCPU)Flat Combining (FC)1 / ( B * LCPU)PIM1 / ( B * LPIM+ LMSG )FC + kpartitionsk / ( B * LCPU )PIM + kpartitionsk/(B *LPIM+LMSG )B= average number of nodes accessed during a skiplistoperation
CONTENTION25LOWHIGHCacheable Pointer-chasingGoals: PIM Concurrent Data Structures1. How do PIM data structures compare to state-of-the-art concurrent data structures?2. How to design efficient PIM data structures?e.g., FIFO queue
PIM FIFO Queue27TailCPUCPUPIMcoreDeq()Deq()1dequeuesanode2retrievesrequest3sendsbackthenode1retrievesrequestPIM Memory Vault
Pipelining2812Timeline3Deq()Deq()Deq()123123Canoverlap the execution of the next request ParallelizeEnqsandDeqs29HeadTailCPUCPUCPUVault2PIMcoreVault1PIMcoreenqsdeqsConclusion30PIM is becoming feasible in the near futureWe investigate Concurrent Data Structures (CDS) for PIMResults:•Naïve PIM data structures are less efficient than CDS•New PIM algorithms can leverage PIM features •They outperform efficient CDS•They are simpler to design and implement
Thankyou!https://research.vmware.com/icalciu@vmware.com© 2017 VMware Inc. All rights reserved.Concurrent Data Structures for Near-Memory ComputingZhiyuLiu (Brown)Irina Calciu(VMware Research)Maurice Herlihy(Brown)OnurMutlu(ETH)
33