Font Size: a A A

Research On Scalable Synchronization Technology For NUMA Multi-Cores Systems

Posted on:2022-03-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M YiFull Text:PDF
GTID:1528307169477484Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently,multi-core systems are moving towards large NUMA multi-core systems with complex memory levels and hundreds of cores.The synchronization primitives(locks and barriers)and concurrent data structures used to design concurrent applications affect the performance of NUMA multi-core systems.However,current synchronization primitives and concurrent data structures are difficult to meet the scalable performance requirements of large-scale NUMA multi-core systems due to frequent cross-node access and other issues.Therefore,research on synchronization technology based on NUMA multi-core systems is of great significance for reducing synchronization overhead,im-proving the efficiency of parallel software on NUMA multi-core systems,and meeting the evolving application requirements.According to the requirements of scalable synchronization performance on NUMA multi-core systems,the dissertation conducts research on issues such as lock synchroniza-tion,barrier synchronization,and the scalability of concurrent data structures.The main work and contributions are as follows:(1)The existing state-of-the-art static delegation locks can provide competitive per-formance at the expense of occupying computing cores,whereas when client thread re-quests for the accesses to different critical sections are processed by the same service thread,it can lead to serialization issues,besides it does not scale well under high con-tention due to ignoring the NUMA features of expensive cross-nodes communication overheads.To address the above issues,a scalable delegation lock based on hierarchi-cal and batch processing technology,called DYLOCK,is proposed.The lock uses a hierarchical method to construct local locks and global locks.A thread is converted into the local service thread by acquiring local locks,and handles local thread requests after succeeding in acquiring the global lock,thereby avoiding long-term core occupation and cross-nodes communication overhead.It reduces cross-nodes lock contention through the use of NUMA-aware memory allocation methods.The service thread performs batch processing of local requests in the local node by groups as a unit to reduce demarshalling overhead and takes advantage of local bandwidth.Tested on the real database application Berkeley DB,the throughput of DYLOCK is 1.4 times that of the best delegation lock ffwd,which shows that DYLOCK can provide better scalability than static delegation methods under high contention without occupying the computing cores.(2)The existing delegation lock exhibits sub-optimal performance under no con-tention due to the communication overhead between the service thread and the client thread,and the traversal overhead of the request array.To address the above issues,a contention-conscious delegation framework is proposed.The framework uses the lock stealing mechanism to identify lock contention.In the absence of contention,threads can directly execute the critical section by acquiring the TTS lock,thereby avoiding the com-munication overhead between the service thread and the client thread,and the traversal overhead of the request array? under contention,threads can provide scalable performance on NUMA multi-core platforms by using delegation locks.Tested on the real database ap-plication Berkeley DB,the throughput of the delegation lock based on competition recog-nition is 2.17 times that of the best delegation lock,indicating that the delegation lock implemented under the contention-conscious delegation framework can significantly im-prove performance under no contention.(3)The existing barrier has poor scalability due to ignoring problems such as high cross-node communication overhead,cache line placement,or communication topology structure.To address the above issues,a three-stage barrier synchronization framework is proposed.The framework divides barrier synchronization into three stages: barrier arrival within a NUMA node,barrier arrival across NUMA nodes,and wake-up.Each stage considers cache line placement and communication topology,and combines the use of coordinator threads to reduce cross-node communication overhead.Two barrier algo-rithms are designed based on the three-stage barrier framework.We show how to convert the barrier algorithm based on the framework into a performance model for performance prediction.Tested on the Clusterstream application,the three-stage barrier algorithm is2.3 times faster than the default barrier algorithm of the application,indicating that the use of the three-stage barrier framework can implement a scalable barrier synchronization algorithm on NUMA multi-core systems.(4)The existing concurrent data structures are poorly scalable when across NUMA nodes.Some methods are only suitable for read-intensive workloads or only for write-intensive workloads.To address the above issues,a universal construction to implement concurrent data structure,called CR,is proposed.It is based on delegation and a shared log.CR provides scalable read performance by maintaining an up-to-date replica pro-tected by a shared lock for read-only accesses? within a node it uses delegation locks to synchronize multiple local threads which write to the local replica,and obtains the write operation records of other nodes from the shared log and executes them locally to ensure the consistency of replicas between nodes,reducing cross-node contention and providing scalable write performance.Tested on the real in-memory database system Kyoto Cabi-net Cache DB,the performance of the CR-based concurrent data structure is 18.1 times higher than the original version,indicating that CR can provide scalable performance on NUMA multi-core systems.
Keywords/Search Tags:NUMA Multicores, Concurrency, Synchronization, Lock, Barrier, Concurrent Data Structure
PDF Full Text Request
Related items