Font Size: a A A

Research On Data Access Method Based On Learning Index

Posted on:2024-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:J GaoFull Text:PDF
GTID:2568307136995009Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The performance of data storage systems is often measured by a variety of metrics,including data access speed.Index structures are one of the most important tools that DBAs leverage to improve the data access speed of database systems.However,with the approaching of the big data era,the amount of data generated in different domains has exploded.A recent study has shown that indexes consume about 55% of total memory in a state-of-the-art in-memory DBMS.Building indexes in traditional ways has encountered a bottleneck.In recent years,learned index(LI)are proposed,which using machine learning models to replace traditional B+-tree indexes and many other indexes,leveraging pattern about the underlying data distribution to train the models and optimize the indirect search of data query into the direct search of function calculation,learned index can speed up queries and reduce the size of an index.However,the fitting effect of LI is general,and it assumes that the data is static and read-only,it does not support modification operations such as insertion.In this paper,a learning index model using gradient descent algorithm to fit data,referred to as GDLIN,is proposed to address the above issues.GDLIN first uses cosine similarity to divide data with relatively similar distributions together.Different from other previous learning indexes,some adopted equal partitioning which ignores the relationship between data,while others adopted Euclidean similarity which focuses on the distance between vectors.However,the most important thing for data partitioning is to divide the points distributed near the same line segment together and pay more attention to the difference in direction.Then,GDLIN uses gradient descent to fit the data on the divided dataset.Compared with the middle value of slope adopted by other learning indexes,gradient descent algorithm can reduce the error between the predict position and the actual position,which can reduce the cost of local research.Besides,GDLIN recursively calls data partitioning algorithm and data fitting algorithm until only one model is created,which making full use of keys’ distribution,and avoiding the increase of the size of index with the data volume.The experiment results demonstrate GDLIN improves the lookup throughput by 2.2× compared with the traditional B +-trees and 1.1× compared with the FITing-Tree without insertion.Although recent studies have proposed to use additional index structures to support update,they are at the cost of scarifying the lookup performance as they suffer from the overheads brought by imprecise predictions in the leaf nodes.GDLIN designs a new leaf node structure which not only stores model parameters,but also allocates an array to store newly inserted data.This node structure supports data update operations with guaranteed query performance.The experiment results demonstrate GDLIN improves the lookup performance by 1.15× compared with the LI and 1.07×compared with FITing-Tree when the factor of insertion is 0.5.
Keywords/Search Tags:Learned index, Gradient descent, Curve fitting model, Linked list, Data insertion
PDF Full Text Request
Related items