Research On Data Access Method Based On Learning Index

Posted on:2024-09-28

Degree:Master

Type:Thesis

Country:China

Candidate:J Gao

Full Text:PDF

GTID:2568307136995009

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The performance of data storage systems is often measured by a variety of metrics,including data access speed.Index structures are one of the most important tools that DBAs leverage to improve the data access speed of database systems.However,with the approaching of the big data era,the amount of data generated in different domains has exploded.A recent study has shown that indexes consume about 55% of total memory in a state-of-the-art in-memory DBMS.Building indexes in traditional ways has encountered a bottleneck.In recent years,learned index(LI)are proposed,which using machine learning models to replace traditional B+-tree indexes and many other indexes,leveraging pattern about the underlying data distribution to train the models and optimize the indirect search of data query into the direct search of function calculation,learned index can speed up queries and reduce the size of an index.However,the fitting effect of LI is general,and it assumes that the data is static and read-only,it does not support modification operations such as insertion.In this paper,a learning index model using gradient descent algorithm to fit data,referred to as GDLIN,is proposed to address the above issues.GDLIN first uses cosine similarity to divide data with relatively similar distributions together.Different from other previous learning indexes,some adopted equal partitioning which ignores the relationship between data,while others adopted Euclidean similarity which focuses on the distance between vectors.However,the most important thing for data partitioning is to divide the points distributed near the same line segment together and pay more attention to the difference in direction.Then,GDLIN uses gradient descent to fit the data on the divided dataset.Compared with the middle value of slope adopted by other learning indexes,gradient descent algorithm can reduce the error between the predict position and the actual position,which can reduce the cost of local research.Besides,GDLIN recursively calls data partitioning algorithm and data fitting algorithm until only one model is created,which making full use of keys’ distribution,and avoiding the increase of the size of index with the data volume.The experiment results demonstrate GDLIN improves the lookup throughput by 2.2× compared with the traditional B +-trees and 1.1× compared with the FITing-Tree without insertion.Although recent studies have proposed to use additional index structures to support update,they are at the cost of scarifying the lookup performance as they suffer from the overheads brought by imprecise predictions in the leaf nodes.GDLIN designs a new leaf node structure which not only stores model parameters,but also allocates an array to store newly inserted data.This node structure supports data update operations with guaranteed query performance.The experiment results demonstrate GDLIN improves the lookup performance by 1.15× compared with the LI and 1.07×compared with FITing-Tree when the factor of insertion is 0.5.

Keywords/Search Tags:

Learned index, Gradient descent, Curve fitting model, Linked list, Data insertion

PDF Full Text Request

Related items

1	Research On RBF Neural Network Algorithm Based On Ant Colony Algorithm
2	Research On Adaptive Learned Index Supporting Efficient Writes
3	The Research On Distributed Large Data Full-Text Retrieval Based On Block Linked List Index Algorithm
4	The Reseach And Application Of Stochastic Gradient Descent And Dual Coordinate Descent Algorithm
5	Research And Implementation On Learned Index For Navigation Domain Data
6	Research And Implementation Of Scalable Learned Index Based On String Keys
7	Research On Learned Index For Database
8	A Study Of Learned KD Tree Based On Neural Network
9	Curve Fitting Of Scattered Data Points And Quadratic Curve Fitting, A New Method
10	Research On Learned Index For Streaming Data