Font Size: a A A

Research On Hadoop Cluster Optimization In Large Scale Network Data Environment

Posted on:2019-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:C H XinFull Text:PDF
GTID:2348330545481092Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data,there will be a lot of data generated every date.Along with the increasing domestic Internet users and network coverage rate,these data will continue to have exponential growth.Facing the era of information explosion,how to effectively store and manage massive data is a difficult problem with technical challenges.HDFS is a distributed file storage system with high fault tolerance,providing high throughput data access,and is suitable for storing large data sets.But in practice,it is found that HDFS has some problems in document retrieval,management,small file storage,data distribution,security and so on.Firstly,this thesis introduces the basic architecture of the distributed file system,and designs the HDFS cluster information acquisition system and HDFS retrieval and management system to help cluster users and managers control and manage cluster files.Then,based on the HDFS information acquisition system,it is found that there are too many small files in the cluster.From the theoretical and experimental aspects,it can be proved that the small file can cause a serious decline in the performance of the cluster.In order to solve the above problems,the HDFS defragmentation system is designed,which can help cluster managers quickly discover and eliminate small clusters of files.At the same time from the HDFS information acquisition system,we also found that the existence of data clustering uneven phenomenon.From the theoretical analysis,the uneven data distribution will cause MapReduce program can not make good use of the advantages of local computing and will cause network congestion in the file concurrent access,and we verify the above conclusions through experiments.In order to improve the imbalance of data distribution,a data equalization strategy based on file dimension is proposed,and the validity of the test is compared with the default strategy.In the real production environment,in order to improve the efficiency of the use of the cluster,we need to manage the user rights reasonably to ensure that the use of the cluster environment is safe and stable.So we design a common rights management system,from the command line and Web side to protect cluster file’s security.
Keywords/Search Tags:distributed file system, HDFS small file, distributed data equalization, cluster security
PDF Full Text Request
Related items