| With the rapid development of the Internet and the arrival of the big data era, a mass of data has been generated from the Internet. Nowadays, the top-k join query has been widely used in many territories such as E-business and Internet due to its super performance like forecast the business, understand customers’need, evaluate the goods and so on. MapReduce, the distributed processing framework, is widely used in data processing with its reliability, scalability, efficiency and fault tolerance. This paper is about processing top-k join query in the MapReduce environment. Managing large data can learn and get many valuable information quickly.First of all, based on mass data top-k join query, the author puts forward the top-k join query method based on MapReduce. The author uses random algorithm to balance the partition in the Map phase so that all the data handled by the Reduce can be similar or identical. The data can not tilt and the time can be relative mean. Then, the author creates a new table by combining join key and two indexes in the Reduce phase, ranks according to the join key and scans in proper order to execute preliminary links and update the index segmentation information table in real time. By scanning index segmentation information table, the author can make sure the threshold value, find the connect indexes which contain k top scores, read tuples from the two tables and connect them. The author doesn’t connect all the tuples when he calculate the top-k join. He filters many tuples by threshold value, connects the tuples that may be the final results and save a lot of time.Second, different users have different preferences in the query, the author presents the top-k join query processing methods based on preference. According to users’ definitions of preferences, the author recognizes that the skyline technology can handle users’preferences quite well.First, the author uses pretreatment to connect two tables, handles users’preference by skyline, then filters tuples that can not meet the need of users’preference. At last he finds the needed top-k join results by scoring function.At last, in order to handle users’preferences quite well, the author uses the skyline technology to deal with users’ preferences. So in the processing algorithm of preference, this paper puts forward the algorithm based on skyline users’ preference. In this paper, the author first extracts users’ preference dimensionality from the join results, and segments the data space. By determining the dominance relationship between these blocks, the tuples in the dominated data blocks will be filtered. Then, he uses skyline algorithm to filter the dominated tuples in every blocks of the rest data and figures out the virtual minimum points of every blocks. The author compares the data in the blocks and virtual minimum points of the rest blocks, decides whether it should be compared with data in the block. He switches the comparison between tuples to blocks and blocks, tuples and blocks and filters the dominated blocks and tuples during the comparison. In this way, the author can decrease the data scale, save the execution time and improve the operation efficiency of system.Furthermore, the author experiences a lot to verify the feasibility and expansibility of his methods mentioned in this paper. By analyzing the experimental results, we can see that the top-k join query method based on the MapReduce can handle the top-k join query in the large-scale data quite well. The top-k join query method based on preference can meet users’preference and solve some practical issues. |