摘要在科学研究(天文学、生物学、高能物理等)、计算机仿真、互联网应用、电子商务等领域,数据量呈现快速增长的趋势。现在,大数据集或超大数据集的存储和处理已经成为很多行业面临的新挑战,怎样能以更快速、更高效、成本更低的方式从海量数据中挖掘有价值、易理解的知识从而帮助企业制定决策,成为数据挖掘技术面临的新课题。基于Hadoop分布式的计算平台,是一种适用于大数据集的并行挖掘算法的平台,它不同处理数据的规模大、效率高。该算法对非结构化的原始大数据集进行分析,从大量数据中寻找其规律,使人们能够在巨大的信息库中找到有用的信息。23519
关键词 数据挖掘;大数据集;并行算法;Hadoop
毕业设计说明书(毕业论文)外文摘要
Title Research on Parallel Mining Large Dataset Based on Hadoop
Abstract
In science (astronomy, biology, high-energy physics, etc.), computer simulation, the field of Internet applications, e-commerce, data showed a rapid growth trend. Currently, the storage and processing of large data sets and large data sets has become a new challenge for many companies face, how can a more rapid, efficient and cost-effective way to tap the valuable data from the mass, understandable knowledge to help businesses decisions faced by the data mining technology to become a new topic. Hadoop-based distributed computing platform, gives one for mining large data sets of parallel algorithms, unlike other algorithms that scale where it needs to process the data, and high efficiency. The algorithm of the original large unstructured data sets were analyzed to find its own rules from large amounts of data, so that people can find useful information in the huge repository.
Keywords Data mining; Large dataset; Parallel algorithm; Hadoop
目 次
1 绪论1
1.1 课题的背景和意义1
1.2 研究现状2
1.3 论文主要工作4
2 hadoop及数据挖掘概述6
2.1 hadoop介绍6
2.2 数据挖掘概述10
3 hadoop单机版环境的搭建13
3.1 硬件描述13
3.2 软件描述13
3.3 hadoop单机版环境的搭建过程13
4 基于hadoop二度人脉挖掘算法的设计与实现14
4.1 基于hadoop二度人脉挖掘的应用背景及意义14
4.2 基于hadoop二度人脉挖掘算法的设计14
5 基于hadoop聚类算法的设计与实现18
5.1 基于hadoop聚类算法的设计18
5.2 K-Means算法概述18
5.3 基于hadoop的并行k-means聚类算法设计与实现20
结论22
致谢23
参考文献24