摘要: 海量web日志给企业带来的是大数据存储、大数据处理和大数据挖掘的挑战,目前,基于传统的日志挖掘工具已经很难满足现有海量web日志的分析需求,所以,本文提出基于云平台的web日志挖掘系统,使用分布式框架hadoop,既能够满足海量web日志的存储,也能满足企业快速处理日志的需求,主要是能够从海量web日志中挖掘出更有价值的信息,让企业能够更快的做出调整,提高企业竞争力。39468
本文提出基于云平台的自动web日志挖掘系统。该系统中主要使用hadoop分布式框架作为基础,开发基于mapreduce并行计算框架的数据清洗算法进行日志清洗,利用hadoop的HDFS作为分布式存储,并使用hive进行日志挖掘。挖掘的结果使用sqoop进行数据迁移,Hbase存储日志明细,最后使用mysql作为日志挖掘结果展示平台进行数据展示和查询。
本文最后对系统进行测试,基于hadoop的web日志挖掘系统比较传统日志挖掘,拥有并行计算、全自动化、高可靠性、高可扩展性、高鲁棒性、定时挖掘等优点,能够更快速而准确地挖掘出企业所需要的内容信息,测试表明,相比较于传统单机挖掘,本文提出的web日志挖掘能够提高3.6倍速度,同时,可以通过增加集群的规模,进一步提高挖掘速率。整个日志挖掘系统能够满足企业的数据安全,传输稳定,批量处理,并行计算,自动分析等需求。 毕业论文关键词: hadoop;日志挖掘;hive;sqoop;mapreduce;HDFS
Web Log Mining And Research Based On The Cloud Platform
Abstract: Massive amounts of web logs bring challenges of data storage, data process and data mining. At present, the traditional log mining tools have been difficult to meet the analysis needs of the mass web logs. Therefore, this paper presents a web log mining system based on cloud platform. The application of distributed framework hadoop meets both the storage of the mass web logs and enterprise demand for fast processing logs. It can dig out more valuable information from the massive web logs, so as to make the enterprise adjust more quickly, improve the competitiveness of enterprises.
This paper puts forward the automatic web log mining system based on the cloud platform.This system mainly uses the Hadoop distributed framework as the basis, develops a data cleaning algorithm based on the mapreduce parallel computing framework to realize log cleaning. It applies the HDFS of hadoop as the distributed storage, and applies the hive to realize log mining. The results of mining use sqoop for data migration, Hbase storage log details, and use mysql as display platform of log mining results to display and inquiry the data.
At last, this paper tests the system. The web log mining system based on the hadoop has advantages of parallel computing, full automation, high reliability, high scalability, high robustness and timing mining in comparing with traditional log mining system. It can dig out the information required by the enterprise more quickly and accurately. Tests show that the mining speed of the web log mining system proposed in this paper is 3.6 times of the traditional system. At the same time, it can further improve the mining speed through increasing the size of cluster. The whole log mining system can meet the enterprise requirements of data security, stable transmission, batch process, parallel computing and automatic analysis.
Keywords: hadoop;Log mining;hive;sqoop;mapreduce;HDFS
目录
摘要 i
Abstract i
目录 iii
1 绪论 1
1.1 研究背景 1
1.2 国内外研究现状 1
1.3 本文研究内容和意义 3