摘要:进入大数据时代后,研究用户行为不再像以前引用平均随机抽样代表全体,数据时代需要研究全体用户数据,因此,给现在研究过程中带来了数据存储、数据处理、数据计算困难等挑战。
本文基于hadoop云平台研究用户行为数据的存储和用户行为挖掘。设计并实现分布式、高可靠、高可用性的数据存储模块,解决现在数据量大存储困难的问题。提出基于MapReduce的分布式并行分词算法,调用集群的所有计算节点,对海量的中文文本进行分词计算,相比较传统中文分词能够提高三倍以上的分词效率,并能够解决现阶段海量文本分词困难的现状。本文将hadoop云平台结合微博用户行为数据进行分析,首先对重庆地区的微博信息进行分词,然后分析挖掘重庆每天各区县关于“感冒”、“肺炎”、“发热”、“咳嗽”的词汇统计,很好的解决微博内容稀疏,价值隐藏深,挖掘困难等问题,实现重庆相关部门对本地医疗的监控和预警。设计数据挖掘结果展示模块,基于Mysql+jdbc+http+Ajax多维度多方位全面的展示微博用户行为分析结果。52407
毕业论文关键词: HDFS;Hadoop;MapReduce;用户行为分析;微博用户
Research on the behavior of Micro-blog users based on hadoop
Abstract: After entering the era of big data, the study of user behavior is no longer as previously referenced average random sampling on behalf of all, the era of data need to study all user data. Therefore, now the research process to brought challenges of data storage, data processing, data and calculate difficulty.
In this paper, based on the Hadoop cloud platform to study the user behavior data storage and user behavior mining. Design and implementation of distributed, high reliability, high availability of data storage module, to solve the problem of large amount of data storage. Is proposed based on the MapReduce distributed parallel word segmentation algorithm, called cluster of all computing nodes, the massive Chinese text segmentation calculation, compared with the traditional Chinese word segmentation can improve more than three times the segmentation efficiency, and can solve the present stage massive text segmentation difficult situation. The Hadoop cloud platform combined with micro Bo user behavior data analysis, first of all to the Chongqing area of the microblog information segmentation, and analysis of mining districts and counties of Chongqing daily vocabulary statistics about "cold", "pneumonia", "fever", "cough", very good solve the microblogging content sparse, deep hidden value, mining is difficult problem, relevant departments of Chongqing on the local medical surveillance and early warning. Design data mining results display module, based on the Mysql+jdbc+http+Ajax multi-dimensional multi-dimensional comprehensive display of micro-blog user behavior analysis results.
Keywords: Research on the behavior of user; HDFS; Hadoop; MapReduce; Micro-blog users
目录
摘要 i
Abstract ii
目录 iii
1 引言 1
1.1 研究背景 1
1.2.1 大数据国内外研究现状 1
1.2.2 用户行为分析研究现状 3
1.3 主要工作 5
1.4 论文组织结构 5
2 大数据技术HADOOP的研究