
当前位置: 毕业论文 > 计算机论文 >


时间:2021-02-09 17:45来源:毕业论文



毕业论文关键词: HDFS;Hadoop;MapReduce;文本分类;中文分词

Research on massive text classification algorithm

Based on Hadoop

Abstract: After entering the era of big data, whether it is Internet data or offline data are increasing exponentially, and these data mainly text structured or semi-structured documents, therefore, how to effectively search user needs from the valid data in mass data, improve the user search accuracy becomes a great challenge. Finding text data requires accurate and efficient classification of text data, so text categorization becomes the main difficulty of text data processing. Therefore, the purpose of this paper is to study the efficient massive text classification algorithm based on the existing hardware.

This paper studies the storage of massive text and text classification based on Hadoop. First, we design and implement a distributed, high reliability and high availability data storage module, which can solve the problem of massive text storage. Then, the proposed MapReduce distributed parallel Chinese segmentation algorithm based on improved MapReduce InputFormat read data model, to solve the problem of low efficiency of Hadoop with small files, compared Chinese word MapReduce default can increase 52 times the word segmentation efficiency, and can solve the present situation of massive text segmentation difficult. Finally, the web text classification algorithm based on MapReduce distributed computing framework of mass, establish the classification model, experimental verification, text classification algorithm is proposed in this paper on the accuracy of unknown text classification up to 97%.

Keywords: HDFS; Hadoop; MapReduce; Text categorization; Chinese word segmentation


1 引言 1

1.1 研究背景 1

1.2 国内外研究现状 2

1.2.1大数据国内外研究现状 2

1.2.2文本分类研究现状 4

1.3 主要工作 4

1.4 论文组织结构 5

2 大数据技术HADOOP的研究 6
