摘要为了在大数据环境下快速、准确地检测出相似文件获取电子证据,本文对传统的TF-IDF算法进行语义、位置信息等的改进。首先利用分词技术对文本进行分词,同时去掉停用词。接着进行词频统计和语义分析,并对同义词进行处理。在进行关键词权值计算时,加入命名实体因子、专有名词因子和位置因子进行加权赋值,结合TF-IDF算法得到文本的空间术语向量。最后利用空间向量模型进行文本相似性比较,同时我们加入余弦距离对余弦函数进行改进。通过上海复旦大学语料库进行测评,同时在以准确率、召回率、宏平均为评价指标的状况下与传统指纹算法和传统TF-IDF算法进行相似性比较,实验证明,改进的TF-IDF算法实现了在相似度检测技术上速度和准确度的提高。76967
毕业论文关键词 文件取证 分词技术 语义分析 TF-IDF算法 向量空间模型
毕业设计说明书外文摘要
Title Research and implementation of file forensics under the environment of big data
Abstract In order to detect the similar files to get electronic evidence quickly and accurately in huge amounts of data, In this paper, the traditional TF-IDF algorithm was improved combined with the semantic and location information。 Firstly, word segmentation technology is used to segment the texts。 At the same time removing stop words and dealing with synonyms。 Then counting word frequency and obtaining the word semantics。 When calculating the weight of keywords, we add a named entity factor, proper nouns factor and location factor 。 Then computing IDF values of texts, combined with the previous weight to calculate the overall improvement of the TF-IDF weight and then to get the text term vectors。 Finally using vector space model to compare similarity of various texts, at the same time, we add the cosine distance to improve cosine function。 The proposed approach is based on word frequency statistics method, comparing the similarity with the traditional fingerprint and traditional TF-IDF algorithm through Fudan university corpus using accuracy, recall rate and macro average as evaluation index。 Experiments show that the improved TF-IDF algorithm achieve an increase of the speed and accuracy。
Keywords file forensics word segmentation technology semantic analysis
TF - IDF algorithm vector space model
目 录
1 绪论 1
1。1 本课题的研究意义 1
1。3 本文的主要工作 3
1。4 本文的组织结构 4
2 相关理论基础 5
2。1 文本描述模型简介 5
2。2 文本分词技术概述 6
2。3 文本特征权重计算 7
2。4 文本相似度计算技术 7
3 改进的TF-IDF算法 10
3。1 传统的TF-IDF算法 10
3。2 传统的TF-IDF算法在文件相似度比较上的不足 11
3。3 TF-IDF算法与词项语义和位置相结合 11
4 文件取证系统设计与实现 13
4。1 系统设计