摘要聚类算法就是将数据集中差异性小的点聚合起来,形成若干个簇,每个簇都反映了它们的个性,是数据挖掘技术的关键步骤。图方法将数据用结点表示,数据之间的邻近度用对应结点之间边的权值表示,从而将图的重要特性(如稀疏化邻近图)应用到聚类分析中,提高聚类算法效率。本论文通过Chameleon算法具体分析图方法在聚类算法中的应用,并通过与K-means算法的实验对比体现基于图的聚类算法能够发现任意形状,大小的数据簇的特点。同时总结在实验过程中出现的Chameleon算法缺点与不足。8122
关键词:聚类 图 chameleon K-means 稀疏化
毕业设计说明书(论文)外文摘要
Title Research of clustering algorithm graph-based Abstract
Clustering algorithm is the algorithmon which puts the points of data-centralizing and little-difference together, and forms a number of clusters, each cluster reflect their own personality. Clustering algorithm is the key step in the data mining technology. The Graph Theory Method, which represents the data with nodes, and represents adjacent degrees between the data with the weights of the edge between the corresponding nodes, and thus applys the important characteristics of figure (such as the sparse adjacent chart) to the clustering analysis, improves the efficiency of the clustering algorithm. In this article,I will make a concrete analysis on how could The Graph Theory Method be applied into Clustering algorithm with the Chameleon algorithm as in specific method. Besides that,I will put the K-means algorithm as a experimental comparision to show the characteristics of the based-on-graph theory clustering algorithm, which can find any kind of shapes or sizes of the data clusters . I will also summarize some disadvantages and shortcomings appeared in the process of the Chameleon algorithm experimentals.
Keywords: Clustering graph Chameleon K-means Sparse
目录
1 绪论 1
1.1 研究背景 1
1.1.1 数据挖掘的内容 1
1.1.2 数据挖掘的意义 1
1.2 聚类研究现状 2
1.2.1 传统的聚类算法 2
1.2.2 簇间距离的度量方法 3
1.2.3基于图聚类算法 4
2算法介绍 5
2.1Chameleon算法 5
2.1.1相对互连度(Relative Interconnectivity, RI) 5
2.1.2 相对紧密度(Relative Closeness,RC) 6
2.1.3 Chameleon算法关键步骤 7
2.2 K-means算法 8
2.2.1 算法简介 8
2.2.2 算法优缺点 9
3实验分析 10
3.1Chameleon算法实验 10
3.1.1构建稀疏矩阵 11
3.1.2 图划分 13
3.1.3 决定哪些簇合并 16
3.1.4实验结果分析 17
3.2 K-means算法实验 18
3.2.1实验结果分析 19
3.3验证实验 20
3.3.1 实验介绍 20
3.3.2实验结果分析 21
3.4对比实验 22
3.4.1 实验介绍 22
3.4.2 实验结果与分析 22
结论及展望 24
致谢 25
参考文献 26
1 绪论
1.1 研究背景
随着信息量爆炸性增长[ ],当我们面对海量数据时,手工的统计、评估的速度十分的缓慢,并且更多的要依靠经验。尤其是当单纯的统计不足以满足多样的分析需求时,我们需要新的方法去处理这些问题。很早就有人提到这样的观点:“人类正在被信息淹没,却渴望知识。”[ ][ ]。数据挖掘[ ][ ]概念的提出,使我们分析数据有了新的方法(分类、聚类、关联规则等),而每一个方法又有若干分支。这些算法现已广泛的应用于各个行业内。