摘要:聚类分析是数据挖掘中非常重要的一步,多样化的聚类方法可以加快数据挖掘的速度和提高数据挖掘的质量。本课题介绍了聚类分析的分类及算法,主要有系统聚类、K-means聚类、模糊聚类、有序样品聚类和K-medoids聚类中的PAM聚类和CLARA聚类,并运用开源软件R3.3.3实现各种算法。其中重点介绍混合型数据聚类分析的处理方法,包括混合型数据综合距离的计算方法、混合型数据聚类个数的确定和混合型数据聚类方法的选择及R实现。本文首先采用Gower方法计算混合型数据的距离,其次根据轮廓系数的大小选择最佳聚类个数,再次采用PAM算法和CLARA算法实现混合型数据的聚类分析及其比较,最后选取了Byar前列腺癌数据集进行实证分析。通过实证分析,发现两种聚类方法均较好地对混合型数据进行了聚类,但是,这两种方法对Byar数据集的聚类结果存在一定的差异,针对这两种聚类结果的差异,分析了其中的原因,为进一步研究提供了一些依据。
关键词:聚类分析,混合型数据,PAM算法,CLARA算法,R
Abstract:Clustering analysis is a very important step in data mining. A variety of clustering methods can speed up the process of data mining and improve the quality of data mining. This paper introduces the classifications and algorithms of clustering analysis, including hierarchical clustering, K-means clustering, fuzzy clustering, ordered sample clustering and PAM clustering and CLARA clustering in K-medoids clustering, and using the open source software R 3.3.3 to achieve the algorithms. The paper focuses on the methods of how to deal with the clustering analysis for mixed data, including how to calculate the integrated distance of mixed data, how to determine the best number of clusters for mixed data, which methods should be choose to achieve the clustering analysis for mixed data and the application of software R. In this paper, the Gower method is used to calculate the distance of the mixed data first. Secondly, the optimal number of clusters is determined according to the width of silhouette coefficient. Thirdly, PAM algorithm and CLARA algorithm are used to realize the clustering analysis for mixed data and further comparative analysis. Finally, select Byar prostate cancer data set for empirical analysis. Through the empirical analysis, it is found that the two kinds of clustering methods can cluster well for mixed data. However, there are some differences in the results of clustering between the two methods for the Byar dataset. According to the differences between the two clustering results, we can analysis some reasons, these provides some basis for the further study.
Keywords: Clustering Analysis, Mixed Data, PAM Algorithm, CLARA Algorithm, R
目录
第一章 绪论 1
1.1研究背景及意义 1
1.1.1研究背景 1
1.1.2研究意义 1
1.2混合型数据的聚类方法及研究现状 1
1.2.1混合型数据的聚类方法 1
1.2.2K-medoids算法的研究现状 2
1.2.3K-medoids算法存在的问题 3
1.3本文的主要研究内容及框架 3
第二章 聚类分析的分类及算法 4
2.1聚类分析的概念、数据类型及聚类统计量 4
2.1.1聚类分析的概念 4
2.1.2聚类分析的相异度度量