A document clustering algorithm based on K-means
Abstract:Since20 th century 50’s, people have proposed many kinds of clustering algorithms. They approximately may be divided into based on division and based on level two kinds. Among based on division clustering algorithms, what most famous is the K-Means type algorithm. Since it was published by MacQueen in1967 for the first time, it has become one of prevalent clustering algorithms in mathematical statistic, pattern recognition, machine learning and data mining etc, and has developed many kinds of derivative algorithms, formed the K-Means algorithm family.
However, K-means algorithm is very sensitive to the initial conditions., unstable results were often gotten while using traditional K-Means and its variants. This paper has made an important improvement to the choice of central points of the K-Means algorithm. Thereby optimize central points. This paper sorts each point according to density, through self-adoptively selecting optimized density radius to determine biggest point density. Selects the points which density is bigger as well as reasonable to take as initial central points, thus can optimize the choice of central points, enable K-Means algorithm to have a good start. The experimental results show tat the optimized algorithm can produce high quality and steady clustering results.
Key Words: Text Clustering, K-Means,density,radius
1 前言本文来自优-文~论^文.网原文请找腾讯324,9114
1.1 课题研究的背景
数据挖掘 (Data Mining)简称DM,也称为数据库中的知识发现(Knowledge discovery in database,KDD),是近年来随着数据库和人工智能发展起来的一门新兴的数据库技术。它是一个众多学科诸如人工智能、机器学习、模式识别、统计学、数据库和知识库、数据可视化等相互交叉、融合所形成的一个新兴的且具有广阔前景的领域。其处理对象是大量的日常业务数据,目的是从大量的、不完全的、有噪声的、模糊的、随机的原始数据中提取隐含在其中的、事先未知的、但又是潜在有用的信息和知识。数据挖掘的挖掘对象不仅仅局限于数据库中的数据记录,而是可以应用于诸如空间数据、音频、视频、数据流、文本等各种数据对象之上的。
[1] [2] [3] [4] [5] [6] [7] [8] [9] 下一页