不确定数据信任密度峰值聚类算法

Belief Density Peak Clustering Algorithm for Uncertain Data

  • 摘要: 密度峰值聚类算法具有简单高效、无需迭代计算和提前设定类簇数的优势,但是在划分非类中心样本时容易产生“多米诺骨牌”效应,并且不能准确划分重叠区域的样本和噪声。为了解决以上问题,提出了不确定数据信任密度峰值聚类算法。首先,该算法在密度峰值聚类算法获取类中心样本的基础上,利用非类中心样本的K近邻求出样本属于不同类的信任值,将样本划分到信任值最大的类别,得到基于K近邻的初步聚类结果。然后,计算关于密度的上分位数得到密度阈值,在证据推理框架下进行信任划分,将密度小于该阈值的孤立样本划分到噪声类;处于重叠部分的样本划分到相关单类组成的复合类;信任值强烈支持属于某个类别的样本划分到相应的单类。该算法通过引入复合类和噪声类能够更加准确地展现样本在现有属性信息下的不确定性。实验结果表明,该算法在人工数据集和UCI数据集上相比于其他对比算法,能够取得更好的聚类性能。

     

    Abstract: The density peak clustering algorithm is simple and efficient and does not require iterative calculations. It has the advantages of setting the number of clusters in advance, but it is easy to produce a "domino"effect when dividing non-centered samples. Moreover, it cannot accurately partition the samples and noise in the overlapping area. To solve the above problems, the belief density peak clustering algorithm for uncertain data is proposed. First, the algorithm uses the K-nearest neighbors of non-class center samples to determine the degree of belief of the samples belonging to different clusters based on the density peak clustering algorithm so as to obtain the cluster center samples and partition the samples into a meta-cluster with the largest degree of belief to obtain the preliminary clustering results of K-nearest neighbors. Then, the upper quantile of the density is calculated to obtain the density threshold and credal partition under the framework of evidence reasoning, and isolated samples whose density is less than the threshold are classified into the noise cluster. Afterward, the samples in the overlapping part are partitioned into the composite cluster composed of related single clusters. The degree of belief strongly supports the classification of samples belonging to a certain cluster into the corresponding single cluster. The algorithm introduces the composite cluster and noise cluster to accurately show the uncertainty of the sample under the existing attribute information. Experimental results show that this algorithm can achieve better clustering performance compared with other algorithms on artificial and UCI datasets.

     

/

返回文章
返回