基于互信息属性排序的不完整数据聚类算法

Incomplete Data Clustering Algorithm Based on Mutual Information Attributes Ranking

  • 摘要: 数据缺失对聚类算法提出了挑战,传统方法往往采用均值或回归方法将不完整数据进行填充,再对填充后的数据进行聚类.为解决均值填充和回归填充等方法在数据缺失比率增大时填充精度以及聚类效果变差的问题,提出一种新的不完整数据相似度计算方法.以期望互信息为依据对数据集中的属性排序,充分考虑了数据集中与位置相关的属性值特征,以数据集本身元素作为缺失值填充的来源,对排序后的不完整数据集进行相似度填充计算,最后采用基于局部密度的聚类算法进行聚类.利用UCI机器学习库中的数据集验证本文填充聚类算法,实验结果表明,当数据集中缺失值增多时,算法对缺失值的容忍性较好,对缺失元素的恢复能力较强,填充精度以及最终聚类结果方面均表现良好.本文填充计算相似度的方法考虑数据集的每个属性值来对缺失值逐个填充,因而耗时较多.

     

    Abstract: The incompleteness of data poses a challenge in clustering algorithms. Traditional methods usually use mean or regression algorithms to fill incomplete data and then cluster the data. To solve the problem of inaccurate filling and poor clustering effect for a high data loss ratio encountered in mean filling and regression filling methods, we propose a new method to calculate the incomplete data similarity. Based on the expected mutual information, we sort the attributes in the dataset, considering the location-related attribute values in the dataset, we use the data element itself as the source of missing values, and then calculate the similarity of the sorted incomplete datasets. Finally, we do the clustering using an algorithm based on local density. The clustering algorithm is verified using a data cluster in the UCI machine learning database. The experimental results show that the algorithm is more tolerant to missing values, better in recovering missing elements, and results in a better filling precision and final clustering results when the number of missing data sets increases. The method of filling similarity calculation in this study is more time-consuming as it fully considers each attribute value of a dataset to discretely fill in missing values.

     

/

返回文章
返回