Abstract:
The incompleteness of data poses a challenge in clustering algorithms. Traditional methods usually use mean or regression algorithms to fill incomplete data and then cluster the data. To solve the problem of inaccurate filling and poor clustering effect for a high data loss ratio encountered in mean filling and regression filling methods, we propose a new method to calculate the incomplete data similarity. Based on the expected mutual information, we sort the attributes in the dataset, considering the location-related attribute values in the dataset, we use the data element itself as the source of missing values, and then calculate the similarity of the sorted incomplete datasets. Finally, we do the clustering using an algorithm based on local density. The clustering algorithm is verified using a data cluster in the UCI machine learning database. The experimental results show that the algorithm is more tolerant to missing values, better in recovering missing elements, and results in a better filling precision and final clustering results when the number of missing data sets increases. The method of filling similarity calculation in this study is more time-consuming as it fully considers each attribute value of a dataset to discretely fill in missing values.