Abstract:
The original suffix tree clustering(STC) algorithm can not effectively process the nodes with text documents that differ greatly in quantity but hold a relation of inclusion,neither the nodes that are similar in text but different in topic,and it lacks an effective algorithm for class label extraction.To solve these problems,an improved similarity formula is presented for base cluster merging based on both the similarity of topic and the included texts,and a class label extraction algorithm based on information gain is proposed.To improve the clustering efficiency,a simple but reasonable measure for base cluster selection is presented to exclude some generalized suffix tree nodes which contribute less to the clustering.Experiment is made and the results prove that the presented clustering algorithm can efficiently increase the precision of text clustering and perform effective labeling for the clustering result.