Abstract:
For the feature and the document category from a T-C(term-category) two-way four-fold contingency table,their mutual independence is equivalent to their mutual non-correlation.At this point,this paper uses two novel hypothesis test methods of independence to measure the degree of correlation between features and categories,and accordingly the high representative feature subset of the document content is selected out of the feature space of the text set for text categorization.The results of experiments show that the categorization performance can be improved by applying the hypothesis test-based feature selection to text categorization.