决策信息系统的连续型特征选取方法

A Continuous Feature Selection Method of Decision Information System

摘要: 在大数据应用过程中，对特征集合进行约简，降低数据维度，有助于提升数据模型的泛化能力.采用随机森林模型选择和相似性度量结合的方式对特征集合进行特征初选，并通过前向搜索策略以距离为评价方式对初选集合进行二次筛选，最终获得特征子集.算法模型采用局部遍历以提高执行效率，同时通过前向选择算法解决传统方法无法确定最优特征数目的问题.实验结果表明，本文提出的方法能更有效地选择特征子集，提高模型的分类准确率.

Abstract: In the process of large data application, it is necessary to reduce the feature set for improving the generalization ability of the data model. We use random forest model selection and similarity measure to select feature sets. Then, we adopt the forward search strategy to finish the second filtering. In the algorithmic model, it uses local traversal because it can be helpful to enhance the execution efficiency. At the same time, it can effectively solve the problem about how to determine the optimal number of features. The experimental results show that this method can obtain the feature subset more effectively and improve the classification accuracy.