Abstract:
In this paper, a data modeling method based on distributed framework and improved random forest algorithm is proposed for the equipment of combustion system in thermal power plant. That is to use the multivariate collinearity test improved stepwise regression to screen the optimal variables in the industrial process. The processed variable data is applied to Hadoop platform, and the parallel optimization of traditional random forest algorithm is carried out by combining with Mapreduce and Spark distributed framework. The research results show that the distributed random forest algorithm based on Hadoop effectively improves the training efficiency and data processing speed. The model established by the distributed random forest algorithm has high accuracy, strong generalization ability, and valuable industrial impacts and applications.