一种基于Spark的不确定数据集频繁模式挖掘算法

杨阳; 丁家满; 李海滨; 贾连印; 游进国; 姜瑛

doi:10.13976/j.cnki.xk.2019.8371

一种基于Spark的不确定数据集频繁模式挖掘算法

A Spark-based Frequent Patterns Mining Algorithm for Uncertain Datasets

摘要

摘要: 如何在海量不确定数据集中提高频繁模式挖掘性能是目前研究的热点.传统算法大多是以期望、概率或者权重等单一指标为数据项集支持度，在大数据背景下，同时考虑概率和权重支持度的算法难以兼顾其执行效率.为此，本文提出一种基于Spark的不确定数据集频繁模式挖掘算法（UWEFP），首先，为了同时兼顾数据项的概率和权重，计算一项集的最大概率权重值并进行剪枝；然后，为了减少对数据集的多次扫描，结合Spark框架的优点，设计了一种具有FP-tree特征的新颖的UWEFP-tree结构进行模式树的构建及挖掘；最后在Spark环境下，以UCI数据集进行实验验证.实验结果表明本文的方法在保证挖掘结果的同时，提高了效率.

Abstract: In recent years, improving the performance of mining frequent patterns in massive uncertain datasets has become an active research topic. Most traditional algorithms for mining frequent patterns consider only a single factor of data items-any of expectation, probability, or weight, while for those algorithms that consider both probability and weight, it is difficult to balance execution efficiency when big data are involved. Therefore, we propose a Spark framework-based algorithm for mining frequent patterns according to expected weight for uncertain datasets (UWEFP for short). To consider both the probabilities and weights of items, UWEFP first calculates the maximum probability weight value of one set and to prune them. A novel UWEFP-tree structure with the advantages of Spark framework is designed to mine frequent patterns; it has the FP-tree characteristics and reduces the time of scanning the datasets. Finally, in the Spark environment, UCI datasets are used to verify the algorithm. The experimental results show that the proposed algorithm is effective and has excellent performance.

HTML全文

参考文献(25)

施引文献

资源附件(0)