基于生成对抗网络的高斯型数据的过采样算法

Oversampling Algorithms for Gaussian-type Data Based on Generative Adversarial Networks

  • 摘要: 针对在非平衡数据分类中倾向于少数类而导致分类效果降低的问题, 提出了一种基于生成对抗网络的蒙特卡洛过采样算法。首先, 利用生成对抗网络(GAN)生成少数类数据的概率密度函数, 通过少数类数据的概率密度值确定少数类数据的过采样权重; 其次, 为了保证生成数据的多样性, 采用蒙特卡洛算法对少数类数据进行过采样; 同时, 为了避免与多数类产生交叉与重叠, 通过高斯分布的3σ法则对进入到多数类区间3σ内的少数类数据进行翻转, 使数据集达到平衡。最后, 从UCI与KEEL数据库中选取7组数据集进行算例实验, 将决策树分类器作为基分类器对数据进行分类。实验结果表明所提算法是有效的。

     

    Abstract: To solve the problem of reduced classification effectiveness due to the tendency to favor some classes in unbalanced data classification, we propose a Monte Carlo oversampling algorithm based on generative adversarial networks (GANs). First, we simulate the probability density function of the minority class data using GANs and determine the oversampling weights of the minority class data using the probability density values of the minority class data. Second, to ensure the diversity of the generated data, we use a Monte Carlo algorithm to oversample a few classes of data. Simultaneously, to avoid crossover and overlapping with the majority class, we introduce the 3σ rule to flip the data of the minority class into the 3σ interval of the majority class, which balances the dataset. Finally, we select seven datasets from the UCI and KEEL databases for algorithm experiments and use the decision tree classifier as the base classifier to classify the data. The experimental results show that the proposed algorithm is more effective than the comparison algorithms.

     

/

返回文章
返回