Class Imbalance Study

This project is to alleviate the class imbalance problem on the text classification dataset by employing sampling methods such as Resampling, Near Miss, and EDA. We systemically conduct extensive experiments using a nested-cross-validation to compare classification performances under three sampling settings. [Codes]

Details

The target text dataset is Reddit dataset. [1]

There are three sampling methods to handle class-imbalance of two classes.

Resampling: It consits of oversampling and undersampling to apply sampling with replacement.

Near Miss [2]: It considers the continous distances between data to decide which majority samples should be removed.

Easy Data Augmentation (EDA) [3]: It generates synthesis texts by randomly performing basic operations on the given sentence.

Cross validation is used to validate the over-fitting of the trained model.

Nested cross validation is employed for two issues: checking over-fitting and hyper-parameters tuning.

5-Folds outer cross validation is for checking over-fitting of each test sets.

Another 5-Folds inner cross validation is for tuning the selected hyper-parameters.

It enbles user to design and fine any involved hyper-parameters of each sampling method.

It uses Trainer API of huggingface for design the BERT model as the classifier. [4]

It use the sklearn library and NumPy to develop two ML algorithms, e.g., PCA and KNN.

Results

Near miss (Before)
Near miss (After)
Resampling Results

EDA Results 1
EDA Results 2
EDA Results 3

References

Reddit dataset conference Paper (EMNLP-IJCNLP2019) [1]

Introduction of Near Miss [2]

EDA conference Paper (EMNLP-IJCNLP2019) [3]

Trainer API of Huggingface [4]

See other projects