Impact of Feature Selection and Data Augmentation for Pregnancy Risk Detection in Indonesia

Irfan, Muhammad and basuki, setio and Azhar, Yufis (2022) Impact of Feature Selection and Data Augmentation for Pregnancy Risk Detection in Indonesia. International Journal on Advanced Science, Engineering and Information Technology (IJASEIT), 12 (6). pp. 2266-2273. ISSN 2460-6952

[thumbnail of irfan basuki azhar - ctgan data augmentation feature selection pregnancy risk detection.pdf]
Preview
Text
irfan basuki azhar - ctgan data augmentation feature selection pregnancy risk detection.pdf

Download (1MB) | Preview
[thumbnail of similarity - irfan basuki azhar - ctgan data augmentation feature selection pregnancy risk detection.pdf]
Preview
Text
similarity - irfan basuki azhar - ctgan data augmentation feature selection pregnancy risk detection.pdf

Download (2MB) | Preview

Abstract

This paper aims to develop an automatic system for pregnancy risk detection in Indonesia. The system requires a sophisticated approach to achieve the required performance as a sensitive field. Existing works are developed using small-sized datasets and limited classification features. Moreover, all features treated equally make the detection results hard to interpret which features contribute more. To address these issues, we propose to combine more complex features, data augmentation methods, and feature selection techniques. We prefer to use all 118 pregnancy indicators and 400 instances from Puskesmas as an original dataset. Next, the new datasets are used to build two data augmentation methods, i.e., GMM and CTGAN. Each data augmentation method generates 2,000 new synthetic instances. Following this, five machine learning methods combined with three feature selection approaches, i.e., RFE, Random Forest, and Chi-Square, are implemented in all datasets. Through experiments, we observed that feature selection techniques play an essential role in improving classification accuracies. While the GMM-based augmentation demonstrated performance improvement, the CTGAN-based synthetic dataset depicted low performances. The best accuracy on all experiment settings reached 95%. By using Random Forest combined with RFE on a GMM-based dataset, the highest accuracy was achieved using only five features. Another notable result is that both XGBoost and Decision Tree reached the same 95% accuracy on the GMM-based dataset on only nine features. The overall results show that appropriate data augmentation and feature selection are a matter for achieving better performance in this research.

Item Type: Article
Keywords: Ctgan; data augmentation; feature selection; pregnancy risk detection
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Q Science > QA Mathematics > QA76 Computer software
T Technology > T Technology (General)
Divisions: Faculty of Engineering > Department of Informatics (55201)
Depositing User: maulana Maulana Chairudin
Date Deposited: 09 Mar 2024 01:31
Last Modified: 09 Mar 2024 01:31
URI: https://eprints.umm.ac.id/id/eprint/4602

Actions (login required)

View Item
View Item