Prognosis Model for Gestational Diabetes Using Machine Learning Techniques

Gestational diabetes mellitus (GDM) is a syndrome that occurs among women during pregnancy and is characterized by lack of insulin hormone secretion. GDM occurs in about 4% of all pregnancies and is diagnosed at later stages of pregnancy. It can occur in women with no known history of diabetes. Since no signs or symptoms occur at the onset of GDM, it is possible to diagnose it only through screening tests. GDM poses some major health risks such as hormonal imbalance, delivery risks, and the development of Type 2 diabetes (T2D) after delivery. The condition can be diagnosed from the blood sugar level. Those diagnosed with GDM are likely to be obese, have a weak constitution, and be undergoing a stressful life or living in a stressful environment, eating unhealthy food, and living an unhealthy lifestyle. Other risk factors to be considered are family history, heredity, and the occurrence of diabetes in the past. Apart from diagnosis, the most crucial stage in managing GDM is its prognosis. If the disease is diagnosed at earlier stages, one can avoid its complications. Advanced technologies such as IoT and wearable sensors can help healthcare professionals in identifying the early signs and symptoms of GDM. In this scenario, data mining techniques are recommended for the prognosis of GDM using existing medical reports and risk factors related to women. A patient’s medical history and their family history should be correlated with each other to find the likelihood of GDM occurrence. Classification is a technique in which a training dataset is used to predict the importance of related factors using an inference function. Our aim is to develop a prognosis model for GDM using a classification technique. A GDM prognosis model is developed using a training set of disease parameters along with an individual’s risk factors. From the results of our experiments, it is inferred that the proposed model can be used for predicting the likelihood of GDM in its earlier stages.


Introduction
Gestational diabetes mellitus (GDM) (1)(2)(3) is a syndrome that occurs among women during their pregnancy. World Health Organization (WHO) stated that the prevalence of GDM is increasing every year owing to lifestyle changes and the high number of Type 2 diabetes (T2D) patients. GDM has pre-and post-implications for both the mother and the infant. After birth, the mother may have the possibility of T2D or Type 1 diabetes (T1D). The infant may experience the problem of poor nutrition and be prone to diabetes in the future. WHO (4) revised the treatment regimen for diabetes based on race, country, and the individual. Research is ongoing to prognosticate GDM and diagnose the condition. Innovative biomarkers that can be used to identify the disease with normal tests have recently been introduced.
The emergence of sensor devices in recent years has led to rapid advances in a wide range of applications. In the healthcare domain, smart patient assistance is a notable field that provides intensive care to patients at remote locations. Hospitals have undergone drastic changes in providing 24 × 7 lifeline support to patients across the globe. In this scenario, it is important to promote research on GDM using advanced technology. Various studies that focus on applying data mining and machine learning concepts to the maintenance and analysis of patient records and disease biomarkers have been conducted. Researchers have recently identified new and highly helpful biomarkers that can be used to periodically check patients for symptoms of GDM. Data mining classifiers (5) are widely used in the prediction of GDM. Our current research aims to improve the accuracy of diagnosis by enhancing the quality of data and finding suitable classifiers such as the support vector machine (SVM) and k-nearest neighbors (KNN) for GDM prediction. The classifiers, which are mostly used in diabetic research, are compared in terms of their accuracy rate.

Related Studies
Schoenaker et al. (6) proposed an important prediction model for GDM based on electronic health records. The model tracks the history of previous pregnancies and compares it with current pregnancy data. It also selects the features of a diabetic dataset based on correlation. This model achieved its maximum accuracy with the use of data mining classifiers. Earlier, Iyer et al. (7) developed a framework to predict GDM using multiclassifier techniques. They focused on developing an autonomous decision-making model to diagnose GDM using an ensemble classifier approach with higher accuracy than previous models. Milewski et al. (8) proposed a prediction model using principal component analysis (PCA), K-means clustering, and the logistic regression (LR) classifier. (9) Kumar and Umatejaswi (10) proposed a new model to solve the basic diagnosis problem, with which they analyzed and identified the severity of diabetes. They developed guidelines for doctors and hospital management to predict and diagnose diabetes as well as its risk levels. Kavakiotis et al. (11) conducted a systematic review of studies on diabetes research that have been conducted with biological tools, machine learning, and data mining. Nagarajan et al. (12) and Omiotek et al. (13) proposed a new algorithm to improve the diagnosis of GDM using data mining techniques.
After reviewing the literature about data mining techniques, we propose a novel approach to handling decision-making in the prediction of GDM that uses data mining and technological improvements.

Methodology
The objective of our study is to predict GDM through data mining and machine learning algorithms. The Pima Indian diabetes dataset, sourced from the UCI repository, is used in this study. (14) First, the data is preprocessed, during which the missing data is handled effectively to improve the accuracy of the classifier. Normalization is used to scale the data of an attribute so that it falls within a small range (0-1). A predictive model is developed with the Random Forest (RF) classifier (15) and cross-validated (16) using part of the dataset. The model is tested for its effectiveness in predicting GDM using patient health data.
Data mining has different stages, among which preprocessing is the first step. The input data should be preprocessed prior to the application of a data mining technique to remove the noise and increase the accuracy and output of the process. During preprocessing, data cleaning and transformation are applied as preliminary steps. To predict GDM for the given dataset with higher accuracy, a set of significant classifiers was selected and compared as performance measures in this study.

Data cleaning and transformation
Data cleaning and transformation are important steps since the dataset should be refined and developed for application in data mining and machine learning approaches. Real-time datasets mostly have a few missing values, encoded as blanks, NaNs, or other placeholders. These missing values should be handled prior to the actual processing. However, it is challenging to manage these values and use them in the development of strong models. Different approaches should be used to handle such missing values. In the current study, various methods, such as the drop-down of the entire tuple, a mean imputation method that replaces the missing value with the mean (17) of each column, and a grouping-based mean imputation method, were tested to overcome missing data values. Among these methods, the grouping-based mean imputation method is used here since it has excellent performance in replacing the missing values based on grouping, thus increasing the classification accuracy. The age attribute is considered in the current study.

Group-based mean (GBM)
The mean is a suitable method for handling missing and inconsistent values in a dataset. In this method, the value of an attribute is replaced by the mean of its group. The mean can be used to approximate some attributes. For instance, the blood pressure (BP) values of patients differ according to their age. However, when calculating the mean to replace the missing data in the BP attribute, age must also be considered. Therefore, a GBM method is proposed in the current study, in which the missing values are filled with categorized group-based values. To achieve this, the dataset is grouped according to the age of the patients, then the GBM is applied. The results attained from the GBM increase the quality of the dataset.

Normalization
Normalization (18) is a stage in the data preprocessing technique (19) in machine learning. In this data transformation technique, attribute values are allowed to repeat until they lie within a relevant range with the help of a common scale. The common value of each attribute may differ. Normalization reduces the distortion and increases the quality of data. Thus, the data is normalized so that the values of all the attributes are between 0 and 1. We analyzed the organization of the data with and without normalization in this study.
Min-max normalization is a way to normalize data using feature values and transformations. This method guarantees the same scale for all the features. Min-max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature is transformed into 0, the maximum value is transformed into 1, and every other value is transformed into a decimal between 0 and 1 using Eq. (1).
Here, A is the attribute, and MIN(A) and MAX(A) are the minimum and maximum absolute values of A, respectively, which define the range of A. V′ represents the new data entry and V represents the old data entry. Finally, newMAX(A) and newMIN(A) are the maximum and minimum values of A, respectively.

Classification
Classification is a supervised (20) and self-automated machine learning algorithm that is used to test the unknown data from sampled data. In contrast to other algorithms in data mining, classifiers are used to handle both continuous and discrete attributes. Classification techniques are evolving and are now able to cope with medical datasets consisting of both continuous and discrete attributes, and are highly useful for classifying data into ranges. Classifiers that have been used in earlier studies on GDM were considered in this study. Figure 1 shows different classifiers. In diabetic research, (21) a few selected classifiers such as LR, SVM, Gaussian naïve Bayes (NB), KNN, and RF (22,23) are widely used. These classifiers were used for performance evaluation in this study. We analyzed different classification algorithms for their performance, accuracy, and output with the given dataset. Among the classifiers, RF achieved the highest accuracy rate. Another advantage of RF is its scalability; (24) when the dataset is large and dynamic in nature, the performance of RF is increased.

RF algorithm
The RF algorithm proposed by Ho (24) is based on the stochastic discrimination approach followed by Kleinberg. (25) RF is a modern ensemble classifier that has started gaining attention in recent years owing to its good classification capability. In this classifier, every single learner is a decision tree built on bagging data, while each node split is developed on the basis of a randomly selected feature subset. This is a supervised multilearning (ensemble) method that is based on the concept of decision trees. Compared with other classifiersclassifiers (26) such as ID3 and C4.5, RF is highly efficient since it can easily handle overfitted values. It creates multiple subsets of decision trees using regression, the mean, and the mode. The later versions of RF include bagging, boosting, and the control variance. (27,28) GDM data contains different sets of multidimensional attributes with continuous values. RF with tuned parameters can handle a GDM dataset with improved accuracy and lower error rate.
The pseudocode for RF is given as follows.
2. Calculate node "d" using the best split point.

Split the nodes into sibling nodes using the best split.
Repeat steps 1 to 3 until [the last node]. Figure 2 shows the test sample input of RF and the creation of subset trees. Figure 3 shows the workflow of the prediction model proposed in this study.

Experimentation and Results
The prediction model used in our investigation was developed using Python language (29,30) and R software. (31)(32)(33) The GDM dataset was input to the application. The data was preprocessed in the first step by using the GBM method. The result was used to develop a prediction model with promising attributes and ranges. After developing the training model, the preprocessed GDM data was tested using different classifiers. The results for each classifier were compared in terms of accuracy.

Dataset
The Pima Indian diabetes dataset was sourced from the UCI online repository. It has the following attributes: pregnancy occurrences, oral glucose tolerance test (OGTT), diastolic BP, skinfold thickness, body mass index (BMI), plasma glucose level, diabetes pedigree function, age, and a class variable. There are nearly 750 observations taken with nine attributes. Figure 4 shows the dataset attributes with their values. Figure 5 shows the correlation matrix that can be used to observe how the features are related to each other or to the target variable. It can be seen that the dataset is symmetrical about the leading axis and that each variable in the dataset is positively correlated with the others. Figure 6 shows the GDM dataset used to analyze the software prediction model.

Data preprocessing
The GDM dataset contained missing values and noisy data, which were preprocessed using the GBM method. In the table of values shown in Fig. 7, the empty values have been replaced with values during preprocessing. The missing values in the group concerning the age and BMI were replaced with the mean values as shown in the table. In the second step of preprocessing the GDM dataset, normalization was performed. Box plot analysis (34) was used to estimate the amount of data that will be normalized. The results obtained before and after the normalization are respectively shown in Figs. 8 and 9.   Data visualization is an important step in data analysis. If the data is graphically visualized as box plots or histograms, it provides a better understanding of different feature values and their distribution. Figures 10 and 11 show the histograms obtained before and after the normalization of each attribute, respectively.
In the third step, the given GDM dataset and resampled dataset were analyzed using different classifiers. Both normalized and non-normalized data were fed as inputs to evaluate the performance of the classifiers. The result showed that RF achieved high accuracy for both normalized and non-normalized data. After min-max normalization was applied to the dataset, the RF classifier achieved much higher accuracy than the other classifiers. Figure 12 shows an individual report of each classifier for relevant performance measures, where the accuracy rate of the classifier is presented using a confusion matrix. The dataset was used without normalization to determine the classifier's basic functionality. Figure 13 shows the outputs attained using different classifiers along with their accuracy rates when using the max-abs normalization technique. Figure 14 shows the outputs attained using different classifiers along with their accuracy rates when using the mix-max normalization technique.
The above experimental results show that the normalization techniques perform well on the different classifiers. Min-max normalization performed well on most classifiers, whereas RF with tuned parameters and min-max normalization outperformed all the other classifiers for the given dataset.

Performance Evaluation
A method widely used for handling highly imbalanced datasets is called Resampling. It consists of adding or removing the samples from/to the training dataset. In such cases, the simplest approach involves adding or duplicating (Replicating, Reproducing the same) samples in the dataset. This type of resampling technique can be effective to have a better performance on the classification model. Table 1 illustrates the detailed results of the comparative analysis of the dataset.
The following are the metrics used to evaluate and compare the models discussed earlier: • Recall -measures the ability of a classifier and its relevant distance, • Precision -fraction of relevant instances among the retrieved instances, • F1 score -combination of recall and precision using a harmonic mean, • Confusion matrix -real, actual, and predicted labels from a classification problem, and • Receiver operating characteristic (ROC) curve -differentiates values into true positive and false positive rates. The ROC curve is an evaluation measure used to analyze the performance of classifiers. It is a probability curve that is deployed to estimate the capability of an algorithm or model. It is plotted with the true positive rate on the y-axis and the false positive rate on the x-axis. The two cutoff points are the sensitivity and specificity with a threshold. In this curve, each point represents a sensitivity and specificity pair corresponding to a particular decision threshold. Figures 15 and 16 show the ROC curves used to differentiate whether the data from a patient falls under a disease or normal category.    Classification algorithms that optimize the overall accuracy or class distribution purity often suffer from the classification of imbalanced data. In most scenarios, the testing set is classified under the majority class. However, such imbalanced data classification strongly considers accuracy in identifying the minority class (e.g., diseased samples). Thus, low sensitivity is highly undesirable. When numerous data features are collected and engineered along with appropriate estimator selection, it is possible to increase the performance. The ROC curve is a twodimensional graph in which sensitivity is plotted against specificity, i.e., accuracy, in identifying the majority class (e.g., normal samples). The ROC curve is deemed to be an accurate means of evaluating the performance of a classification. In general, RF not only improves the classification accuracy but also gives a highly balanced classification result compared with other classification algorithms. Figures 17 and 18 show the prediction reports of gestational diabetes.
The proposed model has three steps: preprocessing, classification, and prediction. Preprocessing involves the handling of missing values and data normalization. First, the input dataset is converted into a preprocessed dataset. In the GBM method, the missing values in the dataset are replaced with the group mean of the corresponding columns. The min-max method scales the data of the given column. After the data is preprocessed, it is classified by a supervised machine learning algorithm. Firstly, a classifier is built using a set of rules based on which the future class or data is classified. Classification is an important task in machine learning and data mining. In the current study, the RF algorithm is used with estimator selection for classification. Finally, this model is applied to predict GDM using patient health data. The proposed model compared with the other classifiers yields high accuracy rate.

Conclusion
The aim of this work is to develop a novel approach to predicting GDM using machine learning classifiers and data mining methods. The significant classifiers are taken into account, namely, LR, SVM, NB, KNN, and RF. In preprocessing, the GBM method and min-max normalization techniques are used to improve the data quality in the dataset. To evaluate the classifiers in terms of their accuracy rate, the confusion matrix and ROC curve are used. The results showed that the RF classifier with tuned parameters achieved higher accuracy than did the other classifiers. The generic nature of the GDM dataset contains correlated attribute values, which require an internally combined approach to obtain better results. The RF algorithm uses attribute values with regression effectively while using the GDM dataset. The performance evaluation results have also proven that RF is a suitable approach to predicting GDM in earlier stages. For the prediction of GDM with similar real-time datasets, the proposed model can also be enhanced by using combined techniques such as ensemble methods.