Summary. Pre-term birth is an increasingly prevalent complex condition with multiple risk factors including environmental pollutants. This is also leading cause of neonatal death of developing countries. Machine learning techniques play a key role in predicting the factors in the pre-term birth.
The objective of this study is to compare the performance of logistic regression and decision tree classification methods and to find the significant determinants that cause pre-term birth.
For this study we have used a case–control study of 50 cases of full-term births and 40 cases of pre-term births. We have been tested the logistic regression and decision tree classifier methods in this dataset and to evaluate the accuracy of the logistic regression and decision tree methods through various indices.
The logistic regression is determined to be suitable classifier model for this dataset as compare to decision tree methods. The variables like alpha HCH, total HCH and MDA are the most influential factors with respect to association with preterm birth. The final result revealed that logistic regression classifier is accurate model to predict the pre-term birth with better accuracy.
Keywords. Machine learning, Pre-term birth, Classifier, Decision tree, Logistic regression
Preterm birth is the birth of a baby at less than 37 weeks' gestational age, as opposed to the usual about 40 weeks. Preterm birth (PTB) has found through various reason including low socio-economic status, smoking, race and consumption of alcohol . Previous research have suggested the associations between organochorines and increased risk of abortion, small for gestational age babies, minor malformations, , cryptochildism and hypospadias in the infants have been reported [2-3] There are various machine learning prediction and classification models like regression, logistic regression, principal component analysis, decision tree and maximum likelihood method have been used to improve maternal and child health. The machine learning scientists have worked on interpretable prediction techniques in several places [4-6].The previous research article recommended that if early risk factors are identified through classification methods in acute kidney injury then can be
Overall aim of this study to construct the logistic and classification models to predict the high-risk groups for PTB based on several factors and covariates in order to reduce the risk of PTB and compare the predicting performance of LR and DT classification methods.
2. Material and Methods
For this study we have used a case–control study of 50 cases of full-term births and 40 cases of pre-term births at Dr. Bhim Rao Ambedkar University (located at Agra, Uttar Pradesh, India).
The dependent variable in this study is preterm birth and it is classified between full term birth and pre-term birth. The risk factors considered in this study include variables age, BMI, number of children, lactation duration, addiction, residence, pesticide exposure, drinking water sources; dietary habits baby gender and organ-chlorine pesticides in the placenta of the females.
In this research paper, we used the decision tree model in whole dataset and train datasets because decision tree creates a binary tree and it is very useful in classification problems. The next step is measure to rank of the variables from the data sets through information gain technique. Finally, we have used the decision tree and logistic regression both classifier model in the training dataset (70% of cases) and evaluate the model using the test sample (30% of cases).
In this study the variable rank shows in the Figure 1 using information gain measure. Higher values of information gain indicate those variables are important for preterm birth. The alpha HCH, total HCH, MDA, p_p_DDT, beta HCH and p_p_DDE are the highly ranked variables. In other words, it declares that these variables play important role in preterm birth. The higher rank variables like alpha HCH, total HCH and MDA are the most influential factors with respect to association with preterm birth. We developed the logistic and decision tree classification model for prediction with the help of 70% of training dataset including all variables. The logistics and decision tree models are tested on 30% testing dataset and model evaluation results are given in Table 1.The result shows that logistics classifiers with the better accuracy of predictions compared to decision tree. The accuracy of logistic regression for classifying preterm birth is 0.96, significantly different from the decision tree method. The comparison of the performance of the both classifier models include all variables reveals that logistic regression performs the better in terms of metrics (precision = 0.92, F1-score = 0.96 and AUROC = 0.97), while decision tree performs the poor (precision = 0.75, F1-score = 0.86 and AUROC = 0.87).The Figure 2 shows the receiver operating characteristic (ROC) curve for all variables and it is found that logistic regression model is better than decision tree model.
After that we have used the top six ranked variables and apply the both model. The reason behind repeat the same experiment with top six ranked variables is to emphasize how efficient is the use of information gain measures in the data. The results obtained using the top six variables are given in Table 2. The result found that top six variables do not affect the accuracy performance of decision tree method and again the accuracy of logistic regression for classifying preterm birth is 0.78, significantly different from the decision tree method. The performance gain is shown by logistic regression (precision = 0.65, F1-score = 0.79 and AUROC = 0.81) and decision tree performs (precision = 0.56, F1-score = 0.69 and AUROC = 0.71). In figure 3, we present ROC curves for both classifiers models with the help of top six ranked variables. It shows that that logistic regression model is again better than decision tree model.
While the final objective of specialist classification/prediction machine learning technique development is to predict preterm birth risk, the definition of preterm birth and data needed to analyze preterm birth risk are less amenable to study presently. Therefore, the purpose of this study is to determine the feasibility of using classification/prediction machine learning to generate expert system (knowledge-base) rules for prediction of preterm birth. In this study we use the logistic regression and decision tree method on preterm birth data and it is found that LR is the most accurate in terms of predicting preterm birth. With this method, scientists, researchers and practitioners are able to predict and detect the preterm birth that is at a higher of data sets.
This study is limited due to small amount of data because the machine learning techniques give better result in big datasets. The result finds that preterm birth is significantly associated with only γ-HCH and MDA. Finally the results revealed that the logistic regression is better accuracy in classifying preterm birth compared to the decision tree method.
- Metzger, M. J., Halperin, A. C., Manhart, L. E., & Hawes, S. E., Association of maternal smoking during pregnancy with infant hospitalization and mortality due to infectious diseases. The Pediatric infectious disease journal, 32(1), e1,2013.
- Birnbaum, S. C., Kien, N., Martucci, R. W., Gelzleichter, T. R., Witschi, H., Hendrickx, A. G., & Last, J. A. Nicotine-or epinephrine-induced uteroplacental vasoconstriction and fetal growth in the rat,Toxicology, 94(1-3), 69-80,1994.
- Hosie, S., Loff, S., Witt, K., Niessen, K., & Waag, K. L., Is there a correlation between organochlorine compounds and undescended testes.European Journal of Pediatric Surgery, 10(05), 304-309,2000.
- Jacob Bien, Robert Tibshirani, et al. Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4):2403–2424, 2011
- Emilio Carrizosa, Amaya Nogales-G´omez, and Dolores Romero Morales. Strongly agree orstrongly disagree?: Rating features in Support Vector Machines. Information Sciences,329:256–273, 2016.
- Amin Emad, Kush R Varshney, and Dmitry M Malioutov. A semiquantitative group testing approach for learning interpretable clinical prediction rules, In Proc. Signal Process. Adapt. Sparse Struct. Repr. Workshop, Cambridge, UK, 2015.
If you found this Article interesting, why not review the other Articles in our archive.