Hambisa Mitiku*
Received: May 05, 2025; Published: May 16, 2025
*Corresponding author: Hambisa Mitiku, Assistant Professor, Jimma University, Jimma Institute of Technology, Faculty of Computing and Informatics, Ethiopia
DOI: 10.26717/BJSTR.2025.62.009671
Background: People around the world visit health professionals when illnesses are diagnosed and treated; however,
there is a challenge for those rural communities to get health professionals around. So, it could be better
and more selective for the patient if disease diagnosis and treatment can be done using automated software
that saves time and money and makes the process go more smoothly. This automated system helps physicians
to treat patients using patient treatment history and health data so that the patient may get immediate treatment.
Disease prediction predicts the user’s disease based on the symptoms that the user provides as input and
returns the disease’s likelihood as an output. Disease prediction can be done using different machine learning
techniques that have spread rapidly from computer science to several disciplines. Given the predictive capacity
of Machine Learning, it offers new opportunities for health professionals and communities.
Methods: This scoping review has been done to study the application of machine learning in disease prediction
by using two search engines (published from January 2020-November 2022). The relevance of the studies has
been analyzed and research questions, Results, data, and algorithms used have been identified for each study.
The finding of the studies has synthesized in a narrative form.
Results: Among the initially retrieved 64 documents, 15 identified as relevant studies regarding the eligibility
and inclusion criteria. The number of publications on the application of machine learning in disease prediction
has risen recently. From 15 relevant studies, more of the studies (n=14) used Indian data. About 90% (n = 14)
utilized surveys and 80% (n = 12) employed Machine Learning for common prediction tasks. Even though the
number of studies in Machine learning and disease prediction is growing rapidly, most of the studies used machine
learning for prediction, and few of them used for algorithm performance measurement (i.e., algorithmic
fairness).
Conclusions: While Machine Learning supports researchers with innovative ways to measure health outcomes
and their determinants from non-conventional sources such as text, audio, and image data, most studies still rely
on traditional surveys.
Keywords: Machine Learning; Disease Prediction Background
People around the globe must see health professionals when affected by an illness that is time- consuming and costly. It is very difficult and complex for those patients to see the doctors immediately specifically for t, specifically for those far from health stations. Thus, if the patient may not get immediate treatment because the illness cannot be identified, they may be suffered even till to die (Davis, et al. [1]). So, it could be better and more selective for the patient if the above procedure can be done using automated software that saves time and money and makes the process go more smoothly. This automated system helps physicians to treat patients using patient treatment history and health data so that the patient may get immediate treatment. Disease prediction can be done by using different machine- learning techniques (Pattekari, et al. [2]).
For the past decades, the application of machine learning in disease prediction by using patient history and health data is an ongoing struggle. Disease Predictor is a web-based system that predicts a user’s disease based on the symptoms they have. Data sets from various health-related websites have been obtained for the disease prediction system. Data mining techniques have been applied to pathological data for the prediction of specific diseases in many works. However, these techniques were tried to predict the reoccurrence of the disease and some approaches try to d, and some approaches try to predict the disease’s control and progressionoming recent deep learning in certain areas of machine learning has changed the use of machine learning models that can learn rich, hierarchical representations of raw data with little preprocessing and be able to produce more accurate results. With the main focus of using machine learning in healthcare to increase patient care for better health, due attention was given to big data technology to predict disease from the perspective of big data analysis (Soni, et al. [3]).
Machine learning with its potential approaches for example; Predictive analysis plays a great role to make easier to identify certain diseases and diagnose them correctly to help treat patients. The daily increase of large amounts of healthcare data in the healthcare industry supports the extraction of information for predicting diseases that can happen to a patient in the future while using the treatment history and health data. So, using this hidden information in healthcare data will be later used for effective decision-making for patients’ health in addition to these areas that need improvement by using informative data in healthcare. The application of machine learning implementation in the field of healthcare advances medical facilities so that better decisions for patient diagnosis and treatment options can be made. Machine learning in healthcare helps humans to process huge and complex medical datasets and then analyze them into clinical insights. This then can further be used by physicians in providing medical care. Machine learning has been a stumbling block for decades. Machine Learning technology provides a strong forum in the medical sector for efficiently resolving healthcare issues. Hence machine learning when implemented in healthcare can lead to increased patient satisfaction.
The aim of this scoping review was this scoping review aimed to review related literatures and identify research gaps concerning the application of machine learning in disease prediction methods.
The Question This Scoping Review will Attempt to Answer is:
1. What machine learning prediction models are available for
disease prediction?
2. What methods were used to create these models?
3. What predictor variables are used in these models?
4. How well have existing models been reported?
The scoping review was opted to be performed due to this type of review is best suited to map research activity in a broad and heterogeneous field such as the application of machine learning and disease prediction (unlike typical systematic literature reviews that focus on more specific research questions) (Arksey H, et al. [4-7]).
Protocol and Registration
Reporting this scoping review is based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for scoping reviews (PRISMA-ScR) (Tricco, et al., 2018).
Eligibility Criteria
The scoping review includes articles that explain the application of machine learning in disease prediction. To be included in this review, the studies identified in the search had to fulfill the eligibility criteria such as disease prediction topic; use of at least one machine learning method (and a description of the machine learning algorithm in sufficient detail to extract the basic information analyzed in this review); publication date between January 2020, and November 2022 written in the English language. Publication as conference proceedings (i.e., review articles) was excluded.
Information Sources and Search
To identify and find relevant literatuTontific articles that use the application of machine learning in the field of disease prediction kinds of literature from January 2020 to November 2022 have systematically searched on PubMed and Google scholar. The search strategies were designed and executed by a reference. The search string was constructed by concatenating general terms related to machine learning (“machine learning”, “artificial intelligence”) and disease prediction.
Selection of Sources of Evidence
For this scoping review, 64 publications were searched and screened. Duplicate studies were removed and Studies that were cited within the retrieved papers were reviewed for finding any missing studies. For identifying the proper full-text articles, the title and abstract were screened based on inclusion and exclusion criteria independently. Finally, considering the inclusion and exclusion criteria, investigators identified the eligible publications in this stage independently.
Data Charting Process
The investigator was responsible for extracting the data, to determine which variables to extract. So, the investigator discussed the results and continuously updated the data-charting form in an iterative process.
Data Items
For the selected studies, data on article characteristics have been abstracted and data such as regions (countries, states), data sources, data structure, machine learning model, and model performance have been extracted.
Selection of Sources of Evidence
The PubMed and google scholar searches yielded 64 studies. After deduplication and title and abstract screening, 45 full texts were reviewed and 15 were included for data extraction (Figure 1).
Critical Appraisal within Sources of Evidence
The included studies are listed in Table 1. All the sources of evidence were published during the past three years and the critical assessment of sources of shreds of evidence is discussed herein in Table 1 below.
Concerning the origin of the data sources, out of 15 different studies used for this scoping review, most of the studies (n=14, 92.33%) used Indian data, and one study used Mexican data (n=1, 6.67). Several studies were based on data from India (n = 14) (Palle, et al. [8- 22]) and Mexico data (n=1) Edgar, 2020. Of the 15 studies, (n=, 14 (92.33%) used survey data followed by (n=1, 6.67%) used image data. Palle et al used text with a Random forest-based only algorithm to predict chronic diseases (Palle et.al, 2021), and Praveen et al used survey data of symptom dataset with K-Nearest Neighbor (KNN) and Convolutional neural network (CNN) machine learning algorithms for the accurate prediction of disease in which the accuracy of CNN algorithm is 84.5% and it is more accurate than KNN (Praveen, et al. [9]).
Priyanka et.al developed a disease prediction model using a Machine Learning algorithm that is a Random Forest Algorithm to predict disease and suggested a drug that is most commonly prescribed by the doctor (Priyanka, et al. [10]). Sahil developed a disease prediction model by using MEDSCAN and described MEDSCAN as an android- based application in which Machine Learning and Deep Learning Models are integrated and concluded MEDSCAN is an advanced application that predicts disease based on the X-RAY and MRI Scan images (Sahil [11]). Revati, et al. [12] developed general disease prediction by using the SVM of a machine learning algorithm and conclude as, using this system in disease prediction could help to predict disease in a short period at a low cost (Revati, et.al. [12]). The study conducted by Harshit et.al on Heart disease prediction using machine learnintoed Logistic Regression, Random Forest Classifier, and KNN over text data. The author of the study concluded as the use of a Heart Disease detection system assists a patient based on his/her clinical information of them been diagnosed with a previous heart disease. As to this study, the accuracy of our model is 87.5% and the Use of more training data ensures a higher chance of the model accurately predicting whether the given person has heart disease or not (Harshit, et al. [13]).
Edgar et.al studied Breast Cancer and Diabetes Disease Prediction by applying Machine Learning approaches. The predictive model has developed by using linear regression and the J48 algorithm in which the prediction accuracy of these algorithms were 93% and 95% in predicting breast cancer and diabetes respectively (Edgar, et al. [14]). Khongdet, et al. [15] conducted a study to measure the overall performance of the different ML approaches to enhance the predictability of the model. Accordingly, as the main finding of the study showed, among the used Machine learning algorithm such as Decision Tree, SVM, and Naïve Bayes to predict diseases, it is found that the SVM outperforms the other three algorithms in terms of accuracy and error rate (Khongdet, et al. [15]). These studies concluded that using one or more machine learning algorithms including supervised such as Random forest; n=10; 66.67 (Palle P, et al. [8-22]) and Support vector machines; n=5; 33.3% (Revati, et al. [12,18,19,21] ; Sangivalasa, 2021) and unsupervised like K- means; n=3; 20% (Praveen, et al. [9,13,19]) algorithms in disease prediction could better predict the diseases, support doctors to make the right decision at the right time and helps the patients to get the treatment at low cost with the short period periods the life of the communities.
Synthesis of Results
The majority of the studies employed Machine Learning for pure prediction (n = 9), and the remaining studies used Machine Learning for algorithmic fairness (n = 6). Several studies using Machine Learning for pure prediction discussed how Machine Learning applications could enhance the prediction of disease. 1st, some studies showed how the machine learning approach could be applied to various data sources that traditional statistical methods cannot handle well. These studies used text data (either structured or unstructured data) and image data (X-ray and CT scan data) (Palle P, et al. [8-21]). Second, other prediction studies discussed the different supervised and unsupervised machine learning algorithms, which approach can flexibly model non-linear relationships as well as possible interactions among variables and best to accurately predict the disease (Harshit, et al. [13-22]). Although few studies evaluated the strength and weaknesses of using machine learning algorithms in the context of disease prediction research, some of them did not. Those studies used machine learning for prediction by adding symptoms of diseases as variables to the list of other predictors without substantive reasons for how the inclusion of disease symptoms will improve the disease predictive performance. However, some studies explicitly discussed different reasons for how the addition of disease symptoms as variables improved the prediction. E.g. HarshE.g. estimated whether the patient is likely to be diagnosed with any cardiovascular heart disease based on their medical attributes such as gender, age, chest pain, and fasting sugar level. The scholar concluded as, the prediction model developed by logistic regression, random forest, and KNN machine learning algorithm helps to predict the patient who has been diagnosed with heart disease by cleaning the data set and getting an accuracy of an average of 87.5% (Harshit, et al. [13]).
Few studies measure the overall performance of different machine learning approaches to enhantoredictability of the model in disease prediction. For instance, Khongdet et.al identified decision tree, SVM, and Naïve Bayes machine learning algorithm to measure their performance in predicting heart diseases, S Pranitha et.al Perform the comparative analysis of classifiers like decision tree, Naïve Bayes, Logistic Regression, SVM and Random Forest and propose an ensemble classifier which performs hybrid classification by taking strong and weak classifiers and Pooja analyze the performance of various classification algorithms and in doing so find the most accurate algorithm for predicting whether a patient would develop and heart disease or not (Khongdet, et al. [15,17,19]).
Summary of Evidence
In this scoping review, sixty-four (64) primary studies addressing the application of machine learning in disease prediction across various settings of dementia published between January 2020 and November 2022 has identified from PubMed and Google scholar. From the total identified studies, 19 studies have been excluded due to eligibility and inclusion criteria, and 45 full-text articles have been assessed for eligibility of which 30 studies are excluded (20, not disease prediction, and 10 machine learning algorithms not discussed). In summary, 15 studies have been included in this scoping review and address the application of machine learning in certain disease prediction so that it helps doctors and patients to prognosis and prediction of diseases in a short period.
This scoping review has the following limitations. First, “disease prediction” include a wide range of definitions; so, some relevant studies may still have been missed. Second, a similar definitional limitation applies to “machine learning.” As new algorithms are continuously being developed under new brands, our Machine Learning search list might not have captured all relevant articles. Third, because our search required that an article uses at least one term from both the Machine Learning and disease prediction lists, abstract, or keywords, it did not include articles that use these terms only in the full text. For this reason, we complemented our bibliometric search with a manual (less strict) search to identify potential articles omitted by the bibliometric search. Fourth, only original article papers have been included and including conference papers and peer-reviewed would have expanded the number of reviewed studies [23].
This conducted scoping review summarizes how and to what extent certain machine learning algorithm has been used in the studies of disease prediction. Hence, this scoping review produced the following conclusions with their consistent research opportunities. First: among the certain number of yearly disease prediction publications from January 2020-November, 2022, only 64 studies produced research using machine learning. While the number of studies on machine learning applications for disease prediction increases from time to time, there is a big opportunity to carry out this research further. Second, as seen from the source of the evidence above, most articles used Indian data; in the future, there is a probability to take public-health issues from other world regions like Ethiopia into consideration. Comparing how the same problems (e.g. predicting breast cancer) produced different predictions in different populations is one of the problems to be analyzed in machine learning called transfer learning. Third, most studies used tabular (structured and unstructured) data. For the future, include other data formats such as text, audio, and image data since machine learning equips scholars to measure health outcomes and their determinants from these non-conventional sources. Fourth, most studies used survey data. Using longitudinal data also extends to disease prediction. Not doing the temporal scope of disease prediction issues is useful not for only using machine learning applications to predict but also for using machine learning to evaluate the effect of public health interventions on these outcomes. Fifth, the majority of the studies use machine learning for prediction and thus, future studies using these approaches have an opportunity to innovate specific diseases further.