Applications of Machine Learning in Drug Discovery

Recent trend in drug discovery has been marked for the escalating cost and lowering rates of getting approved...

solution to improve the efficiency of drug discovery process for pharmaceutical industry. In recent years, machine learning (ML) technique has gained a rapid development. Especially, the advent of deep learning (DL) enables the artificial intelligence (AI) to overwhelm human being in certain specific applications such as chess game and image recognition, marked by the victory of Alpha Go over the world strongest human Go player in 2016. Today, ML is widely applied in every aspect of human's social and industrial activity, such as identification of spam email, handwritten word recognition, news recommendation, autonomous driving, medical image analysis, etc. In pharmaceutical industry, ML has become one of the most important and rapidly evolving tools in computeraided drug discovery, being involved in almost every stage in drug development [3]. There are already several specific and detailed reviews on the applications of ML techniques in drug discovery [3,4]. Here, we present a mini review with special focus on drug target identification and validation, drug design and optimization, and drug toxicity prediction.

Drug Target Identification and Validation
Identification of drug target is an important task in initialing a drug discovery pipeline. Modern biology has accumulated large amounts of human genetic information as well as transcriptomic, proteomic and metabolomic data, which renders it feasible to apply ML to identify drug target. For example, by analyzing the gene expression profile of young and old human skeletal muscle with ML approach, Mamoshina et al. [5] identified a panel of tissuespecific biomarkers of aging, which showed good correlation with the actual age values of muscle tissue samples [5]. Similarly, Jeon et The predicted drug targets were validated by the strong antiproliferative effects of their inhibitors [6]. Target identification with ML is also useful for diagnosis and treatment of rare diseases, which usually lack effective treatment strategies. IJzendoorn et al. [7] performed machine learning analysis on transcriptome sequencing data, thereby uncovering diagnostic biomarker, prognostic gene and identifying potential novel therapeutic targets for soft tissue sarcomas, a group of rare cancers [7]. In addition to predicting the potential target for specific disease, ML approaches can also be utilized to unravel the common features of drug targets.
Using amino acid composition and property group composition as features, Kumari et al. build a model with ensemble classification learning method-rotation forest to distinguish drug target from non-drug target, which proves to be useful for novel drug target identification [8]. In conclusion, machine learning may serve as powerful a tool to speed up target identification and validation.

Drug Design and Optimization
The ultimate goal of drug discovery is to bring new drugs to clinic to treat diseases. Once a target has been identified, the next issue is how to efficiently design and optimize chemical structures that will alter the disease state by modulating the activity of the identified target. In the past decades, computer aided drug design (CADD) has offered valuable tools for identifying active drug candidates, including molecular docking and quantitative structure-activity relationship (QSAR). With the rapid explosion of chemical and biological databases as well as the advance in ML algorithms, ML has become an alternative CADD tool for drug design and optimization [3]. For example, based on random forest (RF) algorithm, a novel score function was proposed to predict protein-ligand binding affinity, which outperformed other 16 classical scoring functions with accuracy increasing with the size of training dataset [9]. ML can also be applied to design inhibitors against non-molecular target. Cruz et al. developed ML models with k-nearest neighbor, RF and SVM algorithms using nuclear magnetic resonance data as features to identify molecules capable of inhibiting growth of cancer cell [10]. The advent of deep learning (DL) methods significantly boost predictive power of ML approaches. For example, in the Merk Kaggle, the DL outperformed RF approach using 2D molecular descriptors for 13 of 15 arrays [11]. Another advantage of DL is that it can be employed to optimize novel chemical structures towards desired properties. Olivecrona et al. designed a model based on recursive neural networks (RNN), which is capable of generating novel compounds with optimized parameters including bioactivity, solubility, pharmacokinetic properties and so on [12].

Prediction of Drug Toxicity
Currently toxicity is the major reason for drug candidate failure during development and clinical trials and is responsible for twothirds of the drugs pulled off the market [13]. So, it is essential to screen out compounds with the potential toxicity as early as possible to save the capital and labor devoted to the preclinical and clinical investigation [14]. One way to achieve this goal is to develop accurate methods for toxicity prediction. Initially the drug toxicity was predicted with QSAR methods, which build quantitative relationships between chemical structure or properties and drug toxicity [15]. The assumptions of linearity as well as the sensitivity to data dimensionality inherent in the early QSAR models limited their predictability. Currently, massive amount of newly available data makes it a rational choice to turn to ML for the prediction of drug toxicity. Researchers have used a combination of algorithms including k-NN, SVM, RF and DL algorithms to predict toxicity [16].
It was showed that the commonly used ML algorithms such SVM, RF, linear discriminant analysis (LDA) and neural network are unsuitable to process imbalanced Tox Cast data [17]. Fortunately, DL method proved to be a qualified method to treat such imbalanced data. For example, Xu et al. [18] built a live injury (DILI) prediction model with DL based on chemical structure data, which performed better than the DILI models reported previously [18]. In another example, convolutional neural networks (CNNs), a subclass of DL networks has been successfully used to predict toxicity in terms of images of cell pretreated with different drugs [19].

Concluding Remarks
Machine learning has received much attention as a powerful tool for uncovering patterns hidden in data. With the exponential growth of chemical and biological datasets over the past decades, machine learning algorithms such RF, SVM and LDA has been successfully applied to drug discovery process, as described above.
Deep learning algorithms showed better performance on property prediction compared to the classic ML algorithms. However, there are still issues that deserves further study. One is the quality of training data, which is a crucial factor for the performance of resulting prediction model. Currently, the public accessible datasets such as Chem BL [20] and Pub Chem [21] are generally built by collecting data from different public literatures. Consequently, the inconsistency in the data collected this way is inevitable, which may ruin the resulting ML model. Here, further study is needed to present systematic, diverse, accurate databases as training dataset for building ML model. The other issue is about the interpretability of ML model. Recent revolution in deep learning networks makes it a promising tool for remarkable predictability. Unfortunately, the DL models are so complicated that their predictions cannot be interpreted or explained in physical or chemical terms, the so-called "black box", which prevents drug designer from gaining insight into the prediction. So, a novel DL algorithm with a balance between predictability and interpretability will be expected in the future.