Virtual Screening of DPP-4 Inhibitors Using QSAR-Based Artificial Intelligence and Molecular Docking of Hit Compounds to DPP-8 and DPP-9 Enzymes

Background: Dipeptidyl Peptidase-4 (DPP-4) inhibitors are becoming an essential drug in the treatment of type 2 diabetes mellitus, but some classes of these drugs have side effects such as joint pain that can become severe to pancreatitis. It is thought that these side effects appear related to their inhibition against enzymes DPP-8 and DPP-9. Objective: This study aims to find DPP-4 inhibitor hit compounds that are selective against the DPP-8 and DPP-9 enzymes. By building a virtual screening workflow using the Quantitative Structure-Activity Relationship (QSAR) method based on artificial intelligence (AI), millions of molecules from the database can be screened for the DPP-4 enzyme target with a faster time compared to other screening methods. Result: Five regression machine learning algorithms and four classification machine learning algorithms were used to build virtual screening workflows. The algorithm that qualifies for the regression QSAR model was Support Vector regression with R 2pred 0.78, while the classification QSAR model was Random Forest with 92.21% accuracy. The virtual screening results of more than 10 million molecules from the database, obtained 2,716 hit compounds with pIC50 above 7.5. Molecular docking results of several potential hit compounds to the DPP-4, DPP-8 and DPP-9 enzymes, obtained CH0002 hit compound that has a high inhibitory potential against the DPP-4 enzyme and low inhibition of the DPP-8 and DPP-9 enzymes. Conclusion: This research was able to produce DPP-4 inhibitor hit compounds that are potential to DPP-4 and selective to DPP-8 and DPP-9 enzymes so that they can be further developed in the DPP-4 inhibitors discovery. The resulting virtual screening workflow can be applied to the discovery of hit compounds on other targets.


Background
Nowadays, Dipeptidyl Peptidase-4 (DPP-4) inhibitors become an important oral antidiabetic drug for the treatment of type 2 diabetes. Sitagliptin was introduced in 2006 as the first DPP-4 inhibitor agent. Furthermore, this class of drugs is increasingly shifting the role of sulfonylurea in type-2 diabetes treatment in some national and international guidelines.
This class of drugs works differently from most other antidiabetic drugs. The DPP-4 enzyme inhibition, besides stimulating the pancreas to produce and release insulin into the blood, also reduces or normalizes body weight and does not cause hypoglycemia. A different mechanism of action and effects with other antidiabetic groups makes this class of drugs exciting to develop [1,2].
Some of the DPP-4 inhibitors have been circulating in the market. These drugs are welltolerated, but some of them have side effects such as joint pain and even pancreatitis. These side effects appear to be related to inhibition of the enzyme homology, such as DPP-8 and DPP-9 [3,4]. Therefore, it necessary to discover and develop new DPP-4 inhibitors that are selective against DPP-8 and DPP-9 enzymes.
The discovery of new DPP-4 inhibitors can be made in several ways, such as high throughput screening (HTS), which is generally carried out by the large pharmaceutical industry. Another alternative that can be done is computer-aided drug design (CADD) by conducting virtual screening [5] on a vast database containing millions of compounds such as ChEMBL [6], PubChem [7]. One of the virtual screening methods that can be developed using a quantitative structure-activity relationship (QSAR) that can predict the activity of a molecule against the DPP-4 enzyme by establishing a relationship between the molecular structure represented by descriptors or fingerprints with the activity of several research results on the DPP-4 enzyme.
The QSAR methods that can be applied were regression QSAR or classification QSAR [8].
In this study, virtual screening of the database will be done by building a virtual screening workflow [9,10], which uses the QSAR classification models to predict active compounds from the database and then uses the QSAR regression method to see the value of its activity.
The developed QSAR method is impossible when using conventional methods, so this study tries to develop a QSAR method based on artificial intelligence (AI) with a supervised learning algorithm. With this method, the prediction of thousands or even millions of molecules activity from a database can be made faster than other virtual screening methods [11].
Those studies generally only used certain compound derivatives, and with a small dataset, so the prediction ability was limited only to that derivatives. The principle of QSAR is that similar structures have similar activities [15], so when using a small dataset and only limited to a derivative compound, the predictive ability is limited and will be biased, especially for virtual screening in large databases. Therefore, this study tried to develop a method for predicting the activity of DPP-4 inhibitor compounds, which is more extensive, not only limited to one derivative but also to various types of derivatives and with a dataset that is more extensive than experimental data that has ever existed before. DPP-4 inhibitor screening with a machine learning approach using thousands of DPP-4 inhibitor datasets has been conducted [16], but this method needs to be further developed with higher accuracy above 90%.
The QSAR model that will be developed to build virtual screening workflows was a model that meets QSAR statistical parameters standard [8,17] for regression models and high accuracy for classification models. The methods used are predictive and reliable for predicting the activity of the new compound. For this purpose, five machine learning regression algorithms were built (i.e., XGboost Tree Ensemble, Random Forest, Support Vector Regression, Deep learning, and Multiple Linear Regression) and four machine learning classification algorithms (i.e., XGboost Tree Ensemble, Random Forest, Support Vector Machine, and Deep learning) [8,18].
To produce a potential hit compound that has a high inhibitory activity against DPP-4 and low inhibitory activity against DPP-8 and DPP-9, the hit compound from the results of virtual screening is carried out molecular docking to DPP-4, DPP-8, and DPP-9. Related ligands used were compounds that have been available in the market, such as Trelagliptin, Alogliptin, Omargliptin, and Carmegliptin [19,20]. Potential hit compounds have inhibitory activity (Ki), which is not much different or even higher than the comparative ligand in DPP-4 enzymes and low inhibitory activity in other DPP enzymes [3]. the salt and small fragments were removed, the molecular structure was normalized, and the duplication was removed [9,21], leaving 3,740 compounds. Furthermore, the data are classified into active and inactive compounds that used for modeling, with pIC50 activity above 7.5 was active compounds, compounds with pIC50 below 6 were inactive compounds, and compounds with pIC50 between 7.5 and 6 were grey compounds that removed [16]. The remaining 2307 compounds were used for model development.

Calculation of descriptors and fingerprint
The descriptor and fingerprint calculations were performed with four nodes in KNIME [22,23], i.e., RDKit Descriptor Calculation and Fingerprint from RDKit, and Fingerprint and Molecular Properties of CDK [24]. The total descriptors and fingerprint of each molecular structure were 17,784 features for the classification model and 19,875 features for the regression model.
The dataset is randomly partitioned, 80%: 20%. Then repartitioned into 72% training sets for the model training set, an 8% validation set was used to see the model performance, and a 20% test set (external validation) was used to see the model generalization ability.

Feature selection
The best features were selected from several feature selection methods. A dimension reduction with PCA, height correlation, and random forest [26], and overall features of the RDKit descriptor, ECFP2 fingerprint, MACCS, Pubchem, or a combination of RDKit descriptor features with the MACCS fingerprint were applied. These features were tested on several machine learning models, features with the highest accuracy used as features for modeling.

QSAR modeling
Five machine learning algorithms were used to build the QSAR regression models, i.e., Deep Learning, XGBoost Tree Ensamble, Multiple Linear Regression, Random Forest, and Support Vector Regression. For the QSAR classification models, four machine learning algorithms were used, i.e., Deep Learning, XGBoost Tree Ensamble, Random Forest, and Support Vector Machine.
The partitioned dataset was used to build machine learning models from training, validation to testing ( Figure 1). The hyperparameter value of each model was determined through optimization with a random search (a combination of 100 experiments). The hyperparameter that produces the best performance model will be used for internal and external validation.

Evaluation of the QSAR regression model
To  (1) is the actual value, ŷ is the predicted value, and ̅ is the actual average value. The minimum value recommended for the QSAR regression models to produce a reliable prediction and predictive value of the correlation coefficient (R) for in vivo data ≥ 0,8 and the coefficient of determination (R 2 ) ≥ 0,6 [27].
Internal validation was done by ten times cross-validation. This method is widely used to evaluate the performance of machine learning models because it is more efficient for big data.
In this validation process, data is sorted and divided into ten random subsets. One subset is used as a test set, and the other is used as a training set. After ten repetitions, the result is the average performance of the ten subsections [28]. Over-fitting of the model is usually suspected when the R 2 value of the training model is significantly higher than 25% than the R 2 cv value [27]. Golbraikh and Tropsha, (2002) proposed a set of parameters to determine the external predictability of the QSAR Regression models, i.e. the model is considered satisfactory if all the following conditions are met: , and R 2 (pred) are the coefficient of determination from internal validation results, and the test data, respectively. 0 2 and ′ 0 2 are the coefficient of determination between actual and predicted activities at zero intercepts, and between predicted and actual, respectively.
Furthermore, k and k' is the slope of the regression line through the origin point [17]. (2015) proposed rm 2 (test) for external validation. rm 2 (test) value was calculated using the square of the correlation coefficient between actual and predicted activities from the test dataset. For acceptable predictions, 2 ̅̅̅ (test), the value should be lower than 0.2, provided that the value of ∆rm 2 (test) is more than 0.5.

5-fold cross validation and external validation sets are used to test the classification model
performance. All models are evaluated with: Precision (FP Rate) = TP / (TP +FP) Receiver Operating Characteristic (ROC) plot was used to display the graphical behavior of the model. Visualization between the X-axis (1-Specificity) and the Y-axis (Sensitivity) will show the perfect model with the area under the curve (AUC) 1. The AUC of 0.5, will then the classification model has no discriminatory power at all [8,29].

Testing the QSAR model workflow on other targets
To see the ability of QSAR method workflow automation implementation, several targets from the ChEMBL database (https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL2525/) were used as a modeling dataset to see workflow prediction capabilities, i.e., the opioid sigma receptor (CHEMBL287) and the adrenergic Beta-1 receptor (CHEMBL213). Activity data for each target was downloaded from the ChEMBL website based on its ID and analyzed with KNIME. Five machine learning regression algorithm was used to see its ability on other models, i.e., deep learning, XGBoost tree ensemble, multiple linear regression, random forest, and support vector machine.

Molecular docking of hit compound into DPP-4, DPP-8, and DPP-9 enzymes
The most active of the hit compounds were collected from the virtual screening results. The molecular docking was carried out in the DPP-4 enzyme, while the selectivity is compared to the DPP-8 and DPP-9 enzymes [36]. The crystal structure of the enzyme used was downloaded from the protein data bank (PDB) [37] with criteria derived from the Homo sapiens organism.
The potential hit compound has an inhibitory activity against the DPP-4 enzyme that is higher or equal to the original ligand. In contrast, for selectivity to DPP-8 and DPP-9, the potential hit compound has a low inhibitory activity > 10,000 nM [3].

Results and Discussion
The diversity of datasets is a critical factor in the QSAR method. Several QSAR methods have been developed to study DPP-4 inhibitors. Some previous research mainly focuses on local (conventional) QSAR modeling studies, both with QSAR-2D, QSAR-3D, and pharmacophore.
However, this method only provides reliable predictions is limited chemical space. So, a more extensive and more diverse set of data from the ChEMBL database is used to build the global QSAR method [43]. the modeling data has a significant heterogeneity of chemical spaces so that the prediction ability becomes wider (global) to predict the new compounds pIC50 value from various compounds derivative against the DPP-4 enzyme [44].

Feature selection
The feature selection method developed for seven types of features (Figure 3 In the regression model, goodness-of-fit of MLR and deep learning were very low when compared to the SVR and ensemble methods. That is because MLR itself is intended for linear data, so when handling nonlinear data the performance becomes low. At the same time, in deep learning, this is likely due to the lack of training data so that the model performance was not optimal. Generally, a small dataset on deep learning shows performance under conventional machine learning methods, but in a large dataset, the ability of deep learning model remains consistent. These results showed contrary to the machine learning method, which optimal on datasets [45].

Internal validation
From the internal validation results of the regression QSAR model, the comparison between the model training results error and the internal validation is no more than 25% for all models.
It means that the resulting QSAR model did not experience overfitting [27]. The internal validation model with the lowest error (MSE) (Figure 4) was the SVR model, then XGBoost Tree Ensemble and Random Forest.
From the results of internal validation on the QSAR classification models (Table 1), three algorithms, i.e., Deep Learning, XGBoost, and Random Forest, showed excellent performance with accuracy above 90%. Of all the QSAR classification models, the model with the best performance was Random Forest.

External validation
The ideal QSAR regression model with high predictability has intercept values 0 and slope 1 in the equation of the line between actual and predictions, and the regression coefficient is equal to 1. Seen from the k value (slope) between actual and prediction and k' between prediction and actual without an interception, the regression line approaches 1 [17]. All the resulting QSAR regression models had met the requirements for the parameters k and k'. Other than that, the predictive model must have proximity between R 2 , 0 2 , and ′ 0 2 values with the condition < 0.1. The method developed in the test with external validation data, i.e., DL, MLR, and RF method, has a difference with 0 2 value more than 0.1 so that these models did not meet the requirements to become predictive QSAR models.
To verify the closeness between the observed and predicted data, the minimum value of the parameter 2 ̅̅̅̅ and ∆ 2 must be met [8]. From the calculation results ( that there is a significant difference between the curves formed by the predictive value and the actual point of origin (zero intercepts). The perfect method will produce a regression line that lies in the regression coefficient one and intercept 0. The further the slope between the plot formed from predictions results of the actual at zero intercepts, the predictive ability of the method will decrease because it will further increase the residual value between the actual data and the predicted results.
External validation results of the classification method in table 3 and the ROC curve ( Figure   5) showed three algorithms, Deep Learning, XGBoost Tree Ensemble, and Random Forest were the classification models with the excellent predictive ability (higher than previous studies with an accuracy above 80% [16]). Deep learning and XGBoost algorithms showed better performance on external validation and ROC curves. However, to build a virtual screening workflow, the Random Forest algorithm was used. This algorithm returned ROC curves with the highest accuracy compared to other algorithms, using internal and external validation. The value is not much different from deep learning and XGBoost algorithms.

Testing QSAR regression workflow on other targets
The results of the QSAR model's prediction performance test against several targets (Table 4), produced predictions with a high enough coefficient of determination reaching 0.7 in the SVR, XGBoost, and Random Forest models. This prediction results showed the ability to predict our workflows against other targets automatically from raw datasets that are directly downloaded from the ChEMBL database.

Virtual screening results
Virtual screening results of the ChEMBL, PubChem, and Malport databases obtained several hit compounds ( Figure 6). This hit compound was tested for similarity with the DPP-4 ChEMBL database, with similarity values below 0.85. Furthermore, it was also tested on DPP-4 DUD-E decoys. Finally, the compound must meet Lipinski's rule of five to be absorbed into the body. As a result, about 2,716 compounds can pass through this phase.

Molecular docking results
Molecules that show potential with high inhibitory activity (Ki <100 nM) in the crystal structure of the DPP-4 enzyme include hit compounds with id CH0001, CH0002, and CH0003.
These hit compounds come from the ChEMBL database; some of these compounds even have higher inhibitory activities than the original ligand. For example, CH0001 and CH0002 hit compounds showed higher inhibitory activity compared to trelagliptin ligands in DPP-4 (5KBY) crystal structure. The CH0003 hit compound has a higher inhibitory activity than the alogliptin ligand in DPP-4 (2ONC) crystal structure. The CH0002 hit compound based on visualization results did not fully meet the eight potential and selective ligand criteria proposed by Ojeda-montes et al. [35] because there was an interaction with the amino acid ligand Trp629 on DPP-4 (5KBY), which could reduce its activity. However, the bonding of the CH0002 hit compound to the DPP-4 enzyme (4PNZ) has a bond with the active site S3 with the amino acids Phe357 and Arg358 which are essential for selectivity ( Figure 8) so that this hit compound deserves to be the best hit compound from this virtual screening results.

Conclusion
It can be concluded that the QSAR method based on artificial intelligence developed both the regression method and the classification method, produces a virtual screening workflow that meets the qsar statistical parameter standards. In the regression model, this is evidenced by the fulfillment of all statistical parameters proposed by Golbraikh & Tropsha, (2002) and Roy & Kur, (2016). The algorithm that satisfies the QSAR statistical parameters was Support Vector Regression (SVR). In the classification model, the accuracy obtained was > 90%, and the highest ROC curve obtained was 0.96, far above the standard > 0.5. The best algorithm in the classification method is Random Forest.
The best machine learning algorithm of classification and regression models was used to build virtual screening workflow. Random Forest was used to predicting active compounds and SVR was used to predict its activity. From the results of MolPort, ChEMBL and Pubchem database screening with a total of more than 10 million compounds, hit compounds with pIC50 activity

Consent for publication
Not Applicable.

Competing interests
We declare that we have no significant competing financial, professional, or personal interests that might have influenced the performance or presentation of the work described in this manuscript.

Authors' contributions
All authors read and approved the final manuscript.   There are seven features developed to get the best method, using four learning models.