Predicting the bioactivity of 2-alkoxycarbonylallyl esters as potential antiproliferative agents against pancreatic cancer (MiaPaCa-2) cell lines: GFA-based QSAR and ELM-based models with molecular docking

Background The number of cancer-related deaths is on the increase, combating this deadly disease has proved difficult owing to resistance and some serious side effects associated with drugs used to combat it. Therefore, scientists continue to probe into the mechanism of action of cancer cells and designing novel drugs that could combat this disease more safely and effectively. Here, we developed a genetic function approximation model to predict the bioactivity of some 2-alkoxyecarbonyl esters and probed into the mode of interaction of these molecules with an epidermal growth factor receptor (3POZ) using the three-dimensional quantitative structure activity relationship (QSAR), extreme learning machine (ELM), and molecular docking techniques. Results The developed QSAR model with predicted (R2pred) of 0.756 showed that the model was fit to be validated parameter for a built model and also proved that the developed model could be used in practical situation, R2 for training set (0.9929) and test set (0.8397) confirmed that the model could successfully predict the activity of new compounds due to its correlation with the experimental activity, the models generated with ELM models showed improved prediction of the activity of the molecules. The lead compounds (22 and 23) had binding energies of −6.327 and −7.232 kcalmol−1 for 22 and 23 respectively and displayed better inhibition at the binding sites of 3POZ when compared with that of the standard drug, chlorambucil (−6.0 kcalmol−1). This could be attributed to the presence of double bonds and the α-ester groups. Conclusion The QSAR and ELM models had good prognostic ability and could be used to predict the bioactivity of novel anti-proliferative drugs.


Background
In the world, pancreatic cancer (PC) is ranked the fourth highest cause of cancer-related deaths and the fourteenth most common cancer [1]. By the year 2030, it is predicted that it would become the second-most common cause of cancer-related deaths [2]. Pancreatic cancer is mainly divided into two types: (i) pancreatic adenocarcinoma (PA), the most common (85% of cases), arises in exocrine glands of the pancreas, and (ii) pancreatic neuroendocrine tumor (Pan-NET) which occurs in the endocrine tissue of the pancreas and less common [3]. MIAPaCa-2 and PANC-1 are two commonly used cancer cell lines for studies of PA [4]. PA is usually advanced at the time of diagnosis as it is relatively symptom-free [5]. With its poor prognosis, only 24% of people survive 1 year and 9% live for 5 years [3].
Late disease diagnosis, attainment of resistant characteristics, the paucity of effective therapies, and metastatic nature among others are some of the proposed reasons for the poor survival rate of PC patients [6]. Cytotoxic treatments presently used have failed, example of such is gemcitabine [7], with patients hardly survive more than 6 months after therapy, according to study [6]. 5-Fluorouracil, however, has been shown to be effective and boosts survival rate of patients [6]. In the last decades, the repurposing of approved drugs to treat cancer birthed drugs like celecoxib, metformin, sulindac, or Tri-FluoroPerazine (TFP). Although, there are limitations associated with them, an instance with TFP is that patients develop neurological side effects when it is used to treat cancer [8].
Some compounds such as methyl protodioscin could inhibit proliferation and promote apoptosis of MIAPaCa-2 cells [9]. A small-molecule, trisubstituted naphthalene diimide compound, is potent against PC cell line MIAPaCa-2 [10]. Another compound is ursolic acid, a popular antiinflammatory and immunosuppressive agent which inhibited growth and induced apoptosis in a dose-dependent manner [11]. Due to this, the necessity of developing better and more effective therapeutics with improved activity at low concentrations that will ensure a longer survival rate and inhibit resistance against the MIAPa-Ca cancer cell lines cannot be overemphasized.
Quantitative structure activity relationship (QSAR) is one of the commonest approaches in ligand-based drug design process among Scaffold hopping, pseudo-receptor and pharmacophore modeling. The key goal of QSAR studies is to determine a mathematical model between the property under investigation, and one or more molecular descriptors [12][13][14]. Using the model, similar bioactivities of compounds not involved in the training set can be predicted from their structural descriptors [15]. Here, we used genetic function approximation (GFA) and ELM models to predict the bioactivity of the molecules under investigation. ELM belongs to a class of feed forward neural networks characterized with a single hidden layer. The algorithm attains its uniqueness through random determination of the biases as well as the weights connecting hidden and input layers [16]. These features result into fast convergence, high training speed while the simple structure of ELM algorithm is maintained and thereby leads to enhanced computational efficiency coupled with impressive robustness [17].
Molecular docking analyzes the pose of molecules into the binding site of a macromolecular target and also probe into the mechanisms of binding of bioactive compounds and biological targets/receptors [3,18,19]. Therefore, this work is aimed at using QSAR to create a mathematical model from the selected training set of 2alkoxycarbonylallyl esters derivatives (available in the literature) [20] to predict the activity of these compounds as anticancer agents against MIAPaCa-2 cancer cell lines and elucidate the interaction of these compounds as MIAPaCa-2 cancer cell line inhibitor via molecular docking studies.

QSAR Studies Data collection
A series of twenty-four compounds of 2-alkoxycarbonylallyl esters [20] as a potential anticancer were collected from the literature. The curative activities of the compounds against given in IC 50 (μM) were converted into their corresponding pIC 50 values (i.e., -log IC 50 = pIC 50 ) in order to make the activities fit to a range of values and also to suit normal distribution curve. The experimental activities, pIC 50 values, and compounds are presented ( Table 1).
The 2D structures of the compounds (Table 1) were drawn using ChemDraw Ultra 12.0 [21] and then saved as cdx file. The cdx file format of the compounds drawn was moved to the Spartan 14 software [22] and optimized using molecular mechanics with the molecular mechanics force field (MMFF) to generate their most stable conformers followed by a restricted hybrid Hartree Fock-density functional theory self-consistent field (HF-DFT SCF) calculation using Pulay's DIIS and geometric direct minimization with the 6-31G* basis set [23]. The molecular structures were saved as sdf file format after optimizing the structures.

Molecular descriptor calculation
The optimized compounds were subjected into PaDEL-Descriptor software version 2.20 [24] in order to calculate the 1D, 2D, and 3D descriptors of the compounds. After removing salt, detecting tautomer and retaining the file name as molecule name, a total of 1875 descriptors were generated saved as Microsoft Excel Comma Separated value (csv) file.

Mathematical description of the proposed ELM algorithm
ELM formalisms allow three layers with connected neurons. Pre-supposing that the number of neurons in the output, hidden, and input layers are indicated by ξ,η and μ, respectively with the threshold number of neurons for the input layer set as I = [I 1 , I 2 , .…I μ ] T . Equation 1 defines the weights linking the hidden with input layer as represented by ψ while Eq. 2 presents the weights linking the hidden with output layer is χ.
: : : : ψ μ1 ψ μ2 … ψ μη The experimental activities of the investigated compounds as well as the implemented descriptors are presented in matrix form as depicted in Eq. 3 and Eq. 4, respectively. where j is the number of training samples. With inclusion of a differentiable activation function γ(.)and network output α, the mathematical expression representing the working principles of ELM algorithm is presented in Eq. 5 [25,26].
Equations 6 and 7 respectively presents the matrix form of hidden layer output (P) and network output.
while P + stands for the Moore-penrose generalized inverse of P.

QSAR methods
Normalization and data pre-treatment The descriptors were normalized (Eq. 9) [27] and pretreated using the Data Pre-treatment software [28].
where X 1 is the descriptor's value for each molecule, X min and X max are minimum and maximum value for each descriptor. This is done in order to filter descriptors with redundant data, highly correlated data, reduce colinearity thereby improving the performance of the model.

Data division
The pre-treated dataset was split into training and test sets by employing Kennard and Stone's algorithm [29]. The training set, 70% of the data sets (16 compounds) was used to build the model and validated internally while 30% of the data sets (8 compounds) were used to externally validate the built model.

Model development
Material studio 2017 software was used to build the model employing the GFA method while fixing the biological activities (pIC 50 ) as the dependent variable and the physiochemical properties (descriptors) as the independent variables.

Internal validation of model
The training compounds were validated internally using material studio. The validation parameters include the following: Friedman's lack of fit (LOF) The models generated were appraised using LOF to obtain their fitness score (Eq. 10) [30] SEE, the standard error of estimation; p, the number of descriptors used; d is a user-defined smoothing parameter, c is the number of model terms, and M is the number of compound in the training set [31]. For a good model, SEE value must be low, it is given as (Eq. 11): where n is the number compounds that made up the training set, Y exp is experimental activity and Y pred is the predicted activity in the training set The correlation coefficient (R 2 ) This is the mostly used for internal assessment of QSAR models. The R 2 (Eq. 12) is expected to be very close to 1 to generate a good model.
where Y training is the mean of the experimental activity.
Adjusted (R 2 ) (Eq. 13) value varies directly with the increase in number of descriptors, it is important because it measures the reliability and stability of a model unlike R 2 .
The cross validation coefficient ðQ 2 cv Þ The strength of the QSAR model was cross-validated (Eq. 14).
Where Y training , Y exp , and Y pred were defined earlier Statistical analysis of the descriptor Variance inflation factor VIF (Eq. 16) measures the multicolinearity of the descriptors among each other and also measures the degree at which one descriptor correlates with the others in a model.
R 2 is the multiple correlation coefficient between all variables used in the model. If the VIF = 1, it signifies that there is no intercorrelation in each variable, and if it is between 1 and 5, it is acceptable. VIF value > 10 makes the model unstable.
Mean effect This correlates the effect of a particular molecular descriptor on the activities of the compounds. A change in the descriptors' values improves the activity of the compounds. It is defined (Eq. 17): where B j and D j are the j-descriptor coefficient in the model and the values of each descriptor in training set, while m and n stand for the number of molecular descriptors and number of molecules in a training set respectively. Therefore, the mean effect of each descriptor used in building the model was calculated to assess the significance of the model [32].
Evaluation of the applicability domain of the model This is a vital step in establishing that the model is good to make predictions [33]. The leverage approach [34] was applied here. Leverage of a given chemical compound hi, is defined as (Eq. 18): X i is the training compounds matrix of i. X is the m × k descriptor matrix of the training set compound and X T is the transpose matrix of X used to build the model. The warning leverage, h* (Eq. 19) is the limit of normal values for X outliers: where n and k are the descriptors and the training set compounds respectively.
Quality assurance of the model The validation parameters test the strength, dependability, and predictiveness of a built-in QSAR model. Consequently, Table 2 establishes general minimum specifications for both internal and external validation parameters to validate a QSAR model [34]. The list of descriptors, their descriptions, and dimension are presented (Table 3).

ELM-based models
The modeling and simulation that leads to the development of the proposed ELM-based models was implemented on MATLAB. The descriptors to the model as well as the experimental activities of the investigated compounds were initially randomized prior to separation into training and testing sets of data in the ratio of 4:1, respectively. Randomization enhances even distribution of data points and improves computational efficiency of the algorithm. The training dataset was employed in generating weights which are further validated using testing set of data. The procedures for the computational implementation of the proposed ELM-based models are detained as follow: Step I: Initial random number generation: Pseudo random number that controls the randomly generated biases as well as the weights linking the hidden with  input layers were initiated with the aid of seeding in MATLAB.
Step II: Optimum activation function and corresponding hidden nodes: One activation function was selected after the other within a pool of functions which include sine function, hardlim function, and sigmoid function triangular basis function among others. Hidden nodes were optimized within search space of 1 to 100.
Step III: Computation of hidden layer output matrix: Equation (6) was implemented on training dataset for the calculation of the hidden layer output matrix.
Step IV: Output weight computation: with the aid of the least square solution to the set of the generated linear system of equations, the output weights were computed.
Step V: Evaluation of the developed models: The developed models during the training phase of the simulation were evaluated using testing set of data. The performance of the model was assessed using four different performance measuring parameters which include MAE, RMSE, CC, and MAPD. The models characterized with lowest error (MAE, RMSE, and MAPD) and highest CC was saved as the best models. The hyper-parameters of the best model as well as the weights were also saved for ensuring reproducibility of the results.

Molecular docking and ADME/Tox screening Protein and ligands preparation
The 3D crystal structure of an epidermal growth factor receptor (EGFR), a kinase domain (PDB ID: 3POZ) was downloaded from the protein data bank (Fig. 1) [35]. The receptor has two ligands (O3P) and sulfate ion (SO 4 ) in its active sites [36]. The receptor was chosen because it is an EGFR which best works with cancer lines caused by cell proliferation and its high resolution of 1.5 Å [36,37]. The downloaded protein was refined using the Protein Preparation Wizard [37,38] by assigning charges and bond orders with the removal of water molecules and addition of hydrogens to the heavy atoms. Energy minimization was done using OPLS3 [38,39].

Ligand preparation
The lead molecules (22 and 23) were selected for docking and prepared using ligand preparation (ligprep) in Schrödinger Suite 2017-1 with an OPLS3 force field [39] in order to create three dimensional geometries and to assign proper bond orders. Epik 2.2 in Schrödinger Suite at pH 7.0 ± 2.0 was used to generate the ionization state [40]. They, alongside with a standard drug, chlorambucil were docked at the active site of the receptor (x = 20.3785826772, y = 32.8299212598, and z = 14.8903149606). Table 6 Calculation of the predicted R 2 of developed model

ADME/Tox screening
The compounds were further screened for their druglikeness by calculating the absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties using the QikProp program [41]. These properties check drug fitness and save cost involved in bioassay studies [42]. Descriptors like molecular weight (Mw), number of hydrogen bond donors (donorHB) and number of hydrogen bond acceptors (acceptHB), solvent accessible surface area (SASA), octanol/water partition coefficient (QPlog Po/w), and value for serum protein binding (QPlogKHsa) were considered (Table 11).

QSAR results of 2-alkoxycarbonylallyl esters
A QSAR model was developed to predict the activity of MIAPaCa-2 cancer cell lines in 2-alcoxycarbonylallyl esters (pIC50). The multiple linear regression (MLR) for the built model is shown in Eq. 20.
The list of descriptors, their constructors, description, and dimension used in building the QSAR model are reported in Table 3. Table 4 represents the comparison between experimental activity (pIC50), predicted activity (pIC50), and residual of the developed model which is used in validating the model externally as reported in Tables 5 and 6.  Table 7 presents the Pearson's correlation, mean effect, and variance inflation factor of the descriptors used in building the model. The experimental activity is plotted against the predicted activity for both training set, and test set shown in Fig. 2. The scatter plot between standardized residual activity and experimental activity which explains the randomness of the activities on both  negative and positive sides of y-axis is in Fig. 3. While Fig. 4 presents the Williams plot of standardized residual against leverages for the developed model.

Comparison of the estimates of developed QSAR and ELM-based models
The performance of the developed ELM-based models was compared with that of QSAR model using four different performance measuring parameters including mean absolute error (MAE), root mean square error (RMSE), correlation coefficient (CC), and mean absolute percentage deviation (MAPD). The outcomes of the comparison are presented in Table 9 and Fig. 5a-d. The comparison of the outcomes of each of the developed models with inclusion of percentage error for each of the investigated compounds is presented in Table 10.

Molecular docking
Molecular docking studies were carried out to understand the mechanism of action of some 2-alkoxycarbonylallyl esters against pancreatic cancer (MiaPaCa-2) cell line targeting the epidermal growth factor receptor (EGFR). The structure of the prepared receptor is shown in Fig. 1. The reference drug (chlorambucil) and lead compounds (22 and 23) were docked with the epidermal receptor growth factor (3POZ) to elucidate the interaction and the binding mode. The binding affinities and ADME/Tox properties of chlorambucil and the lead compounds (22 and 23) are presented in Table 11. While the interactions between the epidermal receptor growth factor (3POZ) with the chlorambucil and lead compounds (22 and 23) are presented respectively in Figs. 5, 6, and 7.

QSAR results of 2-alkoxycarbonylallyl esters
Four different models were generated using the genetic function approximation technique. The first model was selected to be the best model due to its statistical significance and the fact that it satisfied the recommended standard for a stable and reliable model as outlined (Table 2). 2D descriptors played a vital role in predicting the activity of new molecules that can inhibit against MIAPaCa-2 cancer cell line. The positive coefficient of ATSC3c, MATS5p, minHBint5, and ETA_Shape_P descriptors in the model inferred that increase in the coefficient of the descriptor will improve the activity (pIC 50 ) of 2-alkoxycarbonylallyl esters against MIAPaCa-2 cancer cell line.   Hence, to design potent compounds with high pIC 50 value, the positive coefficient of the descriptors will have to be increased.
The low values of residual recorded in Table 4 affirmed that there is high correlation between the experimental activities and predicted activities. The value of R 2 pred (0.7560) signifies that the model has passed the minimum recommended value for validating parameter for a built model presented in Table 2. The idea that once the R 2 predicted value is considered satisfied, the remaining parameters will also be satisfied is not always true as additional statistical analyses can help validate the built model such as variance inflation factor (VIF) and mean effect (ME). The closer the value of R 2 test to 1.0, the better the stability the model generated. Therefore, in prediction of the behavior of a new compound, the stability will take into account model reliability.
The Pearson's correlation, ME, and VIF of the descriptors are presented in Table 7. The low correlation values ≤ 0.5 in most descriptors inferred that the descriptors do not correlate with one another. This ascertains that there is no bias in the prediction made by the model. The mean effect of the descriptors shown in Table 7 signifies the effect of molecular descriptors on the activity pIC 50 of the compounds. Thus, the order of decreasing effect is ETA_Shape_P > ATSC3c > MATS5p> minHBint5.
The experimental activity was plotted against predicted activity for both training set and test set is reported in Fig. 2. The high value of correlation coefficient R 2 for training set (0.9929) and test set (0.8397) confirmed that the model can successfully predict the activity of a new compound due to its correlation with the experimental activity. The randomness of the activities on both negative and positive sides of y-axis shown on the scatter plot between standardized residual activity and the experimental activity reported in Fig. 3 confirmed the built model was free from systematic error.
To discover outliers and influential compounds in the built model, the standardized residual activity for the entire data set was plotted against the leverages. Williams plot (Fig. 4) confirmed that there was only one influential compound (20) with leverage value of 1.00 which is greater than the warning value (h= 0.9375).

Comparison of the estimates of developed QSAR and ELM-based models
Using MAE metric, the ELM-Sig built model was more efficient than ELM-Sine and QSAR models with a performance improvement of 4.09% and 14.92% respectively. The model performed higher with a performance improvement of 9.53% and 30.68% of the RMSE performance assessment parameter respectively. For CC and MAPD metrics, the developed ELM-Sig model performed better than ELM-Sine and QSAR models with performance enhancement of   Table 10, the results of the developed ELM-sig model show persistence closeness with the measured values. The superiority of the developed ELM-based model over the QSAR model can be attributed to the intrinsic feature of ELM algorithm in approximating nonlinear and complex relationship linking the descriptors to the target. Ability of sigmoid activation function to approximate well is reveled from the performance of ELM-Sig model over that of ELM-Sine model.

Molecular docking
The lead compounds (22 and 23) showed good docking score in the receptor's active sites and are better than chlorambucil, a standard drug (Table 11). Compound 22 (−6.327 kcalmol −1 ) showed conventional hydrogen bond with CYT 737 and ASP 855 from its carbonyl oxygen from the α-ester group (Fig. 5). Compound 23 (−7.232 kcalmol −1 ) also showed conventional hydrogen bond with ASP 855 from its carbonyl oxygen from the α-ester group (Fig. 6). Chlorambucil (−5.826 kcalmol −1 ) showed no conventional hydrogen bond with any of the amino acid groups (Fig. 7). The lead molecules have better activity than chlorambucil as reported by Conor et al. [20]; the bonds responsible for their bioactivity according are the double bond and the α-ester groups. The ADME/Tox properties (Table 11) showed that these compounds are fit as drugs, according to Lipinski's rule of five [19,43,44].

Conclusion
In this study, a GFA model, together with two EML models were used to predict the potential activity of 2alkoxycarbonylallyl esters as anticancer against MIAPaCa-2 pancreatic cancer cell lines while also probing into the mechanisms of interaction of the lead compounds against an epidermal growth factor receptor kinase domain, 3POZ via molecular docking approach. The GFA model generated was thoroughly validated; this model was compared to with ELM-Sig model and ELM-Sine model, with the ELM-Sig model proving the best in bioactivity prediction of these molecules. Binding of the lead compounds with the receptor showed they had better inhibitory potentials than chlorambucil. In the 2D interaction diagrams, it was seen that the compounds bind to the receptor predominantly through the hydrogen bond interaction and also mainly the carbonyl oxygen atoms of the α-esther group were responsible for their interaction, which is in line with what was observed in the experiment.