In silico design and molecular docking study of CDK2 inhibitors with potent cytotoxic activity against HCT116 colorectal cancer cell line

Background Colorectal cancer is common to both sexes; third in terms of morbidity and second in terms of mortality, accounting for 10% and 9.2% of cancer cases in men and women globally. Although drugs such as bevacizumab, Camptosar, and cetuximab are being used to manage colorectal cancer, the efficacy of the drugs has been reported to vary from patient to patient. These drugs have also been reported to have varying degrees of side effects; thus, the need for novel drug therapies with better efficacy and lesser side effects. In silico drugs design methods provide a faster and cost-effect method for lead identification and optimization. The aim of this study, therefore, was to design novel imidazol-5-ones via in silico design methods. Results A QSAR model was built using the genetic function algorithm method to model the cytotoxicity of the compounds against the HCT116 colorectal cancer cell line. The built model had statistical parameters; R2 = 0.7397, R2adj = 0.6712, Q2cv = 0.5547, and R2ext. = 0.7202 and revealed the cytotoxic activity of the compounds to be dependent on the molecular descriptors nS, GATS5s, VR1_Dze, ETA_dBetaP, and L3i. These molecular descriptors were poorly correlated (VIF < 4.0) and made unique contributions to the built model. The model was used to design a novel set of derivatives via the ligand-based drug design approach. Compounds e, h, j, and l showed significantly better cytotoxicity (IC50 < 5.0 μM) compared to the template. The interaction of the compounds with the CDK2 enzyme (PDB ID: 6GUE) was investigated via molecular docking study. The compounds were potent inhibitors of the enzyme having binding affinity of range −10.8 to −11.0 kcal/mol and primarily formed hydrogen bond interaction with lysine, aspartic acid, leucine, and histidine amino acid residues of the enzyme. Conclusion The QSAR model built was stable, robust, and had a good predicting ability. Thus, predictions made by the model were reliably employed in further in silico studies. The compounds designed were more active than the template and showed better inhibition of the CDK2 enzyme compared to the standard drugs sorafenib and kenpaullone.


Background
Cancer is a major public health problem worldwide and is the second leading cause of death in the USA. Cancer is a term that refers to over 200 independent health conditions in which cells in different body parts divide abnormally and uncontrollably. Four of the most common cancers are lung, breast, prostate and colorectal cancer [1]. Colorectal cancer (CRC) is the third most prevalent cancer in both sexes and the second in terms of the rate of mortality. The cancer accounts for 10% and 9.2% of all cancer cases in men and women globally and causes over 500,000 deaths annually. In the USA, about 147,950 new cases and an estimated 53,200 mortalities from CRC are expected in the year 2020. While in Germany, 1 in 14 men and 1 in 18 women would diagnosed with CRC within their lifetime, and 1 in 32 men and 1 in 39 women will die from CRC [1,2].
CRC results from the abnormal division and growth of colon cells. This abnormal division of cells forms polyps which may be benign or cancerous. The cause of these abnormal divisions is yet to be fully understood [3]. However, risk factors have been associated with age, race, family history, and a sedentary lifestyle. CRC is more frequently diagnosed in persons over the age of 50, persons of African descent, or persons who consume high amounts of tobacco, alcohol, and high-fat diets. Persons with prior health conditions such as obesity and diabetes, are also more susceptible to the cancer [3][4][5]. However, over a fourth of CRC cases are attributed to hereditary factors. The most common forms of hereditary colorectal cancer are familial adenomatous polyposis coli (FAP) and hereditary non-polyposis colon cancer (HNPCC, Lynch syndrome) [2]. Just like most cancers, CRC at its infancy shows no significant symptoms, thus making early detection difficult. However, where symptoms exist, they include changes in bowel frequency, rectal bleeding, abdominal pains, weakness, and weight loss [6]. Diagnosis of CRC can be made by sigmoidoscopy or by colonoscopy while treatment methods include surgery, radiation therapy and drug treatments, such as chemotherapy, targeted therapy, and immunotherapy [2,7].
Drug treatment of CRC employs various strategies, one such strategy is the use of cytotoxic agents, such as 5-fluorouracil, oxaliplatin, and irinotecan. Another drug treatment strategy involves the use of compounds that block certain CRC targets. CRC chiefly targets the epidermal growth factor receptor (EGFR) and the vascular endothelial growth factor (VEGF). Thus, compounds such as cetuximab, bevacizumab, and ramucirumab, which block these targets are commonly employed in the fight against CRC [2,8]. The human cyclin-dependent kinase 2 (CDK2) has also been observed to be overexpressed in CRC patients, thus, inhibition and/or downregulation of this kinase has also emerged as a strategy for tackling CRC [9,10].
CDK2 primarily binds to cyclins A, B, and E and plays an important role in cell cycle regulation. It is responsible for G1 to S phase transition in the cell cycle. In normal healthy cells, CDK2 is dispensable as CDK1 plays mimicking roles. In cancerous cells, however, CDK2 plays a pivotal role in cell growth and progression [9,11]. Overexpression of CDK2 and cyclins A and E has been observed in ovarian, colorectal, breast, prostate, and lung cancer patients [9,10]. Therefore, drugs such as flavopiridol, roscovitine, olomoucine, adapalene, and kenpaullone, which are reported to be CDK2 inhibitors and sorafinib, aspirin (salicylic acid), etc., which have been reported to cause downregulation of the enzyme via various mechanisms have been employed as therapies for these cancers [9,[12][13][14]. However, the nonspecificity and toxicity of most of these drugs ensure that the search for more specific and less toxic candidates goes on [15]. Thus, this study built a robust quantitative structure activity relationship (QSAR) model which predicted the cytotoxic activity of imidazol-5-one compounds against HCT-116 CRC cell line, then, the built model was used to design a novel set of imidazole-5-ones compounds. Furthermore, the interaction between the designed compounds and the CDK2 enzyme was carried out via the molecular docking approach to determine the potential of the compounds to be used as inhibitors of the enzyme.

QSAR Dataset
A series of thirty-six (36) imidazole-5-ones compounds reported to have cytotoxic activity against HCT-116 colorectal cancer cell line were obtained from the literature [16]. The structures of the compounds were obtained, drawn, and are presented in Fig. 1. The activity of the compounds (IC 50 ) ranged from 4.87-97.18 μM. The skew in the activity of the compounds was minimized by converting the activity to their logarithmic scale equivalents using the formula (pIC 50 = −log 10 IC 50 ) [17]. In the logarithmic scale, the cytotoxic activity (pIC 50 ) of the compounds ranged from 4.02-5.31 M.
Geometry optimization, calculation of molecular descriptors, and data division The 2D structure of the obtained compounds was drawn using the ChemDraw Professional v. 16.0 software. The 2D structures were converted to their 3D forms on the Spartan 14 V 1.1.4 software. The ground state equilibrium structures of the compounds were obtained via geometry optimization using the density functional theory's B3LYP/6-31G* basis set [18,19]. The molecular descriptors of the designed compounds were calculated using the PaDEL-Descriptor v 2.21 software. One thousand eight hundred and seventy-five 1D, 2D, and 3D molecular descriptors were calculated for each compound. These molecular descriptors were further pretreated at a correlation cut-off of 0.7 using the DTC Lab Data Pretreatment v 1.2 software. Data pretreatment was carried out to remove highly correlated and redundant molecular descriptors [20]. The pretreated molecular descriptors were subsequently divided into a training and test set using the Kennard-Stone algorithm integrated into the DTC Lab Data Division v 1.2 software. The training set consists of 25 (70%) randomly selected compounds while the test set is made up of the remaining 11 (30%) of the compounds in the dataset [18,21]. The training set was used to build the model while the test

Model building and validation
The dataset of molecular descriptors of the training set compounds was transferred to the Biovia Materials Studio v 8.0 software. The QSAR model was subsequently built using the Genetic Function Algorithm function in the abovementioned Materials Studio software [18]. The stability, reproducibility, and predicting ability of the model were ascertained by subjecting the model to appropriate validation tests. The stability of the model was determined by carrying out a multilinear regression and one-way ANOVA on the molecular descriptors present in the model using the Microsoft Excel 2016 software. Furthermore, the inter-correlation between the molecular descriptors in the model was determined by calculating the variance inflation factor (VIF). VIF was obtained as the leading diagonal of the inverse of the correlation matrix [20]. The external validation of the model was carried out using Equation 1: Where R 2 ext is the external coefficient of determination, Y expi and Y predi are the experimental and predicted cytotoxic activity of the ith test set compound while Y exp train is the mean of the experimental cytotoxic activity of the training set compounds [17]. The robustness of the built model was evaluated by subjecting the training set to a Y randomization test. In the test, a series of randomly generated multilinear regression (MLR) models are generated by shuffling the molecular descriptors which keeping the cytotoxic activity constant [18]. For a robust model, the adopted validation parameters of R 2 and Q 2 are larger than those randomly generated by the Y randomization test. Furthermore, the Y randomization coefficient of determination (cP 2 p ) for a robust model is greater than 0.5 [18]. The Y randomization test was carried out using the DTC Lab's YRandomization software version 1.2. The zone of applicability of the model was graphed out using the leverages approach as described by Ikwu et al. [22].

Ligand-based design
Ligand-based design was employed in designing novel imidazol-5-one derivatives. A template (compound 20) was selected upon which further modifications were made. Compound 20 was selected as a template because it had high cytotoxic activity, a low residual and was well within the applicability domain of the built model [27]. The designed compounds were drawn, optimized, and their molecular descriptors were calculated as described in the "Geometry optimization, calculation of molecular descriptors and data division" section. The relevant molecular descriptors were then copied and inserted into the regression equation generated for the model to predict the cytotoxic activity of the designed compounds [23].

Molecular docking Preparation of enzyme and ligands
The crystal structure of the human cyclin-dependent kinase 2; CDK2 (PDB ID: 6GUE) was obtained from the protein data bank. The downloaded enzyme was prepared using the Discovery Studio 2016 software; the water molecules and ligands present in the downloaded enzyme were deleted. All chains excluding the A chain of the downloaded enzyme were also removed while preparing the enzyme for molecular docking studies [30]. The crystal structure of the prepared enzyme is presented in Plate 1.
The ligands were prepared for molecular docking by converting the equilibrium structures to their protein data bank (.pdb) file formats using the Spartan 14 v 1.1.4 software [30]. Sorafenib has been reported to downregulate CDK2 [14] while kenpaullone has been reported to inhibit CDK2 [12]. Thus, the structures of sorafenib (PubChem CID: 216239) and kenpaullone (PubChem CID: 3820) were obtained, drawn, optimized, and saved in the .pdb file format to prepare them for molecular docking study.

Enzyme-ligand interaction
The prepared enzyme and ligands were transferred to the PyRx software and their binding interaction was monitored. The binding affinity and the interaction files (.pdbqt file format) were obtained [24]. The interaction files obtained were subsequently opened in the Discovery Studio software where the visualization of the interaction was carried out [18].  Table 1 while the model validation parameters are presented in Table 2. The external validation of the built model was calculated in Supplementary Table S1. The definition of the molecular descriptors in the build regression model is  Table 3 while Table 4 presents the statistical parameters of these molecular descriptors. The result of the Y randomization test is presented in Table 5.

QSAR
The predicting ability of the built model was ascertained by comparing the experimental and predicted cytotoxic activity of the compounds in the dataset. The experimental and predicted cytotoxicity of the compounds in the training and test are presented in Supplementary Table S2 and S3, respectively. Figure 2, however, presents a scatter plot of these experimental and predicted cytotoxic activities. While Fig. 3 presents a scatter plot of the experimental activity against the standardized residual for each molecule in the dataset, the zone of applicability of the model was graphed out using the leverages method [18] and plotted in Fig. 4.
Ligand-based design Compound 20 was selected as the template used in ligand-based design. Slight modifications were made to it and twelve derivatives were designed. The molecular descriptors of the designed compounds were obtained and inserted into Equation 2 to obtain their cytotoxicity. These molecular descriptors are presented in Supplementary Table S4. The designed compounds and their corresponding activities are presented in Table 6.

Molecular docking study
The crystal structure of the human cyclin-dependent kinase 2 (PDB ID: 6GUE) was downloaded from the protein databank. The downloaded kinase was prepared using the Discovery Studio 2016 software. The A chain of the downloaded receptor was employed in molecular docking studies using the PyRx software. Multiple enzyme-ligand interactions were carried out by the software and the most stable conformation was presented by the software. Compounds e, h, j, and l were observed to have significantly better cytotoxic activity compared to the template compound and standard drug doxorubicin. Thus, the molecular docking study of these compounds and the template were carried out with the CDK2 enzyme. The standard drugs sorafenib and kenpaullone were also docked with the CDK2 enzyme and their binding affinity was compared with those of the designed compounds. Findings from the molecular docking study are presented in Table 7. Compounds e and h had the highest binding affinity with a value of −11.0 kcal/mol each. The 2D interaction of compounds e and h and the enzyme are presented in Plates 2 and 3, respectively.

QSAR
QSAR is a variant of the quantitative structureproperty relationship approach. This variant posits that the pharmacological activity of a dataset of compounds can be expressed as a linear combination of their molecular descriptors [25]. The built model revealed that the cytotoxic activity of imidazole-5-ones against HCT116 colorectal cancer cell line was strongly dependent on the molecular descriptors nS, GATS5s, VR1_Dze, ETA_dBetaP, and L3i. The regression equation of the model was shown in Equation 2 while the definition of the molecular descriptors was presented in Table 3. The model built was observed to be significant, possessing p < 0.05 (Table 1). The model also met all internal and external validation benchmarks ( Table 2). The predictions made by the model are therefore reliable, and can be employed in further theoretical and experimental studies. A robust model is characterized by the presence of poorly correlated molecular descriptors. This ensures that each molecular descriptor makes a unique contribution to the prediction made by the model [17]. Poorly correlated molecular descriptors have VIF values < 5.0. As shown in Table 4, the molecular descriptors were all poorly correlated (VIF < 3.5). This implies that each molecular descriptor made a unique contribution to the model. However, the contribution of the descriptors nS and L3i were observed to the insignificant (p value > 0.05; Table 4). This was also mirrored by their low mean effects (< 0.02; Table 4). The molecular descriptors GATS5s and ETA_dBetaP (p < 0.05) were the major contributors to the model, both had |ME| > 1.0.
The predicting ability of the model was determined by using the regression equation to predict the cytotoxic activity of the compounds in the training and test set. The predicted activity and residual of each  compound in the train and test were presented in Supplementary Tables S1 and S2. Figure 2 however presented the scatter plot of the experimental and predicted cytotoxic activity of the compounds in the dataset. The coefficient of determination for both the test and train set were both > 6.0. Thus implying that the model had a good predicting ability. The presence of systematic errors in the model was investigated using the standardized residual method. Figure 3 presented a graph of the standardized residual against experimental activity. The data points were observed to be randomly distributed, thus, signifying the absence of systematic errors [18]. The robustness of the model was determined using the Y randomization test. Findings were presented in Table 5. The Y randomization co-efficient of determination (cP 2 p ) was observed to be greater than 0.5, thus implying that the model built was a robust one [17]. The zone of applicability of the model was graphed out using the leverages method and was presented in Fig. 4. The zone of applicability presents those molecules whose cytotoxic activity was satisfactorily predicted by the model. Compounds within the zone of applicability can be satisfactorily employed for further in silico studies [22,26].

Ligand-based design
Ligand-based design is one of two methods used in computer-based drug design. In this design method, a compound of known pharmacological activity is obtained, and slight modifications are made to this compound to obtain a novel set of compounds with potentially better pharmacological activity [27]. In this study, compound 20 was selected as the template compound, because it was very active (IC 50 = 4.87 μM), the model made a good prediction of its activity (Fig. 2) and  it was also well within the zone of applicability of the model (Fig. 4). It was observed that the cytotoxicity of the compounds was dependent on the electronegativity of the substituent groups [16]. Thus, modification of compound 20 primarily involved the introduction of electronegative substituents. The activity of the compounds was observed to increase significantly when fluorine (-F) and Nitro (-NO 2 ) groups are used as substituent. The activity of the compounds also increased with increase in the number of these substituents. Compounds e and h (two -F substituents) for instance were more active than compound c (one -F substituent). Similarly, compound l (two -NO 2 substituents) was more active compared to compound b (one -NO 2 substituent). Six compounds (b, c, e, j, l, and h) were more active than the template compound and the standard drug doxorubicin (Table 6).

Molecular docking study
Molecular docking investigates the interaction between a compound (ligand) and its target (enzyme) at an atomic level. It investigates the binding affinity, nature, and type of bonding and nonbonding interactions between the ligand and amino acid residues of the enzyme [28,29]. Table 7 presented the results of docking studies between compounds e, h, j, and l, and the CDK2 enzyme. The template compound (compound 20) and the standard drugs sorafenib and kenpaullone were also docked with the CDK2 enzyme. The binding affinity of the designed compounds (−10.8 to −11.0 kcal/mol) was slightly higher than that of the template (−10.6 kcal/mol) and was significantly higher than that of sorafenib (−9.7 kcal/ mol) and kenpaullone (−9.4 kcal/mol). The designed compounds formed hydrogen bonds with lysine (LYS20, LYS89), aspartic acid (ASP145), leucine (LEU83), and histidine (HIS84) amino acid fraction of the CDK2 enzyme. The hydrogen interaction formed was primarily with the pyrrolidine and 2, 3-dihydro-1H-pyrrole groups of the designed compounds. Compounds e and h had higher binding affinities (−11.0 kcal/mol each) and this could be attributed to the halogen (fluorine) interaction formed between these compounds and the glutamic acid (GLU8) residue of

Conclusion
The cytotoxic activity of imidazol-5-ones against HCT116 colorectal cancer cell line was modeled via the QSAR approach. The cytotoxic activity of the compounds was dependent on the molecular descriptors nS, GATS5s, VR1_Dze, ETA_dBetaP, and L3i. Novel set of compounds were designed via the ligand-based design approach, and their activity was strongly dependent on the electronegativity of the substituent group(s). Compounds e, j, h, and l showed remarkable cytotoxicity (< 4.0 μM) against the colorectal cancer cell line. The compounds were also observed to be potent inhibitors of the CDK2 enzyme, having binding affinities ranging from −10.8 to −11.0 kcal/mol and forming hydrogen bond