In the context of elucidating the underlying physicochemical properties of chemical structures of drugs is required for understanding the origin of their biological activity and pharmacological effects. QSAR is used to address these issues in which the biological activities of several compounds are studied using empirical mathematical approaches that seeks to discern the relationship between the chemical structures and the measured activity. Each compound is represented by a set of chemical descriptors, mathematical representation of chemical structures. This is because at a latter stage the model can then be used as a tool to predict the activities of new compounds before they are synthesized without the need for any physical measurements on them. Many approaches can be used to represent the chemical structures: counting the presence and absence of certain chemical groups, identifying the presence or absence of predified structural fragments in molecules or advanced methods that reflect the physiochemical properties of the chemical compounds. These descriptors were then used to build a mode by relating it to measured biological activities using machine learning approaches. With that, it is possible to reveal what influence a change in the chemical structure of a compound has on its biological activity by considering the biological activity of a compound is dependent on the compounds chemical space.
Before we dive into perform the analysis, please see the general workflow of the QSAR. From the figure, we can see that the molecular structure of the chemical compounds are represented as fingerprint descriptors, which are then used to build the QSAR model to predict the property of chemical compounds. For performing the QSAR analysis, I have selected a Histone Deacetylase inhibitors because cancer is one of the leading causes of death all over the word. A wide range of proetines are found to be related to tumor fomration and metastatis. However, only proteins with widespread biological significance for the tumor cells growth regulation are most possible to be the targets. Histone deacetylase are proved to be new epigenetic target for the treatment of cancer. And it has been shown that inhibitors of histone deacetylase have antitumor effects in both in vitro and in vivo. Because of this, these inhibitors have become one of the most important research fields of the antitumor drugs, especially during the area of epigenetics.
For the programming language for this anlaysis, python programming language is used because it is a general language, which means you can do pretty much anything with python. And, as of now, it is increasingly populary in the filed of data science. It also has a simple syntax, which allow new beginner to pick up the language with ease.
For making a QSAR analysis, we will be using the data set from Chembl, particually the Histone Deacetylase receptor number 2. You can get the data set from here Histone Deacetylase Data. From that data, we are interested in two columns, which are SMILES representation of a chemical compounds and its inhibitory properites, typically IC50 values, the half inhibitory concentration to measure the effectiveness of compounds inhibitng an activity of histone deacetylase. Generally, the SMILES representation of the compounds from the database contains unwanted mixtures (e.g., salt, metal ions ) which complicated the workflow of QSAR analysis. Because of that, they should be removed before extracting descriptors, mathematical represtation of chemical compounds. Once the SMILES are standardized aka. remove unwanted mixtures, we can further processed with extracting descriptors (Please see my previous post about chemical descriptors). The IC50 values should also be transformed into pIC50 because the range of the IC50 values are so broad thus to make us easier to work with IC50 values are treated with -log10. Just like the pH and the [H+] hydrogen ions. These descriptors and the pIC50 can then be used to relate the inhibitory properties of histone deacetylase by using various machine learners. From the white box learners (ie.., Decision Tree, Random Foresst, Partial Least Square) to black box learners (Support Vector Machine, Artificial Neural Network). For the analysis, we will Random Forest algorithm or simply call it as RF. The RF algorithm is a type of ensemble method in which multiple decsion trees are use to combined to build predictive models. When making a predicting, we will have to split the data into training and testing set. The training set will be use to build train the model. However, how well the model perform really depends on the internal validation and external validation. To validate internally, 10-fold Cross Validation is used whiere 1 of the 9-fold is left as a test set while the rest is used to build the model. To validate externally, we external test was used as an unknown set to evaluate the performance of the model. And also to make sure that, the performance of the model do not come from chances, Y-scrambing used also used. In the Y-scrambling approach, the inhibitor properties scrambled while the desciprotor was left untouch. I did the QSAR analysis with python on jupyter notebook. Here is the notebook