Principal component analysis (PCA) is a commonly used multivariate data analysis to reveal hidden and unknown data structures. It seeks to find the linear combination of the predictors that capture the most possible variance (i.e. the PCs). Thus, the first PC or PC1, lies along the direction of maximal data variance that capture the most variability of all possible linear combinations. It creates components that are uncorrelated and summarize the characteristic of the data that are irrelevant to the underlying structure of the data. The advantages of PCA is that, unlike the data filtering the where only a pair of descriptors of correlated were removed, the pairs of PC are not correlated. Two of the most useful features of PCA include using it to reveal the correlation between all descriptors simultaneously via the loading scores while visualization of similarities and differences within samples are done via the PC scores. In here, we will try to perform PCA using R language.
Installing R for Mac
In the first paragraph, I will briefly tell you how you can install R on Mac OS. It is a very simple process because all it takes is just a few step. So, the first thing you need to do is open your web browser, and go to CRAN. It is a comprehensive R network. You will see a number of option for you to download for different platform. We are going to download the Mac platform. Once you click R for Mac OS X you will see the latest of the files end with .pkg (latest version) and you will see the download meter will start going. It might takes a few minutes depending on your internet connection. So just be patient. Once its finish downloading, just open it up on your computer. You will see the installer which will guide you the step necessary for you to download all R. It will be on your application folder once you finish the installation process.
Installing RStudio for Mac
Before installing RStudio, you must have R installed in your compouter. So, once you installed R, you can go RStudio. There are two version for RStudio for downloads: one is for desktop and one is for server. In here, we are not going to use server version at all. So you will just want to download the desktop the version of RStudio. Download the Mac OS version and you will see the download meter go. Once thats finished downloading, just click on it to install it. Just like any other mac application, all you have to do to install is just drag the RStudio to the application folder. You can go to application folder and double click on it. Then there you go, you are running R studio.
Installing R packages
Packages in R are a bunch of commands or libraries you can use to carry out certain analysis. Installing R packages in RStudio is easy. You can go to Console and type install.packages with the parasenthese and quotation mark, type in the name of the package you wish to install. In here we use multiple packages from the CRAN as will as bioconductors. The following list of packages can be installed from CRAN. e.g. install.packages(“protr”) to reproduce the results of the research.
install.packages("protr")
install.packages("seqinr")
install.packages("RCurl")
One can install packagees from Bioconductor by entering the source before the package can be installed. When installing R packages from Biconductor, please avoid using proxy internet connection
source("http://bioconductor.org/biocLite.R")
biocLite("Rcpi")
Downloaded packages will be attached to the environment in order to perform PCA.
set.seed(1)
s = c("protr", "Rcpi", "seqinr", "RCurl")
sapply(s, require,character.only=TRUE)
First, let us obtain data set of sequence from the Uniprot, a database which contain more than 62 millions of proteins. In order to do that, we will write a function that retrieve sequence proteins from the uniprot.
getFASTA <- function(id, parallel = 2) {
fastaURL = paste0("http://www.uniprot.org/uniprot/", id, ".fasta")
fastaTxt = RCurl::getURLAsynchronous(url = fastaURL, perform = parallel)
return(fastaTxt)
}
For simplicity, we will only retrieve 11 protein sequences form the Uniprot, which are associated with Histone Deacetylase Protein Faimly. Below are the list of Uniprot ID. First, let us look at the first ID of Uniprot Q13547. The function of the protein is for the deacetylation of lysine residues on the N-terminal part of the core histones, which play an essential role in transcriptional regulation, cell cycle progression and epigenetic repression.
x <- c("Q13547", "Q92769", "O15379", "P56524", "Q9UQL6", "Q9UBN7",
"Q8WUI4", "Q9BY41", "Q9UKV0", "Q969S8", "Q96DB2")
sequence <- getFASTA(x)
Just to save the intermediate of our analysis, we will write the fasta sequence after retriving the protein sequences from the Uniprot. Thus, we will use the seqinr R package, which is a tool for exploratory data analysis and data visualization for biological sequence (DNA and protein) data.
write.fasta(sequence = sequence, names = names(sequence),
nbchar = 80, ,file.out = "HDAC.fasta")
Once the fasta file is written, we will read the fasta protein sequence.
sequence <- readFASTA("HDAC.fasta")
To make sure that the protein sequences contain only the twenty standard amino acids types, protcheck function from protr is used to remove the non-standard sequences:
sequence <- sequence[(sapply(sequence, protcheck))]
We will calculate the amino acid composition descritpors to perform pca. First, we will need to extract the it from sequence using extractAAC() function.
descriptors <- t(sapply(sequence, extractAAC))
When the descriptors were calculated, we will perform pca analysis using the prcomp() function.
pca <- prcomp(descriptors, retx = TRUE, center = TRUE, scale. = TRUE)
Let us first, look at the explained variance or proportion of variance using the summary function.
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 2.9774 1.9728 1.5615 1.4179 1.05593 0.81386 0.71218 0.52106
Proportion of Variance 0.4432 0.1946 0.1219 0.1005 0.05575 0.03312 0.02536 0.01358
Cumulative Proportion 0.4432 0.6378 0.7598 0.8603 0.91603 0.94915 0.97451 0.98809
PC9 PC10 PC11
Standard deviation 0.43338 0.22457 5.3e-16
Proportion of Variance 0.00939 0.00252 0.0e+00
Cumulative Proportion 0.99748 1.00000 1.0e+00
It seems that the PCA model is very good model because with the furst 2 principal component, it has 63.78 percent explained variance.