Principal Component Analysis
 
Metric Solution Example
 
    
 
Goals
  1.  Explain as much variability in as few variables as possible
  2.  Reduce number of parameters
  3.  Form manageable dataset
Principal Component Analysis (PCA) is a data reduction tool.  Generally speaking, PCA provides a framework for minimizing data dimensionality by transforming the original data into multiple principal components so that observations are rotated onto axes of maximal variation.  The first principal axis is the line of best fit and minimizes the Sum of Squares for all perpendicular observational values.  Each subsequent principal axis maximally accounts for variation in residual data and acts as the line of best fit directionally orthogonal to previous axes.  Conceptually, PCA is a greedy algorithm fitting each axes to the data while conditioning upon all previously defined axes.  Principal components project the original data onto these axes, where axes are ordered such that Principal Component 1 (PC1) accounts for the most variation, followed by PC1, PC2, PC3, ...PCp for p variables (dimensions).  Since each PC is orthogonal, each component independently accounts for data variability and the Percent of Total Variation Explained (PTV) is cumulative.  While Singular Value Decomposition (SVD) can be used for PCA, we will focus on the eigenvector decomposition method employed in R.
Data
AAIndex: www.genome.jp/aaindex
    ~500 Quantifiable Amino Acid Indices
    Hydrophobicity
    Polarity
    Size
    Charge
    N = 20 Amino Acids
    p = 54 representative Indices



Pursuit
Create orthogonal, independent vectors (principal components) from correlated data

Correlation Matrix:

    Clearly, variables are correlated, i.e. “Size” is related to “Average volume of buried residue” and “frequency of beta turn” is related to “frequency of turn”.  PCA aims to create a set of principal components that are uncorrelated yet maximally account for variation in the data.

       

A principal axis is determined by the correlation among variables, so that each principal component independently explains a proportion of variance.  Using eigenvector decomposition, eigenvalues λ are the variance for corresponding eigenvectors, which are labelled principal components (PCs).  Looking at Cattell’s scree plot, the number of informative components can be determined by the elbow where eigenvalues become asymptotic.http://www.genome.jp/aaindexshapeimage_4_link_0
R function calls:
>plot(AA54_PCA$scores[,1:2], pch = AminoAcids, main="Principal Component Scores”)
>scatterplot3d(PrinComp54$scores[,1:3], pch = AminoAcids, main="Principal Component     Scores", box = F, grid=F)
 
R function calls
        >AA54_PCA = princomp(AA54, covmat = cov.wt(AA54))
        >screeplot(AA54_PCA, type="lines", npcs=length(AA54_PCA$sdev), pch=20)
 
Loadings values represent the correlation between variables and PCs.  Loadings with values < 0.1 are not shown.  Scores are calculated using loadings as coefficients to project the data onto the principal axes.
Loadings
>AA54_PCA$loadings
Scores
>AA54_PCA$scores
Data Approximation
  1. Accuracy determined by number of Principal Components Retained
 
If the number of PCs retained equals the number of variables (k=p), the dimension is not reduced and all variability is retained.  Observation i can be estimated by the following equation:
Plotting scores according to the principal components displays the relationship among observations.  For PC1 and PC2, the cumulative total variance (CTV) explained is 57.3%.  Including PC3 makes CTV=0.72.
Standardized Data
AA54 (.csv)
Function Calls
princomp (.R)