Factor Analysis
R Function Calls
    >Factor54 = factor.pa.ginv( AA54, nfactors = 5, prerotate=TRUE, rotate="Promax", m=3, scores="regression")
    >Factor54$loadings[order(Factor54$loadings[,1]),]
 
Loadings values represent the relationship between variables and factors.  Loadings with values > 0.85 are boxed in red to indicate their influence on the factor.  Based on similarity in description for values close to ±1, the following latent variables are inferred:
 
Factor1    PAH    polarity, accessibility and hydrophobicity
Factor2    PSS    propensity for secondary structure          
Factor3    MS      molecular size                                        
Factor4    CC      codon composition                                  
Factor5    EC      electrostatic charge                                
Factor Analysis (FA) is a dimension reduction tool that estimates the latent variable structure of data by partitioning the variability into that common to all variables and a residual value unique or specific to each variable.  This differs from PCA by estimating the communality of each variable so to distinguish both error and variation unrelated to other variables.  By separating these sources of variation, FA decomposes the HDMD into an interpretable structure comprised of explanatory factors acting on multiple variables.  Each factor represents a latent variable with loadings or coefficients relating the observed variable to the factor.  Similar to PCA, FA can use Cattell’s scree test to determine the number of informative factors.  Factor Analysis also rotates loadings to accentuate correlated variables and emphasize the latent structure.  An orthogonal rotation such as Varimax assumes the factors are independent, while an oblique rotation like Promax allows the latent variables to be correlated.  Scores can then be estimated based on the loadings and data.  Scores represent a relationship between observations and latent variables.
 
Metric Solution Example
 
    
 
Goals
  1.  Determine if there is an underlying latent structure that can account for data variability
  2.  Interpret latent variables (Loadings)
  3.  Concisely Quantify Amino Acid Attributes (Scores)
 
Paper
  1. Atchley, W. R., Zhao, J., Fernandes, A. and Drueke, T. 2005. Solving the sequence “metric” problem: Proc. Natl. Acad. Sci. USA 102: 6395-6400. (PDF)
Data
AAIndex: www.genome.jp/aaindex
    ~500 Quantifiable Amino Acid Indices
    Hydrophobicity
    Polarity
    Size
    Charge
    N = 20 Amino Acids
    p = 54 representative Indices



Pursuit
Determine latent variables (factors) that account for common variability in data
http://www.genome.jp/aaindexshapeimage_2_link_0
Common Variability

First the amount of variability that the latent variable structure can account for must be determined for each variable.  This is the communality for each variable.  A variable with communality close to 1 is tightly related to other variables and can be explained by the latent variable structure.  A variable with communality close to 0 is unrelated to the factor structure and the majority of its variance is accounted for by unique variability.  The communality replaces the variation (diagonal=1) in the correlation matrix to represent the amount of variability that the structure can explain for that variable. Estimating Communality
    In order to partition common and unique variance, the diagonal elements of R are replaced with the estimated communality h2 of each variable.  The default method initializes h2 by the squared multiple correlation (SMC) value.  SMC simply estimates the correlation of variable j with all other variables.  If the number of factors K is half of the number of variables or imaginary eigenvalues are encountered in the first iteration, then communality is initialized to 1.   Total communality is defined as a sum of communalities for each variable.  Iteratively decomposing the correlation matrix into its eigenvector structure and updating diagonal elements with the sum of squares of each vector estimates the common variance.    Factor Rotation
    The initial loadings of FA determines the correlation between variables and factors so that each factor successively accounts for the maximum common variance.  However, loading values may not reflect the underlying relationship between the variables and latent structure.  Transforming the loadings using a rotational matrix aligns the axes to variables.  This emphasizes variables that attribute to that factor and reduces the association with unrelated variables.  As seen in the figures below, the relationship between Factor1 and Factor2 with the variables has a fairly uniform distribution when there is no rotation.  However, using the varimax, orthogonal rotation, the association of certain variables is accentuated for Factor1 and minimized for Factor2, while other variables are accentuated for Factor2 and minimized for Factor1.  Applying the promax, oblique rotation further allows the factors to be correlated.
 
Loadings
Estimate Scores
    Since unique variability is included in a Factor Analysis model, scores must be estimated rather than calculated.  Regression and bartlett are two common methods.  Correlations in variables (R-1) and factors (Φ) are accounted for in regression, while variable uniqueness is accounted for in Bartlett.  Regression was used for this analysis.
 
Regression
Bartlett
where
R function calls
 
    >Factor54$scores
    
      >plot(Factor54$scores[,1:2], pch = AminoAcids, main="Factor Scores", xlab="pah", ylab="pss")
      >Factor3d =scatterplot3d(Factor54$scores[,1:3], pch = AminoAcids, main="Factor Scores", box = FALSE, grid=F, xlab="pah", ylab="pss", zlab="ms")
      >Factor3d$plane3d(c(0,0,0), col="grey")
      >Factor3d$points3d(c(0,0), c(0,0), c(-3,2), lty="solid", type="l" )
      >Factor3d$points3d(c(0,0), c(-1.5,2), c(0,0), lty="solid", type="l" )
      >Factor3d$points3d(c(-1.5,2), c(0,0), c(0,0), lty="solid", type="l" )
      >Factor3d$points3d(Factor54$scores[hydrophobic,1:3], col="blue", cex = 2.7, lwd=1.5)
      >Factor3d$points3d(Factor54$scores[polar,1:3], col="green", cex = 3.3, lwd=1.5)
      >Factor3d$points3d(Factor54$scores[small,1:3], col="orange", cex = 3.9, lwd=1.5)
    >legend(x=5, y=4.5, legend=c("hydrophobic", "polar", "small"), col=c("blue", "green", "orange"), pch=21, box.lty =0)
 
 
    The relationship among amino acids according the their PAH, PSS, MS, CC, and EC values can be inferred from their scores.  As expected, similar amino acids like Isoleucine (I) and Leucine (L) have comparable scores, while dissimmilar amino acids such as Glycine (G) and Arginine (R) have contrasting scores.
 
Standardized Data
AA54 (.csv)
Function Calls
factor.pa.ginv (.R)