# Principal Component Analysis

We’ve been reading some articles and book chapters about factor analysis for journal club.

W. K. Adams, K. K. Perkins, N. S. Podolefsky, M. Dubson, N. D. Finkelstein, and C. E. Wieman (2006), *New instrument for measuring student beliefs about physics and learning physics: The Colorado Learning Attitudes about Science Survey*, PRST-PER 2, 010101.

D. Huffman and P. Heller (1995), *What does the Force Concept Inventory actually measure?*, The Physics Teacher 33, 138. (and responses)

T. F. Scott, D. Schumayer, and A. R. Gray (2012), Exploratory factor analysis of a Force Concept Inventory data set, PRST-PER 8, 020105.

J. T. Pohlmann (2004), *Use an interpretation of factor analysis in The Journal of Education Research: 1992-2002*, J. Ed. Res. 98(1), 14.

R. Gorsuch (1983), Factor Analysis, Lawrence-Erlbaum, Hillsdale, NJ.

Factor analysis is a technique for identifying factors that underly a given set of data and using these factors to construct a linear model with possible explanatory and/or predictive power. Within education, factor analysis is used to analyze the results of an assessment given to a large collection students to see what (if any) conceptual coherence exists between sets of questions. In the CLASS article, factor analysis was used to group the 40ish questions on the survey into six categories. The authors decompose a student’s score on the survey into six sub-scores using categories like “personal interest” and “sense making/effort”. In the FCI articles, factor analysis is used to see if, for students, the FCI represents an assessment of six different dimensions of Newtonian mechanics. The FCI authors conceived of the test as an assessment of six different dimensions (things like inertia, and types of forces) but this is not necessarily how students view the test. (The two FCI articles differ in their conclusions. Huffman and Heller find no conclusive factors suggesting that students view each question as independent. Scott et. al do find five or six convincing factors however, these factors do differ some from those proposed by the authors of the FCI.)

The “input” for factor analysis is student scores on each individual question from the assessment. We take these scores to be in standard form meaning that their mean is zero and their standard deviation is one. Factor analysis assumes that each student’s score on each question can be written as a linear combination of scores on the underlying factors:

Z_{1,n} = w_{1,A}F_{A,n} + w_{1,B}F_{B,n} + …

Z_{i,2} = w_{2,A}F_{A,n} + w_{2,B}F_{B,n} + …

etc.

Here Z_{i,n} represents student n’s score on question i, w_{i,q} represents factor q’s weighting in question i (also sometimes referred to as the power factors for question i or the factor loadings for question i), and F_{q,n} represents student n’s score on factor q. The goal of factor analysis is to construct a set of weights and factor scores that are capable of reproducing (exactly or approximately) the scores on the original assessment questions. There is no unique solution to this problem. Different solutions are based on different assumptions about relationships between weights, factors, etc. Principal Component Analysis produces one possible solution based on the assumption that the factors are all uncorrelated.

**Principal Component Analysis (PCA)**

In the following discussion i and j will always index questions, n will always index students, and p and q will always index factors. We start by writing the correlation between questions i and j on the assessment:

r_{i,j} = Σ_{n} Z_{i,n}Z_{j,n}/N

Using the definition Z_{i,n} above, we can write the correlation between questions as

r_{i,j} = Σ_{n}(Σ_{p}w_{i,p}F_{p,n})(Σ_{q}w_{i,q}F_{q,n})/N

= Σ_{n}(Σ_{q}w_{i,q}w_{j,q}F_{q,n}^{2} + Σ_{q}Σ_{p≠q}w_{i,q}F_{q,n}w_{j,p}F_{p,n})/N.

PCA seeks to construct a set of weights and factors scores for which factor correlations, r_{q,p} = Σ_{n}F_{q,n}F_{p,n}/N = 0. We also define the factor scores to be in standard form so that the factor variance, r_{q,q} = 1. Under these conditions, the correlation between questions i and j reduces to

r_{i,j} = Σ_{q}w_{i,q}w_{j,q}.

In matrix notation, the relationship between the question correlations and factor loadings can be written

**R** = **PP**^{T}

where **P** is a matrix of power factors, **P** = (**w**_{i,A}, **w**_{i,B}, …).

To determine how to find the matrix **P** we consider the eigenvalue equation for the question correlation matrix,

**RA** = **AS**

where **A** is a matrix whose columns are the eigenvectors of **R** and **S** is a diagonal matrix with eigenvalues of **R** on the diagonal. **R** is a real, symmetric matrix which makes it Hermitian so its eigenvectors are orthonormal. In this case, **A**^{T} = **A**^{-1} and we can write

**R** = **ASA**^{T} = (**AS**^{1/2})(**S**^{1/2}**A**)^{T} = **PP**^{T}.

The matrix **P** is thus the eigenvectors of **R** weighted by the square root of their corresponding eigenvalues.

The columns of P determine the factor weights. To determine the factor scores we recognize that the original equations relating the question scores, factor weights, and factor scores can be written as **Z**=**FP**^{T} where** F** is a matrix of factor scores. Using the definition of **P**, we can solve for **F** as

**F** = **ZPS**^{-1}.

At this point we have as many factors as we have questions on the original assessments. If one keeps all of the factors, then **Z**=**FP**^{T} will reproduce the original question scores exactly. More often, one will keep only a few significant factors. In this case, **Z**=**FP**^{T} will approximately reproduce the original question scores. A future post will discuss the question of how many factors keep.