The Theory of the Singular Value Decomposition and its Relation to Principal Component Analysis
Three points require mentioning before beginning this whitepaper. First, I used Python and R code throughout this whitepaper and often imported objects created in Python into R for creating plots. To use Python and R coinjointly, I used the reticulate
package made for R and created a conda environment to run Python code (see lines 2–12 below).
library(reticulate)
#create and use conda environment
conda_create(envname = 'blog_posts', python_version = '3.10.11')
use_condaenv(condaenv = 'blog_posts')
#install packages in conda environment
py_packages <- c('numpy', 'pandas', 'scikit-learn', "plotnine", "statsmodels", "manim", "factor_analyzer")
conda_install(envname = 'blog_posts', packages = py_packages, pip=T)
#useful for checking what packages are loaded
py_list_packages(envname = 'blog_posts', type = 'conda')
Second, the Python packages and modules in the Python code block are needed to run all the Python code in this paper (see lines 15–22 below).
import math import numpy as np
import pandas as pd
import plotnine as pt
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from factor_analyzer.rotator import Rotator
from factor_analyzer import FactorAnalyzer
Third, although I often include code in papers so that readers can explore concepts, I decided to not include the Python code I used to create mathematical animations given the considerable length of the script. For readers interested in how I created my animations, I used the manim library and my source code can be viewed in this script.
Introduction
In this whitepaper, I provide a deep dive into the math underlying the singular value decomposition and use this as a foundation to explain the connection with principal component analysis. To this end, I provide a detailed proof of the singular value decomposition and explain the linear algebra necessary for understanding its geometry. I then explain the meaning of the singular value decomposition by synthesizing the parts needed to run a principal component analysis. To address all these points, the following sections will follow sequentially:
- A review of linear algebra
- The singular value decomposition and its visualization
- Proving the singular value decomposition
- Understanding the singular value decomposition by synthesizing principal component analysis
A Review of Linear Algebra
Some Important Fundamentals
In this section, I provide some fundamentals of linear algebra that are necessary for understanding the geometry of the singular value decomposition. To this end, I explain the following four points:
- Matrices define basis vectors
- Matrix-vector multiplication transforms bases by computing weighted vector sums
- Dot products are 1-dimensional cases of matrix-vector multiplication
- Dot products provide orthogonal projection lengths with normalized vectors
Matrices Define Basis Vectors
Basis vectors can be understood as the coordinates used to define some n-dimension space,
If
Importantly, outside from the standard basis vectors being familiar and simple, there isn’t always a strong justification for using them in applied settings. Thus, non-standard basis vectors are often used in applied settings. As an example of non-standard basis vectors, consider defining
To define
Matrix-Vector Multiplication Transforms Bases by Computing Weighted Vector Sums
With matrices defining basis vectors, this also allows them to apply transformations. To understand how matrices apply transformations, consider again the non-standard basis of
In post-multiplying
Animation 3 below shows how matrix-vector multiplication of Equation
Dot Products are 1-Dimensional Cases of Matrix-Vector Multiplication
Although it may not be obvious, dot products are simply 1-dimensional cases of matrix-vector multiplication. That is, dot products also transform bases by computing weighted vector sums. The only nuance is that the transformations applied by dot products occur in a 1-dimensional space (i.e., a line). To understand the geometry of dot products, consider two vectors of
Dot Products Provide Orthgonal Projection Lengths With Normalized Vectors
One focal process of principal component analysis is the computation of orthogonal projections: The value of one variable on another basis. In the context of principal component analysis, analysts want to know how scores on one variable translate into scores on another variable that exists in a lower dimension, As it turns out, dot products accomplish this task because the vectors being projected onto are normalized. That is, when computing orthogonal projections of a vector,
To understand how Equation

# Define vectors x = np.array([3, 3])
v = np.array([1, 3])
# scalar multiplier for projection
c = v.T.dot(x)/(x.dot(x))
v_proj = c*x
# Create DataFrame for vectors
df_vectors = pd.DataFrame({
'x_start': [0, 0, v_proj[0], 0],
'y_start': [0, 0, v_proj[1], 0],
'x_end': [v[0], x[0], v[0], 2],
'y_end': [v[1], x[1], v[1], 2],
'label': ['x', 'v', 'x_vproj','v_proj'],
'color': ["#d170c7","#d4bd04", "#33c304","#9DFFFF"]})
# Define line span of x (t * x for t in [-2, 2])
t_values = np.linspace(-5, 5, 100)
x_span = t_values * x[0]
y_span = t_values * x[1]
df_line = pd.DataFrame({'x': x_span, 'y': y_span})
# Create a dataframe for tick labels (for easier annotation)
ticks = range(-5, 6) # From -5 to 5
df_xticks = pd.DataFrame({
"x": [i for i in ticks if i != 0], # Exclude 0
"y": [0] * (len(ticks) - 1), # Same length after filtering
"label": [str(i) for i in ticks if i != 0] # Exclude 0 from labels
})
df_yticks = df_xticks.copy()
df_yticks['x'] = df_xticks['y']
df_yticks['y'] = df_xticks['x']
# dataframe for annotations
df_annotations = pd.DataFrame({
"x": [-4, -4, -4, 0.75, 3.25, 1.5, 2, -4, 3.75, 3, 3, 3, 3],
"y": [5, 4, 3, 3.25, 2.75, 0.5, 3.5, 2, 4.25, -2, -3, -4, -5],
"label": [r"$\mathbf{x}^\top = [ 1 \ 3 ]$",
r"$\mathbf{v}^\top = [ 3 \ 3 ]$",
r"$Proj_L(\mathbf{x}) = c\mathbf{v}$",
r"$\mathbf{x}$",
r"$\mathbf{v}$",
r"$Proj_L(\mathbf{x})$",
r"$\mathbf{x} - Proj_L(\mathbf{x})$",
r"$L = \{c\mathbf{v} | c \in \mathbb{R}\}$",
r"$L$",
r"$(\mathbf{x} - Proj_L(\mathbf{x}))\mathbf{v} = 0 $",
r"$(\mathbf{x} - c\mathbf{v})\mathbf{v} = 0$",
r"$\mathbf{x}^\top \mathbf{v} - c\mathbf{v}^\top \mathbf{v} = 0$",
r"$c = \frac{\mathbf{x}^\top \mathbf{v}}{\mathbf{v}^\top \mathbf{v}}$"],
"color": ["#d170c7", "#d4bd04", "#9DFFFF",
"#d170c7", "#d4bd04", "#9DFFFF", "#33c304", "white", "white",
"white", "white" , "white", "white"],
"size": [14]*13})
# Create plot
plot_proj = (pt.ggplot() +
# Specify vectors
pt.geom_line(df_line, pt.aes(x='x', y='y'), alpha=0.5, color="white") +
pt.geom_segment(data=df_vectors,
mapping=pt.aes(x='x_start', y='y_start', xend='x_end', yend='y_end'),
arrow=pt.arrow(type="closed", length=0.12, angle=30),
color=df_vectors['color']) +
# Add vector labels
pt.geom_text(data=df_annotations,
mapping=pt.aes(x="x", y="y", label="label"),
color=df_annotations['color'], size=df_annotations['size']) +
pt.geom_hline(yintercept=0, color="white") +
pt.geom_vline(xintercept=0, color="white") +
# Remove default tick labels and minor grid lines
pt.scale_x_continuous(limits=(-5, 5), breaks=range(-5, 6), minor_breaks=[]) +
pt.scale_y_continuous(limits=(-5, 5), breaks=range(-5, 6), minor_breaks=[]) +
# Manually place x-axis tick labels ON the hline
pt.geom_text(df_xticks, pt.aes(x='x', y='y', label='label'),
color="white", size=10, va="top", nudge_y=-0.1) +
# Manually place y-axis tick labels ON the vline
pt.geom_text(df_yticks, pt.aes(x='x', y='y', label='label'),
color="white", size=10, ha="right", nudge_x=-0.1, nudge_y=0) +
pt.theme_minimal(base_family='Helvetica', base_size=14) +
pt.theme(
text=pt.element_text(color="#002241"),
plot_background=pt.element_rect(fill="#002241"),
panel_background=pt.element_rect(fill="#002241"),
panel_grid_major=pt.element_line(color="#9DFFFF", size=0.3, alpha=0.5),
axis_title=pt.element_blank(), # Remove axis titles
axis_ticks=pt.element_blank(), # Remove tick marks
axis_text=pt.element_blank(), # Remove default tick labels
legend_position="none"))
plot_proj.save("images/proj_plot.png", dpi=500, width=8, height=6)
Visualizations of Elementary Matrices
As explained in the previous section, matrix multiplication transforms bases by computing weighted vector sums. As it turns out, there are three fundamental transformations that constitute every possible matrix transformation: 1) Rotation, 2) stretching, and 3) changing dimension. Importantly, each fundamental transformation corresponds to a particular type of matrix, and I will provide visualizations for each of these matrices. Below is a brief summary of the matrices and their transformations (note, there is technically a fourth type of transformation that un-transforms):
- Diagonal matrices stretch basis vectors.
- Orthonormal matrices rotate basis vector.
- Inverse matrices un-transform basis vectors.
- Rectangular matrices change the dimension space.
I also provide a geometric overview of eigenvectors and eigenvalues given their central importance to singular value decomposition and principal component analysis.
Diagonal Matrices Stretch Basis Vectors
Animation 5 below shows that diagonal matrices simply stretch the basis vectors that define a space. In the current example, the matrix of
Orthonormal Matrices Only Rotate Basis Vectors (Length and Angle Preserving)
Animation 6 below shows that orthonormal matrices simply rotate the basis vectors that define a space. In the current example, the matrix of
Importantly, orthonormal matrices only rotate vector spaces. That is, there is no stretching/compressing of the vector space; vectors lengths remain unchanged and so do angles between vectors. In mathematical terms, orthonormal matrices preserve lengths and matrices and, given the ease with which each of these statements can be proven, I show each proof in turn.
Beginning with the length-preserving property of orthonormal matrices, consider the formula for computing vector lengths in Equation
As an example of the length-preserving property of orthonormal matrices, the Python code block below (lines 126–134) shows that length of the original vector,
# original vector g_e = np.array([2, 1])
# transformed vector
g_q = np.array([3*(np.sqrt(2))/2, -np.sqrt(2)/2])
# lengths of each vector
print(g_e.dot(g_e), "\n",
np.round(g_q.dot(g_q), 6))
5 5.0
Ending with the angle-preserving feature of orthonormal matrices, consider the formula to computing the angles between two vectors,
By applying Equation
As an example of the angle-preserving property of orthonormal matrices, the Python code block below (lines 137–150) shows that the angle between the original standard basis vectors,
# original standard basis b_ex = np.array([1, 0])
b_ey = np.array([0, 1])
# orthonormal basis
Q = np.array([[np.sqrt(2)/2, np.sqrt(2)/2],
[-np.sqrt(2)/2, np.sqrt(2)/2]])
# angle between original basis vectors
original_angle = np.arccos(b_ex.dot(b_ey)/(np.linalg.norm(b_ex) * np.linalg.norm(b_ey)))
transformed_angle = np.arccos(b_ex.T.dot(Q.T).dot(Q).dot(b_ey)/(np.linalg.norm(Q.dot(b_ex)) * np.linalg.norm(Q.dot(b_ey))))
print(math.degrees(original_angle), "\n",
math.degrees(transformed_angle))
90.0 90.0
Inverse Matrices Un-Transform Basis Vectors
Animation 7 shows that matrix inverses transform vector spaces in the opposite direction and magnitude of their non-inverted counterparts. In this way, matrix inverses can be conceptualized as un-transforming vector spaces. Within Animation 7, the space is first transformed using the orthonormal matrix of
The vector space is then un-transformed and brought back to the standard basis by applying the inverse of
Rectangular Matrices Change Dimension Space
So far, I have only shown linear transformations that result from applying square matrices. When multiplying spaces by square matrices, the input and output dimension spaces are the same. When multiplying by rectangular matrices, however, the dimensions of the input and output spaces are different; that is, rectangular matrices change the dimension space. As an example, Animation 8 below shows the transformation that results from applying the rectangular matrix of
Importantly, because the input and output dimensions spaces are different, the input and output vector live in different dimension spaces, and so no input vector can lie in the span of any output vector (and vice-versa).
Eigenvectors (and Eigenvalues)
Due to the inherent connection between eigenvectors and singular value decomposition, it is necessary to provide an overview of the geometry of eigenvectors. That being said, although it is useful to understand the geometry of eigenvectors, their importance in singular value decomposition comes more from their meaning (which will be explained in later sections) and less so from their geometry. Nonetheless, I briefly present the geometry of eigenvectors and eigenvalues.
Animation 9 below shows that eigenvectors remain on their span following some linear transformation and eigenvalues represent the extent to which the eigenvectors are stretched. The geometry of eigenvectors and eigenvalues is summarized in Equation
In words, if some eigenvector,
One last important point to note is that only square matrices can have eigenvectors because non-square (or rectangular matrices) change the dimension space (see section on rectangular matrices). Because rectangular matrices change the dimension space, it is impossible for any vector to remain on its span.
The Singular Value Decomposition and its Visualization
Having covered the necessary fundamentals, I will introduce the singular value decomposition and its geometry. In short, any matrix
: left singular vectors of , which are the eigenvectors of . Because symmetric matrices have full sets of orthonormal eigenvectors (see Appendix A3), is orthonormal. : rectangular matrix with singular values along its diagonal. Singular values are equivalent to the square roots of the eigenvalues of (or equivalently of ) divided by (see Equation ). : right singular vectors of , which are the eigenvectors of . Because symmetric matrices have full sets of orthonormal eigenvectors (see Appendix A3), is orthonormal.
Applying what I presented previously in the fundamentals of linear algebra, the singular value decomposition implies that the linear transformation applied by any matrix can be broken down into three constituent transformations in the following order:
- Rotation:
is an orthonormal matrix and so rotates basis vectors (see section on orthonormal matrices). The astute reader will notice that the transpose of an orthogonal matrix is equivalent to its inverse, so is more technical an un-rotation of basis vectors. - Stretching with possible dimension change: because
only has nonzero values along its diagonal, these values will stretch basis vectors (see section on diagonal matrices). The dimension space can also change following a transformation by . For example, if the number of rows, , is less than the number of columns , then the remaining columns of will contain zeroes that will remove the remaining dimensions of . - Rotation:
is an orthonormal matrix and thus rotates basis vectors (see section on orthonormal matrices).
Animation 10 below provides an geometric visualization of each transformation applied by each matrix of the singular value decomposition. First, it is shown that
Proving the Singular Value Decomposition
With an geometric understanding of the singular value decomposition, I now provide a proof. Consider the symmetric matrix produced by
Understanding the Singular Value Decomposition by Synthesizing Principal Component Analysis
In this section, I will provide a deeper explanation of the singular value decomposition by synthesizing the pre-requisites for running a principal component analysis. Briefly, principal component analysis provides a method for mapping some data set of n variables onto some lower-dimension subspace consisting of only m dimensions. To this end, three objects must be computed to run a principal component analysis:
- The number of principal axes: the number of (lower) m dimensions that the original n variables can be mapped to.
- Loadings: correlations between original variables and principal axes. By computing loadings, the meaning of the principal axes can be determined.
- Principal component scores: scores of respondents on the principal axes.
Using the three matrices from the singular value decomposition, I will explain how to compute each of the above objects.
Two points deserve mention. First, a varimax rotation will be applied to the loadings to render the results more interpretable, thus facilitating an understanding of the singular value decomposition. Note that, although using varimax rotation adds some complexity to the computation of the loadings and principal component scores, it will be explained in the sections that follow. Second, to facilitate learning in this section, I will apply formulas using a generated data of wine drinkers.
A Guiding Example: Wine Drinkers
Briefly, a data set of 10 wine drinkers’ ratings of four wines will be used to explain the meaning of the singular value decomposition. Wine ratings (on a 0–100-point scale) are provided for four wines of 1) cabernet sauvignon (CS), 2) merlot (M), 3) rosé (R), and 4) champagne (Ch).
Importantly, for understanding the relation between singular value decomposition and principal component analysis , I generated the wine ratings such that there are two dimensions of wine drinking:
- Dinner Wine Drinking: prefers drinking red wines such as cabernet sauvignon (CS) and merlot (M).
- Celebratory Wine Drinking: prefers drinking champagne (Chp) and rosé (R).
To foreshadow, because of how I generated the data, there will be two principal axes that account for the majority of the variance.
The R code block below (lines 153–204) contains the code I used to generate the wine drinker data.
library(tidyverse) library(Matrix)
# set means of each variable for dinner and celebratory wine drinkers (dwd, cwd)
means_dwd = c(81, 82, 68, 69)
means_cwd = c(63, 61, 83, 83)
sd = rep(x = 3, times=4)
# create covariance matrix
cor_var_12 <- .75
cor_var_34 <- .75
cor_var_13 <- 0
cor_var_14 <- cor_var_13
cor_var_23 <- cor_var_13
cor_var_24 <- cor_var_13
# covariance for dinner wine drinkers
cov <- diag(sd^2)
cov[2, 1] <- cor_var_12 * sd[1]^2
cov[3, 1] <- cor_var_13 * sd[1]^2
cov[3, 2] <- cor_var_23 * sd[1]^2
cov[4, 1] <- cor_var_14 * sd[1]^2
cov[4, 2] <- cor_var_24 * sd[1]^2
cov[4, 3] <- cor_var_34 * sd[1]^2
# covariance for celebratory wine drinkers
cov_cwd <- diag(sd^2)
cov_cwd[4, 3] <- cor_var_12 * sd[1]^2
cov_cwd[3, 1] <- cor_var_13 * sd[1]^2
cov_cwd[3, 2] <- cor_var_23 * sd[1]^2
cov_cwd[4, 1] <- cor_var_14 * sd[1]^2
cov_cwd[4, 2] <- cor_var_24 * sd[1]^2
cov_cwd[2, 1] <- cor_var_34 * sd[1]^2
cov_cwd[upper.tri(cov_cwd)] <- t(cov_cwd)[upper.tri(cov_cwd)]
# Number of samples
n <- 5
set.seed(27)
df_dwd = round(MASS::mvrnorm(n = n, mu = means_dwd, Sigma = cov, empirical = T) + rnorm(n = n, mean = 0, sd = 10))
df_cwd = round(MASS::mvrnorm(n = n, mu = means_cwd, Sigma = cov_cwd, empirical = T) + rnorm(n = n, mean = 0, sd = 10))
df_wine_drinkers <- rbind(df_dwd, df_cwd)
Determining the Number of Principal Axes
The number of principal component to extract can be determined by using the matrix of singular values,
Beginning with the first point, singular values can be converted to eigenvalues such that
Ending with the second point, eigenvalues represent the amount of variance (in terms of the number of original variables) accounted for by eigenvectors. Consider a mean-centered matrix
and eigenvectors
with the variance of the projected values then being
Because
# Matrix of wine ratings A = pd.DataFrame({"Cab. Sauv.": [73, 73, 84, 80, 47, 54, 46, 63, 70, 59],
"Merlot": [76, 71, 88, 80, 49, 49, 46, 61, 67, 58],
"Champagne": [58, 55, 71, 69, 40, 69, 63, 89, 90, 79],
"Rosé": [61, 55, 71, 73, 39, 69, 66, 89, 90, 76]})
# standardize data
df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
# Step 2) Compute eigenvalues of correlation matrix
df_corr = A.corr() # or use df_std.T.dot(df_std)/9
eig_vals = np.linalg.eigvals(df_corr)
# Step 3) Convert singular values to eigenvalues and confirm all values are identical
eig_converted = S**2/(df_std.shape[0] - 1)
np.testing.assert_array_equal(eig_vals.round(4), eig_converted.round(4))
Second, eigenvalues can be used to determine the number of principal axes to retain. Although several rules exist for choosing the number of principal axes to retain (for a review, see Citation: Auerswald & Moshagen, 2019 Auerswald, M. & Moshagen, M. (2019). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods, 24(4), 468. https://doi.org/10.1037/met0000200 ), the decision in this example is simple as the data were generated such that only two meaningful principal axes exist (i.e., dinner and celebratory wine drinkers). An examination of the eigenvalues confirms this as the first two principal axes account for 99.5% of the total variance (or 3.98 out of 4 possible variables; see line 225).
(eig_converted.cumsum()/eig_converted.sum()).round(3)
array([0.611, 0.995, 0.998, 1. ])
Computing Loadings and Applying Varimax Rotation
Now that the number of principal axes have been decided, I now show how to determine their meaning. Because loadings are not always guaranteed to be meaningful (due to the problem of rotational indeterminacy), I show how rotations can be used to render principal axes more interpretable. As a result, this section constitutes three parts:
- Computing loadings
- Rotational indeterminacy
- Applying varimax rotation
Computing Loadings
Understanding how to compute loadings simply requires understanding that loadings are correlations between the original variables and the principal axes. Thus, loadings can be computed by obtaining correlations between scores on the original variables and scores on the principal axes. Mathematically, I’ll show that this is equivalent to computing
To prove Equation
First, Equation
Note that, because
Second, I now show how to compute the matrix of standardized principal component scores,
To standardize the principal component scores, they must be divided by their standard deviations,
Proving Equation
Having proven Equations
Using the data set of wine drinkers, I now apply the above equations to compute the loadings. To preface (and as mentioned in the description of the data), the principal axes were generated such that one measures dinner wine drinking (high loadings of cabernet sauvignon and merlot) and the other one measures celebratory wine drinking (high loadings of champagne and rosé). Using the wine rating data, I now provide Python code (lines 227–260 below) to compute the loadings using Equation
# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2) Compute eigenvalues of correlation matrix
df_corr = A.corr() # or use df_std.T.dot(df_std)/9
eig_vals = np.linalg.eigvals(df_corr)
# Step 3) Convert singular values to eigenvalues
num_obs = df_std.shape[0]
eig_converted = S**2/(num_obs - 1)
# Step 4) Run PCA and compute loadings using VE^0.5
pca = decomposition.PCA(n_components=2)
pca_scores = pca.fit_transform(df_std) # needed to obtain components
pca_loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
# Step 5) Use SVD matrices to compute loadings and verify both computations are equivalent
svd_loadings_1 = df_std.T.dot(U[ : ,:2]) * np.sqrt(num_obs - 1)/(num_obs - 1)
svd_loadings_2 = V[ : ,:2].dot(np.diag(np.sqrt(eig_converted[0:2])))
# Step 6) Show that all computations of the loadings are equivalent
# round each array to 4 decimal places
pca_loadings = np.round(pca_loadings, 4) *-1. # NOTE: multiply by -1 to ensure equivalence
svd_loadings_1 = np.round(svd_loadings_1, 4)
svd_loadings_2 = np.round(svd_loadings_2, 4)
assert (np.array_equal(pca_loadings, svd_loadings_1) and
np.array_equal(svd_loadings_2, svd_loadings_2) == True), """
Loading computations are different."""
Rotational Indeterminacy
Although loadings can be computed using the matrices from the singular value decomposition (which I will call the unrotated loadings), they aren’t guaranteed to be meaningful. For instance, consider the set of loadings computed in the previous section and shown below in Table 1 and their alignment with how the data were generated. To recall, I generated the wine rating data such that two dimensions of wine drinking existed:
- Dinner Wine Drinking: prefers drinking red wines such as cabernet sauvignon (CS) and merlot (M).
- Celebratory Wine Drinking: prefers drinking champagne (Chp) and rosé (R).
Given how the wine ratings were generated, there should be two general patterns of loadings. In one pattern, cabernet sauvignon and merlot should have high loadings on one principal axis and low loadings on the other axis. With the second pattern, it is simply the opposite of the first one; that is, champagne and rosé will have high loadings on the other principal axis and low loadings on the principal axis with which cabernet sauvignon and merlot have high loadings. In looking at Table 1, the loadings do not remotely reflect the expected patterns. All the wine have high and negative loadings on the first principal axis and moderate loadings on the second axis. Therefore, in the current example, the unrotated loadings bear no resemblance with expected pattern and are difficult to interpret (the R and Python code below in lines 261–272 produce Table 1).
# matrix of unrotated loadings loadings = pd.DataFrame(data=pca_loadings,
index=[df_std.columns],
columns=['Axis 1', 'Axis 2']).reset_index().round(2)
loadings.rename(columns={'level_0': 'Wine'}, inplace=True)
kbl(py$loadings, booktabs = TRUE, format = 'html', align = c('l', 'c', 'c'),
caption = '(Unrotated) Loadings of Each Wine on Each Principal Axis',
escape = F,
table.attr = "style='width:300px;'") %>%
kable_styling(position = 'center') %>%
column_spec(column = c(2,3), width = "75px")
Wine | Axis 1 | Axis 2 |
---|---|---|
Cab. Sauv. | -0.81 | -0.58 |
Merlot | -0.74 | -0.67 |
Champagne | -0.77 | 0.63 |
Rosé | -0.80 | 0.60 |
To understand why loadings are not always meaningful, it is important to understand rotational determinacy. Mathematically speaking, an infinite number of loading sets exist that all account for the same amount of cumulative variance. Thus, although the loadings in Table 1 were obtained with closed-form solutions, there is nothing unique about them. The non-uniqueness of the above loadings can be better understood by plotting them. Figure 2 shows the unrotated loadings plotted onto each principal axis. Rotational indeterminacy becomes clear in understanding that the x- and y-axes in Figure 2 can be rotated in an infinite number of ways that each account for the same amount of cumulative variance (i.e., the eigenvalue sums will be identical).

# add colors loadings['color'] = ['#730534', '#A40448', '#FB8EBD', '#FFFC27']
color_dict = dict(zip(loadings['Wine'], loadings['color']))
plot_loadings = (pt.ggplot(data=loadings, mapping=pt.aes(x='Axis 1', y='Axis 2', group='Wine', color='Wine')) +
pt.geom_point(size=2) +
pt.scale_color_manual(values=color_dict) +
pt.scale_x_continuous(limits=(-1, 1), expand=(0, 0)) + # Set x-axis limits
pt.scale_y_continuous(limits=(-1, 1), expand=(0, 0)) + # Set y-axis limits
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))
#save as .png
plot_loadings.save("images/loading_plot.png", width=8, height=6, dpi=1000)
To show rotational indeterminacy, I apply four rotations to the unrotated loadings obtained with the wine rating data (see Figure 2/Table 1) and show that the amount of variance accounted for by each rotation is equivalent (see Python code block below in lines 292–309).
# compute four different rotated loadings varimax = Rotator(method="varimax")
oblimax = Rotator(method="oblimax")
quartimax = Rotator(method="quartimax")
equamax = Rotator(method="equamax")
varimax_loadings = varimax.fit_transform(pca_loadings)
oblimax_loadings = oblimax.fit_transform(pca_loadings)
quartimax_loadings = quartimax.fit_transform(pca_loadings)
equamax_loadings = equamax.fit_transform(pca_loadings)
# assert each set of loadings accounts for same amount of variance
assert len(set([
np.sum(pca_loadings**2).round(4),
np.sum(varimax_loadings**2).round(4),
np.sum(oblimax_loadings**2).round(4),
np.sum(quartimax_loadings**2).round(4),
np.sum(equamax_loadings**2).round(4)])) == 1, """The rotated loadings are not all equivalent"""
Applying Varimax Rotation
Having computed unrotated loadings and explained rotational indeterminacy, I now apply a rotation to the loadings. As an aside, given the inevitability of rotational indeterminacy, one immediate question centers around how to handle this issue. Unfortunately, there is no simple solution and readers interested in a historical discussion of this issue can consult chapters 10–11 of Citation: Mulaik, 2009 Mulaik, S. (2009). Foundations of factor analysis. CRC press. Retrieved from bit.ly/40oGrFH . In the current case, I apply varimax rotation because it maximizes the loadings of each variable onto each principal axis (this rotation is also one of the more common ones) and allows a meaningful interpretation of the principal axes. In looking at Table 2/Figure 3, the varimax-rotated loadings reflect the structure with which the data were generated: The wines of cabernet sauvignon and merlot load highly onto what I now call the dinner wine drinking principal axis and the wines of rosé and champagne load onto what I now call the celebratory wine drinking principal axis (the R and Python code below in lines 310–333 produce Table 2).
# compute four different rotated loadings varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(pca_loadings)
# matrix of unrotated loadings
df_varimax = pd.DataFrame(data=varimax_loadings,
index=[df_std.columns],
columns=['Celebratory', 'Dinner']).reset_index().round(2)
# NOTE: loadings are multiplied by -1 since there is nothing mathematically incorrect with this (see
# discussion on rotational indeterminacy) and because this multiplication provides loadings that
# make greater conceptual sense
cols = ['Celebratory', 'Dinner']
df_varimax[cols] = df_varimax[cols] * -1
df_varimax.rename(columns={'level_0': 'Wine'}, inplace=True)
kbl(py$df_varimax, booktabs = TRUE, format = 'html', align = c('l', 'c', 'c'),
caption = 'Varimax-Rotated Loadings of Each Wine on Each Principal Axis',
escape = F,
table.attr = "style='width:300px;'") %>%
kable_styling(position = 'center') %>%
column_spec(column = c(2,3), width = "75px") %>%
add_header_above(c(" " = 1, "Wine Drinking Preferece" = 2))
Wine | Celebratory | Dinner |
---|---|---|
Cab. Sauv. | 0.18 | 0.98 |
Merlot | 0.08 | 0.99 |
Champagne | 0.99 | 0.08 |
Rosé | 0.99 | 0.12 |

# add colors df_varimax['color'] = ['#730534', '#A40448', '#FB8EBD', '#FFFC27']
colors_dict = dict(zip(df_varimax['Wine'], df_varimax['color']))
plot_loadings = (pt.ggplot(data=df_varimax, mapping=pt.aes(x='Celebratory', y='Dinner', group='Wine', color='Wine')) +
pt.geom_point(size=2) +
pt.scale_color_manual(values=colors_dict) +
pt.scale_x_continuous(limits=(-1.02, 1.02), expand=(0, 0)) + # Set x-axis limits
pt.scale_y_continuous(limits=(-1.02, 1.02), expand=(0, 0)) + # Set y-axis limits
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))
#save as .png
plot_loadings.save("images/varimax_loading_plot.png", width=8, height=6, dpi=1000)
Computing Principal Component Scores with Rotated Loadings
Having obtained the rotated loadings, the associated rotated principal component scores can be computed. Importantly, before computing the rotated principal component scors, two points require explanation. First, to more explicitly explain a previous statement, principal component scores can simply be obtained by computing the dot products between the standardized data and the eigenvectors (which are normalized; see dot products).
# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2) Run PCA and compute PCA scores
pca = decomposition.PCA(n_components=2)
pca_scores = pca.fit_transform(df_std)
# Step 3) Use SVD matrices to compute PCA scores (*multiply ny -1 to ensure signs are consistent)
pca_scores_svd = df_std.dot(V[ : , :2])*-1
# assert that both computations are identical
differences = np.abs(pca_scores - pca_scores_svd)
assert (differences.round(5) == 0).all().all() == True, '''PCA scores are not equivalent.'''
Second, scaled (but unstandardized) principal components scores can be computed using the loadings such that
# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2) Use SVD matrices to compute PCA scores (*multiply ny -1 to ensure signs are consistent)
# and scaled PCA scores
pca_scores_svd = df_std.dot(V[ : , :2])*-1
# compute scaled PCA scores
loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
pca_scores_scaled = df_std.dot(loadings)
assert (pca_scores_svd.var().round(4) == pca_scores_scaled.std().round(4)).all() == True, \
"""Scaling of PCA scores is incorrect."""
Having covered some necessary preqrequisites, the computation of rotated principal component scores can be explained. The crux of the issue with computing rotated principal component scores is that rotations such as varimax cause the loadings to lose their orthogonality. To prove this, I show that
# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2) Compute unrotated loadings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
# Step 3 Compute rotated lodaings
varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(unrotated_loadings)
# Step 4) Show that unrotated loadings are orthogonal but not rotated loadings
unrot_orth = unrotated_loadings.T.dot(unrotated_loadings)
rot_orth = varimax_loadings.T.dot(varimax_loadings)
# Create a diagonal matrix from the diagonal elements
unrotated_diagonal = np.diag(np.diag(unrot_orth))
rotated_diagonal = np.diag(np.diag(rot_orth))
# Check if the original matrix is equal to the diagonal matrix
is_diagonal = np.allclose(unrot_orth, unrot_orth, atol=1e-5)
not_diagonal = np.allclose(rot_orth, rotated_diagonal, atol=1e-5)
# Assert that the matrix is diagonal
assert (is_diagonal == True and not_diagonal==False), """Check computation of
rotated and unrotated loadings."""
Because the rotated loadings are not orthogonal, using them to compute principal component scores (Equation
The derivation of
Using Equation
FactorAnalyzer
functions and only differ by a scaling factor (see Python code below in lines 419–439).
# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2 Compute rotated lodaings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(unrotated_loadings)
# Step 3) Compute correction matrix Q
Q = (varimax_loadings.T.dot(varimax_loadings))
rotated_scores = df_std.dot(varimax_loadings).dot(np.linalg.inv(Q))
# Step 4) Show computation of PC scores is identical to using built-in function (barring a
# scaling factor)
varimax_fac = FactorAnalyzer(rotation='varimax', n_factors=2, method='principal')
varimax_fac.fit(df_std)
factor_scores = varimax_fac.transform(df_std)
Like the loadings, the rotated principal component scores can also be plotted. For comprehensiveness, I show a biplot below in Figure 4 where loadings are shown as vectors and rotated principal component score are shown as a dots.

# standardize data df_std = (A - A.mean())/A.std(ddof=1)
# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T
# Step 2 Compute rotated lodaings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
varimax = Rotator(method="varimax")
df_loadings = pd.DataFrame(varimax.fit_transform(unrotated_loadings)) * -1
df_loadings.rename(columns= {0: 'Dinner Wine Drinking',
1: 'Celebratory Wine Drinking'}, inplace=True)
df_loadings[['x_start', 'y_start']] = 0
df_loadings['Wine'] = ['Champagne', 'Rosé', 'Cab. Sauv.', 'Merlot']
df_loadings['color'] = ['#FB8EBD', '#FFFC27', '#730534', '#A40448']
color_dict = dict(zip(df_loadings['Wine'], df_loadings['color']))
# Step 3) Compute correction matrix Q
Q = (varimax_loadings.T.dot(varimax_loadings))
df_rot_scores = pd.DataFrame(df_std.dot(varimax_loadings).dot(np.linalg.inv(Q)))
df_rot_scores.rename(columns={0: 'Dinner Wine Drinking',
1: 'Celebratory Wine Drinking'}, inplace=True)
# create biplot
biplot = (pt.ggplot(data=df_rot_scores, mapping=pt.aes(x='Dinner Wine Drinking',
y='Celebratory Wine Drinking')) +
# add PCA scores
pt.geom_point() +
# add loadings
pt.geom_segment(data=df_loadings,
mapping=pt.aes(x='x_start', y='y_start',
xend='Dinner Wine Drinking', yend='Celebratory Wine Drinking',
color='Wine'),
arrow=pt.arrow(type="closed", length=0.12, angle=30)) +
pt.scale_color_manual(values=color_dict) +
# scale specifications
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.scale_x_continuous(limits=(-2, 2), expand=(0, 0), minor_breaks=[]) + # Set x-axis limits
pt.scale_y_continuous(limits=(-2, 2), expand=(0, 0), minor_breaks=[]) + # Set y-axis limits
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))
biplot.save("images/pca_biplot", dpi=500, width=8, height=6)
Conclusion
In summary, this whitepaper explicated the geometry of the singular value decomposition and its underlying meaning in relation to principal component analysis. Geometrically, the singular value decomposition shows that every matrix can be sequentially decomposed into a rotation followed by stretching and/or dimension change and then followed by another rotation. At a deeper level, the singular value decomposition provides eigenvectors that define principal axes onto which original variables and scores can be projected to, respectively, provide loadings and principal component scores. Using rotations, the loadings and principal component scores can be transformed into having more interpretable values.
Acknowledgments
I’d like to thank David Stanley for providing assistance in generating the wine rating data and providing insights into principal component rotations.
References
Appendix A: Proof of Spectral Theorem
The spectral theorem has three propositions that I prove in turn:
- Symmetric matrices must have at least one eigenvalue (and, therefore, at least one eigenvector).
- For any symmetric matrix
, all of its eigenvalues are real. - For any symmetric matrix
, an orthonormal set of eigenvectors exists that spans the space of . In other words, a full set of orthogonal eigenvectors exists.
Appendix A1: Intuition For Why Symmetric Matrices Must Have at Least One Eigenvalue and Eigenvector
To understand why all symmetric matrices have at least one eigenvalue, it is important to first understand how eigenvalues are obtained. To derive an eigenvalue, the equation of
can be re-expressed below in Equation
Fortunately, the fundamental theorem of algebra guarantees that any square matrix will have at least one eigenvalue (real or complex) if some constant
To provide a basic intuition for why square matrices must have at least one eigenvalue (whether complex or real), it is important to realize that the determinant equation is simply a polynomial one and that polynomial equations always have at least one root (whether complex or real). For example, consider the 2x2 matrix
Appendix A2: Proof That All Eigenvalues of Symmetric Matrices are Real
In the previous section, an argument was provided for why all square matrices must have at least one eigenvalue (and consequently at least one eigenvector). In this section, I will provide a proof that symmetric matrices only have real eigenvalues. As some necessary background, it is first important to have a basic understanding of complex numbers and complex conjugates. A complex number,
Importantly, complex numbers have corresponding entities called complex conjugates,
In circumstances where a complex number is equivalent to its conjugate, the imaginary component does not exist, and so the complex and its conjugate are equivalent to the real number
Appendix A3: Proof That Symmetric Matrices Have Full Sets of Orthonormal Eigenvectors
At this point, I have proven that symmetric matrices have real eigenvalues and that there must be at least one eigenvalue with a corresponding eigenvector. With this information, it can now be proven that symmetric matrices have full sets of orthonormal eigenvectors. That is, when applying the transformation of a symmetric matrix, there exists a set of orthonormal eigenvectors that span the entire space of the symmetric matrix.
To prove that symmetric matrices have full sets of orthogonal eigenvectors, I show that any symmetric matrix can be diagonalized by an orthonormal matrix. This proof applies the Gram-Schmidt procedure to a symmetric matrix to generate a set of orthonormal vectors that is then shown to diagonalize the symmetric matrix. Mathematically then, and central to the spectral theorem, any symmetric matrix
where
To prove Equation
Beginning with the first step of the proof, a symmetric matrix,
Having shown that the first column of a symmetric matrix an be diagonalized, I now show in this second part of the proof that the remaining columns of a symmetric matrix can be diagonalized. In other words, there exists some orthonormal matrix
Now, the central component of the spectral theorem can be proven by showing that
Appendix B: is a Positive Semi-Definite Matrix
A positive semi-definite matrix,
Appendix C: Unitary Freedom of Positive Semi-Definite Matrices
In this section, I will first state the unitary freedom of positive semi-definite matrices and will then provide an intuitive explanation. In formal terms, this property of positive semi-definite matrices states that, given
then there must exist some orthogonal matrix
Although the math above is indeed true, it does not provide an intuitive understanding for why there must exist an orthogonal matrix that allows us to translate between the two matrices. Therefore, I now provide an intuitive explanation for this proof.
As discussed in the section on orthonormal matrices, these matrices simple rotate basis vectors and so the vector lengths and angles between them remain unchanged. Therefore, in looking at Equation
Beginning with the lengths of
Ending with the relative angles between the basis vectors, consider any two basis vectors in the matrices of
In looking at the formula for the angle between two vectors below
and applying the equality of vectors lengths (Equation
To summarize, if two positive semi-definite matrices satisfy the condition of unitary freedom (Equation
Appendix D: Proof That the Lenth of Standardized Variable Equals
If
Because the the standard deviation equation is