The Theory of the Singular Value Decomposition and its Relation to Principal Component Analysis

This whitepaper explicates the geometry of the singular value decomposition and its deep connection to principal component analysis. Geometrically, the singular value decomposition shows that every matrix can be sequentially decomposed into a rotation followed by stretching and/or dimension change and then followed by another rotation. At a deeper level, the singular value decomposition provides eigenvectors that define principal axes onto which original variables and scores can be projected to, respectively, provide loadings and principal component scores. Using rotations, the loadings and principal component scores can be transformed into having more interpretable values.

Three points require mentioning before beginning this whitepaper. First, I used Python and R code throughout this whitepaper and often imported objects created in Python into R for creating plots. To use Python and R coinjointly, I used the reticulate package made for R and created a conda environment to run Python code (see lines 2–12 below).

library(reticulate)

#create and use conda environment
conda_create(envname = 'blog_posts', python_version = '3.10.11')
use_condaenv(condaenv = 'blog_posts')

#install packages in conda environment
py_packages <- c('numpy', 'pandas', 'scikit-learn', "plotnine", "statsmodels", "manim", "factor_analyzer")
conda_install(envname = 'blog_posts', packages = py_packages, pip=T)

#useful for checking what packages are loaded
py_list_packages(envname = 'blog_posts', type = 'conda')

Second, the Python packages and modules in the Python code block are needed to run all the Python code in this paper (see lines 15–22 below).

import math
import numpy as np
import pandas as pd
import plotnine as pt

from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from factor_analyzer.rotator import Rotator
from factor_analyzer import FactorAnalyzer

Third, although I often include code in papers so that readers can explore concepts, I decided to not include the Python code I used to create mathematical animations given the considerable length of the script. For readers interested in how I created my animations, I used the manim library and my source code can be viewed in this script.

Introduction

In this whitepaper, I provide a deep dive into the math underlying the singular value decomposition and use this as a foundation to explain the connection with principal component analysis. To this end, I provide a detailed proof of the singular value decomposition and explain the linear algebra necessary for understanding its geometry. I then explain the meaning of the singular value decomposition by synthesizing the parts needed to run a principal component analysis. To address all these points, the following sections will follow sequentially:

  1. A review of linear algebra
  2. The singular value decomposition and its visualization
  3. Proving the singular value decomposition
  4. Understanding the singular value decomposition by synthesizing principal component analysis

A Review of Linear Algebra

Some Important Fundamentals

In this section, I provide some fundamentals of linear algebra that are necessary for understanding the geometry of the singular value decomposition. To this end, I explain the following four points:

  1. Matrices define basis vectors
  2. Matrix-vector multiplication transforms bases by computing weighted vector sums
  3. Dot products are 1-dimensional cases of matrix-vector multiplication
  4. Dot products provide orthogonal projection lengths with normalized vectors

Matrices Define Basis Vectors

Basis vectors can be understood as the coordinates used to define some n-dimension space, . Beginning with the familiar standard basis vectors—that is, vectors with only one non-zero element equivalent to one—are often used to define n-dimensional spaces. As an example, consider a two-dimensional space, . To span (i.e., define any vector in ), the standard basis vectors of and can be used, and these vectors can be combined to form the basis matrix shown below in Equation :

Note that each column of represents a basis vector.

If is multiplied by one and is multiplied by two, the vector is obtained, and this is shown below in Animation 1.

Animation  1
Coordinates of Vector in Standard Basis (Equation )
Note. Using the standard basis vectors of and , the pink vector () has coordinates defined by .

Importantly, outside from the standard basis vectors being familiar and simple, there isn’t always a strong justification for using them in applied settings. Thus, non-standard basis vectors are often used in applied settings. As an example of non-standard basis vectors, consider defining with basis vectors and , which are defined as a matrix below in Equation :

To define in the non-standard basis of Equation , the vector coordinates of are used, and this is shown below in Animation 2. (In later sections, I will show how to obtain coordinates in non-standard bases.)

Animation  2
Coordinates of Vector in Non-Standard Basis (Equation )
Note. Using the non-standard basis vectors of and , the pink vector () has coordinates defined by .

Matrix-Vector Multiplication Transforms Bases by Computing Weighted Vector Sums

With matrices defining basis vectors, this also allows them to apply transformations. To understand how matrices apply transformations, consider again the non-standard basis of

In post-multiplying by , the first basis vector, , is multiplied by 1.5 and the second basis vector, , is multiplied by 0.5. Each weighted composite is then summed to produce .

Animation 3 below shows how matrix-vector multiplication of Equation transforms bases by computing weighted vector sums. The summing of weighted basis vectors is then shown to be equivalent to applying a shearing transformation (as specified by ) to the entire matrix space.

Animation  3
Geometry of Matrix-Vector Multiplication from Equation )
Note. The example in this animation shows that pre-multiplying a vector by a matrix of basis vectors simply involves taking the weighted sum of basis matrices. Specifically, the first basis vector of , , is multiplied by 1.5 and the second basis vector, , is multiplied by 0.5. Both weighted basis vectors are then summed to give .

Dot Products are 1-Dimensional Cases of Matrix-Vector Multiplication

Although it may not be obvious, dot products are simply 1-dimensional cases of matrix-vector multiplication. That is, dot products also transform bases by computing weighted vector sums. The only nuance is that the transformations applied by dot products occur in a 1-dimensional space (i.e., a line). To understand the geometry of dot products, consider two vectors of and . In computing the dot produict between these vectors, , basis vectors still exist, but they are now 1-dimensional. More specifically, the first basis vector of , , is multiplied by 1 and the second basis vector, , is multiplied by 2. I have provided Animation 4 to clearly explain the geometry of dot products.

Animation  4
Geometry of Dot Product Multiplication from Equation
Note. Dot products are shown to transform bases into a 1-dimensional space. In this animation, the dot product is shown between the two vectors of and . As with matrix-vector multiplication, a transformation still occurs, but it occurs in a 1-dimension space. More specifically, the first basis vector of , , is multiplied by 1 and the second basis vector, , is multiplied by 2.

Dot Products Provide Orthgonal Projection Lengths With Normalized Vectors

One focal process of principal component analysis is the computation of orthogonal projections: The value of one variable on another basis. In the context of principal component analysis, analysts want to know how scores on one variable translate into scores on another variable that exists in a lower dimension, As it turns out, dot products accomplish this task because the vectors being projected onto are normalized. That is, when computing orthogonal projections of a vector, , onto a normalized vector , , the equation is simply that of the dot product:

To understand how Equation is derived, an understanding of orthogonal projections is necessary. To this end, I provide a visualization below in Figure 1. To begin, the task shown in Figure 1 is to find the value of on . For the purposes of comprehension, can be conceptualized as another basis that defines the span of , and we want to know the orthogonal projection of on this basis. The solution to this problem becomes obvious once two facts become evident. First, the orthogonal projection will be some scalar multiple of , . Second, because the projection is orthogonal, then a vector must exist that has a null dot product with the projection. This is indeed the case: is orthogonal with . Given these two points, then Equation below for the orthogonal projection can be obtained.

If the vector being projected onto is normalized, then it has length of 1 (i.e., ), and simply computing the dot product provides the orthogonal projection (i.e., Equation is obtained).

Figure 1
Visualization of Vector Projection
Note. The process for deriving projections (Equation ) is visualized. The depicted objective is to find the value of on . For the purposes of comprehension, can be conceptualized as another basis that defines the span of , and we want to know the orthogonal projection of on this basis. The solution to this problem becomes obvious once two facts become evident. First, the orthogonal projection will be some scalar multiple of , . Second, because the projection is orthogonal, then a vector must exist that has a null dot product with the projection. This is indeed the case: is orthogonal with . For readers interested in producing this visualization, the Python code is provided below in lines 23--125.
# Define vectors
x = np.array([3, 3])
v = np.array([1, 3])

# scalar multiplier for projection
c = v.T.dot(x)/(x.dot(x))
v_proj = c*x

# Create DataFrame for vectors
df_vectors = pd.DataFrame({
'x_start': [0, 0, v_proj[0], 0],
'y_start': [0, 0, v_proj[1], 0],
'x_end': [v[0], x[0], v[0], 2],
'y_end': [v[1], x[1], v[1], 2],
'label': ['x', 'v', 'x_vproj','v_proj'],
'color': ["#d170c7","#d4bd04", "#33c304","#9DFFFF"]})


# Define line span of x (t * x for t in [-2, 2])
t_values = np.linspace(-5, 5, 100)
x_span = t_values * x[0]
y_span = t_values * x[1]

df_line = pd.DataFrame({'x': x_span, 'y': y_span})

# Create a dataframe for tick labels (for easier annotation)
ticks = range(-5, 6) # From -5 to 5

df_xticks = pd.DataFrame({
"x": [i for i in ticks if i != 0], # Exclude 0
"y": [0] * (len(ticks) - 1), # Same length after filtering
"label": [str(i) for i in ticks if i != 0] # Exclude 0 from labels
})

df_yticks = df_xticks.copy()
df_yticks['x'] = df_xticks['y']
df_yticks['y'] = df_xticks['x']

# dataframe for annotations
df_annotations = pd.DataFrame({
"x": [-4, -4, -4, 0.75, 3.25, 1.5, 2, -4, 3.75, 3, 3, 3, 3],
"y": [5, 4, 3, 3.25, 2.75, 0.5, 3.5, 2, 4.25, -2, -3, -4, -5],
"label": [r"$\mathbf{x}^\top = [ 1 \ 3 ]$",
r"$\mathbf{v}^\top = [ 3 \ 3 ]$",
r"$Proj_L(\mathbf{x}) = c\mathbf{v}$",
r"$\mathbf{x}$",
r"$\mathbf{v}$",
r"$Proj_L(\mathbf{x})$",
r"$\mathbf{x} - Proj_L(\mathbf{x})$",
r"$L = \{c\mathbf{v} | c \in \mathbb{R}\}$",
r"$L$",
r"$(\mathbf{x} - Proj_L(\mathbf{x}))\mathbf{v} = 0 $",
r"$(\mathbf{x} - c\mathbf{v})\mathbf{v} = 0$",
r"$\mathbf{x}^\top \mathbf{v} - c\mathbf{v}^\top \mathbf{v} = 0$",
r"$c = \frac{\mathbf{x}^\top \mathbf{v}}{\mathbf{v}^\top \mathbf{v}}$"],
"color": ["#d170c7", "#d4bd04", "#9DFFFF",
"#d170c7", "#d4bd04", "#9DFFFF", "#33c304", "white", "white",
"white", "white" , "white", "white"],
"size": [14]*13})


# Create plot
plot_proj = (pt.ggplot() +

# Specify vectors
pt.geom_line(df_line, pt.aes(x='x', y='y'), alpha=0.5, color="white") +
pt.geom_segment(data=df_vectors,
mapping=pt.aes(x='x_start', y='y_start', xend='x_end', yend='y_end'),
arrow=pt.arrow(type="closed", length=0.12, angle=30),
color=df_vectors['color']) +

# Add vector labels
pt.geom_text(data=df_annotations,
mapping=pt.aes(x="x", y="y", label="label"),
color=df_annotations['color'], size=df_annotations['size']) +

pt.geom_hline(yintercept=0, color="white") +
pt.geom_vline(xintercept=0, color="white") +

# Remove default tick labels and minor grid lines
pt.scale_x_continuous(limits=(-5, 5), breaks=range(-5, 6), minor_breaks=[]) +
pt.scale_y_continuous(limits=(-5, 5), breaks=range(-5, 6), minor_breaks=[]) +

# Manually place x-axis tick labels ON the hline
pt.geom_text(df_xticks, pt.aes(x='x', y='y', label='label'),
color="white", size=10, va="top", nudge_y=-0.1) +

# Manually place y-axis tick labels ON the vline
pt.geom_text(df_yticks, pt.aes(x='x', y='y', label='label'),
color="white", size=10, ha="right", nudge_x=-0.1, nudge_y=0) +

pt.theme_minimal(base_family='Helvetica', base_size=14) +
pt.theme(
text=pt.element_text(color="#002241"),
plot_background=pt.element_rect(fill="#002241"),
panel_background=pt.element_rect(fill="#002241"),
panel_grid_major=pt.element_line(color="#9DFFFF", size=0.3, alpha=0.5),
axis_title=pt.element_blank(), # Remove axis titles
axis_ticks=pt.element_blank(), # Remove tick marks
axis_text=pt.element_blank(), # Remove default tick labels
legend_position="none"))

plot_proj.save("images/proj_plot.png", dpi=500, width=8, height=6)

Visualizations of Elementary Matrices

As explained in the previous section, matrix multiplication transforms bases by computing weighted vector sums. As it turns out, there are three fundamental transformations that constitute every possible matrix transformation: 1) Rotation, 2) stretching, and 3) changing dimension. Importantly, each fundamental transformation corresponds to a particular type of matrix, and I will provide visualizations for each of these matrices. Below is a brief summary of the matrices and their transformations (note, there is technically a fourth type of transformation that un-transforms):

  1. Diagonal matrices stretch basis vectors.
  2. Orthonormal matrices rotate basis vector.
  3. Inverse matrices un-transform basis vectors.
  4. Rectangular matrices change the dimension space.

I also provide a geometric overview of eigenvectors and eigenvalues given their central importance to singular value decomposition and principal component analysis.

Diagonal Matrices Stretch Basis Vectors

Animation 5 below shows that diagonal matrices simply stretch the basis vectors that define a space. In the current example, the matrix of

stretches the first basis vector, , by a factor of two such that it becomes and the second basis vector, , by a factor of three such that it becomes . As a result, the vector of becomes .

Animation  5
Geometry of Matrix Multiplication From Diagonal Matrix
Note. The example in this animation shows the geometry of pre-multiplying a vector, , by the diagonal matrix . In this case, the space spanned by the standard basis vectors is stretched. Specifically, the first basis vector, , is stretched by a factor of two such that it becomes and the second basis vector, , is stretched by a factor of three such that it becomes . As a result, the vector of becomes .

Orthonormal Matrices Only Rotate Basis Vectors (Length and Angle Preserving)

Animation 6 below shows that orthonormal matrices simply rotate the basis vectors that define a space. In the current example, the matrix of

rotates the first basis vector, , clockwise to , and the second basis vector, , to . As a result, the vector of becomes .

Animation  6
Geometry of Orthonormal Matrix Multiplication
Note. The example in this animation shows the geometry of pre-multiplying a vector, , by the orthonormal matrix . In this case, the first basis vector, , is rotated clockwise to , and the second basis vector, , is rotated clockwise to . As a result, the vector of becomes .

Importantly, orthonormal matrices only rotate vector spaces. That is, there is no stretching/compressing of the vector space; vectors lengths remain unchanged and so do angles between vectors. In mathematical terms, orthonormal matrices preserve lengths and matrices and, given the ease with which each of these statements can be proven, I show each proof in turn.

Beginning with the length-preserving property of orthonormal matrices, consider the formula for computing vector lengths in Equation below:

By applying Equation above to the length of a vector that is transformed by some orthonormal matrix, , it becomes clear that the vector’s length remains unchanged.

As an example of the length-preserving property of orthonormal matrices, the Python code block below (lines 126–134) shows that length of the original vector, , and its orthonormal-transformed version, , remains unchanged.

# original vector
g_e = np.array([2, 1])

# transformed vector
g_q = np.array([3*(np.sqrt(2))/2, -np.sqrt(2)/2])

# lengths of each vector
print(g_e.dot(g_e), "\n",
np.round(g_q.dot(g_q), 6))
5
5.0

Ending with the angle-preserving feature of orthonormal matrices, consider the formula to computing the angles between two vectors, and , shown below in Equation :

By applying Equation above to the angles between vectors following an orthonormal transformation, and , it becomes clear that the angle between them remains unchanged (note how the proof shown in Equation above is used in the below proof).

As an example of the angle-preserving property of orthonormal matrices, the Python code block below (lines 137–150) shows that the angle between the original standard basis vectors, and , and their orthonormal-transformed versions, and , remains unchanged.

# original standard basis
b_ex = np.array([1, 0])
b_ey = np.array([0, 1])

# orthonormal basis
Q = np.array([[np.sqrt(2)/2, np.sqrt(2)/2],
[-np.sqrt(2)/2, np.sqrt(2)/2]])

# angle between original basis vectors
original_angle = np.arccos(b_ex.dot(b_ey)/(np.linalg.norm(b_ex) * np.linalg.norm(b_ey)))
transformed_angle = np.arccos(b_ex.T.dot(Q.T).dot(Q).dot(b_ey)/(np.linalg.norm(Q.dot(b_ex)) * np.linalg.norm(Q.dot(b_ey))))

print(math.degrees(original_angle), "\n",
math.degrees(transformed_angle))
90.0
90.0

Inverse Matrices Un-Transform Basis Vectors

Animation 7 shows that matrix inverses transform vector spaces in the opposite direction and magnitude of their non-inverted counterparts. In this way, matrix inverses can be conceptualized as un-transforming vector spaces. Within Animation 7, the space is first transformed using the orthonormal matrix of

The vector space is then un-transformed and brought back to the standard basis by applying the inverse of . By applying , the basis vectors are rotated back to become the standard basis vectors, and , and the transformed vector of returns back to its original standard basis coordinates of .

Animation  7
Geometry of Inverse Matrix Multiplication
Note. The example in this animation shows the geometry of pre-multiplying a vector, , by the orthonormal matrix , and then pre-multiplying the transformed vector, , by . By applying , the basis vectors are rotated back to become the standard basis vectors, and , and the transformed vector of returns back to its original standard basis coordinates of .

Rectangular Matrices Change Dimension Space

So far, I have only shown linear transformations that result from applying square matrices. When multiplying spaces by square matrices, the input and output dimension spaces are the same. When multiplying by rectangular matrices, however, the dimensions of the input and output spaces are different; that is, rectangular matrices change the dimension space. As an example, Animation 8 below shows the transformation that results from applying the rectangular matrix of

The matrix takes an input space of two dimensions and returns values in an output space of one dimension. Thus, two-dimension vectors become scalars (i.e., one dimensional). In this case, the first basis vector, , is transformed to become a scalar value of , and, likewise, the second basis vector, , becomes . As a result, the vector of becomes .

Animation  8
Geometry of Rectangular Matrix Multiplication
Note. The example in this animation shows the geometry of pre-multiplying a vector, , by the rectangular matrix . In this case, the first basis vector, , is transformed to become a scalar (i.e., one-dimensional) value of , and, likewise, the second basis vector, , becomes . As a result, the vector of becomes .

Importantly, because the input and output dimensions spaces are different, the input and output vector live in different dimension spaces, and so no input vector can lie in the span of any output vector (and vice-versa).

Eigenvectors (and Eigenvalues)

Due to the inherent connection between eigenvectors and singular value decomposition, it is necessary to provide an overview of the geometry of eigenvectors. That being said, although it is useful to understand the geometry of eigenvectors, their importance in singular value decomposition comes more from their meaning (which will be explained in later sections) and less so from their geometry. Nonetheless, I briefly present the geometry of eigenvectors and eigenvalues.

Animation 9 below shows that eigenvectors remain on their span following some linear transformation and eigenvalues represent the extent to which the eigenvectors are stretched. The geometry of eigenvectors and eigenvalues is summarized in Equation below

In words, if some eigenvector, , is pre-multiplied by a (square) matrix, , then the resulting vector is simply a scalar multiplication of , where represents the scalar value. As an example from Animation 9 shown below, when the eigenvector remains on its span following a pre-multiplication

because the resulting vector, is simply twice the value of . In Animation 9, the light blue lines represent the eigenbases and each eigenvector remains on its basis following a linear transformation by .

Animation  9
Geometry of Eigenvectors and Eigenvalues
Note. Eigenvectors of the matrix are shown to remain on their span (light blue lines) following a linear transformation applied by . The first eigenvector, , is multiplied by two, , and the second eigenvector, , is multiplied by a one, , following as linear transformation of .

One last important point to note is that only square matrices can have eigenvectors because non-square (or rectangular matrices) change the dimension space (see section on rectangular matrices). Because rectangular matrices change the dimension space, it is impossible for any vector to remain on its span.

The Singular Value Decomposition and its Visualization

Having covered the necessary fundamentals, I will introduce the singular value decomposition and its geometry. In short, any matrix ( can also be complex) can be decomposed such that

where each of the three matrices in the decomposition has the following characteristics:

  1. : left singular vectors of , which are the eigenvectors of . Because symmetric matrices have full sets of orthonormal eigenvectors (see Appendix A3), is orthonormal.
  2. : rectangular matrix with singular values along its diagonal. Singular values are equivalent to the square roots of the eigenvalues of (or equivalently of ) divided by (see Equation ).
  3. : right singular vectors of , which are the eigenvectors of . Because symmetric matrices have full sets of orthonormal eigenvectors (see Appendix A3), is orthonormal.

Applying what I presented previously in the fundamentals of linear algebra, the singular value decomposition implies that the linear transformation applied by any matrix can be broken down into three constituent transformations in the following order:

  1. Rotation: is an orthonormal matrix and so rotates basis vectors (see section on orthonormal matrices). The astute reader will notice that the transpose of an orthogonal matrix is equivalent to its inverse, so is more technical an un-rotation of basis vectors.
  2. Stretching with possible dimension change: because only has nonzero values along its diagonal, these values will stretch basis vectors (see section on diagonal matrices). The dimension space can also change following a transformation by . For example, if the number of rows, , is less than the number of columns , then the remaining columns of will contain zeroes that will remove the remaining dimensions of .
  3. Rotation: is an orthonormal matrix and thus rotates basis vectors (see section on orthonormal matrices).

Animation 10 below provides an geometric visualization of each transformation applied by each matrix of the singular value decomposition. First, it is shown that transforms the standard basis vectors such that becomes , becomes , and becomes . The second animation then shows that each standard basis vector lands on its coordinates after three transformations in the order of a rotation, stretching (with a dimension reduction), and a rotation. First, the (transposed) matrix of right singular vectors is orthonormal and so applies a rotation. Second, the matrix of singular values is off-diagonal and so applies a stretching of basis vectors along with a reduction in the dimension (from to . Third, and last, the matrix of left singular vectors applies a rotation of the basis vectors.

Animation  10
Visualization of the Three Transformations That Constitute the Singular Value Decomposition
Note. An animation of each matrix of the singular valude decomposition of . First, it is shown that transforms the standard basis vectors such that becomes , becomes , and becomes . The second animation then shows that each standard basis vector lands on its coordinates after three transformations in the order of a rotation, stretching (with possible dimension change), and a rotation. First, the (transposed) matrix of right singular vectors is orthonormal and so applies a rotation. Second, the matrix of singular values is off-diagonal and so applies a stretching of basis vectors along with a reduction in the dimension (from to . Third, and last, the matrix of left singular vectors applies a rotation of the basis vectors.

Proving the Singular Value Decomposition

With an geometric understanding of the singular value decomposition, I now provide a proof. Consider the symmetric matrix produced by . Because is it symmetric, it has a full set of orthonormal eigenvectors (see Appendix A3) and so can be re-expressed as

Because is a diagonal matrix (of eigenvalues), it can be square-rooted to obtain the singular values (and divided by , see see Equation )).

Equation can then replace in Equation to obtain

Although it may not look it, the singular value decomposition is proven once we consider two truths. First, (or conversely ) and (or conversely are both positive semi-definite matrices (see Appendix B). Second, because the matrix product of each positive semi-definite matrix is equivalent, the condition of unitary freedom is satisfied and so there must exist some orthonormal matrix, , that can be used to translate between the basis vectors of each matrix (see Appendix C). Mathematically,

Understanding the Singular Value Decomposition by Synthesizing Principal Component Analysis

In this section, I will provide a deeper explanation of the singular value decomposition by synthesizing the pre-requisites for running a principal component analysis. Briefly, principal component analysis provides a method for mapping some data set of n variables onto some lower-dimension subspace consisting of only m dimensions. To this end, three objects must be computed to run a principal component analysis:

  1. The number of principal axes: the number of (lower) m dimensions that the original n variables can be mapped to.
  2. Loadings: correlations between original variables and principal axes. By computing loadings, the meaning of the principal axes can be determined.
  3. Principal component scores: scores of respondents on the principal axes.

Using the three matrices from the singular value decomposition, I will explain how to compute each of the above objects.

Two points deserve mention. First, a varimax rotation will be applied to the loadings to render the results more interpretable, thus facilitating an understanding of the singular value decomposition. Note that, although using varimax rotation adds some complexity to the computation of the loadings and principal component scores, it will be explained in the sections that follow. Second, to facilitate learning in this section, I will apply formulas using a generated data of wine drinkers.

A Guiding Example: Wine Drinkers

Briefly, a data set of 10 wine drinkers’ ratings of four wines will be used to explain the meaning of the singular value decomposition. Wine ratings (on a 0–100-point scale) are provided for four wines of 1) cabernet sauvignon (CS), 2) merlot (M), 3) rosé (R), and 4) champagne (Ch).

Importantly, for understanding the relation between singular value decomposition and principal component analysis , I generated the wine ratings such that there are two dimensions of wine drinking:

  1. Dinner Wine Drinking: prefers drinking red wines such as cabernet sauvignon (CS) and merlot (M).
  2. Celebratory Wine Drinking: prefers drinking champagne (Chp) and rosé (R).

To foreshadow, because of how I generated the data, there will be two principal axes that account for the majority of the variance.

The R code block below (lines 153–204) contains the code I used to generate the wine drinker data.

library(tidyverse)
library(Matrix)

# set means of each variable for dinner and celebratory wine drinkers (dwd, cwd)
means_dwd = c(81, 82, 68, 69)
means_cwd = c(63, 61, 83, 83)

sd = rep(x = 3, times=4)

# create covariance matrix
cor_var_12 <- .75
cor_var_34 <- .75

cor_var_13 <- 0
cor_var_14 <- cor_var_13

cor_var_23 <- cor_var_13
cor_var_24 <- cor_var_13

# covariance for dinner wine drinkers
cov <- diag(sd^2)
cov[2, 1] <- cor_var_12 * sd[1]^2

cov[3, 1] <- cor_var_13 * sd[1]^2
cov[3, 2] <- cor_var_23 * sd[1]^2

cov[4, 1] <- cor_var_14 * sd[1]^2
cov[4, 2] <- cor_var_24 * sd[1]^2
cov[4, 3] <- cor_var_34 * sd[1]^2


# covariance for celebratory wine drinkers
cov_cwd <- diag(sd^2)
cov_cwd[4, 3] <- cor_var_12 * sd[1]^2

cov_cwd[3, 1] <- cor_var_13 * sd[1]^2
cov_cwd[3, 2] <- cor_var_23 * sd[1]^2

cov_cwd[4, 1] <- cor_var_14 * sd[1]^2
cov_cwd[4, 2] <- cor_var_24 * sd[1]^2
cov_cwd[2, 1] <- cor_var_34 * sd[1]^2

cov_cwd[upper.tri(cov_cwd)] <- t(cov_cwd)[upper.tri(cov_cwd)]

# Number of samples
n <- 5

set.seed(27)
df_dwd = round(MASS::mvrnorm(n = n, mu = means_dwd, Sigma = cov, empirical = T) + rnorm(n = n, mean = 0, sd = 10))
df_cwd = round(MASS::mvrnorm(n = n, mu = means_cwd, Sigma = cov_cwd, empirical = T) + rnorm(n = n, mean = 0, sd = 10))

df_wine_drinkers <- rbind(df_dwd, df_cwd)

Determining the Number of Principal Axes

The number of principal component to extract can be determined by using the matrix of singular values, . Two points must be explained to understand this point. First, singular values can be converted to eigenvalues. Second, eigenvalues represent the amount of variance (in terms of original variables) accounted for by an eigenvector (also called a principal axis).

Beginning with the first point, singular values can be converted to eigenvalues such that

Ending with the second point, eigenvalues represent the amount of variance (in terms of the number of original variables) accounted for by eigenvectors. Consider a mean-centered matrix with a covariance matrix

and eigenvectors . Importantly, to compute the variance accounted for by a given eigenvector, , the projections of the data onto the eigenvector are first needed. The projected values can be obtained using the dot product (see dot product)

with the variance of the projected values then being

Because is an eigenvector of the covariance matrix, , the eigenvector equation (Equation ) can be leveraged to simplify Equation above and prove that an eigenvalue represents that amount of total variance accounted for by an eigenvector. Note that, because is a unit vector, ,

Using the singular value matrix then, the number of principal component to retain can be determined. First, eigenvalues are indeed obtained using Equation (see Python code below in lines 205–224).

# Matrix of wine ratings
A = pd.DataFrame({"Cab. Sauv.": [73, 73, 84, 80, 47, 54, 46, 63, 70, 59],
"Merlot": [76, 71, 88, 80, 49, 49, 46, 61, 67, 58],
"Champagne": [58, 55, 71, 69, 40, 69, 63, 89, 90, 79],
"Rosé": [61, 55, 71, 73, 39, 69, 66, 89, 90, 76]})

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)

# Step 2) Compute eigenvalues of correlation matrix
df_corr = A.corr() # or use df_std.T.dot(df_std)/9
eig_vals = np.linalg.eigvals(df_corr)

# Step 3) Convert singular values to eigenvalues and confirm all values are identical
eig_converted = S**2/(df_std.shape[0] - 1)

np.testing.assert_array_equal(eig_vals.round(4), eig_converted.round(4))

Second, eigenvalues can be used to determine the number of principal axes to retain. Although several rules exist for choosing the number of principal axes to retain (for a review, see Citation: & (). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods, 24(4), 468. https://doi.org/10.1037/met0000200 ), the decision in this example is simple as the data were generated such that only two meaningful principal axes exist (i.e., dinner and celebratory wine drinkers). An examination of the eigenvalues confirms this as the first two principal axes account for 99.5% of the total variance (or 3.98 out of 4 possible variables; see line 225).

(eig_converted.cumsum()/eig_converted.sum()).round(3)
array([0.611, 0.995, 0.998, 1. ])

Computing Loadings and Applying Varimax Rotation

Now that the number of principal axes have been decided, I now show how to determine their meaning. Because loadings are not always guaranteed to be meaningful (due to the problem of rotational indeterminacy), I show how rotations can be used to render principal axes more interpretable. As a result, this section constitutes three parts:

  1. Computing loadings
  2. Rotational indeterminacy
  3. Applying varimax rotation

Computing Loadings

Understanding how to compute loadings simply requires understanding that loadings are correlations between the original variables and the principal axes. Thus, loadings can be computed by obtaining correlations between scores on the original variables and scores on the principal axes. Mathematically, I’ll show that this is equivalent to computing

where is the matrix of eigenvectors, is the diagonal matrix of corresponding eigenvalues, contains the standardized wine ratings, and contains the standardized scores on the principal axes (i.e., standardized principal component scores).

To prove Equation , I will first prove the the following two equations:

Once the above equations have been proven, Equation will be a simple corollary.

First, Equation is proven below by using the equation for the correlation (or loadings; ) between one wine’s ratings, , and their corresponding scores on one principal axis, ,

Note that, because and are standardized, their lengths are equivalent to (see Appendix D). To finish this first proof, Equation can be transformed into its matrix form where the matrix of loadings, , contains loadings of each variable of wine scores onto each principal axis

Second, I now show how to compute the matrix of standardized principal component scores, . Conceptually, principal component scores represent projections of the original variables onto the principal axes. Given that projections are simply obtained by computing dot products (see dot products), the unstandardized principal component scores can be obtained with

To standardize the principal component scores, they must be divided by their standard deviations, . Fortunately, computing is rather simple because the unstandardized principal component scores are already mean centered (this results because the left singular eigenvectors, , are normalized). Therefore, no centering is required to compute the standard deviations, and so

Proving Equation now simply requires using Equation to standardize the principal component scores (Equation ).

Having proven Equations , proving Equation becomes a simple exercise of substituing for and using the singular value decomposition of .

Using the data set of wine drinkers, I now apply the above equations to compute the loadings. To preface (and as mentioned in the description of the data), the principal axes were generated such that one measures dinner wine drinking (high loadings of cabernet sauvignon and merlot) and the other one measures celebratory wine drinking (high loadings of champagne and rosé). Using the wine rating data, I now provide Python code (lines 227–260 below) to compute the loadings using Equation .

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2) Compute eigenvalues of correlation matrix
df_corr = A.corr() # or use df_std.T.dot(df_std)/9
eig_vals = np.linalg.eigvals(df_corr)

# Step 3) Convert singular values to eigenvalues
num_obs = df_std.shape[0]
eig_converted = S**2/(num_obs - 1)

# Step 4) Run PCA and compute loadings using VE^0.5
pca = decomposition.PCA(n_components=2)
pca_scores = pca.fit_transform(df_std) # needed to obtain components
pca_loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

# Step 5) Use SVD matrices to compute loadings and verify both computations are equivalent
svd_loadings_1 = df_std.T.dot(U[ : ,:2]) * np.sqrt(num_obs - 1)/(num_obs - 1)
svd_loadings_2 = V[ : ,:2].dot(np.diag(np.sqrt(eig_converted[0:2])))


# Step 6) Show that all computations of the loadings are equivalent
# round each array to 4 decimal places
pca_loadings = np.round(pca_loadings, 4) *-1. # NOTE: multiply by -1 to ensure equivalence
svd_loadings_1 = np.round(svd_loadings_1, 4)
svd_loadings_2 = np.round(svd_loadings_2, 4)

assert (np.array_equal(pca_loadings, svd_loadings_1) and
np.array_equal(svd_loadings_2, svd_loadings_2) == True), """
Loading computations are different."""

Rotational Indeterminacy

Although loadings can be computed using the matrices from the singular value decomposition (which I will call the unrotated loadings), they aren’t guaranteed to be meaningful. For instance, consider the set of loadings computed in the previous section and shown below in Table 1 and their alignment with how the data were generated. To recall, I generated the wine rating data such that two dimensions of wine drinking existed:

  1. Dinner Wine Drinking: prefers drinking red wines such as cabernet sauvignon (CS) and merlot (M).
  2. Celebratory Wine Drinking: prefers drinking champagne (Chp) and rosé (R).

Given how the wine ratings were generated, there should be two general patterns of loadings. In one pattern, cabernet sauvignon and merlot should have high loadings on one principal axis and low loadings on the other axis. With the second pattern, it is simply the opposite of the first one; that is, champagne and rosé will have high loadings on the other principal axis and low loadings on the principal axis with which cabernet sauvignon and merlot have high loadings. In looking at Table 1, the loadings do not remotely reflect the expected patterns. All the wine have high and negative loadings on the first principal axis and moderate loadings on the second axis. Therefore, in the current example, the unrotated loadings bear no resemblance with expected pattern and are difficult to interpret (the R and Python code below in lines 261–272 produce Table 1).

# matrix of unrotated loadings
loadings = pd.DataFrame(data=pca_loadings,
index=[df_std.columns],
columns=['Axis 1', 'Axis 2']).reset_index().round(2)
loadings.rename(columns={'level_0': 'Wine'}, inplace=True)
kbl(py$loadings, booktabs = TRUE, format = 'html',
align = c('l', 'c', 'c'),
caption = '(Unrotated) Loadings of Each Wine on Each Principal Axis',
escape = F,
table.attr = "style='width:300px;'") %>%
kable_styling(position = 'center') %>%
column_spec(column = c(2,3), width = "75px")
Table 1
(Unrotated) Loadings of Each Wine on Each Principal Axis
WineAxis 1Axis 2
Cab. Sauv.-0.81-0.58
Merlot-0.74-0.67
Champagne-0.770.63
Rosé-0.800.60

To understand why loadings are not always meaningful, it is important to understand rotational determinacy. Mathematically speaking, an infinite number of loading sets exist that all account for the same amount of cumulative variance. Thus, although the loadings in Table 1 were obtained with closed-form solutions, there is nothing unique about them. The non-uniqueness of the above loadings can be better understood by plotting them. Figure 2 shows the unrotated loadings plotted onto each principal axis. Rotational indeterminacy becomes clear in understanding that the x- and y-axes in Figure 2 can be rotated in an infinite number of ways that each account for the same amount of cumulative variance (i.e., the eigenvalue sums will be identical).

Figure 2
Unrotated Loadings of Wines Onto Principal Axes
Note. Unrotated loadings for cabernet sauvignon, champagne, merlot, and rosé are shown on two principal axes. In this case, it is difficult to assign any meaning to the principal axes. The Python code in lines 273-291 produced this plot.
# add colors
loadings['color'] = ['#730534', '#A40448', '#FB8EBD', '#FFFC27']
color_dict = dict(zip(loadings['Wine'], loadings['color']))

plot_loadings = (pt.ggplot(data=loadings, mapping=pt.aes(x='Axis 1', y='Axis 2', group='Wine', color='Wine')) +
pt.geom_point(size=2) +
pt.scale_color_manual(values=color_dict) +
pt.scale_x_continuous(limits=(-1, 1), expand=(0, 0)) + # Set x-axis limits
pt.scale_y_continuous(limits=(-1, 1), expand=(0, 0)) + # Set y-axis limits
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))

#save as .png
plot_loadings.save("images/loading_plot.png", width=8, height=6, dpi=1000)

To show rotational indeterminacy, I apply four rotations to the unrotated loadings obtained with the wine rating data (see Figure 2/Table 1) and show that the amount of variance accounted for by each rotation is equivalent (see Python code block below in lines 292–309).

# compute four different rotated loadings
varimax = Rotator(method="varimax")
oblimax = Rotator(method="oblimax")
quartimax = Rotator(method="quartimax")
equamax = Rotator(method="equamax")

varimax_loadings = varimax.fit_transform(pca_loadings)
oblimax_loadings = oblimax.fit_transform(pca_loadings)
quartimax_loadings = quartimax.fit_transform(pca_loadings)
equamax_loadings = equamax.fit_transform(pca_loadings)

# assert each set of loadings accounts for same amount of variance
assert len(set([
np.sum(pca_loadings**2).round(4),
np.sum(varimax_loadings**2).round(4),
np.sum(oblimax_loadings**2).round(4),
np.sum(quartimax_loadings**2).round(4),
np.sum(equamax_loadings**2).round(4)])) == 1, """The rotated loadings are not all equivalent"""

Applying Varimax Rotation

Having computed unrotated loadings and explained rotational indeterminacy, I now apply a rotation to the loadings. As an aside, given the inevitability of rotational indeterminacy, one immediate question centers around how to handle this issue. Unfortunately, there is no simple solution and readers interested in a historical discussion of this issue can consult chapters 10–11 of Citation: (). Foundations of factor analysis. CRC press. Retrieved from bit.ly/40oGrFH . In the current case, I apply varimax rotation because it maximizes the loadings of each variable onto each principal axis (this rotation is also one of the more common ones) and allows a meaningful interpretation of the principal axes. In looking at Table 2/Figure 3, the varimax-rotated loadings reflect the structure with which the data were generated: The wines of cabernet sauvignon and merlot load highly onto what I now call the dinner wine drinking principal axis and the wines of rosé and champagne load onto what I now call the celebratory wine drinking principal axis (the R and Python code below in lines 310–333 produce Table 2).

# compute four different rotated loadings
varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(pca_loadings)

# matrix of unrotated loadings
df_varimax = pd.DataFrame(data=varimax_loadings,
index=[df_std.columns],
columns=['Celebratory', 'Dinner']).reset_index().round(2)

# NOTE: loadings are multiplied by -1 since there is nothing mathematically incorrect with this (see
# discussion on rotational indeterminacy) and because this multiplication provides loadings that
# make greater conceptual sense
cols = ['Celebratory', 'Dinner']
df_varimax[cols] = df_varimax[cols] * -1

df_varimax.rename(columns={'level_0': 'Wine'}, inplace=True)
kbl(py$df_varimax, booktabs = TRUE, format = 'html',
align = c('l', 'c', 'c'),
caption = 'Varimax-Rotated Loadings of Each Wine on Each Principal Axis',
escape = F,
table.attr = "style='width:300px;'") %>%
kable_styling(position = 'center') %>%
column_spec(column = c(2,3), width = "75px") %>%
add_header_above(c(" " = 1, "Wine Drinking Preferece" = 2))
Table 2
Varimax-Rotated Loadings of Each Wine on Each Principal Axis
Wine Drinking Preferece
WineCelebratoryDinner
Cab. Sauv.0.180.98
Merlot0.080.99
Champagne0.990.08
Rosé0.990.12
Figure 3
Varimax-Rotated Loadings of Wines Onto Principal Axes
Note. Rotated loadings for cabernet sauvignon, champagne, merlot, and rosé are shown on two principal axes. In this case, meaning can easily be assigned the principal axes. The first principal axis (x-axis) can be conceptualized as representing a *celebratory wine drinking* tendency as it has high loadings from the wines of champagne and rosé. The second principal axis (y-axis) can be conceptualized as a *dinner wine drinking* tendency as it has high loadings from the wines of cabernet sauvignon and merlot. The Python code block in lines 334-352 produce this plot.
# add colors
df_varimax['color'] = ['#730534', '#A40448', '#FB8EBD', '#FFFC27']
colors_dict = dict(zip(df_varimax['Wine'], df_varimax['color']))

plot_loadings = (pt.ggplot(data=df_varimax, mapping=pt.aes(x='Celebratory', y='Dinner', group='Wine', color='Wine')) +
pt.geom_point(size=2) +
pt.scale_color_manual(values=colors_dict) +
pt.scale_x_continuous(limits=(-1.02, 1.02), expand=(0, 0)) + # Set x-axis limits
pt.scale_y_continuous(limits=(-1.02, 1.02), expand=(0, 0)) + # Set y-axis limits
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))

#save as .png
plot_loadings.save("images/varimax_loading_plot.png", width=8, height=6, dpi=1000)

Computing Principal Component Scores with Rotated Loadings

Having obtained the rotated loadings, the associated rotated principal component scores can be computed. Importantly, before computing the rotated principal component scors, two points require explanation. First, to more explicitly explain a previous statement, principal component scores can simply be obtained by computing the dot products between the standardized data and the eigenvectors (which are normalized; see dot products).

To show that Equation does indeed provide the principal component scores, I provide the below code (see Python code in lines 353–370 below).

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2) Run PCA and compute PCA scores
pca = decomposition.PCA(n_components=2)
pca_scores = pca.fit_transform(df_std)

# Step 3) Use SVD matrices to compute PCA scores (*multiply ny -1 to ensure signs are consistent)
pca_scores_svd = df_std.dot(V[ : , :2])*-1

# assert that both computations are identical
differences = np.abs(pca_scores - pca_scores_svd)

assert (differences.round(5) == 0).all().all() == True, '''PCA scores are not equivalent.'''

Second, scaled (but unstandardized) principal components scores can be computed using the loadings such that

Using Equation , the standard deviations of the principal component scores for each principal axis now match the eigenvalues (with the unscaled scores, the variances match the eigenvalues; see Python code block below in lines 371–388).

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2) Use SVD matrices to compute PCA scores (*multiply ny -1 to ensure signs are consistent)
# and scaled PCA scores
pca_scores_svd = df_std.dot(V[ : , :2])*-1

# compute scaled PCA scores
loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
pca_scores_scaled = df_std.dot(loadings)


assert (pca_scores_svd.var().round(4) == pca_scores_scaled.std().round(4)).all() == True, \
"""Scaling of PCA scores is incorrect."""

Having covered some necessary preqrequisites, the computation of rotated principal component scores can be explained. The crux of the issue with computing rotated principal component scores is that rotations such as varimax cause the loadings to lose their orthogonality. To prove this, I show that , where is a diagonal matrix, for the unrotated loadings but not for the rotated loadings (see Python code in lines 389–418 below).

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2) Compute unrotated loadings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))

# Step 3 Compute rotated lodaings
varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(unrotated_loadings)


# Step 4) Show that unrotated loadings are orthogonal but not rotated loadings
unrot_orth = unrotated_loadings.T.dot(unrotated_loadings)
rot_orth = varimax_loadings.T.dot(varimax_loadings)

# Create a diagonal matrix from the diagonal elements
unrotated_diagonal = np.diag(np.diag(unrot_orth))
rotated_diagonal = np.diag(np.diag(rot_orth))

# Check if the original matrix is equal to the diagonal matrix
is_diagonal = np.allclose(unrot_orth, unrot_orth, atol=1e-5)
not_diagonal = np.allclose(rot_orth, rotated_diagonal, atol=1e-5)

# Assert that the matrix is diagonal
assert (is_diagonal == True and not_diagonal==False), """Check computation of
rotated and unrotated loadings."""

Because the rotated loadings are not orthogonal, using them to compute principal component scores (Equation ) would result in principal axes that do not account for variance in the amounts expected by the eigenvalues. Therefore, some form of correction matrix is required. As it turns out, a correction matrix, , can be computed.

The derivation of is actually quite simple. Because the goal is to compute principal component scores whose principal axis variabilities reflect those of the eigenvalues, we first need to first compute the variance-covariance matrix of the rotated loadings and then apply an inversion to this matrix. Recall that matrix inversions un-transform basis vectors (see matrix inverses, and so applying ensures that the principal component scores maintain the variance structure implied by the eigenvalues Thus, Equation below is obtained.

Using Equation , the computation for the principal component scores can be modified to obtain the rotated principal component scores, , such that

To show Equation is indeed correct, I provide the below Python code. Note that the scores obtained with Equation are identical to the scores returned with the FactorAnalyzer functions and only differ by a scaling factor (see Python code below in lines 419–439).

# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2 Compute rotated lodaings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
varimax = Rotator(method="varimax")
varimax_loadings = varimax.fit_transform(unrotated_loadings)

# Step 3) Compute correction matrix Q
Q = (varimax_loadings.T.dot(varimax_loadings))
rotated_scores = df_std.dot(varimax_loadings).dot(np.linalg.inv(Q))

# Step 4) Show computation of PC scores is identical to using built-in function (barring a
# scaling factor)
varimax_fac = FactorAnalyzer(rotation='varimax', n_factors=2, method='principal')
varimax_fac.fit(df_std)
factor_scores = varimax_fac.transform(df_std)

Like the loadings, the rotated principal component scores can also be plotted. For comprehensiveness, I show a biplot below in Figure 4 where loadings are shown as vectors and rotated principal component score are shown as a dots.

Figure 4
Principal Component Biplot
Note. Biplot shows rotated loadings and rotated principal component scores. The vectors represent the loadings of cabernet sauvignon, champagne, merlot, and rosé onto each of the principal axes and the dots represent each wine drinker's principal component score. The first principal axis (x-axis) represents *celebratory wine drinking* as it has high loadings with champagne and rosé. The second principal axis (y-axis) represents *dinner wine drinking* as it has high loadings with cabernet sauvignon and merlot. The Python code below in lines x440-490 produces this plot.
# standardize data
df_std = (A - A.mean())/A.std(ddof=1)

# Step 1) Compute SVD
U, S, V_t = np.linalg.svd(df_std)
V = V_t.T

# Step 2 Compute rotated lodaings
unrotated_loadings = V[: , :2].dot(np.sqrt(np.diag((S[0:2]**2)/9)))
varimax = Rotator(method="varimax")

df_loadings = pd.DataFrame(varimax.fit_transform(unrotated_loadings)) * -1
df_loadings.rename(columns= {0: 'Dinner Wine Drinking',
1: 'Celebratory Wine Drinking'}, inplace=True)
df_loadings[['x_start', 'y_start']] = 0
df_loadings['Wine'] = ['Champagne', 'Rosé', 'Cab. Sauv.', 'Merlot']
df_loadings['color'] = ['#FB8EBD', '#FFFC27', '#730534', '#A40448']
color_dict = dict(zip(df_loadings['Wine'], df_loadings['color']))

# Step 3) Compute correction matrix Q
Q = (varimax_loadings.T.dot(varimax_loadings))
df_rot_scores = pd.DataFrame(df_std.dot(varimax_loadings).dot(np.linalg.inv(Q)))
df_rot_scores.rename(columns={0: 'Dinner Wine Drinking',
1: 'Celebratory Wine Drinking'}, inplace=True)

# create biplot
biplot = (pt.ggplot(data=df_rot_scores, mapping=pt.aes(x='Dinner Wine Drinking',
y='Celebratory Wine Drinking')) +
# add PCA scores
pt.geom_point() +
# add loadings
pt.geom_segment(data=df_loadings,
mapping=pt.aes(x='x_start', y='y_start',
xend='Dinner Wine Drinking', yend='Celebratory Wine Drinking',
color='Wine'),
arrow=pt.arrow(type="closed", length=0.12, angle=30)) +
pt.scale_color_manual(values=color_dict) +

# scale specifications
pt.geom_hline(yintercept=0, color="#002241") +
pt.geom_vline(xintercept=0, color="#002241") +
pt.scale_x_continuous(limits=(-2, 2), expand=(0, 0), minor_breaks=[]) + # Set x-axis limits
pt.scale_y_continuous(limits=(-2, 2), expand=(0, 0), minor_breaks=[]) + # Set y-axis limits

pt.theme_classic(base_family = 'Helvetica', base_size = 14) +
pt.theme(text=pt.element_text(color="#002241"),
axis_line=pt.element_blank(),
axis_ticks=pt.element_line(color="#002241"),
axis_text=pt.element_text(color="#002241")))

biplot.save("images/pca_biplot", dpi=500, width=8, height=6)

Conclusion

In summary, this whitepaper explicated the geometry of the singular value decomposition and its underlying meaning in relation to principal component analysis. Geometrically, the singular value decomposition shows that every matrix can be sequentially decomposed into a rotation followed by stretching and/or dimension change and then followed by another rotation. At a deeper level, the singular value decomposition provides eigenvectors that define principal axes onto which original variables and scores can be projected to, respectively, provide loadings and principal component scores. Using rotations, the loadings and principal component scores can be transformed into having more interpretable values.

Acknowledgments

I’d like to thank David Stanley for providing assistance in generating the wine rating data and providing insights into principal component rotations.

References

& (). How to determine the number of factors to retain in exploratory factor analysis: A comparison of extraction methods under realistic conditions. Psychological Methods, 24(4), 468. https://doi.org/10.1037/met0000200
(). Foundations of factor analysis. CRC press. Retrieved from bit.ly/40oGrFH

Appendix A: Proof of Spectral Theorem

The spectral theorem has three propositions that I prove in turn:

  1. Symmetric matrices must have at least one eigenvalue (and, therefore, at least one eigenvector).
  2. For any symmetric matrix , all of its eigenvalues are real.
  3. For any symmetric matrix , an orthonormal set of eigenvectors exists that spans the space of . In other words, a full set of orthogonal eigenvectors exists.

Appendix A1: Intuition For Why Symmetric Matrices Must Have at Least One Eigenvalue and Eigenvector

To understand why all symmetric matrices have at least one eigenvalue, it is important to first understand how eigenvalues are obtained. To derive an eigenvalue, the equation of

can be re-expressed below in Equation :

At this point, a brief discussion of determinants is apropos. Geometrically, determinants represent the extent to which a matrix changes the space spanned by a set of basis vectors changes (for an excellent video, see 3blue1brown). Importantly, if a matrix has a zero-value determinant, it applies a transformation that converts the space spanned by a set of basis vectors to zero. In the current case where only one vector is under consideration, , if the matrix of has a determinant of zero, then will go to zero after pre-multiplication (Equation ). Therefore, to solve Equation above, we need to find the eigenvalue, , that, when subtracted from the diagonal of , causes to have a zero-value determinant.

Fortunately, the fundamental theorem of algebra guarantees that any square matrix will have at least one eigenvalue (real or complex) if some constant is subtracted from the diagonal. Although outside the context of this paper, a great explanation of this proof can be found at fundamental theorem of linear algebra; note that this video provides an explanation for there being at least one complex eigenvalue).

To provide a basic intuition for why square matrices must have at least one eigenvalue (whether complex or real), it is important to realize that the determinant equation is simply a polynomial one and that polynomial equations always have at least one root (whether complex or real). For example, consider the 2x2 matrix

The determinant of is then

and the eigenvalues are

Although the determinant of a 2x2 matrix is a rather simple case, it generalizes to larger matrices and more complex polynomial equations. Because the existence of at least eigenvalue is guaranteed for square matrices, then there will be at least one eigenvector that accompanies the eigenvalue.

Appendix A2: Proof That All Eigenvalues of Symmetric Matrices are Real

In the previous section, an argument was provided for why all square matrices must have at least one eigenvalue (and consequently at least one eigenvector). In this section, I will provide a proof that symmetric matrices only have real eigenvalues. As some necessary background, it is first important to have a basic understanding of complex numbers and complex conjugates. A complex number, , can be conceptualized as the sum of a real-number, , and imaginary component, , as shown below in Equation :

Importantly, complex numbers have corresponding entities called complex conjugates, , that have real components of equivalent magnitude and imaginary components of equivalent magnitude in the opposite direction. Given the complex number defined above in Equation , its complex conjugate would be defined as

In circumstances where a complex number is equivalent to its conjugate, the imaginary component does not exist, and so the complex and its conjugate are equivalent to the real number (see derivation below).

With a basic understanding of complex numbers, I will now prove that the eigenvalues of any given symmetric matrix, , are real. That is, for any set of eigenvectors , the complex and complex conjugate eigenvalues (i.e., and , respectively) are equivalent. To prove that the eigenvalues are real, I will derive complex and complex conjugate forms of an eigenvalue-based equation. Before doing do, it is important to first obtain the complex conjugate of the eigenvector equation, shown below in Equation :

Note that the matrix has no complex conjugate because it consists of only real numbers. Using Equation , we can now derive complex and complex conjugate forms of an eigenvalue-based equation that are equivalent. As a reminder, because is symmetric.

Given that the complex and complex conjugate equations of Equations and , respectively, are equivalent (note that they are both equivalent to ), any eigenvalue contained within must then be real.

Appendix A3: Proof That Symmetric Matrices Have Full Sets of Orthonormal Eigenvectors

At this point, I have proven that symmetric matrices have real eigenvalues and that there must be at least one eigenvalue with a corresponding eigenvector. With this information, it can now be proven that symmetric matrices have full sets of orthonormal eigenvectors. That is, when applying the transformation of a symmetric matrix, there exists a set of orthonormal eigenvectors that span the entire space of the symmetric matrix.

To prove that symmetric matrices have full sets of orthogonal eigenvectors, I show that any symmetric matrix can be diagonalized by an orthonormal matrix. This proof applies the Gram-Schmidt procedure to a symmetric matrix to generate a set of orthonormal vectors that is then shown to diagonalize the symmetric matrix. Mathematically then, and central to the spectral theorem, any symmetric matrix can be transformed into a diagonal matrix, , using a matrix of eigenvectors, , such that

where is a diagonal matrix of eigenvalues. Importantly, because is obtained with the Grahm-Schmidt process, the eigenvectors are orthonormal.

To prove Equation for any symmetric matrix, I provide a two-step proof below. First, I apply the Gram-Schmidt procedure to some symmetric matrix to obtain a full set of orthonormal vectors and show that the first basis vector of can be diagonalized. Second, I show that the remaining columns of are also eigenvectors by showing they can be diagonalized.

Beginning with the first step of the proof, a symmetric matrix, , can be orthonormalized by applying the Gram-Schmidt procedure to obtain a matrix (for an explanation of the Gram-Schmidt procedure, see this video) such that each column of is an orthonormal vector. Importantly, can be applied to to obtain

which is symmetric because

Additionally, and important for the first step of this proof, the first column of can be diagonalized because the first column of is assumed to be an eigenvector, of which there must be at least one for a symmetric matrix (for an explanation, see Appendix A1). Mathematically then, the first column of follows a diagonal form (i.e., it only has one nonzero value) such that

Because is symmetric, the first row then has the same structure as the first column, and so follows a diagonal structure in the first row and column (see Equation below) such that

where and represents the remainder of .

Having shown that the first column of a symmetric matrix an be diagonalized, I now show in this second part of the proof that the remaining columns of a symmetric matrix can be diagonalized. In other words, there exists some orthonormal matrix such that

where is a diagonal matrix. To prove Equation above, I first define

As with , is also equal to its inverse because it is orthogonal (inherited from ).

Now, the central component of the spectral theorem can be proven by showing that yields a diagonal matrix.

Therefore, symmetric matrices can be fully diagonalized because they must have full sets of eigenvectors that are orthonormal.

Appendix B: is a Positive Semi-Definite Matrix

A positive semi-definite matrix, , is one whose product with any nonzero vector, , is greater than or equal to zero such that

If we replace with the symmetric matrix of , the condition above in Equation is met because the computation becomes a dot product or the 2-norm of identical vectors. In simpler terms,

because and result in identical vectors. Therefore,

Appendix C: Unitary Freedom of Positive Semi-Definite Matrices

In this section, I will first state the unitary freedom of positive semi-definite matrices and will then provide an intuitive explanation. In formal terms, this property of positive semi-definite matrices states that, given and

then there must exist some orthogonal matrix that allows us to go from one positive semi-definite matrix to another. Mathematically then, we can define

Although the math above is indeed true, it does not provide an intuitive understanding for why there must exist an orthogonal matrix that allows us to translate between the two matrices. Therefore, I now provide an intuitive explanation for this proof.

As discussed in the section on orthonormal matrices, these matrices simple rotate basis vectors and so the vector lengths and angles between them remain unchanged. Therefore, in looking at Equation , the basis vectors of and must have identical lengths and angles.

Beginning with the lengths of and , the lengths are identical because the 2-norm of each basis vector in either matrix produces the outcome value (e.g., ). Mathematically,

Ending with the relative angles between the basis vectors, consider any two basis vectors in the matrices of and , and . If the two matrices satisfy the unitary freedom condition (Equation ), then the dot product between any corresponding sets of two basis vectors will be equivalent. Mathematically,

In looking at the formula for the angle between two vectors below

and applying the equality of vectors lengths (Equation ) and dot products (Equation ), it becomes evident that the angle between any sets of two vectors, and , must be identical be. Mathematically,

To summarize, if two positive semi-definite matrices satisfy the condition of unitary freedom (Equation ), then the basis vectors of each matrix will have identical lengths and the angles between any two sets of corresponding basis vectors will have identical angles. This equivalence of angles and lengths means that there must exist some orthonormal matrix that can be used to translate between the two positive semi-definite matrices.

Appendix D: Proof That the Lenth of Standardized Variable Equals

If is a standardized variable with observations, then

To prove Equation above, consider first the computation of standarizing a variable

where is the standard deviation of . Equation can be substituted into Equation

Because the the standard deviation equation is

then Equation becomes