Skip to content

vzytoi/PCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Principal Component Analysis (PCA) implementation in C++

Note: This code is not meant for production use. It is neither as fast nor as numerically reliable as
numpy.linalg.eigh.

Usage

For a minimal example, see the following:

import pca

# The matrix to reduce.
# It has shape N × p, where:
# - N is the number of samples in the dataset
# - p is the number of features per sample
M = [
    [2, 1, -1],
    [-4, 2, 0],
    [9, 2, 1],
    [0, -1, 2]
]

# Target dimension
k = 2

res = pca.fit(dataset, k)

# dataset M reduced to k parameters
M_reduced = res["projected"]

# the base in which the projection as been made and the mean of the initial dataset
# if you want to project some more point: (test - means) @ W
W = res["W"]
means = res["means"]

The left image represents the generated data in 3D. You may notice that one axis does not carry much information. In this case, reducing the dimension to 2 is especially interesting because very little information is lost. The right image shows the same dataset after applying PCA.

How it works

If you want more detail on how TCA works and the implementation choices I made, checkout How PCA works (PDF)

Why is PCA used?

Using a Breast Cancer Diagnosis Data Set you can visualise data in 2d, then project a new entry and calculate the average distance. You get:

You can clearly do better than a 7% error rate. Maybe the 3rd more important direction still contains a lot of information. Let's check that out:

The error rate remains constant. Although this may seem unexpected, it can be explained by the fact that most of the discriminative information is captured by the first two principal components. The additional components mainly introduce noise and do not contribute significantly to class separation.

In contrast, when using a more complex dataset such as load_digits, where the informative variance is spread across several directions, we can clearly observe how the average error evolves as a function of the number of retained dimensions.

This highlights another practical use of PCA. As observed, beyond approximately 40 components, the error rate remains almost unchanged. Consequently, reducing the dimensionality to 30–40 components before training a neural network could significantly reduce computational cost without sacrificing performance.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published