Principal Component Analysis (PCA) implementation in C++

Note: This code is not meant for production use. It is neither as fast nor as numerically reliable as
numpy.linalg.eigh.

Usage

For a minimal example, see the following:

import pca

# The matrix to reduce.
# It has shape N × p, where:
# - N is the number of samples in the dataset
# - p is the number of features per sample
M = [
    [2, 1, -1],
    [-4, 2, 0],
    [9, 2, 1],
    [0, -1, 2]
]

# Target dimension
k = 2

res = pca.fit(dataset, k)

# dataset M reduced to k parameters
M_reduced = res["projected"]

# the base in which the projection as been made and the mean of the initial dataset
# if you want to project some more point: (test - means) @ W
W = res["W"]
means = res["means"]

The left image represents the generated data in 3D. You may notice that one axis does not carry much information. In this case, reducing the dimension to 2 is especially interesting because very little information is lost. The right image shows the same dataset after applying PCA.

How it works

If you want more detail on how TCA works and the implementation choices I made, checkout How PCA works (PDF)

Why is PCA used?

Using a Breast Cancer Diagnosis Data Set you can visualise data in 2d, then project a new entry and calculate the average distance. You get:

You can clearly do better than a 7% error rate. Maybe the 3rd more important direction still contains a lot of information. Let's check that out:

The error rate remains constant. Although this may seem unexpected, it can be explained by the fact that most of the discriminative information is captured by the first two principal components. The additional components mainly introduce noise and do not contribute significantly to class separation.

In contrast, when using a more complex dataset such as load_digits, where the informative variance is spread across several directions, we can clearly observe how the average error evolves as a function of the number of retained dimensions.

This highlights another practical use of PCA. As observed, beyond approximately 40 components, the error rate remains almost unchanged. Consequently, reducing the dimensionality to 30–40 components before training a neural network could significantly reduce computational cost without sacrificing performance.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
src		src
test		test
CMakeLists.txt		CMakeLists.txt
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Principal Component Analysis (PCA) implementation in C++

Usage

How it works

Why is PCA used?

About

Uh oh!

Releases

Packages

Languages

vzytoi/PCA

Folders and files

Latest commit

History

Repository files navigation

Principal Component Analysis (PCA) implementation in C++

Usage

How it works

Why is PCA used?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages