PARC

Phenotyping by Accelerated Refined Community-partitioning

✳️ PARC is a fast, automated, combinatorial, graph-based clustering approach that integrates hierarchical graph construction (HNSW) and data-driven graph-pruning with the new Leiden community-detection algorithm.

✳️ Original Paper: PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells.

✳️ Check out Read the Docs for:

✳️ PARC forms the clustering basis for our new Trajectory Inference (TI) method VIA available on Github or readthedocs. VIA is a single-cell TI method that offers topology construction and visualization, pseudotimes, automated prediction of terminal cell fates and temporal gene dynamics along detected lineages. VIA can also be used to topologically visualize the graph-based connectivity of clusters found by PARC in a non-TI context.

Installation

MacOS / Linux

git clone https://github.com/ahill187/PARC.git
cd PARC
python3 -m venv env
source env/bin/activate
pip install .

Note
If the pip install doesn't work, it usually suffices to first install all the requirements (using pip) and subsequently install PARC (also using pip), i.e.

git clone https://github.com/ahill187/PARC.git
cd PARC
python3 -m venv env
source env/bin/activate
pip install igraph, leidenalg, hnswlib, umap-learn
pip install .

Windows

Once you have Visual Studio installed it should be smooth sailing (sometimes requires a restart after intalling VS). It might be easier to install dependences using either pip install or conda -c conda-forge install. If this doesn't work then you might need to consider using binaries to install igraph and leidenalg.

python-igraph: download the python36 Windows Binaries by Gohlke
leidenalg: depends on python-igraph. download windows binary available for Python 3.6 only

conda create --name parcEnv python=3.6 pip
pip install igraph #(or install python_igraph-0.7.1.post6-cp36-cp36m-win_amd64.whl )
pip install leidenalg #(or install leidenalg-0.7.0-cp36-cp36m-win_amd64.whl)
pip install hnswlib
pip install parc

Examples

Example 1: COVID-19 scRNA-seq Data

Check out the Jupyter Notebook for how to pre-process and PARC cluster the new Covid-19 BALF dataset by Liao et. al 2020. We also show how to integrate UMAP with HNSW such that the embedding in UMAP is constructed using the HNSW graph built in PARC, enabling a very fast and memory efficient viusalization (particularly noticeable when n_cells > 1 Million)

PARC Cluster-level average gene expression

PARC visualizes cells by integrating UMAP embedding on the HNSW graph

Example 2: IRIS Dataset from `sklearn`

import parc
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()
x_data = pd.DataFrame(iris.data, columns=iris.feature_names) # (n_samples x n_features = 150 x 4)
y_data = iris.target

# Plot the data (coloured by ground truth)
column_a = iris.feature_names[0]
column_b = iris.feature_names[1]
plt.scatter(x_data[column_a], x_data[column_b], c=y_data)
plt.title("Iris Dataset: Ground Truth")
plt.xlabel(column_a)
plt.ylabel(column_b)
plt.show()

# Instantiate the PARC model
parc_model = parc.PARC(x_data=x_data, y_data_true=y_data)

# Run the PARC clustering
parc_model.run_parc()
y_data_pred = parc_model.y_data_pred

# View scatterplot colored by PARC labels
plt.scatter(x_data[column_a], x_data[column_b], c=y_data_pred, cmap="rainbow")
plt.title("Iris Dataset: PARC Predictions")
plt.xlabel(column_a)
plt.ylabel(column_b)
plt.show()

# Run UMAP on the HNSW knngraph already built in PARC
# (more time and memory efficient for large datasets)
csr_array = parc_model.create_knn_graph()
x_umap = parc_model.run_umap_hnsw(x_data=x_data, graph=csr_array)

# Visualize UMAP results
plt.scatter(x_umap[:, 0], x_umap[:, 1], c=parc_model.y_data_pred)
plt.title("Iris Dataset UMAP: PARC Predictions")
plt.xlabel("umap_x")
plt.ylabel("umap_y")
plt.show()

Example 3: Digits Dataset from `sklearn`

import parc
import matplotlib.pyplot as plt
from sklearn import datasets


# Load the Digits dataset
digits = datasets.load_digits()
x_data = digits.data # (n_samples x n_features = 1797 x 64)
y_data = digits.target

# Insantiate the PARC model
parc_model = parc.PARC(
    x_data=x_data,
    y_data_true=y_data,
    jac_threshold_type="median"  # "median" is default pruning level
)

# Run the PARC clustering
parc_model.run_parc()
y_data_pred = parc_model.y_data_pred

Example 4: (mid-scale scRNA-seq): 10X PBMC (Zheng et al., 2017)

Download the input data file and save it to the PARC/data/ directory: pca50_pbmc68k.txt.
You can view the target annotations here: PARC/data/zheng17_annotations.txt.

import parc
import numpy as np
import pandas as pd
import pathlib


# Set the directory by replacing {PATH/TO/PARC}
PARC_DIR = "{PATH/TO/PARC}/PARC/"
x_data_path = pathlib.Path(PARC_DIR, "data/pca50_pbmc68k.txt")
y_data_path = pathlib.Path(PARC_DIR, "data/zheng17_annotations.txt")

# Load data
# 50 PCs of filtered gene matrix pre-processed as per Zheng et al. 2017)
# (n_samples x n_features) = (68579 x 50)
x_data = pd.read_csv(x_data_path, header=None).values.astype("float")
y_data = list(pd.read_csv(y_data_path, header=None)[0])   


# Instantiate the PARC model
parc_model = parc.PARC(
    x_data=x_data,
    y_data_true=y_data,
    jac_std_factor=0.15,
    jac_threshold_type="mean",
    random_seed=1,
    small_community_size=50 # setting small_community_size = 50
    # cleans up some of the smaller clusters, but can also be left at the default 10
)

# Run the PARC clustering
parc_model.run_parc()
y_data_pred = parc_model.y_data_pred

# View the model performance
parc_model.stats_df

t-SNE plot of annotations and PARC clustering

Example 5: 10X PBMC (Zheng et al., 2017) integrating Scanpy pipeline

The description of the data for this is example can be found in the GitHub repository: single-cell-3prime-paper.

Download and unzip the 10X compressed folder "filtered genes". Save it to PARC/data/.
You can view the target annotations here: PARC/data/zheng17_annotations.txt.
Install scanpy:

pip install scanpy

import scanpy as sc
import pandas as pd
import pathlib

# Set the directory by replacing {PATH/TO/PARC}
PARC_DIR = "{PATH/TO/PARC}/PARC/"
x_data_path = pathlib.Path(PARC_DIR, "data/filtered_matrices_mex/hg19")
y_data_path = pathlib.Path(PARC_DIR, "data/zheng17_annotations.txt")

# Load data
ann_data = sc.read(f"{x_data_path}/matrix.mtx", cache=True).T  # transpose the data
ann_data.var_names = pd.read_csv(f"{x_data_path}/genes.tsv", header=None, sep='\t')[1]
ann_data.obs_names = pd.read_csv(f"{x_data_path}/barcodes.tsv", header=None)[0]

# Load the annotations as per correlation with pure samples
y_data = list(pd.read_csv(y_data_path, header=None)[0])
ann_data.obs["annotations"] = pd.Categorical(y_data)

# Pre-process as per Zheng et al., and take first 50 PCs for analysis
sc.pp.recipe_zheng17(ann_data)
sc.tl.pca(ann_data, n_comps=50)

# Instantiate the PARC model
parc_model = parc.PARC(
    x_data=ann_data.obsm["X_pca"],
    y_data_true=y_data,
    jac_std_factor=0.15,
    jac_threshold_type="mean",
    random_seed=1,
    small_community_size=50 # setting small_community_size to 50 cleans up some of the
    # smaller clusters, but can also be left at the default 10
)

# Run the PARC clustering
parc_model.run_parc()

# Get the predicted cell types
y_data_pred = parc_model.y_data_pred
ann_data.obs["PARC"] = pd.Categorical(y_data_pred)

# Visualize UMAP
sc.settings.n_jobs=4
sc.pp.neighbors(ann_data, n_neighbors=10, n_pcs=40)
sc.tl.umap(ann_data)
sc.pl.umap(ann_data, color="annotations")
sc.pl.umap(ann_data, color="PARC")

Example 6: Large-scale (70K subset and 1.1M cells) Lung Cancer cells (multi-ATOM imaging cytometry based features)

Here we have a dataset containing single cell feature data for lung cancer cells. The data was extracted from Bright Field and QPI images taken by Multi-ATOM imaging flow cytometry. The data is a digital mix of 7 cell lines from 7 sets of pure samples.

The full dataset is 1.1M cells; however, depending on your computer's memory capacity, you may want to run a smaller subset, so we have provided a subset of 70K cells as well.

To run the full dataset of 1.1M cells:

Download the Lung Cancer 1.1M cell features and annotations from Elsevier.

Save the files to:

PARC/data/datamatrix_LungCancer_multiATOM_N1113369.txt PARC/data/true_label_LungCancer_multiATOM_N1113369.txt
Download the H1975 digital spike test cluster data (n = 100).

Save it to PARC/data/datamatrix_RareH1975_LC_RS209_N281604Dim24.txt.
You can view the H1975 annotations (which are all 0 since it's one cluster) under PARC/data/true_label_RareH1975_LC_PARC_RS209_N281604.txt.

Otherwise, you can download the 70K subset:

Download the input data: normalized image-based feature matrix 70K cells.

Save the file to PARC/data/datamatrix_LC_PARC__N70000.txt
Download the target data (annotations): Lung Cancer cells annotation 70K cells.

Save the file to PARC/data/true_label_LC_PARC_N70000.txt
Download the H1975 digital spike test cluster data (n = 100).

Save it to PARC/data/datamatrix_RareH1975_LC_RS209_N281604Dim24.txt.
You can view the H1975 annotations (which are all 0 since it's one cluster) under PARC/data/true_label_RareH1975_LC_PARC_RS209_N281604.txt.

import parc
import pandas as pd
import pathlib

# Set the directory by replacing {PATH/TO/PARC}
PARC_DIR = "{PATH/TO/PARC}/PARC/"

# Load the full dataset of 1.1M cells
x_data_path = pathlib.Path(PARC_DIR, "data/datamatrix_LungCancer_multiATOM_N1113369.txt")
y_data_path = pathlib.Path(PARC_DIR, "data/true_label_LungCancer_multiATOM_N1113369.txt")

# # Alternatively, load the subset of 70K cells
# x_data_path = pathlib.Path(PARC_DIR, "data/datamatrix_LC_PARC__N70000.txt")
# y_data_path = pathlib.Path(PARC_DIR, "data/true_label_LC_PARC_N70000.txt")

# Load data
x_data = pd.read_csv(x_data_path, header=None).values.astype("float")
y_data = list(pd.read_csv(y_data_path, header=None)[0]) # list of cell-type annotations

# Instantiate PARC with the lung cancer data
parc_model = parc.PARC(
    x_data=x_data,
    y_data_true=y_data,
    jac_weighted_edges=False # provides unweighted graph to leidenalg (faster)
)

# Run the PARC clustering
parc_model.run_parc()
y_data_pred = parc_model.y_data_pred


# Load the H1975 cell cluster (n = 100)
x_data_path = pathlib.Path(PARC_DIR, "data/datamatrix_RareH1975_LC_RS209_N281604Dim24.txt")
y_data_path = pathlib.Path(PARC_DIR, "data/true_label_RareH1975_LC_PARC_RS209_N281604.txt")
x_data = pd.read_csv(x_data_path, header=None).values.astype("float")
y_data = list(pd.read_csv(y_data_path, header=None)[0])

# Instantiate PARC with the H1975 spiked cells
parc_model = parc.PARC(
    x_data=x_data,
    y_data_true=y_data,
    jac_std_factor=0.15, # 0.15 prunes ~60% edges and can be effective for rarer populations
    jac_threshold_type="mean",
    jac_weighted_edges=False
) 

# Run the PARC clustering
parc_model.run_parc()
parc_labels_rare = parc_model.y_data_pred

t-SNE plot of annotations and PARC clustering, heatmap of features

Parameters

For a more detailed explanation of the impact of tuning key parameters please see the Supplementary Analysis in our paper.

Input Parameter	Description
`x_data`	`np.ndarray` a Numpy array of the input x data, with dimensions (n_samples, n_features)
`y_data_true`	`np.ndarray` (optional) a Numpy array of the true output y labels.
`knn`	(int) The number of nearest neighbors k for the k-nearest neighbours algorithm. Larger k means more neighbors in a cluster and therefore less clusters.
`n_iter_leiden`	`int` The number of iterations for the Leiden algorithm.
`random_seed`	`int` The random seed to enable reproducible Leiden clustering.
`distance_metric`	`str` The distance metric to be used in the KNN algorithm.
`n_threads`	`int` The number of threads used in the KNN algorithm.
`hnsw_param_ef_construction`	`int` A higher value increases accuracy of index construction. Even for O(100 000) cells, 150-200 is adequate.
`neighbor_graph`	`Compressed Sparse Row Matrix` A sparse matrix with dimensions (n_samples, n_samples), containing the distances between nodes.
`knn_struct`	`hnswlib.Index` The HNSW index of the KNN graph on which we perform queries.
`l2_std_factor`	`float` The multiplier used in calculating the Euclidean distance threshold for the distance between two nodes during local pruning: `max_distance = np.mean(distances) + l2_std_factor * np.std(distances)`. Avoid setting both the `jac_std_factor` (global) and the `l2_std_factor` (local) to < 0.5 as this is very aggressive pruning. Higher `l2_std_factor` means more edges are kept.
`do_prune_local`	(optional, default=None) Whether or not to do local pruning. If None (default), set to `False` if the number of samples is > 300 000, and set to `True` otherwise.
`jac_threshold_type`	`str` (optional, default = 'median') One of `"median"` or `"mean"`. Determines how the Jaccard similarity threshold is calculated during global pruning.
`jac_std_factor`	`float` The multiplier used in calculating the Jaccard similarity threshold for the similarity between two nodes during global pruning for `jac_threshold_type = "mean"`: `threshold = np.mean(similarities) - jac_std_factor * np.std(similarities)`. Setting `jac_std_factor = 0.15` and `jac_threshold_type="mean"` performs empirically similar to `jac_threshold_type="median"`, which does not use the `jac_std_factor`. Generally values between 0-1.5 are reasonable. Higher `jac_std_factor` means more edges are kept.
`jac_weighted_edges`	`bool` (optional, default = True) Whether to partition using the weighted graph.
`resolution_parameter`	`float` (optional, default = 1) The resolution parameter to be used in the Leiden algorithm. In order to change `resolution_parameter`, we switch to `RBVP`.
`partition_type`	`str` The partition type to be used in the Leiden algorithm.
`large_community_factor`	`float` A factor used to determine if a community is too large. If the community size is greater than `large_community_factor * n_samples`, then the community is too large and the PARC algorithm will be run on the single community to split it up. The default value of `0.4` ensures that all communities will be less than the cutoff size.
`small_community_size`	`int` The smallest population size to be considered a community.
`small_community_timeout`	`int` The maximum number of seconds trying to check an outlying small community.

Attributes

In addition to the parameters described above, the PARC model has the following attributes:

Attributes	Description
`y_data_pred`	`np.ndarray` a Numpy array of the predicted output y labels.
`f1_mean`	`list` f1 score (not weighted by population). For details see supplementary section of paper.
`stats_df`	`pd.DataFrame` Stores parameter values and performance metrics

Developers

Installation

To install a development version:

git clone https://github.com/ahill187/PARC.git
cd PARC
pip install -e ".[extra]"

Testing

To run the test suite for the package:

cd PARC
pytest tests/

References

leidenalg: V.A. Traag, 2019 doi.org/10.1038/s41598-019-41695-z
hsnwlib: Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
igraph: https://igraph.org/python/

Citing

If you find this code useful in your work, please consider citing this paper:

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
.github		.github
Images		Images
data		data
docs		docs
parc		parc
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
Covid19_Parc.ipynb		Covid19_Parc.ipynb
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARC

Installation

MacOS / Linux

Windows

Examples

Example 1: COVID-19 scRNA-seq Data

PARC Cluster-level average gene expression

PARC visualizes cells by integrating UMAP embedding on the HNSW graph

Example 2: IRIS Dataset from `sklearn`

Example 3: Digits Dataset from `sklearn`

Example 4: (mid-scale scRNA-seq): 10X PBMC (Zheng et al., 2017)

Example 5: 10X PBMC (Zheng et al., 2017) integrating Scanpy pipeline

Example 6: Large-scale (70K subset and 1.1M cells) Lung Cancer cells (multi-ATOM imaging cytometry based features)

Parameters

Attributes

Developers

Installation

Testing

References

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

ahill187/PARC

Folders and files

Latest commit

History

Repository files navigation

PARC

Installation

MacOS / Linux

Windows

Examples

Example 1: COVID-19 scRNA-seq Data

PARC Cluster-level average gene expression

PARC visualizes cells by integrating UMAP embedding on the HNSW graph

Example 2: IRIS Dataset from sklearn

Example 3: Digits Dataset from sklearn

Example 4: (mid-scale scRNA-seq): 10X PBMC (Zheng et al., 2017)

Example 5: 10X PBMC (Zheng et al., 2017) integrating Scanpy pipeline

Example 6: Large-scale (70K subset and 1.1M cells) Lung Cancer cells (multi-ATOM imaging cytometry based features)

Parameters

Attributes

Developers

Installation

Testing

References

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Example 2: IRIS Dataset from `sklearn`

Example 3: Digits Dataset from `sklearn`

Packages