Poisson Subspace Clustering: Focusing on the Essentials in Count Data
To install the requirements do the following:
pip install numpy
pip install git+https://github.com/collinleiber/ClustPy.git
The used datasets are available here:
- Wholesales: https://archive.ics.uci.edu/dataset/292/wholesale+customers
- SportA: https://archive.ics.uci.edu/dataset/450/sports+articles+for+objectivity+analysis
- Optdigits: https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits
- BBCSports: http://mlg.ucd.ie/datasets/bbc.html
- BBCNews: http://mlg.ucd.ie/datasets/bbc.html
- WebKB: https://www.cs.cmu.edu/~webkb/
- Reuters: https://archive.ics.uci.edu/dataset/137/reuters+21578+text+categorization+collection
- 20NewsG: https://archive.ics.uci.edu/dataset/113/twenty+newsgroups (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
- MouseAtlas: https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data/dataset2 (renamd "filtered_total_batch1_seqwell_batch2_10x.txt" -> "mouse_cell_atlas.txt")
- GeneExp: https://archive.ics.uci.edu/dataset/401/gene+expression+cancer+rna+seq
- HDendritic: https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data/dataset1 (renamed "dataset1_sm_uc3.txt" -> "human_dendritic_cells.txt")
The comparison algorithms Spherical-k-Means (SKM), PoissonL and PoissonC were implemented by us and can be found in the competitors.py file.
The co-clustering algorithms CROINFO, CoclustMod, CoclustSpecMod, ELBM, SELBM and TauCC are contained in the coclustering directory and were originally obtained here:
- CROINFO, CoclustMod, CoclustSpecMod: https://github.com/franrole/cclust_package / https://www.jstatsoft.org/article/view/v088i07
- ELBM, SELBM: https://github.com/Saeidhoseinipour/ELBMcoclust
- TauCC: https://github.com/rupensa/tauCC
You can test 3CPO manually by testing on some dataset.
from threecpo import ThreeCPO
from datasets import load_synth_data
X, L = load_synth_data() # Replace by any other dataset
n_clusters = len(np.unique(L))
threeCPO = ThreeCPO(n_clusters=n_clusters)
threeCPO.fit(X)Our results and executions can be obtained by running the methods within the experiments.py file. The methods include the whole pipeline (loading datasets, running algorithms, evaluation). Examples:
from threecpo import experiment_table, experiment_text_data, experiment_ablations, experiment_initializations, experiment_robustness_amount_noise_columns, experiment_robustness_maximum_noise_value, experiment_runtime_rows, experiment_runtime_columns, experiment_estiamte_k, load_synth_data
experiment_table()
experiment_text_data()
experiment_ablations()
experiment_initializations()
experiment_robustness_amount_noise_columns()
experiment_robustness_maximum_noise_value()
experiment_runtime_rows()
experiment_runtime_columns()
X, L = load_synth_data(return_X_y=True)
experiment_estimate_k(X, L)