A novel causal discovery algorithm for learning causal relationships between clusters of variables, significantly reducing computational complexity while maintaining accuracy.
CLOC (Causal Learning Over Clusters) is an efficient algorithm for causal structure learning when variables are grouped into clusters. Instead of learning causal relationships between individual variables, CLOC operates at the cluster level, ideal for high-dimensional settings where variables exhibit natural groupings.
Benefits of CLOC:
- Computational Efficiency: Reduces complexity by operating on clusters rather than individual variables
- Scalability: Handles datasets with hundreds of variables grouped into a smaller number of clusters
- Accuracy: Maintains causal discovery performance while reducing the number of required conditional independence (CI) tests
The algorithm returns an alphaCluster-CPDAG ($\alpha$C-CPDAG) - a cluster-level partially directed acyclic graph representing the Markov equivalence class of cluster-level causal relationships.
pip install numpy scipy pandas networkx tqdm
pip install causallearn dowhy sempler pyarrow
pip install hyppogit clone https://github.com/TaraAnand/CLOC.git
cd CLOCfrom learning_utils import CLOC
from graph_gen_utils import generate_cdag_structure, generate_dag_compat_cdag, simulate_gaussian_sem
import numpy as np
# 1. Generate a cluster DAG structure
cdag = generate_cdag_structure(n_clusters=4, density=0.3, seed=42)
# 2. Expand to variable-level DAG
dag_adj, var_names = generate_dag_compat_cdag(
cdag,
nodes_per_cluster=[3, 5, 4, 4],
density_inner=0.4,
seed=42
)
# 3. Define partition (cluster membership)
partition = {
"names": cdag.clusters,
"clusters": {
cluster: [v for v in var_names if v[0] == cluster]
for cluster in cdag.clusters
}
}
# 4. Generate data
data = simulate_gaussian_sem(dag_adj, n=1000, seed=123)
# 5. Run CLOC
result = CLOC(data, partition, oracle_mode=False)
# Access results
ccpdag = result["ccpdag"]
n_ci_tests = result["counter"]
print(f"Cluster CPDAG adjacency matrix:\n{ccpdag.adj_mat}")
print(f"Number of CI tests performed: {n_ci_tests}")Learn causal structure from observational data:
import pandas as pd
from learning_utils import CLOC
# Load your data (n samples × p variables)
data = pd.read_csv("your_data.csv")
# Define variable clusters
partition = {
"names": ["ClusterA", "ClusterB", "ClusterC"],
"clusters": {
"ClusterA": ["var1", "var2", "var3"],
"ClusterB": ["var4", "var5", "var6", "var7"],
"ClusterC": ["var8", "var9"]
}
}
# Run CLOC
result = CLOC(data, partition, oracle_mode=False)
ccpdag = result["ccpdag"]When the true DAG structure is known (for benchmarking/evaluation):
# dag_adj: adjacency matrix encoding the true DAG
result = CLOC(dag_adj, partition, oracle_mode=True)CLOC/
├── learning_utils.py # Core CLOC algorithm implementation
├── graph_gen_utils.py # Utilities for generating cluster DAGs and data
├── graph_utils.py # Graph manipulation and metrics
├── simulations.py # Experimental evaluation framework
└── README.md # This file
learning_utils.py
CLOC(): Main algorithm implementationget_full_ccpdag_from_dag(): PC-then-cluster baseline approach- Conditional independence testing at cluster level
- Orientation rules for CCPDAGs
graph_gen_utils.py
generate_cdag_structure(): Generate random cluster DAGsgenerate_dag_compat_cdag(): Generate a random DAG compatible with the cluster DAGsimulate_gaussian_sem(): Generate data from linear Gaussian SEMs
graph_utils.py
- Graph utility functions (adjacency, parents, children)
cpdag_shd(): Structural Hamming Distance calculation- DAG validation and CPDAG conversion
simulations.py
- Comprehensive experimental framework
- Compares CLOC vs. PC-then-cluster approach
- Measures performance metrics (SHD, runtime, CI test count)
Run the full simulation study:
from simulations import main
# Runs experiments across multiple graph configurations and sample sizes
main()The simulation framework evaluates:
- Structural Hamming Distance (SHD): Accuracy of learned structure
- Runtime: Computational efficiency
- CI Test Count: Number of independence tests required
- Performance across varying cluster counts and sample sizes
CLOC is compared against:
- PC-then-Cluster: Learn full variable-level CPDAG with PC algorithm, then project to clusters
- Naive Complete Graph: Assume all clusters are connected
- Oracle Methods: Upper bound on performance with known structure
- Directed Edge i → j:
adj[i,j] = -1,adj[j,i] = 1 - Undirected Edge i — j:
adj[i,j] = -1,adj[j,i] = -1 - No Edge:
adj[i,j] = 0,adj[j,i] = 0
CLOC uses cluster-level conditional independence tests based on:
- Fisher's Z-test for Gaussian data
- Supports kernel-based tests (FCIT, KCI) via
hyppopackage
- Python 3.8+
- NumPy >= 1.20
- pandas >= 1.3
- causallearn >= 0.1.3
- networkx >= 2.6
- scipy >= 1.7
- hyppo >= 0.3
This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon:
- causal-learn: Python library for causal discovery
- DoWhy: Python library for causal inference
- sempler: Tools for structural equation model generation
- The PC algorithm and its extensions for constraint-based causal discovery
- Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press.
- Zhang, J. (2008). On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17), 1873-1896.
- Claassen, T., & Heskes, T. (2012). A Bayesian approach to constraint based causal inference. UAI 2012.