Skip to content

TaraAnand/CLOC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLOC: Causal Learning Over Clusters

A novel causal discovery algorithm for learning causal relationships between clusters of variables, significantly reducing computational complexity while maintaining accuracy.

Overview

CLOC (Causal Learning Over Clusters) is an efficient algorithm for causal structure learning when variables are grouped into clusters. Instead of learning causal relationships between individual variables, CLOC operates at the cluster level, ideal for high-dimensional settings where variables exhibit natural groupings.

Benefits of CLOC:

  • Computational Efficiency: Reduces complexity by operating on clusters rather than individual variables
  • Scalability: Handles datasets with hundreds of variables grouped into a smaller number of clusters
  • Accuracy: Maintains causal discovery performance while reducing the number of required conditional independence (CI) tests

The algorithm returns an alphaCluster-CPDAG ($\alpha$C-CPDAG) - a cluster-level partially directed acyclic graph representing the Markov equivalence class of cluster-level causal relationships.

Installation

Prerequisites

pip install numpy scipy pandas networkx tqdm
pip install causallearn dowhy sempler pyarrow
pip install hyppo

Clone Repository

git clone https://github.com/TaraAnand/CLOC.git
cd CLOC

Quick Start

from learning_utils import CLOC
from graph_gen_utils import generate_cdag_structure, generate_dag_compat_cdag, simulate_gaussian_sem
import numpy as np

# 1. Generate a cluster DAG structure
cdag = generate_cdag_structure(n_clusters=4, density=0.3, seed=42)

# 2. Expand to variable-level DAG
dag_adj, var_names = generate_dag_compat_cdag(
    cdag, 
    nodes_per_cluster=[3, 5, 4, 4],
    density_inner=0.4,
    seed=42
)

# 3. Define partition (cluster membership)
partition = {
    "names": cdag.clusters,
    "clusters": {
        cluster: [v for v in var_names if v[0] == cluster]
        for cluster in cdag.clusters
    }
}

# 4. Generate data
data = simulate_gaussian_sem(dag_adj, n=1000, seed=123)

# 5. Run CLOC
result = CLOC(data, partition, oracle_mode=False)

# Access results
ccpdag = result["ccpdag"]
n_ci_tests = result["counter"]
print(f"Cluster CPDAG adjacency matrix:\n{ccpdag.adj_mat}")
print(f"Number of CI tests performed: {n_ci_tests}")

Usage

Leveraging an observational data distribution

Learn causal structure from observational data:

import pandas as pd
from learning_utils import CLOC

# Load your data (n samples × p variables)
data = pd.read_csv("your_data.csv")

# Define variable clusters
partition = {
    "names": ["ClusterA", "ClusterB", "ClusterC"],
    "clusters": {
        "ClusterA": ["var1", "var2", "var3"],
        "ClusterB": ["var4", "var5", "var6", "var7"],
        "ClusterC": ["var8", "var9"]
    }
}

# Run CLOC
result = CLOC(data, partition, oracle_mode=False)
ccpdag = result["ccpdag"]

Using an oracle

When the true DAG structure is known (for benchmarking/evaluation):

# dag_adj: adjacency matrix encoding the true DAG
result = CLOC(dag_adj, partition, oracle_mode=True)

Repository Structure

CLOC/
├── learning_utils.py       # Core CLOC algorithm implementation
├── graph_gen_utils.py      # Utilities for generating cluster DAGs and data
├── graph_utils.py          # Graph manipulation and metrics
├── simulations.py          # Experimental evaluation framework
└── README.md              # This file

File Descriptions

learning_utils.py

  • CLOC(): Main algorithm implementation
  • get_full_ccpdag_from_dag(): PC-then-cluster baseline approach
  • Conditional independence testing at cluster level
  • Orientation rules for CCPDAGs

graph_gen_utils.py

  • generate_cdag_structure(): Generate random cluster DAGs
  • generate_dag_compat_cdag(): Generate a random DAG compatible with the cluster DAG
  • simulate_gaussian_sem(): Generate data from linear Gaussian SEMs

graph_utils.py

  • Graph utility functions (adjacency, parents, children)
  • cpdag_shd(): Structural Hamming Distance calculation
  • DAG validation and CPDAG conversion

simulations.py

  • Comprehensive experimental framework
  • Compares CLOC vs. PC-then-cluster approach
  • Measures performance metrics (SHD, runtime, CI test count)

Experiments

Run the full simulation study:

from simulations import main

# Runs experiments across multiple graph configurations and sample sizes
main()

The simulation framework evaluates:

  • Structural Hamming Distance (SHD): Accuracy of learned structure
  • Runtime: Computational efficiency
  • CI Test Count: Number of independence tests required
  • Performance across varying cluster counts and sample sizes

Comparison with Baselines

CLOC is compared against:

  • PC-then-Cluster: Learn full variable-level CPDAG with PC algorithm, then project to clusters
  • Naive Complete Graph: Assume all clusters are connected
  • Oracle Methods: Upper bound on performance with known structure

Technical Details

Graph Encoding

  • Directed Edge i → j: adj[i,j] = -1, adj[j,i] = 1
  • Undirected Edge i — j: adj[i,j] = -1, adj[j,i] = -1
  • No Edge: adj[i,j] = 0, adj[j,i] = 0

Conditional Independence Testing

CLOC uses cluster-level conditional independence tests based on:

  • Fisher's Z-test for Gaussian data
  • Supports kernel-based tests (FCIT, KCI) via hyppo package

Requirements

  • Python 3.8+
  • NumPy >= 1.20
  • pandas >= 1.3
  • causallearn >= 0.1.3
  • networkx >= 2.6
  • scipy >= 1.7
  • hyppo >= 0.3

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work builds upon:

  • causal-learn: Python library for causal discovery
  • DoWhy: Python library for causal inference
  • sempler: Tools for structural equation model generation
  • The PC algorithm and its extensions for constraint-based causal discovery

Related Work

  • Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press.
  • Zhang, J. (2008). On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17), 1873-1896.
  • Claassen, T., & Heskes, T. (2012). A Bayesian approach to constraint based causal inference. UAI 2012.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages