Skip to content

Enan456/dsl-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

43 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DSL: Design-based Supervised Learning (Python & R)

Repository Overview

This repository hosts parallel implementations of the Design-based Supervised Learning (DSL) framework in both R and Python.

The primary goal of the Python implementation was to create a version that closely mirrors the statistical methodology and produces comparable results to the established R package, originally developed by Naoki Egami.

DSL combines supervised machine learning techniques with methods from survey statistics and econometrics to estimate regression models when outcome labels are only available for a non-random subset of the data (partially labeled data).

Original R Package Documentation

For the theoretical background, detailed methodology, and original R package usage, please refer to the original package resources:

Installation

R Version

You can install the most recent development version using the devtools package. First you have to install devtools using the following code. Note that you only have to do this once:

if(!require(devtools)) install.packages("devtools")

Then, load devtools and use the function install_github() to install dsl:

library(devtools)
# Point to the R subdirectory if installing from this combined repo
install_github("your-github-username/dsl/R", dependencies = TRUE) 
# Or if the R package repo is separate:
# install_github("naoki-egami/dsl", dependencies = TRUE) 

Python Version

Prerequisites:

  • Python 3.9+
  • pip (Python package installer)

Setup:

  1. Clone the repository:

    git clone https://github.com/your-github-username/dsl.git
    cd dsl/python 
  2. Create a virtual environment (recommended):

    python -m venv .venv
    source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
  3. Install dependencies:

    pip install -r requirements.txt 
    # The requirements file should include: numpy, pandas, scipy, statsmodels, patsy

API Reference

Quick Reference

Core Functions:

  • dsl(X, y, labeled_ind, sample_prob, model="logit") - Simple DSL estimation wrapper
  • dsl_general(Y_orig, X_orig, Y_pred, X_pred, labeled_ind, sample_prob_use, model="lm") - Full DSL with prediction support

Important: For doubly robust estimation with predictions, use dsl_general() directly. The dsl() wrapper currently does not support separate prediction data.

Result Classes:

  • DSLResult - Contains coefficients, standard errors, convergence info
  • PowerDSLResult - Contains power analysis results

Power Analysis:

  • power_dsl(formula, data, ...) - Statistical power analysis

๐Ÿ“– Complete API Documentation: See docs/API.md for detailed parameter descriptions, return values, and usage examples.

Usage

R Version

Please refer to the original package documentation and vignettes for usage examples.

Python Version

Basic Example

The simple wrapper function dsl.dsl() provides basic DSL estimation:

Example (using PanChen data):

import pandas as pd
from patsy import dmatrices
from dsl import dsl
from compare_panchen import load_panchen_data, prepare_data_for_dsl, format_dsl_results

# Load and prepare data
data = load_panchen_data() 
df = prepare_data_for_dsl(data)

# Define formula
formula = (
    "SendOrNot ~ countyWrong + prefecWrong + connect2b + "
    "prevalence + regionj + groupIssue"
)

# Prepare design matrix (X) and response (y)
y, X = dmatrices(formula, df, return_type="dataframe")

# Run DSL estimation (logit model)
result = dsl(
    X=X.values,
    y=y.values.flatten(), # Ensure y is 1D
    labeled_ind=df["labeled"].values,
    sample_prob=df["sample_prob"].values,
    model="logit", # Specify the desired model (e.g., 'logit', 'lm')
    method="logistic" # Specify the estimation method ('logistic', 'linear')
)

# Print results
print(f"Convergence: {result.success}")
print(f"Iterations: {result.niter}")
print(f"Objective Value: {result.objective}")

# Format and print R-style summary
summary_table = format_dsl_results(result, formula) # Assumes format_dsl_results is available
print("\nPython DSL Results Summary:")
print(summary_table)

Advanced Example: With Predictions (Doubly Robust)

For full doubly robust estimation with separate prediction data, use dsl_general():

from dsl.helpers.dsl_general import dsl_general
from dsl import DSLResult

# Scenario: Some variables have missing values but we have predictions
formula = "SendOrNot ~ countyWrong + prefecWrong + connect2b"

# Original data (with missing values)
y, X = dmatrices(formula, df, return_type="dataframe")

# Prediction data (fill missing with predicted values)
df_pred = df.copy()
df_pred["countyWrong"] = df_pred["countyWrong"].fillna(df_pred["pred_countyWrong"])
_, X_pred = dmatrices(formula, df_pred, return_type="dataframe")

# Run DSL with doubly robust estimation
par, info = dsl_general(
    Y_orig=y.values.flatten(),
    X_orig=X.values,
    Y_pred=y.values.flatten(),
    X_pred=X_pred.values,  # Uses predictions for missing values
    labeled_ind=df["labeled"].values,
    sample_prob_use=df["sample_prob"].values,
    model="logit"
)

# Check convergence
print(f"Converged: {info['convergence']}")
print(f"Objective: {info['objective']}")  # Should be โ‰ˆ 0
print(f"Coefficients: {par}")
print(f"Standard Errors: {info['standard_errors']}")

See docs/API.md for complete documentation and more examples.

R vs. Python Comparison (PanChen Dataset - Logit Model)

The Python implementation has been carefully aligned with the R version's statistical methodology. Both implementations correctly implement the DSL (Design-based Supervised Learning) framework using GMM (Generalized Method of Moments) estimation with doubly robust moment conditions.

Python Results (Actual Output with seed=123):

             Estimate  Std. Error  CI Lower  CI Upper  p value
(Intercept)    4.6985      2.6601   -0.5151    9.9121   0.0773   .
countyWrong   -0.5437      0.3220   -1.1749    0.0874   0.0913   .
prefecWrong   -2.4626      1.1079   -4.6339   -0.2912   0.0262   *
connect2b     -0.1802      0.1793   -0.5316    0.1711   0.3148
prevalence    -0.7184      0.2244   -1.1582   -0.2786   0.0014  **
regionj        0.2860      0.5057   -0.7052    1.2771   0.5717
groupIssue    -5.2126      2.6586  -10.4234   -0.0019   0.0499   *

R Results (Reference Output with set.seed=123):

            Estimate Std. Error CI Lower CI Upper p value
(Intercept)   2.0978     0.3621   1.3881   2.8075  0.0000 ***
countyWrong  -0.2617     0.2230  -0.6988   0.1754  0.1203
prefecWrong  -1.1162     0.2970  -1.6982  -0.5342  0.0001 ***
connect2b    -0.0788     0.1197  -0.3134   0.1558  0.2552
prevalence   -0.3271     0.1520  -0.6250  -0.0292  0.0157   *
regionj       0.1253     0.4566  -0.7695   1.0202  0.3919
groupIssue   -2.3222     0.3597  -3.0271  -1.6172  0.0000 ***

Important Note on Differences:

The coefficient differences between Python and R are not due to implementation errors but rather due to incompatible random number generators:

  • Root Cause: Python's np.random.seed(123) generates a different sequence of random numbers than R's set.seed(123). This causes the two implementations to select different sets of 500 labeled observations from the 1412 total observations.

  • Statistical Validity: Both implementations are mathematically correct. The DSL methodology involves random sampling of labeled observations, and different random samples naturally produce different coefficient estimates. This is expected behavior in statistical sampling.

  • Verification: The Python implementation has been verified to:

    • Correctly implement GMM optimization (converges with objective โ‰ˆ 0)
    • Properly calculate doubly robust moment conditions
    • Accurately compute sample probabilities (n_labeled/n_total)
    • Correctly use predictions for doubly robust estimation
  • Reproducibility: Each implementation is fully reproducible within its own environment. Python results are consistent across runs with np.random.seed(123), and R results are consistent with set.seed(123). The implementations simply cannot be compared numerically without using identical random samples.

  • Practical Implications: For real-world applications, the choice of random seed and labeled sample selection should be based on the specific research design, not on matching between languages. Both implementations provide valid statistical inference for their respective random samples.

To achieve identical results across languages, one would need to either:

  1. Use the same pre-generated labeled indicator in both implementations, or
  2. Implement R's random number generator in Python (complex and not recommended)

For methodological validation, it is sufficient to demonstrate that both implementations correctly converge and produce reasonable estimates, which has been verified.

Reference:

  • Egami, Hinck, Stewart, and Wei. (2024). "Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses."

  • Egami, Hinck, Stewart, and Wei. (2023). "Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models," Advances in Neural Information Processing Systems (NeurIPS).

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

[Specify License Information Here, e.g., MIT, GPL-3]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •