This repository hosts parallel implementations of the Design-based Supervised Learning (DSL) framework in both R and Python.
The primary goal of the Python implementation was to create a version that closely mirrors the statistical methodology and produces comparable results to the established R package, originally developed by Naoki Egami.
DSL combines supervised machine learning techniques with methods from survey statistics and econometrics to estimate regression models when outcome labels are only available for a non-random subset of the data (partially labeled data).
For the theoretical background, detailed methodology, and original R package usage, please refer to the original package resources:
- Package Website & Vignettes: http://naokiegami.com/dsl
- Original R Package Repository: https://github.com/naoki-egami/dsl
You can install the most recent development version using the devtools package. First you have to install devtools using the following code. Note that you only have to do this once:
if(!require(devtools)) install.packages("devtools")Then, load devtools and use the function install_github() to install dsl:
library(devtools)
# Point to the R subdirectory if installing from this combined repo
install_github("your-github-username/dsl/R", dependencies = TRUE)
# Or if the R package repo is separate:
# install_github("naoki-egami/dsl", dependencies = TRUE) Prerequisites:
- Python 3.9+
- pip (Python package installer)
Setup:
-
Clone the repository:
git clone https://github.com/your-github-username/dsl.git cd dsl/python -
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt # The requirements file should include: numpy, pandas, scipy, statsmodels, patsy
Core Functions:
dsl(X, y, labeled_ind, sample_prob, model="logit")- Simple DSL estimation wrapperdsl_general(Y_orig, X_orig, Y_pred, X_pred, labeled_ind, sample_prob_use, model="lm")- Full DSL with prediction support
Important: For doubly robust estimation with predictions, use dsl_general() directly. The dsl() wrapper currently does not support separate prediction data.
Result Classes:
DSLResult- Contains coefficients, standard errors, convergence infoPowerDSLResult- Contains power analysis results
Power Analysis:
power_dsl(formula, data, ...)- Statistical power analysis
๐ Complete API Documentation: See docs/API.md for detailed parameter descriptions, return values, and usage examples.
Please refer to the original package documentation and vignettes for usage examples.
The simple wrapper function dsl.dsl() provides basic DSL estimation:
Example (using PanChen data):
import pandas as pd
from patsy import dmatrices
from dsl import dsl
from compare_panchen import load_panchen_data, prepare_data_for_dsl, format_dsl_results
# Load and prepare data
data = load_panchen_data()
df = prepare_data_for_dsl(data)
# Define formula
formula = (
"SendOrNot ~ countyWrong + prefecWrong + connect2b + "
"prevalence + regionj + groupIssue"
)
# Prepare design matrix (X) and response (y)
y, X = dmatrices(formula, df, return_type="dataframe")
# Run DSL estimation (logit model)
result = dsl(
X=X.values,
y=y.values.flatten(), # Ensure y is 1D
labeled_ind=df["labeled"].values,
sample_prob=df["sample_prob"].values,
model="logit", # Specify the desired model (e.g., 'logit', 'lm')
method="logistic" # Specify the estimation method ('logistic', 'linear')
)
# Print results
print(f"Convergence: {result.success}")
print(f"Iterations: {result.niter}")
print(f"Objective Value: {result.objective}")
# Format and print R-style summary
summary_table = format_dsl_results(result, formula) # Assumes format_dsl_results is available
print("\nPython DSL Results Summary:")
print(summary_table)For full doubly robust estimation with separate prediction data, use dsl_general():
from dsl.helpers.dsl_general import dsl_general
from dsl import DSLResult
# Scenario: Some variables have missing values but we have predictions
formula = "SendOrNot ~ countyWrong + prefecWrong + connect2b"
# Original data (with missing values)
y, X = dmatrices(formula, df, return_type="dataframe")
# Prediction data (fill missing with predicted values)
df_pred = df.copy()
df_pred["countyWrong"] = df_pred["countyWrong"].fillna(df_pred["pred_countyWrong"])
_, X_pred = dmatrices(formula, df_pred, return_type="dataframe")
# Run DSL with doubly robust estimation
par, info = dsl_general(
Y_orig=y.values.flatten(),
X_orig=X.values,
Y_pred=y.values.flatten(),
X_pred=X_pred.values, # Uses predictions for missing values
labeled_ind=df["labeled"].values,
sample_prob_use=df["sample_prob"].values,
model="logit"
)
# Check convergence
print(f"Converged: {info['convergence']}")
print(f"Objective: {info['objective']}") # Should be โ 0
print(f"Coefficients: {par}")
print(f"Standard Errors: {info['standard_errors']}")See docs/API.md for complete documentation and more examples.
The Python implementation has been carefully aligned with the R version's statistical methodology. Both implementations correctly implement the DSL (Design-based Supervised Learning) framework using GMM (Generalized Method of Moments) estimation with doubly robust moment conditions.
Python Results (Actual Output with seed=123):
Estimate Std. Error CI Lower CI Upper p value
(Intercept) 4.6985 2.6601 -0.5151 9.9121 0.0773 .
countyWrong -0.5437 0.3220 -1.1749 0.0874 0.0913 .
prefecWrong -2.4626 1.1079 -4.6339 -0.2912 0.0262 *
connect2b -0.1802 0.1793 -0.5316 0.1711 0.3148
prevalence -0.7184 0.2244 -1.1582 -0.2786 0.0014 **
regionj 0.2860 0.5057 -0.7052 1.2771 0.5717
groupIssue -5.2126 2.6586 -10.4234 -0.0019 0.0499 *
R Results (Reference Output with set.seed=123):
Estimate Std. Error CI Lower CI Upper p value
(Intercept) 2.0978 0.3621 1.3881 2.8075 0.0000 ***
countyWrong -0.2617 0.2230 -0.6988 0.1754 0.1203
prefecWrong -1.1162 0.2970 -1.6982 -0.5342 0.0001 ***
connect2b -0.0788 0.1197 -0.3134 0.1558 0.2552
prevalence -0.3271 0.1520 -0.6250 -0.0292 0.0157 *
regionj 0.1253 0.4566 -0.7695 1.0202 0.3919
groupIssue -2.3222 0.3597 -3.0271 -1.6172 0.0000 ***
Important Note on Differences:
The coefficient differences between Python and R are not due to implementation errors but rather due to incompatible random number generators:
-
Root Cause: Python's
np.random.seed(123)generates a different sequence of random numbers than R'sset.seed(123). This causes the two implementations to select different sets of 500 labeled observations from the 1412 total observations. -
Statistical Validity: Both implementations are mathematically correct. The DSL methodology involves random sampling of labeled observations, and different random samples naturally produce different coefficient estimates. This is expected behavior in statistical sampling.
-
Verification: The Python implementation has been verified to:
- Correctly implement GMM optimization (converges with objective โ 0)
- Properly calculate doubly robust moment conditions
- Accurately compute sample probabilities (n_labeled/n_total)
- Correctly use predictions for doubly robust estimation
-
Reproducibility: Each implementation is fully reproducible within its own environment. Python results are consistent across runs with
np.random.seed(123), and R results are consistent withset.seed(123). The implementations simply cannot be compared numerically without using identical random samples. -
Practical Implications: For real-world applications, the choice of random seed and labeled sample selection should be based on the specific research design, not on matching between languages. Both implementations provide valid statistical inference for their respective random samples.
To achieve identical results across languages, one would need to either:
- Use the same pre-generated labeled indicator in both implementations, or
- Implement R's random number generator in Python (complex and not recommended)
For methodological validation, it is sufficient to demonstrate that both implementations correctly converge and produce reasonable estimates, which has been verified.
Reference:
-
Egami, Hinck, Stewart, and Wei. (2024). "Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses."
-
Egami, Hinck, Stewart, and Wei. (2023). "Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models," Advances in Neural Information Processing Systems (NeurIPS).
Contributions are welcome! Please open an issue or submit a pull request.
[Specify License Information Here, e.g., MIT, GPL-3]