This repository contains a high-performance, object-oriented MATLAB replication of the Factorial Difference-in-Differences (FDID) model, based strictly on the theoretical framework presented in:
Xu, Zhao, and Ding (2024). Factorial Difference-in-Differences. arXiv:2407.11937v2
The FDID framework provides a robust identification strategy for panel data where:
- Universal Exposure: An event (the "Treatment") affects all units simultaneously (no clean control group).
-
Baseline Modulator: A baseline factor (
$G$ ) modulates the impact of the event across different units.
This implementation accurately distinguishes between Effect Modification (what canonical DID recovers) and Causal Moderation (the true causal interaction of the baseline factor, identifiable under the Factorial Parallel Trends assumption).
The replication is divided into four main files reflecting a professional econometric package:
-
MonteCarloSim.m: A rigorous Data Generating Process (DGP) simulator for FDID panel structures.- Generates simulated datasets controlling effect modifications (
$\tau_{em}$ ) vs. causal moderations ($\tau_{inter}$ ). - Injects canonical and factorial parallel trends violations, heteroskedasticity, and AR(1) serial correlation.
- Generates simulated datasets controlling effect modifications (
-
FdidEstimator.m: A computationally efficient TWFE-based estimator.- Utilizes Within-Transformation (absorbing unit and time fixed effects) for optimal performance on large
$N$ panels, completely avoiding large sparsedummyvarmatrices. - Incorporates robust
pinvsolvers to handle TWFE collinearity. - By default, calculates Unit-Level Cluster-Robust Standard Errors (CRSE) with finite-sample corrections.
- Utilizes Within-Transformation (absorbing unit and time fixed effects) for optimal performance on large
-
run_monte_carlo.m: Evaluation script demonstrating the estimator's unbiasedness, RMSE, and properties under valid vs. invalid factorial assumptions across 200 replications. -
run_empirical_fdid.m: Empirical application mock-up. Demonstrates the usage syntax on a proxy dataset modeling the "Clans and Calamity" Great Famine study.
The FDID setup relies on a panel data structure where an event occurs at time
The potential outcomes are modeled as:
The Two-Way Fixed Effects (TWFE) regression estimated by FdidEstimator is:
Where:
-
$\mu_i$ and$\lambda_t$ are unit and time fixed effects. -
$\beta_{GZ}$ captures the Causal Moderation ($\tau_{inter}$ ) if the Factorial Parallel Trends assumption holds and covariates are centered. -
$\beta_{GZ}$ captures only the Effect Modification ($\tau_{em}$ ) if only canonical Parallel Trends holds.
The MonteCarloSim class provides an interface to generate synthetic FDID data.
% Initialize simulator
sim = MonteCarloSim();
sim.NumUnits = 1000;
sim.NumPeriods = 4;
sim.EventTime = 3;
% Configure parameters
sim.TauInter = 2.0; % True causal moderation
sim.TauEm = 2.0; % Set different from TauInter to violate Factorial PT
sim.HasHeteroskedasticity = true;
% Generate data
[data, trueParams] = sim.generate();The FdidEstimator fits the TWFE model on the panel data, using within-transformation to efficiently absorb fixed effects.
% Initialize the estimator
% Arguments: (data, idVar, timeVar, outcomeVar, baseFactorVar, exposureVar, covariatesList)
estimator = FdidEstimator(data, "id", "time", "y", "g", "z", ["x"]);
% Fit the model (calculates coefficients, robust standard errors, t-stats, p-values)
estimator = estimator.fit();
% Display the formatted results table
estimator.displayResults();
% Access specific coefficients or p-values programmatically
betaGZ = estimator.Coef.GZ;
pValGZ = estimator.PValue.GZ;% Set up paths and run the empirical mock script
run_empirical_fdidExpected Output: Formatted regression tables showing the Causal Interaction Estimate (the GZ term).
% Run the Monte Carlo test (Note: takes a few seconds to process 200 iterations and draw KDE charts)
run_monte_carloWhen applying the FDID framework, researchers must exercise extreme caution regarding two core theoretical caveats identified in the literature:
The TWFE regression (the GZ coefficient) mathematically converges to the Effect Modification (
-
What this means: Under any hypothetical state of the world (e.g., if the event never occurred), the trajectory of outcomes for the
$G=1$ group must precisely match the$G=0$ group. - Failure Consequence: If high-$G$ units and low-$G$ units have differing natural trajectories (due to historical or geographic advantages), the estimate contains severe Selection Bias, rending the causal interpretation invalid.
To rescue the Factorial Parallel Trends assumption, researchers often condition on baseline covariates
-
Efficiency Loss & Multicollinearity: Every added covariate
$X$ requires controlling for$X \times Z$ and ideally the three-way interaction$G \times X \times Z$ . This exponentially consumes degrees of freedom. Highly correlated covariates will push the design matrix$(X'X)$ towards singularity, exploding standard errors and causing computational instability. -
Bad Controls (Colliders & Mediators): Never control for "post-treatment" variables (factors that could themselves be affected by the event
$Z$ ). Doing so opens endogenous pathways (Collider Bias) or absorbs the mechanism of action (Over-control Bias), thoroughly polluting both$\tau_{inter}$ and$\tau_{em}$ . -
Best Practice: Only include strictly pre-determined, essential confounders that theoretically dictate both the
$G$ group assignment and the underlying time trend. Furthermore,FdidEstimatorautomatically centers ($X_i - \bar{X}$ ) all provided covariates to explicitly ensure the main$\beta_{GZ}$ term anchors to the sample average causal interaction.
- Environment: MATLAB R2017a or higher (uses
stringarrays). - Dependencies: Uses basic Statistics and Machine Learning Toolbox functions (
normrnd,ksdensity, etc.).