This repository contains replication codes for simulations and empirical results from Chan, Mátyás, Reguly (2025): Modelling with Sensitive Variables and its online supplement.
Codes have been run with MatLab version 2023b on a MacBook with OS Version 14.2.1 and an Apple M1 Max chip. The simulations are parallelized; we used 10 workers. Results may slightly change if different numbers of workers are used due to randomization. One can get rid of parallelization by changing the parfor loop to for in codes/@splitsampling/estimate_DOC.m, line 60. In case of discretization happens with the outcome variable, we have used StataMP 13 and run the referred codes on the same laptop. The codes for the empirical application are shared; however, the data from the Australian Tax Office (ATO) cannot be made public unless one applies to get the data from the ATO.
We deal with econometric models in which the dependent variable, some explanatory variables, or both are observed as censored interval data. This discretization often happens due to confidentiality of `sensitive’ variables like income. Models using these variables cannot point identify regression parameters as the conditional moments are unknown, which led the literature to use interval estimates (see, e.g., Manski and Tamer, 2002). Here, we propose a discretization method through which the regression parameters can be point identified while preserving data confidentiality. We demonstrate the asymptotic properties of the OLS estimator for the parameters in multivariate linear regressions for cross-sectional data. The theoretical findings are supported by Monte Carlo experiments and illustrated with an application to the Australian gender wage gap.
Example: In many cases, income data cannot be shared in its original form. Typically, the shared (or surveyed) data contains income categories (e.g., '10,000-30,000$', '30,000-50,000$' or '50,000$ or more'). The modeler would like to understand customer behavior (elasticities); however, due to discretization, the parameters cannot be point identified in general. Using the proposed split sampling, we create multiple discretization schemes for the sensitive or surveyed income variable. Then we use a synthetic variable to calculate appropriate conditional expectations that allows to point identify the parameter of interest.
To replicate our results, one needs to add to path the folder of codes/estimations and codes/@splitsampling. The latter is automatic in MatLab.
codes/simulations/sim_results_RHS.mreplicates: Table 1 RHS rows and further evidences for the right-hand-side case in the online supplement.codes/simulations/convergence_RHS.mreplicates convergence results from the online supplement for the right hand side.codes/simulations/sim_results_LHS.mreplicates mid-point regressions and shifting for Table 1 and further tables from the online appendix.codes/simulations/convergence_LHS.mreplicates convergence results from the online supplement for the left hand side.codes/simulations/LHS_stata/save_results.doreplicates Set identification, Ordered probit, Ordered logit and Interval regressions for Table 2 and further results from the online supplement and saves to an excel. One need to change the path in the code. To be able to run the stata script one needs to- add
codes/simulations/LHS_stata/simul_overallLHS.doto path. - inport the STATA Software for Best Linear Prediction with Interval Outcome Data provided and documented by Arie Beresteanu, Francesca Molinari and Darcy Steeg Morris (2010): "Asymptotics for Partially Identified Models in STATA
- add
codes/simulations/sim_results_both.mreplicates Table 1 when discretization happens on both sides and further evidences from online supplement.codes/simulations/convergence_both.mreplicates convergence results from the online supplement when both sides contains sensitive variables.codes/empirical_application/model_results.mreplicates: Table 2 for true parameters, mid-point regression and shifting methodcodes/empirical_application/differential_privacy_ATO.mreplicates: Table 2 for the differential privacy using the package of 'diffprivlib' from python.codes/empirical_application/descriptives_stat.mreplicates the descriptive table for ATO analysis from the online supllement.