|
3 | 3 | This repo contains a prototype implementation **DoubleML-Serverless** of distributed double machine learning with a serverless infrastructure |
4 | 4 | using [AWS Lambda](https://aws.amazon.com/lambda). |
5 | 5 | A detailed discussion of this prototype can be found in the paper "Distributed Double Machine Learning with a Serverless Architecture" (Kurz, 2021). |
6 | | -**DoubleML-Serverless** is an extension for serverless cloud computing of the Python package **DoubleML**. |
7 | | -**DoubleML** is available via PyPI [https://pypi.org/project/DoubleML](https://pypi.org/project/DoubleML) and on GitHub [https://github.com/DoubleML/doubleml-for-py](https://github.com/DoubleML/doubleml-for-py). |
8 | | -Also see [https://docs.doubleml.org](https://docs.doubleml.org) for a detailed documentation and user guide for the **DoubleML** package. |
| 6 | +DoubleML-Serverless is an extension for serverless cloud computing of the Python package **DoubleML**. |
| 7 | +DoubleML is available via PyPI [https://pypi.org/project/DoubleML](https://pypi.org/project/DoubleML) and on GitHub [https://github.com/DoubleML/doubleml-for-py](https://github.com/DoubleML/doubleml-for-py). |
| 8 | +Also see [https://docs.doubleml.org](https://docs.doubleml.org) for a detailed documentation and user guide for the DoubleML package. |
9 | 9 |
|
10 | 10 | ## Getting started |
11 | 11 |
|
@@ -47,19 +47,74 @@ There are two options for deployment: |
47 | 47 |
|
48 | 48 | 2. The second option for deployment is based on AWS Serverless Application Model (AWS SAM). |
49 | 49 |
|
50 | | -2.1 Setup the AWS SAM CLI as described here: [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html) |
| 50 | + 2.1 Setup the AWS SAM CLI as described here: [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html) |
51 | 51 |
|
52 | | -2.2 To deploy the application use the following commands (for more information see [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)) |
| 52 | + 2.2 To deploy the application use the following commands (for more information see [https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)) |
| 53 | + ``` |
| 54 | + cd aws_lambda_app |
| 55 | + sam build |
| 56 | + sam deploy --guided |
| 57 | + ``` |
53 | 58 |
|
| 59 | +### Estimating a partially linear regression model with double machine learning and serverless scaling using AWS Lambda |
| 60 | +
|
| 61 | +To demonstrate the functionality of DoubleML-Serverless we revisit the Pennsylvania Reemployment Bonus experiment |
| 62 | +and estimate the effect of provisioning a cash bonus on the unemployment duration as studied in Chernozhukov et al. (2018). |
| 63 | +This example is also discussed in the accompanying paper to the DoubleML-Serverless package (Kurz, 2021). |
| 64 | +
|
| 65 | +We first load the data using functionalities from the DoubleML package. |
| 66 | +```python |
| 67 | +from doubleml.datasets import fetch_bonus |
| 68 | +df_bonus = fetch_bonus('DataFrame') |
54 | 69 | ``` |
55 | | -cd aws_lambda_app |
56 | | -sam build |
57 | | -sam deploy --guided |
| 70 | + |
| 71 | +The class `DoubleMLDataS3` serves as data-backend for DoubleML-Serverless model classes. |
| 72 | +It is inherited from the `DoubleML` class `DoubleMLData`. |
| 73 | +We initialize an object of the `DoubleMLDataS3` for the bonus data and upload it to the S3 bucket `doubleml-serverless-data` used for the data transfer to AWS Lambda. |
| 74 | +```python |
| 75 | +from doubleml_serverless import DoubleMLDataS3 |
| 76 | + |
| 77 | +dml_data_bonus = DoubleMLDataS3( |
| 78 | + 'doubleml-serverless-data', 'bonus_data.csv', |
| 79 | + df_bonus, |
| 80 | + y_col='inuidur1', |
| 81 | + d_cols='tg', |
| 82 | + x_cols=['female', 'black', 'othrace', |
| 83 | + 'dep1', 'dep2', 'q2', 'q3', |
| 84 | + 'q4', 'q5', 'q6', 'agelt35', |
| 85 | + 'agegt54', 'durable', 'lusd', 'husd']) |
| 86 | +dml_data_bonus.store_and_upload_to_s3() |
58 | 87 | ``` |
59 | 88 |
|
60 | | -### Estimating a partially linear regression model with double machine learning and serverless scaling using AWS Lambda |
| 89 | +To estimate the nuisance functions we use a random forest regressor which averages over 500 trees. |
| 90 | +We further apply repeated cross-fitting with 5 folds and 100 repetitions/splits. |
| 91 | +```python |
| 92 | +from doubleml_serverless import DoubleMLPLRServerless |
| 93 | +from sklearn.base import clone |
| 94 | +from sklearn.ensemble import RandomForestRegressor |
| 95 | + |
| 96 | +ml = RandomForestRegressor(n_estimators = 500) |
| 97 | +ml_g = clone(ml) |
| 98 | +ml_m = clone(ml) |
| 99 | +dml_lambda_plr_bonus = DoubleMLPLRServerless( |
| 100 | + 'LambdaCVPredict', 'eu-central-1', |
| 101 | + dml_data_bonus, ml_g, ml_m, |
| 102 | + n_folds=5, n_rep=100) |
| 103 | +``` |
61 | 104 |
|
| 105 | +To estimate the model locally we can call `dml_lambda_plr_bonus.fit()`. |
| 106 | +Estimation on AWS Lambda is achieved via `dml_lambda_plr_bonus.fit_aws_lambda()`. |
| 107 | +Note that you will be charged for all used resources in the AWS account you deployed the serverless application to. |
| 108 | +```python |
| 109 | +dml_lambda_plr_bonus.fit_aws_lambda() |
| 110 | +``` |
62 | 111 |
|
| 112 | +A summary of the estimation result is available via the property `dml_lambda_plr_bonus.summary`. |
| 113 | +Some metrics about the estimation on AWS Lambda can be obtained via the property `dml_lambda_plr_bonus.aws_lambda_metrics`. |
63 | 114 |
|
64 | 115 | ## References |
| 116 | + |
| 117 | +Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), |
| 118 | +Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68. doi:[10.1111/ectj.12097](https://doi.org/10.1111/ectj.12097). |
| 119 | + |
65 | 120 | Kurz, M.S. 2020. "Distributed Double Machine Learning with a Serverless Architecture". Unpublished Working Paper. |
0 commit comments