Skip to content

Commit e20716d

Browse files
committed
Updated BasicEmpirMethods.md
1 parent bcf4ee1 commit e20716d

File tree

7 files changed

+240
-63
lines changed

7 files changed

+240
-63
lines changed

data/basic_empirics/maketable1.dta

14.1 KB
Binary file not shown.

docs/book/CompMethods_references.bib

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,30 @@ @BOOK{AddaCooper:2003
55
YEAR = {2003},
66
}
77

8+
@ARTICLE{AcemogluEtAl:2001,
9+
AUTHOR = {Daron Acemoglu and Simon Johnson and James A. Robinson},
10+
TITLE = {The Colonial Origins of Comparative Development: An Empirical Investigation},
11+
JOURNAL = {American Economic Review},
12+
YEAR = {2001},
13+
volume = {91},
14+
number = {5},
15+
pages = {1369-1401},
16+
month = {December},
17+
url = {https://www.aeaweb.org/articles?id=10.1257/aer.91.5.1369},
18+
}
19+
20+
@ARTICLE{AllenEtAl:2023,
21+
AUTHOR = {Robert C. Allen and Mattia C. Bertazzini and Leander Heldring},
22+
TITLE = {The Economic Origins of Government},
23+
JOURNAL = {American Economic Review},
24+
YEAR = {2023},
25+
volume = {113},
26+
number = {10},
27+
pages = {2507-2545},
28+
month = {October},
29+
url = {https://www.aeaweb.org/articles?id=10.1257/aer.20201919},
30+
}
31+
832
@ARTICLE{AttanasioEtAl:2020,
933
AUTHOR = {Orazio Attanasio and Sarah Cattan and Emla Fitzsimons and Costas Meghir and Marta {Rubio-Codina}},
1034
TITLE = {Estimating the Production Function for Human Capital: Results from a Randomized Control Trial in Colombia},
@@ -443,6 +467,28 @@ @ARTICLE{Rust:2010
443467
month = {May}
444468
}
445469

470+
@INCOLLECTION{SargentStachurski:2023a,
471+
AUTHOR = {Thomas J. Sargent and John Stachurski},
472+
TITLE = {Simple Linear Regression},
473+
BOOKTITLE = {QuantEcon Python Lectures: A First Course in Quantitative Economics With Python},
474+
INSTITUTION = {QuantEcon.org},
475+
YEAR = {2023},
476+
chapter = {36},
477+
type = {Open access lectures},
478+
url = {https://intro.quantecon.org/simple_linear_regression.html},
479+
}
480+
481+
@INCOLLECTION{SargentStachurski:2023b,
482+
AUTHOR = {Thomas J. Sargent and John Stachurski},
483+
TITLE = {https://python.quantecon.org/ols.html},
484+
BOOKTITLE = {QuantEcon Python Lectures: Intermediate Quantitative Economics with Python},
485+
INSTITUTION = {QuantEcon.org},
486+
YEAR = {2023},
487+
chapter = {79},
488+
type = {Open access lectures},
489+
url = {https://python.quantecon.org/ols.html},
490+
}
491+
446492
@INCOLLECTION{Smith:2020,
447493
AUTHOR = {Smith, Anthony A. Jr.},
448494
TITLE = {Indirect Inference},

docs/book/_toc.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ parts:
1919
- caption: Git and GitHub
2020
chapters:
2121
- file: git/intro
22-
- caption: Basic Causal Inference
22+
- caption: Basic Empirical Methods
2323
chapters:
24-
- file: caus_inf/BasicEmpirMethods
24+
- file: basic_empirics/BasicEmpirMethods
2525
- caption: Basic Machine Learning
2626
chapters:
27-
- file: basic_ml/intro
27+
- file: basic_ml/ml_intro
2828
- caption: Neural Nets and Deep Learning
2929
chapters:
3030
- file: deep_learn/intro
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
---
2+
jupytext:
3+
formats: md:myst
4+
text_representation:
5+
extension: .md
6+
format_name: myst
7+
kernelspec:
8+
display_name: Python 3
9+
language: python
10+
name: python3
11+
---
12+
13+
(Chap_BasicEmpirMethods)=
14+
# Basic Empirical Methods
15+
16+
The focus of this chapter is to give the reader a basic introduction to the standard empirical methods in data science, policy analysis, and economics. I want each reader to come away from this chapter with the following basic skills:
17+
18+
* Difference between **correlation** and **causation**
19+
* Standard **data description**
20+
* Basic understanding of **linear regression**
21+
* What do regression **coefficients** mean?
22+
* What do **standard errors** mean?
23+
* How can I estimate my own linear regression with standard errors?
24+
* Basic extensions: cross terms, quadratic terms, difference-in-difference
25+
* Ideas behind bigger extensions of linear regression
26+
* Instrumental variables (omitted variable bias)
27+
* Logistic regression
28+
* Multiple equation models
29+
* Panel data
30+
* Time series data
31+
* Vector autoregression
32+
33+
34+
In the next chapter {ref}`Chap_BasicMLintro`, I give a more detailed treatment of logistic regression as a bridge to learning the basics of machine learning.
35+
36+
Some other good resources on the topic of learning the basics of linear regression in Python include the [QuantEcon.org](https://quantecon.org/) lectures "[Simple Linear Regression Model](https://intro.quantecon.org/simple_linear_regression.html)" {cite}`SargentStachurski:2023a`, and "[Linear Regression in Python](https://python.quantecon.org/ols.html)" {cite}`SargentStachurski:2023b`.
37+
38+
39+
(SecBasicEmpLit)=
40+
## Basic Empirical Methods in the Literature
41+
42+
What are the standard empirical methods in the current version of the *American Economic Review* ([Vol. 113, No. 10, October 2023](https://www.aeaweb.org/issues/736))?
43+
44+
Allen, Bertazzini, and Heldring, "The Economic Origins of Government" {cite}`AllenEtAl:2023`
45+
* Table 1, descriptive/summary statistics of the data
46+
* Eq. 1: Difference-in-difference
47+
\begin{equation*}
48+
Y_{c,t} = \sum_{k=0}^{-4}\left(\beta_k^{trmt}\times\mathbf{1}_k\times treated_c\right) + \rho_c + \gamma_t + \nu_{c,t} + \varepsilon_{c,t}
49+
\end{equation*}
50+
* Table 2, estimated coefficients, cross terms, standard errors
51+
52+
The iframe below contains a PDF of {cite}`AllenEtAl:2023` "The Economic Origins of Government".
53+
54+
<div>
55+
<iframe id="inlineFrameExample"
56+
title="Inline Frame Example"
57+
width="100%"
58+
height="700"
59+
src="https://drive.google.com/file/d/1ZR8a6DmbMbrW3K4Hdkgg5x4X1oFy_yNn/preview?usp=sharing">
60+
</iframe>
61+
</div>
62+
63+
64+
(SecBasicEmpCorrCaus)=
65+
## Correlation versus Causation
66+
67+
What is the difference between correlation and causation?
68+
* What are some examples of things that are correlated but do not "cause" each other?
69+
70+
What are some principles that cause correlation to not be causation?
71+
* Third variable problem/omitted variable/spurious correlation
72+
* Directionality/endogeneity
73+
74+
How do we determine causation?
75+
* Randomized controlled trials (RCT)
76+
* Laboratory experiments
77+
* Natural experiments
78+
* Quasi natural experiments
79+
80+
81+
(SecBasicEmpDescr)=
82+
## Data Description
83+
84+
Any paper that uses data needs to spend some ink summarizing and describing the data. This is usually done in tables. But it can also be done in cross tabulation, which is descriptive statistics by category. The most common types of descriptive statistics are the following:
85+
86+
* mean
87+
* median
88+
* variance
89+
* count
90+
* max
91+
* min
92+
93+
Let's download some data, and read it in using the Pandas library for Python.[^PandasRef] The following example is adapted from QuantEcon's "[Linear Regression in Python](https://python.quantecon.org/ols.html)" lecture {cite}`SargentStachurski:2023b`.
94+
95+
The research question of the paper "The Colonial Origins of Comparative Development: An Empirical Investigation" {cite}`AcemogluEtAl:2001` is to determine whether or not differences in institutions can help to explain observed economic outcomes. How do we measure institutional differences and economic outcomes? In this paper:
96+
* economic outcomes are proxied by log GDP per capita in 1995, adjusted for exchange rates,
97+
* institutional differences are proxied by an index of protection against expropriation on average over 1985-95, constructed by the [Political Risk Serivces Group](https://www.prsgroup.com/).
98+
99+
These variables and other data used in the paper are available for download on [Daron Acemoglu’s webpage](https://economics.mit.edu/faculty/acemoglu/data/ajr2001).
100+
101+
102+
(SecBasicEmpDescrBasic)=
103+
### Basic Data Description
104+
105+
The following cells downloads the data from {cite}`AcemogluEtAl:2001` from the file `maketable1.dta` and displays the first five observations from the data.
106+
107+
```{code-cell} ipython3
108+
:tags: []
109+
110+
import pandas as pd
111+
112+
df1 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/' +
113+
'raw/master/ols/maketable1.dta')
114+
```
115+
116+
The [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method returns the first $n$ forws of a DataFrame with column headings and index numbers. The default is `n=5`.
117+
118+
```{code-cell} ipython3
119+
:tags: []
120+
121+
df1.head()
122+
```
123+
124+
How many observations are in this dataset? What are the different countries in this dataset?
125+
126+
```{code-cell} ipython3
127+
:tags: []
128+
129+
print("The number of observations (rows) in the dataset is:", df1.size)
130+
print("")
131+
print("A list of all the", len(df1["shortnam"].unique()),
132+
'unique countries in the "shortnam" variable is:')
133+
print(df1["shortnam"].unique())
134+
```
135+
136+
Pandas DataFrames have a built-in method `.describe()` that will give the basic descriptive statistics for the numerical variables of a dataset.
137+
138+
```{code-cell} ipython3
139+
:tags: []
140+
141+
df1.describe()
142+
```
143+
144+
145+
<!-- {numref}`ExerBasicEmpir_MultLinRegress` -->
146+
147+
148+
(SecBasicEmpirExercises)=
149+
## Exercises
150+
151+
```{exercise-start} Multiple linear regression
152+
:label: ExerBasicEmpir_MultLinRegress
153+
:class: green
154+
```
155+
For this problem, you will use the 397 observations from the [`Auto.csv`](https://github.com/OpenSourceEcon/CompMethods/tree/main/data/basic_empirics/Auto.csv) dataset in the [`/data/basic_empirics/`](https://github.com/OpenSourceEcon/CompMethods/tree/main/data/basic_empirics) folder of the repository for this book.[^Auto] This dataset includes 397 observations on miles per gallon (`mpg`), number of cylinders (`cylinders`), engine displacement (`displacement`), horsepower (`horsepower`), vehicle weight (`weight`), acceleration (`acceleration`), vehicle year (`year`), vehicle origin (`origin`), and vehicle name (`name`).
156+
1. Import the data using the [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function. Look for characters that seem out of place that might indicate missing values. Replace them with missing values using the `na_values=...` option.
157+
2. Produce a scatterplot matrix which includes all of the quantitative variables `mpg`, `cylinders`, `displacement`, `horsepower`, `weight`, `acceleration`, `year`, `origin`. Call your DataFrame of quantitative variables `df_quant`. [Use the pandas scatterplot function in the code block below.]
158+
```python
159+
from pandas.plotting import scatter_matrix
160+
161+
scatter_matrix(df_quant, alpha=0.3, figsize=(6, 6), diagonal='kde')
162+
```
163+
3. Compute the correlation matrix for the quantitative variables ($8\times 8$) using the [`pandas.DataFrame.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method.
164+
4. Estimate the following multiple linear regression model of $mpg_i$ on all other quantitative variables, where $u_i$ is an error term for each observation, using Python's `statsmodels.api.OLS()` function.
165+
\begin{equation*}
166+
\begin{split}
167+
mpg_i &= \beta_0 + \beta_1 cylinders_i + \beta_2 displacement_i + \beta_3 horsepower_i + ... \\
168+
&\qquad \beta_4 weight_i + \beta_5 acceleration_i + \beta_6 year_i + \beta_7 origin_i + u_i
169+
\end{split}
170+
\end{equation*}
171+
* Which of the coefficients is statistically significant at the 1\% level?
172+
* Which of the coefficients is NOT statistically significant at the 10\% level?
173+
* Give an interpretation in words of the estimated coefficient $\hat{\beta}_6$ on $year_i$ using the estimated value of $\hat{\beta}_6$.
174+
5. Looking at your scatterplot matrix from part (2), what are the three variables that look most likely to have a nonlinear relationship with $mpg_i$?
175+
* Estimate a new multiple regression model by OLS in which you include squared terms on the three variables you identified as having a nonlinear relationship to $mpg_i$ as well as a squared term on $acceleration_i$.
176+
* Report your adjusted R-squared statistic. Is it better or worse than the adjusted R-squared from part (4)?
177+
* What happened to the statistical significance of the $displacement_i$ variable coefficient and the coefficient on its squared term?
178+
* What happened to the statistical significance of the cylinders variable?
179+
6. Using the regression model from part (5) and the `.predict()` function, what would be the predicted miles per gallon $mpg$ of a car with 6 cylinders, displacement of 200, horsepower of 100, a weight of 3,100, acceleration of 15.1, model year of 1999, and origin of 1?
180+
```{exercise-end}
181+
```
182+
183+
184+
(SecBasicEmpirFootnotes)=
185+
## Footnotes
186+
187+
The footnotes from this chapter.
188+
189+
[^PandasRef]: For a tutorial on using Python's Pandas package, see the {ref}`Chap_Pandas` chapter of this online book.
190+
191+
[^Auto]: The [`Auto.csv`](https://github.com/OpenSourceEcon/CompMethods/tree/main/data/basic_empirics/Auto.csv) dataset comes from {cite}`JamesEtAl:2017` (ch. 3) and is also available at http://www-bcf.usc.edu/~gareth/ISL/data.html.
File renamed without changes.

docs/book/caus_inf/BasicEmpirMethods.md

Lines changed: 0 additions & 60 deletions
This file was deleted.

0 commit comments

Comments
 (0)