Skip to content

Commit a593b20

Browse files
committed
bcs70
1 parent 7cb5701 commit a593b20

9 files changed

+623
-29
lines changed

.Rprofile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
options(knitr.stata.engine.path = "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP")

.ipynb_checkpoints/Untitled-checkpoint.ipynb

Lines changed: 0 additions & 28 deletions
This file was deleted.

docs/bcs70-data_discovery.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
---
2+
layout: default
3+
title: Data Discovery
4+
nav_order: 2
5+
parent: BCS70
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
# Introduction
13+
14+
In this section, we show a few `R` functions for exploring BCS70 data;
15+
as noted, historical sweeps of the BCS70 did not use modern metadata
16+
standards, so finding a specific variable can be challenging. Variables
17+
do not always have names that are descriptive or follow a consistent
18+
naming convention across sweeps. (The variable for cohort member sex in
19+
the `0y/bcs7072a.dta` file is `a0255`, for example.) In what follows, we
20+
will use the `R` functions to find variables on cohort members’ smoking,
21+
which has been collected in many of the sweeps.
22+
23+
The packages we will use are:
24+
25+
```r
26+
# Load Packages
27+
library(tidyverse) # For data manipulation
28+
library(haven) # For importing .dta files
29+
library(labelled) # For searching imported datasets
30+
library(codebookr) # For creating .docx codebooks
31+
```
32+
33+
# `labelled::lookfor()`
34+
35+
The `labelled` package contains functionality for attaching and
36+
examining metadata in dataframes (for instance, adding labels to
37+
variables \[columns\]). Beyond this, it also contains the `lookfor()`
38+
function, which replicates similar functionality in `Stata`. `lookfor()`
39+
also one to search for variables in a dataframe by keyword (regular
40+
expression); the function searches variable names as well as associated
41+
metadata. It returns an object containing matching variables, their
42+
labels, and their types, etc.. Below, we read in the BCS70 38-year sweep
43+
derived variable dataset (`38y/bcs8derived.dta`) and use `lookfor()` to
44+
search for variables which mention `"smok"` in their name or metadata.
45+
46+
```r
47+
bcs70_38y <- read_dta("38y/bcs8derived.dta")
48+
49+
lookfor(bcs70_38y, "smok|cigar")
50+
```
51+
52+
``` text
53+
pos variable label col_type missing values
54+
11 BD8SMOKE 2008: Smoking habits dbl+lbl 0 [-8] Dont Know
55+
[-1] Not applicable
56+
[0] Never smoked
57+
[1] Ex smoker
58+
[2] Occasional smoker
59+
[3] Up to 10 a day
60+
[4] 11 to 20 a day
61+
[5] More than 20 a day
62+
[6] Daily but frequency n~
63+
```
64+
65+
Users may consider it easier to create a tibble of the `lookfor()`
66+
output, which can be searched and filtered using `dplyr` functions.
67+
Below, we create a `tibble` (a type of `data.frame` with good printing
68+
defaults) of the `lookfor()` output and use `filter()` to find variables
69+
with `"smok"` or `"cigar"` in their labels. Note, we convert both the
70+
variable names and labels to lower case to make the search case
71+
insensitive.
72+
73+
```r
74+
bcs70_38y_lookfor <- lookfor(bcs70_38y) %>%
75+
as_tibble() %>%
76+
mutate(variable_low = str_to_lower(variable),
77+
label_low = str_to_lower(label))
78+
79+
bcs70_38y_lookfor %>%
80+
filter(str_detect(label_low, "smok|cigar"))
81+
```
82+
83+
``` text
84+
# A tibble: 1 × 9
85+
pos variable label col_type missing levels value_labels variable_low
86+
<int> <chr> <chr> <chr> <int> <name> <named list> <chr>
87+
1 11 BD8SMOKE 2008: Smokin… dbl+lbl 0 <NULL> <dbl [9]> bd8smoke
88+
# ℹ 1 more variable: label_low <chr>
89+
```
90+
91+
# `codebookr::codebook()`
92+
93+
The BCS70 datasets that are downloadable from the UK Data Service come
94+
bundled with data dictionaries within the `mrdoc` subfolder. However,
95+
these are limited in some ways. The `codebookr` package enables the
96+
creation of data dictionaries that are more customisable, and in our
97+
opinion, easier-to-read. Below we create a codebook for the BCS70
98+
51-year sweep dataset. These codebooks are intended to be saved and
99+
viewed in Microsoft Word.
100+
101+
```r
102+
cdb <- codebook(bcs70_38y)
103+
print(cdb, "bcs70_38y_codebook.docx") # Saves as .docx (Word) file
104+
```
105+
106+
A screenshot of the codebook is shown below.
107+
108+
<figure>
109+
<img src="../images/bcs70-data_discovery.png"
110+
alt="Codebook created by codebookr::codebook()" />
111+
<figcaption aria-hidden="true">Codebook created by
112+
codebookr::codebook()</figcaption>
113+
</figure>
114+
115+
# Create a Lookup Table Across All Datasets
116+
117+
Creating the `lookfor()` and `codebook()` one dataset at a time does not
118+
allow one to get a quick overview of the variables available in the
119+
BCS70, including the sweeps repeatedly measured characteristics are
120+
available in. Below we create a `tibble`, `df_lookfor`, that contains
121+
`lookfor()` results for all the `.dta` files in the BCS70 folder.
122+
123+
To do this, we create a function, `create_lookfor()`, that takes a file
124+
path to a `.dta` file, reads in the first row of the dataset (faster
125+
than reading the full dataset), and applies `lookfor()` to it. We call
126+
this function with a `mutate()` function call to create a set of lookups
127+
for every `.dta` file we can find in the BCS70 folder. `map()` loops
128+
over every value in the `file_path` column, creating a corresponding
129+
lookup table for that file, stored as a
130+
[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns).
131+
`unnest()` expands the results out, so rather than have one row per
132+
`file_path`, we have one row per variable.
133+
134+
```r
135+
create_lookfor <- function(file_path){
136+
read_dta(file_path, n_max = 1) %>%
137+
lookfor() %>%
138+
as_tibble()
139+
}
140+
141+
df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>%
142+
filter(!str_detect(file_path, "^UKDS")) %>%
143+
mutate(lookfor = map(file_path, create_lookfor)) %>%
144+
unnest(lookfor) %>%
145+
mutate(variable_low = str_to_lower(variable),
146+
label_low = str_to_lower(label)) %>%
147+
separate(file_path,
148+
into = c("sweep", "file"),
149+
sep = "/",
150+
remove = FALSE) %>%
151+
relocate(file_path, pos, .after = last_col())
152+
```
153+
154+
``` text
155+
Warning: Expected 2 pieces. Additional pieces discarded in 3174 rows [28943, 28944,
156+
28945, 28946, 28947, 28948, 28949, 28950, 28951, 28952, 28953, 28954, 28955,
157+
28956, 28957, 28958, 28959, 28960, 28961, 28962, ...].
158+
```
159+
160+
We can use the resulting object to search for variables with `"smok"` or
161+
`"cigar"` in their labels.
162+
163+
```r
164+
df_lookfor %>%
165+
filter(str_detect(label_low, "smok|cigar")) %>%
166+
select(file, variable, label)
167+
```
168+
169+
``` text
170+
# A tibble: 294 × 3
171+
file variable label
172+
<chr> <chr> <chr>
173+
1 bcs7072a.dta a0043b SMOKING DURING PREGNANCY
174+
2 bcs7072b.dta b0024 DOES THE CHILD'S MOTHER SMOKE TOBACCO ?
175+
3 bcs7072b.dta b0025 IF NO WHEN DID SHE LAST SMOKE (MONTH) ?
176+
4 bcs7072b.dta b0026 IF NO WHEN DID SHE LAST SMOKE (YEAR) ?
177+
5 bcs7072b.dta b0027 ANSWER TO LAST SMOKED OTHER THAN A DATE
178+
6 bcs7072b.dta b0028 HOW MANY SMOKED ( CIGARETTES ) ?
179+
7 sn3723.dta e9_1 MOTHER'S PRESENT SMOKING HABITS
180+
8 sn3723.dta e9_2 NO. OF CIGARETTES MOTHER SMOKES DAILY
181+
9 sn3723.dta e9_3 LENGTH OF TIME MOTHER HAS SMOKED
182+
10 sn3723.dta e10_1 MOTHER NON SMOKER NOW HAS SMOKED IN PAST
183+
# ℹ 284 more rows
184+
```

docs/bcs70-reshape_long_wide.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
---
2+
layout: default
3+
title: Reshaping Data from Long to Wide (or Wide to Long)
4+
nav_order: 4
5+
parent: BCS70
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
# Introduction
13+
14+
In this section, we show how to reshape data from long to wide (and vice
15+
versa). To demonstrate, we use data from Sweeps 8 (51y) and 11 (51y) on
16+
cohort member’s height and weight collected.
17+
18+
The packages we use are:
19+
20+
```r
21+
# Load Packages
22+
library(tidyverse) # For data manipulation
23+
library(haven) # For importing .dta files
24+
```
25+
26+
# Reshaping Raw Data from Wide to Long
27+
28+
We begin by loading the data from each sweep and merging these together
29+
into a single wide format data frame; see [Combining Data Across
30+
Sweeps](https://cls-data.github.io/docs/bcs70-merging_across_sweeps.html)
31+
for further explanation on how this is achieved. Note, the names of the
32+
height and weight variables in Sweep 8 and Sweep 11 follow a similar
33+
convention, which is the exception rather than the rule in BCS70 data.
34+
Below, we convert the variable names in the Sweep 8 data frame to lower
35+
case so that they closely match those in the Sweep 11 data frame. This
36+
will make reshaping easier.
37+
38+
```r
39+
df_42y <- read_dta("42y/bcs70_2012_derived.dta",
40+
col_select = c("BCSID", "BD9HGHTM", "BD9WGHTK")) %>%
41+
rename_with(str_to_lower)
42+
43+
df_51y <- read_dta("51y/bcs11_age51_main.dta",
44+
col_select = c("bcsid", "bd11hghtm", "bd11wghtk"))
45+
46+
df_wide <- df_42y %>%
47+
full_join(df_51y, by = "bcsid")
48+
```
49+
50+
`df_wide` has 5 columns. Besides, the identifier, `bcsid`, there are 4
51+
columns for height and weight measurements at each sweep. Each of these
52+
4 columns is prefix by three characters indicating the sweep at
53+
assessment. We can reshape the dataset into long format (one row per
54+
person x sweep combination) using the `pivot_longer()` function so that
55+
the resulting data frame has four columns: one person identifier, a
56+
variable for age of assessment (`fup`), and variables for height and
57+
weight. We specify the columns to be reshaped using the `cols` argument,
58+
provide the new variable names in the `names_to` argument, and the
59+
pattern the existing column names take using the `names_pattern`
60+
argument. For `names_pattern` we specify `"^bd(\\d{1,2})([A-Za-z].+)$"`,
61+
which breaks the column name into two pieces: one or two digits
62+
indicating sweep (and after `bd`; `(\\d{1,2})`) and subsequent
63+
characters at the end of the name (`"([A-Za-z].+)$"`). `names_pattern`
64+
uses regular expressions. `.` matches single characters, and `.+`
65+
modifies this to make one or more characters. `\\d` is a special
66+
character denoting a digit. `[A-Za-z]` indicates any alphabetic
67+
character, upper or lower case. As noted, the digits hold information on
68+
sweep of assessment; in the reshaped data frame the character is stored
69+
as a value in a new column `sweep`. `.value` is a placeholder for the
70+
new columns in the reshaped data frame that store the values from the
71+
columns selected by `cols`; these new columns are named using the first
72+
piece from `names_pattern` - in this case `hghtm` (height) and `wghtk`
73+
(weight).
74+
75+
```r
76+
df_long <- df_wide %>%
77+
pivot_longer(cols = matches("^bd"),
78+
names_to = c("sweep", ".value"),
79+
names_pattern = "^bd(\\d{1,2})([A-Za-z].+)$")
80+
81+
df_long
82+
```
83+
84+
``` text
85+
# A tibble: 21,366 × 4
86+
bcsid sweep hghtm wghtk
87+
<chr> <chr> <dbl+lbl> <dbl+lbl>
88+
1 B10001N 9 1.55 55.8
89+
2 B10001N 11 1.55 50.8
90+
3 B10003Q 9 1.85 82.6
91+
4 B10003Q 11 1.85 83.5
92+
5 B10004R 9 1.60 57.2
93+
6 B10004R 11 1.6 57.2
94+
7 B10007U 9 1.52 82.6
95+
8 B10007U 11 NA NA
96+
9 B10009W 9 1.63 54.9
97+
10 B10009W 11 1.63 60.3
98+
# ℹ 21,356 more rows
99+
```
100+
101+
# Reshaping Raw Data from Long to Wide
102+
103+
We can also reshape the data from long to wide format using the
104+
`pivot_wider()` function. In this case, we want to create two new
105+
columns for each sweep: one for height and one for weight. We specify
106+
the columns to be reshaped using the `values_from` argument, provide the
107+
old column names in the `names_from` argument, and use the `names_glue`
108+
argument to specify the convention to follow for the new column names.
109+
The `names_glue` argument uses curly braces (`{}`) to reference the
110+
values from the `names_from` and `.value` arguments. As we are
111+
specifying multiple columns in `values_from`, `.value` is a placeholder
112+
for the names of the variables selected in `values_from`.
113+
114+
```r
115+
df_long %>%
116+
pivot_wider(names_from = sweep,
117+
values_from = c(hghtm, wghtk),
118+
names_glue = "{.value}_{sweep}")
119+
```
120+
121+
``` text
122+
# A tibble: 10,683 × 5
123+
bcsid hghtm_9 hghtm_11 wghtk_9 wghtk_11
124+
<chr> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
125+
1 B10001N 1.55 1.55 55.8 50.8
126+
2 B10003Q 1.85 1.85 82.6 83.5
127+
3 B10004R 1.60 1.6 57.2 57.2
128+
4 B10007U 1.52 NA 82.6 NA
129+
5 B10009W 1.63 1.63 54.9 60.3
130+
6 B10010P 1.65 NA -8 [No information] NA
131+
7 B10011Q 1.63 1.65 76.2 82.6
132+
8 B10013S 1.63 1.63 63.5 66.7
133+
9 B10015U 1.83 1.8 77.6 82.6
134+
10 B10016V 1.88 1.88 114. 118
135+
# ℹ 10,673 more rows
136+
```
137+
138+
Note, in the original `df_wide` tibble, the height and weight variables
139+
were labelled numeric vectors - this class allows users to add metadata
140+
to variables (value labels, etc.). When reshaping to long format,
141+
multiple variables are effectively appended together, but the final
142+
reshape variables can only have one set of properties. `pivot_longer()`
143+
tries to preserve variables attributes, but in some cases will throw an
144+
error (where variables are of inconsistent types) or print a warning
145+
(where value labels are inconsistent).

docs/bcs70.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,16 @@ has_children: true
55
nav_order: 4
66
---
77

8-
This section will be populated as soon as possible.
8+
This section presents code to clean and handle data from the 1970 British Cohort Study (BCS70). The BCS70 has relatively straightforward data structures. Difficulties mainly arise from its historic nature - data from older sweeps does not conform to modern metadata standards (e.g., file and variable names are not explanatory). This can make it challenging to find relevant variables. In "Data Discovery", we provide code to assist with data discovery (i.e., creating searchable data dictionaries).
9+
10+
## Miscellanea
11+
12+
There are a few characteristics of the BCS70 that data users should be aware of.
13+
14+
- The cohort member identifier variable is `BCSID`. In some datasets, this variable is stored in lower case, `bcsid`.
15+
- Almost all datasets are at the cohort member level, with one row per-cohort member. Exceptions the derived partnership histories and activity histories datasets.
16+
- There is a small number of twins in the BCS70. A cohort member's twin can be be identified with the `twincode` variable within the `bcs70_response_1970-2021.dta` dataset. The does not work as a family ID variable.
17+
- Later sweeps of the BCS70 follow consistent naming conventions (e.g., self-rated health is named `B*HLTHGN` [where `*` is a sweep number]), but earlier sweeps do not (e.g., self-rated health is named `b96043` at age 26y).
18+
- Later sweeps of the BCS70 have included survey items also collected in the NCDS. Both studies use the same naming conventions for variable names in these sweeps, which can help with harmonisation.
19+
- The BCS70 did not track all cohort members over time, including the 628 individuals born in Northern Ireland. The `bcs70_response_1970-2021.dta` dataset tracks only those individuals eligible for longitudinal follow-up, so sample sizes are smaller than if all wave-specific datasets are merged together. Users should consider using `bcs70_response_1970-2021.dta` to define their sample.
20+
- Negative values are typically reserved for different forms of missingness ("Don't know", "Refuse", "Not applicable", etc.), but in some cases - especially with older datasets - missing values have positive values instead.

images/bcs70-data_discovery.png

1.21 MB
Loading

0 commit comments

Comments
 (0)