Skip to content

Commit b8da6be

Browse files
committed
Edit MCS reshape long wide
1 parent 9932cec commit b8da6be

8 files changed

+237
-193
lines changed

docs/mcs-data_structures.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,11 @@ The parent files have a similar structure to the cohort member-level (`mcs[1-7]_
7272
| M10005C | 1 | ... |
7373
| ... | ... | ... |
7474

75-
Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but take the same value across sweeps for a given individual (i.e., it is persistent).
75+
Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but takes the same value across sweeps for a given individual (i.e., it is persistent).
7676

7777
The value of `[A-G]PNUM00` is partly arbitrary. It does not specify a particular relationship to a cohort member. Such relationships are determined in the household grid files, which we discuss further below. The `[A-G]PNUM00` does follow a convention, however. For non-cohort members, `[A-G]PNUM00` is a positive integer between 1 and 99. For cohort members, `[A-G]PNUM00` is equal to `[A-G]CNUM00` multiplied by 100; i.e. for the first cohort member in a family it is 100, and for the second it is 200.[^3] While cohort members have a `[A-G]PNUM00`, non-cohort members (parents or other household members) do not get a `[A-G]CNUM00`.
7878

79-
[^3]: An exception to this is in `mcs6_hhgrid.dta` where for all cohort members `FPNUM00 == -1 [Not applicable]`.
79+
[^3]: Exceptions to this are `mcs[6-7]_hhgrid.dta` where for all cohort members `[F-G]PNUM00 == -1 [Not applicable]`.
8080

8181
Again, as two variables are required to uniquely identify a parent, you may prefer to create a single, unique identifier variable by concatenating `MCSID` and `[A-G]PNUM00`.
8282

docs/mcs-merging_across_sweeps.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,7 @@ variables as these have slightly different names each sweep. Typically
380380
variable names only differ on the sweep prefix used (`ACHTM00`,
381381
`BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable
382382
(`ECHTCMA00`) diverged slightly from this pattern. Below, we also
383-
include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to
383+
include a step to `rename()` the `[B-G]CNUM00` variable to `CNUM00` to
384384
ensure consistency across sweeps as this will make merging more
385385
straightforward later.
386386

@@ -394,7 +394,7 @@ load_height_wide <- function(sweep){
394394

395395
glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
396396
read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>%
397-
rename(cnum = matches("CNUM00"))
397+
rename(CNUM00 = matches("CNUM00"))
398398
}
399399
```
400400

@@ -407,7 +407,7 @@ load_height_wide(2)
407407

408408
``` text
409409
# A tibble: 15,778 × 3
410-
MCSID cnum BCHTCM00
410+
MCSID CNUM00 BCHTCM00
411411
<chr> <dbl+lbl> <dbl+lbl>
412412
1 M10001N 1 [1st Cohort Member of the family] 97
413413
2 M10002P 1 [1st Cohort Member of the family] 96
@@ -428,7 +428,7 @@ load_height_wide(3)
428428

429429
``` text
430430
# A tibble: 15,431 × 3
431-
MCSID cnum CCHTCM00
431+
MCSID CNUM00 CCHTCM00
432432
<chr> <dbl+lbl> <dbl+lbl>
433433
1 M10001N 1 [1st Cohort Member of the family] 114.
434434
2 M10002P 1 [1st Cohort Member of the family] 110.
@@ -449,15 +449,15 @@ rather verbose:
449449

450450
```r
451451
load_height_wide(2) %>%
452-
full_join(load_height_wide(3), by = c("MCSID", "cnum")) %>%
453-
full_join(load_height_wide(4), by = c("MCSID", "cnum")) %>%
454-
full_join(load_height_wide(6), by = c("MCSID", "cnum")) %>%
455-
full_join(load_height_wide(7), by = c("MCSID", "cnum"))
452+
full_join(load_height_wide(3), by = c("MCSID", "CNUM00")) %>%
453+
full_join(load_height_wide(4), by = c("MCSID", "CNUM00")) %>%
454+
full_join(load_height_wide(6), by = c("MCSID", "CNUM00")) %>%
455+
full_join(load_height_wide(7), by = c("MCSID", "CNUM00"))
456456
```
457457

458458
``` text
459459
# A tibble: 17,568 × 7
460-
MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
460+
MCSID CNUM00 BCHTCM00 CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
461461
<chr> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
462462
1 M10001N 1 [1st Cohort Member o… 97 114. 128. NA NA
463463
2 M10002P 1 [1st Cohort Member o… 96 110. 123 163. 174.
@@ -504,12 +504,12 @@ merged in.
504504

505505
```r
506506
map(2:7, load_height_wide) %>%
507-
reduce(~ full_join(.x, .y, by = c("MCSID", "cnum")))
507+
reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00")))
508508
```
509509

510510
``` text
511511
# A tibble: 17,614 × 8
512-
MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
512+
MCSID CNUM00 BCHTCM00 CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
513513
<chr> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
514514
1 M10001N 1 [1st Cohort… 97 114. 128. NA NA NA
515515
2 M10002P 1 [1st Cohort… 96 110. 123 144. 163. 174.

docs/mcs-merging_within_sweep.md

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: default
3-
title: Combining Data Within a Sweep
3+
title: Combining Data Within A Sweep
44
nav_order: 5
55
parent: MCS
66
format: docusaurus-md
@@ -82,7 +82,7 @@ Family country of residence is stored in a family-level dataset
8282
(`mcs2_family_derived`). This also does not need any further processing
8383
at this stage. Later when we merging this data with `df_ethnic_group`,
8484
we perform a 1-to-many merge, so the data will be automatically repeated
85-
for cases where there are multiple cohort members in a family.
85+
for cases where there are multiple cohort members in a family.[^1]
8686

8787
```r
8888
df_country <- family %>%
@@ -101,7 +101,7 @@ on a [grouped data
101101
frame](https://r4ds.hadley.nz/data-transform.html#groups)
102102
(`group_by(MCSID, BCNUM00)`) to ensure this is calculated per cohort
103103
member. The result is a dataset with one row per cohort member with data
104-
on whether any parent reads to them.[^1]
104+
on whether any parent reads to them.[^2]
105105

106106
```r
107107
df_reads <- parent_cm %>%
@@ -185,7 +185,7 @@ highest education level variable (`BDDNVQ00`) from the
185185
dataset, regardless of whether they have education data or not
186186
(`right_join()` fills variables with `NA` where [the retained row does
187187
not have a
188-
match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^2]
188+
match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^3]
189189

190190
```r
191191
df_mother <- hhgrid %>%
@@ -249,13 +249,21 @@ df_ethnic_group %>%
249249

250250
# Footnotes
251251

252-
[^1]: Below, for simplicity, we drop any rows with missing values
252+
[^1]: It is also possible to expand a family level dataset so that it
253+
has as many rows as there are cohort-members in the family.
254+
`mcs2_family_derived.dta` contains a variable, `BDNOCM00`, with this
255+
information that can be used with the `tidyverse` function
256+
`uncount(BDNOCM00)` to achieve this. (The dataset
257+
`mcs_longitudinal_family_file` contains a variable `NOCMHH` which
258+
holds similar information.)
259+
260+
[^2]: Below, for simplicity, we drop any rows with missing values
253261
(`drop_na()` step). Proper analyses may opt to use a different rule,
254262
which may require merging in other information (e.g., setting the
255263
value to missing unless all resident parents have been interviewed
256264
and provided a valid response).
257265

258-
[^2]: More detail on merging with `right_join()` (and other `*_join()`
266+
[^3]: More detail on merging with `right_join()` (and other `*_join()`
259267
variants) is provided in [*Combining Data Across
260268
Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html),
261269
as well as [Chapter 19 of the R for Data Science

0 commit comments

Comments
 (0)