Skip to content

Commit 3330572

Browse files
committed
NCDS pages
1 parent 876544a commit 3330572

9 files changed

+783
-1
lines changed

docs/documentation_download.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
layout: default
3+
title: "Downloading Documentation"
4+
nav_order: 2
5+
---
6+
7+
The UKDS files come bundled with study documentation in `.pdf` format within the `mrdoc/` subfolder - for instance, technical reports, questionnaires, user guides, etc. The files do not always have the most user-friendly titles, so we have written code to instead download documentation from the CLS study pages.
8+
9+
The code is available on [GitHub](https://github.com/CLS-Data/cohort-documentation.git). To use the code, first download or clone the GitHub directory.
10+
11+
- To download the directory, on the GitHub website, click `Code -> Download ZIP` (see screenshot below) then unzip the downloaded file and place it in a suitable location on your computer.
12+
- To clone the directory, open your computer's command line or terminal, navigate to an appropriate location (`cd ...`) and type `git clone https://github.com/CLS-Data/cohort-documentation.git`. You may want to rename the folder from `cohort-documentation` to `CLS Documentation` or something similar.
13+
14+
![Downloading the GitHub directory](../images/documentation_download_1.png)
15+
16+
When the folder is downloaded, open the `README.md` file and follow the instructions therein.
17+
18+
Once completed, the folder should look like the below.
19+
20+
![Directory after code completed](../images/documentation_download_2.png)

docs/ncds-data_discovery.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
layout: default
3+
title: Data Discovery
4+
nav_order: 1
5+
parent: NCDS
6+
format: docusaurus-md
7+
---
8+
9+
10+
11+
12+
# Introduction
13+
14+
In this section, we show a few `R` functions for exploring NCDS data; as
15+
noted, historical sweeps of the NCDS did not use modern metadata
16+
standards, so finding a specific variable can be challenging. Variables
17+
do not always have names that are descriptive or follow a consistent
18+
naming convention across sweeps. (The variable for cohort member sex is
19+
`N622`, for example.) In what follows, we will use the `R` functions to
20+
find variables for cohort members’ height, which has been collected in
21+
many of the sweeps.
22+
23+
The packages we will use are:
24+
25+
```r
26+
# Load Packages
27+
library(tidyverse) # For data manipulation
28+
library(haven) # For importing .dta files
29+
library(labelled) # For searching imported datasets
30+
library(codebookr) # For creating .docx codebooks
31+
```
32+
33+
# `labelled::lookfor()`
34+
35+
The `labelled` package contains functionality for attaching and
36+
examining metadata in dataframes (for instance, adding labels to
37+
variables \[columns\]). Beyond this, it also contains the `lookfor()`
38+
function, which replicates similar functionality in `Stata`. `lookfor()`
39+
also one to search for variables in a dataframe by keyword (regular
40+
expression); the function searches variable names as well as associated
41+
metadata. It returns an object containing matching variables, their
42+
labels, and their types, etc.. Below, we read in the NCDS 55-year sweep
43+
dataset which contains derived variables (`55y/ncds_2013_derived.dta`)
44+
and use `lookfor()` to search for variables related to `"height"`.
45+
46+
```r
47+
ncds_55y <- read_dta("55y/ncds_2013_derived.dta")
48+
49+
lookfor(ncds_55y, "height")
50+
```
51+
52+
``` text
53+
pos variable label col_type missing values
54+
46 ND9HGHTM (Derived) Height in metres dbl+lbl 0 [-8] No information
55+
```
56+
57+
Users may consider it easier to create a tibble of the `lookfor()`
58+
output, which can be searched and filtered using `dplyr` functions.
59+
Below, we create a `tibble` (a type of `data.frame` with good printing
60+
defaults) of the `lookfor()` output and use `filter()` to find variables
61+
with `"height"` in their labels. Note, we convert both the variable
62+
names and labels to lower case to make the search case insensitive.
63+
64+
```r
65+
ncds_55y_lookfor <- lookfor(ncds_55y) %>%
66+
as_tibble() %>%
67+
mutate(variable_low = str_to_lower(variable),
68+
label_low = str_to_lower(label))
69+
70+
ncds_55y_lookfor %>%
71+
filter(str_detect(label_low, "height"))
72+
```
73+
74+
``` text
75+
# A tibble: 1 × 9
76+
pos variable label col_type missing levels value_labels variable_low
77+
<int> <chr> <chr> <chr> <int> <name> <named list> <chr>
78+
1 46 ND9HGHTM (Derived) He… dbl+lbl 0 <NULL> <dbl [1]> nd9hghtm
79+
# ℹ 1 more variable: label_low <chr>
80+
```
81+
82+
# `codebookr::codebook()`
83+
84+
The NCDS datasets that are downloadable from the UK Data Service come
85+
bundled with data dictionaries within the `mrdoc` subfolder. However,
86+
these are limited in some ways. The `codebookr` package enables the
87+
creation of data dictionaries that are more customisable, and in our
88+
opinion, easy to read. Below we create a codebook for the NCDS 55-year
89+
sweep derived variable dataset. These codebooks are intended to be saved
90+
and viewed in Microsoft Word.
91+
92+
```r
93+
cdb <- codebook(ncds_55y)
94+
print(cdb, "ncds_55y_codebook.docx") # Saves as .docx (Word) file
95+
```
96+
97+
A screenshot of the codebook is shown below.
98+
99+
<figure>
100+
<img src="../images/ncds-data_discovery.png"
101+
alt="Codebook created by codebookr::codebook()" />
102+
<figcaption aria-hidden="true">Codebook created by
103+
codebookr::codebook()</figcaption>
104+
</figure>
105+
106+
# Create a Lookup Table Across All Datasets
107+
108+
Creating the `lookfor()` and `codebook()` one dataset at a time does not
109+
allow one to get a quick overview of the variables available in the
110+
NCDS, including the sweeps repeatedly measured characteristics are
111+
available in. Below we create a `tibble`, `df_lookfor`, that contains
112+
`lookfor()` results for all the `.dta` files in the NCDS folder.
113+
114+
To do this, we create a function, `create_lookfor()`, that takes a file
115+
path to a `.dta` file, reads in the first row of the dataset (faster
116+
than reading the full dataset), and applies `lookfor()` to it. We call
117+
this function with a `mutate()` function call to create a set of lookups
118+
for every `.dta` file we can find in the NCDS folder. `map()` loops over
119+
every value in the `file_path` column, creating a corresponding lookup
120+
table for that file, stored as a
121+
[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns).
122+
`unnest()` expands the results out, so rather than have one row per
123+
`file_path`, we have one row per variable.
124+
125+
```r
126+
create_lookfor <- function(file_path){
127+
read_dta(file_path, n_max = 1) %>%
128+
lookfor() %>%
129+
as_tibble()
130+
}
131+
132+
df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>%
133+
filter(!str_detect(file_path, "^UKDS")) %>%
134+
mutate(lookfor = map(file_path, create_lookfor)) %>%
135+
unnest(lookfor) %>%
136+
mutate(variable_low = str_to_lower(variable),
137+
label_low = str_to_lower(label)) %>%
138+
separate(file_path,
139+
into = c("sweep", "file"),
140+
sep = "/",
141+
remove = FALSE) %>%
142+
relocate(file_path, pos, .after = last_col())
143+
```
144+
145+
We can use the resulting object to search for variables with `"height"`
146+
in their labels.
147+
148+
```r
149+
df_lookfor %>%
150+
filter(str_detect(label_low, "height")) %>%
151+
select(file, variable, label)
152+
```
153+
154+
``` text
155+
# A tibble: 77 × 3
156+
file variable label
157+
<chr> <chr> <chr>
158+
1 ncds0123.dta n510 0 Height of mum in inches at chlds brth
159+
2 ncds0123.dta n332 1M Childs height, no shoes-nearest inch
160+
3 ncds0123.dta n334 1M Childs height,no shoes-to centimeter
161+
4 ncds0123.dta n1199 2P Father's height in inches
162+
5 ncds0123.dta n1205 2P Mothers height in inches
163+
6 ncds0123.dta n1510 2M Childs height no shoes,socks- inches
164+
7 ncds0123.dta n1511 2M Fractions of an inch in childs height
165+
8 ncds0123.dta n1949 3M Child's height,in bare feet,in cms
166+
9 ncds0123.dta dvht07 1D Height in metres at 7 years
167+
10 ncds0123.dta dvht11 2D Height in metres at 11 years
168+
# ℹ 67 more rows
169+
```

0 commit comments

Comments
 (0)