|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Data Discovery |
| 4 | +nav_order: 1 |
| 5 | +parent: NCDS |
| 6 | +format: docusaurus-md |
| 7 | +--- |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +# Introduction |
| 13 | + |
| 14 | +In this section, we show a few `R` functions for exploring NCDS data; as |
| 15 | +noted, historical sweeps of the NCDS did not use modern metadata |
| 16 | +standards, so finding a specific variable can be challenging. Variables |
| 17 | +do not always have names that are descriptive or follow a consistent |
| 18 | +naming convention across sweeps. (The variable for cohort member sex is |
| 19 | +`N622`, for example.) In what follows, we will use the `R` functions to |
| 20 | +find variables for cohort members’ height, which has been collected in |
| 21 | +many of the sweeps. |
| 22 | + |
| 23 | +The packages we will use are: |
| 24 | + |
| 25 | +```r |
| 26 | +# Load Packages |
| 27 | +library(tidyverse) # For data manipulation |
| 28 | +library(haven) # For importing .dta files |
| 29 | +library(labelled) # For searching imported datasets |
| 30 | +library(codebookr) # For creating .docx codebooks |
| 31 | +``` |
| 32 | + |
| 33 | +# `labelled::lookfor()` |
| 34 | + |
| 35 | +The `labelled` package contains functionality for attaching and |
| 36 | +examining metadata in dataframes (for instance, adding labels to |
| 37 | +variables \[columns\]). Beyond this, it also contains the `lookfor()` |
| 38 | +function, which replicates similar functionality in `Stata`. `lookfor()` |
| 39 | +also one to search for variables in a dataframe by keyword (regular |
| 40 | +expression); the function searches variable names as well as associated |
| 41 | +metadata. It returns an object containing matching variables, their |
| 42 | +labels, and their types, etc.. Below, we read in the NCDS 55-year sweep |
| 43 | +dataset which contains derived variables (`55y/ncds_2013_derived.dta`) |
| 44 | +and use `lookfor()` to search for variables related to `"height"`. |
| 45 | + |
| 46 | +```r |
| 47 | +ncds_55y <- read_dta("55y/ncds_2013_derived.dta") |
| 48 | + |
| 49 | +lookfor(ncds_55y, "height") |
| 50 | +``` |
| 51 | + |
| 52 | +``` text |
| 53 | + pos variable label col_type missing values |
| 54 | + 46 ND9HGHTM (Derived) Height in metres dbl+lbl 0 [-8] No information |
| 55 | +``` |
| 56 | + |
| 57 | +Users may consider it easier to create a tibble of the `lookfor()` |
| 58 | +output, which can be searched and filtered using `dplyr` functions. |
| 59 | +Below, we create a `tibble` (a type of `data.frame` with good printing |
| 60 | +defaults) of the `lookfor()` output and use `filter()` to find variables |
| 61 | +with `"height"` in their labels. Note, we convert both the variable |
| 62 | +names and labels to lower case to make the search case insensitive. |
| 63 | + |
| 64 | +```r |
| 65 | +ncds_55y_lookfor <- lookfor(ncds_55y) %>% |
| 66 | + as_tibble() %>% |
| 67 | + mutate(variable_low = str_to_lower(variable), |
| 68 | + label_low = str_to_lower(label)) |
| 69 | + |
| 70 | +ncds_55y_lookfor %>% |
| 71 | + filter(str_detect(label_low, "height")) |
| 72 | +``` |
| 73 | + |
| 74 | +``` text |
| 75 | +# A tibble: 1 × 9 |
| 76 | + pos variable label col_type missing levels value_labels variable_low |
| 77 | + <int> <chr> <chr> <chr> <int> <name> <named list> <chr> |
| 78 | +1 46 ND9HGHTM (Derived) He… dbl+lbl 0 <NULL> <dbl [1]> nd9hghtm |
| 79 | +# ℹ 1 more variable: label_low <chr> |
| 80 | +``` |
| 81 | + |
| 82 | +# `codebookr::codebook()` |
| 83 | + |
| 84 | +The NCDS datasets that are downloadable from the UK Data Service come |
| 85 | +bundled with data dictionaries within the `mrdoc` subfolder. However, |
| 86 | +these are limited in some ways. The `codebookr` package enables the |
| 87 | +creation of data dictionaries that are more customisable, and in our |
| 88 | +opinion, easy to read. Below we create a codebook for the NCDS 55-year |
| 89 | +sweep derived variable dataset. These codebooks are intended to be saved |
| 90 | +and viewed in Microsoft Word. |
| 91 | + |
| 92 | +```r |
| 93 | +cdb <- codebook(ncds_55y) |
| 94 | +print(cdb, "ncds_55y_codebook.docx") # Saves as .docx (Word) file |
| 95 | +``` |
| 96 | + |
| 97 | +A screenshot of the codebook is shown below. |
| 98 | + |
| 99 | +<figure> |
| 100 | +<img src="../images/ncds-data_discovery.png" |
| 101 | +alt="Codebook created by codebookr::codebook()" /> |
| 102 | +<figcaption aria-hidden="true">Codebook created by |
| 103 | +codebookr::codebook()</figcaption> |
| 104 | +</figure> |
| 105 | + |
| 106 | +# Create a Lookup Table Across All Datasets |
| 107 | + |
| 108 | +Creating the `lookfor()` and `codebook()` one dataset at a time does not |
| 109 | +allow one to get a quick overview of the variables available in the |
| 110 | +NCDS, including the sweeps repeatedly measured characteristics are |
| 111 | +available in. Below we create a `tibble`, `df_lookfor`, that contains |
| 112 | +`lookfor()` results for all the `.dta` files in the NCDS folder. |
| 113 | + |
| 114 | +To do this, we create a function, `create_lookfor()`, that takes a file |
| 115 | +path to a `.dta` file, reads in the first row of the dataset (faster |
| 116 | +than reading the full dataset), and applies `lookfor()` to it. We call |
| 117 | +this function with a `mutate()` function call to create a set of lookups |
| 118 | +for every `.dta` file we can find in the NCDS folder. `map()` loops over |
| 119 | +every value in the `file_path` column, creating a corresponding lookup |
| 120 | +table for that file, stored as a |
| 121 | +[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns). |
| 122 | +`unnest()` expands the results out, so rather than have one row per |
| 123 | +`file_path`, we have one row per variable. |
| 124 | + |
| 125 | +```r |
| 126 | +create_lookfor <- function(file_path){ |
| 127 | + read_dta(file_path, n_max = 1) %>% |
| 128 | + lookfor() %>% |
| 129 | + as_tibble() |
| 130 | +} |
| 131 | + |
| 132 | +df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>% |
| 133 | + filter(!str_detect(file_path, "^UKDS")) %>% |
| 134 | + mutate(lookfor = map(file_path, create_lookfor)) %>% |
| 135 | + unnest(lookfor) %>% |
| 136 | + mutate(variable_low = str_to_lower(variable), |
| 137 | + label_low = str_to_lower(label)) %>% |
| 138 | + separate(file_path, |
| 139 | + into = c("sweep", "file"), |
| 140 | + sep = "/", |
| 141 | + remove = FALSE) %>% |
| 142 | + relocate(file_path, pos, .after = last_col()) |
| 143 | +``` |
| 144 | + |
| 145 | +We can use the resulting object to search for variables with `"height"` |
| 146 | +in their labels. |
| 147 | + |
| 148 | +```r |
| 149 | +df_lookfor %>% |
| 150 | + filter(str_detect(label_low, "height")) %>% |
| 151 | + select(file, variable, label) |
| 152 | +``` |
| 153 | + |
| 154 | +``` text |
| 155 | +# A tibble: 77 × 3 |
| 156 | + file variable label |
| 157 | + <chr> <chr> <chr> |
| 158 | + 1 ncds0123.dta n510 0 Height of mum in inches at chlds brth |
| 159 | + 2 ncds0123.dta n332 1M Childs height, no shoes-nearest inch |
| 160 | + 3 ncds0123.dta n334 1M Childs height,no shoes-to centimeter |
| 161 | + 4 ncds0123.dta n1199 2P Father's height in inches |
| 162 | + 5 ncds0123.dta n1205 2P Mothers height in inches |
| 163 | + 6 ncds0123.dta n1510 2M Childs height no shoes,socks- inches |
| 164 | + 7 ncds0123.dta n1511 2M Fractions of an inch in childs height |
| 165 | + 8 ncds0123.dta n1949 3M Child's height,in bare feet,in cms |
| 166 | + 9 ncds0123.dta dvht07 1D Height in metres at 7 years |
| 167 | +10 ncds0123.dta dvht11 2D Height in metres at 11 years |
| 168 | +# ℹ 67 more rows |
| 169 | +``` |
0 commit comments