|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Data Discovery |
| 4 | +nav_order: 2 |
| 5 | +parent: BCS70 |
| 6 | +format: docusaurus-md |
| 7 | +--- |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +# Introduction |
| 13 | + |
| 14 | +In this section, we show a few `R` functions for exploring BCS70 data; |
| 15 | +as noted, historical sweeps of the BCS70 did not use modern metadata |
| 16 | +standards, so finding a specific variable can be challenging. Variables |
| 17 | +do not always have names that are descriptive or follow a consistent |
| 18 | +naming convention across sweeps. (The variable for cohort member sex in |
| 19 | +the `0y/bcs7072a.dta` file is `a0255`, for example.) In what follows, we |
| 20 | +will use the `R` functions to find variables on cohort members’ smoking, |
| 21 | +which has been collected in many of the sweeps. |
| 22 | + |
| 23 | +The packages we will use are: |
| 24 | + |
| 25 | +```r |
| 26 | +# Load Packages |
| 27 | +library(tidyverse) # For data manipulation |
| 28 | +library(haven) # For importing .dta files |
| 29 | +library(labelled) # For searching imported datasets |
| 30 | +library(codebookr) # For creating .docx codebooks |
| 31 | +``` |
| 32 | + |
| 33 | +# `labelled::lookfor()` |
| 34 | + |
| 35 | +The `labelled` package contains functionality for attaching and |
| 36 | +examining metadata in dataframes (for instance, adding labels to |
| 37 | +variables \[columns\]). Beyond this, it also contains the `lookfor()` |
| 38 | +function, which replicates similar functionality in `Stata`. `lookfor()` |
| 39 | +also one to search for variables in a dataframe by keyword (regular |
| 40 | +expression); the function searches variable names as well as associated |
| 41 | +metadata. It returns an object containing matching variables, their |
| 42 | +labels, and their types, etc.. Below, we read in the BCS70 38-year sweep |
| 43 | +derived variable dataset (`38y/bcs8derived.dta`) and use `lookfor()` to |
| 44 | +search for variables which mention `"smok"` in their name or metadata. |
| 45 | + |
| 46 | +```r |
| 47 | +bcs70_38y <- read_dta("38y/bcs8derived.dta") |
| 48 | + |
| 49 | +lookfor(bcs70_38y, "smok|cigar") |
| 50 | +``` |
| 51 | + |
| 52 | +``` text |
| 53 | + pos variable label col_type missing values |
| 54 | + 11 BD8SMOKE 2008: Smoking habits dbl+lbl 0 [-8] Dont Know |
| 55 | + [-1] Not applicable |
| 56 | + [0] Never smoked |
| 57 | + [1] Ex smoker |
| 58 | + [2] Occasional smoker |
| 59 | + [3] Up to 10 a day |
| 60 | + [4] 11 to 20 a day |
| 61 | + [5] More than 20 a day |
| 62 | + [6] Daily but frequency n~ |
| 63 | +``` |
| 64 | + |
| 65 | +Users may consider it easier to create a tibble of the `lookfor()` |
| 66 | +output, which can be searched and filtered using `dplyr` functions. |
| 67 | +Below, we create a `tibble` (a type of `data.frame` with good printing |
| 68 | +defaults) of the `lookfor()` output and use `filter()` to find variables |
| 69 | +with `"smok"` or `"cigar"` in their labels. Note, we convert both the |
| 70 | +variable names and labels to lower case to make the search case |
| 71 | +insensitive. |
| 72 | + |
| 73 | +```r |
| 74 | +bcs70_38y_lookfor <- lookfor(bcs70_38y) %>% |
| 75 | + as_tibble() %>% |
| 76 | + mutate(variable_low = str_to_lower(variable), |
| 77 | + label_low = str_to_lower(label)) |
| 78 | + |
| 79 | +bcs70_38y_lookfor %>% |
| 80 | + filter(str_detect(label_low, "smok|cigar")) |
| 81 | +``` |
| 82 | + |
| 83 | +``` text |
| 84 | +# A tibble: 1 × 9 |
| 85 | + pos variable label col_type missing levels value_labels variable_low |
| 86 | + <int> <chr> <chr> <chr> <int> <name> <named list> <chr> |
| 87 | +1 11 BD8SMOKE 2008: Smokin… dbl+lbl 0 <NULL> <dbl [9]> bd8smoke |
| 88 | +# ℹ 1 more variable: label_low <chr> |
| 89 | +``` |
| 90 | + |
| 91 | +# `codebookr::codebook()` |
| 92 | + |
| 93 | +The BCS70 datasets that are downloadable from the UK Data Service come |
| 94 | +bundled with data dictionaries within the `mrdoc` subfolder. However, |
| 95 | +these are limited in some ways. The `codebookr` package enables the |
| 96 | +creation of data dictionaries that are more customisable, and in our |
| 97 | +opinion, easier-to-read. Below we create a codebook for the BCS70 |
| 98 | +51-year sweep dataset. These codebooks are intended to be saved and |
| 99 | +viewed in Microsoft Word. |
| 100 | + |
| 101 | +```r |
| 102 | +cdb <- codebook(bcs70_38y) |
| 103 | +print(cdb, "bcs70_38y_codebook.docx") # Saves as .docx (Word) file |
| 104 | +``` |
| 105 | + |
| 106 | +A screenshot of the codebook is shown below. |
| 107 | + |
| 108 | +<figure> |
| 109 | +<img src="../images/bcs70-data_discovery.png" |
| 110 | +alt="Codebook created by codebookr::codebook()" /> |
| 111 | +<figcaption aria-hidden="true">Codebook created by |
| 112 | +codebookr::codebook()</figcaption> |
| 113 | +</figure> |
| 114 | + |
| 115 | +# Create a Lookup Table Across All Datasets |
| 116 | + |
| 117 | +Creating the `lookfor()` and `codebook()` one dataset at a time does not |
| 118 | +allow one to get a quick overview of the variables available in the |
| 119 | +BCS70, including the sweeps repeatedly measured characteristics are |
| 120 | +available in. Below we create a `tibble`, `df_lookfor`, that contains |
| 121 | +`lookfor()` results for all the `.dta` files in the BCS70 folder. |
| 122 | + |
| 123 | +To do this, we create a function, `create_lookfor()`, that takes a file |
| 124 | +path to a `.dta` file, reads in the first row of the dataset (faster |
| 125 | +than reading the full dataset), and applies `lookfor()` to it. We call |
| 126 | +this function with a `mutate()` function call to create a set of lookups |
| 127 | +for every `.dta` file we can find in the BCS70 folder. `map()` loops |
| 128 | +over every value in the `file_path` column, creating a corresponding |
| 129 | +lookup table for that file, stored as a |
| 130 | +[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns). |
| 131 | +`unnest()` expands the results out, so rather than have one row per |
| 132 | +`file_path`, we have one row per variable. |
| 133 | + |
| 134 | +```r |
| 135 | +create_lookfor <- function(file_path){ |
| 136 | + read_dta(file_path, n_max = 1) %>% |
| 137 | + lookfor() %>% |
| 138 | + as_tibble() |
| 139 | +} |
| 140 | + |
| 141 | +df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>% |
| 142 | + filter(!str_detect(file_path, "^UKDS")) %>% |
| 143 | + mutate(lookfor = map(file_path, create_lookfor)) %>% |
| 144 | + unnest(lookfor) %>% |
| 145 | + mutate(variable_low = str_to_lower(variable), |
| 146 | + label_low = str_to_lower(label)) %>% |
| 147 | + separate(file_path, |
| 148 | + into = c("sweep", "file"), |
| 149 | + sep = "/", |
| 150 | + remove = FALSE) %>% |
| 151 | + relocate(file_path, pos, .after = last_col()) |
| 152 | +``` |
| 153 | + |
| 154 | +``` text |
| 155 | +Warning: Expected 2 pieces. Additional pieces discarded in 3174 rows [28943, 28944, |
| 156 | +28945, 28946, 28947, 28948, 28949, 28950, 28951, 28952, 28953, 28954, 28955, |
| 157 | +28956, 28957, 28958, 28959, 28960, 28961, 28962, ...]. |
| 158 | +``` |
| 159 | + |
| 160 | +We can use the resulting object to search for variables with `"smok"` or |
| 161 | +`"cigar"` in their labels. |
| 162 | + |
| 163 | +```r |
| 164 | +df_lookfor %>% |
| 165 | + filter(str_detect(label_low, "smok|cigar")) %>% |
| 166 | + select(file, variable, label) |
| 167 | +``` |
| 168 | + |
| 169 | +``` text |
| 170 | +# A tibble: 294 × 3 |
| 171 | + file variable label |
| 172 | + <chr> <chr> <chr> |
| 173 | + 1 bcs7072a.dta a0043b SMOKING DURING PREGNANCY |
| 174 | + 2 bcs7072b.dta b0024 DOES THE CHILD'S MOTHER SMOKE TOBACCO ? |
| 175 | + 3 bcs7072b.dta b0025 IF NO WHEN DID SHE LAST SMOKE (MONTH) ? |
| 176 | + 4 bcs7072b.dta b0026 IF NO WHEN DID SHE LAST SMOKE (YEAR) ? |
| 177 | + 5 bcs7072b.dta b0027 ANSWER TO LAST SMOKED OTHER THAN A DATE |
| 178 | + 6 bcs7072b.dta b0028 HOW MANY SMOKED ( CIGARETTES ) ? |
| 179 | + 7 sn3723.dta e9_1 MOTHER'S PRESENT SMOKING HABITS |
| 180 | + 8 sn3723.dta e9_2 NO. OF CIGARETTES MOTHER SMOKES DAILY |
| 181 | + 9 sn3723.dta e9_3 LENGTH OF TIME MOTHER HAS SMOKED |
| 182 | +10 sn3723.dta e10_1 MOTHER NON SMOKER NOW HAS SMOKED IN PAST |
| 183 | +# ℹ 284 more rows |
| 184 | +``` |
0 commit comments