CLS-Data
diff --git a/‎.Rprofile‎
Lines changed: 1 addition & 0 deletions b/‎.Rprofile‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.ipynb_checkpoints/Untitled-checkpoint.ipynb‎
Lines changed: 0 additions & 28 deletions b/‎.ipynb_checkpoints/Untitled-checkpoint.ipynb‎
Lines changed: 0 additions & 28 deletions
diff --git a/‎docs/bcs70-data_discovery.md‎
Lines changed: 184 additions & 0 deletions b/‎docs/bcs70-data_discovery.md‎
Lines changed: 184 additions & 0 deletions
diff --git a/‎docs/bcs70-reshape_long_wide.md‎
Lines changed: 145 additions & 0 deletions b/‎docs/bcs70-reshape_long_wide.md‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎docs/bcs70.md‎
Lines changed: 13 additions & 1 deletion b/‎docs/bcs70.md‎
Lines changed: 13 additions & 1 deletion
diff --git a/‎images/bcs70-data_discovery.png‎
1.21 MB b/‎images/bcs70-data_discovery.png‎
1.21 MB
@@ -0,0 +1 @@
+options(knitr.stata.engine.path = "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP")
@@ -0,0 +1,184 @@
+---
+layout: default
+title: Data Discovery
+nav_order: 2
+parent: BCS70
+format: docusaurus-md
+---
+
+
+
+
+# Introduction
+
+In this section, we show a few `R` functions for exploring BCS70 data;
+as noted, historical sweeps of the BCS70 did not use modern metadata
+standards, so finding a specific variable can be challenging. Variables
+do not always have names that are descriptive or follow a consistent
+naming convention across sweeps. (The variable for cohort member sex in
+the `0y/bcs7072a.dta` file is `a0255`, for example.) In what follows, we
+will use the `R` functions to find variables on cohort members’ smoking,
+which has been collected in many of the sweeps.
+
+The packages we will use are:
+
+```r
+# Load Packages
+library(tidyverse) # For data manipulation
+library(haven) # For importing .dta files
+library(labelled) # For searching imported datasets
+library(codebookr) # For creating .docx codebooks
+```
+
+# `labelled::lookfor()`
+
+The `labelled` package contains functionality for attaching and
+examining metadata in dataframes (for instance, adding labels to
+variables \[columns\]). Beyond this, it also contains the `lookfor()`
+function, which replicates similar functionality in `Stata`. `lookfor()`
+also one to search for variables in a dataframe by keyword (regular
+expression); the function searches variable names as well as associated
+metadata. It returns an object containing matching variables, their
+labels, and their types, etc.. Below, we read in the BCS70 38-year sweep
+derived variable dataset (`38y/bcs8derived.dta`) and use `lookfor()` to
+search for variables which mention `"smok"` in their name or metadata.
+
+```r
+bcs70_38y <- read_dta("38y/bcs8derived.dta")
+
+lookfor(bcs70_38y, "smok|cigar")
+```
+
+``` text
+ pos variable label                col_type missing values                    
+ 11  BD8SMOKE 2008: Smoking habits dbl+lbl  0       [-8] Dont Know            
+                                                    [-1] Not applicable       
+                                                    [0] Never smoked          
+                                                    [1] Ex smoker             
+                                                    [2] Occasional smoker     
+                                                    [3] Up to 10 a day        
+                                                    [4] 11 to 20 a day        
+                                                    [5] More than 20 a day    
+                                                    [6] Daily but frequency n~
+```
+
+Users may consider it easier to create a tibble of the `lookfor()`
+output, which can be searched and filtered using `dplyr` functions.
+Below, we create a `tibble` (a type of `data.frame` with good printing
+defaults) of the `lookfor()` output and use `filter()` to find variables
+with `"smok"` or `"cigar"` in their labels. Note, we convert both the
+variable names and labels to lower case to make the search case
+insensitive.
+
+```r
+bcs70_38y_lookfor <- lookfor(bcs70_38y) %>%
+  as_tibble() %>%
+  mutate(variable_low = str_to_lower(variable),
+         label_low = str_to_lower(label))
+
+bcs70_38y_lookfor %>%
+  filter(str_detect(label_low, "smok|cigar"))
+```
+
+``` text
+# A tibble: 1 × 9
+    pos variable label         col_type missing levels value_labels variable_low
+  <int> <chr>    <chr>         <chr>      <int> <name> <named list> <chr>       
+1    11 BD8SMOKE 2008: Smokin… dbl+lbl        0 <NULL> <dbl [9]>    bd8smoke    
+# ℹ 1 more variable: label_low <chr>
+```
+
+# `codebookr::codebook()`
+
+The BCS70 datasets that are downloadable from the UK Data Service come
+bundled with data dictionaries within the `mrdoc` subfolder. However,
+these are limited in some ways. The `codebookr` package enables the
+creation of data dictionaries that are more customisable, and in our
+opinion, easier-to-read. Below we create a codebook for the BCS70
+51-year sweep dataset. These codebooks are intended to be saved and
+viewed in Microsoft Word.
+
+```r
+cdb <- codebook(bcs70_38y)
+print(cdb, "bcs70_38y_codebook.docx") # Saves as .docx (Word) file
+```
+
+A screenshot of the codebook is shown below.
+
+<figure>
+<img src="../images/bcs70-data_discovery.png"
+alt="Codebook created by codebookr::codebook()" />
+<figcaption aria-hidden="true">Codebook created by
+codebookr::codebook()</figcaption>
+</figure>
+
+# Create a Lookup Table Across All Datasets
+
+Creating the `lookfor()` and `codebook()` one dataset at a time does not
+allow one to get a quick overview of the variables available in the
+BCS70, including the sweeps repeatedly measured characteristics are
+available in. Below we create a `tibble`, `df_lookfor`, that contains
+`lookfor()` results for all the `.dta` files in the BCS70 folder.
+
+To do this, we create a function, `create_lookfor()`, that takes a file
+path to a `.dta` file, reads in the first row of the dataset (faster
+than reading the full dataset), and applies `lookfor()` to it. We call
+this function with a `mutate()` function call to create a set of lookups
+for every `.dta` file we can find in the BCS70 folder. `map()` loops
+over every value in the `file_path` column, creating a corresponding
+lookup table for that file, stored as a
+[`list-column`](https://r4ds.hadley.nz/rectangling.html#list-columns).
+`unnest()` expands the results out, so rather than have one row per
+`file_path`, we have one row per variable.
+
+```r
+create_lookfor <- function(file_path){
+  read_dta(file_path, n_max = 1) %>%
+    lookfor() %>%
+    as_tibble()
+}
+
+df_lookfor <- tibble(file_path = list.files(pattern = "\\.dta$", recursive = TRUE)) %>%
+  filter(!str_detect(file_path, "^UKDS")) %>%
+  mutate(lookfor = map(file_path, create_lookfor)) %>%
+  unnest(lookfor) %>%
+  mutate(variable_low = str_to_lower(variable),
+         label_low = str_to_lower(label)) %>%
+  separate(file_path, 
+           into = c("sweep", "file"), 
+           sep = "/", 
+           remove = FALSE) %>% 
+  relocate(file_path, pos, .after = last_col())
+```
+
+``` text
+Warning: Expected 2 pieces. Additional pieces discarded in 3174 rows [28943, 28944,
+28945, 28946, 28947, 28948, 28949, 28950, 28951, 28952, 28953, 28954, 28955,
+28956, 28957, 28958, 28959, 28960, 28961, 28962, ...].
+```
+
+We can use the resulting object to search for variables with `"smok"` or
+`"cigar"` in their labels.
+
+```r
+df_lookfor %>%
+  filter(str_detect(label_low, "smok|cigar")) %>%
+  select(file, variable, label)
+```
+
+``` text
+# A tibble: 294 × 3
+   file         variable label                                   
+   <chr>        <chr>    <chr>                                   
+ 1 bcs7072a.dta a0043b   SMOKING DURING PREGNANCY                
+ 2 bcs7072b.dta b0024    DOES THE CHILD'S MOTHER SMOKE TOBACCO ? 
+ 3 bcs7072b.dta b0025    IF NO WHEN DID SHE LAST SMOKE (MONTH) ? 
+ 4 bcs7072b.dta b0026    IF NO WHEN DID SHE LAST SMOKE (YEAR) ?  
+ 5 bcs7072b.dta b0027    ANSWER TO LAST SMOKED OTHER THAN A DATE 
+ 6 bcs7072b.dta b0028    HOW MANY SMOKED ( CIGARETTES ) ?        
+ 7 sn3723.dta   e9_1     MOTHER'S PRESENT SMOKING HABITS         
+ 8 sn3723.dta   e9_2     NO. OF CIGARETTES MOTHER SMOKES DAILY   
+ 9 sn3723.dta   e9_3     LENGTH OF TIME MOTHER HAS SMOKED        
+10 sn3723.dta   e10_1    MOTHER NON SMOKER NOW HAS SMOKED IN PAST
+# ℹ 284 more rows
+```
@@ -0,0 +1,145 @@
+---
+layout: default
+title: Reshaping Data from Long to Wide (or Wide to Long)
+nav_order: 4
+parent: BCS70
+format: docusaurus-md
+---
+
+
+
+
+# Introduction
+
+In this section, we show how to reshape data from long to wide (and vice
+versa). To demonstrate, we use data from Sweeps 8 (51y) and 11 (51y) on
+cohort member’s height and weight collected.
+
+The packages we use are:
+
+```r
+# Load Packages
+library(tidyverse) # For data manipulation
+library(haven) # For importing .dta files
+```
+
+# Reshaping Raw Data from Wide to Long
+
+We begin by loading the data from each sweep and merging these together
+into a single wide format data frame; see [Combining Data Across
+Sweeps](https://cls-data.github.io/docs/bcs70-merging_across_sweeps.html)
+for further explanation on how this is achieved. Note, the names of the
+height and weight variables in Sweep 8 and Sweep 11 follow a similar
+convention, which is the exception rather than the rule in BCS70 data.
+Below, we convert the variable names in the Sweep 8 data frame to lower
+case so that they closely match those in the Sweep 11 data frame. This
+will make reshaping easier.
+
+```r
+df_42y <- read_dta("42y/bcs70_2012_derived.dta",
+                      col_select = c("BCSID", "BD9HGHTM", "BD9WGHTK")) %>%
+rename_with(str_to_lower)
+
+df_51y <- read_dta("51y/bcs11_age51_main.dta",
+                   col_select = c("bcsid", "bd11hghtm", "bd11wghtk"))
+
+df_wide <- df_42y %>%
+  full_join(df_51y, by = "bcsid")
+```
+
+`df_wide` has 5 columns. Besides, the identifier, `bcsid`, there are 4
+columns for height and weight measurements at each sweep. Each of these
+4 columns is prefix by three characters indicating the sweep at
+assessment. We can reshape the dataset into long format (one row per
+person x sweep combination) using the `pivot_longer()` function so that
+the resulting data frame has four columns: one person identifier, a
+variable for age of assessment (`fup`), and variables for height and
+weight. We specify the columns to be reshaped using the `cols` argument,
+provide the new variable names in the `names_to` argument, and the
+pattern the existing column names take using the `names_pattern`
+argument. For `names_pattern` we specify `"^bd(\\d{1,2})([A-Za-z].+)$"`,
+which breaks the column name into two pieces: one or two digits
+indicating sweep (and after `bd`; `(\\d{1,2})`) and subsequent
+characters at the end of the name (`"([A-Za-z].+)$"`). `names_pattern`
+uses regular expressions. `.` matches single characters, and `.+`
+modifies this to make one or more characters. `\\d` is a special
+character denoting a digit. `[A-Za-z]` indicates any alphabetic
+character, upper or lower case. As noted, the digits hold information on
+sweep of assessment; in the reshaped data frame the character is stored
+as a value in a new column `sweep`. `.value` is a placeholder for the
+new columns in the reshaped data frame that store the values from the
+columns selected by `cols`; these new columns are named using the first
+piece from `names_pattern` - in this case `hghtm` (height) and `wghtk`
+(weight).
+
+```r
+df_long <- df_wide %>%
+  pivot_longer(cols = matches("^bd"),
+               names_to = c("sweep", ".value"),
+               names_pattern = "^bd(\\d{1,2})([A-Za-z].+)$")
+
+df_long
+```
+
+``` text
+# A tibble: 21,366 × 4
+   bcsid   sweep hghtm     wghtk    
+   <chr>   <chr> <dbl+lbl> <dbl+lbl>
+ 1 B10001N 9      1.55     55.8     
+ 2 B10001N 11     1.55     50.8     
+ 3 B10003Q 9      1.85     82.6     
+ 4 B10003Q 11     1.85     83.5     
+ 5 B10004R 9      1.60     57.2     
+ 6 B10004R 11     1.6      57.2     
+ 7 B10007U 9      1.52     82.6     
+ 8 B10007U 11    NA        NA       
+ 9 B10009W 9      1.63     54.9     
+10 B10009W 11     1.63     60.3     
+# ℹ 21,356 more rows
+```
+
+# Reshaping Raw Data from Long to Wide
+
+We can also reshape the data from long to wide format using the
+`pivot_wider()` function. In this case, we want to create two new
+columns for each sweep: one for height and one for weight. We specify
+the columns to be reshaped using the `values_from` argument, provide the
+old column names in the `names_from` argument, and use the `names_glue`
+argument to specify the convention to follow for the new column names.
+The `names_glue` argument uses curly braces (`{}`) to reference the
+values from the `names_from` and `.value` arguments. As we are
+specifying multiple columns in `values_from`, `.value` is a placeholder
+for the names of the variables selected in `values_from`.
+
+```r
+df_long %>%
+  pivot_wider(names_from = sweep,
+              values_from = c(hghtm, wghtk),
+              names_glue = "{.value}_{sweep}")
+```
+
+``` text
+# A tibble: 10,683 × 5
+   bcsid   hghtm_9   hghtm_11  wghtk_9              wghtk_11 
+   <chr>   <dbl+lbl> <dbl+lbl> <dbl+lbl>            <dbl+lbl>
+ 1 B10001N 1.55       1.55      55.8                 50.8    
+ 2 B10003Q 1.85       1.85      82.6                 83.5    
+ 3 B10004R 1.60       1.6       57.2                 57.2    
+ 4 B10007U 1.52      NA         82.6                 NA      
+ 5 B10009W 1.63       1.63      54.9                 60.3    
+ 6 B10010P 1.65      NA         -8 [No information]  NA      
+ 7 B10011Q 1.63       1.65      76.2                 82.6    
+ 8 B10013S 1.63       1.63      63.5                 66.7    
+ 9 B10015U 1.83       1.8       77.6                 82.6    
+10 B10016V 1.88       1.88     114.                 118      
+# ℹ 10,673 more rows
+```
+
+Note, in the original `df_wide` tibble, the height and weight variables
+were labelled numeric vectors - this class allows users to add metadata
+to variables (value labels, etc.). When reshaping to long format,
+multiple variables are effectively appended together, but the final
+reshape variables can only have one set of properties. `pivot_longer()`
+tries to preserve variables attributes, but in some cases will throw an
+error (where variables are of inconsistent types) or print a warning
+(where value labels are inconsistent).
@@ -5,4 +5,16 @@ has_children: true
 nav_order: 4
 ---
 
-This section will be populated as soon as possible.
+This section presents code to clean and handle data from the 1970 British Cohort Study (BCS70). The BCS70 has relatively straightforward data structures. Difficulties mainly arise from its historic nature - data from older sweeps does not conform to modern metadata standards (e.g., file and variable names are not explanatory). This can make it challenging to find relevant variables. In "Data Discovery", we provide code to assist with data discovery (i.e., creating searchable data dictionaries).
+
+## Miscellanea
+
+There are a few characteristics of the BCS70 that data users should be aware of.
+
+-   The cohort member identifier variable is `BCSID`. In some datasets, this variable is stored in lower case, `bcsid`.
+-   Almost all datasets are at the cohort member level, with one row per-cohort member. Exceptions the derived partnership histories and activity histories datasets.
+-   There is a small number of twins in the BCS70. A cohort member's twin can be be identified with the `twincode` variable within the `bcs70_response_1970-2021.dta` dataset. The does not work as a family ID variable.
+-   Later sweeps of the BCS70 follow consistent naming conventions (e.g., self-rated health is named `B*HLTHGN` [where `*` is a sweep number]), but earlier sweeps do not (e.g., self-rated health is named `b96043` at age 26y).
+-   Later sweeps of the BCS70 have included survey items also collected in the NCDS. Both studies use the same naming conventions for variable names in these sweeps, which can help with harmonisation.
+-   The BCS70 did not track all cohort members over time, including the 628 individuals born in Northern Ireland. The `bcs70_response_1970-2021.dta` dataset tracks only those individuals eligible for longitudinal follow-up, so sample sizes are smaller than if all wave-specific datasets are merged together. Users should consider using `bcs70_response_1970-2021.dta` to define their sample.
+-   Negative values are typically reserved for different forms of missingness ("Don't know", "Refuse", "Not applicable", etc.), but in some cases - especially with older datasets - missing values have positive values instead.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+options(knitr.stata.engine.path = "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP")`