Skip to content

Commit 642c289

Browse files
author
Susan Vanderplas
committed
Fixing references and pipe forms
1 parent 6c2a6c5 commit 642c289

File tree

3 files changed

+93
-39
lines changed

3 files changed

+93
-39
lines changed

part-wrangling/01-data-input.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ See @sec-lists for some tools to work with these files.
4646

4747
- **Spatial files**: Shapefiles are the most common version of spatial files, though there are a seemingly infinite number of different formats, and new formats pop up at the most inconvenient times.
4848
Spatial files often include structured encodings of geographic information plus corresponding tabular format data that goes with the geographic information.
49-
@sec-spatial covers some of the tools available for working with spatial data.
49+
<!-- @sec-spatial covers some of the tools available for working with spatial data. -->
5050

5151
To be minimally functional in R and Python, it's important to know how to read in text files (CSV, tab-delimited, etc.).
5252
It can be helpful to also know how to read in XLSX files.

part-wrangling/05b-normal-forms.qmd

Lines changed: 50 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,9 @@ This data is part of the Australian data and story library [OzDASL](https://gksm
5555
See the [help page](http://www.statsci.org/data/oz/wallaby.html) for information about this data set and each of its variables.
5656

5757
```{r}
58+
library(knitr)
5859
wallaby <- read.csv("../data/wallaby.csv")
59-
head(wallaby) %>% knitr::kable(caption="First few lines of the wallaby data.")
60+
head(wallaby) |> kable(caption="First few lines of the wallaby data.")
6061
```
6162

6263
::: panel-tabset
@@ -70,21 +71,23 @@ We can check whether that condition is fulfilled by tallying up the combination
7071
#### Is Anim the key? {.unnumbered}
7172

7273
```{r}
73-
wallaby %>%
74-
count(Anim, sort = TRUE) %>%
75-
head() %>%
76-
knitr::kable(caption="Anim by itself is not uniquely identifying rows in the data.")
74+
library(dplyr)
75+
wallaby |>
76+
count(Anim, sort = TRUE) |>
77+
head() |>
78+
kable(caption="Anim by itself is not uniquely identifying rows in the data.")
7779
```
7880

7981
Anim by itself is not a key variable, because for some animal ids we have multiple sets of measurements.
8082

8183
#### Is Anim + Age the key? {.unnumbered}
8284

8385
```{r}
84-
wallaby %>%
85-
count(Anim, Age, sort = TRUE) %>%
86-
head() %>%
87-
knitr::kable(caption="All combinations of animal ID and an animal's age only refer to one set of measurements.")
86+
library(dplyr)
87+
wallaby |>
88+
count(Anim, Age, sort = TRUE) |>
89+
head() |>
90+
kable(caption="All combinations of animal ID and an animal's age only refer to one set of measurements.")
8891
```
8992

9093
The combination of `Anim` and `Age` uniquely describes each observation, and is therefore a key for the data set.
@@ -134,24 +137,26 @@ Below is a snapshot of a reshaped version of the previous example, where all mea
134137
While `Anim` now should be a key variable (presumably it uniquely identifies each animal), the data set is still not in first normal form, because the entries in the variable `measurements` are data sets by themselves, not just single values.
135138

136139
```{r}
137-
wallaby2 <- wallaby %>%
140+
library(tidyr)
141+
142+
wallaby2 <- wallaby |>
138143
nest(.by = c("Anim", "Sex", "Loca"), .key="measurements")
139-
wallaby2 %>% head()
144+
wallaby2 |> head()
140145
```
141146
Is `Anim` the key variable of `wallaby2`?
142147
For that we check whether the `Anim` variable is unique - and find out that it is not unique!
143148

144149
```{r}
145-
wallaby2 %>% count(Anim, sort=TRUE) %>% head() %>%
150+
wallaby2 |> count(Anim, sort=TRUE) |> head() |>
146151
knitr::kable(caption="We see in the frequency breakdown of `Anim`, that the animal ID for 125 is used twice, i.e. 125 seems to describe two different animals. This would indicate that animal numbers do not refer to an individual animal as the data description suggests.")
147152
```
148153
This finding is a sign of an inconsistency in the data set - and just a first example of why we care about normal forms.
149154
Here, we identify the first entry in the results below as a completely different animal - it is male and lives in a different location.
150155
Most likely, this is a wrongly identified animal.
151156

152157
```{r}
153-
wallaby %>%
154-
filter(Anim == 125) %>% head() %>%
158+
wallaby |>
159+
filter(Anim == 125) |> head() |>
155160
knitr::kable(caption="Based on the listing for the values of animal 125, the very first entry does not fit in well with the other values.")
156161
```
157162

@@ -172,16 +177,18 @@ Its timing (`Age`) and value (`Weight`) make it the best fit for animal 126.
172177
```{r}
173178
#| echo: false
174179
#| warning: false
175-
#| label: fig-wallaby125
180+
#| label: wallaby125
176181
#| fig-cap: "Scatterplots of Weight by Age, facetted by animal number. Only male specimen in location Ha are considered. The orange point shows the measurements of the entry wrongly assigned to animal 125 (which is female and lives in a different location)."
177-
#|
178-
wallaby %>%
179-
filter(Loca == "Ha", Anim!=125, Sex == 1, !is.na(Weight)) %>%
180-
mutate(Anim=factor(Anim)) %>%
182+
183+
library(ggplot2)
184+
185+
wallaby |>
186+
filter(Loca == "Ha", Anim!=125, Sex == 1, !is.na(Weight)) |>
187+
mutate(Anim=factor(Anim)) |>
181188
ggplot(aes(x = Age, y = Weight)) +
182189
geom_point(size = 4, alpha = 0.5) +
183190
facet_wrap(~Anim) +
184-
geom_point(data = filter(wallaby, Anim==125, Sex==1) %>% select(-Anim),
191+
geom_point(data = filter(wallaby, Anim==125, Sex==1) |> select(-Anim),
185192
colour = "darkorange", size = 4, alpha = 0.8) +
186193
xlim(c(50,200)) +
187194
ylim(c(0,5000)) +
@@ -193,19 +200,24 @@ In @fig-wallaby126 all measurements for wallaby 126 are shown by age (in days) w
193200

194201
```{r}
195202
#| echo: false
196-
#| label: fig-wallaby126
203+
#| label: wallaby126
197204
#| fig-cap: "Growth curves of animal 126 for all the measurements taken between days 50 and 200. In orange, the additional entry for animal 125 is shown. The values are consistent with animal 126's growth."
198-
wallaby_long <- wallaby %>%
199-
pivot_longer(c("Weight", "Head":"Tail"), values_to = "Measurement", names_to="Trait") %>%
205+
206+
library(tidyr)
207+
library(dplyr)
208+
library(ggplot2)
209+
210+
wallaby_long <- wallaby |>
211+
pivot_longer(c("Weight", "Head":"Tail"), values_to = "Measurement", names_to="Trait") |>
200212
filter(Trait!="Leng")
201213
202-
wallaby_long %>%
203-
filter(Anim==126, between(Age, 50, 200)) %>%
214+
wallaby_long |>
215+
filter(Anim==126, between(Age, 50, 200)) |>
204216
ggplot(aes(x = Age, y = Measurement)) +
205217
geom_point(size = 4, alpha = 0.5) +
206218
facet_wrap(~Trait, scales="free_y") +
207-
geom_point(data = wallaby_long %>%
208-
filter(Anim==125, Sex==1) %>%
219+
geom_point(data = wallaby_long |>
220+
filter(Anim==125, Sex==1) |>
209221
select(-Anim),
210222
colour = "darkorange", size = 4, alpha = 0.8) +
211223
theme_bw() +
@@ -219,7 +231,7 @@ wallaby_long %>%
219231
As a direct result of the normalization step, we make a change (!!!) to the orignial data.
220232
```{r}
221233
# Cleaning Step 1
222-
wallaby_cleaner <- wallaby %>%
234+
wallaby_cleaner <- wallaby |>
223235
mutate(Anim = ifelse(Anim==125 & Sex==1, 126, Anim))
224236
```
225237
Making any changes to a dataset should never be done light-heartedly, always be well argued and well-documented.
@@ -239,7 +251,7 @@ Note, that tables in 1st normal form with a single key variable are automaticall
239251
Regarding the example of the `wallaby` dataset, we see the dataset in its basic form is not in 2nd normal form, because the two non-key variables `Sex` (biological sex of the animal) and the animal's location (`Loca`) only depend on the animal number `Anim`, and not on the `Age` variable.
240252

241253
```{r}
242-
wallaby2 %>% group_by(Anim)
254+
wallaby2 |> group_by(Anim)
243255
```
244256

245257
### Normalization: 1st NF to 2nd NF
@@ -257,27 +269,27 @@ In the example of the `wallaby` data, we have identified the non-key variables `
257269
We separate those variables into the data set `wallaby_demographics` and reduce the number of rows by finding a tally of the number of rows we summarize.
258270

259271
```{r}
260-
wallaby_demographics <- wallaby_cleaner %>%
261-
select(Anim, Sex, Loca) %>%
272+
wallaby_demographics <- wallaby_cleaner |>
273+
select(Anim, Sex, Loca) |>
262274
count(Anim, Sex, Loca)
263275
# Don't need the total number
264276
```
265277

266278
Once we have verified that `Anim` is a key for `wallaby_demographics`, we know that this table is in 2nd normal form.
267279

268280
```{r}
269-
wallaby_demographics %>%
270-
count(Anim, sort=TRUE) %>%
281+
wallaby_demographics |>
282+
count(Anim, sort=TRUE) |>
271283
head()
272284
```
273285

274286
With the key-splitting variables `Sex` and `Loca` being taken care of in the second dataset, we can safely remove those variables from the original data. To preserve the original, we actually create a separate copy called `wallaby_measurements`:
275287

276288
```{r}
277-
wallaby_measurements <- wallaby_cleaner %>%
289+
wallaby_measurements <- wallaby_cleaner |>
278290
select(-Sex, -Loca)
279291
280-
wallaby_measurements %>%
292+
wallaby_measurements |>
281293
head()
282294
```
283295

@@ -324,16 +336,16 @@ For the `wallaby_demographics`, we will split the data into two parts: `wallaby_
324336

325337

326338
```{r}
327-
wallaby_sex <- wallaby_demographics %>% select(Anim, Sex)
328-
wallaby_Location <- wallaby_demographics %>% select(Anim, Loca)
339+
wallaby_sex <- wallaby_demographics |> select(Anim, Sex)
340+
wallaby_Location <- wallaby_demographics |> select(Anim, Loca)
329341
```
330342

331343
##### Pivoting (R)
332344

333345
Another approach to bringing a dataset into key-value pairs is to summarize the values of a set of variables by introducing a new key for the variable names.
334346

335347
```{r}
336-
wallaby_measurements_long <- wallaby_measurements %>%
348+
wallaby_measurements_long <- wallaby_measurements |>
337349
pivot_longer(cols = 'Leng':'Weight', names_to="Traits", values_to="Measurements",
338350
values_drop_na = TRUE)
339351

references.bib

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2234,3 +2234,45 @@ @online{xieDontUseSpaces2018
22342234
date = {2018-03-15},
22352235
langid = {american},
22362236
}
2237+
2238+
@online{datacarpentryAccessingSQLiteDatabases2019,
2239+
title = {Accessing {SQLite} Databases Using Python and Pandas – Data Analysis and Visualization in Python for Ecologists},
2240+
url = {https://datacarpentry.org/python-ecology-lesson/09-working-with-sql/index.html},
2241+
titleaddon = {Data Analysis and Visualization in Python for Ecologists},
2242+
author = {{Data Carpentry}},
2243+
urldate = {2022-06-13},
2244+
date = {2019-06-04},
2245+
note = {tex.ids= {thecarpentriesAccessingSQLiteDatabases}2022},
2246+
file = {Accessing SQLite Databases Using Python and Pandas – Data Analysis and Visualization in Python for Ecologists:/home/susan/Nextcloud/Zotero/storage/J7YECHEL/index.html:text/html},
2247+
}
2248+
2249+
@online{datatofishHowConnectPython2021,
2250+
title = {How to Connect Python to {MS} Access Database using Pyodbc},
2251+
url = {https://datatofish.com/how-to-connect-python-to-ms-access-database-using-pyodbc/},
2252+
abstract = {Need to connect Python to {MS} Access database? If so, you'll see an easy way to connect Python to an Access database using pyodbc.},
2253+
titleaddon = {Data to Fish},
2254+
author = {{Data to Fish}},
2255+
urldate = {2022-06-13},
2256+
date = {2021-08-21},
2257+
langid = {american},
2258+
file = {Snapshot:/home/susan/Nextcloud/Zotero/storage/M2Z2HFRW/how-to-connect-python-to-ms-access-database-using-pyodbc.html:text/html},
2259+
}
2260+
2261+
@online{juliangoodareSurveyScottishWitchcraft2003,
2262+
title = {The Survey of Scottish Witchcraft},
2263+
url = {www.shc.ed.ac.uk/witches/},
2264+
author = {{Julian Goodare} and {Lauren Martin} and {Joyce Miller} and {Louise Yeoman}},
2265+
urldate = {2022-06-13},
2266+
date = {2003-01},
2267+
}
2268+
@online{teamexploratoryHowImportData2022,
2269+
title = {How to import Data from Microsoft Access Database with {ODBC}},
2270+
url = {https://exploratory.io/note/exploratory/How-to-import-Data-from-Microsoft-Access-Database-with-ODBC-zIJ2bjs2},
2271+
abstract = {This explains How to import Data from Microsoft Access Database with {ODBC}.},
2272+
titleaddon = {Exploratory.io},
2273+
author = {{Team Exploratory}},
2274+
urldate = {2022-06-13},
2275+
date = {2022-02-26},
2276+
file = {Snapshot:/home/susan/Nextcloud/Zotero/storage/G85GBQWY/How-to-import-Data-from-Microsoft-Access-Database-with-ODBC-zIJ2bjs2.html:text/html},
2277+
}
2278+

0 commit comments

Comments
 (0)