You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: part-wrangling/01-data-input.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ See @sec-lists for some tools to work with these files.
46
46
47
47
-**Spatial files**: Shapefiles are the most common version of spatial files, though there are a seemingly infinite number of different formats, and new formats pop up at the most inconvenient times.
48
48
Spatial files often include structured encodings of geographic information plus corresponding tabular format data that goes with the geographic information.
49
-
@sec-spatial covers some of the tools available for working with spatial data.
49
+
<!--@sec-spatial covers some of the tools available for working with spatial data.-->
50
50
51
51
To be minimally functional in R and Python, it's important to know how to read in text files (CSV, tab-delimited, etc.).
52
52
It can be helpful to also know how to read in XLSX files.
Copy file name to clipboardExpand all lines: part-wrangling/05b-normal-forms.qmd
+50-38Lines changed: 50 additions & 38 deletions
Original file line number
Diff line number
Diff line change
@@ -55,8 +55,9 @@ This data is part of the Australian data and story library [OzDASL](https://gksm
55
55
See the [help page](http://www.statsci.org/data/oz/wallaby.html) for information about this data set and each of its variables.
56
56
57
57
```{r}
58
+
library(knitr)
58
59
wallaby <- read.csv("../data/wallaby.csv")
59
-
head(wallaby) %>% knitr::kable(caption="First few lines of the wallaby data.")
60
+
head(wallaby) |> kable(caption="First few lines of the wallaby data.")
60
61
```
61
62
62
63
::: panel-tabset
@@ -70,21 +71,23 @@ We can check whether that condition is fulfilled by tallying up the combination
70
71
#### Is Anim the key? {.unnumbered}
71
72
72
73
```{r}
73
-
wallaby %>%
74
-
count(Anim, sort = TRUE) %>%
75
-
head() %>%
76
-
knitr::kable(caption="Anim by itself is not uniquely identifying rows in the data.")
74
+
library(dplyr)
75
+
wallaby |>
76
+
count(Anim, sort = TRUE) |>
77
+
head() |>
78
+
kable(caption="Anim by itself is not uniquely identifying rows in the data.")
77
79
```
78
80
79
81
Anim by itself is not a key variable, because for some animal ids we have multiple sets of measurements.
80
82
81
83
#### Is Anim + Age the key? {.unnumbered}
82
84
83
85
```{r}
84
-
wallaby %>%
85
-
count(Anim, Age, sort = TRUE) %>%
86
-
head() %>%
87
-
knitr::kable(caption="All combinations of animal ID and an animal's age only refer to one set of measurements.")
86
+
library(dplyr)
87
+
wallaby |>
88
+
count(Anim, Age, sort = TRUE) |>
89
+
head() |>
90
+
kable(caption="All combinations of animal ID and an animal's age only refer to one set of measurements.")
88
91
```
89
92
90
93
The combination of `Anim` and `Age` uniquely describes each observation, and is therefore a key for the data set.
@@ -134,24 +137,26 @@ Below is a snapshot of a reshaped version of the previous example, where all mea
134
137
While `Anim` now should be a key variable (presumably it uniquely identifies each animal), the data set is still not in first normal form, because the entries in the variable `measurements` are data sets by themselves, not just single values.
knitr::kable(caption="We see in the frequency breakdown of `Anim`, that the animal ID for 125 is used twice, i.e. 125 seems to describe two different animals. This would indicate that animal numbers do not refer to an individual animal as the data description suggests.")
147
152
```
148
153
This finding is a sign of an inconsistency in the data set - and just a first example of why we care about normal forms.
149
154
Here, we identify the first entry in the results below as a completely different animal - it is male and lives in a different location.
150
155
Most likely, this is a wrongly identified animal.
151
156
152
157
```{r}
153
-
wallaby %>%
154
-
filter(Anim == 125) %>% head() %>%
158
+
wallaby |>
159
+
filter(Anim == 125) |> head() |>
155
160
knitr::kable(caption="Based on the listing for the values of animal 125, the very first entry does not fit in well with the other values.")
156
161
```
157
162
@@ -172,16 +177,18 @@ Its timing (`Age`) and value (`Weight`) make it the best fit for animal 126.
172
177
```{r}
173
178
#| echo: false
174
179
#| warning: false
175
-
#| label: fig-wallaby125
180
+
#| label: wallaby125
176
181
#| fig-cap: "Scatterplots of Weight by Age, facetted by animal number. Only male specimen in location Ha are considered. The orange point shows the measurements of the entry wrongly assigned to animal 125 (which is female and lives in a different location)."
177
-
#|
178
-
wallaby %>%
179
-
filter(Loca == "Ha", Anim!=125, Sex == 1, !is.na(Weight)) %>%
180
-
mutate(Anim=factor(Anim)) %>%
182
+
183
+
library(ggplot2)
184
+
185
+
wallaby |>
186
+
filter(Loca == "Ha", Anim!=125, Sex == 1, !is.na(Weight)) |>
@@ -193,19 +200,24 @@ In @fig-wallaby126 all measurements for wallaby 126 are shown by age (in days) w
193
200
194
201
```{r}
195
202
#| echo: false
196
-
#| label: fig-wallaby126
203
+
#| label: wallaby126
197
204
#| fig-cap: "Growth curves of animal 126 for all the measurements taken between days 50 and 200. In orange, the additional entry for animal 125 is shown. The values are consistent with animal 126's growth."
Making any changes to a dataset should never be done light-heartedly, always be well argued and well-documented.
@@ -239,7 +251,7 @@ Note, that tables in 1st normal form with a single key variable are automaticall
239
251
Regarding the example of the `wallaby` dataset, we see the dataset in its basic form is not in 2nd normal form, because the two non-key variables `Sex` (biological sex of the animal) and the animal's location (`Loca`) only depend on the animal number `Anim`, and not on the `Age` variable.
240
252
241
253
```{r}
242
-
wallaby2 %>% group_by(Anim)
254
+
wallaby2 |> group_by(Anim)
243
255
```
244
256
245
257
### Normalization: 1st NF to 2nd NF
@@ -257,27 +269,27 @@ In the example of the `wallaby` data, we have identified the non-key variables `
257
269
We separate those variables into the data set `wallaby_demographics` and reduce the number of rows by finding a tally of the number of rows we summarize.
258
270
259
271
```{r}
260
-
wallaby_demographics <- wallaby_cleaner %>%
261
-
select(Anim, Sex, Loca) %>%
272
+
wallaby_demographics <- wallaby_cleaner |>
273
+
select(Anim, Sex, Loca) |>
262
274
count(Anim, Sex, Loca)
263
275
# Don't need the total number
264
276
```
265
277
266
278
Once we have verified that `Anim` is a key for `wallaby_demographics`, we know that this table is in 2nd normal form.
267
279
268
280
```{r}
269
-
wallaby_demographics %>%
270
-
count(Anim, sort=TRUE) %>%
281
+
wallaby_demographics |>
282
+
count(Anim, sort=TRUE) |>
271
283
head()
272
284
```
273
285
274
286
With the key-splitting variables `Sex` and `Loca` being taken care of in the second dataset, we can safely remove those variables from the original data. To preserve the original, we actually create a separate copy called `wallaby_measurements`:
275
287
276
288
```{r}
277
-
wallaby_measurements <- wallaby_cleaner %>%
289
+
wallaby_measurements <- wallaby_cleaner |>
278
290
select(-Sex, -Loca)
279
291
280
-
wallaby_measurements %>%
292
+
wallaby_measurements |>
281
293
head()
282
294
```
283
295
@@ -324,16 +336,16 @@ For the `wallaby_demographics`, we will split the data into two parts: `wallaby_
Another approach to bringing a dataset into key-value pairs is to summarize the values of a set of variables by introducing a new key for the variable names.
file = {Accessing SQLite Databases Using Python and Pandas – Data Analysis and Visualization in Python for Ecologists:/home/susan/Nextcloud/Zotero/storage/J7YECHEL/index.html:text/html},
2247
+
}
2248
+
2249
+
@online{datatofishHowConnectPython2021,
2250
+
title = {How to Connect Python to {MS} Access Database using Pyodbc},
0 commit comments