You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: part-wrangling/02-basic-data-vis.qmd
+197-7Lines changed: 197 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,12 @@
1
1
# Data Visualization Basics {#sec-basic-data-vis}
2
2
3
-
This section is intended as a very light overview of how you might create charts in R and python. @sec-data-vis will be much more in depth.
3
+
This section is intended as a very light overview of how you might create charts in R and python.
4
+
@sec-data-vis will be much more in depth.
4
5
5
6
## Objectives {-}
6
7
7
8
- Use ggplot2/seaborn to create a chart
8
-
- Begin to identify issues with data formatting
9
+
- Begin to identify issues with data formatting that need to be resolved before creating a chart.
9
10
10
11
## Package Installation
11
12
@@ -52,7 +53,8 @@ Once you have run this command, please comment it out so that you don't reinstal
52
53
## First Steps
53
54
54
55
Now that you can read data in to R and python and define new variables, you can create plots!
55
-
Data visualization is a skill that takes a lifetime to learn, but for now, let's start out easy: let's talk about how to make (basic) plots in R (with `ggplot2`) and in python (with `seaborn`, which has similar).
56
+
Data visualization is a skill that takes a lifetime to learn, but for now, let's start out easy: let's talk about how to make (basic) plots in R (with `ggplot2`) and in python (with `seaborn`, which has a similar approach to charts).
57
+
You can read more about this approach, called the **grammar of graphics** in @sec-data-vis.
If you pass a data frame in as the data argument, you can refer to columns in that data with "bare" column names (you don't have to reference the full data object using `df$name` or `df.name`; you can instead use `name` or `"name"`).
97
99
@@ -135,10 +137,17 @@ plt.show()
135
137
136
138
### Data Formatting
137
139
138
-
If your data is in the right format, ggplot2 is very easy to use; if your data aren't formatted neatly, it can be a real pain.
140
+
If your data is in the right format, `ggplot2` is very easy to use; if your data aren't formatted neatly, it can be a real pain.
139
141
If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into "long form".
140
142
We'll learn more about how to do this type of data transition when we talk about [reshaping data](05-data-reshape.qmd).
141
143
144
+
::: {.column-margin .callout}
145
+
#### Tip {-}
146
+
It's helpful to start thinking about what format your data is in, and what format you would want it to be in in order to plot it.
147
+
Sketching a data frame for the "as is" condition and the "to plot" condition is a useful skill to cultivate.
148
+
149
+
:::
150
+
142
151
::: callout-demo
143
152
144
153
You don't need to know exactly how this works, but it is helpful to see the difference in the two datasets:
In the long form of the data, we have a row for each data point (year x measurement type), not for each year.
186
+
I've shown the same amount of data (6 years, 9 measurements) in this table as in the original data, but this takes up much more vertical space!
177
187
178
188
:::
179
189
@@ -182,7 +192,7 @@ In the long form of the data, we have a row for each data point (year x measurem
182
192
### Making a (Better) Line Chart
183
193
184
194
If we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain).
185
-
Converting the data to long form means we can use ggplot2/seaborn to do all of this for us with only a single plot statement (`geom_line` or `sns.lineplot`).
195
+
Converting the data to long form means we can use `ggplot2`/`seaborn` to do all of this for us with only a single plot statement (`geom_line` or `sns.lineplot`).
186
196
Having the data in the right form to plot is very important if you want to get the plot you're imagining with relatively little effort.
187
197
188
198
@@ -193,6 +203,9 @@ Having the data in the right form to plot is very important if you want to get t
193
203
#### R
194
204
```{r}
195
205
#| label: long-form-demo-r
206
+
#| fig-width: 8
207
+
#| fig-height: 6
208
+
#| fig-alt: "A chart titled HBCU College Enrollment, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Many lines are shown, with combinations of 2-year, 4-year, and total, private and public within those designations, and male and female enrollment as well. For most values, the overall enrollment has grown over time (declining somewhat since 2010), but the value for private 2-year HBCUs has declined over the entire time period and remained relatively flat. It is clear that private HBCUs make up an extremely small proportion of overall HBCU enrollment."
196
209
197
210
ggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +
198
211
ggtitle("HBCU College Enrollment")
@@ -211,6 +224,11 @@ ggplot(hbcu_long, aes(x = "Year", y = "value", color = "variable")) + geom_line(
211
224
212
225
```{python}
213
226
#| label: long-form-demo-python-seaborn
227
+
#| fig-width: 8
228
+
#| fig-height: 6
229
+
#| fig-alt: "A chart titled HBCU College Enrollment, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Many lines are shown, with combinations of 2-year, 4-year, and total, private and public within those designations, and male and female enrollment as well. For most values, the overall enrollment has grown over time (declining somewhat since 2010), but the value for private 2-year HBCUs has declined over the entire time period and remained relatively flat. It is clear that private HBCUs make up an extremely small proportion of overall HBCU enrollment."
230
+
231
+
214
232
plot = sns.lineplot(hbcu_long, x = "Year", y = "value", hue = "variable")
215
233
plot.set_title("4-year HBCU College Enrollment")
216
234
plt.show()
@@ -220,4 +238,176 @@ plt.show()
220
238
221
239
:::
222
240
241
+
242
+
### Highlighting Key Insights
243
+
244
+
Examining the charts in the previous section, it seems that there are at least three different contrasts which are easily made: 2 year vs. 4 year (vs. Total), Public vs. Private, and Male vs. Female enrollment.
245
+
The 2 year vs. 4 year vs. Total and Public vs. Private variables seem to be "crossed" - we have values for every combination of these variables.
246
+
The Male and Female numbers are not broken out further.
247
+
248
+
When creating charts, it's useful to think about key comparisons and insights that the user may wish to explore, and then to explicitly highlight those comparisons with separate charts.
249
+
It's never enough to just make one chart -- any data that is complex enough to plot deserves to be explored from multiple angles.
250
+
251
+
::: demo
252
+
#### Sketching Possible Comparisons
253
+
254
+
It is helpful to sketch out the data structure and then use that sketch to identify key comparisons.
255
+
You do not have to know what the data looks like at this stage -- it is enough to think through what *might* be interesting and then to test whether or not the comparison is interesting by generating the chart.
256
+
257
+
::: panel-tabset
258
+
259
+
##### Data Structure
260
+
261
+
{fig-alt="A sketch of the HBCU data in wide form, with annotations underneath the table showing comparisons between relevant columns - by gender, by public and private 2y vs 4y, by total 2y and 4y enrollment, by public vs. private"}
262
+
263
+
##### Chart 1
264
+
265
+
{fig-alt="A sketch of what the HBCU enrollment data broken out by gender might show."}
266
+
267
+
##### Chart 2
268
+
269
+
{fig-alt=" There are three subplots shown - Total, Public, and Private. In the Total pane, lines are broken out by public and private, while in the public and private panes they are broken out by 2y vs. 4y."}
270
+
##### Chart 3
271
+
272
+
{fig-alt=" There are four subplots shown, arranged in a 2x2 table, with columns Public and Private, and rows 2yr and 4yr. In each cell, there is a subplot with a single line drawn."}
273
+
274
+
##### Chart 4
275
+
276
+
{fig-alt=" There are two subplots shown, arranged in a 1x2 table, with columns Public and Private. In each cell there is a chart with two lines: 4yr and 2yr."}
277
+
:::
278
+
279
+
When considering which version of a chart to generate, it is helpful to think about which comparisons are most natural for each chart. When lines are close together and share the same scale, comparisons are easier to make. So, if the goal is to highlight the difference between 2yr enrollment and 4yr enrollment, then Chart 4 is particularly effective, as those comparisons are easiest to make in that chart.
280
+
@sec-data-vis discusses this in more detail.
281
+
282
+
:::
283
+
284
+
#### HBCU Enrollment by Gender
285
+
286
+
First, let's peel off the gender-specific enrollment data.
287
+
This requires us to find only types "Females" and "Males", which we can obtain by filtering or subsetting the data.
288
+
289
+
290
+
::: callout-demo
291
+
292
+
::: panel-tabset
293
+
294
+
#### R
295
+
```{r}
296
+
#| label: long-form-subset-r
297
+
#| fig-width: 8
298
+
#| fig-height: 6
299
+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
ggplot(hbcu_gender, aes(x = Year, y = value, color = type)) + geom_line() +
305
+
ggtitle("HBCU College Enrollment by Gender")
306
+
```
307
+
308
+
#### Python
309
+
310
+
```{python}
311
+
#| label: long-form-subset-python-seaborn
312
+
#| fig-width: 8
313
+
#| fig-height: 6
314
+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
plot = sns.lineplot(hbcu_gender, x = "Year", y = "value", hue = "variable")
319
+
plot.set_title("HBCU College Enrollment")
320
+
plt.show()
321
+
```
322
+
323
+
:::
324
+
325
+
:::
326
+
327
+
Seeing only the gender related data (with total enrollment as a comparison) helps to highlight that the same temporal trends seem to apply across males and females, though in some cases growth is more shallow in males.
328
+
329
+
#### Institution Types
330
+
331
+
It may also be useful to consider the type of institution when we consider enrollment numbers -- public or private? 2 year or 4 year?
332
+
In this case, it may be most useful to aim for creating some "small multiples" -- multiple charts that are similarly constructed and placed together systematically.
333
+
334
+
::: demo
335
+
336
+
::: panel-tabset
337
+
338
+
#### R
339
+
340
+
```{r}
341
+
#| label: hbcu-small-multiples-r
342
+
#| fig-width: 8
343
+
#| fig-height: 6
344
+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
tidyr::separate("variable", into = c("Type", "Degree"))
360
+
361
+
ggplot(hbcu_inst, aes(x = Year, y = enrollment, color = Degree)) + geom_line() +
362
+
ggtitle("HBCU College Enrollment by Institution Type and Degree Length") +
363
+
facet_wrap(~Type)
364
+
365
+
366
+
ggplot(hbcu_inst, aes(x = Year, y = enrollment, color = Degree)) + geom_line() +
367
+
ggtitle("HBCU College Enrollment by Institution Type and Degree Length") +
368
+
facet_wrap(~Degree)
369
+
```
370
+
371
+
#### Python
372
+
373
+
```{python}
374
+
#| label: hbcu-small-multiples-python-seaborn
375
+
#| fig-width: 8
376
+
#| fig-height: 6
377
+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
plot.map_dataframe(sns.lineplot,x = "Year", y = "value", hue = "Type")
395
+
plot.add_legend()
396
+
plt.show()
397
+
```
398
+
399
+
:::
400
+
401
+
Any of these plots could be customized by e.g. removing Totals (if desired), changing axis parameters so that facets don't have the same axis values, etc., but this is enough to get the basic idea of how to work with data.
402
+
403
+
:::
404
+
405
+
### Key Takeaways
406
+
407
+
From this example of HBCU enrollment, a few things are clear:
408
+
409
+
- The form of the data is important to be able to plot the data easily
410
+
- It can be helpful to break measurements down into disjoint combinations (where the data allows) to create plots that thoughtfully compare variables
411
+
- Plotting subsets of variables and rows can help us understand effects in the data better
0 commit comments