Skip to content

Commit a1a38ce

Browse files
author
Susan Vanderplas
committed
HBCU example updates - sketches, and more extensive examples
1 parent 28eb88d commit a1a38ce

File tree

6 files changed

+197
-7
lines changed

6 files changed

+197
-7
lines changed
429 KB
Loading
321 KB
Loading
211 KB
Loading
168 KB
Loading
123 KB
Loading

part-wrangling/02-basic-data-vis.qmd

Lines changed: 197 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# Data Visualization Basics {#sec-basic-data-vis}
22

3-
This section is intended as a very light overview of how you might create charts in R and python. @sec-data-vis will be much more in depth.
3+
This section is intended as a very light overview of how you might create charts in R and python.
4+
@sec-data-vis will be much more in depth.
45

56
## Objectives {-}
67

78
- Use ggplot2/seaborn to create a chart
8-
- Begin to identify issues with data formatting
9+
- Begin to identify issues with data formatting that need to be resolved before creating a chart.
910

1011
## Package Installation
1112

@@ -52,7 +53,8 @@ Once you have run this command, please comment it out so that you don't reinstal
5253
## First Steps
5354

5455
Now that you can read data in to R and python and define new variables, you can create plots!
55-
Data visualization is a skill that takes a lifetime to learn, but for now, let's start out easy: let's talk about how to make (basic) plots in R (with `ggplot2`) and in python (with `seaborn`, which has similar).
56+
Data visualization is a skill that takes a lifetime to learn, but for now, let's start out easy: let's talk about how to make (basic) plots in R (with `ggplot2`) and in python (with `seaborn`, which has a similar approach to charts).
57+
You can read more about this approach, called the **grammar of graphics** in @sec-data-vis.
5658

5759
### Graphing HBCU Enrollment
5860

@@ -91,7 +93,7 @@ hbcu_all = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytu
9193

9294
### Making a Line Chart
9395

94-
ggplot2 and seaborn work with data frames.
96+
`ggplot2` and `seaborn` work with data frames.
9597

9698
If you pass a data frame in as the data argument, you can refer to columns in that data with "bare" column names (you don't have to reference the full data object using `df$name` or `df.name`; you can instead use `name` or `"name"`).
9799

@@ -135,10 +137,17 @@ plt.show()
135137

136138
### Data Formatting
137139

138-
If your data is in the right format, ggplot2 is very easy to use; if your data aren't formatted neatly, it can be a real pain.
140+
If your data is in the right format, `ggplot2` is very easy to use; if your data aren't formatted neatly, it can be a real pain.
139141
If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into "long form".
140142
We'll learn more about how to do this type of data transition when we talk about [reshaping data](05-data-reshape.qmd).
141143

144+
::: {.column-margin .callout}
145+
#### Tip {-}
146+
It's helpful to start thinking about what format your data is in, and what format you would want it to be in in order to plot it.
147+
Sketching a data frame for the "as is" condition and the "to plot" condition is a useful skill to cultivate.
148+
149+
:::
150+
142151
::: callout-demo
143152

144153
You don't need to know exactly how this works, but it is helpful to see the difference in the two datasets:
@@ -170,10 +179,11 @@ knitr::kable(head(hbcu_all))
170179
#| label: pivot-operation-data-2
171180
#| echo: false
172181
173-
knitr::kable(head(hbcu_long))
182+
knitr::kable(head(hbcu_long, 9*6))
174183
```
175184

176185
In the long form of the data, we have a row for each data point (year x measurement type), not for each year.
186+
I've shown the same amount of data (6 years, 9 measurements) in this table as in the original data, but this takes up much more vertical space!
177187

178188
:::
179189

@@ -182,7 +192,7 @@ In the long form of the data, we have a row for each data point (year x measurem
182192
### Making a (Better) Line Chart
183193

184194
If we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain).
185-
Converting the data to long form means we can use ggplot2/seaborn to do all of this for us with only a single plot statement (`geom_line` or `sns.lineplot`).
195+
Converting the data to long form means we can use `ggplot2`/`seaborn` to do all of this for us with only a single plot statement (`geom_line` or `sns.lineplot`).
186196
Having the data in the right form to plot is very important if you want to get the plot you're imagining with relatively little effort.
187197

188198

@@ -193,6 +203,9 @@ Having the data in the right form to plot is very important if you want to get t
193203
#### R
194204
```{r}
195205
#| label: long-form-demo-r
206+
#| fig-width: 8
207+
#| fig-height: 6
208+
#| fig-alt: "A chart titled HBCU College Enrollment, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Many lines are shown, with combinations of 2-year, 4-year, and total, private and public within those designations, and male and female enrollment as well. For most values, the overall enrollment has grown over time (declining somewhat since 2010), but the value for private 2-year HBCUs has declined over the entire time period and remained relatively flat. It is clear that private HBCUs make up an extremely small proportion of overall HBCU enrollment."
196209
197210
ggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +
198211
ggtitle("HBCU College Enrollment")
@@ -211,6 +224,11 @@ ggplot(hbcu_long, aes(x = "Year", y = "value", color = "variable")) + geom_line(
211224

212225
```{python}
213226
#| label: long-form-demo-python-seaborn
227+
#| fig-width: 8
228+
#| fig-height: 6
229+
#| fig-alt: "A chart titled HBCU College Enrollment, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Many lines are shown, with combinations of 2-year, 4-year, and total, private and public within those designations, and male and female enrollment as well. For most values, the overall enrollment has grown over time (declining somewhat since 2010), but the value for private 2-year HBCUs has declined over the entire time period and remained relatively flat. It is clear that private HBCUs make up an extremely small proportion of overall HBCU enrollment."
230+
231+
214232
plot = sns.lineplot(hbcu_long, x = "Year", y = "value", hue = "variable")
215233
plot.set_title("4-year HBCU College Enrollment")
216234
plt.show()
@@ -220,4 +238,176 @@ plt.show()
220238

221239
:::
222240

241+
242+
### Highlighting Key Insights
243+
244+
Examining the charts in the previous section, it seems that there are at least three different contrasts which are easily made: 2 year vs. 4 year (vs. Total), Public vs. Private, and Male vs. Female enrollment.
245+
The 2 year vs. 4 year vs. Total and Public vs. Private variables seem to be "crossed" - we have values for every combination of these variables.
246+
The Male and Female numbers are not broken out further.
247+
248+
When creating charts, it's useful to think about key comparisons and insights that the user may wish to explore, and then to explicitly highlight those comparisons with separate charts.
249+
It's never enough to just make one chart -- any data that is complex enough to plot deserves to be explored from multiple angles.
250+
251+
::: demo
252+
#### Sketching Possible Comparisons
253+
254+
It is helpful to sketch out the data structure and then use that sketch to identify key comparisons.
255+
You do not have to know what the data looks like at this stage -- it is enough to think through what *might* be interesting and then to test whether or not the comparison is interesting by generating the chart.
256+
257+
::: panel-tabset
258+
259+
##### Data Structure
260+
261+
![Sketching the data structure and identifying interesting comparisons can help to guide your data analysis and exploration of the data visually.](../images/wrangling/hbcu-sketch-data-comparisons.png){fig-alt="A sketch of the HBCU data in wide form, with annotations underneath the table showing comparisons between relevant columns - by gender, by public and private 2y vs 4y, by total 2y and 4y enrollment, by public vs. private"}
262+
263+
##### Chart 1
264+
265+
![This sketch shows what we might see by breaking the data out by gender.](../images/wrangling/hbcu-sketch-subplots-gender.png){fig-alt="A sketch of what the HBCU enrollment data broken out by gender might show."}
266+
267+
##### Chart 2
268+
269+
![One way of representing the public/private and 2y/4y data using small multiples.](../images/wrangling/hbcu-sketch-subplots-0.png){fig-alt=" There are three subplots shown - Total, Public, and Private. In the Total pane, lines are broken out by public and private, while in the public and private panes they are broken out by 2y vs. 4y."}
270+
##### Chart 3
271+
272+
![Another way of representing the public/private and 2y/4y data using small multiples.](../images/wrangling/hbcu-sketch-subplots-1.png){fig-alt=" There are four subplots shown, arranged in a 2x2 table, with columns Public and Private, and rows 2yr and 4yr. In each cell, there is a subplot with a single line drawn."}
273+
274+
##### Chart 4
275+
276+
![An additional way of representing the public/private and 2y/4y data using small multiples.](../images/wrangling/hbcu-sketch-subplots-2.png){fig-alt=" There are two subplots shown, arranged in a 1x2 table, with columns Public and Private. In each cell there is a chart with two lines: 4yr and 2yr."}
277+
:::
278+
279+
When considering which version of a chart to generate, it is helpful to think about which comparisons are most natural for each chart. When lines are close together and share the same scale, comparisons are easier to make. So, if the goal is to highlight the difference between 2yr enrollment and 4yr enrollment, then Chart 4 is particularly effective, as those comparisons are easiest to make in that chart.
280+
@sec-data-vis discusses this in more detail.
281+
282+
:::
283+
284+
#### HBCU Enrollment by Gender
285+
286+
First, let's peel off the gender-specific enrollment data.
287+
This requires us to find only types "Females" and "Males", which we can obtain by filtering or subsetting the data.
288+
289+
290+
::: callout-demo
291+
292+
::: panel-tabset
293+
294+
#### R
295+
```{r}
296+
#| label: long-form-subset-r
297+
#| fig-width: 8
298+
#| fig-height: 6
299+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
300+
301+
hbcu_gender <- hbcu_long |>
302+
dplyr::filter(type %in% c("Females", "Males", "Total enrollment"))
303+
304+
ggplot(hbcu_gender, aes(x = Year, y = value, color = type)) + geom_line() +
305+
ggtitle("HBCU College Enrollment by Gender")
306+
```
307+
308+
#### Python
309+
310+
```{python}
311+
#| label: long-form-subset-python-seaborn
312+
#| fig-width: 8
313+
#| fig-height: 6
314+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
315+
plt.clf() # Clear previous plot
316+
rel_types=["Total enrollment", "Females", "Males"]
317+
hbcu_gender = hbcu_long.query('variable.isin(@rel_types)')
318+
plot = sns.lineplot(hbcu_gender, x = "Year", y = "value", hue = "variable")
319+
plot.set_title("HBCU College Enrollment")
320+
plt.show()
321+
```
322+
323+
:::
324+
325+
:::
326+
327+
Seeing only the gender related data (with total enrollment as a comparison) helps to highlight that the same temporal trends seem to apply across males and females, though in some cases growth is more shallow in males.
328+
329+
#### Institution Types
330+
331+
It may also be useful to consider the type of institution when we consider enrollment numbers -- public or private? 2 year or 4 year?
332+
In this case, it may be most useful to aim for creating some "small multiples" -- multiple charts that are similarly constructed and placed together systematically.
333+
334+
::: demo
335+
336+
::: panel-tabset
337+
338+
#### R
339+
340+
```{r}
341+
#| label: hbcu-small-multiples-r
342+
#| fig-width: 8
343+
#| fig-height: 6
344+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
345+
346+
hbcu_inst <- hbcu_all |>
347+
dplyr::select(1:2, 5:12) |>
348+
dplyr::rename(
349+
"Total-Total" = "Total enrollment",
350+
"Total-4year" = "4-year",
351+
"Total-2year"="2-year",
352+
"Public-Total" = "Total - Public",
353+
"Public-4year" = "4-year - Public",
354+
"Public-2year" = "2-year - Public",
355+
"Private-Total" = "Total - Private",
356+
"Private-4year" = "4-year - Private",
357+
"Private-2year" = "2-year - Private") |>
358+
tidyr::pivot_longer(2:10, names_to="variable", values_to="enrollment") |>
359+
tidyr::separate("variable", into = c("Type", "Degree"))
360+
361+
ggplot(hbcu_inst, aes(x = Year, y = enrollment, color = Degree)) + geom_line() +
362+
ggtitle("HBCU College Enrollment by Institution Type and Degree Length") +
363+
facet_wrap(~Type)
364+
365+
366+
ggplot(hbcu_inst, aes(x = Year, y = enrollment, color = Degree)) + geom_line() +
367+
ggtitle("HBCU College Enrollment by Institution Type and Degree Length") +
368+
facet_wrap(~Degree)
369+
```
370+
371+
#### Python
372+
373+
```{python}
374+
#| label: hbcu-small-multiples-python-seaborn
375+
#| fig-width: 8
376+
#| fig-height: 6
377+
#| fig-alt: "A chart titled HBCU College Enrollment by gender, with y-axis labeled 'value' ranging from 0 to 350000 and x-axis year spanning from 1976 to 2015. Three lines are shown: Total enrollment, Males, and Females."
378+
379+
hbcu_inst = hbcu_all.iloc[:,[0,1,4,5,6,7,8,9,10,11]]
380+
hbcu_inst_names=["Year", "Total-Total", "Total-4year", "Total-2year", "Public-Total", "Public-4year", "Public-2year","Private-Total", "Private-4year", "Private-2year"]
381+
hbcu_inst.columns = hbcu_inst_names
382+
hbcu_inst_long = pd.melt(hbcu_inst, id_vars = ['Year'], value_vars = hbcu_inst.columns[1:10])
383+
hbcu_inst_long[['Type','Degree']]=list(hbcu_inst_long["variable"].str.split("-"))
384+
385+
plt.clf() # Clear previous plot
386+
plot = sns.FacetGrid(hbcu_inst_long, col="Type")
387+
plot.map_dataframe(sns.lineplot,x = "Year", y = "value", hue = "Degree")
388+
plot.add_legend()
389+
plt.show()
390+
391+
392+
plt.clf() # Clear previous plot
393+
plot = sns.FacetGrid(hbcu_inst_long, col="Degree")
394+
plot.map_dataframe(sns.lineplot,x = "Year", y = "value", hue = "Type")
395+
plot.add_legend()
396+
plt.show()
397+
```
398+
399+
:::
400+
401+
Any of these plots could be customized by e.g. removing Totals (if desired), changing axis parameters so that facets don't have the same axis values, etc., but this is enough to get the basic idea of how to work with data.
402+
403+
:::
404+
405+
### Key Takeaways
406+
407+
From this example of HBCU enrollment, a few things are clear:
408+
409+
- The form of the data is important to be able to plot the data easily
410+
- It can be helpful to break measurements down into disjoint combinations (where the data allows) to create plots that thoughtfully compare variables
411+
- Plotting subsets of variables and rows can help us understand effects in the data better
412+
223413
## References {#sec-graphics-intro-refs}

0 commit comments

Comments
 (0)