coding-bootcamp/ProblemSet_plotting.py at master · macadology/coding-bootcamp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# %% [markdown]
# # Problem Set - Plotting

# %% Setup
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# %% [markdown]
# We first import the excel file into python as a pandas object. The pandas.DataFrame object contains 5 columns, the first 4 columns corresponding to the relative abundance, the last column corresponding to the site the sample was collected from.
# If you encounter an error when reading excel files, download and install the xlrd package.

# %% Importing data from excel into pandas dataframe
relative_abundance_data = pd.read_excel('./rel_abd.xlsx') # Import relative abundance data from excel
ecoli_rel_abd = relative_abundance_data['e_coli'] # Get the relative abundance of ecoli

# Extract the relative abundances of c_acnes, k_pneumoniae and h_sapiens


# %% [markdown]
# # Figure 1. Line/Scatter plot
# We are interested is visualizing the relationship between h.sapiens and the 3 different microbes. Plot a scatter plot of h.sapiens (xaxis) vs microbes (yaxis) with three different colors for each respective microbe. I have provided links to several examples depending on the method. Feel free to use any/all of the three methods listed below

# %% Method 1 - Run plt.scatter or plt.plot using the extracted relative abundances as inputs for x and y.
#==============================================================================
# Refer to this page for an example on how to use plt.scatter. [](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.scatter.html)
# Refer to this page for an example on how to use plt.plot. [](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot)
# Here is another example using plt.scatter. [](https://matplotlib.org/3.3.1/gallery/lines_bars_and_markers/scatter_symbol.html#sphx-glr-gallery-lines-bars-and-markers-scatter-symbol-py)
# If using plt.plot, remember to hide the curve and show the marker. [](https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simple-scatter-plots.html#Scatter-Plots-with-plt.plot)
#==============================================================================


# %% Method 2 - Run plt.scatter three times using keyword strings
#==============================================================================
# Refer to this guide on how to plot using keyword strings. [](https://matplotlib.org/3.3.1/tutorials/introductory/pyplot.html#plotting-with-keyword-strings)
# The difference between method 1 and 2 is that in method 2, you don't specify the inputs to x and y explicitly with a variable. Instead, you use a keyword that corresponds to the column name of the pandas object.
#==============================================================================


# %% Method 3 - Run sns.scatterplot three times using keyword strings
#==============================================================================
# Create a plot using seaborn's version of scatterplot.
# Refer to the following link for several examples on how to plot a scatter plot [](https://seaborn.pydata.org/generated/seaborn.scatterplot.html)
# I have rearranged the data using pd.melt to convert the data from wide form to long form. Read this post to learn about the differences between long form and wide form data. [](https://towardsdatascience.com/long-wide-data-and-how-to-efficiently-plot-them-7a96887e309d)
# Tip: Use the rearranged data below to plot a scatter plot. Call the variable `rearranged_data` to explore what the data looks like after rearrangement.
#==============================================================================
rearranged_data = pd.melt(relative_abundance_data, id_vars=['sites','h_sapiens'], value_vars=['e_coli','c_acnes','k_pneumoniae'], var_name='microbes', value_name='relabd (y)')
rearranged_data = rearranged_data.rename(columns={'h_sapiens':'h_sapiens_relabd (x)'})
rearranged_data


# %% [markdown]
# # Figure 2. Box plot
# Plot a boxplot to show the distribution of relative abundances of all 4 species. Each species should contain 2 box for the two sites.

# %% Run sns.boxplot
#==============================================================================
# Use the examples here as a guide on how to make box plots. [](https://seaborn.pydata.org/generated/seaborn.boxplot.html)
# Tip: use the long form data created below
#==============================================================================
rearranged_data = pd.melt(relative_abundance_data, id_vars=['sites'], value_vars=['e_coli','c_acnes','k_pneumoniae','h_sapiens'], var_name='microbes', value_name='relabd')
rearranged_data


# %% [markdown]
# # Figure 3. Stacked bar chart
# Plot a stacked bar graph for each sample showing the relative abundance of each species.
# Method 1 is the fastest, but methods 2 and 3 demonstrate how to create a stacked bar plot from scratch.
# Methods 2 and 3 are optional. You may attempt them if keen.

# %% Method 1 - Use df.plot.bar
#==============================================================================
# Refer to this link for an example [](https://pandas.pydata.org/docs/user_guide/visualization.html#bar-plots)
# Note 1: pandas is using matplotlib as the backend figure generator
# Note 2: the df in `df.plot.bar` refers to the variable name of the target data. In our case, replace df the name of the right variable.
#==============================================================================


# %% Method 2 - Run sns.barplot
#==============================================================================
# Refer to this example ()[https://randyzwitch.com/creating-stacked-bar-chart-seaborn/]
#==============================================================================


# %% Method 3 - Run plt.bar
#==============================================================================
# Refer to this example ()[https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/bar_stacked.html]
#==============================================================================


# %% Generating Data
# The following code was used to create the excel file used above. I have included the code here for reference sake, but it is not necessary for the completion of the problem set.
#
# species = ['E.coli', 'C.acnes', 'K.pneumoniae', 'H.sapiens']
# n = 100
# site = random.randint(2,size=n) #0->environment, 1->skin
# site_name = ['environment' if i == 0 else 'skin' for i in site]
# e_coli = np.random.uniform(0,1,n) * 0.2
# c_acnes = e_coli * 3 + np.random.uniform(0,1,n) * 0.1 + np.random.uniform(0,1,n) * site * 0.5
# k_pneumoniae = np.random.uniform(0,1,n) * 0.2
# h_sapiens = k_pneumoniae * 1.5 + np.random.uniform(0,1,n) * 0.1 + np.random.uniform(0,1,n) * site * 0.5
# absolute_abundance = np.array([e_coli,c_acnes,k_pneumoniae,h_sapiens])
# relative_abundance = (absolute_abundance / absolute_abundance.sum(axis=0)).T
# relative_abundance_data = pd.DataFrame(relative_abundance, columns=['e_coli','c_acnes','k_pneumoniae','h_sapiens'])
# relative_abundance_data['sites'] = site_name
# relative_abundance_data.to_excel('./rel_abd.xlsx',index=False)