python_wb_basics.pandas/wb_with_code at main · Generation-Data/python_wb_basics.pandas · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
## this workbook has suggested code for the exercises so you can check your work and assess the level of learning we recommend before the python for data engineering course.
## If you can do 5 out of 7 exercises then you are set on having learned pandas and python basics.
## If you can do 4 out of 7 we recommend you review materials and the videos listed in the readme file before the next python class on July 31.
## If you can do 3 or less we recommend you review materials and reach out to your mentor for a 1:1 session to cover these concepts.

## here we've imported the first few packages for you.
import os
import pandas as pd

## exercise 0: Import necessary libraries and load data from a csv file into a pandas dataframe
## hint: pd.read_csv is your friend
## again, pandas documentation, which you can refer to frequently, is here
## https://pandas.pydata.org/docs/user_guide/index.html

df = pd.read_csv("Fictitious_Records.csv")

## exercise 1: Inspect the DataFrame to understand its structure and content
## hint: Use the head(), dtypes, and describe() methods to quickly inspect the DataFrame, as mentioned in the slides.

# Display the first 10 rows
print(df.head(10))

# Display columns and data types
print(df.dtypes)

# Summary statistics
print(df.describe())

# Display the shape of the DataFrame
print(df.shape)

## exercise 2: Filter data based on specific conditions.
## hint: Remember how we used the > operator to filter data. Apply similar techniques here to filter and count rows.

# Filter rows where 'Age' > 30
filtered_df = df[df['Age'] > 30]
OR
filtered_df = df.query(“Age > 30”)

# Count the rows
count = filtered_df.shape[0]
print(f"Number of rows where Age > 30: {count}")

## exercise 3: Select specific columns and rename them.
## hint: Use double brackets to select multiple columns and the rename() method to change column names.

# Select specific columns
selected_df = df[['Name', 'Age', 'City']]

# Rename columns
selected_df = selected_df.rename(columns={'Name': 'Full Name', 'City': 'Location'})
print(selected_df.head())

## exercise 4: Identify and handle missing data in the DataFrame.
## hint: Use isnull(), dropna(), and fillna() to handle missing data.

# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df_cleaned = df.dropna()

# Fill missing values in 'Age' with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Change the data type of Age from int to float
df['Age'] = df['Age'].astype(float)

print(df.dtypes) #Verify the change in the data type

## exercise 5: Translate a SQL query into Pandas code.
## code is Given SQL: SELECT Name, Age FROM your_table WHERE Age > 25 ORDER BY Age DESC;
## hint: Break down the SQL query into its components (filtering, selecting columns, sorting) and translate each part into Pandas code.

# Translate SQL to Pandas
result_df = df[df['Age'] > 25][['Name', 'Age']].sort_values(by='Age', ascending=False)
print(result_df.head())

## exercise 6: Perform grouping and aggregation operations.

# Group by 'State' and calculate the average age
grouped_avg_df = df.groupby('State')['Age'].mean().reset_index()
print(grouped_avg_df)

# Group by 'State' and calculate the total amount of campaign contributions
grouped_sum_df = df.groupby('State')['Campaign_Contributions'].sum().reset_index()
print(grouped_sum_df)

# Group by 'State' and count the number of members
grouped_count_df = df.groupby('State')['Member_ID'].count().reset_index()
print(grouped_count_df)

## exercise 7: Export data from a DataFrame to a CSV or Excel file.

# Export the DataFrame to a CSV file
df.to_csv('exported_data.csv', index=False)
print("Data exported to 'exported_data.csv'.")

# Export the DataFrame to an Excel file
df.to_excel('exported_data.xlsx', index=False)
print("Data exported to 'exported_data.xlsx'.")