-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathwb_with_code
More file actions
101 lines (72 loc) · 3.75 KB
/
wb_with_code
File metadata and controls
101 lines (72 loc) · 3.75 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
## this workbook has suggested code for the exercises so you can check your work and assess the level of learning we recommend before the python for data engineering course.
## If you can do 5 out of 7 exercises then you are set on having learned pandas and python basics.
## If you can do 4 out of 7 we recommend you review materials and the videos listed in the readme file before the next python class on July 31.
## If you can do 3 or less we recommend you review materials and reach out to your mentor for a 1:1 session to cover these concepts.
## here we've imported the first few packages for you.
import os
import pandas as pd
## exercise 0: Import necessary libraries and load data from a csv file into a pandas dataframe
## hint: pd.read_csv is your friend
## again, pandas documentation, which you can refer to frequently, is here
## https://pandas.pydata.org/docs/user_guide/index.html
df = pd.read_csv("Fictitious_Records.csv")
## exercise 1: Inspect the DataFrame to understand its structure and content
## hint: Use the head(), dtypes, and describe() methods to quickly inspect the DataFrame, as mentioned in the slides.
# Display the first 10 rows
print(df.head(10))
# Display columns and data types
print(df.dtypes)
# Summary statistics
print(df.describe())
# Display the shape of the DataFrame
print(df.shape)
## exercise 2: Filter data based on specific conditions.
## hint: Remember how we used the > operator to filter data. Apply similar techniques here to filter and count rows.
# Filter rows where 'Age' > 30
filtered_df = df[df['Age'] > 30]
OR
filtered_df = df.query(“Age > 30”)
# Count the rows
count = filtered_df.shape[0]
print(f"Number of rows where Age > 30: {count}")
## exercise 3: Select specific columns and rename them.
## hint: Use double brackets to select multiple columns and the rename() method to change column names.
# Select specific columns
selected_df = df[['Name', 'Age', 'City']]
# Rename columns
selected_df = selected_df.rename(columns={'Name': 'Full Name', 'City': 'Location'})
print(selected_df.head())
## exercise 4: Identify and handle missing data in the DataFrame.
## hint: Use isnull(), dropna(), and fillna() to handle missing data.
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df_cleaned = df.dropna()
# Fill missing values in 'Age' with the mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
# Change the data type of Age from int to float
df['Age'] = df['Age'].astype(float)
print(df.dtypes) #Verify the change in the data type
## exercise 5: Translate a SQL query into Pandas code.
## code is Given SQL: SELECT Name, Age FROM your_table WHERE Age > 25 ORDER BY Age DESC;
## hint: Break down the SQL query into its components (filtering, selecting columns, sorting) and translate each part into Pandas code.
# Translate SQL to Pandas
result_df = df[df['Age'] > 25][['Name', 'Age']].sort_values(by='Age', ascending=False)
print(result_df.head())
## exercise 6: Perform grouping and aggregation operations.
# Group by 'State' and calculate the average age
grouped_avg_df = df.groupby('State')['Age'].mean().reset_index()
print(grouped_avg_df)
# Group by 'State' and calculate the total amount of campaign contributions
grouped_sum_df = df.groupby('State')['Campaign_Contributions'].sum().reset_index()
print(grouped_sum_df)
# Group by 'State' and count the number of members
grouped_count_df = df.groupby('State')['Member_ID'].count().reset_index()
print(grouped_count_df)
## exercise 7: Export data from a DataFrame to a CSV or Excel file.
# Export the DataFrame to a CSV file
df.to_csv('exported_data.csv', index=False)
print("Data exported to 'exported_data.csv'.")
# Export the DataFrame to an Excel file
df.to_excel('exported_data.xlsx', index=False)
print("Data exported to 'exported_data.xlsx'.")