Skip to content

Commit b3ffcec

Browse files
Merge pull request #3 from oracle-samples/feature/fde_code
Reading data from excel file and Data generator utility
2 parents d9f013a + 76801b9 commit b3ffcec

File tree

14 files changed

+2544
-0
lines changed

14 files changed

+2544
-0
lines changed
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Read Excel Files in Spark and Pandas
2+
3+
This module demonstrates two approaches to read Excel files within Spark environments like **OCI Data Flow**, **Databricks**, or **local Spark clusters**.
4+
5+
---
6+
7+
## 1. Using `com.crealytics.spark.excel`
8+
9+
This approach uses the **Spark Excel connector** developed by [Crealytics](https://github.com/crealytics/spark-excel).
10+
It supports `.xls` and `.xlsx` files directly within Spark DataFrames.
11+
12+
### Requirements
13+
14+
You must add the following JARs to your cluster classpath:
15+
16+
- poi-4.1.2.jar
17+
- poi-ooxml-4.1.2.jar
18+
- poi-ooxml-schemas-4.1.2.jar
19+
- xmlbeans-3.1.0.jar
20+
- curvesapi-1.06.jar
21+
- commons-collections4-4.4.jar
22+
- commons-compress-1.20.jar
23+
- spark-excel_2.12-0.13.5.jar
24+
25+
Download them from [Maven Central Repository](https://mvnrepository.com/).
26+
27+
### Example
28+
29+
```python
30+
excel_path = "/Volumes/test_data.xlsx"
31+
32+
df = spark.read.format("com.crealytics.spark.excel") \
33+
.option("header", "true") \
34+
.option("inferSchema", "true") \
35+
.load(excel_path)
36+
37+
df.show()
38+
```
39+
# Excel to Spark using Pandas
40+
41+
This example demonstrates how to **read Excel files using Pandas**, optionally convert them to **CSV**, and then **load them into Spark** for further processing.
42+
It’s ideal for lightweight or pre-processing workflows before ingesting data into Spark.
43+
44+
---
45+
46+
## Requirements
47+
48+
Install the required dependencies via `requirements.txt`:
49+
- `pandas`
50+
- `openpyxl`
51+
- `xlrd`
52+
53+
### Example
54+
55+
```python
56+
import pandas as pd
57+
58+
# Path to Excel file
59+
excel_path = "/Volumes/test_data.xlsx"
60+
61+
# Read Excel file using Pandas
62+
df = pd.read_excel(excel_path)
63+
64+
# Convert to CSV if needed
65+
csv_path = "/Volumes/test_data.csv"
66+
df.to_csv(csv_path, index=False)
67+
68+
print(df.head())
69+
70+
# Load the CSV back into Spark
71+
spark_df = spark.read.csv(csv_path, header=True, inferSchema=True)
72+
spark_df.show()
73+
74+
```
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
{
2+
"cells": [
3+
{
4+
"metadata": {},
5+
"cell_type": "code",
6+
"outputs": [],
7+
"execution_count": null,
8+
"source": [
9+
"## Reading by using inbuilt spark library\n",
10+
"excel_path = \"/Volumes/test_data.xlsx\"\n",
11+
"\n",
12+
"## You must add the JARs to your cluster classpath as per README.md\n",
13+
"df = spark.read.format(\"com.crealytics.spark.excel\") \\\n",
14+
" .option(\"header\", \"true\") \\\n",
15+
" .option(\"inferSchema\", \"true\") \\\n",
16+
" .load(excel_path)\n",
17+
"\n",
18+
"df.show()"
19+
],
20+
"id": "4d1c762a078b6ac2"
21+
},
22+
{
23+
"metadata": {},
24+
"cell_type": "code",
25+
"outputs": [],
26+
"execution_count": null,
27+
"source": [
28+
"## Using pandas to convert excel into csv and then read in spark\n",
29+
"import pandas as pd\n",
30+
"\n",
31+
"excel_path = \"/Volumes/test_data.xlsx\"\n",
32+
"df = pd.read_excel(excel_path)\n",
33+
"\n",
34+
"# Convert to CSV if needed\n",
35+
"csv_path = \"/Volumes/test_data.csv\"\n",
36+
"df.to_csv(csv_path, index=False)\n",
37+
"\n",
38+
"print(df.head())\n",
39+
"\n",
40+
"# Load CSV back into Spark\n",
41+
"spark_df = spark.read.csv(csv_path, header=True, inferSchema=True)\n",
42+
"spark_df.show()\n"
43+
],
44+
"id": "3d929687c9b1c44a"
45+
}
46+
],
47+
"metadata": {},
48+
"nbformat": 4,
49+
"nbformat_minor": 5
50+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pandas
2+
openpyxl
3+
xlrd

0 commit comments

Comments
 (0)