Skip to content

Commit d3f8600

Browse files
authored
Add samples demonstrating Hyper's native S3 capabilities
With this commit we add samples from the Tableau Conference Hands-on Training "Use Hyper as your Cloud Lake Engine" that show you how Hyper can natively read CSV and parquet files that are stored on Amazon S3.
2 parents 9ad9c0d + 971b476 commit d3f8600

File tree

6 files changed

+190
-0
lines changed

6 files changed

+190
-0
lines changed
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# parquet-to-hyper
2+
## __parquet_to_hyper__
3+
4+
![Community Supported](https://img.shields.io/badge/Support%20Level-Community%20Supported-53bd92.svg)
5+
6+
__Current Version__: 1.0
7+
8+
These samples show you how Hyper can natively interact with Amazon S3, without the need to install any external dependencies like boto or aws-cli.
9+
They originate from the Tableau Conference 2022 Hands-on Training Use Hyper as your Cloud Lake Engine - you can [check out the slides here](https://mkt.tableau.com/tc22/sessions/live/428-HOT-D1_Hands-onUseTheHyperAPI.pdf).
10+
11+
# Get started
12+
13+
## __Prerequisites__
14+
15+
To run the script, you will need:
16+
17+
- a computer running Windows, macOS, or Linux
18+
19+
- Python 3.9+
20+
21+
- install the dependencies from the `requirements.txt` file
22+
23+
## Run the samples
24+
25+
The following instructions assume that you have set up a virtual environment for Python. For more information on
26+
creating virtual environments, see [venv - Creation of virtual environments](https://docs.python.org/3/library/venv.html)
27+
in the Python Standard Library.
28+
29+
1. Open a terminal and activate the Python virtual environment (`venv`).
30+
31+
1. Navigate to the folder where you installed the samples.
32+
33+
1. Then follow the steps to run one of the samples which are shown below.
34+
35+
**Create a `.hyper` file from parquet file on S3**
36+
Run the Python script
37+
```bash
38+
$ python parquet-on-s3-to-hyper.py
39+
```
40+
41+
This script will read the parquet file from `s3://nyc-tlc/trip%20data/yellow_tripdata_2021-06.parquet`, visit [AWS OpenData](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) for more details and license about the dataset and insert the records into a table named `taxi_rides` which is stored in a `.hyper` database file.
42+
43+
This database file can then directly be opened with Tableau Desktop or Tableau Prep or it can be published to Tableau Online and Tableau Server as shown in [this example](https://github.com/tableau/hyper-api-samples/tree/main/Community-Supported/publish-hyper).
44+
45+
**Live query against a `.csv` file which is stored on AWS S3**
46+
Run the Python script
47+
48+
```bash
49+
$ python query-csv-on-s3.py
50+
```
51+
52+
This script will perform a live query on the CSV file which is stored in this public S3 bucket: `s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_small.csv`.
53+
54+
**Live query with multiple `.parquet` and `.csv` files which are stored on AWS S3**
55+
Run the Python script
56+
57+
```bash
58+
$ python join-parquet-and-csv-on-s3.py
59+
```
60+
61+
This script will perform a live query on multiple `.parquet` files which are stored on AWS S3. It shows how to use the [`ARRAY` syntax](https://help.tableau.com/current/api/hyper_api/en-us/reference/sql/functions-srf.html#FUNCTIONS-SRF-EXTERNAL) to union multiple `.parquet` files and how `.parquet` files can be joined together with `.csv` files - as you would expect from normal database tables stored inside a `.hyper` file.
62+
63+
## __Resources__
64+
Check out these resources to learn more:
65+
66+
- [Hyper API docs](https://help.tableau.com/current/api/hyper_api/en-us/index.html)
67+
68+
- [Tableau Hyper API Reference (Python)](https://help.tableau.com/current/api/hyper_api/en-us/reference/py/index.html)
69+
70+
- [The EXTERNAL function in the Hyper API SQL Reference](https://help.tableau.com/current/api/hyper_api/en-us/reference/sql/functions-srf.html#FUNCTIONS-SRF-EXTERNAL)
71+
72+
- [AWS command line tools documentation](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html), e.g. if you want to download some of the sample files to your local machine and explore them
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, SqlType, TableDefinition, TableName, Nullability, Inserter, escape_string_literal
2+
3+
ORDERS_DATASET_2018 = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_2018.parquet")
4+
ORDERS_DATASET_2019 = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_2019.parquet")
5+
ORDERS_DATASET_2020 = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_2020.parquet")
6+
ORDERS_DATASET_2021 = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_2021.parquet")
7+
8+
# CSV file which contains the orders that were returned by the customers
9+
RETURNS_DATASET = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/returns.csv")
10+
11+
# We need to manually enable S3 connectivity as this is still an experimental feature
12+
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU, parameters={"experimental_external_s3": "true"}) as hyper:
13+
# Create a connection to the Hyper process - we do not connect to a database
14+
with Connection(endpoint=hyper.endpoint) as connection:
15+
16+
# We use the `ARRAY` syntax in the CREATE TEMP EXTERNAL TABLE statement to specify multiple files to be unioned
17+
create_ext_orders_table = f"""
18+
CREATE TEMP EXTERNAL TABLE orders
19+
FOR ARRAY[ S3_LOCATION({ORDERS_DATASET_2018}, REGION => 'us-west-2'),
20+
S3_LOCATION({ORDERS_DATASET_2019}, REGION => 'us-west-2'),
21+
S3_LOCATION({ORDERS_DATASET_2020}, REGION => 'us-west-2'),
22+
S3_LOCATION({ORDERS_DATASET_2021}, REGION => 'us-west-2')]
23+
WITH (FORMAT => 'parquet')
24+
"""
25+
connection.execute_command(create_ext_orders_table)
26+
27+
# Create the `returns` table also as EXTERNAL TABLE
28+
create_ext_returns_table = f"""
29+
CREATE TEMP EXTERNAL TABLE returns(
30+
returned TEXT,
31+
order_id TEXT
32+
)
33+
FOR S3_LOCATION({RETURNS_DATASET}, REGION => 'us-west-2')
34+
WITH (FORMAT => 'csv', HEADER => 'true', DELIMITER => ';')
35+
"""
36+
connection.execute_command(create_ext_returns_table)
37+
38+
# Select the total sales amount per category from the CSV file
39+
# and drill down by whether the orders were returned or not
40+
query = f"""SELECT category,
41+
(CASE WHEN returned IS NULL THEN 'Not Returned' ELSE 'Returned' END) AS return_info,
42+
SUM(sales)
43+
FROM orders
44+
LEFT OUTER JOIN returns on orders.order_id = returns.order_id
45+
GROUP BY 1, 2
46+
ORDER BY 1, 2"""
47+
48+
# Execute the query with `execute_list_query`
49+
result = connection.execute_list_query(query)
50+
51+
# Iterate over all rows in the result and print them
52+
print(f"{'Category':<20} {'Status':<20} Sales")
53+
print(f"{'--------':<20} {'------':<20} -----")
54+
for row in result:
55+
print(f"{row[0]:<20} {row[1]:<20} {row[2]:,.2f} USD")
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, SqlType, TableDefinition, TableName, Nullability, Inserter, escape_string_literal
2+
3+
# Details and license of dataset: https://registry.opendata.aws/nyc-tlc-trip-records-pds/
4+
TAXI_DATASET = escape_string_literal("s3://nyc-tlc/trip%20data/yellow_tripdata_2021-06.parquet") # May release fixes a bug so that %20 doesn't need to be escaped manually
5+
6+
# We need to manually enable S3 connectivity as this is still an experimental feature
7+
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU, parameters={"experimental_external_s3": "true"}) as hyper:
8+
# Create a connection to the Hyper process and let it create a database file - if it exists, it's overwritten
9+
with Connection(endpoint=hyper.endpoint, database="taxi-rides-2021-06.hyper", create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
10+
11+
# Use `TableName` so we do not have to worry about escaping in the SQL query we generate below
12+
# Note: This line does not create a table in Hyper, it just defines a name
13+
taxi_rides = TableName("public", "taxi_rides")
14+
15+
# Ingest the data from the parquet file into a Hyper Table
16+
# Since the schema is stored inside the parquet file, we don't need to specify it explicitly here
17+
cmd = f"CREATE TABLE {taxi_rides}" \
18+
f" AS ( SELECT * FROM EXTERNAL(S3_LOCATION({TAXI_DATASET}), FORMAT => 'parquet'))"
19+
20+
# We use `execute_command` to send the CREATE TABLE statement to Hyper
21+
# This may take some time depending on your network connectivity so AWS S3
22+
connection.execute_command(cmd)
23+
24+
# Let's check how many rows we loaded
25+
ride_count = connection.execute_scalar_query(f"SELECT COUNT(*) FROM {taxi_rides}")
26+
print (f"Loaded {ride_count} taxi rides")
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from tableauhyperapi import HyperProcess, Connection, Telemetry, CreateMode, SqlType, TableDefinition, TableName, Nullability, Inserter, escape_string_literal
2+
3+
ORDERS_DATASET_S3 = escape_string_literal("s3://hyper-dev-us-west-2-bucket/tc22-demo/orders_small.csv")
4+
5+
# We need to manually enable S3 connectivity as this is still an experimental feature
6+
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU, parameters={"experimental_external_s3": "true"}) as hyper:
7+
# Create a connection to the Hyper process - we do not connect to a database
8+
with Connection(endpoint=hyper.endpoint) as connection:
9+
10+
# Use the CREATE TEMP EXTERNAL TABLE syntax - this allows us to use the CSV file like a normal table name in SQL queries
11+
# We do not need to specify credentials as the S3 bucket is publicly accessible; this may be different when used with your own data
12+
create_external_table = f"""
13+
CREATE TEMP EXTERNAL TABLE orders(
14+
order_date DATE,
15+
product_id TEXT,
16+
category TEXT,
17+
sales DOUBLE PRECISION
18+
)
19+
FOR S3_LOCATION({ORDERS_DATASET_S3}, REGION => 'us-west-2')
20+
WITH (FORMAT => 'csv', HEADER => true)
21+
"""
22+
# Create the external table using `execute_command` which sends an instruction to the database - we don't expect a result value
23+
connection.execute_command(create_external_table)
24+
25+
# Select the total sales amount per category from the external table
26+
query = f"""SELECT category, SUM(sales)
27+
FROM orders
28+
GROUP BY category"""
29+
30+
# Execute the query with `execute_list_query` as we expect multiple rows (one row per category) and two columns (category name and sum of sales)
31+
result = connection.execute_list_query(query)
32+
33+
# Iterate over all rows in the result and print the category name and the sum of sales for that category
34+
for row in result:
35+
print(f"{row[0]}: {row[1]} USD")
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
tableauhyperapi>=0.0.14946

Community-Supported/s3-to-hyper/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ This sample demonstrates how to, with little modification, leverage the Hyper AP
1414

1515
It should serve as a starting point for anyone looking to automate the publishing process of datasources based on contents of S3 buckets. The advantage of leveraging this sample is that an end user should not need to open the Python script, instead simply edit the configuration file and the code handles the rest automatically.
1616

17+
**Note:** As an alternative to using Boto3, you can also check out if [Hyper's Native S3 capabilities](https://github.com/tableau/hyper-api-samples/tree/main/Community-Supported/native-s3/README.md) are applicable to your use-case to ingest data from AWS S3 into Hyper.
1718

1819
# Get started
1920

0 commit comments

Comments
 (0)