Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
redshiftspectrumglue-lab4.template	redshiftspectrumglue-lab4.template

Lab 4: Analysis of data in Amazon S3 using Amazon Redshift Spectrum

Deploying Amazon Redshift Cluster
Running AWS Glue Crawlers
Create Redshift Spectrum Scehma and reference external table from AWS Glue Data Catalog Database
Querying data from Amazon S3 using Amazon Redshift Spectrum
Querying partitioned data using Amazon Redshift Spectrum

Architectural Diagram

Deploying Amazon Redshift Cluster

In this section you will use the CloudFormation template to create Amazon RedShift cluster resources. The template will also install pgweb, an SQL Client for PostgreSQL, in an Amazon EC2 instance to connect and run your queries on the launched Amazon Redshift cluster. Alternatively, you can connect to the Amazon Redshift cluster using standard SQL Clients such as SQL Workbench/J. For more information refer http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html.

Login in to your AWS console and open the Amazon CloudFormation Dashboard
Make a note of the AWS region name, for example, for this lab you will need to choose the US West (Oregon) region.
Click Create Stack
Copy the contents of the file here redshiftspectrumglue-lab4.template and save it to your local machine as redshiftspectrumglue-lab4.json. This file is the Amazon CloudFormation template file. Select Upload a template file and upload this file by clicking Choose file.
Click Next

Type a name (e.g. RedshiftSpectrumLab) for the Stack Name

Enter the following Parameters for Redshift Cluster Configuration
1. Choose multi-node for ClusterType
2. Type 2 for the NumberOfNodes
3. For NodeType select dc1.large

Enter the following Parameters for Redshift Database Configuration. 1. Type a name (e.g. dbadmin) for MasterUserName. 2. Type a password for MasterUserPassword. Make sure this password has at least 1 uppercase letter, 1 lowercase letter, and 1 number. 3. Type the a name (e.g. taxidb) for DatabaseName. 4. Type the IP address of your local machine for ClientIP.

Enter the following Parameters for Glue Crawler Configuration
1. Type the name (e.g. taxi-spectrum-db) for GlueCatalogDBName.
2. Type the name (e.g. csvCrawler) for CSVCrawler.
3. Type the name (e.g. parquetCrawler) for ParquetCrawler.
Click Next

[Optional] In the Tags sub-sections in Options type a Key name (e.g. Name) and Value for key.
Click Next

Check I acknowledge that AWS CloudFormation might create IAM resources.
Click Create

Note: This is may take approximately 15 minutes

Ensure that status of the Amazon CloudFormation stack that you just created is CREATE_COMPLETE
Select your Amazon CloudFormation stack (RedshiftSpectrumLab)
Click on the Outputs tab
Review the list of Key and Value pairs, which will look like the following.

Running AWS Glue Crawlers - CSV & Parquet Crawler

Open AWS Management Console for Glue
Go to AWS Glues Crawlers page by clicking on Crawlers in the navigation pane

Select the AWS Glue Crawler for CSV (e.g. csvCrawler)
Click Run crawler
Select the AWS Glue Crawler for Parquet (e.g. parquetCrawler)
Click Run crawler

Note: This may take approximately 5 min for both the crawlers to parse the data in CSV and Parquet format.

Wait for the Status of both the crawlers to return to the Ready state

Now that you have run the crawlers lets ensure that new tables taxi and ny_pub been created.

Go to the list of databases in the AWS Glue Data Catalog by clicking on Databases in the navigation pane.
Click on taxi-spectrum-db

Click on Tables in taxi-spectrum-db

Click on taxi to review the table definition and schema
Navigate back and click on ny_pub to review the table definition and schema

Note: The good news is that you don’t have to create a new table or definition to read the CSV document we just looked at. With AWS Glue crawlers, you have already inferred the schema and created tables namely taxi and ny_pub.

Click on View partitions to review the partition metadata

Note: The major advantage of Glue Crawlers is that they understand the partitions based on the S3 object prefix and automatically create the table with partitions as part of the crawling.

Create Redshift Spectrum Scehma and reference external table from AWS Glue Data Catalog Database

Open the Amazon CloudFormation Dashboard
Make a note of the AWS region name, for example, for this lab you will need to choose the US West (Oregon) region.
Select your Amazon CloudFormation stack (RedshiftSpectrumLab)
Click on the Outputs tab
Navigate to the pgWeb URL
In the pgWeb console ensure that the SQL Query tab is selected
Copy the following statement to create a database (e.g. taxispectrum) in Redshift Spectrum

  create external schema taxispectrum from data catalog
  database 'taxi-spectrum-db' 
  iam_role '<specify the redshift IAM Role arn from the CloudFormation outputs section>'

Replace the <specify the redshift IAM Role arn from the CloudFormation output section'> in the statment with the value of redshiftIAMRole from the Outputs tab of the Amazon CloudFromation stack (RedshiftSpectrumLab) you created as part of the lab.

Note: The IAM role must be in single quotes

Click Run Query

Note: You can create an external table in Amazon Redshift, AWS Glue, Amazon Athena, or an Apache Hive metastore. For more information, see Getting Started Using AWS Glue in the AWS Glue Developer Guide, Getting Started in the Amazon Athena User Guide, or Apache Hive in the Amazon EMR Developer Guide. If your external table is defined in AWS Glue, Athena, or a Hive metastore, you first create an external schema that references the external database. Then you can reference the external table in your SELECT statement by prefixing the table name with the schema name, without needing to create the table in Amazon Redshift. For more information, see Creating External Schemas for Amazon Redshift Spectrum

Querying data from Amazon S3 using Amazon Redshift Spectrum

Now that you have created the schema, you can run queries on the data set and see the results in PGWeb Console.

Copy the following statement into the query pane, and then choose Run Query.

    SELECT * FROM taxispectrum.taxi limit 10

Results for the above query look like the following:

Copy the following statement into the query pane, and then choose Run Query to get the total number of taxi rides for yellow cabs.

    SELECT COUNT(1) as TotalCount FROM taxispectrum.taxi

Results for the above query look like the following:

Copy the following statement into the query pane, and then choose Run Query to query for the number of rides per vendor, along with the average fair amount for yellow taxi rides

    SELECT 
    CASE vendorid 
         WHEN '1' THEN 'Creative Mobile Technologies'
         WHEN '2' THEN 'VeriFone Inc'
         ELSE CAST(vendorid as VARCHAR) END AS Vendor,
    COUNT(1) as RideCount, 
    avg(total_amount) as AverageAmount
    FROM taxispectrum.taxi
    WHERE total_amount > 0
    GROUP BY (1)

Results for the above query look like the following:

Querying partitioned data using Amazon Redshift Spectrum

By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Amazon Redshift Spectrum leverages Hive for partitioning data. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data coming from many different sources but loaded one time per day, may partition by a data source identifier and date.

Now that you have added the partition metadata to the Athena data catalog you can now run your query.

Copy the following statement into the query pane, and then choose Run Query to get the total number of taxi rides

    SELECT count(1) as TotalCount from taxispectrum.ny_pub

Results for the above query look like the following:

Note: This query executes much faster because the data set is partitioned and it in optimal format - Apache Parquet (an open source columnar).

Copy the following statement into the query pane, and then choose Run Query to get the total number of taxi rides by year

    SELECT YEAR, count(1) as TotalCount from taxispectrum.ny_pub GROUP BY YEAR

Results for the above query look like the following:

Copy the following statement into the query pane, and then choose Run Query to get the top 12 months by total number of rides across all the years

    SELECT YEAR, MONTH, COUNT(1) as TotalCount 
    FROM taxispectrum.ny_pub
    GROUP BY (1), (2) 
    ORDER BY (3) DESC LIMIT 12

Results for the above query look like the following:

Copy the following statement into the query pane, and then choose Run Query to get the monthly ride counts per taxi time for the year 2016.

    SELECT MONTH, TYPE, COUNT(1) as TotalCount 
    FROM taxispectrum.ny_pub
    WHERE YEAR = 2016 
    GROUP BY (1), (2)
    ORDER BY (1), (2)

Results for the above query look like the following:

Copy the following statement anywhere into the query pane, and then choose Run Query.

    SELECT MONTH, TYPE,
      avg(trip_distance) avgDistance,
      avg(total_amount/trip_distance) avgCostPerMile,
      avg(total_amount) avgCost,
      percentile_cont(0.99)
      within group (order by total_amount)
    FROM taxispectrum.ny_pub
    WHERE YEAR = 2016 AND (TYPE = 'yellow' OR TYPE = 'green')
    AND trip_distance > 0 AND total_amount > 0
    GROUP BY MONTH, TYPE
    ORDER BY MONTH

Results for the above query look like the following:

Deleting the Amazon CloudFormation Stack

Now that you have successfully queried the dataset using Amazon Redshift Spectrum, you need to tear down the stack that you deployed using the Amazon CloudFormation template.

Open the Amazon CloudFormation Dashboard
Enable the check box next to the name of the stack (e.g. RedshiftSpectrumLab) that you deployed at the beginning of the Lab.
Click on Actions drop down button.
Select Delete Stack.
Click Yes, Delete on the Delete Stack pop dialog
Ensure that Amazon CloudFormation stack name (e.g. RedshiftSpectrumLab) is no longer showing in the list of stacks.

License

This library is licensed under the Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Lab 4: Analysis of data in Amazon S3 using Amazon Redshift Spectrum

Architectural Diagram

Deploying Amazon Redshift Cluster

Running AWS Glue Crawlers - CSV & Parquet Crawler

Create Redshift Spectrum Scehma and reference external table from AWS Glue Data Catalog Database

Querying data from Amazon S3 using Amazon Redshift Spectrum

Querying partitioned data using Amazon Redshift Spectrum

Deleting the Amazon CloudFormation Stack

License

FilesExpand file tree

Lab4

Directory actions

More options

Directory actions

More options

Latest commit

History

Lab4

Folders and files

parent directory

README.md

Lab 4: Analysis of data in Amazon S3 using Amazon Redshift Spectrum

Architectural Diagram

Deploying Amazon Redshift Cluster

Running AWS Glue Crawlers - CSV & Parquet Crawler

Create Redshift Spectrum Scehma and reference external table from AWS Glue Data Catalog Database

Querying data from Amazon S3 using Amazon Redshift Spectrum

Querying partitioned data using Amazon Redshift Spectrum

Deleting the Amazon CloudFormation Stack

License