This markdown file contains an outline for a data training workshop.
For this training you need:
-
owner permissions to a GCP project
-
a modern web browser
-
(optional) Google Cloud SDK installed on your laptop
In this lab we explore the Google Analytics data from a demo account.
Products: Google Analytics, Cloud Console, BigQuery, Cloud Shell, Cloud Datalab, Data Studio
-
The demo account tracks data from Google Merchandise Store
-
Access the Google Analytics dashboard through this page
-
Open your Google Cloud Console and navigate to BigQuery
-
Explore the UI, find your
google_analytics_sampledataset and thega_sessions_*tables within -
See the documentation & tutorials:
- BigQuery Export for Analytics (optional)
- BigQuery Export schema
- Standard SQL Reference
- Analytics Academy
- Try out a simple query like
SELECT fullVisitorId, date, device.deviceCategory, geoNetwork.country
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY 1,2,3,4-
Approximately, how many distinct visitors were there on the Google Merchandise Store site in July 2017?
-
How many distinct visitors were there by country? By device category (desktop, mobile, tablet)?
Choose your analytics tool: BigQuery UI, Data Studio, Datalab
- Just use the query editor!
-
Sign in to Data Studio from Google Marketing Platform
-
Open a blank report (and answer the questions)
-
Create a new data source from BigQuery, follow the steps, select "Session level fields" and click "Add to report"
-
Create a date range for July 2017
-
Create a new field
Distinct visitorswithCOUNT_DISTINCT(fullVisitorId)and place it in a scorecard -
Create a filter control dimension
Countryand metricDistinct visitors(view it in View mode) -
Create similar filter controls for
Device category
-
Open Cloud Console in a new tab and activate Cloud Shell
-
Get help for Cloud Datalab
datalab --help
- Enable Compute Engine and Source Repositories APIs
gcloud services enable compute.googleapis.com
gcloud services enable sourcerepo.googleapis.com
- Create a new Datalab instance
datalab create --zone europe-north1-a --disk-size-gb 20 my-datalab
-
Open the Datalab UI in your browser from web preview (change port to 8081)
-
Create a new notebook and make a query
%%bq query
SELECT ...
FROM `google_analytics_sample.ga_sessions_*` ...-
Explore the documentation notebooks in the
docsfolder -
View the general Datalab documendation and the Python reference
-
Explore the web UI of the "ungit" version control tool
- Number of distinct visitors
SELECT COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_*`- Visitors by country
SELECT geoNetwork.country, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY country ORDER BY number_of_visitors DESC- Visitors by device category
SELECT device.deviceCategory, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY deviceCategory ORDER BY number_of_visitors DESC- Learn about BigQuery as a data warehouse
- Run queries from BigQuery cookbook
-
Shutdown the notebooks and close the Datalab tabs
-
Go back to Cloud Shell and close the SSH tunnel by
CTRL-C. -
View the state of your Datalab instance by
datalab list
- Stop the running instance by
datalab stop my-datalab
In this lab we
- run a streaming pipeline from Pub/Sub to BigQuery,
- run a batch pipeline from BigQuery to Datastore,
- schedule the batch pipeline and other tasks with Composer.
Products: Cloud Shell, Cloud Source Repositories, Pub/Sub, Cloud Dataflow, BigQuery, Cloud Datastore, Cloud Storage, Cloud Composer
Frameworks: Apache Beam, Apache Airflow
-
In Cloud Console, navigate to APIs & Services
-
Enable APIs for Pub/Sub, Cloud Dataflow, and Cloud Composer
or, alternatively,
- Enable APIs from the Cloud Shell command line by
gcloud services enable pubsub.googleapis.com
gcloud services enable dataflow.googleapis.com
gcloud services enable composer.googleapis.com
- From Cloud Shell Terminal settings go to Terminal preferences/Keyboard and click Alt is Meta. This is to ensure you can enter characters like [] in the terminal without complications.
- Open Cloud Shell (preferably in a new tab) and clone this repository
git clone https://github.com/qvik/gcp-data-training.git
If you want to use your local code editor instead of the Cloud Shell code editor, follow these steps (you will need to have installed Google Cloud SDK locally):
- Create a repository in Cloud Source Repositories
gcloud source repos create gcp-data-training
- In the repository folder, add remote
git remote add google https://source.developers.google.com/p/$GOOGLE_CLOUD_PROJECT/r/gcp-data-training
- Push
git push google master
- Clone the
gcp-data-trainingrepository to your laptop by following the instructions in Source Repositories
Back to Cloud Shell, everyone!
- Create a virtual environment for the publisher by
virtualenv --python=/usr/bin/python pubvenv
- Activate it and install the Python client for Pub/Sub
source pubvenv/bin/activate
pip install --upgrade google-cloud-pubsub numpy
-
Open
publisher.pyin your code editor and inspect the code -
In Cloud Console, navigate to Pub/Sub, create a topic
stream_data_ingestionand for it a subscriptionprocess_stream_data
or, alternatively,
- Create the topic and its subscription from command line
gcloud pubsub topics create stream_data_ingestion
gcloud pubsub subscriptions create --topic stream_data_ingestion process_stream_data
-
Run
publisher.py -
Open a new Cloud Shell tab and pull messages from the subscription to make sure data is flowing
gcloud pubsub subscriptions pull --auto-ack \
projects/$GOOGLE_CLOUD_PROJECT/subscriptions/process_stream_data
- Interrupt the Python process
publisher.pywithCTRL-C
- Open a new Cloud Shell tab and create a virtual environment for the pipeline
virtualenv --python=/usr/bin/python beamvenv
- Activate it and install the Apache Beam Python SDK
source beamvenv/bin/activate
pip install --upgrade apache-beam[gcp]
-
Open
stream_pipeline.pyin your code editor and inspect the different suggestions for pipelines -
Take a look at the Apache Beam Programming Guide, the Python reference, and the examples in GitHub
-
Go to BigQuery console and create a dataset
my_datasetand an empty tablestream_datawith fieldstimestamp: TIMESTAMP, location: STRING, spend: INTEGER -
Launch
publisher.pyin another tab (in its virtual environment) and try out different pipelines with DirectRunner
python stream_pipeline.py --runner DirectRunner
-
Interrupt the Python processes with
CTRL-C -
Create a Cloud Storage bucket
<project_id>-dataflowfor Dataflow temp and staging either from the console or from the command line (seegsutil help)
gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-dataflow
- Take a look at Dataflow documentation and run the pipeline in Dataflow
python stream_pipeline.py --runner DataflowRunner
-
View the pipeline by navigating to Dataflow in Cloud Console
-
Clean up by stopping the pipeline and the publisher script
-
In Cloud Console, navigate to Datastore and create a database
-
Inspect
batch_pipeline.pyin your code editor and fill in the missing code -
Run the pipeline to make sure it works (activate
beamvenvfirst)
python batch_pipeline.py --runner DirectRunner
-
In Cloud Console, navigate to Cloud Composer
-
Create an environment named
data-transfer-environmentineurope-west1(this takes a while to finish) -
Take a look at Cloud Composer documentation
-
Create a Cloud Storage bucket for data export
gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-data-export
- Copy the pipeline into storage bucket
gsutil cp pipelines/batch_pipeline.py gs://$GOOGLE_CLOUD_PROJECT-dataflow/pipelines/
-
Take a look at Apache Airflow API Reference
-
Open
scheduler.pyin your code editor and fill in the missing code -
Once the environment is ready, navigate to the Airflow web UI and explore it
-
Set the Airflow variable
gcloud composer environments run data-transfer-environment \
--location europe-west1 variables -- --set gcp_project $GOOGLE_CLOUD_PROJECT
-
Submit your scheduling to Composer by copying
scheduler.pyinto thedagsfolder of your Composer environment bucket -
Run the pipeline manually, if necessary, and inspect the runs in the web UI
-
View the pipeline by navigating to Dataflow in Cloud Console
- Draw an architecture diagram of the pipelines in Lab 2
-
In Cloud Console, delete the Composer environment and its storage bucket
-
Check that no Dataflow pipelines are running
In this lab we train a deep neural network TensorFlow model on Cloud ML Engine. The task is to classify successful marketing phone calls of a Portuguese banking institution.
Products: Cloud ML Engine
Frameworks: TensorFlow
- Enable the Cloud ML Engine API
gcloud services enable ml.googleapis.com
-
Visit the origin of the dataset at UCI Machine Learning Repository
-
For your convenience, the data has been prepared into training and evaluation sets in
mlengine/data/. All the numerical variables except foragehave been normalized. -
Open the model file
trainer/model.pyin your code editor and examine the objectsCSV_COLUMNS,INPUT_COLUMNS, etc, which encode your data format -
Take a look at TensorFlow documentation and fill in the missing feature columns in
build_estimatorfunction -
Navigate to the repository folder in your Cloud Shell and set the environment variables
TRAIN_DATA=$(pwd)/mlengine/data/bank_data_train.csv
EVAL_DATA=$(pwd)/mlengine/data/bank_data_eval.csv
MODEL_DIR=$(pwd)/mlengine/output
- Change directory to
mlengineand try the training locally
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
--job-dir $MODEL_DIR \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--eval-steps 100
- Set the environment variables
BUCKET_NAME=$GOOGLE_CLOUD_PROJECT-mlengine
REGION=europe-west1
- Create a bucket for ML Engine jobs
gsutil mb -l $REGION gs://$BUCKET_NAME
- Copy the data into the bucket
gsutil cp $TRAIN_DATA $EVAL_DATA gs://$BUCKET_NAME/data/
- Reset the environment variables for data
TRAIN_DATA=gs://$BUCKET_NAME/data/bank_data_train.csv
EVAL_DATA=gs://$BUCKET_NAME/data/bank_data_eval.csv
- Set the environment variables for the training job
JOB_NAME=bank_marketing_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
- Run the training job in ML Engine
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.10 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 10000 \
--eval-steps 1000 \
--verbosity DEBUG
- View the job logs in Cloud Shell
gcloud ml-engine jobs stream-logs $JOB_NAME
or, alternatively,
- Inspect the training process on TensorBoard (open web preview on port 6006)
tensorboard --logdir=$OUTPUT_PATH
-
Learn more about hyperparameter tuning (see also here)
-
Open
hptuning_config.yamlin your code editor and fill in the missing code -
In the
mlenginefolder, set the environment variables
HPTUNING_CONFIG=$(pwd)/hptuning_config.yaml
JOB_NAME=bank_marketing_hptune_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
- Run training with hyperparameter tuning
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.10 \
--config $HPTUNING_CONFIG \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
--scale-tier STANDARD_1 \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 10000 \
--eval-steps 1000 \
--verbosity DEBUG
- View the job logs in Cloud Shell
gcloud ml-engine jobs stream-logs $JOB_NAME
or, alternatively,
- Inspect the training process on TensorBoard (open web preview on port 6006)
tensorboard --logdir=$OUTPUT_PATH/<trial_number>/
- Set the environment variable
MODEL_NAME=bank_marketing
- Create a model in ML Engine
gcloud ml-engine models create $MODEL_NAME --regions=$REGION
- Select the job output to use and look up the path to model binaries
gsutil ls -r $OUTPUT_PATH
- Set the environment variable with the correct value for
<trial_number>and<timestamp>
MODEL_BINARIES=$OUTPUT_PATH/<trial_number>/export/bank_marketing/<timestamp>/
- Create a version of the model
gcloud ml-engine versions create v1 \
--model $MODEL_NAME \
--origin $MODEL_BINARIES \
--runtime-version 1.10
- From the
mlenginefolder, inspect the test instances
cat data/bank_data_test_no.json
cat data/bank_data_test_yes.json
- Get the prediction for two test instances
gcloud ml-engine predict \
--model $MODEL_NAME \
--version v1 \
--json-instances \
data/bank_data_test_no.json
gcloud ml-engine predict \
--model $MODEL_NAME \
--version v1 \
--json-instances \
data/bank_data_test_yes.json