Infrastructure-as-code for deploying Monster team's environments.
We use Terraform and Helm to manage our deployments.
See the README.
See the README.
We can automate nearly all of our infrastructure setup, but not the creation of resources in Terra (at least not yet). These actions require manual work:
- Registering a Google service account with Terra
- Creating a TDR resource profile
- Creating a TDR dataset
NOTE: The instructions below use many APIs that are planned to change pretty drastically as part of Terra's new architecture. Your milage may vary.
Terraform can create Google SAs, but it can't register them in the Terra system. We need to register the SAs that run our ingest pipelines in order to grant them read/write permissions to the dataset(s) they target.
To register an account:
- Apply the Terraform module that creates the account; it should also write the account's secret key to Vault
- Read the secret key from Vault into a JSON file on your local machine
- Run the registration script, passing the path to the key-file and the name of the targeted Terra environment
After registering the account, you'll still need to grant it permissions. The easiest way to do that right now is to make it a TDR steward. You can do this by:
- Go to the Terra UI for the targeted environment
- Click the top-left hamburger menu, then the dropdown with your name, then "Groups"
- Find the Stewards group in the list of your groups
- Dev: "JadeStewards-dev"
- Prod: "Stewards"
- Add the SA to the group using its email address
- Grant access to relevant datasets by calling the Jade
addDatasetPolicyMemberwithpolicyName=stewardfor the SA in either Dev or Prod (see the Data repo FAQ)
Resource profiles connect Google Billing Accounts to the repository's machinery. You should only need to create a new profile when a projects begins with a funding source that hasn't been used before.
Step 1 of setting up a profile is ensuring the TDR can access the targeted account. Grant the TDR's service account "Billing Account User" permissions on the account.
- Dev: jade-k8-sa@broad-jade-dev.iam.gserviceaccount.com
- Prod: terra-data-repository@broad-datarepo-terra-prod.iam.gserviceaccount.com You need to be a Billing Account Administrator on the target account to make this change.
Step 2 is to get the ID of the Billing Account. If you're viewing the details page of the BA, the ID is in the URL:
https://console.cloud.google.com/billing/{id}
Step 3 is to link the Billing Account into the TDR. Visit the Swagger UI of the TDR instance. Under the "resources" section, expand the POST route. Click "Try it out" and make the following edits to the pre-populated JSON:
- Replace the value of "biller" with the constant string "direct"
- Replace the value of "billingAccountId" with the ID from step 2
- Replace the value of "profileName" with some unique name for the profile object; it will be used in the name of the generated GCP project
NOTE: When the TDR creates a project, it applies a prefix to the profile name. Google imposes a character maximum on project names. This means that profile names are effectively length- limited, but the limit depends on other configuration in the TDR. In the current production deployment, the limit is 4 characters.
Once you've filled out the JSON, you can submit the POST. If everything works out, you should get back the same payload with extra fields:
- An "accessible" field with a value of
true - An "id" field with a UUID
The UUID is needed for dataset creation.
TDR Datasets are the main targets of our ingest pipelines. Most of the hard work that goes into dataset creation involves schema design & declaration. Our ingest-utils repository includes tooling & build plugins to assist with that piece of the puzzle.
Pre-work:
- Create a resource profile for the dataset
- Declare the schema for the dataset in the ingest project, using our plugins
From there, step 1 is to generate the Jade-compatible definition of the schema. From the root
of the ingest project, run sbt generateJadeSchema. The output should include a line:
[info] Wrote Jade schema to <some-path>/schema.json
Step 2 is to declare the dataset. Visit the Swagger UI of the TDR instance.
Under the "repository" section, look for the POST /api/repository/v1/datasets route.
Expand it, click "Try it out", and make the following edits to the pre-populated JSON:
- Delete the "additionalProfileIds" field
- Replace the value of "defaultProfileId" with the UUID of the resource profile you want to use
- Replace the value of "description" with whatever you'd like, or delete it
- Replace the value of "name" with a BigQuery-compatible identifier (only lowercase alphanumeric characters and '_' allowed)
- Replace the entire value of "schema" with the contents of the Jade schema generated
by
sbtin step 1
Once you've filled out the JSON, you can submit the POST. You'll get back a job ID.
Step 3 is to poll the job ID until it finishes. You can do so using the GET /api/repository/v1/jobs/{id}
route in the Swagger UI. When the job exits the "running" state, you can get its final results using
the GET /api/repository/v1/jobs/{id}/result endpoint. For succeeded jobs, this call will output
the ID of the new dataset. For failed jobs, this call will show information about what went wrong.