A place to communicate about the team flattened data needs.
Use the issues tools to make and manage requests.
Note: This flow diagram shows the general flow of information, but the exact function names and file names may be simplified for the purposes of the diagram.
flowchart LR
subgraph BQ [BigQuery]
SRC[[Source Table]]:::table
QUERY(Scheduled Query):::table
FLAT[[Flattened Table]]:::table
end
class BQ subgraphTitle;
SCH[/schema.json/]:::file
DEV[/dev.sql/]:::file
STG[/stg.sql/]:::file
PROD[/prod.sql/]:::file
subgraph LOC ["flatten.R"]
FETCH("fetch_schema()"):::function
SCH_CSV[/"schema.csv"/]:::file
FILT("sort_variables()"):::function
ARR[/arrays.json/]:::file
VAR[/variables.csv/]:::file
GEN("generate_queries()"):::function
end
class LOC subgraphTitle;
SRC --- |API| SCH:::link
SCH --> FETCH:::link
FETCH --> SCH_CSV:::link
SCH_CSV --> FILT:::link
FILT --> ARR:::link & VAR:::link
ARR & VAR --> GEN:::link
GEN --> DEV:::link & STG:::link & PROD:::link
DEV & STG & PROD --> |API| QUERY:::link
QUERY --> FLAT:::link
classDef table link stroke:#000,stroke-width:1.5px;
classDef schema fill:#ff9,stroke:#333,stroke-width:2px;
classDef function fill:#ffc,stroke:#333,stroke-width:2px;
classDef file fill:#cfc,stroke:#333,stroke-width:2px;
classDef subgraphTitle font-size:24px;
flatten.R is an R function that simplifies the process of flattening the Connect4Cancer source tables:
- Retrieve the schema of the source table using the BigQuery API.
- Convert the schema from JSON to CSV.
- Sort the variables in the source table by their data structure: arrays and non-array variables and stores them in appropriate files.
- Generate SQL queries for each environment.
- Update the Scheduled Query which generates the flattened table daily.
flatten_batch.R is the main entry point for the user to call flatten() for a given set of tables.
Note: The user can ignore the
bash/,python/andjs/directories which store legacy code that could be referenced later. The core functions of this workflow are in tehr/directory
The query_configs/ directory stores a metadata for each table to be flattened and is referenced by the flatten.Rmd notebook.
To set up a new table for flattening, a new config folder needs to be setup for the new table with meta data about the table.
Here is an example for the PROMIS survey table:
query_configs/promis_v1/
├── dev
│ ├── FlatConnect.promis_v1_JP-dev.sql
│ ├── promis_v1-lists.json
│ ├── promis_v1-variables.csv
│ └── query-config.json
├── prod
│ ├── FlatConnect.promis_v1_JP-prod.sql
│ ├── promis_v1-lists.json
│ ├── promis_v1-variables.csv
│ └── query-config.json
└── stg
├── FlatConnect.promis_v1_JP-stg.sql
├── promis_v1-lists.json
├── promis_v1-variables.csv
└── query-config.json
Each tier dev, stg and prod has a *.sql, *-lists.json, *-variables.csv, and a query-config.json file.
-
query-config.json: Specifies key information about the table, including the name of the source table, the name of the resulting SQL query, the name of target table and the -
resource_name(containing thetransfer_config_id) of the scheduled query to be updated. These bit of information must be updated manually. -
*-lists.json: A JSON file key/value pairs are the names of "array variables" and an array of possible responses for these variables, respectively. -
*-variables.csv: A CSV file listing the names of the variables for which a single value can be given. -
*.sql: The resulting SQL query that is generated by this workflow, which is delivered to BigQuery via the BQ Transfer Config API.