Skip to content

cbs_parse_data

Conor Wild edited this page Aug 23, 2022 · 4 revisions

Description

This script takes as input your raw data export (e.g., the .csv file that you saved from the CBS admin portal), extracts and calculates hidden score features, and reformats the data into a tidy wideform format with one row per user / timepoint. That is, the resulting data file will contain users' assessments as individual rows and the test scores & features as columns. This is the most commonly requested pre-processing step when working with CBS data, because the raw data export is longform (one row per test score) with lots of extraneous data fields and buried data features.

Help

$ cbs_parse_data --help

usage: cbs_parse_data [-h] [-o {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...]] [--destination DESTINATION]
                      [-i [INCLUDE_COLUMNS ...]] [-x [EXCLUDE_FEATURES ...]] [-n {-1,0,1,2,3,4,5,6,7,8}] [-u USER_COLUMN]
                      [-e | --strip-emails | --no-strip-emails] [-k [KEEP_USERS ...]] [-d [DROP_USERS ...]] [-l LIMIT]
                      [--drop-session-data | --no-drop-session-data]
                      input_file output_name

CBS Data Parser

positional arguments:
  input_file            The source data file.
  output_name           The name of the output file(s).

optional arguments:
  -h, --help            show this help message and exit
  -o {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...], --outputs {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...]
                        A space separated list of files types to output.
  --destination DESTINATION
                        Path to save output files (default: location of the input file)
  -i [INCLUDE_COLUMNS ...], --include-columns [INCLUDE_COLUMNS ...]
                        List of other columns (space separated) to include as features
  -x [EXCLUDE_FEATURES ...], --exclude-features [EXCLUDE_FEATURES ...]
                        List of features (space separated) to drop
  -n {-1,0,1,2,3,4,5,6,7,8}, --num-jobs {-1,0,1,2,3,4,5,6,7,8}
                        How many cores to use for exracting features? (default: -1)
  -u USER_COLUMN, --user-column USER_COLUMN
                        The column for User identifier (default: 'Email Address')
  -e, --strip-emails, --no-strip-emails
                        If user IDs are stored as email addresses, setting this to True will strip the "@domain.com" - leaving only
                        the username. (default: False)
  -k [KEEP_USERS ...], --keep-users [KEEP_USERS ...]
                        A comma-separated list of user IDs to keep (drops all others)
  -d [DROP_USERS ...], --drop-users [DROP_USERS ...]
                        A comma-separated list of user IDs to drop (keeps all others)
  -l LIMIT, --limit LIMIT
                        Process only the first N scores
  --drop-session-data, --no-drop-session-data
                        Minimize memory usage by dropping the session data field? (default: False)

Required (Positional) Arguments

From the above --help, you can see that this script requires at least two arguments to run:

  1. input_file is the name of the input file you want to process. It is probably a .csv file of a raw CBS data export. Must be in your present working directory, or use a relative path in the filename.
  2. output_name is how any saved files will named. For example, if you specify my_parsed_CBS_data then you will end up with a file called my_parse_CBS_data.csv.

Optional (Named) Arguments

Read the descriptions in the --help above. I'll maybe add more details in here later.

Examples

  • The following examples use a small test dataset (called cbs_example_data_A.csv) to illustrate how some of the arguments work.
  • Data are printed to the console (using -o stdout) rather than saved in files so that we can the results here. Therefore, the second positional argument (output_name) is unused.
  • Data results are compressed / truncated to fit in this text view (see the columns of ...)
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout

cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete

                                                                                            spatial_span                                    ...  token_search
                                                                                            max_score avg_score avg_ms_per_item num_errors  ...  num_correct num_attempts duration_ms avg_ms_correct
user                                               report     test_date  batch_name                                                         ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch          4.0      3.50       2078.5000          3  ...            6            9      104458    9749.166667
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch          5.0      4.75       1877.0875          3  ...            3            6       81422   14428.333333
  • Do the same as above, but run it using the Docker image
$ docker run --rm -it -v $PWD:/tmp -w /tmp ghcr.io/theowenlab/cbspython:latest cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout

cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete

                                                                                            spatial_span                                    ...  token_search
                                                                                            max_score avg_score avg_ms_per_item num_errors  ...  num_correct num_attempts duration_ms avg_ms_correct
user                                               report     test_date  batch_name                                                         ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch          4.0      3.50       2078.5000          3  ...            6            9      104458    9749.166667
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch          5.0      4.75       1877.0875          3  ...            3            6       81422   14428.333333

[2 rows x 95 columns]
  • This time, strip the user IDs out of the "autoregister" email addresses
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout --strip-emails
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete

                                                          spatial_span                                                ...   token_search
                                                          max_score avg_score avg_ms_per_item num_errors num_correct  ...   num_errors num_correct num_attempts duration_ms avg_ms_correct
user             report     test_date  batch_name                                                                     ...
spc0bs93c4cc3218 76446959.0 2021-08-14 test data batch          4.0      3.50       2078.5000          3           2  ...            3           6            9      104458    9749.166667
spc0bs9e2bba6684 76763307.0 2021-08-23 test data batch          5.0      4.75       1877.0875          3           4  ...            3           3            6       81422   14428.333333

[2 rows x 95 columns]
  • How about we drop some columns (unecessary features?) from the output using -x
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout -x avg_ms_per_item num_errors avg_ms_correct duration_ms

cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete

                                                                                            spatial_span                                  ...    token_search
                                                                                            max_score avg_score num_correct num_attempts  ...    max_score avg_score num_correct num_attempts
user                                               report     test_date  batch_name                                                       ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch          4.0      3.50           2            5  ...          7.0       5.5           6            9
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch          5.0      4.75           4            7  ...          6.0       5.0           3            6

[2 rows x 54 columns]
  • Drop one of the users
cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout -e -d spc0bs93c4cc3218

cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete

                                                          spatial_span                                                ...   token_search
                                                          max_score avg_score avg_ms_per_item num_errors num_correct  ...   num_errors num_correct num_attempts duration_ms avg_ms_correct
user             report     test_date  batch_name                                                                     ...
spc0bs9e2bba6684 76763307.0 2021-08-23 test data batch          5.0      4.75       1877.0875          3           4  ...            3           3            6       81422   14428.333333

[1 rows x 95 columns]

Clone this wiki locally