-
Notifications
You must be signed in to change notification settings - Fork 0
cbs_parse_data
This script takes as input your raw data export (e.g., the .csv file that you saved from the CBS admin portal), extracts and calculates hidden score features, and reformats the data into a tidy wideform format with one row per user / timepoint. That is, the resulting data file will contain users' assessments as individual rows and the test scores & features as columns. This is the most commonly requested pre-processing step when working with CBS data, because the raw data export is longform (one row per test score) with lots of extraneous data fields and buried data features.
$ cbs_parse_data --help
usage: cbs_parse_data [-h] [-o {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...]] [--destination DESTINATION]
[-i [INCLUDE_COLUMNS ...]] [-x [EXCLUDE_FEATURES ...]] [-n {-1,0,1,2,3,4,5,6,7,8}] [-u USER_COLUMN]
[-e | --strip-emails | --no-strip-emails] [-k [KEEP_USERS ...]] [-d [DROP_USERS ...]] [-l LIMIT]
[--drop-session-data | --no-drop-session-data]
input_file output_name
CBS Data Parser
positional arguments:
input_file The source data file.
output_name The name of the output file(s).
optional arguments:
-h, --help show this help message and exit
-o {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...], --outputs {csv,pickle,report,stdout} [{csv,pickle,report,stdout} ...]
A space separated list of files types to output.
--destination DESTINATION
Path to save output files (default: location of the input file)
-i [INCLUDE_COLUMNS ...], --include-columns [INCLUDE_COLUMNS ...]
List of other columns (space separated) to include as features
-x [EXCLUDE_FEATURES ...], --exclude-features [EXCLUDE_FEATURES ...]
List of features (space separated) to drop
-n {-1,0,1,2,3,4,5,6,7,8}, --num-jobs {-1,0,1,2,3,4,5,6,7,8}
How many cores to use for exracting features? (default: -1)
-u USER_COLUMN, --user-column USER_COLUMN
The column for User identifier (default: 'Email Address')
-e, --strip-emails, --no-strip-emails
If user IDs are stored as email addresses, setting this to True will strip the "@domain.com" - leaving only
the username. (default: False)
-k [KEEP_USERS ...], --keep-users [KEEP_USERS ...]
A comma-separated list of user IDs to keep (drops all others)
-d [DROP_USERS ...], --drop-users [DROP_USERS ...]
A comma-separated list of user IDs to drop (keeps all others)
-l LIMIT, --limit LIMIT
Process only the first N scores
--drop-session-data, --no-drop-session-data
Minimize memory usage by dropping the session data field? (default: False)
From the above --help, you can see that this script requires at least two arguments to run:
-
input_fileis the name of the input file you want to process. It is probably a .csv file of a raw CBS data export. Must be in your present working directory, or use a relative path in the filename. -
output_nameis how any saved files will named. For example, if you specifymy_parsed_CBS_datathen you will end up with a file calledmy_parse_CBS_data.csv.
Read the descriptions in the --help above. I'll maybe add more details in here later.
- The following examples use a small test dataset (called cbs_example_data_A.csv) to illustrate how some of the arguments work.
- Data are printed to the console (using
-o stdout) rather than saved in files so that we can the results here. Therefore, the second positional argument (output_name) is unused. - Data results are compressed / truncated to fit in this text view (see the columns of
...)
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete
spatial_span ... token_search
max_score avg_score avg_ms_per_item num_errors ... num_correct num_attempts duration_ms avg_ms_correct
user report test_date batch_name ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch 4.0 3.50 2078.5000 3 ... 6 9 104458 9749.166667
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch 5.0 4.75 1877.0875 3 ... 3 6 81422 14428.333333
- Do the same as above, but run it using the Docker image
$ docker run --rm -it -v $PWD:/tmp -w /tmp ghcr.io/theowenlab/cbspython:latest cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete
spatial_span ... token_search
max_score avg_score avg_ms_per_item num_errors ... num_correct num_attempts duration_ms avg_ms_correct
user report test_date batch_name ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch 4.0 3.50 2078.5000 3 ... 6 9 104458 9749.166667
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch 5.0 4.75 1877.0875 3 ... 3 6 81422 14428.333333
[2 rows x 95 columns]
- This time, strip the user IDs out of the "autoregister" email addresses
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout --strip-emails
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete
spatial_span ... token_search
max_score avg_score avg_ms_per_item num_errors num_correct ... num_errors num_correct num_attempts duration_ms avg_ms_correct
user report test_date batch_name ...
spc0bs93c4cc3218 76446959.0 2021-08-14 test data batch 4.0 3.50 2078.5000 3 2 ... 3 6 9 104458 9749.166667
spc0bs9e2bba6684 76763307.0 2021-08-23 test data batch 5.0 4.75 1877.0875 3 4 ... 3 3 6 81422 14428.333333
[2 rows x 95 columns]
- How about we drop some columns (unecessary features?) from the output using
-x
$ cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout -x avg_ms_per_item num_errors avg_ms_correct duration_ms
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete
spatial_span ... token_search
max_score avg_score num_correct num_attempts ... max_score avg_score num_correct num_attempts
user report test_date batch_name ...
spc0bs93c4cc3218@researcher-102448.autoregister... 76446959.0 2021-08-14 test data batch 4.0 3.50 2 5 ... 7.0 5.5 6 9
spc0bs9e2bba6684@researcher-102448.autoregister... 76763307.0 2021-08-23 test data batch 5.0 4.75 4 7 ... 6.0 5.0 3 6
[2 rows x 54 columns]
- Drop one of the users
cbs_parse_data cbs_example_data_A.csv test_data -n 1 -o stdout -e -d spc0bs93c4cc3218
cbs_example_data_A.csv initially has 26 rows.
Restructing data frame ...
Clustering scores ...
Parsing Score Features (n_cores=1)
Parsing Score Features: |██████████████████████████████████████████████████| 100.0% Complete
spatial_span ... token_search
max_score avg_score avg_ms_per_item num_errors num_correct ... num_errors num_correct num_attempts duration_ms avg_ms_correct
user report test_date batch_name ...
spc0bs9e2bba6684 76763307.0 2021-08-23 test data batch 5.0 4.75 1877.0875 3 4 ... 3 3 6 81422 14428.333333
[1 rows x 95 columns]
- Home
- Score Features
-
CBS Python (Docker)
- cbs_parse_data
- cbs_score_calculator
- cbs_encrypt
- cbs_decrypt
- Terminology