Skip to content
Mayanka Jha edited this page May 4, 2018 · 15 revisions

Code Walkthrough

Pipeline

  1. execute.py: Runs entire data processing pipeline and sets up client.
  2. tokenize.py: Tokenize corpus
  3. train_stmt/mallet.py: Train model
  4. compute_saliency.py: Compute term saliency
  5. compute_similarity.py: Compute term similarity
  6. compute_seriation.py: Seriates terms
  7. prepare_data_for_client.py: Generates datafiles for client
  8. prepare_vis_for_client.py: Copies necessary scripts for client
--./execute.py --corpus-path <corpus_file> example_lda.cfg --model-path <any_path_for_model> --data-path <any_path_for_output>

Objective 1: Inspect the main() function to find out which flags are activated when running the file using above command line

The main function performs these steps:

Argument Parsing

Logging

Tokens

STMT/Mallet

Compute Saliency

Compute Similarity

Compute Seriation

Prepare Data

Prepare Visualization

Output Folder Structure

The output folder structure, specified as a parameter when executing the code and on lines:

Has the following format:

├── model
│   ├── term-index.txt
│   ├── term-topic-matrix.txt
│   └── topic-index.txt
├── saliency
│   ├── term-info.json
│   ├── term-info.txt
│   ├── topic-info.json
│   └── topic-info.txt
├── similarity
│   └── combined-g2.txt
├── tokens
│   └── tokens.txt
└── topic-model
    ├── output-topic-keys.txt
    ├── output.model
    ├── text.vectors
    ├── topic-word-weights.txt
    └── word-topic-counts.txt

Clone this wiki locally