-
Notifications
You must be signed in to change notification settings - Fork 3
Code Walkthrough
- execute.py: Runs entire data processing pipeline and sets up client.
- tokenize.py: Tokenize corpus
- train_stmt/mallet.py: Train model
- compute_saliency.py: Compute term saliency
- compute_similarity.py: Compute term similarity
- compute_seriation.py: Seriates terms
- prepare_data_for_client.py: Generates datafiles for client
- prepare_vis_for_client.py: Copies necessary scripts for client
--./execute.py --corpus-path <corpus_file> example_lda.cfg --model-path <any_path_for_model> --data-path <any_path_for_output>
Objective 1: Inspect the main() function to find out which flags are activated when running the file using above command line
The main function performs these steps:
Argument Parsing
Logging
Tokens
-
2.1 In case this tokenization is NONE, the execute functions set the tokenization value accordingly.
STMT/Mallet
-
1.7 The execute.py function then checks model_library and verifies if it's stmt or mallet-- I am not sure about the difference between two, I will look further into their differences and significance
-
3.1 The train_mallet.sh expects three arguments, input-file, output-path and num-topics
-
3.3 It then calls the mallet executable file which is inside the mallet-2.0.7/bin
-
3.4 The mallet model vectorize the token.txt file and store it in text.vector file
-
3.6 Below output files are created by the mallet model using Hierarchical LDA model
- output.model
- output-topic-keys.txt
- topic-word-weights.txt
- word-topic-counts.txt
-
1.9 The execute.py file then calls the import_mallet.execute function
-
3.8 The extractTopicWordWeights function reads the content of files and gives us below matrix
- term_tpoic_matrix
- term_index
- topic_index
-
3.9 Similarly, stmt also expects three arguments, input-file, output-path and num-topics
-
3.11 Creates and extracts below files
- unpack topic-term-distribution.csv
- Generate topic-index list
- Copy term-index list
- Extract doc-index list
- Extract list of term frequencies
Compute Saliency
-
1.10 The execute.py runs ComputeSaliency.execute function and passes data path as an argument.
-
4.1 The data path should have term-topic probability distribution stored in 3 separate files:
- 'term-topic-matrix.txt'- It contains the entries of the matrix.
- 'term-index.txt'- It contains the terms corresponding to the rows of the matrix.
- 'topic-index.txt' contains the topic labels corresponding to the columns of the matrix.
-
4.2 It then computes topic info, term info and rank the results.
Compute Similarity
Compute Seriation
Prepare Data
Prepare Visualization
-
8.1 Prepare_vis_for_client.sh performs below actions:
- Navigates to public_html library
- Copy the js and css file to this folder
The output folder structure, specified as a parameter when executing the code and on lines:
Has the following format:
├── model
│ ├── term-index.txt
│ ├── term-topic-matrix.txt
│ └── topic-index.txt
├── saliency
│ ├── term-info.json
│ ├── term-info.txt
│ ├── topic-info.json
│ └── topic-info.txt
├── similarity
│ └── combined-g2.txt
├── tokens
│ └── tokens.txt
└── topic-model
├── output-topic-keys.txt
├── output.model
├── text.vectors
├── topic-word-weights.txt
└── word-topic-counts.txt