-
Expand Contractions
- Converts words like
"haven't"→"have not". - Library:
contractions
- Converts words like
-
Lowercase Conversion
- Coverts all words to lower case so
"love"and"Love"would count as the same. - Library: Python built-in string methods.
- Coverts all words to lower case so
-
Remove Punctuation
- Removes punctuation and special characters since we only care about the words
- Library: Python's
stringmodule.
-
Tokenization
- Split text in letter into individual words so we can iterate through.
- Library:
spacy
-
Remove Stop Words
- Filters out common stop words like "and," "the," and "is."
- Library:
spacy
-
Lemmatization
- Converts words to their base/root form like
"running"→"run". - Library:
spacy
- Converts words to their base/root form like
-
Save Cleaned Data
- Outputs the cleaned text to an array of JSON objects where each object is:
{post_id(string): [array, of, words,...]} - Outputs a bag of words with duplicates
- Outputs the cleaned text to an array of JSON objects where each object is:
Use pip to install the necessary Python packages:
pip install -r requirements.txt
python -m spacy download en_core_web_smRun npm run start to see the project on localhost:3000!