Homework 02 = Text Mining

Project Description

This project involves developing a pipeline for dictionary generation from a text corpus. The goal is to track specific vocabulary within the corpus using a dictionary method. The project requires selecting a relevant corpus, defining the vocabulary categories, and implementing a methodology to extract meaningful insights.

Project Structure

textmining_hw02/
|-- documents/                  # Documents for the project
|   |-- Gentzkow (2010).pdf
|   |-- Hassan (2019).pdf
|   |-- hw02.pdf
|-- packages/                   # Package initialization file
|   |-- __pycache__.py          
|   |-- categories.py           # Create new categories
|   |-- preprocessing.py        # Data processing
|-- HW2_TEXT_MINING.pdf         # PDF with all our results and analysis
|-- hw02.ipynb                  # Principal Notebook
|-- README.md                   # Description the project structure
|-- requirements.txt            # Dependencies required to run
|-- setup.py                    # Installation and setup script

Installation and Setup

Requirements

To install the required dependencies, run:

pip install -r requirements.txt

Running the Pipeline

Prepare the Corpus:
- Choose a dataset from class materials or other sources like Kaggle, Google Books, or scraped web content.
- Ensure the dataset covers diverse topics and contains metadata.
Preprocess the Text:
- Run the preprocessing script to clean and normalize the text:
```
python packages/preprocessing.py
```
- This step includes removing stopwords, lemmatization, tokenization, and metadata extraction.
Generate Dictionaries:
- Define dictionary categories in categories.py.
- Run the script to generate dictionaries:
```
python packages/categories.py
```
- The script uses TF-IDF and other statistical methods to extract meaningful vocabulary.
Analyze the Data:
- Open the Jupyter Notebook and execute the analysis:
```
jupyter notebook hw02.ipynb
```
- This notebook visualizes dictionary distributions and metadata-based insights.

Methodology

Data Collection
- The corpus consists of diverse text sources, ensuring topic coverage.
- Metadata is extracted to categorize documents over time.
Dictionary Generation
- Uses TF-IDF and predefined heuristics to build dictionaries.
- Inspired by approaches from Gentzkow & Shapiro (2010), Hassan et al. (2019), and García-Uribe (2024).
Analysis
- Computes dictionary term distributions across metadata groups.
- Explores patterns and relationships within the corpus.

Contributions

This project is an academic exercise in text mining and dictionary-based analysis. Ethical considerations apply when scraping and analyzing text data.

References

Gentzkow, M., & Shapiro, J. (2010). Media Bias and Reputation.
Hassan, T. A., et al. (2019). Firm-Level Political Risk: Measurement and Effects.
García-Uribe, S., et al. (2024). Economic Uncertainty and Divisive Politics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homework 02 = Text Mining

Project Description

Project Structure

Installation and Setup

Requirements

Running the Pipeline

Methodology

Contributions

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
documents		documents
packages		packages
HW2_TEXT_MINING.pdf		HW2_TEXT_MINING.pdf
README.md		README.md
hw02.ipynb		hw02.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Homework 02 = Text Mining

Project Description

Project Structure

Installation and Setup

Requirements

Running the Pipeline

Methodology

Contributions

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages