coursework/week4 at master · ys3006/coursework

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
airbnb_api.py	airbnb_api.py
airbnb_html.py	airbnb_html.py
classification.ipynb	classification.ipynb
classify_nyt_articles.R	classify_nyt_articles.R
enron_naive_bayes.sh	enron_naive_bayes.sh
intro.py	intro.py
nyt_api.py	nyt_api.py
scraping.pdf	scraping.pdf
selenium_demo.py	selenium_demo.py
streeteasy_api_mobile.py	streeteasy_api_mobile.py

Day 1

Notes on naive Bayes, logistic regression, and classifier evaluation
A video explaining ROC curves with an accompanying interactive demo
We had a guest lecture from Hal Daume on natural language processing
- Slides on word sense disambiguation, expectation maximization, and word alignment
- The Yarowsky algorithm for word sense disambiguation
- A statistical approach to machine translation
- See these interactive demos on k-means and mixture models

See the example we worked on in class for the NYTimes API, using the requests module for easy http functionality

Read this overview of JSON and review the first two sections of this overview of Python's json module

Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.
Then use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
- The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
- Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
- Plot of the cross-validation curve from cv.glmnet
- Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
- Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model