You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use the API console to figure out how to query the API by section (hint: set the fq parameter to section_name:business to get articles from the Business section, for instance), sorted from newest to oldest articles
Once you've figured out the query you want to run, translate this to working python code
Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with section_name, web_url, pub_date, and snippet as columns (hint: use the codecs package to deal with unicode issues if you run into them)
You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
Finally, run your code to get articles from the Business and World
sections of the NYT.
Day 5
Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.
Then use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
Plot of the cross-validation curve from cv.glmnet
Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model