Skip to content

Latest commit

 

History

History
 
 

README.md

Day 1

Day 2

Day 3

Day 4

  • Write Python code to download the 1000 most recent articles from the New York Times (NYT) API by section of the newspaper:
    • Register for an API key for the Article Search API
    • Use the API console to figure out how to query the API by section (hint: set the fq parameter to section_name:business to get articles from the Business section, for instance), sorted from newest to oldest articles
    • Once you've figured out the query you want to run, translate this to working python code
    • Your code should take an API key, section name, and number of articles as command line arguments, and write out a tab-delimited file where each article is in a separate row, with section_name, web_url, pub_date, and snippet as columns (hint: use the codecs package to deal with unicode issues if you run into them)
    • You'll have to loop over pages of API results until you have enough articles, and you'll want to remove any newlines from article snippets to keep each article on one line
    • Finally, run your code to get articles from the Business and World sections of the NYT.

Day 5

  • Use your code to downloaded the 1000 most recent articles from the Business and World sections of the New york Times.
  • Then use the code in classify_nyt_articles.R to read the data into R and fit a logistic regression to prediction which section an article belongs to based on the words in its snippets
    • The provided code reads in each file and uses tools from the tm package---specifically VectorSource, Corpus, and DocumentTermMatrix---to parse the article collection into a sparseMatrix, where each row corresponds to one article and each column to one word, and a non-zero entry indicates that an article contains that word (note: this assumes that there's a column named snippet in your tsv files!)
    • Create an 80% train / 20% test split of the data and use cv.glmnet to find a best-fit logistic regression model to predict section_name from snippet
    • Plot of the cross-validation curve from cv.glmnet
    • Quote the accuracy and AUC on the test data and use the ROCR package to provide a plot of the ROC curve for the test data
    • Look at the most informative words for each section by examining the words with the top 10 largest and smallest weights from the fitted model