Welcome to the Mizzou/IRE course on building a web scraper, being held in Columbia from Oct. 10-13, 2013.
Although the stated goal of this course is to introduce the concepts of web scraping, we will also spend time covering programming fundamentals that can be applied to other problems, from data analysis to web development.
The course will begin Thursday, Oct. 10, and end on Sunday, Oct. 13. It will be held in the Lambert Room, which is room 200 in the Reynolds Journalism Institute building on 9th Street.
The format of the course is subject to change, but the rough schedule looks like this:
Thursday, Oct. 10, 5 - 8 p.m.: Introductions, computer setup and covering the basics of command line navigation.
Friday, Oct. 11, 9 a.m. - 5 p.m.: More command line basics; Python programming basics; a review of HTML/CSS and Javascript in the context of web scraping; and building our first web scraper.
Saturday, Oct. 12, 9 a.m. - 5 p.m.: Build a web scraper on your own (!) and learn about more sophisticated scraping techniques, such as manipulating forms.
Sunday, Oct. 13, 9 a.m. - Noon: Questions, review and wrapup.
This course will be taught primarily using the Python programming language. In addition, we'll be using two open source Python modules that greatly simplify the web scraping process -- BeautifulSoup, which makes it easy to parse and sort through HTML files; and mechanize, which allows you to emulate a web browser from within your Python programs.
We will need some place to edit and write code. If you don't already have a code editor, we recommend you explore Sublime.
In addition to Python, we'll also be making use of the Chrome web browser. Although it isn't required, we'd also recommend you check out git, version control software so you can download the course materials after you leave.
No worries if you don't have this software already installed. We'll help you set up everything on Thursday evening.
This course will be taught primarily by Chase Davis, of The New York Times; Presidential Innovation Fellow Jackie Kazil, formerly of The Washington Post; and Matt Wynn, of the Omaha World-Herald. If you have any questions, you can reach us here:
- Chase: chase.davis@gmail.com
- Jackie: jackiekazil@gmail.com
- Matt: matt.wynn@gmail.com