This code crawls the Catalog of Government Publications (CGP) using a protocol called Z39.50 and saves the results as XML files.
The Catalog of Government Publications is a collection of all federal publications, administered by the Government Printing Office. The CGP includes descriptive records and links to those available online.
There is a web search interface for the CGP. However, to our knowledge, there is no public way to access the CGP in bulk. As a result, the public cannot run queries against the data unless they are built into the CGP web search interface.
That's why we (the Sunlight Foundation) use this code to crawl the CGP and share the resulting CGP bulk data publicly.
In order to run this code, you will need access credentials to access the CGP. Create a file called config/access.yml based on access.yml.example.
Practically speaking, most people probably will not have direct access to the CGP. This is why we share the results of our crawl at http://sunlightlabs.com/cgp-data.
This crawler depends on Ruby ZOOM, a Ruby binding to YAZ, a programing toolkit that supports writing Z39.50 clients and servers. (ZOOM stands for Z39.50 Object-Orientation Model.)
For Mac OS X, I recommend installing YAZ with homebrew:
brew install yaz
For linux, I recommend installing YAZ using APT:
apt-get install libyaz-dev
Run bundler to make sure your gem dependencies are in order:
bundle
Create a file called config/config.yml based on config.yml.example. The defaults should work just fine.
You might want to the delay value which controls the delay (in seconds) between requests to the GPO's Z39.50 CGP server.
If you have to stop the crawl midway and want to restart where you left off, you will find the start_at setting useful.
To start crawling:
rake crawl
Please note that this process will run continuously. Assuming 700,000 documents and a 1 second delay time between record requests, it takes approximately 8 days to pull down the entire set of records from the CGP.
The crawl does not overwrite files; instead, it keeps a history of all records that is has seen.
This crawler stores the resulting documents as XML files on the filesystem. The filenames are a combination of the CGP system number and the revision number. For example:
system number revision
| |
v v
/000/111/000111222-000.xml
/000/111/000111222-001.xml
/000/111/000111222-002.xml
These files are grouped into folders in order to reduce the number of files per directory.
Ruby ZOOM converts the original CGP records to XML.
-
The original records are in the MARC 21 format and encoded as MARC-8.
-
The resulting XML files are formatted as MARCXML and encoded as UTF-8.
Please bear with us; with all of these conversions, don't be surprised if some things go strangely wrong.
Please refer to the MARC documentation to interpret the fields in the XML output files. In particular, I recommend the MARC 21 Format for Holdings Data Documentation.
Now what do you do with these hundreds of thousands of government documents? We hope you explore them and let us know.
This is a project of the Sunlight Labs, the technical arm of the Sunlight Foundation. If you would like to to discuss the project or anything related to government transparency, please contact us on our mailing list.
This code was written by David James of the Sunlight Labs. Ed Summers offered help (and consolation) regarding Z39.50 and MARC. The idea for putting the CGP bulk data online originated from John Wonderlich and Daniel Schuman, both on Sunlight's policy team.