Skip to content

Catalog of Government Publications Crawler

License

BSD-3-Clause, BSD-3-Clause licenses found

Licenses found

BSD-3-Clause
LICENSE
BSD-3-Clause
LICENSE.markdown
Notifications You must be signed in to change notification settings

bluetouch/cgp_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Catalog of Government Publications Crawler

This code crawls the Catalog of Government Publications (CGP) using a protocol called Z39.50 and saves the results as XML files.

Background

The Catalog of Government Publications is a collection of all federal publications, administered by the Government Printing Office. The CGP includes descriptive records and links to those available online.

There is a web search interface for the CGP. However, to our knowledge, there is no public way to access the CGP in bulk. As a result, the public cannot run queries against the data unless they are built into the CGP web search interface.

That's why we (the Sunlight Foundation) use this code to crawl the CGP and share the resulting CGP bulk data publicly.

Access Credentials

In order to run this code, you will need access credentials to access the CGP. Create a file called config/access.yml based on access.yml.example.

Practically speaking, most people probably will not have direct access to the CGP. This is why we share the results of our crawl at http://sunlightlabs.com/cgp-data.

Software Dependencies

This crawler depends on Ruby ZOOM, a Ruby binding to YAZ, a programing toolkit that supports writing Z39.50 clients and servers. (ZOOM stands for Z39.50 Object-Orientation Model.)

For Mac OS X, I recommend installing YAZ with homebrew:

brew install yaz

For linux, I recommend installing YAZ using APT:

apt-get install libyaz-dev

Run bundler to make sure your gem dependencies are in order:

bundle

Configuration

Create a file called config/config.yml based on config.yml.example. The defaults should work just fine.

You might want to the delay value which controls the delay (in seconds) between requests to the GPO's Z39.50 CGP server.

If you have to stop the crawl midway and want to restart where you left off, you will find the start_at setting useful.

Running

To start crawling:

rake crawl

Please note that this process will run continuously. Assuming 700,000 documents and a 1 second delay time between record requests, it takes approximately 8 days to pull down the entire set of records from the CGP.

The crawl does not overwrite files; instead, it keeps a history of all records that is has seen.

Resulting Files

This crawler stores the resulting documents as XML files on the filesystem. The filenames are a combination of the CGP system number and the revision number. For example:

 system number      revision
             |      |
             v      v
/000/111/000111222-000.xml
/000/111/000111222-001.xml
/000/111/000111222-002.xml

These files are grouped into folders in order to reduce the number of files per directory.

Formats

Ruby ZOOM converts the original CGP records to XML.

  • The original records are in the MARC 21 format and encoded as MARC-8.

  • The resulting XML files are formatted as MARCXML and encoded as UTF-8.

Please bear with us; with all of these conversions, don't be surprised if some things go strangely wrong.

Please refer to the MARC documentation to interpret the fields in the XML output files. In particular, I recommend the MARC 21 Format for Holdings Data Documentation.

Uses and Interpretation

Now what do you do with these hundreds of thousands of government documents? We hope you explore them and let us know.

Community

This is a project of the Sunlight Labs, the technical arm of the Sunlight Foundation. If you would like to to discuss the project or anything related to government transparency, please contact us on our mailing list.

This code was written by David James of the Sunlight Labs. Ed Summers offered help (and consolation) regarding Z39.50 and MARC. The idea for putting the CGP bulk data online originated from John Wonderlich and Daniel Schuman, both on Sunlight's policy team.

About

Catalog of Government Publications Crawler

Resources

License

BSD-3-Clause, BSD-3-Clause licenses found

Licenses found

BSD-3-Clause
LICENSE
BSD-3-Clause
LICENSE.markdown

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors