Catalog of Government Publications Crawler

This code crawls the Catalog of Government Publications (CGP) using a protocol called Z39.50 and saves the results as XML files.

Background

The Catalog of Government Publications is a collection of all federal publications, administered by the Government Printing Office. The CGP includes descriptive records and links to those available online.

There is a web search interface for the CGP. However, to our knowledge, there is no public way to access the CGP in bulk. As a result, the public cannot run queries against the data unless they are built into the CGP web search interface.

That's why we (the Sunlight Foundation) use this code to crawl the CGP and share the resulting CGP bulk data publicly.

Access Credentials

In order to run this code, you will need access credentials to access the CGP. Create a file called config/access.yml based on access.yml.example.

Practically speaking, most people probably will not have direct access to the CGP. This is why we share the results of our crawl at http://sunlightlabs.com/cgp-data.

Software Dependencies

This crawler depends on Ruby ZOOM, a Ruby binding to YAZ, a programing toolkit that supports writing Z39.50 clients and servers. (ZOOM stands for Z39.50 Object-Orientation Model.)

For Mac OS X, I recommend installing YAZ with homebrew:

brew install yaz

For linux, I recommend installing YAZ using APT:

apt-get install libyaz-dev

Run bundler to make sure your gem dependencies are in order:

bundle

Configuration

Create a file called config/config.yml based on config.yml.example. The defaults should work just fine.

You might want to the delay value which controls the delay (in seconds) between requests to the GPO's Z39.50 CGP server.

If you have to stop the crawl midway and want to restart where you left off, you will find the start_at setting useful.

Running

To start crawling:

rake crawl

Please note that this process will run continuously. Assuming 700,000 documents and a 1 second delay time between record requests, it takes approximately 8 days to pull down the entire set of records from the CGP.

The crawl does not overwrite files; instead, it keeps a history of all records that is has seen.

Resulting Files

This crawler stores the resulting documents as XML files on the filesystem. The filenames are a combination of the CGP system number and the revision number. For example:

 system number      revision
             |      |
             v      v
/000/111/000111222-000.xml
/000/111/000111222-001.xml
/000/111/000111222-002.xml

These files are grouped into folders in order to reduce the number of files per directory.

Formats

Ruby ZOOM converts the original CGP records to XML.

The original records are in the MARC 21 format and encoded as MARC-8.
The resulting XML files are formatted as MARCXML and encoded as UTF-8.

Please bear with us; with all of these conversions, don't be surprised if some things go strangely wrong.

Please refer to the MARC documentation to interpret the fields in the XML output files. In particular, I recommend the MARC 21 Format for Holdings Data Documentation.

Uses and Interpretation

Now what do you do with these hundreds of thousands of government documents? We hope you explore them and let us know.

Community

This is a project of the Sunlight Labs, the technical arm of the Sunlight Foundation. If you would like to to discuss the project or anything related to government transparency, please contact us on our mailing list.

This code was written by David James of the Sunlight Labs. Ed Summers offered help (and consolation) regarding Z39.50 and MARC. The idea for putting the CGP bulk data online originated from John Wonderlich and Daniel Schuman, both on Sunlight's policy team.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.bundle		.bundle
config		config
doc		doc
lib		lib
spec		spec
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
LICENSE.markdown		LICENSE.markdown
README.markdown		README.markdown
Rakefile		Rakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Catalog of Government Publications Crawler

Background

Access Credentials

Software Dependencies

Configuration

Running

Resulting Files

Formats

Uses and Interpretation

Community

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

License

Licenses found

bluetouch/cgp_crawler

Folders and files

Latest commit

History

Repository files navigation

Catalog of Government Publications Crawler

Background

Access Credentials

Software Dependencies

Configuration

Running

Resulting Files

Formats

Uses and Interpretation

Community

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages