Adjust Govscape To Handle Multiple Crawls Per PDF

Currently, govscape has only the scrape from 2020, but we want to expand that to scrape 2008, 2012, 2016, and 2024. Additionally, a PDF can be hosted by multiple websites, and we want to expose this information. To do this, we will need to change a number of things in the code base.

On the front end:

- [ ] The PDFPreview interface needs to be updated to display all of the different times that a PDF has been scraped. This would also allow people to see all of the different locations that the same PDF was scraped from.

On the back end:

- [ ] The SearchResult schema in the search API needs to be updated to include an iterable of all crawl instances. Each instance would have at least the crawl_date, crawl_url, and sub-domain.
- [ ] The search function in server.py needs to get all crawls for each result and populate the iterable with their information.
- [ ] The metadata index needs to expose a function that returns all cdx lines for a particular PDF.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjust Govscape To Handle Multiple Crawls Per PDF #38

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Adjust Govscape To Handle Multiple Crawls Per PDF #38

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions