Skip to content

Adjust Govscape To Handle Multiple Crawls Per PDF #38

@kylebd99

Description

@kylebd99

Currently, govscape has only the scrape from 2020, but we want to expand that to scrape 2008, 2012, 2016, and 2024. Additionally, a PDF can be hosted by multiple websites, and we want to expose this information. To do this, we will need to change a number of things in the code base.

On the front end:

  • The PDFPreview interface needs to be updated to display all of the different times that a PDF has been scraped. This would also allow people to see all of the different locations that the same PDF was scraped from.

On the back end:

  • The SearchResult schema in the search API needs to be updated to include an iterable of all crawl instances. Each instance would have at least the crawl_date, crawl_url, and sub-domain.
  • The search function in server.py needs to get all crawls for each result and populate the iterable with their information.
  • The metadata index needs to expose a function that returns all cdx lines for a particular PDF.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions