-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Currently, govscape has only the scrape from 2020, but we want to expand that to scrape 2008, 2012, 2016, and 2024. Additionally, a PDF can be hosted by multiple websites, and we want to expose this information. To do this, we will need to change a number of things in the code base.
On the front end:
- The PDFPreview interface needs to be updated to display all of the different times that a PDF has been scraped. This would also allow people to see all of the different locations that the same PDF was scraped from.
On the back end:
- The SearchResult schema in the search API needs to be updated to include an iterable of all crawl instances. Each instance would have at least the crawl_date, crawl_url, and sub-domain.
- The search function in server.py needs to get all crawls for each result and populate the iterable with their information.
- The metadata index needs to expose a function that returns all cdx lines for a particular PDF.
Metadata
Metadata
Assignees
Labels
No labels