Skip to content

Pages are not being rescraped fast enough #60

@markbeep

Description

@markbeep

Currently I fetch the 500 oldest links in the HTML cache and if any of them are accessed, the cache is ignored and they are rescraped. It seems a large chunk of entries are outdated/invalid though so only around 120 seem to have been rescraped (in the last run). With 2500 courses/sem, updating the last 2 semesters takes way too long.

This issue consists of two things:

  • Adding a script to clean up all urls in the cache that will never be accessed.
  • Adjusting the system for rescraping. Potentially by even making use of the flagged attribute. Might make sense to simply flag pages to be rescraped and checking that at most 500 are actually rescraped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions