Skip to content

Build an S3 processing runner #37

@kylebd99

Description

@kylebd99

Much of the code in the scripts files is devoted to handling S3 and parallelism. It does a few things:

  1. Downloading large batches of files using either multiprocessing or s5cmd
  2. Applying a function to each batch of files (PDF, metadata json, embeddings, etc.), potentially using multiprocessing to do so
  3. Saving results in S3 & checkpointing

We should create a class that handles this logic, so that we can avoid duplicating it every time we want to do something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions