-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Much of the code in the scripts files is devoted to handling S3 and parallelism. It does a few things:
- Downloading large batches of files using either multiprocessing or s5cmd
- Applying a function to each batch of files (PDF, metadata json, embeddings, etc.), potentially using multiprocessing to do so
- Saving results in S3 & checkpointing
We should create a class that handles this logic, so that we can avoid duplicating it every time we want to do something.
Metadata
Metadata
Assignees
Labels
No labels