Skip to content

Managing repository size #30

@jeromedockes

Description

@jeromedockes

As we keep adding more documents for new projects, the repository is likely to
get too big.

To mitigate this, we can periodically scrub un-annotated documents for projects
that are not active anymore. For some projects 200 documents are added to the
repo but only a handful is annotated, so the other ones could be removed. The
repo's history would have to be rewritten as well for this to actually reduce
the repository size. Also we probably want to aim for a few active projects with
clear goals rather than a constellation of little projects with very few
annotations each.

If this is not sufficient to keep the repository size reasonable, as discussed IRL
with @Remi-Gau we could use git submodules and store each project in a separate
repository. Each repository would be small and annotators could clone only the
repository containing the project they are working on.

The downside is that for users or contributors who want the parts of the
repository that are independent from any project, such as the labelrepo package,
or the code and data to build the (jupyterbook) documentation, they would need
to use the git submodule commands which adds some friction.

Having the full repository (as it is now) is necessary for running analyses on
the annotations, and for even for annotating in the case of the
participant_demographics project because in that project annotating is made much
easier by using the watch_participants.py script in the /scripts/ directory,
which relies on labelrepo. labelrepo could be distributed from PyPI rather
than with the annotations, but it is useless without the annotations repo,
keeping it here means it's always in synch with the rest of the repo, and
installing it in editable mode provides a convenient way to find the location of
the repo in the filesystem without the user having to pass it on the command
line, export some env variable, or run the scripts from a specific working
directory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions