diff --git a/brainstorm/add.md b/brainstorm/add.md new file mode 100644 index 0000000..23d9702 --- /dev/null +++ b/brainstorm/add.md @@ -0,0 +1,45 @@ +# `dvs add` + +Goal: Add files to an initialized dvs repository. + +- [ ] Currently the `message` is attached to all files checked in simultaneously. + dvs has a log and audit log to + illuminate "why" a change occurred in the data. + + + +## CLI + +- Assume that current directory is a dvs repository, both in cli and R-package. +- The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_add <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + overwrite = FALSE, + fail = FALSE +) +``` + +## Compression + +If the added file exceeds a certain threshold, the +R package should provide suggest compressing the recently added file. + +- [ ] `getOption(dvs.large_file_size = integer()`) + - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + +Advice compression when + +- a single size exceeds size thresholds +- a directory of files exceeds size thresholds + +There are cases where individual files are not large, but the collection of files +starts to amount to a large amount, presumably too large to track. diff --git a/brainstorm/alias_git.md b/brainstorm/alias_git.md new file mode 100644 index 0000000..c248ab9 --- /dev/null +++ b/brainstorm/alias_git.md @@ -0,0 +1,5 @@ +# Alias dvs with get terminology + +- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we +expose a git compatible interface to dvs, in order to +plug-in dvs as a git replacement? diff --git a/brainstorm/audit.md b/brainstorm/audit.md new file mode 100644 index 0000000..c005291 --- /dev/null +++ b/brainstorm/audit.md @@ -0,0 +1,38 @@ +# `dvs audit` + +Goal: Provide a repository wide log of dvs tracked files. + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs audit +[Date] [User] [+{files} -{files}] [Message] +``` + +```sh +$ dvs audit --since +``` + +## R + +Signature: + +```r +dvs_audit <- function( + since = NULL, # date | duration (unit) + by_user = character()) + +``` + + + +```r +dvs_audit() + +``` + +```r +dvs_audit(since = NULL) +``` diff --git a/brainstorm/configuration.md b/brainstorm/configuration.md new file mode 100644 index 0000000..7649cc2 --- /dev/null +++ b/brainstorm/configuration.md @@ -0,0 +1,5 @@ +# `dvs.toml` + + +Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). + diff --git a/brainstorm/delete.md b/brainstorm/delete.md new file mode 100644 index 0000000..d9c4687 --- /dev/null +++ b/brainstorm/delete.md @@ -0,0 +1,29 @@ +# `dvs delete` + +Goal: Remove tracked files. + +## CLI + +```shell +$ dvs delete + +``` + +## R package + +```r +dvs_delete <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. + +- `files`: list of files that are to be deleted. + +### Non-existing files + +Emit a warning, but still remove the files that do exist and are tracked. diff --git a/brainstorm/dvs_last.md b/brainstorm/dvs_last.md new file mode 100644 index 0000000..cd0a5e9 --- /dev/null +++ b/brainstorm/dvs_last.md @@ -0,0 +1,10 @@ +# `dvs_last` + +Goal: provide users with the ability to retrieve the result of +the last executed dvs command within the r package. + +Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed +an error occurred, and an overview is displayed as a data-frame. The user +got a R native result, a data-frame, but if the user wants to act on the +provided information, we might want to provide a `dvs_last` that contains +miscellaneous. diff --git a/brainstorm/enum_status.md b/brainstorm/enum_status.md new file mode 100644 index 0000000..611ae46 --- /dev/null +++ b/brainstorm/enum_status.md @@ -0,0 +1,39 @@ +# Configuration: Status + +- current | absent | unsynced +- tracked file that is un-added + + + +# TODO (editing needed) + + relative_path: relative path to the file with respect to where the operation was called + + status: (doesn’t include error status) + + current: the file is present in the project directory and matches the version in the storage directory + + absent: the file isn't present in the project directory + + unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory + + file_size_bytes: current size of the file in bytes + + time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory + + saved_by: the user who uploaded the most recent file version in the storage directory + + message: the message inputted to the dvs_add command that added the most recent file version in the storage directory + + blake3_checksum: hash of the file via the blake3 algorithm + + absolute_path: canonicalized path of the file + input: + + If inputted explicitly via file glob or path: the file name + + if inputted implicitly via dvs_status() (without input): NA + +error: if the outcome was error, the error type, else NA + +error message: if the outcome was error, the error message (if there was one), else NA diff --git a/brainstorm/follow.md b/brainstorm/follow.md new file mode 100644 index 0000000..f87c953 --- /dev/null +++ b/brainstorm/follow.md @@ -0,0 +1,19 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + +- [ ] MOSSA: We may want to not track too large files, even if they are .csv +- [ ] pre-hook 100mb limit see template-PMx-project-starter + +Cloned repositories do not have hooks! + +- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! +- [ ] \ No newline at end of file diff --git a/brainstorm/get.md b/brainstorm/get.md new file mode 100644 index 0000000..54231fa --- /dev/null +++ b/brainstorm/get.md @@ -0,0 +1,20 @@ +# `dvs get` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_get <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE # follows fs::dir_ls +) +``` + + diff --git a/brainstorm/initialization.md b/brainstorm/initialization.md new file mode 100644 index 0000000..c552e69 --- /dev/null +++ b/brainstorm/initialization.md @@ -0,0 +1,252 @@ +# dvs initialization / `dvs init` / `dvs_init` + +Goal: Prepare shared storage and initialize DVS in directory + +dvs initialization will create a `dvs.toml` and a directory as specified by the +shared area in the init command. The shared dir may also need to `chown` the directory +to specify certain permissions. For example, for sensitive projects, setting +ownership to a particular group, allowing write access for the group, and limiting +read access to those not in the group. + +## User site assumptions + +- Always operating within a repository/project/workspace. +- A dvs repository need not fall under a git or any other vcs repository +- Storage is detached from repository root + +- [ ] If `git` is not a requirement, what alternative heuristics do we use + for instantiating a dvs repository? Suggestion: If a `.git` directory is not + available, then take current directory as the choice directory for initialization? + +## CLI + +```shell +dvs --- Data version control and storage management system + +Usage: + dvs [OPTIONS] + +Commands: + init + add + get + status + audit + log + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) + --version Show version information +``` + +The initialization command will have further subcommands. + +```shell +dvs init --- Initialize a new DVS repository + +Usage: + dvs init [OPTIONS] + +Backends: + local Local, on-disk storage + fs File system storage (e.g. network file system (nfs)) + s3 S3 compatible storage + aws S3 hosted via AWS + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) +``` + +### Local + +```shell +dvs init local --- Initialize a DVS repository via on-disk storage + +Usage: + dvs init local [OPTIONS] + +Required: + path to the local storage locations (e.g. `/data/`) + +Options: + --json + Output results as JSON + --metadata-folder-name + If you want to use a folder name other than `.dvs` for storing the metadata files + --permissions + Unix permissions for storage directory and files (octal, e.g., "770") + --group + Unix group to set on storage directory and files + --no-compression + Disable compression of stored files. Compression defaults to zstd + --no-compression + Disable compression of stored files. Compression defaults to zstd + -h, --help + Print help +``` + +## FS / NFS + + + +Example output: + +```shell +$ dvs init /data/ +DVS Repository created with storage path located at +``` + +## R function + +```r +dvs_init <- function( + storage_path = character(), # required + permissions = NULL, + group = NULL, + metadata_folder_name = NULL) +``` + +Example output: + +```r +> dvs_init() +> Error: `storage_path` is missing; Please provide a location to store dvs objects. +``` + +```r +> dvs_init("/data/projectA_storage") +> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" +``` + +CLI users do not need the full path shown to them, but R users need that information. + +Different storage backends have to be initialized through specialized functions. + +- `dvs_init_local` with alias `dvs_init` +- `dvs_init_fs(...)` +- `dvs_init_s3(...)` +- `dvs_init_aws(...)` + +## Storage + +- (future) Multiple projects can be hosted within the same storage + +### Case: No project or specific work directory + +Considering the one off scripts that scientists might create, in which there is +no project surrounding where said script is. + +- (future) User/machine storage +- (future) A remote project +- (future) One off scripts + +## Journey 1: Initial Setup with defaults + +Expected outcomes: + +- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. +- shared dir created in specified path, with default permissions of 664 + +Known Caveats: + +- certain linux `umask` setups cause folders to have default permissions like 600, or 644 +where other collaborators could not write by default, therefore, + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/example-proj +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs") +``` + +## Journey 2: Initial Setup with shared folder locked down to group + +- set permissions to writeable by group, not readable if not in group (660) +- group name projx + +Expected outcomes: + +- dvs.toml created in working directory +- shared dir created in specified path, with permissions of 660 and owned by group projx + +Edge cases: + +- group must resolve to known gid on system + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/sensitive-projx --permissions "660" --group projx +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") +``` + +#### Returns + +Return a rich data-frame that the end-user can then further subset/filter +to fit their needs. + +Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. + +- [ ] New format: + - `absolute_path`: abbreviated when printed in R (pillar) + - `relative path`: full path + - `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` + - `checksum`: always abbreviated in print (pillar, first 5 characters) + - `size`: using units and not raw `double()/numeric()` + +## Data formats to track + +- `.csv` +- `.rds` +- don't track `.RDA` files, as they are a collection of datasets + +Configuration: Must add these filters to the `dvs.toml`. + +Known annoyance: Verbosity of this can be annoying. +There should be a way to reduce outputs on untracked data files available +to the user. + +# TODO (to be edited) + +Errors + +dvs_init could return any of the following error types: + +project already initialized: dvs_init has already been run with different initialization attributes. + +git repository not found: dvs_init was run outside of a git repository + +storage directory input is not a directory: if input was an existing file + +storage directory absolute path not found: if the path could not be made absolute + +configuration file not created (dvs.yaml): failed to write to or save dvs.yaml + +linux primary group not found: if the group was inputted and it doesn't refer to a valid group + +storage directory not created: failed to create the storage directory + +linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions + +could not check if storage directory is empty: error reading the contents of the directory + +storage directory permissions not set: couldn't modify the permissions of the storage directory diff --git a/brainstorm/journey-2-adding-data-files.md b/brainstorm/journey-2-adding-data-files.md new file mode 100644 index 0000000..94a2ff2 --- /dev/null +++ b/brainstorm/journey-2-adding-data-files.md @@ -0,0 +1,68 @@ +# Journey 2: Adding Data Files + +Goal: Version a newly created dataset so others can retrieve it. + +## CLI flow + +1. Produce the data (example) + + ```bash + # Your data pipeline or script writes: + # data/derived/pk_data.csv + ``` + +2. Add the file to DVS + + ```bash + dvs add data/derived/pk_data.csv --message "Initial PK dataset v1" + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +## R package flow + +1. Produce the data + + ```r + write.csv(pk_data, "data/derived/pk_data.csv") + ``` + +2. Add the file to DVS + + ```r + dvs_add("data/derived/pk_data.csv", message = "Initial PK dataset v1") + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +```r +# no dvs_init ran before +dvs_add("contingency_table2.csv") +``` + +In RStudio: Check if there is no active folder, then emit warning. +Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/brainstorm/journey-3-getting-latest-files.md b/brainstorm/journey-3-getting-latest-files.md new file mode 100644 index 0000000..5d681d1 --- /dev/null +++ b/brainstorm/journey-3-getting-latest-files.md @@ -0,0 +1,55 @@ +# Journey 3: Getting Latest Files + +Goal: Pull metadata from Git and restore the tracked data files. + +## CLI flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```bash + dvs status + ``` + +3. Restore tracked files + + ```bash + dvs get data/derived/* + ``` + +4. Verify everything is current + + ```bash + dvs status + ``` + +## R package flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```r + dvs_status() + ``` + +3. Restore tracked files + + ```r + dvs_get("data/derived/*") + ``` + +4. Verify everything is current + + ```r + dvs_status() + ``` diff --git a/brainstorm/journey-4-updating-data-files.md b/brainstorm/journey-4-updating-data-files.md new file mode 100644 index 0000000..a96d329 --- /dev/null +++ b/brainstorm/journey-4-updating-data-files.md @@ -0,0 +1,90 @@ +# Journey 4: Updating Data Files + +Goal: Replace an existing tracked dataset with a new version. + +## CLI flow + +1. Re-run your processing to overwrite the data file + + ```bash + # Your data pipeline updates: + # data/derived/pk_data.csv + ``` + +2. Check status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +3. Add the new version + + ```bash + dvs add data/derived/pk_data.csv --message "Updated PK dataset v2" + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## R package flow + +1. Re-run your processing + + ```r + pk_data_v2 <- update_processing(pk_data) + write.csv(pk_data_v2, "data/derived/pk_data.csv") + ``` + +2. Check status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +3. Add the new version + + ```r + dvs_add("data/derived/pk_data.csv", message = "Updated PK dataset v2") + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## Journey 5: Updating data files with new rows + +New data following previous form might come up. Example is new rows from a clinical trial, +new participants in trials is added, however the scientists want them added to already +checked data files. + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs") +``` + +this ought to say + +```r +> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" +``` + +Then, + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) +``` + +could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated +to the level of the previous file, and then it can be known if this new event can supersede other add events, because we +know it is an addition. + +The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, +so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. diff --git a/brainstorm/journey-5-working-with-multiple-files.md b/brainstorm/journey-5-working-with-multiple-files.md new file mode 100644 index 0000000..bf7872e --- /dev/null +++ b/brainstorm/journey-5-working-with-multiple-files.md @@ -0,0 +1,60 @@ +# Journey 5: Working with Multiple Files + +Goal: Add and retrieve batches of outputs with glob patterns. + +## CLI flow + +1. Produce multiple outputs + + ```bash + # Your data pipeline writes: + # data/derived/pk.csv + # data/derived/pd.csv + # data/derived/summary.csv + ``` + +2. Add all outputs at once + + ```bash + dvs add data/derived/*.csv --message "Analysis outputs batch 1" + ``` + +3. Retrieve all tracked files later + + ```bash + dvs get data/derived/*.csv + ``` + +4. Check status for everything + + ```bash + dvs status + ``` + +## R package flow + +1. Produce multiple outputs + + ```r + write.csv(pk_data, "data/derived/pk.csv") + write.csv(pd_data, "data/derived/pd.csv") + write.csv(summary_stats, "data/derived/summary.csv") + ``` + +2. Add all outputs at once + + ```r + dvs_add("data/derived/*.csv", message = "Analysis outputs batch 1") + ``` + +3. Retrieve all tracked files later + + ```r + dvs_get("data/derived/*.csv") + ``` + +4. Check status for everything + + ```r + dvs_status() + ``` diff --git a/brainstorm/log.md b/brainstorm/log.md new file mode 100644 index 0000000..79fcdfe --- /dev/null +++ b/brainstorm/log.md @@ -0,0 +1,50 @@ +# `dvs log` + +Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. + +## CLI + +The option to return `--json` must be present. + +```sh +# in a previously `dvs init` folder +$ dvs log data/derived/model_summary.txt +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +``` + +```sh +$ dvs log --interval +[date -- duration since now] +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +[date -- 2x duration since now] +... +[date -- 3x duration since now] +... +``` + +``: `days`, `weeks`, `months` + +## R + +Signature: + +```r +dvs_log <- function( + since = NULL, + by_user = NULL, +) +``` + + diff --git a/brainstorm/message.md b/brainstorm/message.md new file mode 100644 index 0000000..283a92f --- /dev/null +++ b/brainstorm/message.md @@ -0,0 +1,23 @@ +# `dvs_message` + +Goal: Add messages to files without re-hashing or replacing them. + +## CLI + +```sh +$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repetitions" +Added message to `data/model_aaabb/model_summary.csv` +``` + +## R package + +```r +dvs_message <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/brainstorm/remote_storage.md b/brainstorm/remote_storage.md new file mode 100644 index 0000000..ed6f055 --- /dev/null +++ b/brainstorm/remote_storage.md @@ -0,0 +1,3 @@ +# dvs supported storage backends + +File systems, S3, and S3 hosted by AWS. diff --git a/brainstorm/revert.md b/brainstorm/revert.md new file mode 100644 index 0000000..b8e6151 --- /dev/null +++ b/brainstorm/revert.md @@ -0,0 +1,19 @@ +# `dvs revert` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_revert <- function( + commit_sha = integer(), + date = NULL, + before = NULL, # date | duration +) +``` + + diff --git a/brainstorm/root.md b/brainstorm/root.md new file mode 100644 index 0000000..dcdc4b0 --- /dev/null +++ b/brainstorm/root.md @@ -0,0 +1,33 @@ +# `dvs root` + +Convenience utility for expert users + +Goal: Return the location of the dvs repository root anywhere. + +## CLI + + + +## R package + +Signature: + +```r +dvs_root <- function(...) +# alias +find_dvs_root <- dvs_root() +``` + +Convenience: + +```r +dvs_root("model_code") +# equivalent to +fs::join(dvs_root(), "model_code") +# or +file.path(dvs_root(), "model_code") +``` + +The use cases for this function is very limited. We assume heavy use of +`{here}`-package in dvs-based projects. But it could be a relevant convenience +function in certain, specific cases. diff --git a/brainstorm/status.md b/brainstorm/status.md new file mode 100644 index 0000000..eee4104 --- /dev/null +++ b/brainstorm/status.md @@ -0,0 +1,106 @@ +# `dvs` status + +Goal: Provide an overview of the changed data files and potential files to track +via the traced data file filters. + +## CLI + +The option to return `--json` must be present. + +```shell +$ dvs status --help +Status of the DVS repository + +Usage: + dvs status [FILTERS] [OPTIONS] + +Filters: + --current + --unsynced, --missing + --absent + --no-current + --no-unsynced + --no-absent + +Options: + -s, --state filter for states to retain + -i, --invert inverts the selection provided by `--state` + -h, --help Print help +``` + +When a filter is provided, only the selected state(s) are provided. + + + +```sh +dvs status + +Current files: + + +Changed files (unsynced): + new_scenario/model_spec.txt + +Untracked and followed files: + orignal_scenario/model_summary.txt + orignal_scenario/tab-0123.tsv + orignal_scenario/tab-0123b.tsv + orignal_scenario/tab-0123c.tsv +``` + +We do not need to display the user in unsynced files, as they are likely to be owned by the current user. + +## R + +Signature: + +```r +dvs_status <- function( + show_storage = FALSE, +) +``` + +- `show_storage`: + - Show location of storage(s) for the current dvs repository. + - Warn the user that they must not alter the state of + the storage directory. + - (future) Show number of projects that the storage contains + +## Return format + +### CLI JSON format + + + +### R format + +Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` + +Proposed format: + +- `absolute_path`: abbreviated when printed in R (pillar) +- `relative path`: full path +- `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` +- `checksum`: always abbreviated in print (pillar, first 5 characters) +- `size`: using units and not raw `double()/numeric()` + +## Data name format + +`dvs_status` should show untracked data files in the current dvs repository, if +tracking is specified. + +## Granularity + +We expect the end user to use `{dplyr}` in order to +filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. + +## Following Filters in Status + +`dvs_track(".csv")`: tracks all CSV files. + +`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) + +`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. + +These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). diff --git a/brainstorm/sync.md b/brainstorm/sync.md new file mode 100644 index 0000000..ab55301 --- /dev/null +++ b/brainstorm/sync.md @@ -0,0 +1,48 @@ +# `dvs sync` + +Goal: Provide a streamlined way to update a cloned dvs repository. + +Synchronization `sync` is an alias for `dvs get **/*`, meant as a +repository wide syncing from storage (local/remote). + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs sync +[status] [Last modified] [Message] +... ... ... +``` + +The sync subcommand should also be able to act as a repository revert, +and + +```sh +$ dvs sync --before +[status] [Last modified] [Message] +... ... ... +``` + +## R + +Signature: + +```r +dvs_sync <- function( + by_folder = character(), + since = NULL, # date | duration (unit) + recurse = TRUE , +) +``` + +- `path` is a location within a dvs repository. + Not necessarily the root a dvs repository. +- `by_folder` allows to sync specific folders only +- `recurse` is whether to sync folders recursively + +### `recurse` + +When there is no `by_folder`, recurse will update the entire dvs repository, even if +current directory is a sub-directory in a dvs repository. The current location of the +user might be incidental to their intent with dvs. diff --git a/brainstorm/trace.md b/brainstorm/trace.md new file mode 100644 index 0000000..e69de29 diff --git a/brainstorm/tracking.md b/brainstorm/tracking.md new file mode 100644 index 0000000..6d6917e --- /dev/null +++ b/brainstorm/tracking.md @@ -0,0 +1,79 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + + + + + +## CLI + +```shell +$ dvs follow --help +Files that are followed by dvs when untracked. + +Usage: + dvs follow [COMMANDS] [OPTIONS] + +Commands: + add + list + audit + +Options: + -h, --help Show help for a command +``` + +`add` command: + + +`list` command: + + +`add` command: + + +## R package + +Support the following + +- `ext` which are following-filters based on file extensions, e.g. `"csx"`. +- `glob`: a glob that can enable matching files through their paths and file extension +- `regex`: a regular expression to match files through their full paths + +Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. + +The follow filter must support + +- `glob`, `ext`, `regex` field +- an optional `label` that can be used to identify which follow-filter matched a file +- file size qualifiers: + `file_size_gt` (file size greater than mask), + `file_size_lt` (file size less than mask) + +Example: + +```toml +[[follow]] +{ ext = "parquet" } +[[follow]] +{ glob = "data/**/*.csv", label? = "optional label" } +[[follow]] +{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB +[[follow]] +{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } +``` + +## Matcher audit + +A helpful utility for end users is a way to figure out why a given file was followed +by dvs. To that end, the dvs track ought to display the matching filter next to every +followed file.