From 65df1bdea589af51cb21f7282da6cf4e1deb47fa Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 15:46:31 +0100 Subject: [PATCH 01/28] ui: added amendment logic --- ui/journey-4-updating-data-files.md | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/ui/journey-4-updating-data-files.md b/ui/journey-4-updating-data-files.md index c25ebcf..fdaa574 100644 --- a/ui/journey-4-updating-data-files.md +++ b/ui/journey-4-updating-data-files.md @@ -59,3 +59,32 @@ Goal: Replace an existing tracked dataset with a new version. git commit -m "Update PK data with new processing" git push ``` + +## Journey 5: Updating data files with new rows + +New data following previous form might come up. Example is new rows from a clinical trial, +new participants in trials is added, however the scientists want them added to already +checked data files. + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs") +``` + +this ought to say + +```r +> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" +``` + +Then, + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) +``` + +could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated +to the level of the previous file, and then it can be known if this new event can superseed other add events, because we +know it is an addition. + +The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, +so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. From 9198416591cee3c68c8ec89a7000fc04cf0eb7f6 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 18:54:16 +0100 Subject: [PATCH 02/28] added a section on non-repository targets --- ui/initialization.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/ui/initialization.md b/ui/initialization.md index 21ed288..4cbf23b 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -36,12 +36,19 @@ Options: ## R function - - ```r dvs_init <- function(directory = ".", permissions = NULL, group = NULL, metadata_folder_name = NULL) ``` +### Case: No project or specific work directory + +Considering the one off scripts that scientists might create, in which there is +no project surrounding where said script is. + +- User/machine storage +- A remote project +- One off scripts + ## Journey 1: Initial Setup with defaults Expected outcomes: From 904a835afed82cd5d619734a49632ed98c5c7d45 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 18:54:30 +0100 Subject: [PATCH 03/28] adding files to remote repositories --- ui/journey-2-adding-data-files.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/ui/journey-2-adding-data-files.md b/ui/journey-2-adding-data-files.md index 86a9cb8..b1edf74 100644 --- a/ui/journey-2-adding-data-files.md +++ b/ui/journey-2-adding-data-files.md @@ -58,3 +58,19 @@ Goal: Version a newly created dataset so others can retrieve it. ```r dvs_status("data/derived/pk_data.csv") ``` + +```r +# no dvs_init ran before +dvs_add("contingency_table2.csv") +``` + +In RStudio: Check if there is an active project, then emit + +```r +> Error: There is no active dvs repository in current location; +> Use `remote_repository` parameter +``` + +```r +dvs_add("contingency_table2.csv", remote_repository = "~/dvs/projectA/") +``` From 0d2a42c248b66c0fa05af89b1760fe0bfc88c0df Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 18:54:45 +0100 Subject: [PATCH 04/28] formatting --- ui/initialization.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/ui/initialization.md b/ui/initialization.md index 4cbf23b..7c79143 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -53,12 +53,12 @@ no project surrounding where said script is. Expected outcomes: -* dvs.toml created in working directory -* shared dir created in specified path, with default permissions of 664 +- dvs.toml created in working directory +- shared dir created in specified path, with default permissions of 664 Known Caveats: -* certain linux umask setups cause folders to have default permissions like 600, or 644 +- certain linux umask setups cause folders to have default permissions like 600, or 644 where other collaborators could not write by default, therefore, ### CLI flow @@ -79,17 +79,17 @@ dvs_init("/data/shared/project-x-dvs") ## Journey 2: Initial Setup with shared folder locked down to group -* set permissions to writeable by group, not readable if not in group (660) -* group name projx +- set permissions to writeable by group, not readable if not in group (660) +- group name projx Expected outcomes: -* dvs.toml created in working directory -* shared dir created in specified path, with permissions of 660 and owned by group projx +- dvs.toml created in working directory +- shared dir created in specified path, with permissions of 660 and owned by group projx Edge cases: -* group must resolve to known gid on system +- group must resolve to known gid on system ### CLI flow From a3c5adee73c59253b18b035c11bf7fcb4996ed32 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 18:58:06 +0100 Subject: [PATCH 05/28] dvs-init: location created --- ui/initialization.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/ui/initialization.md b/ui/initialization.md index 7c79143..925757b 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -34,12 +34,28 @@ Options: Print help ``` +Example output: + +```shell +$ dvs init +DVS Repository created +``` + ## R function ```r dvs_init <- function(directory = ".", permissions = NULL, group = NULL, metadata_folder_name = NULL) ``` +Example output: + +```r +> dvs_init("~/Documents/projectA") +> A DVS repository was initialised in "/Users/elea/Documents/projectA" +``` + +CLI users do not need the full path shown to them, but R users need that information. + ### Case: No project or specific work directory Considering the one off scripts that scientists might create, in which there is From 63f7ebae80f7489c6b3982aef93b7640ad31f23e Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 19:42:32 +0100 Subject: [PATCH 06/28] added notion of multiple projects for the same storage location --- ui/initialization.md | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/ui/initialization.md b/ui/initialization.md index 925757b..32faa4b 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -3,12 +3,12 @@ Goal: Prepare shared storage and initialize DVS in directory dvs initialization will create a `dvs.toml` and a directory as specified by the -shared area in the init command. The shared dir may also need to chown the directory +shared area in the init command. The shared dir may also need to `chown` the directory to specify certain permissions. For example, for sensitive projects, setting ownership to a particular group, allowing write access for the group, and limiting read access to those not in the group. -## cli +## CLI ```default dvs init @@ -44,18 +44,34 @@ DVS Repository created ## R function ```r -dvs_init <- function(directory = ".", permissions = NULL, group = NULL, metadata_folder_name = NULL) +dvs_init <- function( + path = ".", + storage_directory = getOption("dvs.global_storage") %||% + stop("must provide a storage location"), + permissions = NULL, + group = NULL, + metadata_folder_name = NULL) ``` Example output: ```r > dvs_init("~/Documents/projectA") -> A DVS repository was initialised in "/Users/elea/Documents/projectA" +> Error: `storage_path` is missing; Please provide a location to store dvs objects. +``` + +```r +> dvs_init("~/Documents/projectA", storage_directory = "~/Documents/dvs_storage") +> A DVS repository was initialised in "/Users/elea/Documents/projectA" with storage location at "/Users/elea/Documents/dvs_storage" ``` CLI users do not need the full path shown to them, but R users need that information. +## Storage + +- Multiple projects can be hosted within the same storage + - DVS storage locations should contain a list of projects it is currently serving. + ### Case: No project or specific work directory Considering the one off scripts that scientists might create, in which there is @@ -69,12 +85,12 @@ no project surrounding where said script is. Expected outcomes: -- dvs.toml created in working directory +- `dvs.toml` created in working directory - shared dir created in specified path, with default permissions of 664 Known Caveats: -- certain linux umask setups cause folders to have default permissions like 600, or 644 +- certain linux `umask` setups cause folders to have default permissions like 600, or 644 where other collaborators could not write by default, therefore, ### CLI flow From 7a3619398ec0dd36995aac7447651fe626d5f2b7 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 19:46:14 +0100 Subject: [PATCH 07/28] started dvs status specification --- ui/status.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 ui/status.md diff --git a/ui/status.md b/ui/status.md new file mode 100644 index 0000000..ee1b753 --- /dev/null +++ b/ui/status.md @@ -0,0 +1,12 @@ +# `dvs` status + +```r +dvs_status <- function( + path = ".", + show_storage = FALSE, +) +``` + +- `show_storage`: + - Show location of storage(s) for the current dvs repository. + - Show number of projects that the storage contains From c05795152534074ab889edbe5c88ae01b6a76e08 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 19:52:29 +0100 Subject: [PATCH 08/28] tracking of file formats or paths --- ui/initialization.md | 8 ++++++++ ui/status.md | 11 +++++++++++ 2 files changed, 19 insertions(+) diff --git a/ui/initialization.md b/ui/initialization.md index 32faa4b..8f83137 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -138,3 +138,11 @@ dvs init /data/dvs/sensitive-projx --permissions "660" --group projx ```r dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") ``` + +## Data formats to track + +- `.csv` +- `.rds` +- don't track `.RDA` files, as they are a collection of datasets + +Configuration: Must add these filters to the `dvs.toml`. diff --git a/ui/status.md b/ui/status.md index ee1b753..61be74f 100644 --- a/ui/status.md +++ b/ui/status.md @@ -10,3 +10,14 @@ dvs_status <- function( - `show_storage`: - Show location of storage(s) for the current dvs repository. - Show number of projects that the storage contains + +## Data name format + +`dvs_status` should show untracked data files in the current dvs repository, if +tracking is specified. + +`dvs_track(".csv")`: tracks all CSV files. + +`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) + +`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. From 7ad1c7c81e613146cdeebfa71a9cc7a9feb5d4e9 Mon Sep 17 00:00:00 2001 From: Mossa Date: Mon, 16 Feb 2026 20:43:30 +0100 Subject: [PATCH 09/28] marked things (future) due to conversations with devin on slack --- ui/initialization.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/ui/initialization.md b/ui/initialization.md index 8f83137..352b5a7 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -69,7 +69,7 @@ CLI users do not need the full path shown to them, but R users need that informa ## Storage -- Multiple projects can be hosted within the same storage +- (future) Multiple projects can be hosted within the same storage - DVS storage locations should contain a list of projects it is currently serving. ### Case: No project or specific work directory @@ -77,9 +77,9 @@ CLI users do not need the full path shown to them, but R users need that informa Considering the one off scripts that scientists might create, in which there is no project surrounding where said script is. -- User/machine storage -- A remote project -- One off scripts +- (future) User/machine storage +- (future) A remote project +- (future) One off scripts ## Journey 1: Initial Setup with defaults @@ -146,3 +146,7 @@ dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") - don't track `.RDA` files, as they are a collection of datasets Configuration: Must add these filters to the `dvs.toml`. + +Known annoyance: Verbosity of this can be annoying. +There should be a way to reduce outputs on untracked data files available +to the user. From 6a62ac2dcdadaf785b30362f73abb5f83f4cb805 Mon Sep 17 00:00:00 2001 From: Mossa Date: Tue, 17 Feb 2026 11:24:46 +0100 Subject: [PATCH 10/28] drafting a bunch of specs --- ui/add.md | 31 +++++++++++++++++++++++++++ ui/audit.md | 39 +++++++++++++++++++++++++++++++++ ui/dvs_last.md | 10 +++++++++ ui/get.md | 21 ++++++++++++++++++ ui/initialization.md | 1 + ui/log.md | 51 ++++++++++++++++++++++++++++++++++++++++++++ ui/revert.md | 20 +++++++++++++++++ ui/status.md | 18 +++++++++++++--- ui/sync.md | 48 +++++++++++++++++++++++++++++++++++++++++ 9 files changed, 236 insertions(+), 3 deletions(-) create mode 100644 ui/add.md create mode 100644 ui/audit.md create mode 100644 ui/dvs_last.md create mode 100644 ui/get.md create mode 100644 ui/log.md create mode 100644 ui/revert.md create mode 100644 ui/sync.md diff --git a/ui/add.md b/ui/add.md new file mode 100644 index 0000000..7c44c68 --- /dev/null +++ b/ui/add.md @@ -0,0 +1,31 @@ +# `dvs add` + +Goal: Add files to an initialised dvs repository. + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_add <- function(path = ".", + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + overwrite = FALSE, + fail = FALSE +) +``` + +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. + +## Compression + +If the added file exceeds a certain threshold, the +R package should provide suggest compressing the recently added file. + +- [ ] `getOption(dvs.large_file_size = integer()`) diff --git a/ui/audit.md b/ui/audit.md new file mode 100644 index 0000000..20b7ed9 --- /dev/null +++ b/ui/audit.md @@ -0,0 +1,39 @@ +# `dvs audit` + +Goal: Provide a repository wide log of dvs tracked files. + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs audit +[Date] [User] [+{files} -{files}] [Message] +``` + +```sh +$ dvs audit --since +``` + +## R + +Signature: + +```r +dvs_audit <- function(path = ".", + since = NULL, # date | duration (unit) + by_user = character()) + +``` + +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. + +```r +dvs_audit() + +``` + +```r +dvs_audit(since = NULL) +``` diff --git a/ui/dvs_last.md b/ui/dvs_last.md new file mode 100644 index 0000000..cd0a5e9 --- /dev/null +++ b/ui/dvs_last.md @@ -0,0 +1,10 @@ +# `dvs_last` + +Goal: provide users with the ability to retrieve the result of +the last executed dvs command within the r package. + +Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed +an error occurred, and an overview is displayed as a data-frame. The user +got a R native result, a data-frame, but if the user wants to act on the +provided information, we might want to provide a `dvs_last` that contains +miscellaneous. diff --git a/ui/get.md b/ui/get.md new file mode 100644 index 0000000..69c78c1 --- /dev/null +++ b/ui/get.md @@ -0,0 +1,21 @@ +# `dvs get` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_get <- function(path = ".", + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE # follows fs::dir_ls +) +``` + +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. \ No newline at end of file diff --git a/ui/initialization.md b/ui/initialization.md index 352b5a7..94db43e 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -14,6 +14,7 @@ read access to those not in the group. dvs init Starts a new dvs project. This will create a `dvs.toml` file in the root folder of where the user is calling the CLI from. root folder being the place where we find a `.git` folder +```shell Usage: dvs init [OPTIONS] Arguments: diff --git a/ui/log.md b/ui/log.md new file mode 100644 index 0000000..e6f2f06 --- /dev/null +++ b/ui/log.md @@ -0,0 +1,51 @@ +# `dvs log` + +Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. + +## CLI + +The option to return `--json` must be present. + +```sh +# in a previously `dvs init` folder +$ dvs log data/derived/model_summary.txt +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +``` + +```sh +$ dvs log --interval +[date -- duration since now] +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +[date -- 2x duration since now] +... +[date -- 3x duration since now] +... +``` + +``: `days`, `weeks`, `months` + +## R + +Signature: + +```r +dvs_log <- function(path = ".", + since = NULL, + by_user = NULL, +) +``` + +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. diff --git a/ui/revert.md b/ui/revert.md new file mode 100644 index 0000000..fed49a8 --- /dev/null +++ b/ui/revert.md @@ -0,0 +1,20 @@ +# `dvs revert` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_revert <- function(path = ".", + commit_sha = integer(), + date = NULL, + before = NULL, # date | duration +) +``` + +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. diff --git a/ui/status.md b/ui/status.md index 61be74f..8269ea2 100644 --- a/ui/status.md +++ b/ui/status.md @@ -1,15 +1,27 @@ # `dvs` status +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + ```r dvs_status <- function( - path = ".", - show_storage = FALSE, + path = ".", + show_storage = FALSE, + by_folder = character(), + by_user = character(), ) ``` +- `path` is location of dvs repository; the `dvs.toml` has to be present + in an ancestor to `path`. - `show_storage`: - Show location of storage(s) for the current dvs repository. - - Show number of projects that the storage contains + - (future) Show number of projects that the storage contains ## Data name format diff --git a/ui/sync.md b/ui/sync.md new file mode 100644 index 0000000..3d722ca --- /dev/null +++ b/ui/sync.md @@ -0,0 +1,48 @@ +# `dvs sync` + +Goal: Provide a streamlined way to update a cloned dvs repository. + +Syncronisation `sync` is an alias for `dvs get **/*`, meant as a +repository wide syncing from storage (local/remote). + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs sync +[status] [Last modified] [Message] +... ... ... +``` + +The sync subcommand should also be able to act as a repository revert, +and + +```sh +$ dvs sync --before +[status] [Last modified] [Message] +... ... ... +``` + +## R + +Signature: + +```r +dvs_sync <- function(path = ".", + by_folder = character(), + since = NULL, # date | duration (unit) + recurse = TRUE , +) +``` + +- `path` is a location within a dvs repository. + Not necessarily the root a dvs repository. +- `by_folder` allows to sync specific folders only +- `recurse` is whether to sync folders recursively + +### `recurse` + +When there is no `by_folder`, recurse will update the entire dvs repository, even if +current directory is a sub-directory in a dvs repository. The current location of the +user might be incidental to their intent with dvs. From d299cf2216fcb7bbb64b9417f77bbefd88558229 Mon Sep 17 00:00:00 2001 From: Mossa Date: Tue, 17 Feb 2026 12:20:03 +0100 Subject: [PATCH 11/28] typo --- ui/add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ui/add.md b/ui/add.md index 7c44c68..18d3f64 100644 --- a/ui/add.md +++ b/ui/add.md @@ -1,6 +1,6 @@ # `dvs add` -Goal: Add files to an initialised dvs repository. +Goal: Add files to an initialized dvs repository. ## CLI From e4804ac9fc23986f7947111a300e14eb8d71965e Mon Sep 17 00:00:00 2001 From: Mossa Date: Tue, 17 Feb 2026 12:50:45 +0100 Subject: [PATCH 12/28] updated ui design specs --- ui/add.md | 4 ++++ ui/alias_git.md | 5 +++++ ui/initialization.md | 7 +++++++ ui/message.md | 23 +++++++++++++++++++++++ ui/remote_storage.md | 22 ++++++++++++++++++++++ ui/root.md | 28 ++++++++++++++++++++++++++++ ui/status.md | 19 +++++++++++++++++++ 7 files changed, 108 insertions(+) create mode 100644 ui/alias_git.md create mode 100644 ui/message.md create mode 100644 ui/remote_storage.md create mode 100644 ui/root.md diff --git a/ui/add.md b/ui/add.md index 18d3f64..4d280e1 100644 --- a/ui/add.md +++ b/ui/add.md @@ -2,6 +2,10 @@ Goal: Add files to an initialized dvs repository. +- [ ] Currently the `message` is attached to all files checked in simultaneously. dvs has a log and audit log to +illuminate "why" a change occurred in the data. + + ## CLI The option to return `--json` must be present. diff --git a/ui/alias_git.md b/ui/alias_git.md new file mode 100644 index 0000000..c248ab9 --- /dev/null +++ b/ui/alias_git.md @@ -0,0 +1,5 @@ +# Alias dvs with get terminology + +- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we +expose a git compatible interface to dvs, in order to +plug-in dvs as a git replacement? diff --git a/ui/initialization.md b/ui/initialization.md index 94db43e..2ff1638 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -8,6 +8,13 @@ to specify certain permissions. For example, for sensitive projects, setting ownership to a particular group, allowing write access for the group, and limiting read access to those not in the group. +## User site assumptions + +- [ ] Always within operating within a repository/project/workspace. +- a dvs repository need not fall under a git or any other vcs repository +- storage is detached from repository root + + ## CLI ```default diff --git a/ui/message.md b/ui/message.md new file mode 100644 index 0000000..5734343 --- /dev/null +++ b/ui/message.md @@ -0,0 +1,23 @@ +# `dvs_message` + +Goal: Add messages to files without re-hashing or replacing them. + +## CLI + +```sh +$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" +Added message to `data/model_aaabb/model_summary.csv` +``` + +## R package + +```r +dvs_message <- function(path = ".", + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/ui/remote_storage.md b/ui/remote_storage.md new file mode 100644 index 0000000..7fe6678 --- /dev/null +++ b/ui/remote_storage.md @@ -0,0 +1,22 @@ +# dvs supported remote storage + +## A2-AI hosting + +storagemagic/Dumbledore/A2-AI Cloud + +## Custom DVS storage + +- [ ] (future) dvs server hosted by client. + +## Third party storage hosting + +### AMazon FSx + + + +### S3 + +### Sharepoint + +### + diff --git a/ui/root.md b/ui/root.md new file mode 100644 index 0000000..2d094f6 --- /dev/null +++ b/ui/root.md @@ -0,0 +1,28 @@ +# `dvs root` + +Goal: return the location of the dvs repository root anywhere. + +## CLI + +Not relevant. + +## R package + +Signature: + +```r +# note: no `path` parameter, always assume current directory +dvs_root <- function(...) +# alias +find_dvs_root <- dvs_root() +``` + +Convenience: + +```r +dvs_root("model_code") +# equivalent to +fs::join(dvs_root(), "model_code") +# or +file.path(dvs_root(), "model_code") +``` diff --git a/ui/status.md b/ui/status.md index 8269ea2..444df28 100644 --- a/ui/status.md +++ b/ui/status.md @@ -23,6 +23,25 @@ dvs_status <- function( - Show location of storage(s) for the current dvs repository. - (future) Show number of projects that the storage contains +## Return format + +### CLI JSON format + + + +### R format + +Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` + +Proposed format: + +- `absolute_path`: abbreviated when printed in R (pillar) +- `relative path`: full path +- `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` +- `checksum`: always abbreviated in print (pillar, first 5 characters) +- `size`: using units and not raw `double()/numeric()` + ## Data name format `dvs_status` should show untracked data files in the current dvs repository, if From b179aa27d8f0f2f5cec42455b24259952eb8b893 Mon Sep 17 00:00:00 2001 From: Mossa Date: Tue, 17 Feb 2026 13:15:50 +0100 Subject: [PATCH 13/28] work in progress: common status --- ui/enum_status.md | 32 ++++++++++++++++++++++++++++++++ ui/get.md | 2 +- ui/initialization.md | 43 +++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 74 insertions(+), 3 deletions(-) create mode 100644 ui/enum_status.md diff --git a/ui/enum_status.md b/ui/enum_status.md new file mode 100644 index 0000000..d7fc405 --- /dev/null +++ b/ui/enum_status.md @@ -0,0 +1,32 @@ +# Configuration: Status + + relative_path: relative path to the file with respect to where the operation was called + + status: (doesn’t include error status) + + current: the file is present in the project directory and matches the version in the storage directory + + absent: the file isn't present in the project directory + + unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory + + file_size_bytes: current size of the file in bytes + + time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory + + saved_by: the user who uploaded the most recent file version in the storage directory + + message: the message inputted to the dvs_add command that added the most recent file version in the storage directory + + blake3_checksum: hash of the file via the blake3 algorithm + + absolute_path: canonicalized path of the file + input: + + If inputted explicitly via file glob or path: the file name + + if inputted implicitly via dvs_status() (without input): NA + +error: if the outcome was error, the error type, else NA + +error message: if the outcome was error, the error message (if there was one), else NA diff --git a/ui/get.md b/ui/get.md index 69c78c1..b0e0454 100644 --- a/ui/get.md +++ b/ui/get.md @@ -18,4 +18,4 @@ dvs_get <- function(path = ".", ``` - `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. \ No newline at end of file + in an ancestor to `path`. diff --git a/ui/initialization.md b/ui/initialization.md index 2ff1638..79e9206 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -1,4 +1,4 @@ -# dvs initialization +# dvs initialization / `dvs init` / `dvs_init` Goal: Prepare shared storage and initialize DVS in directory @@ -14,7 +14,6 @@ read access to those not in the group. - a dvs repository need not fall under a git or any other vcs repository - storage is detached from repository root - ## CLI ```default @@ -147,6 +146,20 @@ dvs init /data/dvs/sensitive-projx --permissions "660" --group projx dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") ``` +#### Returns + +- [ ] implement `split_output` or do we rely on the user being familiar with dplyr? + +Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. + +- [ ] New format: + - `absolute_path`: abbreviated when printed in R (pillar) + - `relative path`: full path + - `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` + - `checksum`: always abbreviated in print (pillar, first 5 characters) + - `size`: using units and not raw `double()/numeric()` + ## Data formats to track - `.csv` @@ -158,3 +171,29 @@ Configuration: Must add these filters to the `dvs.toml`. Known annoyance: Verbosity of this can be annoying. There should be a way to reduce outputs on untracked data files available to the user. + +# TODO (to be editted) + +Errors + +dvs_init could return any of the following error types: + +project already initialized: dvs_init has already been run with different initialization attributes. + +git repository not found: dvs_init was run outside of a git repository + +storage directory input is not a directory: if input was an existing file + +storage directory absolute path not found: if the path could not be made absolute + +configuration file not created (dvs.yaml): failed to write to or save dvs.yaml + +linux primary group not found: if the group was inputted and it doesn't refer to a valid group + +storage directory not created: failed to create the storage directory + +linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions + +could not check if storage directory is empty: error reading the contents of the directory + +storage directory permissions not set: couldn't modify the permissions of the storage directory From 84b939d48795bdd60f7843107e7db8e3d25835ce Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 10:45:21 +0100 Subject: [PATCH 14/28] revisions --- ui/add.md | 18 ++++++---- ui/audit.md | 2 +- ui/configuration.md | 5 +++ ui/delete.md | 29 +++++++++++++++ ui/enum_status.md | 7 ++++ ui/follow.md | 19 ++++++++++ ui/initialization.md | 85 ++++++++++++++++++++++++++++++++++++-------- ui/log.md | 2 +- ui/message.md | 2 +- ui/status.md | 4 +++ ui/trace.md | 0 ui/tracking.md | 19 ++++++++++ 12 files changed, 168 insertions(+), 24 deletions(-) create mode 100644 ui/configuration.md create mode 100644 ui/delete.md create mode 100644 ui/follow.md create mode 100644 ui/trace.md create mode 100644 ui/tracking.md diff --git a/ui/add.md b/ui/add.md index 4d280e1..2715da1 100644 --- a/ui/add.md +++ b/ui/add.md @@ -2,20 +2,23 @@ Goal: Add files to an initialized dvs repository. -- [ ] Currently the `message` is attached to all files checked in simultaneously. dvs has a log and audit log to -illuminate "why" a change occurred in the data. - +- [ ] Currently the `message` is attached to all files checked in simultaneously. + dvs has a log and audit log to + illuminate "why" a change occurred in the data. + + ## CLI -The option to return `--json` must be present. +- Assume that current directory is a dvs repository, both in cli and R-package. +- The option to return `--json` must be present. ## R Signature: ```r -dvs_add <- function(path = ".", +dvs_add <- function( files = character(), glob = character(), ignore.case = NULL %||% !is.empty(glob), @@ -24,8 +27,7 @@ dvs_add <- function(path = ".", ) ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. + ## Compression @@ -33,3 +35,5 @@ If the added file exceeds a certain threshold, the R package should provide suggest compressing the recently added file. - [ ] `getOption(dvs.large_file_size = integer()`) + - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) diff --git a/ui/audit.md b/ui/audit.md index 20b7ed9..962a7d0 100644 --- a/ui/audit.md +++ b/ui/audit.md @@ -20,7 +20,7 @@ $ dvs audit --since Signature: ```r -dvs_audit <- function(path = ".", +dvs_audit <- function( since = NULL, # date | duration (unit) by_user = character()) diff --git a/ui/configuration.md b/ui/configuration.md new file mode 100644 index 0000000..7649cc2 --- /dev/null +++ b/ui/configuration.md @@ -0,0 +1,5 @@ +# `dvs.toml` + + +Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). + diff --git a/ui/delete.md b/ui/delete.md new file mode 100644 index 0000000..d9c4687 --- /dev/null +++ b/ui/delete.md @@ -0,0 +1,29 @@ +# `dvs delete` + +Goal: Remove tracked files. + +## CLI + +```shell +$ dvs delete + +``` + +## R package + +```r +dvs_delete <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. + +- `files`: list of files that are to be deleted. + +### Non-existing files + +Emit a warning, but still remove the files that do exist and are tracked. diff --git a/ui/enum_status.md b/ui/enum_status.md index d7fc405..a051258 100644 --- a/ui/enum_status.md +++ b/ui/enum_status.md @@ -1,5 +1,12 @@ # Configuration: Status +- current | absent | unsynced +- tracked file that is un-added + + + +# TODO (editting needed) + relative_path: relative path to the file with respect to where the operation was called status: (doesn’t include error status) diff --git a/ui/follow.md b/ui/follow.md new file mode 100644 index 0000000..f87c953 --- /dev/null +++ b/ui/follow.md @@ -0,0 +1,19 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + +- [ ] MOSSA: We may want to not track too large files, even if they are .csv +- [ ] pre-hook 100mb limit see template-PMx-project-starter + +Cloned repositories do not have hooks! + +- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! +- [ ] \ No newline at end of file diff --git a/ui/initialization.md b/ui/initialization.md index 79e9206..b054dc6 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -10,21 +10,80 @@ read access to those not in the group. ## User site assumptions -- [ ] Always within operating within a repository/project/workspace. -- a dvs repository need not fall under a git or any other vcs repository -- storage is detached from repository root +- Always operating within a repository/project/workspace. +- A dvs repository need not fall under a git or any other vcs repository +- Storage is detached from repository root ## CLI -```default +```shell +dvs --- Data version control and storage management system + +Usage: + dvs [OPTIONS] + +Commands: + init + add + get + status + audit + log + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) + --version Show version information +``` + +The initialization command will have further subcommands. + +```shell +dvs init --- Initialize a new DVS repository + +Usage: + dvs init [OPTIONS] + +Backends: + local Local, on-disk storage + fs File system storage (e.g. network file system (nfs)) + s3 S3 compatible storage + aws S3 hosted via AWS + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) +``` + +### Local + +```shell +dvs init local --- Initialize a DVS repository via on-disk storage + +Usage: + dvs init local [OPTIONS] + +Required: + path to the local storage locations (e.g. `/data/`) + +``` + +## FS / NFS + + +```shell dvs init Starts a new dvs project. This will create a `dvs.toml` file in the root folder of where the user is calling the CLI from. root folder being the place where we find a `.git` folder ```shell -Usage: dvs init [OPTIONS] +Usage: dvs init Arguments: - Where the data will be stored + +```shell +Usage: dvs init local [OPTIONS] + +Arguments: + + Where the data will be stored Options: --json @@ -44,20 +103,18 @@ Options: Example output: ```shell -$ dvs init -DVS Repository created +$ dvs init /data/ +DVS Repository created with storage path located at ``` ## R function ```r dvs_init <- function( - path = ".", - storage_directory = getOption("dvs.global_storage") %||% - stop("must provide a storage location"), - permissions = NULL, - group = NULL, - metadata_folder_name = NULL) + storage_path = character(), # required + permissions = NULL, + group = NULL, + metadata_folder_name = NULL) ``` Example output: diff --git a/ui/log.md b/ui/log.md index e6f2f06..5708736 100644 --- a/ui/log.md +++ b/ui/log.md @@ -41,7 +41,7 @@ message: "Ran nonmem model on exposure assumptions" Signature: ```r -dvs_log <- function(path = ".", +dvs_log <- function( since = NULL, by_user = NULL, ) diff --git a/ui/message.md b/ui/message.md index 5734343..0024275 100644 --- a/ui/message.md +++ b/ui/message.md @@ -12,7 +12,7 @@ Added message to `data/model_aaabb/model_summary.csv` ## R package ```r -dvs_message <- function(path = ".", +dvs_message <- function( files = character(), glob = character(), ignore.case = NULL %||% !is.empty(glob), diff --git a/ui/status.md b/ui/status.md index 444df28..f338645 100644 --- a/ui/status.md +++ b/ui/status.md @@ -1,5 +1,9 @@ # `dvs` status +- [ ] MOSSA: don't show git status for certain folders until models have finished + running. +- [ ] git hooks alert + ## CLI The option to return `--json` must be present. diff --git a/ui/trace.md b/ui/trace.md new file mode 100644 index 0000000..e69de29 diff --git a/ui/tracking.md b/ui/tracking.md new file mode 100644 index 0000000..f87c953 --- /dev/null +++ b/ui/tracking.md @@ -0,0 +1,19 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + +- [ ] MOSSA: We may want to not track too large files, even if they are .csv +- [ ] pre-hook 100mb limit see template-PMx-project-starter + +Cloned repositories do not have hooks! + +- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! +- [ ] \ No newline at end of file From 4bd43cce2709280f9d555cc7044e4be5b285f567 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 11:27:58 +0100 Subject: [PATCH 15/28] changes --- ui/add.md | 10 ++++-- ui/initialization.md | 72 +++++++++++++++++++++----------------------- 2 files changed, 42 insertions(+), 40 deletions(-) diff --git a/ui/add.md b/ui/add.md index 2715da1..23d9702 100644 --- a/ui/add.md +++ b/ui/add.md @@ -27,8 +27,6 @@ dvs_add <- function( ) ``` - - ## Compression If the added file exceeds a certain threshold, the @@ -37,3 +35,11 @@ R package should provide suggest compressing the recently added file. - [ ] `getOption(dvs.large_file_size = integer()`) - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + +Advice compression when + +- a single size exceeds size thresholds +- a directory of files exceeds size thresholds + +There are cases where individual files are not large, but the collection of files +starts to amount to a large amount, presumably too large to track. diff --git a/ui/initialization.md b/ui/initialization.md index b054dc6..c552e69 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -14,6 +14,10 @@ read access to those not in the group. - A dvs repository need not fall under a git or any other vcs repository - Storage is detached from repository root +- [ ] If `git` is not a requirement, what alternative heuristics do we use + for instantiating a dvs repository? Suggestion: If a `.git` directory is not + available, then take current directory as the choice directory for initialization? + ## CLI ```shell @@ -64,42 +68,27 @@ Usage: Required: path to the local storage locations (e.g. `/data/`) -``` - -## FS / NFS - - -```shell -dvs init -Starts a new dvs project. This will create a `dvs.toml` file in the root folder of where the user is calling the CLI from. root folder being the place where we find a `.git` folder - -```shell -Usage: dvs init - -Arguments: - -```shell -Usage: dvs init local [OPTIONS] - -Arguments: - - Where the data will be stored - Options: - --json - Output results as JSON - --metadata-folder-name - If you want to use a folder name other than `.dvs` for storing the metadata files - --permissions - Unix permissions for storage directory and files (octal, e.g., "770") - --group - Unix group to set on storage directory and files - --no-compression - Disable compression of stored files. Compression defaults to zstd + --json + Output results as JSON + --metadata-folder-name + If you want to use a folder name other than `.dvs` for storing the metadata files + --permissions + Unix permissions for storage directory and files (octal, e.g., "770") + --group + Unix group to set on storage directory and files + --no-compression + Disable compression of stored files. Compression defaults to zstd + --no-compression + Disable compression of stored files. Compression defaults to zstd -h, --help Print help ``` +## FS / NFS + + + Example output: ```shell @@ -120,21 +109,27 @@ dvs_init <- function( Example output: ```r -> dvs_init("~/Documents/projectA") +> dvs_init() > Error: `storage_path` is missing; Please provide a location to store dvs objects. ``` ```r -> dvs_init("~/Documents/projectA", storage_directory = "~/Documents/dvs_storage") -> A DVS repository was initialised in "/Users/elea/Documents/projectA" with storage location at "/Users/elea/Documents/dvs_storage" +> dvs_init("/data/projectA_storage") +> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" ``` CLI users do not need the full path shown to them, but R users need that information. +Different storage backends have to be initialized through specialized functions. + +- `dvs_init_local` with alias `dvs_init` +- `dvs_init_fs(...)` +- `dvs_init_s3(...)` +- `dvs_init_aws(...)` + ## Storage - (future) Multiple projects can be hosted within the same storage - - DVS storage locations should contain a list of projects it is currently serving. ### Case: No project or specific work directory @@ -149,7 +144,7 @@ no project surrounding where said script is. Expected outcomes: -- `dvs.toml` created in working directory +- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. - shared dir created in specified path, with default permissions of 664 Known Caveats: @@ -205,7 +200,8 @@ dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") #### Returns -- [ ] implement `split_output` or do we rely on the user being familiar with dplyr? +Return a rich data-frame that the end-user can then further subset/filter +to fit their needs. Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. @@ -229,7 +225,7 @@ Known annoyance: Verbosity of this can be annoying. There should be a way to reduce outputs on untracked data files available to the user. -# TODO (to be editted) +# TODO (to be edited) Errors From 1b9447d0047e162db5b853e7309c05fe38412286 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 11:31:42 +0100 Subject: [PATCH 16/28] removed remote_repository comment --- ui/journey-2-adding-data-files.md | 12 ++---------- ui/journey-4-updating-data-files.md | 2 +- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/ui/journey-2-adding-data-files.md b/ui/journey-2-adding-data-files.md index b1edf74..94a2ff2 100644 --- a/ui/journey-2-adding-data-files.md +++ b/ui/journey-2-adding-data-files.md @@ -64,13 +64,5 @@ Goal: Version a newly created dataset so others can retrieve it. dvs_add("contingency_table2.csv") ``` -In RStudio: Check if there is an active project, then emit - -```r -> Error: There is no active dvs repository in current location; -> Use `remote_repository` parameter -``` - -```r -dvs_add("contingency_table2.csv", remote_repository = "~/dvs/projectA/") -``` +In RStudio: Check if there is no active folder, then emit warning. +Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/ui/journey-4-updating-data-files.md b/ui/journey-4-updating-data-files.md index fdaa574..a96d329 100644 --- a/ui/journey-4-updating-data-files.md +++ b/ui/journey-4-updating-data-files.md @@ -83,7 +83,7 @@ dvs_add("data/registry/participants.csv", "added information from the second bat ``` could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated -to the level of the previous file, and then it can be known if this new event can superseed other add events, because we +to the level of the previous file, and then it can be known if this new event can supersede other add events, because we know it is an addition. The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, From 191d55eb7dc01febb73d31053e1835a58265b36c Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 11:58:33 +0100 Subject: [PATCH 17/28] updated --- ui/remote_storage.md | 2 +- ui/root.md | 9 +++++++-- ui/sync.md | 2 +- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/ui/remote_storage.md b/ui/remote_storage.md index 7fe6678..93694f6 100644 --- a/ui/remote_storage.md +++ b/ui/remote_storage.md @@ -1,4 +1,4 @@ -# dvs supported remote storage +# dvs supported storage backends ## A2-AI hosting diff --git a/ui/root.md b/ui/root.md index 2d094f6..9814ee9 100644 --- a/ui/root.md +++ b/ui/root.md @@ -1,6 +1,8 @@ # `dvs root` -Goal: return the location of the dvs repository root anywhere. +Convenience utility for expert users + +Goal: Return the location of the dvs repository root anywhere. ## CLI @@ -11,7 +13,6 @@ Not relevant. Signature: ```r -# note: no `path` parameter, always assume current directory dvs_root <- function(...) # alias find_dvs_root <- dvs_root() @@ -26,3 +27,7 @@ fs::join(dvs_root(), "model_code") # or file.path(dvs_root(), "model_code") ``` + +The use cases for this function is very limited. We assume heavy use of +`{here}`-package in dvs-based projects. But it could be a relevant convenience +function in certain, specific cases. diff --git a/ui/sync.md b/ui/sync.md index 3d722ca..e4e713b 100644 --- a/ui/sync.md +++ b/ui/sync.md @@ -2,7 +2,7 @@ Goal: Provide a streamlined way to update a cloned dvs repository. -Syncronisation `sync` is an alias for `dvs get **/*`, meant as a +Synchronization `sync` is an alias for `dvs get **/*`, meant as a repository wide syncing from storage (local/remote). ## CLI From 40d0c027932512d57ea85236ba8885d81e05f706 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 12:56:14 +0100 Subject: [PATCH 18/28] updated files --- ui/audit.md | 3 +-- ui/get.md | 3 +-- ui/log.md | 3 +-- ui/revert.md | 3 +-- ui/status.md | 50 +++++++++++++++++++++++++++++++++++++++++++------- 5 files changed, 47 insertions(+), 15 deletions(-) diff --git a/ui/audit.md b/ui/audit.md index 962a7d0..c005291 100644 --- a/ui/audit.md +++ b/ui/audit.md @@ -26,8 +26,7 @@ dvs_audit <- function( ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. + ```r dvs_audit() diff --git a/ui/get.md b/ui/get.md index b0e0454..8f5741c 100644 --- a/ui/get.md +++ b/ui/get.md @@ -17,5 +17,4 @@ dvs_get <- function(path = ".", ) ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. + diff --git a/ui/log.md b/ui/log.md index 5708736..79fcdfe 100644 --- a/ui/log.md +++ b/ui/log.md @@ -47,5 +47,4 @@ dvs_log <- function( ) ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. + diff --git a/ui/revert.md b/ui/revert.md index fed49a8..7ce26cc 100644 --- a/ui/revert.md +++ b/ui/revert.md @@ -16,5 +16,4 @@ dvs_revert <- function(path = ".", ) ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. + diff --git a/ui/status.md b/ui/status.md index f338645..5eb0a53 100644 --- a/ui/status.md +++ b/ui/status.md @@ -1,13 +1,42 @@ # `dvs` status -- [ ] MOSSA: don't show git status for certain folders until models have finished - running. -- [ ] git hooks alert +Goal: Provide an overview of the changed data files and potential files to track +via the traced data file filters. ## CLI The option to return `--json` must be present. +```shell +$ dvs status --help +Status of the DVS repository + +Usage: + dvs status [FILTERS] + +Filters: + --unsynced + --absent +``` + +```sh +dvs status + +Current files: + + +Changed files (unsynced): + new_scenario/model_spec.txt + +Untracked and followed files: + orignal_scenario/model_summary.txt + orignal_scenario/tab-0123.tsv + orignal_scenario/tab-0123b.tsv + orignal_scenario/tab-0123c.tsv +``` + +We do not need to display the user in unsynced files, as they are likely to be owned by the current user. + ## R Signature: @@ -16,15 +45,13 @@ Signature: dvs_status <- function( path = ".", show_storage = FALSE, - by_folder = character(), - by_user = character(), ) ``` -- `path` is location of dvs repository; the `dvs.toml` has to be present - in an ancestor to `path`. - `show_storage`: - Show location of storage(s) for the current dvs repository. + - Warn the user that they must not alter the state of + the storage directory. - (future) Show number of projects that the storage contains ## Return format @@ -51,8 +78,17 @@ Proposed format: `dvs_status` should show untracked data files in the current dvs repository, if tracking is specified. +## Granularity + +We expect the end user to use `{dplyr}` in order to +filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. + +## Following Filters in Status + `dvs_track(".csv")`: tracks all CSV files. `dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) `dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. + +These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). From 7961799ba0860b30366c3872a772785982edf0e7 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 13:51:09 +0100 Subject: [PATCH 19/28] updated --- ui/status.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/ui/status.md b/ui/status.md index 5eb0a53..a5d9eaa 100644 --- a/ui/status.md +++ b/ui/status.md @@ -12,13 +12,27 @@ $ dvs status --help Status of the DVS repository Usage: - dvs status [FILTERS] + dvs status [FILTERS] [OPTIONS] Filters: - --unsynced + --current + --unsynced, --missing --absent + --no-current + --no-unsynced + --no-absent + +Options: + -s, --state filter for states to retain + -i, --invert inverts the selection provided by `--state` + -h, --help + Print help ``` +When a filter is provided, only the selected state(s) are provided. + + + ```sh dvs status From 584301712d200e2176634f37f2b9cdf3faab4f7f Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 14:35:35 +0100 Subject: [PATCH 20/28] update tracking --- ui/tracking.md | 74 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 67 insertions(+), 7 deletions(-) diff --git a/ui/tracking.md b/ui/tracking.md index f87c953..6d6917e 100644 --- a/ui/tracking.md +++ b/ui/tracking.md @@ -1,8 +1,8 @@ # `dvs track` / `dvs_track` -Goal: Purpose is to specify which files we ought to follow in dvs. +Goal: Purpose is to specify which files we ought to follow in dvs. -User journey: +User journey: - [ ] All the .csv files underneath a specific directory. - [ ] All the .csv files that are less than 25 MB @@ -10,10 +10,70 @@ User journey: - File type - Size filters -- [ ] MOSSA: We may want to not track too large files, even if they are .csv -- [ ] pre-hook 100mb limit see template-PMx-project-starter + -Cloned repositories do not have hooks! + -- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! -- [ ] \ No newline at end of file +## CLI + +```shell +$ dvs follow --help +Files that are followed by dvs when untracked. + +Usage: + dvs follow [COMMANDS] [OPTIONS] + +Commands: + add + list + audit + +Options: + -h, --help Show help for a command +``` + +`add` command: + + +`list` command: + + +`add` command: + + +## R package + +Support the following + +- `ext` which are following-filters based on file extensions, e.g. `"csx"`. +- `glob`: a glob that can enable matching files through their paths and file extension +- `regex`: a regular expression to match files through their full paths + +Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. + +The follow filter must support + +- `glob`, `ext`, `regex` field +- an optional `label` that can be used to identify which follow-filter matched a file +- file size qualifiers: + `file_size_gt` (file size greater than mask), + `file_size_lt` (file size less than mask) + +Example: + +```toml +[[follow]] +{ ext = "parquet" } +[[follow]] +{ glob = "data/**/*.csv", label? = "optional label" } +[[follow]] +{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB +[[follow]] +{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } +``` + +## Matcher audit + +A helpful utility for end users is a way to figure out why a given file was followed +by dvs. To that end, the dvs track ought to display the matching filter next to every +followed file. From 7e07094360ac21399548a1c8671331c279c7915a Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 14:35:59 +0100 Subject: [PATCH 21/28] minor --- ui/status.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/ui/status.md b/ui/status.md index a5d9eaa..f6111e7 100644 --- a/ui/status.md +++ b/ui/status.md @@ -25,8 +25,7 @@ Filters: Options: -s, --state filter for states to retain -i, --invert inverts the selection provided by `--state` - -h, --help - Print help + -h, --help Print help ``` When a filter is provided, only the selected state(s) are provided. From aa82db9106d040152de7f8f3cf64d3b79254adb3 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:16:29 +0100 Subject: [PATCH 22/28] moving things around --- brainstorm/add.md | 45 ++++ brainstorm/alias_git.md | 5 + brainstorm/audit.md | 38 +++ brainstorm/configuration.md | 5 + brainstorm/delete.md | 29 ++ brainstorm/dvs_last.md | 10 + brainstorm/enum_status.md | 39 +++ brainstorm/follow.md | 19 ++ brainstorm/get.md | 20 ++ brainstorm/initialization.md | 252 ++++++++++++++++++ brainstorm/journey-2-adding-data-files.md | 68 +++++ brainstorm/journey-3-getting-latest-files.md | 55 ++++ brainstorm/journey-4-updating-data-files.md | 90 +++++++ .../journey-5-working-with-multiple-files.md | 60 +++++ brainstorm/log.md | 50 ++++ brainstorm/message.md | 23 ++ brainstorm/remote_storage.md | 22 ++ brainstorm/revert.md | 19 ++ brainstorm/root.md | 33 +++ brainstorm/status.md | 107 ++++++++ brainstorm/sync.md | 48 ++++ brainstorm/trace.md | 0 brainstorm/tracking.md | 79 ++++++ 23 files changed, 1116 insertions(+) create mode 100644 brainstorm/add.md create mode 100644 brainstorm/alias_git.md create mode 100644 brainstorm/audit.md create mode 100644 brainstorm/configuration.md create mode 100644 brainstorm/delete.md create mode 100644 brainstorm/dvs_last.md create mode 100644 brainstorm/enum_status.md create mode 100644 brainstorm/follow.md create mode 100644 brainstorm/get.md create mode 100644 brainstorm/initialization.md create mode 100644 brainstorm/journey-2-adding-data-files.md create mode 100644 brainstorm/journey-3-getting-latest-files.md create mode 100644 brainstorm/journey-4-updating-data-files.md create mode 100644 brainstorm/journey-5-working-with-multiple-files.md create mode 100644 brainstorm/log.md create mode 100644 brainstorm/message.md create mode 100644 brainstorm/remote_storage.md create mode 100644 brainstorm/revert.md create mode 100644 brainstorm/root.md create mode 100644 brainstorm/status.md create mode 100644 brainstorm/sync.md create mode 100644 brainstorm/trace.md create mode 100644 brainstorm/tracking.md diff --git a/brainstorm/add.md b/brainstorm/add.md new file mode 100644 index 0000000..23d9702 --- /dev/null +++ b/brainstorm/add.md @@ -0,0 +1,45 @@ +# `dvs add` + +Goal: Add files to an initialized dvs repository. + +- [ ] Currently the `message` is attached to all files checked in simultaneously. + dvs has a log and audit log to + illuminate "why" a change occurred in the data. + + + +## CLI + +- Assume that current directory is a dvs repository, both in cli and R-package. +- The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_add <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + overwrite = FALSE, + fail = FALSE +) +``` + +## Compression + +If the added file exceeds a certain threshold, the +R package should provide suggest compressing the recently added file. + +- [ ] `getOption(dvs.large_file_size = integer()`) + - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + +Advice compression when + +- a single size exceeds size thresholds +- a directory of files exceeds size thresholds + +There are cases where individual files are not large, but the collection of files +starts to amount to a large amount, presumably too large to track. diff --git a/brainstorm/alias_git.md b/brainstorm/alias_git.md new file mode 100644 index 0000000..c248ab9 --- /dev/null +++ b/brainstorm/alias_git.md @@ -0,0 +1,5 @@ +# Alias dvs with get terminology + +- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we +expose a git compatible interface to dvs, in order to +plug-in dvs as a git replacement? diff --git a/brainstorm/audit.md b/brainstorm/audit.md new file mode 100644 index 0000000..c005291 --- /dev/null +++ b/brainstorm/audit.md @@ -0,0 +1,38 @@ +# `dvs audit` + +Goal: Provide a repository wide log of dvs tracked files. + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs audit +[Date] [User] [+{files} -{files}] [Message] +``` + +```sh +$ dvs audit --since +``` + +## R + +Signature: + +```r +dvs_audit <- function( + since = NULL, # date | duration (unit) + by_user = character()) + +``` + + + +```r +dvs_audit() + +``` + +```r +dvs_audit(since = NULL) +``` diff --git a/brainstorm/configuration.md b/brainstorm/configuration.md new file mode 100644 index 0000000..7649cc2 --- /dev/null +++ b/brainstorm/configuration.md @@ -0,0 +1,5 @@ +# `dvs.toml` + + +Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). + diff --git a/brainstorm/delete.md b/brainstorm/delete.md new file mode 100644 index 0000000..d9c4687 --- /dev/null +++ b/brainstorm/delete.md @@ -0,0 +1,29 @@ +# `dvs delete` + +Goal: Remove tracked files. + +## CLI + +```shell +$ dvs delete + +``` + +## R package + +```r +dvs_delete <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. + +- `files`: list of files that are to be deleted. + +### Non-existing files + +Emit a warning, but still remove the files that do exist and are tracked. diff --git a/brainstorm/dvs_last.md b/brainstorm/dvs_last.md new file mode 100644 index 0000000..cd0a5e9 --- /dev/null +++ b/brainstorm/dvs_last.md @@ -0,0 +1,10 @@ +# `dvs_last` + +Goal: provide users with the ability to retrieve the result of +the last executed dvs command within the r package. + +Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed +an error occurred, and an overview is displayed as a data-frame. The user +got a R native result, a data-frame, but if the user wants to act on the +provided information, we might want to provide a `dvs_last` that contains +miscellaneous. diff --git a/brainstorm/enum_status.md b/brainstorm/enum_status.md new file mode 100644 index 0000000..a051258 --- /dev/null +++ b/brainstorm/enum_status.md @@ -0,0 +1,39 @@ +# Configuration: Status + +- current | absent | unsynced +- tracked file that is un-added + + + +# TODO (editting needed) + + relative_path: relative path to the file with respect to where the operation was called + + status: (doesn’t include error status) + + current: the file is present in the project directory and matches the version in the storage directory + + absent: the file isn't present in the project directory + + unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory + + file_size_bytes: current size of the file in bytes + + time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory + + saved_by: the user who uploaded the most recent file version in the storage directory + + message: the message inputted to the dvs_add command that added the most recent file version in the storage directory + + blake3_checksum: hash of the file via the blake3 algorithm + + absolute_path: canonicalized path of the file + input: + + If inputted explicitly via file glob or path: the file name + + if inputted implicitly via dvs_status() (without input): NA + +error: if the outcome was error, the error type, else NA + +error message: if the outcome was error, the error message (if there was one), else NA diff --git a/brainstorm/follow.md b/brainstorm/follow.md new file mode 100644 index 0000000..f87c953 --- /dev/null +++ b/brainstorm/follow.md @@ -0,0 +1,19 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + +- [ ] MOSSA: We may want to not track too large files, even if they are .csv +- [ ] pre-hook 100mb limit see template-PMx-project-starter + +Cloned repositories do not have hooks! + +- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! +- [ ] \ No newline at end of file diff --git a/brainstorm/get.md b/brainstorm/get.md new file mode 100644 index 0000000..8f5741c --- /dev/null +++ b/brainstorm/get.md @@ -0,0 +1,20 @@ +# `dvs get` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_get <- function(path = ".", + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE # follows fs::dir_ls +) +``` + + diff --git a/brainstorm/initialization.md b/brainstorm/initialization.md new file mode 100644 index 0000000..c552e69 --- /dev/null +++ b/brainstorm/initialization.md @@ -0,0 +1,252 @@ +# dvs initialization / `dvs init` / `dvs_init` + +Goal: Prepare shared storage and initialize DVS in directory + +dvs initialization will create a `dvs.toml` and a directory as specified by the +shared area in the init command. The shared dir may also need to `chown` the directory +to specify certain permissions. For example, for sensitive projects, setting +ownership to a particular group, allowing write access for the group, and limiting +read access to those not in the group. + +## User site assumptions + +- Always operating within a repository/project/workspace. +- A dvs repository need not fall under a git or any other vcs repository +- Storage is detached from repository root + +- [ ] If `git` is not a requirement, what alternative heuristics do we use + for instantiating a dvs repository? Suggestion: If a `.git` directory is not + available, then take current directory as the choice directory for initialization? + +## CLI + +```shell +dvs --- Data version control and storage management system + +Usage: + dvs [OPTIONS] + +Commands: + init + add + get + status + audit + log + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) + --version Show version information +``` + +The initialization command will have further subcommands. + +```shell +dvs init --- Initialize a new DVS repository + +Usage: + dvs init [OPTIONS] + +Backends: + local Local, on-disk storage + fs File system storage (e.g. network file system (nfs)) + s3 S3 compatible storage + aws S3 hosted via AWS + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) +``` + +### Local + +```shell +dvs init local --- Initialize a DVS repository via on-disk storage + +Usage: + dvs init local [OPTIONS] + +Required: + path to the local storage locations (e.g. `/data/`) + +Options: + --json + Output results as JSON + --metadata-folder-name + If you want to use a folder name other than `.dvs` for storing the metadata files + --permissions + Unix permissions for storage directory and files (octal, e.g., "770") + --group + Unix group to set on storage directory and files + --no-compression + Disable compression of stored files. Compression defaults to zstd + --no-compression + Disable compression of stored files. Compression defaults to zstd + -h, --help + Print help +``` + +## FS / NFS + + + +Example output: + +```shell +$ dvs init /data/ +DVS Repository created with storage path located at +``` + +## R function + +```r +dvs_init <- function( + storage_path = character(), # required + permissions = NULL, + group = NULL, + metadata_folder_name = NULL) +``` + +Example output: + +```r +> dvs_init() +> Error: `storage_path` is missing; Please provide a location to store dvs objects. +``` + +```r +> dvs_init("/data/projectA_storage") +> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" +``` + +CLI users do not need the full path shown to them, but R users need that information. + +Different storage backends have to be initialized through specialized functions. + +- `dvs_init_local` with alias `dvs_init` +- `dvs_init_fs(...)` +- `dvs_init_s3(...)` +- `dvs_init_aws(...)` + +## Storage + +- (future) Multiple projects can be hosted within the same storage + +### Case: No project or specific work directory + +Considering the one off scripts that scientists might create, in which there is +no project surrounding where said script is. + +- (future) User/machine storage +- (future) A remote project +- (future) One off scripts + +## Journey 1: Initial Setup with defaults + +Expected outcomes: + +- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. +- shared dir created in specified path, with default permissions of 664 + +Known Caveats: + +- certain linux `umask` setups cause folders to have default permissions like 600, or 644 +where other collaborators could not write by default, therefore, + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/example-proj +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs") +``` + +## Journey 2: Initial Setup with shared folder locked down to group + +- set permissions to writeable by group, not readable if not in group (660) +- group name projx + +Expected outcomes: + +- dvs.toml created in working directory +- shared dir created in specified path, with permissions of 660 and owned by group projx + +Edge cases: + +- group must resolve to known gid on system + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/sensitive-projx --permissions "660" --group projx +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") +``` + +#### Returns + +Return a rich data-frame that the end-user can then further subset/filter +to fit their needs. + +Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. + +- [ ] New format: + - `absolute_path`: abbreviated when printed in R (pillar) + - `relative path`: full path + - `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` + - `checksum`: always abbreviated in print (pillar, first 5 characters) + - `size`: using units and not raw `double()/numeric()` + +## Data formats to track + +- `.csv` +- `.rds` +- don't track `.RDA` files, as they are a collection of datasets + +Configuration: Must add these filters to the `dvs.toml`. + +Known annoyance: Verbosity of this can be annoying. +There should be a way to reduce outputs on untracked data files available +to the user. + +# TODO (to be edited) + +Errors + +dvs_init could return any of the following error types: + +project already initialized: dvs_init has already been run with different initialization attributes. + +git repository not found: dvs_init was run outside of a git repository + +storage directory input is not a directory: if input was an existing file + +storage directory absolute path not found: if the path could not be made absolute + +configuration file not created (dvs.yaml): failed to write to or save dvs.yaml + +linux primary group not found: if the group was inputted and it doesn't refer to a valid group + +storage directory not created: failed to create the storage directory + +linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions + +could not check if storage directory is empty: error reading the contents of the directory + +storage directory permissions not set: couldn't modify the permissions of the storage directory diff --git a/brainstorm/journey-2-adding-data-files.md b/brainstorm/journey-2-adding-data-files.md new file mode 100644 index 0000000..94a2ff2 --- /dev/null +++ b/brainstorm/journey-2-adding-data-files.md @@ -0,0 +1,68 @@ +# Journey 2: Adding Data Files + +Goal: Version a newly created dataset so others can retrieve it. + +## CLI flow + +1. Produce the data (example) + + ```bash + # Your data pipeline or script writes: + # data/derived/pk_data.csv + ``` + +2. Add the file to DVS + + ```bash + dvs add data/derived/pk_data.csv --message "Initial PK dataset v1" + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +## R package flow + +1. Produce the data + + ```r + write.csv(pk_data, "data/derived/pk_data.csv") + ``` + +2. Add the file to DVS + + ```r + dvs_add("data/derived/pk_data.csv", message = "Initial PK dataset v1") + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +```r +# no dvs_init ran before +dvs_add("contingency_table2.csv") +``` + +In RStudio: Check if there is no active folder, then emit warning. +Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/brainstorm/journey-3-getting-latest-files.md b/brainstorm/journey-3-getting-latest-files.md new file mode 100644 index 0000000..5d681d1 --- /dev/null +++ b/brainstorm/journey-3-getting-latest-files.md @@ -0,0 +1,55 @@ +# Journey 3: Getting Latest Files + +Goal: Pull metadata from Git and restore the tracked data files. + +## CLI flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```bash + dvs status + ``` + +3. Restore tracked files + + ```bash + dvs get data/derived/* + ``` + +4. Verify everything is current + + ```bash + dvs status + ``` + +## R package flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```r + dvs_status() + ``` + +3. Restore tracked files + + ```r + dvs_get("data/derived/*") + ``` + +4. Verify everything is current + + ```r + dvs_status() + ``` diff --git a/brainstorm/journey-4-updating-data-files.md b/brainstorm/journey-4-updating-data-files.md new file mode 100644 index 0000000..a96d329 --- /dev/null +++ b/brainstorm/journey-4-updating-data-files.md @@ -0,0 +1,90 @@ +# Journey 4: Updating Data Files + +Goal: Replace an existing tracked dataset with a new version. + +## CLI flow + +1. Re-run your processing to overwrite the data file + + ```bash + # Your data pipeline updates: + # data/derived/pk_data.csv + ``` + +2. Check status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +3. Add the new version + + ```bash + dvs add data/derived/pk_data.csv --message "Updated PK dataset v2" + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## R package flow + +1. Re-run your processing + + ```r + pk_data_v2 <- update_processing(pk_data) + write.csv(pk_data_v2, "data/derived/pk_data.csv") + ``` + +2. Check status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +3. Add the new version + + ```r + dvs_add("data/derived/pk_data.csv", message = "Updated PK dataset v2") + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## Journey 5: Updating data files with new rows + +New data following previous form might come up. Example is new rows from a clinical trial, +new participants in trials is added, however the scientists want them added to already +checked data files. + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs") +``` + +this ought to say + +```r +> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" +``` + +Then, + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) +``` + +could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated +to the level of the previous file, and then it can be known if this new event can supersede other add events, because we +know it is an addition. + +The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, +so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. diff --git a/brainstorm/journey-5-working-with-multiple-files.md b/brainstorm/journey-5-working-with-multiple-files.md new file mode 100644 index 0000000..bf7872e --- /dev/null +++ b/brainstorm/journey-5-working-with-multiple-files.md @@ -0,0 +1,60 @@ +# Journey 5: Working with Multiple Files + +Goal: Add and retrieve batches of outputs with glob patterns. + +## CLI flow + +1. Produce multiple outputs + + ```bash + # Your data pipeline writes: + # data/derived/pk.csv + # data/derived/pd.csv + # data/derived/summary.csv + ``` + +2. Add all outputs at once + + ```bash + dvs add data/derived/*.csv --message "Analysis outputs batch 1" + ``` + +3. Retrieve all tracked files later + + ```bash + dvs get data/derived/*.csv + ``` + +4. Check status for everything + + ```bash + dvs status + ``` + +## R package flow + +1. Produce multiple outputs + + ```r + write.csv(pk_data, "data/derived/pk.csv") + write.csv(pd_data, "data/derived/pd.csv") + write.csv(summary_stats, "data/derived/summary.csv") + ``` + +2. Add all outputs at once + + ```r + dvs_add("data/derived/*.csv", message = "Analysis outputs batch 1") + ``` + +3. Retrieve all tracked files later + + ```r + dvs_get("data/derived/*.csv") + ``` + +4. Check status for everything + + ```r + dvs_status() + ``` diff --git a/brainstorm/log.md b/brainstorm/log.md new file mode 100644 index 0000000..79fcdfe --- /dev/null +++ b/brainstorm/log.md @@ -0,0 +1,50 @@ +# `dvs log` + +Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. + +## CLI + +The option to return `--json` must be present. + +```sh +# in a previously `dvs init` folder +$ dvs log data/derived/model_summary.txt +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +``` + +```sh +$ dvs log --interval +[date -- duration since now] +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +[date -- 2x duration since now] +... +[date -- 3x duration since now] +... +``` + +``: `days`, `weeks`, `months` + +## R + +Signature: + +```r +dvs_log <- function( + since = NULL, + by_user = NULL, +) +``` + + diff --git a/brainstorm/message.md b/brainstorm/message.md new file mode 100644 index 0000000..0024275 --- /dev/null +++ b/brainstorm/message.md @@ -0,0 +1,23 @@ +# `dvs_message` + +Goal: Add messages to files without re-hashing or replacing them. + +## CLI + +```sh +$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" +Added message to `data/model_aaabb/model_summary.csv` +``` + +## R package + +```r +dvs_message <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/brainstorm/remote_storage.md b/brainstorm/remote_storage.md new file mode 100644 index 0000000..93694f6 --- /dev/null +++ b/brainstorm/remote_storage.md @@ -0,0 +1,22 @@ +# dvs supported storage backends + +## A2-AI hosting + +storagemagic/Dumbledore/A2-AI Cloud + +## Custom DVS storage + +- [ ] (future) dvs server hosted by client. + +## Third party storage hosting + +### AMazon FSx + + + +### S3 + +### Sharepoint + +### + diff --git a/brainstorm/revert.md b/brainstorm/revert.md new file mode 100644 index 0000000..7ce26cc --- /dev/null +++ b/brainstorm/revert.md @@ -0,0 +1,19 @@ +# `dvs revert` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_revert <- function(path = ".", + commit_sha = integer(), + date = NULL, + before = NULL, # date | duration +) +``` + + diff --git a/brainstorm/root.md b/brainstorm/root.md new file mode 100644 index 0000000..9814ee9 --- /dev/null +++ b/brainstorm/root.md @@ -0,0 +1,33 @@ +# `dvs root` + +Convenience utility for expert users + +Goal: Return the location of the dvs repository root anywhere. + +## CLI + +Not relevant. + +## R package + +Signature: + +```r +dvs_root <- function(...) +# alias +find_dvs_root <- dvs_root() +``` + +Convenience: + +```r +dvs_root("model_code") +# equivalent to +fs::join(dvs_root(), "model_code") +# or +file.path(dvs_root(), "model_code") +``` + +The use cases for this function is very limited. We assume heavy use of +`{here}`-package in dvs-based projects. But it could be a relevant convenience +function in certain, specific cases. diff --git a/brainstorm/status.md b/brainstorm/status.md new file mode 100644 index 0000000..f6111e7 --- /dev/null +++ b/brainstorm/status.md @@ -0,0 +1,107 @@ +# `dvs` status + +Goal: Provide an overview of the changed data files and potential files to track +via the traced data file filters. + +## CLI + +The option to return `--json` must be present. + +```shell +$ dvs status --help +Status of the DVS repository + +Usage: + dvs status [FILTERS] [OPTIONS] + +Filters: + --current + --unsynced, --missing + --absent + --no-current + --no-unsynced + --no-absent + +Options: + -s, --state filter for states to retain + -i, --invert inverts the selection provided by `--state` + -h, --help Print help +``` + +When a filter is provided, only the selected state(s) are provided. + + + +```sh +dvs status + +Current files: + + +Changed files (unsynced): + new_scenario/model_spec.txt + +Untracked and followed files: + orignal_scenario/model_summary.txt + orignal_scenario/tab-0123.tsv + orignal_scenario/tab-0123b.tsv + orignal_scenario/tab-0123c.tsv +``` + +We do not need to display the user in unsynced files, as they are likely to be owned by the current user. + +## R + +Signature: + +```r +dvs_status <- function( + path = ".", + show_storage = FALSE, +) +``` + +- `show_storage`: + - Show location of storage(s) for the current dvs repository. + - Warn the user that they must not alter the state of + the storage directory. + - (future) Show number of projects that the storage contains + +## Return format + +### CLI JSON format + + + +### R format + +Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` + +Proposed format: + +- `absolute_path`: abbreviated when printed in R (pillar) +- `relative path`: full path +- `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` +- `checksum`: always abbreviated in print (pillar, first 5 characters) +- `size`: using units and not raw `double()/numeric()` + +## Data name format + +`dvs_status` should show untracked data files in the current dvs repository, if +tracking is specified. + +## Granularity + +We expect the end user to use `{dplyr}` in order to +filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. + +## Following Filters in Status + +`dvs_track(".csv")`: tracks all CSV files. + +`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) + +`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. + +These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). diff --git a/brainstorm/sync.md b/brainstorm/sync.md new file mode 100644 index 0000000..e4e713b --- /dev/null +++ b/brainstorm/sync.md @@ -0,0 +1,48 @@ +# `dvs sync` + +Goal: Provide a streamlined way to update a cloned dvs repository. + +Synchronization `sync` is an alias for `dvs get **/*`, meant as a +repository wide syncing from storage (local/remote). + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs sync +[status] [Last modified] [Message] +... ... ... +``` + +The sync subcommand should also be able to act as a repository revert, +and + +```sh +$ dvs sync --before +[status] [Last modified] [Message] +... ... ... +``` + +## R + +Signature: + +```r +dvs_sync <- function(path = ".", + by_folder = character(), + since = NULL, # date | duration (unit) + recurse = TRUE , +) +``` + +- `path` is a location within a dvs repository. + Not necessarily the root a dvs repository. +- `by_folder` allows to sync specific folders only +- `recurse` is whether to sync folders recursively + +### `recurse` + +When there is no `by_folder`, recurse will update the entire dvs repository, even if +current directory is a sub-directory in a dvs repository. The current location of the +user might be incidental to their intent with dvs. diff --git a/brainstorm/trace.md b/brainstorm/trace.md new file mode 100644 index 0000000..e69de29 diff --git a/brainstorm/tracking.md b/brainstorm/tracking.md new file mode 100644 index 0000000..6d6917e --- /dev/null +++ b/brainstorm/tracking.md @@ -0,0 +1,79 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + + + + + +## CLI + +```shell +$ dvs follow --help +Files that are followed by dvs when untracked. + +Usage: + dvs follow [COMMANDS] [OPTIONS] + +Commands: + add + list + audit + +Options: + -h, --help Show help for a command +``` + +`add` command: + + +`list` command: + + +`add` command: + + +## R package + +Support the following + +- `ext` which are following-filters based on file extensions, e.g. `"csx"`. +- `glob`: a glob that can enable matching files through their paths and file extension +- `regex`: a regular expression to match files through their full paths + +Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. + +The follow filter must support + +- `glob`, `ext`, `regex` field +- an optional `label` that can be used to identify which follow-filter matched a file +- file size qualifiers: + `file_size_gt` (file size greater than mask), + `file_size_lt` (file size less than mask) + +Example: + +```toml +[[follow]] +{ ext = "parquet" } +[[follow]] +{ glob = "data/**/*.csv", label? = "optional label" } +[[follow]] +{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB +[[follow]] +{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } +``` + +## Matcher audit + +A helpful utility for end users is a way to figure out why a given file was followed +by dvs. To that end, the dvs track ought to display the matching filter next to every +followed file. From 1178e1e3dc383d5e866a959e576e7101d107275d Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:17:20 +0100 Subject: [PATCH 23/28] copied the brainstorm to spec/ --- spec/add.md | 45 ++++ spec/alias_git.md | 5 + spec/audit.md | 38 +++ spec/configuration.md | 5 + spec/delete.md | 29 ++ spec/dvs_last.md | 10 + spec/enum_status.md | 39 +++ spec/follow.md | 19 ++ spec/get.md | 20 ++ spec/initialization.md | 252 ++++++++++++++++++ spec/journey-2-adding-data-files.md | 68 +++++ spec/journey-3-getting-latest-files.md | 55 ++++ spec/journey-4-updating-data-files.md | 90 +++++++ spec/journey-5-working-with-multiple-files.md | 60 +++++ spec/log.md | 50 ++++ spec/message.md | 23 ++ spec/remote_storage.md | 22 ++ spec/revert.md | 19 ++ spec/root.md | 33 +++ spec/status.md | 107 ++++++++ spec/sync.md | 48 ++++ spec/trace.md | 0 spec/tracking.md | 79 ++++++ 23 files changed, 1116 insertions(+) create mode 100644 spec/add.md create mode 100644 spec/alias_git.md create mode 100644 spec/audit.md create mode 100644 spec/configuration.md create mode 100644 spec/delete.md create mode 100644 spec/dvs_last.md create mode 100644 spec/enum_status.md create mode 100644 spec/follow.md create mode 100644 spec/get.md create mode 100644 spec/initialization.md create mode 100644 spec/journey-2-adding-data-files.md create mode 100644 spec/journey-3-getting-latest-files.md create mode 100644 spec/journey-4-updating-data-files.md create mode 100644 spec/journey-5-working-with-multiple-files.md create mode 100644 spec/log.md create mode 100644 spec/message.md create mode 100644 spec/remote_storage.md create mode 100644 spec/revert.md create mode 100644 spec/root.md create mode 100644 spec/status.md create mode 100644 spec/sync.md create mode 100644 spec/trace.md create mode 100644 spec/tracking.md diff --git a/spec/add.md b/spec/add.md new file mode 100644 index 0000000..23d9702 --- /dev/null +++ b/spec/add.md @@ -0,0 +1,45 @@ +# `dvs add` + +Goal: Add files to an initialized dvs repository. + +- [ ] Currently the `message` is attached to all files checked in simultaneously. + dvs has a log and audit log to + illuminate "why" a change occurred in the data. + + + +## CLI + +- Assume that current directory is a dvs repository, both in cli and R-package. +- The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_add <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + overwrite = FALSE, + fail = FALSE +) +``` + +## Compression + +If the added file exceeds a certain threshold, the +R package should provide suggest compressing the recently added file. + +- [ ] `getOption(dvs.large_file_size = integer()`) + - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) + +Advice compression when + +- a single size exceeds size thresholds +- a directory of files exceeds size thresholds + +There are cases where individual files are not large, but the collection of files +starts to amount to a large amount, presumably too large to track. diff --git a/spec/alias_git.md b/spec/alias_git.md new file mode 100644 index 0000000..c248ab9 --- /dev/null +++ b/spec/alias_git.md @@ -0,0 +1,5 @@ +# Alias dvs with get terminology + +- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we +expose a git compatible interface to dvs, in order to +plug-in dvs as a git replacement? diff --git a/spec/audit.md b/spec/audit.md new file mode 100644 index 0000000..c005291 --- /dev/null +++ b/spec/audit.md @@ -0,0 +1,38 @@ +# `dvs audit` + +Goal: Provide a repository wide log of dvs tracked files. + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs audit +[Date] [User] [+{files} -{files}] [Message] +``` + +```sh +$ dvs audit --since +``` + +## R + +Signature: + +```r +dvs_audit <- function( + since = NULL, # date | duration (unit) + by_user = character()) + +``` + + + +```r +dvs_audit() + +``` + +```r +dvs_audit(since = NULL) +``` diff --git a/spec/configuration.md b/spec/configuration.md new file mode 100644 index 0000000..7649cc2 --- /dev/null +++ b/spec/configuration.md @@ -0,0 +1,5 @@ +# `dvs.toml` + + +Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). + diff --git a/spec/delete.md b/spec/delete.md new file mode 100644 index 0000000..d9c4687 --- /dev/null +++ b/spec/delete.md @@ -0,0 +1,29 @@ +# `dvs delete` + +Goal: Remove tracked files. + +## CLI + +```shell +$ dvs delete + +``` + +## R package + +```r +dvs_delete <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. + +- `files`: list of files that are to be deleted. + +### Non-existing files + +Emit a warning, but still remove the files that do exist and are tracked. diff --git a/spec/dvs_last.md b/spec/dvs_last.md new file mode 100644 index 0000000..cd0a5e9 --- /dev/null +++ b/spec/dvs_last.md @@ -0,0 +1,10 @@ +# `dvs_last` + +Goal: provide users with the ability to retrieve the result of +the last executed dvs command within the r package. + +Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed +an error occurred, and an overview is displayed as a data-frame. The user +got a R native result, a data-frame, but if the user wants to act on the +provided information, we might want to provide a `dvs_last` that contains +miscellaneous. diff --git a/spec/enum_status.md b/spec/enum_status.md new file mode 100644 index 0000000..a051258 --- /dev/null +++ b/spec/enum_status.md @@ -0,0 +1,39 @@ +# Configuration: Status + +- current | absent | unsynced +- tracked file that is un-added + + + +# TODO (editting needed) + + relative_path: relative path to the file with respect to where the operation was called + + status: (doesn’t include error status) + + current: the file is present in the project directory and matches the version in the storage directory + + absent: the file isn't present in the project directory + + unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory + + file_size_bytes: current size of the file in bytes + + time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory + + saved_by: the user who uploaded the most recent file version in the storage directory + + message: the message inputted to the dvs_add command that added the most recent file version in the storage directory + + blake3_checksum: hash of the file via the blake3 algorithm + + absolute_path: canonicalized path of the file + input: + + If inputted explicitly via file glob or path: the file name + + if inputted implicitly via dvs_status() (without input): NA + +error: if the outcome was error, the error type, else NA + +error message: if the outcome was error, the error message (if there was one), else NA diff --git a/spec/follow.md b/spec/follow.md new file mode 100644 index 0000000..f87c953 --- /dev/null +++ b/spec/follow.md @@ -0,0 +1,19 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + +- [ ] MOSSA: We may want to not track too large files, even if they are .csv +- [ ] pre-hook 100mb limit see template-PMx-project-starter + +Cloned repositories do not have hooks! + +- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! +- [ ] \ No newline at end of file diff --git a/spec/get.md b/spec/get.md new file mode 100644 index 0000000..8f5741c --- /dev/null +++ b/spec/get.md @@ -0,0 +1,20 @@ +# `dvs get` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_get <- function(path = ".", + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE # follows fs::dir_ls +) +``` + + diff --git a/spec/initialization.md b/spec/initialization.md new file mode 100644 index 0000000..c552e69 --- /dev/null +++ b/spec/initialization.md @@ -0,0 +1,252 @@ +# dvs initialization / `dvs init` / `dvs_init` + +Goal: Prepare shared storage and initialize DVS in directory + +dvs initialization will create a `dvs.toml` and a directory as specified by the +shared area in the init command. The shared dir may also need to `chown` the directory +to specify certain permissions. For example, for sensitive projects, setting +ownership to a particular group, allowing write access for the group, and limiting +read access to those not in the group. + +## User site assumptions + +- Always operating within a repository/project/workspace. +- A dvs repository need not fall under a git or any other vcs repository +- Storage is detached from repository root + +- [ ] If `git` is not a requirement, what alternative heuristics do we use + for instantiating a dvs repository? Suggestion: If a `.git` directory is not + available, then take current directory as the choice directory for initialization? + +## CLI + +```shell +dvs --- Data version control and storage management system + +Usage: + dvs [OPTIONS] + +Commands: + init + add + get + status + audit + log + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) + --version Show version information +``` + +The initialization command will have further subcommands. + +```shell +dvs init --- Initialize a new DVS repository + +Usage: + dvs init [OPTIONS] + +Backends: + local Local, on-disk storage + fs File system storage (e.g. network file system (nfs)) + s3 S3 compatible storage + aws S3 hosted via AWS + +Options: + -h, --help Show help for command (e.g. `dvs init --help`) +``` + +### Local + +```shell +dvs init local --- Initialize a DVS repository via on-disk storage + +Usage: + dvs init local [OPTIONS] + +Required: + path to the local storage locations (e.g. `/data/`) + +Options: + --json + Output results as JSON + --metadata-folder-name + If you want to use a folder name other than `.dvs` for storing the metadata files + --permissions + Unix permissions for storage directory and files (octal, e.g., "770") + --group + Unix group to set on storage directory and files + --no-compression + Disable compression of stored files. Compression defaults to zstd + --no-compression + Disable compression of stored files. Compression defaults to zstd + -h, --help + Print help +``` + +## FS / NFS + + + +Example output: + +```shell +$ dvs init /data/ +DVS Repository created with storage path located at +``` + +## R function + +```r +dvs_init <- function( + storage_path = character(), # required + permissions = NULL, + group = NULL, + metadata_folder_name = NULL) +``` + +Example output: + +```r +> dvs_init() +> Error: `storage_path` is missing; Please provide a location to store dvs objects. +``` + +```r +> dvs_init("/data/projectA_storage") +> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" +``` + +CLI users do not need the full path shown to them, but R users need that information. + +Different storage backends have to be initialized through specialized functions. + +- `dvs_init_local` with alias `dvs_init` +- `dvs_init_fs(...)` +- `dvs_init_s3(...)` +- `dvs_init_aws(...)` + +## Storage + +- (future) Multiple projects can be hosted within the same storage + +### Case: No project or specific work directory + +Considering the one off scripts that scientists might create, in which there is +no project surrounding where said script is. + +- (future) User/machine storage +- (future) A remote project +- (future) One off scripts + +## Journey 1: Initial Setup with defaults + +Expected outcomes: + +- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. +- shared dir created in specified path, with default permissions of 664 + +Known Caveats: + +- certain linux `umask` setups cause folders to have default permissions like 600, or 644 +where other collaborators could not write by default, therefore, + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/example-proj +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs") +``` + +## Journey 2: Initial Setup with shared folder locked down to group + +- set permissions to writeable by group, not readable if not in group (660) +- group name projx + +Expected outcomes: + +- dvs.toml created in working directory +- shared dir created in specified path, with permissions of 660 and owned by group projx + +Edge cases: + +- group must resolve to known gid on system + +### CLI flow + +1. initialize dvs from a project directory + +```bash +dvs init /data/dvs/sensitive-projx --permissions "660" --group projx +``` + +### R package flow + +1. Initialize DVS in the repo + +```r +dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") +``` + +#### Returns + +Return a rich data-frame that the end-user can then further subset/filter +to fit their needs. + +Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. + +- [ ] New format: + - `absolute_path`: abbreviated when printed in R (pillar) + - `relative path`: full path + - `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` + - `checksum`: always abbreviated in print (pillar, first 5 characters) + - `size`: using units and not raw `double()/numeric()` + +## Data formats to track + +- `.csv` +- `.rds` +- don't track `.RDA` files, as they are a collection of datasets + +Configuration: Must add these filters to the `dvs.toml`. + +Known annoyance: Verbosity of this can be annoying. +There should be a way to reduce outputs on untracked data files available +to the user. + +# TODO (to be edited) + +Errors + +dvs_init could return any of the following error types: + +project already initialized: dvs_init has already been run with different initialization attributes. + +git repository not found: dvs_init was run outside of a git repository + +storage directory input is not a directory: if input was an existing file + +storage directory absolute path not found: if the path could not be made absolute + +configuration file not created (dvs.yaml): failed to write to or save dvs.yaml + +linux primary group not found: if the group was inputted and it doesn't refer to a valid group + +storage directory not created: failed to create the storage directory + +linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions + +could not check if storage directory is empty: error reading the contents of the directory + +storage directory permissions not set: couldn't modify the permissions of the storage directory diff --git a/spec/journey-2-adding-data-files.md b/spec/journey-2-adding-data-files.md new file mode 100644 index 0000000..94a2ff2 --- /dev/null +++ b/spec/journey-2-adding-data-files.md @@ -0,0 +1,68 @@ +# Journey 2: Adding Data Files + +Goal: Version a newly created dataset so others can retrieve it. + +## CLI flow + +1. Produce the data (example) + + ```bash + # Your data pipeline or script writes: + # data/derived/pk_data.csv + ``` + +2. Add the file to DVS + + ```bash + dvs add data/derived/pk_data.csv --message "Initial PK dataset v1" + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +## R package flow + +1. Produce the data + + ```r + write.csv(pk_data, "data/derived/pk_data.csv") + ``` + +2. Add the file to DVS + + ```r + dvs_add("data/derived/pk_data.csv", message = "Initial PK dataset v1") + ``` + +3. Commit DVS metadata + + ```bash + git add data/derived/pk_data.csv.dvs data/derived/.gitignore + git commit -m "Add processed PK data" + git push + ``` + +4. Verify status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +```r +# no dvs_init ran before +dvs_add("contingency_table2.csv") +``` + +In RStudio: Check if there is no active folder, then emit warning. +Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/spec/journey-3-getting-latest-files.md b/spec/journey-3-getting-latest-files.md new file mode 100644 index 0000000..5d681d1 --- /dev/null +++ b/spec/journey-3-getting-latest-files.md @@ -0,0 +1,55 @@ +# Journey 3: Getting Latest Files + +Goal: Pull metadata from Git and restore the tracked data files. + +## CLI flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```bash + dvs status + ``` + +3. Restore tracked files + + ```bash + dvs get data/derived/* + ``` + +4. Verify everything is current + + ```bash + dvs status + ``` + +## R package flow + +1. Pull the latest repo changes + + ```bash + git pull + ``` + +2. See what is missing + + ```r + dvs_status() + ``` + +3. Restore tracked files + + ```r + dvs_get("data/derived/*") + ``` + +4. Verify everything is current + + ```r + dvs_status() + ``` diff --git a/spec/journey-4-updating-data-files.md b/spec/journey-4-updating-data-files.md new file mode 100644 index 0000000..a96d329 --- /dev/null +++ b/spec/journey-4-updating-data-files.md @@ -0,0 +1,90 @@ +# Journey 4: Updating Data Files + +Goal: Replace an existing tracked dataset with a new version. + +## CLI flow + +1. Re-run your processing to overwrite the data file + + ```bash + # Your data pipeline updates: + # data/derived/pk_data.csv + ``` + +2. Check status + + ```bash + dvs status data/derived/pk_data.csv + ``` + +3. Add the new version + + ```bash + dvs add data/derived/pk_data.csv --message "Updated PK dataset v2" + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## R package flow + +1. Re-run your processing + + ```r + pk_data_v2 <- update_processing(pk_data) + write.csv(pk_data_v2, "data/derived/pk_data.csv") + ``` + +2. Check status + + ```r + dvs_status("data/derived/pk_data.csv") + ``` + +3. Add the new version + + ```r + dvs_add("data/derived/pk_data.csv", message = "Updated PK dataset v2") + ``` + +4. Commit updated metadata + + ```bash + git add data/derived/pk_data.csv.dvs + git commit -m "Update PK data with new processing" + git push + ``` + +## Journey 5: Updating data files with new rows + +New data following previous form might come up. Example is new rows from a clinical trial, +new participants in trials is added, however the scientists want them added to already +checked data files. + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs") +``` + +this ought to say + +```r +> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" +``` + +Then, + +```r +dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) +``` + +could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated +to the level of the previous file, and then it can be known if this new event can supersede other add events, because we +know it is an addition. + +The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, +so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. diff --git a/spec/journey-5-working-with-multiple-files.md b/spec/journey-5-working-with-multiple-files.md new file mode 100644 index 0000000..bf7872e --- /dev/null +++ b/spec/journey-5-working-with-multiple-files.md @@ -0,0 +1,60 @@ +# Journey 5: Working with Multiple Files + +Goal: Add and retrieve batches of outputs with glob patterns. + +## CLI flow + +1. Produce multiple outputs + + ```bash + # Your data pipeline writes: + # data/derived/pk.csv + # data/derived/pd.csv + # data/derived/summary.csv + ``` + +2. Add all outputs at once + + ```bash + dvs add data/derived/*.csv --message "Analysis outputs batch 1" + ``` + +3. Retrieve all tracked files later + + ```bash + dvs get data/derived/*.csv + ``` + +4. Check status for everything + + ```bash + dvs status + ``` + +## R package flow + +1. Produce multiple outputs + + ```r + write.csv(pk_data, "data/derived/pk.csv") + write.csv(pd_data, "data/derived/pd.csv") + write.csv(summary_stats, "data/derived/summary.csv") + ``` + +2. Add all outputs at once + + ```r + dvs_add("data/derived/*.csv", message = "Analysis outputs batch 1") + ``` + +3. Retrieve all tracked files later + + ```r + dvs_get("data/derived/*.csv") + ``` + +4. Check status for everything + + ```r + dvs_status() + ``` diff --git a/spec/log.md b/spec/log.md new file mode 100644 index 0000000..79fcdfe --- /dev/null +++ b/spec/log.md @@ -0,0 +1,50 @@ +# `dvs log` + +Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. + +## CLI + +The option to return `--json` must be present. + +```sh +# in a previously `dvs init` folder +$ dvs log data/derived/model_summary.txt +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +``` + +```sh +$ dvs log --interval +[date -- duration since now] +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" + +Last edited on: 20-10-2020 +checksum: NNNNN +message: "Ran nonmem model on exposure assumptions" +[date -- 2x duration since now] +... +[date -- 3x duration since now] +... +``` + +``: `days`, `weeks`, `months` + +## R + +Signature: + +```r +dvs_log <- function( + since = NULL, + by_user = NULL, +) +``` + + diff --git a/spec/message.md b/spec/message.md new file mode 100644 index 0000000..0024275 --- /dev/null +++ b/spec/message.md @@ -0,0 +1,23 @@ +# `dvs_message` + +Goal: Add messages to files without re-hashing or replacing them. + +## CLI + +```sh +$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" +Added message to `data/model_aaabb/model_summary.csv` +``` + +## R package + +```r +dvs_message <- function( + files = character(), + glob = character(), + ignore.case = NULL %||% !is.empty(glob), + fail = FALSE +) +``` + +`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/spec/remote_storage.md b/spec/remote_storage.md new file mode 100644 index 0000000..93694f6 --- /dev/null +++ b/spec/remote_storage.md @@ -0,0 +1,22 @@ +# dvs supported storage backends + +## A2-AI hosting + +storagemagic/Dumbledore/A2-AI Cloud + +## Custom DVS storage + +- [ ] (future) dvs server hosted by client. + +## Third party storage hosting + +### AMazon FSx + + + +### S3 + +### Sharepoint + +### + diff --git a/spec/revert.md b/spec/revert.md new file mode 100644 index 0000000..7ce26cc --- /dev/null +++ b/spec/revert.md @@ -0,0 +1,19 @@ +# `dvs revert` + +## CLI + +The option to return `--json` must be present. + +## R + +Signature: + +```r +dvs_revert <- function(path = ".", + commit_sha = integer(), + date = NULL, + before = NULL, # date | duration +) +``` + + diff --git a/spec/root.md b/spec/root.md new file mode 100644 index 0000000..9814ee9 --- /dev/null +++ b/spec/root.md @@ -0,0 +1,33 @@ +# `dvs root` + +Convenience utility for expert users + +Goal: Return the location of the dvs repository root anywhere. + +## CLI + +Not relevant. + +## R package + +Signature: + +```r +dvs_root <- function(...) +# alias +find_dvs_root <- dvs_root() +``` + +Convenience: + +```r +dvs_root("model_code") +# equivalent to +fs::join(dvs_root(), "model_code") +# or +file.path(dvs_root(), "model_code") +``` + +The use cases for this function is very limited. We assume heavy use of +`{here}`-package in dvs-based projects. But it could be a relevant convenience +function in certain, specific cases. diff --git a/spec/status.md b/spec/status.md new file mode 100644 index 0000000..f6111e7 --- /dev/null +++ b/spec/status.md @@ -0,0 +1,107 @@ +# `dvs` status + +Goal: Provide an overview of the changed data files and potential files to track +via the traced data file filters. + +## CLI + +The option to return `--json` must be present. + +```shell +$ dvs status --help +Status of the DVS repository + +Usage: + dvs status [FILTERS] [OPTIONS] + +Filters: + --current + --unsynced, --missing + --absent + --no-current + --no-unsynced + --no-absent + +Options: + -s, --state filter for states to retain + -i, --invert inverts the selection provided by `--state` + -h, --help Print help +``` + +When a filter is provided, only the selected state(s) are provided. + + + +```sh +dvs status + +Current files: + + +Changed files (unsynced): + new_scenario/model_spec.txt + +Untracked and followed files: + orignal_scenario/model_summary.txt + orignal_scenario/tab-0123.tsv + orignal_scenario/tab-0123b.tsv + orignal_scenario/tab-0123c.tsv +``` + +We do not need to display the user in unsynced files, as they are likely to be owned by the current user. + +## R + +Signature: + +```r +dvs_status <- function( + path = ".", + show_storage = FALSE, +) +``` + +- `show_storage`: + - Show location of storage(s) for the current dvs repository. + - Warn the user that they must not alter the state of + the storage directory. + - (future) Show number of projects that the storage contains + +## Return format + +### CLI JSON format + + + +### R format + +Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` + +Proposed format: + +- `absolute_path`: abbreviated when printed in R (pillar) +- `relative path`: full path +- `status`: ordered factor instead of `character()` + - `absent|unsync|sync|present|added` +- `checksum`: always abbreviated in print (pillar, first 5 characters) +- `size`: using units and not raw `double()/numeric()` + +## Data name format + +`dvs_status` should show untracked data files in the current dvs repository, if +tracking is specified. + +## Granularity + +We expect the end user to use `{dplyr}` in order to +filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. + +## Following Filters in Status + +`dvs_track(".csv")`: tracks all CSV files. + +`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) + +`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. + +These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). diff --git a/spec/sync.md b/spec/sync.md new file mode 100644 index 0000000..e4e713b --- /dev/null +++ b/spec/sync.md @@ -0,0 +1,48 @@ +# `dvs sync` + +Goal: Provide a streamlined way to update a cloned dvs repository. + +Synchronization `sync` is an alias for `dvs get **/*`, meant as a +repository wide syncing from storage (local/remote). + +## CLI + +The option to return `--json` must be present. + +```sh +$ dvs sync +[status] [Last modified] [Message] +... ... ... +``` + +The sync subcommand should also be able to act as a repository revert, +and + +```sh +$ dvs sync --before +[status] [Last modified] [Message] +... ... ... +``` + +## R + +Signature: + +```r +dvs_sync <- function(path = ".", + by_folder = character(), + since = NULL, # date | duration (unit) + recurse = TRUE , +) +``` + +- `path` is a location within a dvs repository. + Not necessarily the root a dvs repository. +- `by_folder` allows to sync specific folders only +- `recurse` is whether to sync folders recursively + +### `recurse` + +When there is no `by_folder`, recurse will update the entire dvs repository, even if +current directory is a sub-directory in a dvs repository. The current location of the +user might be incidental to their intent with dvs. diff --git a/spec/trace.md b/spec/trace.md new file mode 100644 index 0000000..e69de29 diff --git a/spec/tracking.md b/spec/tracking.md new file mode 100644 index 0000000..6d6917e --- /dev/null +++ b/spec/tracking.md @@ -0,0 +1,79 @@ +# `dvs track` / `dvs_track` + +Goal: Purpose is to specify which files we ought to follow in dvs. + +User journey: + +- [ ] All the .csv files underneath a specific directory. +- [ ] All the .csv files that are less than 25 MB + +- File type +- Size filters + + + + + +## CLI + +```shell +$ dvs follow --help +Files that are followed by dvs when untracked. + +Usage: + dvs follow [COMMANDS] [OPTIONS] + +Commands: + add + list + audit + +Options: + -h, --help Show help for a command +``` + +`add` command: + + +`list` command: + + +`add` command: + + +## R package + +Support the following + +- `ext` which are following-filters based on file extensions, e.g. `"csx"`. +- `glob`: a glob that can enable matching files through their paths and file extension +- `regex`: a regular expression to match files through their full paths + +Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. + +The follow filter must support + +- `glob`, `ext`, `regex` field +- an optional `label` that can be used to identify which follow-filter matched a file +- file size qualifiers: + `file_size_gt` (file size greater than mask), + `file_size_lt` (file size less than mask) + +Example: + +```toml +[[follow]] +{ ext = "parquet" } +[[follow]] +{ glob = "data/**/*.csv", label? = "optional label" } +[[follow]] +{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB +[[follow]] +{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } +``` + +## Matcher audit + +A helpful utility for end users is a way to figure out why a given file was followed +by dvs. To that end, the dvs track ought to display the matching filter next to every +followed file. From 73a2c07b2b96d2d47019b452e66385c4dde7392a Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:23:40 +0100 Subject: [PATCH 24/28] pushed reduced spec --- brainstorm/get.md | 2 +- spec/add.md | 9 --- spec/alias_git.md | 5 -- spec/audit.md | 8 +- spec/configuration.md | 6 +- spec/delete.md | 4 +- spec/dvs_last.md | 10 --- spec/enum_status.md | 4 +- spec/follow.md | 19 ----- spec/get.md | 4 +- spec/initialization.md | 54 +------------ ...-files.md => journey_adding_data_files.md} | 0 ...les.md => journey_getting_latest_files.md} | 0 ...ata-files.md => journey_updating_files.md} | 0 ...s.md => journey_working_multiple_files.md} | 0 spec/message.md | 23 ------ spec/remote_storage.md | 22 ------ spec/revert.md | 19 ----- spec/root.md | 33 -------- spec/status.md | 11 --- spec/sync.md | 2 +- spec/trace.md | 0 spec/tracking.md | 79 ------------------- 23 files changed, 14 insertions(+), 300 deletions(-) delete mode 100644 spec/alias_git.md delete mode 100644 spec/dvs_last.md delete mode 100644 spec/follow.md rename spec/{journey-2-adding-data-files.md => journey_adding_data_files.md} (100%) rename spec/{journey-3-getting-latest-files.md => journey_getting_latest_files.md} (100%) rename spec/{journey-4-updating-data-files.md => journey_updating_files.md} (100%) rename spec/{journey-5-working-with-multiple-files.md => journey_working_multiple_files.md} (100%) delete mode 100644 spec/message.md delete mode 100644 spec/remote_storage.md delete mode 100644 spec/revert.md delete mode 100644 spec/root.md delete mode 100644 spec/trace.md delete mode 100644 spec/tracking.md diff --git a/brainstorm/get.md b/brainstorm/get.md index 8f5741c..54231fa 100644 --- a/brainstorm/get.md +++ b/brainstorm/get.md @@ -9,7 +9,7 @@ The option to return `--json` must be present. Signature: ```r -dvs_get <- function(path = ".", +dvs_get <- function( files = character(), glob = character(), ignore.case = NULL %||% !is.empty(glob), diff --git a/spec/add.md b/spec/add.md index 23d9702..907190c 100644 --- a/spec/add.md +++ b/spec/add.md @@ -2,12 +2,6 @@ Goal: Add files to an initialized dvs repository. -- [ ] Currently the `message` is attached to all files checked in simultaneously. - dvs has a log and audit log to - illuminate "why" a change occurred in the data. - - - ## CLI - Assume that current directory is a dvs repository, both in cli and R-package. @@ -21,9 +15,6 @@ Signature: dvs_add <- function( files = character(), glob = character(), - ignore.case = NULL %||% !is.empty(glob), - overwrite = FALSE, - fail = FALSE ) ``` diff --git a/spec/alias_git.md b/spec/alias_git.md deleted file mode 100644 index c248ab9..0000000 --- a/spec/alias_git.md +++ /dev/null @@ -1,5 +0,0 @@ -# Alias dvs with get terminology - -- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we -expose a git compatible interface to dvs, in order to -plug-in dvs as a git replacement? diff --git a/spec/audit.md b/spec/audit.md index c005291..403a082 100644 --- a/spec/audit.md +++ b/spec/audit.md @@ -20,17 +20,11 @@ $ dvs audit --since Signature: ```r -dvs_audit <- function( - since = NULL, # date | duration (unit) - by_user = character()) - +dvs_audit <- function() ``` - - ```r dvs_audit() - ``` ```r diff --git a/spec/configuration.md b/spec/configuration.md index 7649cc2..e0e9801 100644 --- a/spec/configuration.md +++ b/spec/configuration.md @@ -1,5 +1,7 @@ # `dvs.toml` +The configuration toml file should contain -Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). - +- Backend +- The default compression +- Path to the storage directory diff --git a/spec/delete.md b/spec/delete.md index d9c4687..2b5c74c 100644 --- a/spec/delete.md +++ b/spec/delete.md @@ -14,9 +14,7 @@ $ dvs delete ```r dvs_delete <- function( files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE + glob = character() ) ``` diff --git a/spec/dvs_last.md b/spec/dvs_last.md deleted file mode 100644 index cd0a5e9..0000000 --- a/spec/dvs_last.md +++ /dev/null @@ -1,10 +0,0 @@ -# `dvs_last` - -Goal: provide users with the ability to retrieve the result of -the last executed dvs command within the r package. - -Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed -an error occurred, and an overview is displayed as a data-frame. The user -got a R native result, a data-frame, but if the user wants to act on the -provided information, we might want to provide a `dvs_last` that contains -miscellaneous. diff --git a/spec/enum_status.md b/spec/enum_status.md index a051258..898b10b 100644 --- a/spec/enum_status.md +++ b/spec/enum_status.md @@ -5,7 +5,9 @@ -# TODO (editting needed) +# TODO (editing needed) + + relative_path: relative path to the file with respect to where the operation was called diff --git a/spec/follow.md b/spec/follow.md deleted file mode 100644 index f87c953..0000000 --- a/spec/follow.md +++ /dev/null @@ -1,19 +0,0 @@ -# `dvs track` / `dvs_track` - -Goal: Purpose is to specify which files we ought to follow in dvs. - -User journey: - -- [ ] All the .csv files underneath a specific directory. -- [ ] All the .csv files that are less than 25 MB - -- File type -- Size filters - -- [ ] MOSSA: We may want to not track too large files, even if they are .csv -- [ ] pre-hook 100mb limit see template-PMx-project-starter - -Cloned repositories do not have hooks! - -- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! -- [ ] \ No newline at end of file diff --git a/spec/get.md b/spec/get.md index 8f5741c..6300b0f 100644 --- a/spec/get.md +++ b/spec/get.md @@ -9,11 +9,9 @@ The option to return `--json` must be present. Signature: ```r -dvs_get <- function(path = ".", +dvs_get <- function( files = character(), glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE # follows fs::dir_ls ) ``` diff --git a/spec/initialization.md b/spec/initialization.md index c552e69..654120c 100644 --- a/spec/initialization.md +++ b/spec/initialization.md @@ -10,14 +10,10 @@ read access to those not in the group. ## User site assumptions -- Always operating within a repository/project/workspace. -- A dvs repository need not fall under a git or any other vcs repository +- Always operating within a repository/project/workspace, and will initialize + at current working directory. - Storage is detached from repository root -- [ ] If `git` is not a requirement, what alternative heuristics do we use - for instantiating a dvs repository? Suggestion: If a `.git` directory is not - available, then take current directory as the choice directory for initialization? - ## CLI ```shell @@ -85,10 +81,6 @@ Options: Print help ``` -## FS / NFS - - - Example output: ```shell @@ -123,22 +115,6 @@ CLI users do not need the full path shown to them, but R users need that informa Different storage backends have to be initialized through specialized functions. - `dvs_init_local` with alias `dvs_init` -- `dvs_init_fs(...)` -- `dvs_init_s3(...)` -- `dvs_init_aws(...)` - -## Storage - -- (future) Multiple projects can be hosted within the same storage - -### Case: No project or specific work directory - -Considering the one off scripts that scientists might create, in which there is -no project surrounding where said script is. - -- (future) User/machine storage -- (future) A remote project -- (future) One off scripts ## Journey 1: Initial Setup with defaults @@ -224,29 +200,3 @@ Configuration: Must add these filters to the `dvs.toml`. Known annoyance: Verbosity of this can be annoying. There should be a way to reduce outputs on untracked data files available to the user. - -# TODO (to be edited) - -Errors - -dvs_init could return any of the following error types: - -project already initialized: dvs_init has already been run with different initialization attributes. - -git repository not found: dvs_init was run outside of a git repository - -storage directory input is not a directory: if input was an existing file - -storage directory absolute path not found: if the path could not be made absolute - -configuration file not created (dvs.yaml): failed to write to or save dvs.yaml - -linux primary group not found: if the group was inputted and it doesn't refer to a valid group - -storage directory not created: failed to create the storage directory - -linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions - -could not check if storage directory is empty: error reading the contents of the directory - -storage directory permissions not set: couldn't modify the permissions of the storage directory diff --git a/spec/journey-2-adding-data-files.md b/spec/journey_adding_data_files.md similarity index 100% rename from spec/journey-2-adding-data-files.md rename to spec/journey_adding_data_files.md diff --git a/spec/journey-3-getting-latest-files.md b/spec/journey_getting_latest_files.md similarity index 100% rename from spec/journey-3-getting-latest-files.md rename to spec/journey_getting_latest_files.md diff --git a/spec/journey-4-updating-data-files.md b/spec/journey_updating_files.md similarity index 100% rename from spec/journey-4-updating-data-files.md rename to spec/journey_updating_files.md diff --git a/spec/journey-5-working-with-multiple-files.md b/spec/journey_working_multiple_files.md similarity index 100% rename from spec/journey-5-working-with-multiple-files.md rename to spec/journey_working_multiple_files.md diff --git a/spec/message.md b/spec/message.md deleted file mode 100644 index 0024275..0000000 --- a/spec/message.md +++ /dev/null @@ -1,23 +0,0 @@ -# `dvs_message` - -Goal: Add messages to files without re-hashing or replacing them. - -## CLI - -```sh -$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" -Added message to `data/model_aaabb/model_summary.csv` -``` - -## R package - -```r -dvs_message <- function( - files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE -) -``` - -`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/spec/remote_storage.md b/spec/remote_storage.md deleted file mode 100644 index 93694f6..0000000 --- a/spec/remote_storage.md +++ /dev/null @@ -1,22 +0,0 @@ -# dvs supported storage backends - -## A2-AI hosting - -storagemagic/Dumbledore/A2-AI Cloud - -## Custom DVS storage - -- [ ] (future) dvs server hosted by client. - -## Third party storage hosting - -### AMazon FSx - - - -### S3 - -### Sharepoint - -### - diff --git a/spec/revert.md b/spec/revert.md deleted file mode 100644 index 7ce26cc..0000000 --- a/spec/revert.md +++ /dev/null @@ -1,19 +0,0 @@ -# `dvs revert` - -## CLI - -The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_revert <- function(path = ".", - commit_sha = integer(), - date = NULL, - before = NULL, # date | duration -) -``` - - diff --git a/spec/root.md b/spec/root.md deleted file mode 100644 index 9814ee9..0000000 --- a/spec/root.md +++ /dev/null @@ -1,33 +0,0 @@ -# `dvs root` - -Convenience utility for expert users - -Goal: Return the location of the dvs repository root anywhere. - -## CLI - -Not relevant. - -## R package - -Signature: - -```r -dvs_root <- function(...) -# alias -find_dvs_root <- dvs_root() -``` - -Convenience: - -```r -dvs_root("model_code") -# equivalent to -fs::join(dvs_root(), "model_code") -# or -file.path(dvs_root(), "model_code") -``` - -The use cases for this function is very limited. We assume heavy use of -`{here}`-package in dvs-based projects. But it could be a relevant convenience -function in certain, specific cases. diff --git a/spec/status.md b/spec/status.md index f6111e7..a4365a5 100644 --- a/spec/status.md +++ b/spec/status.md @@ -56,7 +56,6 @@ Signature: ```r dvs_status <- function( - path = ".", show_storage = FALSE, ) ``` @@ -95,13 +94,3 @@ tracking is specified. We expect the end user to use `{dplyr}` in order to filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. - -## Following Filters in Status - -`dvs_track(".csv")`: tracks all CSV files. - -`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) - -`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. - -These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). diff --git a/spec/sync.md b/spec/sync.md index e4e713b..ab55301 100644 --- a/spec/sync.md +++ b/spec/sync.md @@ -29,7 +29,7 @@ $ dvs sync --before Signature: ```r -dvs_sync <- function(path = ".", +dvs_sync <- function( by_folder = character(), since = NULL, # date | duration (unit) recurse = TRUE , diff --git a/spec/trace.md b/spec/trace.md deleted file mode 100644 index e69de29..0000000 diff --git a/spec/tracking.md b/spec/tracking.md deleted file mode 100644 index 6d6917e..0000000 --- a/spec/tracking.md +++ /dev/null @@ -1,79 +0,0 @@ -# `dvs track` / `dvs_track` - -Goal: Purpose is to specify which files we ought to follow in dvs. - -User journey: - -- [ ] All the .csv files underneath a specific directory. -- [ ] All the .csv files that are less than 25 MB - -- File type -- Size filters - - - - - -## CLI - -```shell -$ dvs follow --help -Files that are followed by dvs when untracked. - -Usage: - dvs follow [COMMANDS] [OPTIONS] - -Commands: - add - list - audit - -Options: - -h, --help Show help for a command -``` - -`add` command: - - -`list` command: - - -`add` command: - - -## R package - -Support the following - -- `ext` which are following-filters based on file extensions, e.g. `"csx"`. -- `glob`: a glob that can enable matching files through their paths and file extension -- `regex`: a regular expression to match files through their full paths - -Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. - -The follow filter must support - -- `glob`, `ext`, `regex` field -- an optional `label` that can be used to identify which follow-filter matched a file -- file size qualifiers: - `file_size_gt` (file size greater than mask), - `file_size_lt` (file size less than mask) - -Example: - -```toml -[[follow]] -{ ext = "parquet" } -[[follow]] -{ glob = "data/**/*.csv", label? = "optional label" } -[[follow]] -{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB -[[follow]] -{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } -``` - -## Matcher audit - -A helpful utility for end users is a way to figure out why a given file was followed -by dvs. To that end, the dvs track ought to display the matching filter next to every -followed file. From 12ac1f32a90158560f9fa2b26fb3273d2293b319 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:27:15 +0100 Subject: [PATCH 25/28] removed nonsense `path = "."` (again!!) --- brainstorm/enum_status.md | 2 +- brainstorm/message.md | 2 +- brainstorm/remote_storage.md | 21 +-------------------- brainstorm/revert.md | 2 +- brainstorm/root.md | 2 +- brainstorm/status.md | 1 - brainstorm/sync.md | 2 +- 7 files changed, 6 insertions(+), 26 deletions(-) diff --git a/brainstorm/enum_status.md b/brainstorm/enum_status.md index a051258..611ae46 100644 --- a/brainstorm/enum_status.md +++ b/brainstorm/enum_status.md @@ -5,7 +5,7 @@ -# TODO (editting needed) +# TODO (editing needed) relative_path: relative path to the file with respect to where the operation was called diff --git a/brainstorm/message.md b/brainstorm/message.md index 0024275..283a92f 100644 --- a/brainstorm/message.md +++ b/brainstorm/message.md @@ -5,7 +5,7 @@ Goal: Add messages to files without re-hashing or replacing them. ## CLI ```sh -$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" +$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repetitions" Added message to `data/model_aaabb/model_summary.csv` ``` diff --git a/brainstorm/remote_storage.md b/brainstorm/remote_storage.md index 93694f6..ed6f055 100644 --- a/brainstorm/remote_storage.md +++ b/brainstorm/remote_storage.md @@ -1,22 +1,3 @@ # dvs supported storage backends -## A2-AI hosting - -storagemagic/Dumbledore/A2-AI Cloud - -## Custom DVS storage - -- [ ] (future) dvs server hosted by client. - -## Third party storage hosting - -### AMazon FSx - - - -### S3 - -### Sharepoint - -### - +File systems, S3, and S3 hosted by AWS. diff --git a/brainstorm/revert.md b/brainstorm/revert.md index 7ce26cc..b8e6151 100644 --- a/brainstorm/revert.md +++ b/brainstorm/revert.md @@ -9,7 +9,7 @@ The option to return `--json` must be present. Signature: ```r -dvs_revert <- function(path = ".", +dvs_revert <- function( commit_sha = integer(), date = NULL, before = NULL, # date | duration diff --git a/brainstorm/root.md b/brainstorm/root.md index 9814ee9..dcdc4b0 100644 --- a/brainstorm/root.md +++ b/brainstorm/root.md @@ -6,7 +6,7 @@ Goal: Return the location of the dvs repository root anywhere. ## CLI -Not relevant. + ## R package diff --git a/brainstorm/status.md b/brainstorm/status.md index f6111e7..eee4104 100644 --- a/brainstorm/status.md +++ b/brainstorm/status.md @@ -56,7 +56,6 @@ Signature: ```r dvs_status <- function( - path = ".", show_storage = FALSE, ) ``` diff --git a/brainstorm/sync.md b/brainstorm/sync.md index e4e713b..ab55301 100644 --- a/brainstorm/sync.md +++ b/brainstorm/sync.md @@ -29,7 +29,7 @@ $ dvs sync --before Signature: ```r -dvs_sync <- function(path = ".", +dvs_sync <- function( by_folder = character(), since = NULL, # date | duration (unit) recurse = TRUE , From 96a9b2becfd1a93108fe60777f4779f9db29cdd0 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:40:54 +0100 Subject: [PATCH 26/28] notes from devin --- spec/add.md | 3 --- spec/delete.md | 2 +- spec/initialization.md | 15 +++++---------- 3 files changed, 6 insertions(+), 14 deletions(-) diff --git a/spec/add.md b/spec/add.md index 907190c..333f536 100644 --- a/spec/add.md +++ b/spec/add.md @@ -20,9 +20,6 @@ dvs_add <- function( ## Compression -If the added file exceeds a certain threshold, the -R package should provide suggest compressing the recently added file. - - [ ] `getOption(dvs.large_file_size = integer()`) - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) diff --git a/spec/delete.md b/spec/delete.md index 2b5c74c..cbc22db 100644 --- a/spec/delete.md +++ b/spec/delete.md @@ -1,6 +1,6 @@ # `dvs delete` -Goal: Remove tracked files. +Goal: tracked files. ## CLI diff --git a/spec/initialization.md b/spec/initialization.md index 654120c..2c3e89e 100644 --- a/spec/initialization.md +++ b/spec/initialization.md @@ -44,22 +44,19 @@ Usage: dvs init [OPTIONS] Backends: - local Local, on-disk storage - fs File system storage (e.g. network file system (nfs)) - s3 S3 compatible storage - aws S3 hosted via AWS + fs Local, on-disk storage backend Options: -h, --help Show help for command (e.g. `dvs init --help`) ``` -### Local +### fs ```shell -dvs init local --- Initialize a DVS repository via on-disk storage +dvs init fs --- Initialize a DVS repository via on-disk storage Usage: - dvs init local [OPTIONS] + dvs init fs [OPTIONS] Required: path to the local storage locations (e.g. `/data/`) @@ -75,8 +72,6 @@ Options: Unix group to set on storage directory and files --no-compression Disable compression of stored files. Compression defaults to zstd - --no-compression - Disable compression of stored files. Compression defaults to zstd -h, --help Print help ``` @@ -114,7 +109,7 @@ CLI users do not need the full path shown to them, but R users need that informa Different storage backends have to be initialized through specialized functions. -- `dvs_init_local` with alias `dvs_init` +- `dvs_init_fs` with alias `dvs_init` ## Journey 1: Initial Setup with defaults From 191cef1eeaaea83251d23c601012c7c728ed91ac Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 15:59:08 +0100 Subject: [PATCH 27/28] remove spec directory changes --- spec/add.md | 33 ----- spec/audit.md | 32 ---- spec/configuration.md | 7 - spec/delete.md | 27 ---- spec/enum_status.md | 41 ----- spec/get.md | 18 --- spec/initialization.md | 197 ------------------------- spec/journey_adding_data_files.md | 68 --------- spec/journey_getting_latest_files.md | 55 ------- spec/journey_updating_files.md | 90 ----------- spec/journey_working_multiple_files.md | 60 -------- spec/log.md | 50 ------- spec/status.md | 96 ------------ spec/sync.md | 48 ------ 14 files changed, 822 deletions(-) delete mode 100644 spec/add.md delete mode 100644 spec/audit.md delete mode 100644 spec/configuration.md delete mode 100644 spec/delete.md delete mode 100644 spec/enum_status.md delete mode 100644 spec/get.md delete mode 100644 spec/initialization.md delete mode 100644 spec/journey_adding_data_files.md delete mode 100644 spec/journey_getting_latest_files.md delete mode 100644 spec/journey_updating_files.md delete mode 100644 spec/journey_working_multiple_files.md delete mode 100644 spec/log.md delete mode 100644 spec/status.md delete mode 100644 spec/sync.md diff --git a/spec/add.md b/spec/add.md deleted file mode 100644 index 333f536..0000000 --- a/spec/add.md +++ /dev/null @@ -1,33 +0,0 @@ -# `dvs add` - -Goal: Add files to an initialized dvs repository. - -## CLI - -- Assume that current directory is a dvs repository, both in cli and R-package. -- The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_add <- function( - files = character(), - glob = character(), -) -``` - -## Compression - -- [ ] `getOption(dvs.large_file_size = integer()`) - - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - -Advice compression when - -- a single size exceeds size thresholds -- a directory of files exceeds size thresholds - -There are cases where individual files are not large, but the collection of files -starts to amount to a large amount, presumably too large to track. diff --git a/spec/audit.md b/spec/audit.md deleted file mode 100644 index 403a082..0000000 --- a/spec/audit.md +++ /dev/null @@ -1,32 +0,0 @@ -# `dvs audit` - -Goal: Provide a repository wide log of dvs tracked files. - -## CLI - -The option to return `--json` must be present. - -```sh -$ dvs audit -[Date] [User] [+{files} -{files}] [Message] -``` - -```sh -$ dvs audit --since -``` - -## R - -Signature: - -```r -dvs_audit <- function() -``` - -```r -dvs_audit() -``` - -```r -dvs_audit(since = NULL) -``` diff --git a/spec/configuration.md b/spec/configuration.md deleted file mode 100644 index e0e9801..0000000 --- a/spec/configuration.md +++ /dev/null @@ -1,7 +0,0 @@ -# `dvs.toml` - -The configuration toml file should contain - -- Backend -- The default compression -- Path to the storage directory diff --git a/spec/delete.md b/spec/delete.md deleted file mode 100644 index cbc22db..0000000 --- a/spec/delete.md +++ /dev/null @@ -1,27 +0,0 @@ -# `dvs delete` - -Goal: tracked files. - -## CLI - -```shell -$ dvs delete - -``` - -## R package - -```r -dvs_delete <- function( - files = character(), - glob = character() -) -``` - -Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. - -- `files`: list of files that are to be deleted. - -### Non-existing files - -Emit a warning, but still remove the files that do exist and are tracked. diff --git a/spec/enum_status.md b/spec/enum_status.md deleted file mode 100644 index 898b10b..0000000 --- a/spec/enum_status.md +++ /dev/null @@ -1,41 +0,0 @@ -# Configuration: Status - -- current | absent | unsynced -- tracked file that is un-added - - - -# TODO (editing needed) - - - - relative_path: relative path to the file with respect to where the operation was called - - status: (doesn’t include error status) - - current: the file is present in the project directory and matches the version in the storage directory - - absent: the file isn't present in the project directory - - unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory - - file_size_bytes: current size of the file in bytes - - time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory - - saved_by: the user who uploaded the most recent file version in the storage directory - - message: the message inputted to the dvs_add command that added the most recent file version in the storage directory - - blake3_checksum: hash of the file via the blake3 algorithm - - absolute_path: canonicalized path of the file - input: - - If inputted explicitly via file glob or path: the file name - - if inputted implicitly via dvs_status() (without input): NA - -error: if the outcome was error, the error type, else NA - -error message: if the outcome was error, the error message (if there was one), else NA diff --git a/spec/get.md b/spec/get.md deleted file mode 100644 index 6300b0f..0000000 --- a/spec/get.md +++ /dev/null @@ -1,18 +0,0 @@ -# `dvs get` - -## CLI - -The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_get <- function( - files = character(), - glob = character(), -) -``` - - diff --git a/spec/initialization.md b/spec/initialization.md deleted file mode 100644 index 2c3e89e..0000000 --- a/spec/initialization.md +++ /dev/null @@ -1,197 +0,0 @@ -# dvs initialization / `dvs init` / `dvs_init` - -Goal: Prepare shared storage and initialize DVS in directory - -dvs initialization will create a `dvs.toml` and a directory as specified by the -shared area in the init command. The shared dir may also need to `chown` the directory -to specify certain permissions. For example, for sensitive projects, setting -ownership to a particular group, allowing write access for the group, and limiting -read access to those not in the group. - -## User site assumptions - -- Always operating within a repository/project/workspace, and will initialize - at current working directory. -- Storage is detached from repository root - -## CLI - -```shell -dvs --- Data version control and storage management system - -Usage: - dvs [OPTIONS] - -Commands: - init - add - get - status - audit - log - -Options: - -h, --help Show help for command (e.g. `dvs init --help`) - --version Show version information -``` - -The initialization command will have further subcommands. - -```shell -dvs init --- Initialize a new DVS repository - -Usage: - dvs init [OPTIONS] - -Backends: - fs Local, on-disk storage backend - -Options: - -h, --help Show help for command (e.g. `dvs init --help`) -``` - -### fs - -```shell -dvs init fs --- Initialize a DVS repository via on-disk storage - -Usage: - dvs init fs [OPTIONS] - -Required: - path to the local storage locations (e.g. `/data/`) - -Options: - --json - Output results as JSON - --metadata-folder-name - If you want to use a folder name other than `.dvs` for storing the metadata files - --permissions - Unix permissions for storage directory and files (octal, e.g., "770") - --group - Unix group to set on storage directory and files - --no-compression - Disable compression of stored files. Compression defaults to zstd - -h, --help - Print help -``` - -Example output: - -```shell -$ dvs init /data/ -DVS Repository created with storage path located at -``` - -## R function - -```r -dvs_init <- function( - storage_path = character(), # required - permissions = NULL, - group = NULL, - metadata_folder_name = NULL) -``` - -Example output: - -```r -> dvs_init() -> Error: `storage_path` is missing; Please provide a location to store dvs objects. -``` - -```r -> dvs_init("/data/projectA_storage") -> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" -``` - -CLI users do not need the full path shown to them, but R users need that information. - -Different storage backends have to be initialized through specialized functions. - -- `dvs_init_fs` with alias `dvs_init` - -## Journey 1: Initial Setup with defaults - -Expected outcomes: - -- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. -- shared dir created in specified path, with default permissions of 664 - -Known Caveats: - -- certain linux `umask` setups cause folders to have default permissions like 600, or 644 -where other collaborators could not write by default, therefore, - -### CLI flow - -1. initialize dvs from a project directory - -```bash -dvs init /data/dvs/example-proj -``` - -### R package flow - -1. Initialize DVS in the repo - -```r -dvs_init("/data/shared/project-x-dvs") -``` - -## Journey 2: Initial Setup with shared folder locked down to group - -- set permissions to writeable by group, not readable if not in group (660) -- group name projx - -Expected outcomes: - -- dvs.toml created in working directory -- shared dir created in specified path, with permissions of 660 and owned by group projx - -Edge cases: - -- group must resolve to known gid on system - -### CLI flow - -1. initialize dvs from a project directory - -```bash -dvs init /data/dvs/sensitive-projx --permissions "660" --group projx -``` - -### R package flow - -1. Initialize DVS in the repo - -```r -dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") -``` - -#### Returns - -Return a rich data-frame that the end-user can then further subset/filter -to fit their needs. - -Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. - -- [ ] New format: - - `absolute_path`: abbreviated when printed in R (pillar) - - `relative path`: full path - - `status`: ordered factor instead of `character()` - - `absent|unsync|sync|present|added` - - `checksum`: always abbreviated in print (pillar, first 5 characters) - - `size`: using units and not raw `double()/numeric()` - -## Data formats to track - -- `.csv` -- `.rds` -- don't track `.RDA` files, as they are a collection of datasets - -Configuration: Must add these filters to the `dvs.toml`. - -Known annoyance: Verbosity of this can be annoying. -There should be a way to reduce outputs on untracked data files available -to the user. diff --git a/spec/journey_adding_data_files.md b/spec/journey_adding_data_files.md deleted file mode 100644 index 94a2ff2..0000000 --- a/spec/journey_adding_data_files.md +++ /dev/null @@ -1,68 +0,0 @@ -# Journey 2: Adding Data Files - -Goal: Version a newly created dataset so others can retrieve it. - -## CLI flow - -1. Produce the data (example) - - ```bash - # Your data pipeline or script writes: - # data/derived/pk_data.csv - ``` - -2. Add the file to DVS - - ```bash - dvs add data/derived/pk_data.csv --message "Initial PK dataset v1" - ``` - -3. Commit DVS metadata - - ```bash - git add data/derived/pk_data.csv.dvs data/derived/.gitignore - git commit -m "Add processed PK data" - git push - ``` - -4. Verify status - - ```bash - dvs status data/derived/pk_data.csv - ``` - -## R package flow - -1. Produce the data - - ```r - write.csv(pk_data, "data/derived/pk_data.csv") - ``` - -2. Add the file to DVS - - ```r - dvs_add("data/derived/pk_data.csv", message = "Initial PK dataset v1") - ``` - -3. Commit DVS metadata - - ```bash - git add data/derived/pk_data.csv.dvs data/derived/.gitignore - git commit -m "Add processed PK data" - git push - ``` - -4. Verify status - - ```r - dvs_status("data/derived/pk_data.csv") - ``` - -```r -# no dvs_init ran before -dvs_add("contingency_table2.csv") -``` - -In RStudio: Check if there is no active folder, then emit warning. -Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/spec/journey_getting_latest_files.md b/spec/journey_getting_latest_files.md deleted file mode 100644 index 5d681d1..0000000 --- a/spec/journey_getting_latest_files.md +++ /dev/null @@ -1,55 +0,0 @@ -# Journey 3: Getting Latest Files - -Goal: Pull metadata from Git and restore the tracked data files. - -## CLI flow - -1. Pull the latest repo changes - - ```bash - git pull - ``` - -2. See what is missing - - ```bash - dvs status - ``` - -3. Restore tracked files - - ```bash - dvs get data/derived/* - ``` - -4. Verify everything is current - - ```bash - dvs status - ``` - -## R package flow - -1. Pull the latest repo changes - - ```bash - git pull - ``` - -2. See what is missing - - ```r - dvs_status() - ``` - -3. Restore tracked files - - ```r - dvs_get("data/derived/*") - ``` - -4. Verify everything is current - - ```r - dvs_status() - ``` diff --git a/spec/journey_updating_files.md b/spec/journey_updating_files.md deleted file mode 100644 index a96d329..0000000 --- a/spec/journey_updating_files.md +++ /dev/null @@ -1,90 +0,0 @@ -# Journey 4: Updating Data Files - -Goal: Replace an existing tracked dataset with a new version. - -## CLI flow - -1. Re-run your processing to overwrite the data file - - ```bash - # Your data pipeline updates: - # data/derived/pk_data.csv - ``` - -2. Check status - - ```bash - dvs status data/derived/pk_data.csv - ``` - -3. Add the new version - - ```bash - dvs add data/derived/pk_data.csv --message "Updated PK dataset v2" - ``` - -4. Commit updated metadata - - ```bash - git add data/derived/pk_data.csv.dvs - git commit -m "Update PK data with new processing" - git push - ``` - -## R package flow - -1. Re-run your processing - - ```r - pk_data_v2 <- update_processing(pk_data) - write.csv(pk_data_v2, "data/derived/pk_data.csv") - ``` - -2. Check status - - ```r - dvs_status("data/derived/pk_data.csv") - ``` - -3. Add the new version - - ```r - dvs_add("data/derived/pk_data.csv", message = "Updated PK dataset v2") - ``` - -4. Commit updated metadata - - ```bash - git add data/derived/pk_data.csv.dvs - git commit -m "Update PK data with new processing" - git push - ``` - -## Journey 5: Updating data files with new rows - -New data following previous form might come up. Example is new rows from a clinical trial, -new participants in trials is added, however the scientists want them added to already -checked data files. - -```r -dvs_add("data/registry/participants.csv", "added information from the second batch of runs") -``` - -this ought to say - -```r -> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" -``` - -Then, - -```r -dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) -``` - -could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated -to the level of the previous file, and then it can be known if this new event can supersede other add events, because we -know it is an addition. - -The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, -so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. diff --git a/spec/journey_working_multiple_files.md b/spec/journey_working_multiple_files.md deleted file mode 100644 index bf7872e..0000000 --- a/spec/journey_working_multiple_files.md +++ /dev/null @@ -1,60 +0,0 @@ -# Journey 5: Working with Multiple Files - -Goal: Add and retrieve batches of outputs with glob patterns. - -## CLI flow - -1. Produce multiple outputs - - ```bash - # Your data pipeline writes: - # data/derived/pk.csv - # data/derived/pd.csv - # data/derived/summary.csv - ``` - -2. Add all outputs at once - - ```bash - dvs add data/derived/*.csv --message "Analysis outputs batch 1" - ``` - -3. Retrieve all tracked files later - - ```bash - dvs get data/derived/*.csv - ``` - -4. Check status for everything - - ```bash - dvs status - ``` - -## R package flow - -1. Produce multiple outputs - - ```r - write.csv(pk_data, "data/derived/pk.csv") - write.csv(pd_data, "data/derived/pd.csv") - write.csv(summary_stats, "data/derived/summary.csv") - ``` - -2. Add all outputs at once - - ```r - dvs_add("data/derived/*.csv", message = "Analysis outputs batch 1") - ``` - -3. Retrieve all tracked files later - - ```r - dvs_get("data/derived/*.csv") - ``` - -4. Check status for everything - - ```r - dvs_status() - ``` diff --git a/spec/log.md b/spec/log.md deleted file mode 100644 index 79fcdfe..0000000 --- a/spec/log.md +++ /dev/null @@ -1,50 +0,0 @@ -# `dvs log` - -Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. - -## CLI - -The option to return `--json` must be present. - -```sh -# in a previously `dvs init` folder -$ dvs log data/derived/model_summary.txt -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" -``` - -```sh -$ dvs log --interval -[date -- duration since now] -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" - -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" - -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" -[date -- 2x duration since now] -... -[date -- 3x duration since now] -... -``` - -``: `days`, `weeks`, `months` - -## R - -Signature: - -```r -dvs_log <- function( - since = NULL, - by_user = NULL, -) -``` - - diff --git a/spec/status.md b/spec/status.md deleted file mode 100644 index a4365a5..0000000 --- a/spec/status.md +++ /dev/null @@ -1,96 +0,0 @@ -# `dvs` status - -Goal: Provide an overview of the changed data files and potential files to track -via the traced data file filters. - -## CLI - -The option to return `--json` must be present. - -```shell -$ dvs status --help -Status of the DVS repository - -Usage: - dvs status [FILTERS] [OPTIONS] - -Filters: - --current - --unsynced, --missing - --absent - --no-current - --no-unsynced - --no-absent - -Options: - -s, --state filter for states to retain - -i, --invert inverts the selection provided by `--state` - -h, --help Print help -``` - -When a filter is provided, only the selected state(s) are provided. - - - -```sh -dvs status - -Current files: - - -Changed files (unsynced): - new_scenario/model_spec.txt - -Untracked and followed files: - orignal_scenario/model_summary.txt - orignal_scenario/tab-0123.tsv - orignal_scenario/tab-0123b.tsv - orignal_scenario/tab-0123c.tsv -``` - -We do not need to display the user in unsynced files, as they are likely to be owned by the current user. - -## R - -Signature: - -```r -dvs_status <- function( - show_storage = FALSE, -) -``` - -- `show_storage`: - - Show location of storage(s) for the current dvs repository. - - Warn the user that they must not alter the state of - the storage directory. - - (future) Show number of projects that the storage contains - -## Return format - -### CLI JSON format - - - -### R format - -Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` - -Proposed format: - -- `absolute_path`: abbreviated when printed in R (pillar) -- `relative path`: full path -- `status`: ordered factor instead of `character()` - - `absent|unsync|sync|present|added` -- `checksum`: always abbreviated in print (pillar, first 5 characters) -- `size`: using units and not raw `double()/numeric()` - -## Data name format - -`dvs_status` should show untracked data files in the current dvs repository, if -tracking is specified. - -## Granularity - -We expect the end user to use `{dplyr}` in order to -filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. diff --git a/spec/sync.md b/spec/sync.md deleted file mode 100644 index ab55301..0000000 --- a/spec/sync.md +++ /dev/null @@ -1,48 +0,0 @@ -# `dvs sync` - -Goal: Provide a streamlined way to update a cloned dvs repository. - -Synchronization `sync` is an alias for `dvs get **/*`, meant as a -repository wide syncing from storage (local/remote). - -## CLI - -The option to return `--json` must be present. - -```sh -$ dvs sync -[status] [Last modified] [Message] -... ... ... -``` - -The sync subcommand should also be able to act as a repository revert, -and - -```sh -$ dvs sync --before -[status] [Last modified] [Message] -... ... ... -``` - -## R - -Signature: - -```r -dvs_sync <- function( - by_folder = character(), - since = NULL, # date | duration (unit) - recurse = TRUE , -) -``` - -- `path` is a location within a dvs repository. - Not necessarily the root a dvs repository. -- `by_folder` allows to sync specific folders only -- `recurse` is whether to sync folders recursively - -### `recurse` - -When there is no `by_folder`, recurse will update the entire dvs repository, even if -current directory is a sub-directory in a dvs repository. The current location of the -user might be incidental to their intent with dvs. From c299667a428fd18fc34c0dde7cde1f0af0b60844 Mon Sep 17 00:00:00 2001 From: Mossa Date: Wed, 18 Feb 2026 16:07:03 +0100 Subject: [PATCH 28/28] revert ui directory changes --- ui/add.md | 45 ------ ui/alias_git.md | 5 - ui/audit.md | 38 ----- ui/configuration.md | 5 - ui/delete.md | 29 ---- ui/dvs_last.md | 10 -- ui/enum_status.md | 39 ------ ui/follow.md | 19 --- ui/get.md | 20 --- ui/initialization.md | 209 ++++------------------------ ui/journey-2-adding-data-files.md | 8 -- ui/journey-4-updating-data-files.md | 29 ---- ui/log.md | 50 ------- ui/message.md | 23 --- ui/remote_storage.md | 22 --- ui/revert.md | 19 --- ui/root.md | 33 ----- ui/status.md | 107 -------------- ui/sync.md | 48 ------- ui/trace.md | 0 ui/tracking.md | 79 ----------- 21 files changed, 29 insertions(+), 808 deletions(-) delete mode 100644 ui/add.md delete mode 100644 ui/alias_git.md delete mode 100644 ui/audit.md delete mode 100644 ui/configuration.md delete mode 100644 ui/delete.md delete mode 100644 ui/dvs_last.md delete mode 100644 ui/enum_status.md delete mode 100644 ui/follow.md delete mode 100644 ui/get.md delete mode 100644 ui/log.md delete mode 100644 ui/message.md delete mode 100644 ui/remote_storage.md delete mode 100644 ui/revert.md delete mode 100644 ui/root.md delete mode 100644 ui/status.md delete mode 100644 ui/sync.md delete mode 100644 ui/trace.md delete mode 100644 ui/tracking.md diff --git a/ui/add.md b/ui/add.md deleted file mode 100644 index 23d9702..0000000 --- a/ui/add.md +++ /dev/null @@ -1,45 +0,0 @@ -# `dvs add` - -Goal: Add files to an initialized dvs repository. - -- [ ] Currently the `message` is attached to all files checked in simultaneously. - dvs has a log and audit log to - illuminate "why" a change occurred in the data. - - - -## CLI - -- Assume that current directory is a dvs repository, both in cli and R-package. -- The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_add <- function( - files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - overwrite = FALSE, - fail = FALSE -) -``` - -## Compression - -If the added file exceeds a certain threshold, the -R package should provide suggest compressing the recently added file. - -- [ ] `getOption(dvs.large_file_size = integer()`) - - Hard limit 100 MB [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - - Soft limit 50 MB (warning emitted) [PMx-project-template](https://github.com/A2-ai/template-PMx-project-starter/blob/main/.lefthook/pre-commit/file-size) - -Advice compression when - -- a single size exceeds size thresholds -- a directory of files exceeds size thresholds - -There are cases where individual files are not large, but the collection of files -starts to amount to a large amount, presumably too large to track. diff --git a/ui/alias_git.md b/ui/alias_git.md deleted file mode 100644 index c248ab9..0000000 --- a/ui/alias_git.md +++ /dev/null @@ -1,5 +0,0 @@ -# Alias dvs with get terminology - -- [ ] (future?) Should dvs-cli and dvs-rpkg have a --git-mode, where we -expose a git compatible interface to dvs, in order to -plug-in dvs as a git replacement? diff --git a/ui/audit.md b/ui/audit.md deleted file mode 100644 index c005291..0000000 --- a/ui/audit.md +++ /dev/null @@ -1,38 +0,0 @@ -# `dvs audit` - -Goal: Provide a repository wide log of dvs tracked files. - -## CLI - -The option to return `--json` must be present. - -```sh -$ dvs audit -[Date] [User] [+{files} -{files}] [Message] -``` - -```sh -$ dvs audit --since -``` - -## R - -Signature: - -```r -dvs_audit <- function( - since = NULL, # date | duration (unit) - by_user = character()) - -``` - - - -```r -dvs_audit() - -``` - -```r -dvs_audit(since = NULL) -``` diff --git a/ui/configuration.md b/ui/configuration.md deleted file mode 100644 index 7649cc2..0000000 --- a/ui/configuration.md +++ /dev/null @@ -1,5 +0,0 @@ -# `dvs.toml` - - -Configuration should track which patterns are tracked [DVS Tracking](./tracking.md). - diff --git a/ui/delete.md b/ui/delete.md deleted file mode 100644 index d9c4687..0000000 --- a/ui/delete.md +++ /dev/null @@ -1,29 +0,0 @@ -# `dvs delete` - -Goal: Remove tracked files. - -## CLI - -```shell -$ dvs delete - -``` - -## R package - -```r -dvs_delete <- function( - files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE -) -``` - -Aliases: `dvs_delete`, `dvs_remove`, `dvs_rm`. - -- `files`: list of files that are to be deleted. - -### Non-existing files - -Emit a warning, but still remove the files that do exist and are tracked. diff --git a/ui/dvs_last.md b/ui/dvs_last.md deleted file mode 100644 index cd0a5e9..0000000 --- a/ui/dvs_last.md +++ /dev/null @@ -1,10 +0,0 @@ -# `dvs_last` - -Goal: provide users with the ability to retrieve the result of -the last executed dvs command within the r package. - -Example: Suppose after `dvs_add(by_folder = "data/derived/*")` was executed -an error occurred, and an overview is displayed as a data-frame. The user -got a R native result, a data-frame, but if the user wants to act on the -provided information, we might want to provide a `dvs_last` that contains -miscellaneous. diff --git a/ui/enum_status.md b/ui/enum_status.md deleted file mode 100644 index a051258..0000000 --- a/ui/enum_status.md +++ /dev/null @@ -1,39 +0,0 @@ -# Configuration: Status - -- current | absent | unsynced -- tracked file that is un-added - - - -# TODO (editting needed) - - relative_path: relative path to the file with respect to where the operation was called - - status: (doesn’t include error status) - - current: the file is present in the project directory and matches the version in the storage directory - - absent: the file isn't present in the project directory - - unsynced: the file is present in the project directory, but doesn't match the version on in the storage directory - - file_size_bytes: current size of the file in bytes - - time_stamp: the ISO 8601 Zulu time of the most recent file version in the storage directory - - saved_by: the user who uploaded the most recent file version in the storage directory - - message: the message inputted to the dvs_add command that added the most recent file version in the storage directory - - blake3_checksum: hash of the file via the blake3 algorithm - - absolute_path: canonicalized path of the file - input: - - If inputted explicitly via file glob or path: the file name - - if inputted implicitly via dvs_status() (without input): NA - -error: if the outcome was error, the error type, else NA - -error message: if the outcome was error, the error message (if there was one), else NA diff --git a/ui/follow.md b/ui/follow.md deleted file mode 100644 index f87c953..0000000 --- a/ui/follow.md +++ /dev/null @@ -1,19 +0,0 @@ -# `dvs track` / `dvs_track` - -Goal: Purpose is to specify which files we ought to follow in dvs. - -User journey: - -- [ ] All the .csv files underneath a specific directory. -- [ ] All the .csv files that are less than 25 MB - -- File type -- Size filters - -- [ ] MOSSA: We may want to not track too large files, even if they are .csv -- [ ] pre-hook 100mb limit see template-PMx-project-starter - -Cloned repositories do not have hooks! - -- [ ] MOSSA: Filtering **/* but only the tracked files by dvs! -- [ ] \ No newline at end of file diff --git a/ui/get.md b/ui/get.md deleted file mode 100644 index 8f5741c..0000000 --- a/ui/get.md +++ /dev/null @@ -1,20 +0,0 @@ -# `dvs get` - -## CLI - -The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_get <- function(path = ".", - files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE # follows fs::dir_ls -) -``` - - diff --git a/ui/initialization.md b/ui/initialization.md index c552e69..21ed288 100644 --- a/ui/initialization.md +++ b/ui/initialization.md @@ -1,155 +1,57 @@ -# dvs initialization / `dvs init` / `dvs_init` +# dvs initialization Goal: Prepare shared storage and initialize DVS in directory dvs initialization will create a `dvs.toml` and a directory as specified by the -shared area in the init command. The shared dir may also need to `chown` the directory +shared area in the init command. The shared dir may also need to chown the directory to specify certain permissions. For example, for sensitive projects, setting ownership to a particular group, allowing write access for the group, and limiting read access to those not in the group. -## User site assumptions +## cli -- Always operating within a repository/project/workspace. -- A dvs repository need not fall under a git or any other vcs repository -- Storage is detached from repository root +```default +dvs init +Starts a new dvs project. This will create a `dvs.toml` file in the root folder of where the user is calling the CLI from. root folder being the place where we find a `.git` folder -- [ ] If `git` is not a requirement, what alternative heuristics do we use - for instantiating a dvs repository? Suggestion: If a `.git` directory is not - available, then take current directory as the choice directory for initialization? +Usage: dvs init [OPTIONS] -## CLI - -```shell -dvs --- Data version control and storage management system - -Usage: - dvs [OPTIONS] - -Commands: - init - add - get - status - audit - log +Arguments: + Where the data will be stored Options: - -h, --help Show help for command (e.g. `dvs init --help`) - --version Show version information -``` - -The initialization command will have further subcommands. - -```shell -dvs init --- Initialize a new DVS repository - -Usage: - dvs init [OPTIONS] - -Backends: - local Local, on-disk storage - fs File system storage (e.g. network file system (nfs)) - s3 S3 compatible storage - aws S3 hosted via AWS - -Options: - -h, --help Show help for command (e.g. `dvs init --help`) -``` - -### Local - -```shell -dvs init local --- Initialize a DVS repository via on-disk storage - -Usage: - dvs init local [OPTIONS] - -Required: - path to the local storage locations (e.g. `/data/`) - -Options: - --json - Output results as JSON - --metadata-folder-name - If you want to use a folder name other than `.dvs` for storing the metadata files - --permissions - Unix permissions for storage directory and files (octal, e.g., "770") - --group - Unix group to set on storage directory and files - --no-compression - Disable compression of stored files. Compression defaults to zstd - --no-compression - Disable compression of stored files. Compression defaults to zstd + --json + Output results as JSON + --metadata-folder-name + If you want to use a folder name other than `.dvs` for storing the metadata files + --permissions + Unix permissions for storage directory and files (octal, e.g., "770") + --group + Unix group to set on storage directory and files + --no-compression + Disable compression of stored files. Compression defaults to zstd -h, --help Print help ``` -## FS / NFS - - - -Example output: - -```shell -$ dvs init /data/ -DVS Repository created with storage path located at -``` - ## R function -```r -dvs_init <- function( - storage_path = character(), # required - permissions = NULL, - group = NULL, - metadata_folder_name = NULL) -``` - -Example output: + ```r -> dvs_init() -> Error: `storage_path` is missing; Please provide a location to store dvs objects. +dvs_init <- function(directory = ".", permissions = NULL, group = NULL, metadata_folder_name = NULL) ``` -```r -> dvs_init("/data/projectA_storage") -> A DVS repository was initialized in "/Users/elea/Documents/projectA" with storage location at "/data/projectA_storage" -``` - -CLI users do not need the full path shown to them, but R users need that information. - -Different storage backends have to be initialized through specialized functions. - -- `dvs_init_local` with alias `dvs_init` -- `dvs_init_fs(...)` -- `dvs_init_s3(...)` -- `dvs_init_aws(...)` - -## Storage - -- (future) Multiple projects can be hosted within the same storage - -### Case: No project or specific work directory - -Considering the one off scripts that scientists might create, in which there is -no project surrounding where said script is. - -- (future) User/machine storage -- (future) A remote project -- (future) One off scripts - ## Journey 1: Initial Setup with defaults Expected outcomes: -- `dvs.toml` created in the ancestral directory that contains `.git`, or other heuristics. -- shared dir created in specified path, with default permissions of 664 +* dvs.toml created in working directory +* shared dir created in specified path, with default permissions of 664 Known Caveats: -- certain linux `umask` setups cause folders to have default permissions like 600, or 644 +* certain linux umask setups cause folders to have default permissions like 600, or 644 where other collaborators could not write by default, therefore, ### CLI flow @@ -170,17 +72,17 @@ dvs_init("/data/shared/project-x-dvs") ## Journey 2: Initial Setup with shared folder locked down to group -- set permissions to writeable by group, not readable if not in group (660) -- group name projx +* set permissions to writeable by group, not readable if not in group (660) +* group name projx Expected outcomes: -- dvs.toml created in working directory -- shared dir created in specified path, with permissions of 660 and owned by group projx +* dvs.toml created in working directory +* shared dir created in specified path, with permissions of 660 and owned by group projx Edge cases: -- group must resolve to known gid on system +* group must resolve to known gid on system ### CLI flow @@ -197,56 +99,3 @@ dvs init /data/dvs/sensitive-projx --permissions "660" --group projx ```r dvs_init("/data/shared/project-x-dvs", permissions = "660", group = "projx") ``` - -#### Returns - -Return a rich data-frame that the end-user can then further subset/filter -to fit their needs. - -Old format: `relative_path`, `outcome`, `file_size_bytes`, `blake3_checksum`. - -- [ ] New format: - - `absolute_path`: abbreviated when printed in R (pillar) - - `relative path`: full path - - `status`: ordered factor instead of `character()` - - `absent|unsync|sync|present|added` - - `checksum`: always abbreviated in print (pillar, first 5 characters) - - `size`: using units and not raw `double()/numeric()` - -## Data formats to track - -- `.csv` -- `.rds` -- don't track `.RDA` files, as they are a collection of datasets - -Configuration: Must add these filters to the `dvs.toml`. - -Known annoyance: Verbosity of this can be annoying. -There should be a way to reduce outputs on untracked data files available -to the user. - -# TODO (to be edited) - -Errors - -dvs_init could return any of the following error types: - -project already initialized: dvs_init has already been run with different initialization attributes. - -git repository not found: dvs_init was run outside of a git repository - -storage directory input is not a directory: if input was an existing file - -storage directory absolute path not found: if the path could not be made absolute - -configuration file not created (dvs.yaml): failed to write to or save dvs.yaml - -linux primary group not found: if the group was inputted and it doesn't refer to a valid group - -storage directory not created: failed to create the storage directory - -linux file permissions invalid: if the permissions were inputted, they don't refer to actual octal linux file permissions - -could not check if storage directory is empty: error reading the contents of the directory - -storage directory permissions not set: couldn't modify the permissions of the storage directory diff --git a/ui/journey-2-adding-data-files.md b/ui/journey-2-adding-data-files.md index 94a2ff2..86a9cb8 100644 --- a/ui/journey-2-adding-data-files.md +++ b/ui/journey-2-adding-data-files.md @@ -58,11 +58,3 @@ Goal: Version a newly created dataset so others can retrieve it. ```r dvs_status("data/derived/pk_data.csv") ``` - -```r -# no dvs_init ran before -dvs_add("contingency_table2.csv") -``` - -In RStudio: Check if there is no active folder, then emit warning. -Similarly in VSCode and Positron, as both can be run without an active workspace. diff --git a/ui/journey-4-updating-data-files.md b/ui/journey-4-updating-data-files.md index a96d329..c25ebcf 100644 --- a/ui/journey-4-updating-data-files.md +++ b/ui/journey-4-updating-data-files.md @@ -59,32 +59,3 @@ Goal: Replace an existing tracked dataset with a new version. git commit -m "Update PK data with new processing" git push ``` - -## Journey 5: Updating data files with new rows - -New data following previous form might come up. Example is new rows from a clinical trial, -new participants in trials is added, however the scientists want them added to already -checked data files. - -```r -dvs_add("data/registry/participants.csv", "added information from the second batch of runs") -``` - -this ought to say - -```r -> "Error: file already exists; consider noting if this is an amendment to the previous file via `amend = TRUE`" -``` - -Then, - -```r -dvs_add("data/registry/participants.csv", "added information from the second batch of runs", amend = TRUE) -``` - -could be executed, in which: Previous hash is compared to the new file `data/registry/participants.csv`, but truncated -to the level of the previous file, and then it can be known if this new event can supersede other add events, because we -know it is an addition. - -The hash itself cannot distinguish between a completely new file, or one with new bytes. In dvs, we only have current hash, -so we should consider adding this context via the user, i.e. by asking if it is an addition / amendment. diff --git a/ui/log.md b/ui/log.md deleted file mode 100644 index 79fcdfe..0000000 --- a/ui/log.md +++ /dev/null @@ -1,50 +0,0 @@ -# `dvs log` - -Per file logging is inspected via `dvs log` / `dvs_log()`. For project-wide logging, we have `dvs audit` / `dvs_audit()`. - -## CLI - -The option to return `--json` must be present. - -```sh -# in a previously `dvs init` folder -$ dvs log data/derived/model_summary.txt -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" -``` - -```sh -$ dvs log --interval -[date -- duration since now] -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" - -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" - -Last edited on: 20-10-2020 -checksum: NNNNN -message: "Ran nonmem model on exposure assumptions" -[date -- 2x duration since now] -... -[date -- 3x duration since now] -... -``` - -``: `days`, `weeks`, `months` - -## R - -Signature: - -```r -dvs_log <- function( - since = NULL, - by_user = NULL, -) -``` - - diff --git a/ui/message.md b/ui/message.md deleted file mode 100644 index 0024275..0000000 --- a/ui/message.md +++ /dev/null @@ -1,23 +0,0 @@ -# `dvs_message` - -Goal: Add messages to files without re-hashing or replacing them. - -## CLI - -```sh -$ dvs message data/model_aaabb/model_summary.csv "this time it was run with 10000 repititons" -Added message to `data/model_aaabb/model_summary.csv` -``` - -## R package - -```r -dvs_message <- function( - files = character(), - glob = character(), - ignore.case = NULL %||% !is.empty(glob), - fail = FALSE -) -``` - -`dvs_message` is a equivalent to an idempotent `dvs_add`-call. diff --git a/ui/remote_storage.md b/ui/remote_storage.md deleted file mode 100644 index 93694f6..0000000 --- a/ui/remote_storage.md +++ /dev/null @@ -1,22 +0,0 @@ -# dvs supported storage backends - -## A2-AI hosting - -storagemagic/Dumbledore/A2-AI Cloud - -## Custom DVS storage - -- [ ] (future) dvs server hosted by client. - -## Third party storage hosting - -### AMazon FSx - - - -### S3 - -### Sharepoint - -### - diff --git a/ui/revert.md b/ui/revert.md deleted file mode 100644 index 7ce26cc..0000000 --- a/ui/revert.md +++ /dev/null @@ -1,19 +0,0 @@ -# `dvs revert` - -## CLI - -The option to return `--json` must be present. - -## R - -Signature: - -```r -dvs_revert <- function(path = ".", - commit_sha = integer(), - date = NULL, - before = NULL, # date | duration -) -``` - - diff --git a/ui/root.md b/ui/root.md deleted file mode 100644 index 9814ee9..0000000 --- a/ui/root.md +++ /dev/null @@ -1,33 +0,0 @@ -# `dvs root` - -Convenience utility for expert users - -Goal: Return the location of the dvs repository root anywhere. - -## CLI - -Not relevant. - -## R package - -Signature: - -```r -dvs_root <- function(...) -# alias -find_dvs_root <- dvs_root() -``` - -Convenience: - -```r -dvs_root("model_code") -# equivalent to -fs::join(dvs_root(), "model_code") -# or -file.path(dvs_root(), "model_code") -``` - -The use cases for this function is very limited. We assume heavy use of -`{here}`-package in dvs-based projects. But it could be a relevant convenience -function in certain, specific cases. diff --git a/ui/status.md b/ui/status.md deleted file mode 100644 index f6111e7..0000000 --- a/ui/status.md +++ /dev/null @@ -1,107 +0,0 @@ -# `dvs` status - -Goal: Provide an overview of the changed data files and potential files to track -via the traced data file filters. - -## CLI - -The option to return `--json` must be present. - -```shell -$ dvs status --help -Status of the DVS repository - -Usage: - dvs status [FILTERS] [OPTIONS] - -Filters: - --current - --unsynced, --missing - --absent - --no-current - --no-unsynced - --no-absent - -Options: - -s, --state filter for states to retain - -i, --invert inverts the selection provided by `--state` - -h, --help Print help -``` - -When a filter is provided, only the selected state(s) are provided. - - - -```sh -dvs status - -Current files: - - -Changed files (unsynced): - new_scenario/model_spec.txt - -Untracked and followed files: - orignal_scenario/model_summary.txt - orignal_scenario/tab-0123.tsv - orignal_scenario/tab-0123b.tsv - orignal_scenario/tab-0123c.tsv -``` - -We do not need to display the user in unsynced files, as they are likely to be owned by the current user. - -## R - -Signature: - -```r -dvs_status <- function( - path = ".", - show_storage = FALSE, -) -``` - -- `show_storage`: - - Show location of storage(s) for the current dvs repository. - - Warn the user that they must not alter the state of - the storage directory. - - (future) Show number of projects that the storage contains - -## Return format - -### CLI JSON format - - - -### R format - -Old format: `relative_path`, `status`, `file_size_bytes`, `blake3_checksum` - -Proposed format: - -- `absolute_path`: abbreviated when printed in R (pillar) -- `relative path`: full path -- `status`: ordered factor instead of `character()` - - `absent|unsync|sync|present|added` -- `checksum`: always abbreviated in print (pillar, first 5 characters) -- `size`: using units and not raw `double()/numeric()` - -## Data name format - -`dvs_status` should show untracked data files in the current dvs repository, if -tracking is specified. - -## Granularity - -We expect the end user to use `{dplyr}` in order to -filter to users, groups, and/or folders. Therefore it is important to provide consistent data-frames. - -## Following Filters in Status - -`dvs_track(".csv")`: tracks all CSV files. - -`dvs_track("model_data/*")`: all files in a directory will be added to the (potentially untracked files) - -`dvs_track("results/*.rds")`: glob on all r data that are saved in a specific directory. - -These should result in additions to `[following]` table in `dvs.toml`. See [Following Formats](tracking.md). diff --git a/ui/sync.md b/ui/sync.md deleted file mode 100644 index e4e713b..0000000 --- a/ui/sync.md +++ /dev/null @@ -1,48 +0,0 @@ -# `dvs sync` - -Goal: Provide a streamlined way to update a cloned dvs repository. - -Synchronization `sync` is an alias for `dvs get **/*`, meant as a -repository wide syncing from storage (local/remote). - -## CLI - -The option to return `--json` must be present. - -```sh -$ dvs sync -[status] [Last modified] [Message] -... ... ... -``` - -The sync subcommand should also be able to act as a repository revert, -and - -```sh -$ dvs sync --before -[status] [Last modified] [Message] -... ... ... -``` - -## R - -Signature: - -```r -dvs_sync <- function(path = ".", - by_folder = character(), - since = NULL, # date | duration (unit) - recurse = TRUE , -) -``` - -- `path` is a location within a dvs repository. - Not necessarily the root a dvs repository. -- `by_folder` allows to sync specific folders only -- `recurse` is whether to sync folders recursively - -### `recurse` - -When there is no `by_folder`, recurse will update the entire dvs repository, even if -current directory is a sub-directory in a dvs repository. The current location of the -user might be incidental to their intent with dvs. diff --git a/ui/trace.md b/ui/trace.md deleted file mode 100644 index e69de29..0000000 diff --git a/ui/tracking.md b/ui/tracking.md deleted file mode 100644 index 6d6917e..0000000 --- a/ui/tracking.md +++ /dev/null @@ -1,79 +0,0 @@ -# `dvs track` / `dvs_track` - -Goal: Purpose is to specify which files we ought to follow in dvs. - -User journey: - -- [ ] All the .csv files underneath a specific directory. -- [ ] All the .csv files that are less than 25 MB - -- File type -- Size filters - - - - - -## CLI - -```shell -$ dvs follow --help -Files that are followed by dvs when untracked. - -Usage: - dvs follow [COMMANDS] [OPTIONS] - -Commands: - add - list - audit - -Options: - -h, --help Show help for a command -``` - -`add` command: - - -`list` command: - - -`add` command: - - -## R package - -Support the following - -- `ext` which are following-filters based on file extensions, e.g. `"csx"`. -- `glob`: a glob that can enable matching files through their paths and file extension -- `regex`: a regular expression to match files through their full paths - -Provide diagnostics in case users accidentally write `.csv` instead of the correct `csv`. - -The follow filter must support - -- `glob`, `ext`, `regex` field -- an optional `label` that can be used to identify which follow-filter matched a file -- file size qualifiers: - `file_size_gt` (file size greater than mask), - `file_size_lt` (file size less than mask) - -Example: - -```toml -[[follow]] -{ ext = "parquet" } -[[follow]] -{ glob = "data/**/*.csv", label? = "optional label" } -[[follow]] -{ regex = ".+tab[0-9].+", file_size_gt ="5MB" } # match all nonmem tab files sdtab001 patab001 .... over 5MB -[[follow]] -{ glob = "model/nonmem/**/*", file_size_gt = "10MB" } -``` - -## Matcher audit - -A helpful utility for end users is a way to figure out why a given file was followed -by dvs. To that end, the dvs track ought to display the matching filter next to every -followed file.