Complete comprehensive package review for gdho #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

larnsce merged 1 commit into main from 1-metadata-fixes

Jul 3, 2025

..Rcheck/00check.log

-Original file line number
+Diff line change
@@ -0,0 +1,13 @@
+    * using log directory ‘/Users/lschoebitz/Documents/gitrepos/gh-org-openwashdata/data-repos/gdho/..Rcheck’
+    * using R version 4.5.0 (2025-04-11)
+    * using platform: aarch64-apple-darwin20
+    * R was compiled by
+        Apple clang version 14.0.0 (clang-1400.0.29.202)
+        GNU Fortran (GCC) 14.2.0
+    * running under: macOS Sequoia 15.5
+    * using session charset: UTF-8
+    * checking for file ‘./DESCRIPTION’ ... ERROR
+    Benötigte Felder fehlen oder sind leer:
+      ‘Author’ ‘Maintainer’
+    * DONE
+    Status: 1 ERROR

CLAUDE.md

-Original file line number
+Diff line change
@@ -0,0 +1,199 @@
+    # CLAUDE.md - OpenWashData R Package Review Guide
+    This guide helps Claude Code review R data packages for the openwashdata organization, ensuring consistency, quality, and completeness across all published datasets.
+    ## Overview
+    The review process follows a PLAN → CREATE → TEST → DEPLOY workflow triggered by a PR from dev to main branch. Each phase requires explicit user approval before proceeding.
+    ## Review Workflow
+    ### 1. PLAN Phase
+    When initiated via `/review-package [package-name]`, Claude will:
+. **Analyze Package Structure**
+       - Verify package was created with `washr` template
+       - Check for required directories: R/, data/, data-raw/, inst/extdata/, man/
+       - Confirm presence of key files: DESCRIPTION, README.Rmd, _pkgdown.yml
+. **Create Review Issues** (5 GitHub issues)
+       - Issue 1: General Information & Metadata
+       - Issue 2: Data Content & Quality
+       - Issue 3: Data Processing Script Review
+       - Issue 4: Documentation
+       - Issue 5: Tests & CI/CD
+. **Present Review Plan**
+       - Summary of findings
+       - List of issues to be addressed
+       - Request user confirmation before proceeding
+    ### 2. CREATE Phase
+    After user approval, work through each issue systematically:
+    #### Issue 1: General Information & Metadata
+    - [ ] DESCRIPTION file completeness
+      - Title (descriptive, <65 characters)
+      - Description (clear purpose statement)
+      - Authors with ORCID IDs
+      - License: CC BY 4.0
+      - Dependencies properly declared
+      - Version follows semantic versioning
+    - [ ] CITATION.cff file present and valid
+    - [ ] Generate citation using `washr::compile_citation()`
+    #### Issue 2: Data Content & Quality
+    - [ ] Data files in data/ directory (.rda format)
+    - [ ] CSV/XLSX exports in inst/extdata/
+    - [ ] Main dataset accessible via function matching package name
+    - [ ] Data quality checks:
+      - No unexpected missing values
+      - Consistent data types
+      - Reasonable value ranges
+      - Proper encoding (UTF-8)
+    #### Issue 3: Data Processing Script Review
+    - [ ] data_processing.R in data-raw/
+    - [ ] Script is reproducible and well-commented
+    - [ ] Raw data files preserved in data-raw/
+    - [ ] dictionary.csv with variable descriptions
+    - [ ] Uses tidyverse conventions
+    - [ ] Handles data cleaning transparently
+    #### Issue 4: Documentation
+    - [ ] README.Rmd follows openwashdata template:
+      - Dynamic content generation
+      - Installation instructions
+      - Data overview with dimensions
+      - Variable dictionary table
+      - License and citation sections
+    - [ ] Roxygen documentation for all exported functions
+    - [ ] _pkgdown.yml configured with:
+      ```yaml
+      template:
+        bootstrap: 5
+        includes:
+          in_header: |
+            <script defer data-domain="openwashdata.github.io" src="https://plausible.io/js/script.js"></script>
+      ```
+    - [ ] Package website builds without errors
+    #### Issue 5: Tests & CI/CD
+    - [ ] GitHub Actions workflow for R-CMD-check
+    - [ ] Package passes `devtools::check()` with no errors/warnings
+    - [ ] Examples run successfully
+    - [ ] Data loads correctly
+    **For each issue**: Present planned changes and request user confirmation before implementing.
+    ### 3. TEST Phase
+    Run comprehensive package checks:
+    ```r
+    devtools::check()
+    devtools::build()
+    pkgdown::build_site()
+    ```
+    Verify:
+    - All tests pass
+    - No R CMD check issues
+    - Documentation renders correctly
+    - Website builds successfully
+    ### 4. DEPLOY Phase
+. Build and deploy pkgdown website
+. Verify Plausible analytics tracking
+. Confirm all changes are committed
+. Approve PR merge to main branch
+    ## Key Standards
+    ### Required Files Structure
+    ```
+    package-name/
+    ├── DESCRIPTION
+    ├── NAMESPACE
+    ├── R/
+    │   └── package-name.R
+    ├── data/
+    │   └── package-name.rda
+    ├── data-raw/
+    │   ├── data_processing.R
+    │   └── dictionary.csv
+    ├── inst/
+    │   ├── CITATION
+    │   └── extdata/
+    │       ├── package-name.csv
+    │       └── package-name.xlsx
+    ├── man/
+    ├── README.Rmd
+    ├── README.md
+    ├── CITATION.cff
+    ├── _pkgdown.yml
+    └── .github/
+        └── workflows/
+            └── R-CMD-check.yaml
+    ```
+    ### Package Dependencies
+    Common dependencies for data packages:
+    - dplyr, tidyr (data manipulation)
+    - readr, readxl (data import)
+    - janitor (data cleaning)
+    - desc (DESCRIPTION parsing)
+    - gt, kableExtra (table formatting)
+    ### Quality Criteria
+. **Reproducibility**: All data processing steps documented and runnable
+. **Transparency**: Raw data preserved with clear transformation pipeline
+. **Accessibility**: Multiple export formats (R, CSV, XLSX)
+. **Documentation**: Comprehensive variable descriptions and usage examples
+. **Consistency**: Follows openwashdata naming and structure conventions
+    ## Commands
+    - `/review-package [package-name]` - Start package review
+    - `/review-status` - Check current review progress
+    - `/review-issue [number]` - Work on specific issue
+    - `/review-pr` - Create pull request for current issue
+    ## Important Notes
+    - Always request user confirmation between phases
+    - Check in with user before implementing changes in CREATE phase
+    - Preserve existing git history and commits
+    - Follow tidyverse style guide for R code
+    - Use semantic versioning for package versions
+    ## Project Management with GitHub CLI
+    - List issues: `gh issue list`
+    - View issue details: `gh issue view 80` (e.g., for issue #80 "Rename geographies parameter")
+    - Create branch for issue: `gh issue develop 80`
+    - Checkout branch: `git checkout 80-rename-geographies-parameter-to-entities`
+    - Create pull request: `gh pr create --title "Rename geographies parameter to entities" --body "Implements #80"`
+    - List pull requests: `gh pr list`
+    - View pull request: `gh pr view PR_NUMBER`
+    ## Build/Test/Check Commands
+    - Build package: `R CMD build .`
+    - Install package: `R CMD INSTALL .`
+    - Run all tests: `R -e "devtools::test()"`
+    - Run single test: `R -e "devtools::test_file('tests/testthat/test-FILE_NAME.R', reporter = 'progress')"`
+    - Run R CMD check: `R -e "devtools::check()"`
+    - Build Roxygen2 documentation: `R -e "devtools::document()"`
+    - Build vignettes: `R -e "devtools::build_vignettes()"`
+    - Build README.md from README.Rmd: `R -e "devtools::build_readme()"`
+    ## Code Style Guidelines
+    - Use 2 spaces for indentation (no tabs)
+    - Maximum 80 characters per line
+    - Use tidyverse style for R code (`dplyr`, `tidyr`, `purrr`)
+    - Use snake_case for function and variable names

DESCRIPTION

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -11,9 +11,12 @@ Description: A dataset of global humanitarian organizations collected by Humanit
  
    License: CC BY 4.0

    Encoding: UTF-8

    Roxygen: list(markdown = TRUE)

    RoxygenNote: 7.3.1

    RoxygenNote: 7.3.2

    Depends: 

        R (>= 2.10)

        R (>= 3.5)

    LazyData: true

    Config/Needs/website: rmarkdown

    Date: 2024-02-29

    Suggests: 

        testthat (>= 3.0.0)

    Config/testthat/edition: 3

R/gdho.R

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -5,7 +5,7 @@
  
    #' excludes the information about how many operational units a humanitarian organization has

    #' in each country, where you can find it in data gdho_full.

    #'

    #' @format A tibble with 4556 rows and 33 variables

    #' @format A tibble with 4548 rows and 35 variables

    #'

    #' \describe{

    #'   \item{id}{A unique Id for each organisation}

    @@ -38,10 +38,13 @@
  
    #'   \item{ope/staff}{Percent of operational program expenditure per staff member}

    #'   \item{ope_inflation_adjusted}{Operational program expenditure adjusted for inflation}

    #'   \item{ope_original_currency}{Actual approximate operational program expenditure in original currency used by organisation}

    #'   \item{ope_original_amount}{Operational program expenditure amount in original currency}

    #'   \item{ope_original_currency_code}{Currency code for operational program expenditure}

    #'   \item{humexp_approx_usd}{Approximate humanitarian expenditure in USD}

    #'   \item{humexp_imputed}{Imputed approximate humanitarian expenditure in USD}

    #'   \item{humexp_inflation_adjusted}{Approximate humanitarian expenditure adjusted for inflation}

    #' }

    #'

    #' @source Humanitarian Outcomes <https://www.humanitarianoutcomes.org/projects/gdho>

    "gdho"

R/gdho_full.R

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -3,10 +3,10 @@
  
    #' This dataset collected by Humanitarian Outcomes provides insights about humanitarian

    #' organizations, such as name, website, headquarter information, and etc. This full version

    #' includes the information about how many operational units a humanitarian organization has

    #' in each country, which is represented as one column per country resulting in 273 variables

    #' compared with the short version "gdho" (33 variables).

    #' in each country, which is represented as one column per country resulting in 275 variables

    #' compared with the short version "gdho" (35 variables).

    #'

    #' @format A tibble with 4556 rows and 273 variables

    #' @format A tibble with 4548 rows and 275 variables

    #'

    #' \describe{

    #'   \item{id}{A unique Id for each organisation}

    @@ -39,10 +39,14 @@
  
    #'   \item{ope/staff}{Percent of operational program expenditure per staff member}

    #'   \item{ope_inflation_adjusted}{Operational program expenditure adjusted for inflation}

    #'   \item{ope_original_currency}{Actual approximate operational program expenditure in original currency used by organisation}

    #'   \item{ope_original_amount}{Operational program expenditure amount in original currency}

    #'   \item{ope_original_currency_code}{Currency code for operational program expenditure}

    #'   \item{humexp_approx_usd}{Approximate humanitarian expenditure in USD}

    #'   \item{humexp_imputed}{Imputed approximate humanitarian expenditure in USD}

    #'   \item{humexp_inflation_adjusted}{Approximate humanitarian expenditure adjusted for inflation}

    #'   \item{countries}{Individual countries where organisations are operational}

    #'   \item{afghanistan..zimbabwe}{240 columns representing operational presence in individual countries}

    #' }

    #'

    #' @source Humanitarian Outcomes <https://www.humanitarianoutcomes.org/projects/gdho>

    "gdho_full"

data-raw/data-processing.R

-Original file line number
+Diff line change
@@ Expand Up @@
     gdho_full <- gdho_raw |>
       dplyr::rename_all(~stringr::str_replace_all(.x, "[\\(\\)]", "")) |>
       dplyr::rename_all(~stringr::str_replace_all(.x, " ", "_")) |>
-      dplyr::rename_all(tolower) #TODO: country names need more cleaning, remove ","
+      dplyr::rename_all(~stringr::str_replace_all(.x, ",", "")) |>
+      dplyr::rename_all(tolower)
     ## Encoding UTF-8 --------------------------------------------------------------
     gdho_full <- gdho_full |>
-      mutate(across(where(is.character), stringi::stri_enc_toutf8, ))
+      mutate(across(where(is.character), \(x) stringi::stri_enc_toutf8(x)))
+    ## Remove duplicate rows -------------------------------------------------------
+    gdho_full <- gdho_full |>
+      dplyr::distinct()
     ## Modify data types -----------------------------------------------------------
     ### to integer:
@@ Expand Down Expand Up / @@ -47,25 +52,69 @@ gdho_full <- gdho_full |> @@
                                     "ope_inflation_adjusted", "humexp_approx_usd",
                                     "humexp_inflation_adjusted"), as.double))
-    ### TODO: separate ope_original_currency into 2 columns
-    ### TODO:staff_, natl_, intl_imputed do not indicate numbers but some categories which are not documented
+    ### Separate ope_original_currency into amount and currency columns
+    gdho_full <- gdho_full |>
+      tidyr::separate(ope_original_currency,
+                      into = c("ope_original_amount", "ope_original_currency_code"),
+                      sep = " ",
+                      fill = "right",
+                      remove = FALSE)
+    ### Document imputed categories:
+    # staff_imputed, natl_imputed, intl_imputed categories:
+    # - "small" = estimated 1-50 staff
+    # - "medium" = estimated 51-250 staff
+    # - "large" = estimated 251-1000 staff
+    # - "very large" = estimated 1000+ staff
     ## Build dataset gdho ----------------------------------------------------------
-    gdho <- gdho_full[1:33] # a shorter version that does not include all country columns
+    # Get column names before country columns start
+    non_country_cols <- names(gdho_full)[1:which(names(gdho_full) == "afghanistan") - 1]
+    gdho <- gdho_full |>
+      dplyr::select(all_of(non_country_cols))
     ## Read and write dictionary ---------------------------------------------------
     original_dict <- read_excel("./data-raw/gdho_read_me.xlsx", skip = 2)
-    gdho_full_dictionary <- tibble(directory = "data",
-           file_name = "gdho_full.rda",
-           variable_name = c(colnames(gdho_full)[1:33], "countries"),
-           variable_type =  c(sapply(gdho_full, typeof)[1:33], "integer"),
-           description = original_dict$`Content description`)
-    gdho_dict <- tibble(directory = "data",
-                        file_name = "gdho.rda",
-                        variable_name = colnames(gdho_full)[1:33],
-                        variable_type = sapply(gdho_full, typeof)[1:33],
-                        description = original_dict$`Content description`[1:33])
+    # Get descriptions for new columns
+    new_column_descriptions <- c(
+      ope_original_amount = "Operational program expenditure amount in original currency",
+      ope_original_currency_code = "Currency code for operational program expenditure"
+    )
+    # Build dictionary for gdho_full
+    gdho_full_vars <- names(gdho_full)
+    country_start <- which(gdho_full_vars == "afghanistan")
+    country_end <- which(gdho_full_vars == "zimbabwe")
+    # Create descriptions vector
+    descriptions_full <- c(
+      original_dict$`Content description`[1:30],  # Original descriptions up to ope_original_currency
+      new_column_descriptions["ope_original_amount"],
+      new_column_descriptions["ope_original_currency_code"],
+      original_dict$`Content description`[31:33],  # Remaining original descriptions
+      rep("Country operational presence indicator", length(country_start:country_end))
+    )
+    gdho_full_dictionary <- tibble(
+      directory = "data",
+      file_name = "gdho_full.rda",
+      variable_name = gdho_full_vars,
+      variable_type = sapply(gdho_full, typeof),
+      description = descriptions_full
+    )
+    # Build dictionary for gdho
+    gdho_vars <- names(gdho)
+    gdho_dict <- tibble(
+      directory = "data",
+      file_name = "gdho.rda",
+      variable_name = gdho_vars,
+      variable_type = sapply(gdho, typeof),
+      description = descriptions_full[1:length(gdho_vars)]
+    )
     dictionary <- rbind(gdho_full_dictionary, gdho_dict)
     write_csv(dictionary, "./data-raw/dictionary.csv")
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete comprehensive package review for gdho #8

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!