Skip to content

VitalantRI/cephia_public_data

Repository files navigation

CEPHIA Public Use Dataset

Public use dataset from the Consortium for the Evaluation and Performance of HIV Incidence Assays (CEPHIA)'s evaluations of HIV recency assays. Samples tested by CEPHIA were obtained from numerous collaborators. See the acknowledgements for more details.

This dataset was prepared for public release by Eduard Grebe egrebe@vitalant.org eduard@grebe.consulting and Shelley Facente shelley@facenteconsulting.com.

Versions

  • Version 1.0.0 released 2021-07-04
  • Version 2.0.0 released 2025-10-25
    • Compressed CSV using lz4
    • Switched type-preserved dataset from feather to parquet format
  • Version 2.0.1 released 2025-11-03
    • Documentation update

Using the CEPHIA public use dataset

The dataset is extracted from the CEPIHA study database in long format. A key concept for interpretating the data is the unique result identifiers:

  • specific_result_identifier
  • generic_result_identifier

These identifiers should be used when subsetting and pivoting the data to wide format. A generic result represents the "final result" obtained for a specific sample tested as part of the panel, while a specific result could identify more than one results for the same sample if the specific recency assay's testing algorithm requires retesting. For example, in the case of the Limiting Antigen Avidity EIA, samples that yield an initial normalised optical density below 2.0 are restested in triplicate, and the median of the three retests produces a final result. In most cases, analysis should be performed on the "generic" final results.

Please note that "visits" identify unique specimens obtained from a particular participant at a particular timepoint. The specimen or sample identifiers are not relevant for analysis.

File formats

The dataset is released in two formats: comma-seperated values (CSV), compressed using lz4, and the Apache Arrow parquet format, also compressed using lz4. The parquet dataset can be read into R using the arrow package and into Python using the pyarrow or pandas packages. The parquet dataset is strongly recommended over the CSV, given that data types (e.g. dates, numeric variables, booleans and character variables) are preserved.

Here is example code to read the dataset:

library(arrow)
library(tidyverse)
if (codec_is_available("lz4")) {
  cephia_data <- read_parquet("cephia_public_use_dataset_20210604.parquet")
} else {
  stop("Please make sure you have arrow with support for lz4 installed.")
}
glimpse(cephia_data)
import pandas as pd
cephia_data = pd.read_parquet("cephia_public_use_dataset_20210604.parquet")
cephia_data.head(5)

Variables in the dataset

The dataset contains the following variables. A more detailed codebook is in development and a future version of this posting will include it. However, variable names are generally self-explanatory.

  • assay
  • cephia_panel
  • testing_laboratory
  • test_date
  • assay_result_field
  • assay_result_value
  • assay_result_method
  • specific_result_identifier
  • generic_result_identifier
  • participant_identifier
  • visit_identifier
  • specimen_type
  • hiv_status_at_visit
  • cohort_entry_hiv_status
  • days_since_cohort_entry
  • hiv_subtype
  • hiv_subtype_confirmed
  • country
  • sex
  • age_in_years
  • eddi_interval_size
  • days_since_eddi
  • days_since_ep_ddi
  • days_since_lp_ddi
  • designated_as_elite_controller_at_visit
  • ever_designated_as_elite_controller
  • treatment_naive_at_visit
  • on_treatment_at_visit
  • first_treatment_episode
  • days_since_first_art
  • days_since_current_art
  • days_from_eddi_to_first_art
  • days_from_eddi_to_current_art
  • viral_load_closest_to_visit
  • viral_load_date_offset_from_visit_date
  • viral_load_type
  • viral_load_detectable
  • cd4_count_at_visit

Filtering and pivoting the data for analysis

The following R code snippet demonstrates how to extract the "final" result for the ARCHITECT Avidity assay from the CEPHIA Evaluation Panel and pivoting it to wide format. It is assumed that the full dataset is in a dataframe called cephia_data

ArchitectAvidityEPfinal <- cephia_data %>%
  filter(assay == "ArchitectAvidity",
         assay_result_field == "final_result",
         cephia_panel == "CEPHIA 1 Evaluation Panel") %>%
  pivot_wider(names_from = assay_result_field, values_from = assay_result_value)

The following R code snippet demonstrates how to extract the specific results (i.e. potentially multiple rows per sample tested) for the ARCHITECT Avidity assay from the CEPHIA Qualification Panel and pivoting it to wide format.

ArchitectAvidityQPspecific <- cephia_data %>%
  filter(assay == "ArchitectAvidity",
         assay_result_field != "final_result",
         cephia_panel == "CEPHIA 1 Qualification Panel") %>%
  pivot_wider(names_from = assay_result_field, values_from = assay_result_value)

License and usage terms

License: CC BY 4.0 These data are released under the Creative Commons Attribution 4.0 International license. The deed and the legal code can be found on the Creative Commons website.

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

While we have made all efforts to deidentify the data, by accessing these data you agree not attempt to identify any participants or violate the legal or moral rights of any person.

Citation

You can cite the dataset as follows:

Grebe, E., Facente, S. N., Hampton, D., Marson, K., Hall, J., McKinney, E., Lebedeva, M., Parkin, N., Keating, S. M., Kassanjee, R., Pilcher, C. D., Murphy, G., Busch, M. P., & Welte, A. (2025). CEPHIA public use data [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17439895

A BibTeX entry for the dataset is provided for your convenience:

@dataset{cephia_v2.0.1_2025,
  author       = {Grebe, Eduard and
                  Facente, Shelley N. and
                  Hampton, Dylan and
                  Marson, Kara and
                  Hall, Jake and
                  McKinney, Elaine and
                  Lebedeva, Mila and
                  Parkin, Neil and
                  Keating, Sheila M. and
                  Kassanjee, Reshma and
                  Pilcher, Christopher D. and
                  Murphy, Gary and
                  Busch, Michael P. and
                  Welte, Alex},
  title        = {CEPHIA public use data},
  month        = oct,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17439895},
  url          = {https://doi.org/10.5281/zenodo.17439895},
}

Acknowledgements

CEPHIA was supported by grants from the Bill and Melinda Gates Foundation (OPP1017716 to G.M., OPP1062806 to C.D.P. and OPP1115799). Additional support for analysis was provided by a grant from the US National Institutes of Health (R34 MH096606 to C.D.P.) and by the South African Department of Science and Technology and the National Research Foundation. Specimen and data collection were funded in part by grants from the NIH (P01 AI071713, R01 HD074511, P30 AI027763, R24 AI067039, U01 AI043638, P01 AI074621 and R24 AI106039); the HIV Prevention Trials Network (HPTN) sponsored by the NIAID, National Institutes of Child Health and Human Development (NICH/HD), National Institute on Drug Abuse, National Institute of Mental Health, and Office of AIDS Research, of the NIH, DHHS (UM1 AI068613 and R01 AI095068); the California HIV-1 Research Program (RN07-SD-702); Brazilian Program for STD and AIDS, Ministry of Health (914/BRA/3014-UNESCO); and the São Paulo City Health Department (2004-0.168.922– 7). Selected samples from International AIDS Vaccine Initiative (IAVI)-supported cohorts were funded by IAVI with the generous support of USAID and other donors; a full list of IAVI donors is available at www.iavi.org.

The Consortium for the Evaluation and Performance of HIV Incidence Assays (CEPHIA) comprised (at various times): at South African Centre for Epidemiological Modelling and Analysis, Stellenbosch University – Alex Welte, Eduard Grebe (later at Vitalant Research Institute), Reshma Kassanjee (later at the University of Cape Town), Joseph Sempa, David Matten, Hilmarie Brand, Trust Chibawara; at Public Health England – Gary Murphy, Jake Hall, Elaine Mckinney; Michael P. Busch, Dylan Hampton, Sheila Keating, Mila Lebedeva (Vitalant Research Institute, formerly Blood Systems Research Institute); at University of California San Francisco: Christopher D. Pilcher, Shelley Facente (later at Vitalant Research Institute and Facente Consulting), Kara Marson; at National Institutes of Health: Oliver Laeyendecker, Thomas Quinn, David Burns; Susan Little (University of California San Diego); Anita Sands (World Health Organization); Tim Hallett (Imperial College London); Sherry Michele Owen, Bharat Parekh, Connie Sexton (Centers for Disease Control and Prevention); Matthew Price, Anatoli Kamali (International AIDS Vaccine Initiative); Lisa Loeb (The Options Study—University of California San Francisco); Jeffrey Martin, Steven G Deeks, Rebecca Hoh (The SCOPE Study—University of California San Francisco); Zelinda Bartolomei, Natalia Cerqueira (The AMPLIAR Cohort— University of São Paulo); Breno Santos, Kellin Zabtoski, Rita de Cassia Alves Lira (The AMPLIAR Cohort—Grupo Hospital Conceic ̧ão); Rosa Dea Sperhacke, Leonardo R Motta, Machline Paganella (The AMPLIAR Cohort—Universidade Caxias Do Sul); Esper Kallas, Helena Tomiyama, Claudia Tomiyama, Priscilla Costa, Maria A Nunes, Gisele Reis, Mariana M Sauer, Natalia Cerqueira, Zelinda Nakagawa, Lilian Ferrari, Ana P Amaral, Karine Milani (The São Paulo Cohort—University of São Paulo, Brazil); Salim S Abdool Karim, Quarraisha Abdool Karim, Thumbi Ndungu, Nelisile Majola, Natasha Samsunder (CAPRISA, University of Kwazulu-Natal); Denise Naniche (The GAMA Study—Barcelona Centre for International Health Research); Ina ́cio Mandomando, Eusebio V Macete (The GAMA Study—Fundacao Manhica); Jorge Sanchez, Javier Lama (SABES Cohort—Asociacio ́n Civil Impacta Salud y Educacio ́n (IMPACTA)); Ann Duerr (The Fred Hutchinson Cancer Research Center); Maria R Capobianchi (National Institute for Infectious Diseases “L. Spallanzani”, Rome); Barbara Suligoi (Istituto Superiore di Sanità, Rome); Susan Stramer (American Red Cross); Phillip Williamson (Creative Testing Solutions / Vitalant Research Institute); Marion Vermeulen (South African National Blood Service); and Ester Sabino (Hemocentro do São Paolo).

About

CEPHIA Public Use Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors