diff --git a/Chapter_1/01-Chapter1.Rmd b/Chapter_1/01-Chapter1.Rmd
deleted file mode 100644
index e1af6d7..0000000
--- a/Chapter_1/01-Chapter1.Rmd
+++ /dev/null
@@ -1,1049 +0,0 @@
-# (PART\*) Chapter 1 Introductory Data Science {-}
-
-# 1.1 FAIR Data Management Practices
-
-This training module was developed by Rebecca Boyles, with contributions from Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-This training module provides a description of FAIR data management practices, and points participants to important resources to help ensure generated data meet current FAIR guidelines. This training module is descriptive content-based (as opposed to coding-based), in order to present information clearly and serve as an important resource alongside the other scripted training activities.
-
-
-
-### Training Module's Environmental Heatlh Questions
-This training module was specifically developed to answer the following questions:
-
-1. What is FAIR?
-2. When was FAIR first developed?
-3. When making data ‘Findable’, who and what should be able to find your data?
-4. When saving/formatting your data, which of the following formats is preferred to meet FAIR principles: .pdf, .csv, or a proprietary output file from your lab instrument?
-5. How can I find a suitable data repository for my data?
-
-
-
-## Introduction to FAIR
-Proper data management is of utmost importance while leading data analyses within the field of environmental health science. A method to ensure proper data management is the implementation of Findability, Accessibility, Interoperability, and Reusability (FAIR) practices. A landmark paper that describes FAIR practices in environmental health research is the following:
-
-+ Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15. PMID: [26978244](https://pubmed.ncbi.nlm.nih.gov/26978244/).
-
-The FAIR principles describe a framework for data management and stewardship aimed at increasing the value of data by enabling sharing and reuse. These principles were originally developed from discussions during the [Jointly Designing a Data FAIRport](https://www.lorentzcenter.nl/jointly-designing-a-data-fairport.html) meeting at the Lorentz Center in Leiden, The Netherlands in 2014, which brought together stakeholders to discuss the creation of an environment for virtual computational science. The resulting principles are technology agnostic, discipline independent, community driven, and internationally adopted.
-
-Below is a schematic providing an overview of this guiding principle:
-```{r 01-Chapter1-1, echo=FALSE, fig.height=3.5, fig.width=3.5, fig.align='center' }
-knitr::include_graphics("Chapter_1/Module1_1_Input/Module1_1_Image1.png")
-```
-
-### Answer to Environmental Health Question 1 & 2
-:::question
-*With this background, we can answer **Environmental Health Question #1 and #2***: What is FAIR and when was it first developed?
-:::
-
-:::answer
-**Answer**: FAIR is guiding framework that was recently established to promote best data management practices, to ensure that data are Findable, Accessibility, Interoperable, and Reusable. It was first developed in 2014- which means that these principles are very new and continuing to evolve!
-:::
-
-
-
-## Breaking Down FAIR, Letter-by-Letter
-
-The aspects of the FAIR principles apply to data and metadata with the aim of making the information available to people and computers as described in the seminal paper by [Wilkinson et al., 2016](https://pubmed.ncbi.nlm.nih.gov/26978244/).
-
-
-### F (Findable) in FAIR
-The F in FAIR identifies components of the principles needed to make the meta(data) findable through the application of unique persistent identifiers, thoroughly described, reference the unique identifiers, and that the descriptive information (i.e., metadata) could be searched by both *humans and computer systems*.
-
-**F1. (Meta)data are assigned a globally unique and persistent identifier**
-
-+ Each dataset is assigned a globally unique and persistent identifier (PID), for example a DOI. These identifiers allow to find, cite and track (meta)data.
-+ A DOI looks like: https://doi.org/10.1109/5.771073
-+ Action: Ensure that each dataset is assigned a globally unique and persistent identifier. Certain repositories automatically assign identifiers to datasets as a service. If not, obtain a PID via a [PID registration service](https://pidservices.org/).
-
-**F2. Data are described with rich metadata**
-
-+ Each dataset is thoroughly (see R1) described: these metadata document how the data was generated, under what term (license) and how it can be (re)used and provide the necessary context for proper interpretation. This information needs to be machine-readable.
-+ Action: Fully document each dataset in the metadata, which may include descriptive information about the context, quality and condition, or characteristics of the data. Another researcher in any field, or their computer, should be able to properly understand the nature of your dataset. Be as generous as possible with your metadata (see R1).
-
-**F3. Metadata clearly and explicitly include the identifier of the data it describes**
-
-+ Explanation: The metadata and the dataset they describe are separate files. The association between a metadata file and the dataset is obvious thanks to the mention of the dataset’s PID in the metadata.
-+ Action: Make sure that the metadata contains the dataset’s PID.
-
-**F4. (Meta)data are registered or indexed in a searchable resource**
-
-+ Explanation: Metadata are used to build easily searchable indexes of datasets. These resources will allow to search for existing datasets similarly to searching for a book in a library.
-+ Action: Provide detailed and complete metadata for each dataset (see F2).
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: When making data ‘Findable’, who and what should be able to find your data?
-:::
-
-:::answer
-**Answer**: Both humans and computer systems should be able to find your data.
-:::
-
-
-
-### A (Accessible) in FAIR
-The A components are designed to enable meta(data) be available long-term, accessed by humans and machines using standard communication protocols with clearly described limitations on reuse.
-
-**A1. (Meta)data are retrievable by their identifier using a standardized communications protocol**
-
-+ Explanation: If one knows a dataset’s identifier and the location where it is archived, one can access at least the metadata. Furthermore, the user knows how to proceed to get access to the data.
-+ Action: Clearly define who can access the actual data and specify how. It is possible that data will not be downloaded, but rather reused *in situ*. If so, the metadata must specify the conditions under which this is allowed (sometimes versus the conditions needed to fulfill for external usage/“download”).
-
-**A1.1 The protocol is open, free, and universally implementable**
-
-+ Explanation: Anyone with a computer and an internet connection can access at least the metadata.
-
-**A1.2 The protocol allows for an authentication and authorization procedure, where necessary**
-
-+ Explanation: It often makes sense to request users to create a user account on a repository. This allows to authenticate the owner (or contributor) of each dataset, and to potentially set user specific rights.
-
-**A2. Metadata are accessible, even when the data are no longer available**
-
-+ Explanation: Maintaining all datasets in a readily usable state eternally would require an enormous amount of curation work (adapting to new standards for formats, converting to different format if specifically needed software is discontinued, etc). Keeping the metadata describing each dataset accessible, however, can be done with fewer resources. This allows to build comprehensive data indexes including all current, past, and potentially arising datasets.
-+ Action: Provide detailed and complete metadata for each dataset (see R1).
-
-
-
-### I (Interoperable) in FAIR
-The I components of the principles address needs for data exchange and interpretation by humans and machines which includes the use of controlled vocabularies or ontologies to describe meta(data) and to describe provenance relationships through appropriate data citation.
-
-**I1. (Meta)data use a formal, accessible, shared, and broadly applicable language**
-
-+ Explanation: Interoperability typically means that each computer system has at least knowledge of the other system’s formats in which data is exchanged. If (meta)data are to be searchable and if compatible data sources should be combinable in a (semi)automatic way, computer systems need to be able to decide if the content of datasets are comparable.
-+ Action: Provide machine readable data and metadata in an accessible language, using a well-established formalism. Data and metadata are annotated with resolvable vocabularies/ontologies/thesauri that are commonly used in the field (see I2).
-
-**I2. (Meta)data use vocabularies that follow FAIR principles**
-
-+ Explanation: The controlled vocabulary (e.g., [MESH](https://www.ncbi.nlm.nih.gov/mesh/)) used to describe datasets needs to be documented. This documentation needs to be easily findable and accessible by anyone who uses the dataset.
-+ Action: The vocabularies/ontologies/thesauri are themselves findable, accessible, interoperable and thoroughly documented, hence FAIR. Lists of these standards can be found at: [NCBO BioPortal](https://bioportal.bioontology.org/), [FAIRSharing](https://fairsharing.org/), [OBO Foundry](http://www.obofoundry.org/).
-
-**I3. (Meta)data include qualified references to other (meta)data**
-
-+ Explanation: If the dataset builds on another dataset, if additional datasets are needed to complete the data, or if complementary information is stored in a different dataset, this needs to be specified. In particular, the scientific link between the datasets needs to be described. Furthermore, all datasets need to be properly cited (i.e. including their persistent identifiers).
-+ Action: Properly cite relevant/associated datasets, by providing their persistent identifiers, in the metadata, and describe the scientific link/relation to your dataset.
-
-
-
-### R (Reusable) in FAIR
-The R components highlight needs for the meta(data) to be reused and support integration such as sufficient description of the data and data use limitations.
-
-**R1. Meta(data) are richly described with a plurality of accurate and relevant attributes**
-
-Explanation: Description of a dataset is required at two different levels:
-
-+ Metadata describing the dataset: what does the dataset contain, how was the data generated, how has it been processed, how can it be reused.
-+ Metadata describing the data: any needed information to properly use the data, such as definitions of the variable names
-
-Action: Provide complete metadata for each data file.
-
-+ Scope of your data: for what purpose was it generated/collected?
-+ Particularities or limitations about the data that other users should be aware of.
-+ Date of the dataset generation, lab conditions, who prepared the data, parameter settings, name and version of the software used.
-+ Variable names are explained or self-explanatory.
-+ Version of the archived and/or reused data is clearly specified and documented.
-
-
-
-## What Does This Mean for You?
-We advise the following as 'starting-points' for participants to start meeting FAIR guidances:
-
-+ Learn how to create a [Data Management Plan](https://dmptool.org)
-+ Keep good documentation (project & data-level) while working
-+ Do not use proprietary file formats (.csv is a great go-to formats for your data!)
-+ When able, use a domain appropriate metadata standard or ontology
-+ Ruthlessly document any steps in a project
-+ Most of FAIR can be handled by selecting a good data or software repository
-+ Don’t forget to include a [license](https://resources.data.gov/open-licenses/)!
-
-### Answer to Environmental Health Question 4
-:::question
-*With these, we can answer **Environmental Health Question #4***: When saving/formatting your data, which of the following formats is preferred to meet FAIR principles: .pdf, .csv, or a proprietary output file from your lab instrument?
-:::
-
-:::answer
-**Answer**: A .csv file is preferred to enhance data sharing.
-:::
-
-
-
-## Data Repositories for Sharing of Data
-When you are organizing your data to deposit online, it is important to identify an appropriate repository to publish your dataset it. A good starting place is a repository registry such as [FAIRsharing.org](https://fairsharing.org/) or [re3data.org](https://www.re3data.org/). Journals can also provide helpful resources and starting repository lists, such as [Nature](https://www.nature.com/sdata/policies/repositories#general) and [PLOS](https://journals.plos.org/plosone/s/recommended-repositories), which both have published a list of recommended repositories. Funding agencies, including the NIH, can also inform specific repositories.
-
-Below are some examples of two main categories of data repositories:
-
-**1. Domain Agnostic Data Repositories**
-Domain agnostic repositories allow the deposition of any data type. Some examples include the following:
-
-+ Data in Brief Articles (e.g., [Elsevier's Data in Brief Journal](https://www.journals.elsevier.com/data-in-brief))
-+ [Dryad](https://www.datadryad.org)
-+ [Figshare](https://figshare.com/)
-+ [The Dataverse Project](https://dataverse.org/)
-+ [Zenodo](https://zenodo.org/)
-
-
-**2. Domain Specific Data Repositories**
-Domain specific repositories allow the deposition of specific types of data, produced from specific types of technologies or within specific domains. Some examples include the following:
-
-+ [Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
-+ [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/)
-+ [The Immunology Database and Analysis Portal](https://www.immport.org/home)
-+ [Metabolomics Workbench (National Metabolomics Data Repository)](https://www.metabolomicsworkbench.org/data/index.php)
-+ [Microphysiology Systems Database](https://upddi.pitt.edu/microphysiology-systems-database/)
-+ [Mouse Genome Informatics](http://www.informatics.jax.org/)
-+ [Mouse Phenome Database](https://phenome.jax.org/)
-+ [OpenNeuro](https://openneuro.org/)
-+ [Protein Data Bank](https://www.rcsb.org/)
-+ [ProteomeXchange](http://www.proteomexchange.org/)
-+ [Rat Genome Database](https://rgd.mcw.edu/)
-+ [The Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
-+ [Zebrafish Model Organism Database](http://zfin.org/)
-+ and many, many, many others...
-
-### Answer to Environmental Health Question 5
-:::question
-*With these, we can answer **Environmental Health Question #5***: How can I find a suitable data repository for my data?
-:::
-
-:::answer
-**Answer**: I can search through a data repository registry service or look for recommendations from NIH or other funding agencies.
-:::
-
-
-
-## Recent Shifts in Regulatory Policies for Data Sharing
-
-### The NIH Data Management and Sharing Policy
-NIH’s data management and sharing (DMS) policy became effective January 2023. This policy specifically lists the expectations that investigators must comply with in order to promote the sharing of scientific data.
-
-Information about this recent policy can be found through updated [NIH websites](https://sharing.nih.gov/data-management-and-sharing-policy).
-
-Information about writing an official Data Management and Sharing (DMS) plan for your research can be found through [NIH's Guidance on Writing a Data Management & Sharing Plan](https://sharing.nih.gov/data-management-and-sharing-policy/planning-and-budgeting-for-data-management-and-sharing/writing-a-data-management-and-sharing-plan#after).
-
-
-### The 2018 Evidence Act
-The Evidence Act, or Foundations for Evidence-Based Policymaking Act of 2018, was signed into U.S. law on January 14, 2019.
-
-The Act requires federal agencies to build the capacity to use evidence and data in their decision-making and policymaking. It also requires agencies to:
-Develop an evidence-building plan as part of their quadrennial strategic plan & Develop an evaluation plan concurrent with their annual performance plan.
-
-The Evidence Act also:
-
-+ Mandates that data be "open by default"
-+ Specifies that a comprehensive data inventory should be created for each agency's open data assets
-
-**How Does the NIH Data Management and Sharing Policy Intersect with the 2018 Evidence Act?**
-Making your data FAIR, by definition, makes it more shareable and reusable. Many of the requirements in the NIH DMS and the Evidence Act policy overlap with the FAIR principles.
-
-
-### The CARE Principles for Indigenous Data Governance
-While we are experiencing increased requirements for the open sharing of data, it is important to recognize that there are circumstances and populations that should, at the same time, be carefully protected. Examples include human clinical or epidemiological data that may become identifiable upon the sharing of sensitive data. Another example includes the consideration of Indigenous populations. A recent article by [Carroll et al. 2021](https://www.nature.com/articles/s41597-021-00892-0) describes in their abstract:
-
-*As big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle.*
-
-
-
-## Additional Training Resources on FAIR
-Many organizations, from specific programs to broad organizations, provide training and resources for scientists in FAIR principles. Some of the notable global organizations organizing and providing training that offer opportunities for community involvement are:
-
-+ [Committee on Data for Science and Technology (CODATA)](https://www.codata.org/uploads/CODATA@45years.pdf)
-+ [Global Alliance for Genomics & Health](https://pubmed.ncbi.nlm.nih.gov/27149219/)
-+ [GoFAIR](https://www.go-fair.org/)
-+ [Force11](https://www.force11.org/)
-+ [Research Data Alliance](http://www.dlib.org/dlib/january14/01guest_editorial.html)
-
-
-**Example Workshops discussing FAIR**:
-
-+ NAS Implementing FAIR Data for People and Machines: Impacts and Implications (2019). Available at: https://www.nationalacademies.org/our-work/implementing-fair-data-for-people-and-machines-impacts-and-implications
-
-+ NIH Catalyzing Knowledge-driven Discovery in Environmental Health Sciences Through a Harmonized Language, Virtual Workshop (2021). Available at: https://www.niehs.nih.gov/news/events/pastmtg/2021/ehslanguage/index.cfm
-
-+ NIH Trustworthy Data Repositories Workshop (2019). Available at: https://datascience.nih.gov/data-ecosystem/trustworthy-data-repositories-workshop
-
-+ NIH Virtual Workshop on Data Metrics (2020). Available at: https://datascience.nih.gov/data-ecosystem/nih-virtual-workshop-on-data-metrics
-
-+ NIH Workshop on the Role of Generalist Repositories to Enhance Data Discoverability and Reuse: Workshop Summary (2020). Available at: https://datascience.nih.gov/data-ecosystem/nih-data-repository-workshop-summary
-
-
-
-**Example Government Report Documents on FAIR:**
-
-+ Collins S, Genova F, Harrower N, Hodson S, Jones S, Laaksonen L, Mietchen D, Petrauskaite R, Wittenburg P. Turning FAIR into reality: Final report and action plan from the European Commission expert group on FAIR data: European Union; 2018. Available at: https://www.vdu.lt/cris/handle/20.500.12259/103794.
-
-+ EU. FAIR Data Advanced Use Cases: From Principles to Practice in the Netherlands. 2018. European Union. Available at: doi:10.5281/zenodo.1250535.
-
-+ NIH. Final NIH Policy for Data Management and Sharing and Supplemental Information. National Institutes of Health. Federal Register, vol. 85, 2020-23674, 30 Oct. 2020, pp. 68890–900. Available at: https://www.federalregister.gov/d/2020-23674.
-
-+ NIH. NIH Strategic Plan for Data Science 2018. National Institutes of Health. Available at: https://datascience.nih.gov/strategicplan.
-
-+ NLM. NLM Strategic Plan 2017 to 2027. U.S. National Library of Medicine, Feb. 2018. Available at: https://www.nlm.nih.gov/about/strategic-plan.html.
-
-
-
-**Example Related Publications on FAIR:**
-
-+ Comess S, Akbay A, Vasiliou M, Hines RN, Joppa L, Vasiliou V, Kleinstreuer N. Bringing Big Data to Bear in Environmental Public Health: Challenges and Recommendations. Front Artif Intell. 2020 May;3:31. doi: 10.3389/frai.2020.00031. Epub 2020 May 15. PMID: 33184612; PMCID: [PMC7654840](https://pubmed.ncbi.nlm.nih.gov/33184612/).
-
-+ Koers H, Bangert D, Hermans E, van Horik R, de Jong M, Mokrane M. Recommendations for Services in a FAIR Data Ecosystem. Patterns (N Y). 2020 Jul 7;1(5):100058. doi: 10.1016/j.patter.2020.100058. Erratum in: Patterns (N Y). 2020 Sep 11;1(6):100104. PMID: [33205119](https://pubmed.ncbi.nlm.nih.gov/33205119/).
-
-+ Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, Pétavy F, Galvez J, Becnel LB, Zhou FL, Harmon N, Jauregui B, Jackson T, Hudson L. FAIR data sharing: The roles of common data elements and harmonization. J Biomed Inform. 2020 Jul;107:103421. doi: 10.1016/j.jbi.2020.103421. Epub 2020 May 12. PMID: [32407878](https://pubmed.ncbi.nlm.nih.gov/32407878/).
-
-+ Lin D, Crabtree J, Dillo I, Downs RR, Edmunds R, Giaretta D, De Giusti M, L'Hours H, Hugo W, Jenkyns R, Khodiyar V, Martone ME, Mokrane M, Navale V, Petters J, Sierman B, Sokolova DV, Stockhause M, Westbrook J. The TRUST Principles for digital repositories. Sci Data. 2020 May 14;7(1):144. PMID: [32409645](https://pubmed.ncbi.nlm.nih.gov/32409645/).
-
-+ Thessen AE, Grondin CJ, Kulkarni RD, Brander S, Truong L, Vasilevsky NA, Callahan TJ, Chan LE, Westra B, Willis M, Rothenberg SE, Jarabek AM, Burgoon L, Korrick SA, Haendel MA. Community Approaches for Integrating Environmental Exposures into Human Models of Disease. Environ Health Perspect. 2020 Dec;128(12):125002. PMID: [33369481](https://pubmed.ncbi.nlm.nih.gov/33369481/).
-
-+ Roundtable on Environmental Health Sciences, Research, and Medicine; Board on Population Health and Public Health Practice; Health and Medicine Division; National Academies of Sciences, Engineering, and Medicine. Principles and Obstacles for Sharing Data from Environmental Health Research: Workshop Summary. Washington (DC): National Academies Press (US); 2016 Apr 29. PMID: [27227195](https://pubmed.ncbi.nlm.nih.gov/27227195/).
-
-
-
-
-
-:::tyk
-Let’s imagine that you’re a researcher who is planning on gathering a lot of data using the zebrafish model. In order to adequately prepare your studies and steps to ensure data are deposited into proper repositories, you have the idea to check repository information obtained in [FAIRsharing.org](https://fairsharing.org/). What are some example repositories and relevant ontology resources that you could use to organize, deposit, and share your zebrafish data (hint: use the search tool)?
-:::
-
-# 1.2 Data Sharing through Online Repositories
-## An Overview and Example with the Dataverse Repository
-
-This training module was developed by Kyle R. Roell, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Submitting data to publicly available repositories is an essential part of ensuring data meet FAIR guidelines, as discussed in detail in the previous training module. There are many benefits to sharing and submitting your researching, such as:
-
-+ Making more use out of data that are generated in your lab
-+ More easily sharing and integrating across datasets
-+ Ensuring reproducibility in analysis findings and conclusions
-+ Improving the tracking and archiving of data sources, and data updates
-+ Increasing the awareness and attention surrounding your research as others locate your data through additional online queries
-
-
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. How should I structure my data for upload into online repositories?
-2. What does the term 'metadata' mean and what does it look like?
-
-
-This module will introduce some of the repositories that are commonly used to deposit data, how to set up metadata files, and how to organize example data in preparation for sharing. We will also provide information surrounding best practices for data organization and sharing through these repositories. Additional resources are also provided throughout, as there are many ways to organize, share, and deposit data depending on your data types and structures and overall research goals.
-
-
-
-## Data Repositories
-
-There are many publicly available repositories that we should consider when depositing data. Some general repository registries that are helpful to search through include [FAIRsharing.org](https://fairsharing.org/) or [re3data.org](https://www.re3data.org/). Journals can also provide helpful resources and starting repository lists, such as [Nature](https://www.nature.com/sdata/policies/repositories#general) and [PLOS](https://journals.plos.org/plosone/s/recommended-repositories), which both have published a list of recommended repositories. As detailed in the FAIR training module, there are two main categories of data repositories:
-
-**1. Domain Agnostic Data Repositories**
-Domain agnostic repositories allow the deposition of any data type. Some examples include:
-
-+ Data in Brief Articles (e.g., [Elsevier's Data in Brief Journal](https://www.journals.elsevier.com/data-in-brief))
-+ [Dryad](https://www.datadryad.org)
-+ [Figshare](https://figshare.com/)
-+ [The Dataverse Project](https://dataverse.org/)
-+ [Zenodo](https://zenodo.org/)
-
-
-**2. Domain Specific Data Repositories**
-Domain specific repositories allow the deposition of specific types of data, produced from specific types of technologies or within specific domains. Some examples include:
-
-+ [Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
-+ [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/)
-+ [The Immunology Database and Analysis Portal](https://www.immport.org/home)
-+ [Metabolomics Workbench (National Metabolomics Data Repository)](https://www.metabolomicsworkbench.org/data/index.php)
-+ [Microphysiology Systems Database](https://upddi.pitt.edu/microphysiology-systems-database/)
-+ [Mouse Genome Informatics](http://www.informatics.jax.org/)
-+ [Mouse Phenome Database](https://phenome.jax.org/)
-+ [OpenNeuro](https://openneuro.org/)
-+ [Protein Data Bank](https://www.rcsb.org/)
-+ [ProteomeXchange](http://www.proteomexchange.org/)
-+ [Rat Genome Database](https://rgd.mcw.edu/)
-+ [The Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
-+ [Zebrafish Model Organism Database](http://zfin.org/)
-+ and many, many, many others...
-
-This training module focuses on providing an example of how to organize and upload data into the Dataverse; though many of the methods described below pertain to other data repositories as well, and also incorporate general data organization and sharing best practices.
-
-
-
-## The Dataverse Project
-Dataverse, organized through [The Dataverse Project](https://dataverse.org/), is a popular repository option that allows for upload of most types of material, without any stringent requirements. The Dataverse organization also provides ample resources on how to organize, upload, and share data through Dataverse. These resources include very thorough, readable, and user guides and best practices.
-```{r 01-Chapter1-2, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image1.png")
-```
-*Screenshot of the main page of [The Dataverse Project](https://dataverse.org/)*
-
-An easier way to think about Dataverse is to interpret it similar to a folder system on your computer. A Dataverse is just an online folder that contains files, data, or datasets that are all related to some topic, project, etc. Although Dataverse was started at Harvard and the base Dataverse lives there, there are many versions of Dataverse that are specific to and supported by various institutions. For example, these training modules are being developed primarily by faculty, staff, and students at the University of North Carolina at Chapel Hill. As such, the examples contained in this module will specifically connect with the [UNC Dataverse](https://dataverse.unc.edu); though many of the methods outlined here are applicable to other Dataverses and additional online repositories, in general.
-
-
-
-## What is a Dataverse?
-
-Remember how we pointed out that a Dataverse is similar to a folder system on a computer? Well, here we are going to show you what that actually looks like. But first, something that can be confusing when starting to work with Dataverse is the fact that the term Dataverse is used for both the overarching repository as well as individual subsections (or folders) in which data are stored. For example, the UNC Dataverse is called a Dataverse, but to upload data, you need to upload it to a specific sub-Dataverse. So, what is the difference between the high level UNC Dataverse and smaller, sub-dataverses? Well, nothing, really. The UNC Dataverse is similar to a large folder that says, these are all the projects and research related to or contained within UNC. From there, we want to be more specific about where we store our research, so we are creating more sub-Dataverses (folders) within that higher, overarching UNC Dataverse.
-
-As an example, using the UNC Dataverse, here we can see various sub-Dataverses that have been created as repositories for specific projects or types of data.
-
-```{r 01-Chapter1-3, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image2.png")
-```
-
-As another example looking within a specific Dataverse, here we can see the Dataverse that hosts datasets and publications for Dr. Julia Rager's lab, the [Ragerlab-Dataverse](https://dataverse.unc.edu/dataverse/ragerlab).
-
-```{r 01-Chapter1-4, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image3.png")
-```
-
-Within this Datavere, we can see various datasets produced by her lab. It is worth noting that the datasets may not necessarily be directly related to each other in terms of exact topic, for example, the Ragerlab-Dataverse hosts data pertaining to wildfire smoke exposure as well as chemical exposures and breast cancer. But they are all pertaining to experiments and analyses run within her specific lab.
-
-Let's now start talking more specifically about how to organize data and format files for Dataverse, create your own "Dataverse", upload datasets, and what this all means!
-
-
-
-### Dataset Structure
-
-Before uploading your data to any data repository, it is important to structure your data efficiently and effectively, making it easy for others to navigate, understand, and utilize. While we will cover this in various sections throughout these training modules, here are some basic tips for data structure and organization.
-
-+ Keep all data for one participant or subject within one column (or row) of your dataset
- + Genomic data and other analytical assays tend to have subjects on columns and genes, expression, etc. as the rows
- + Descriptive and demographic data often tend to have subjects or participants as the rows and each descriptor variable (including demographics and any other subject variables) as columns
-+ Create succinct, descriptive variable names
- + For example, do not use something like "This Variable Contains Information Regarding Smoking Status", and instead just using something like, "Smoking_Status"
- + Be aware of using spacing, special characters, and capitalization within variable names
-+ Think about transforming data from wide to long format depending on your specific dataset and general conventions
-+ Be sure to follow specific guidelines of repository when appropriate
-
-**TAME 2.0 Module 1.1 FAIR Data Management Practices** and **TAME 2.0 Module 1.4 Data Wrangling in Excel** are also helpful resources to reference when thinking about organizing your data.
-
-A general example of an organized, long format dataset in Excel in provided below:
-```{r 01-Chapter1-5, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image4.png")
-```
-
-Only .csv or .txt files can be uploaded to dataverse; therefore, the metadata and data tabs in an excel file will need to saved and uploaded as two separate .csv or .txt files.
-
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question 1***: How should I structure my data for upload into online repositories?
-:::
-
-:::answer
-**Answer**: It is ideal to have data clearly organized and filled, with succinct and descriptive variable names clearly labeled and values filled in. Most commonly, datasets should be saved as separate .csv or .txt files for upload into data repositories.
-:::
-
-
-
-## Metadata
-There are many different definitions of what a metadata file is. Helpful explanations, for example, are provided by the [UNC University Libraries](https://guides.lib.unc.edu/metadata/definition):
-
-:::txtbx
-There are many definitions of metadata, but one of the simplest is *data about data*. More specifically...
-
-+ *Metadata (in terms of data management) describe a dataset:* how they were collected; when they were collected; what assumptions were made in their methodology; their geographic scope; if there are multiple files, how they relate to one another; the definitions of individual variables and, if applicable, what possible answers were (i.e., to survey questions); the calibration of any equipment used in data collection; the version of software used for analysis; etc. Very often, a dataset that has no metadata is incomprehensible.
-
-+ *Metadata ARE data.* They are pieces of information that have some meaning in relation to another piece of information. They can be created, managed, stored, and preserved like any other data.
-
-+ *Metadata can be applied to anything.* A computer file can be described in the same way that a book or piece of art can be described. For example, both can have a title, an author, and a year created. Metadata should be documented for research outputs of any kind.
-
-+ *Metadata generally has little value on their own.* Metadata adds value to other information, but are usually not valuable in themselves. There are exceptions to this rule, such as text transcription of an audio file.
-
-There are three kinds of metadata:
-
-+ Descriptive metadata consist of information about the content and context of your data.
-
- + Examples: title, creator, subject keywords, and description (abstract)
-
-+ Structural metadata describe the physical structure of compound data.
-
- + Examples: camera used, aperture, exposure, file format, and relation to other data or files
-
-+ Administrative metadata are information used to manage your data.
-
- + Examples: when and how they were created, who can access them, software required to use them, and copyright permissions
-
-:::
-
-
-Therefore, after having organized your primary dataset for submission into online repositories, it is equally important to have a metadata file for easy comprehension and utilization of your data for future researchers or anyone downloading your data. While most repositories capture some metadata on the dataset page (e.g., descripton of data, upload date, contact information), there is generally little information about the specific data values and variables. In this section, we review some general guidelines and tips to better annotate your data.
-
-First, keep in mind, depending on the specific repository you are using, you may have to follow their metadata standards. But, if uploading to more generalist repository, this may be up to you to define.
-
-Generally, a metadata file consists of a set of descriptors for each variable in the data. If you are uploading data that contains many covariates or descriptive variables, it is essential that you provide a metadata file that describes these covariates. Both a description of the variable as well as any specific levels of any categorical or factor type variables.
-
-From the dataset presented previously, here we present an example of an associated metadata file:
-```{r 01-Chapter1-6, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image5.png")
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question 2***: What does the term 'metadata' mean and what does it look like?
-:::
-
-:::answer
-**Answer**: Metadata refers to the information that describes and explains data. It looks like an additional dataset that provides context with details such as the source, type, owner, and relationships to other datasets. This file can help users understand the relevance of a specific dataset and provide guidance on how to use it.
-:::
-
-
-
-## Creating a Dataverse
-
-Now, let's review how to actually create a Dataverse. First, navigate to the parent Dataverse that you would like to use as your primary host website. For example, our group uses the [UNC Dataverse](https://dataverse.unc.edu/). If you do not already have one, create a username and login.
-
-Then, from the home Dataverse page, click "Add Data" and select "New Dataverse".
-```{r 01-Chapter1-7, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image6.png")
-```
-
-And fill in the information necessary.
-
-And that is it. After creating your Dataverse site, you will need to publish it; however, before it is accessible to the public, note that you can actually create a Dataverse within another Dataverse (similar to a folder within a folder on your computer). This makes sense even when you are creating a new Dataverse at the home, UNC Dataverse level, you are still technically creating a new Dataverse within an existing one (the large UNC Dataverse).
-
-Here are some tips as you create your Dataverse:
-
-+ Do not recreate a Dataverse that already exists
-+ Choose a name that is specific, but general enough that it doesn't only pertain to one specific dataset
-+ You can add more than one contact email, if necessary
-
-
-
-## Creating a Dataset
-
-Creating a dataset creates a page for your data containing information about that data, a citation for the data (something valuable and somewhat unique to Dataverse), as well the place from where you data can be directly accessed or downloaded. First, decide the specific Dataverse your data will live and navigate to that specific Dataverse site. Then carry out the following steps to create a dataset:
-
-+ Navigate to the Dataverse page under which your dataset will live
-+ Click "Add Data" and then select "New Dataset"
-
-```{r 01-Chapter1-8, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_2_Input/Module1_2_Image7.png")
-```
-
-+ Fill in the necessary information
-+ Upload your data and metadata file(s) structured as detailed above
-
-Now, you have a dataset within your Dataverse. Again, you will have to publish the dataset for someone to have access to it. The easy part of using a more generalist repository like Dataverse, is that you do not have to have a strict data structure adherence. However, this means it is up to you to make sure your data is readable and useable.
-
-
-
-## Concluding Remarks
-In this training module, we set out to express the importance of uploading data to online repositories, demonstrate what the upload process may look like using a generalist repository (Dataverse), and give some examples and tips on structuring data for upload and creating metadata files. It is important to choose the appropriate repository for your data based on your field of study and specifications of your work.
-
-
-
-
-
-:::tyk
-Try creating your own Dataverse repository, format your files to be uploaded to Dataverse, and upload those files to your new repository!
-:::
-
-# 1.3 File Management using Github
-
-This training module was developed by Alexis Payton, Lauren E. Koval, Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Good data practices like file management and code tracking are imperative for data analysis initiatives, especially when working in research teams and/or shared project folders. Often times analyses and manuscripts are edited many times prior to being submitted for a grant or publication. Analysis methods are also shared between members of a research team and to external communities, as further detailed in **TAME 2.0 Module 1.1 FAIR Data Management Practices**. Therefore, Github has emerged as an effective way to manage, share, and track how code changes over time.
-
-[Github](Github.com) is an open source or publicly accessible platform designed to facilitate version control and issue tracking of code. It is used by us and many of our colleagues to not only document versions of script written for data analysis and visualization, but to also make our code publicly available for open communication and dissemination of results.
-
-This training module serves a launch pad for getting acclimated with Github and includes...
-
-+ Creating an account
-+ Uploading code
-+ Creating a repository and making it legible for manuscript submission
-
-
-## Creating an Account
-First, users must create their own accounts within github to start uploading/sharing code. To do this, navigate to [github.com](github.com), click "Sign Up", and follow the on screen instructions.
-```{r 01-Chapter1-9, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image10.png")
-```
-
-
-
-## Creating a Repository
-A repository, also known as a "repo", is similar to a project folder that will contain all code pertaining to a specific project (which can be used for specific research programs, grants, or manuscripts, as examples). A repository can be set to public or private. If a repo is initially set to private to keep findings confidential prior to publication, it can always be updated to public once findings are ready for public dissemination. Multiple people can be allowed to work on a project together within a single repository.
-
-To access the repositories that are currently available to you through your user account, click the circle in top right-hand corner and click "Your repositories".
-```{r 01-Chapter1-10, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image11.png")
-```
-
-To create a new repository, click on the green button that says "New".
-```{r 01-Chapter1-11, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image12.png")
-```
-
-Then give your repository a descriptive name. We often edit the repo titles to match the title of specific manuscripts, though specific titling formats are up to the users/team's preference.
-
-For more information, visit Github's [Create a repo](https://docs.github.com/en/get-started/quickstart/create-a-repo) documentation.
-
-Then click "Add a README file" to initiate the README file, which is important to continually edit to provide analysis-specific background information, and any additional information that would be helpful during and after code is drafted to better facilitate tracking information and project details. *We provide further details surrounding specific information that can be included within the README file below.*
-```{r 01-Chapter1-12, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image13.png")
-```
-
-
-
-## Uploading Code
-
-The simplest way to upload code is to first navigate to the repository that you would like to upload your code/associated files to. Note that this could represent a repo that you created or that someone granted you access to.
-
-Click “Add file” then click “Upload files”. Drag and drop your file containing your script into github and click “Commit changes”.
-```{r 01-Chapter1-13, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image1.png")
-```
-
-A more advanced way to upload code is by using the command line, which allows a user to directly interact with the computer or software application. Further documentation can be found [here](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository).
-
-
-
-## Adding Subfolders in a Repository
-To keep the repository organized, it might be necessary to create a new folder (like the folder labeled “1.1. Summary Statistics” in the above screenshot). Files can be grouped into these folders based on the type of analysis.
-
-To do so, click on the new file and then click on the pencil icon next to the "Blame" button.
-```{r 01-Chapter1-14, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image2.png")
-```
-
-Click on the box that contains the title of the file. Write the title of your new folder and then end with a forward slash (/). In the screenshot below, we're creating a new folder entitled "New Folder". Click “Commit changes” and your file should now be in a new folder.
-```{r 01-Chapter1-15, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image3.png")
-```
-
-
-
-## Updating Code
-Saving iterations of code can save valuable time later as analyses are constantly being updated and edited. If your code undergoes substantial changes, (e.g., adding/ removing steps or if there’s code that is likely to be beneficial later on, but is no longer relevant to the current analysis), it is helpful to save that version in Github for future reference.
-
-To do so, create a subfolder named “Archive” and move the old file into it. If you have multiple versions of a file with the same name, add the current date to prevent the file from being overwritten later on as seen in the screenshot below.
-```{r 01-Chapter1-16, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image4.png")
-```
-
-Once the old file version has been archived, now upload the most recent version of your code to the main folder. Based on the screenshot above, that would be under “3. ML Visualizations”.
-
-
-*Note: If a file is uploaded with the same name it will be overwritten, which can't be undone! Therefore, put the older file into the archive folder if you'd like it to be saved **PRIOR** to uploading the new version.*
-
-
-
-## Updating Repository Titles and Structure to Support a Manuscript
-
-If the code is for a manuscript, it's helpful to include the table or figure name it pertains to in the manuscript in parentheses. For example, "Baseline Clusters (Figure 3)". This allows viewers to find find the code for each table or figure faster.
-```{r 01-Chapter1-17, echo=FALSE, fig.width=6, fig.height=7, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image5.png")
-```
-
-
-
-### Using a README.md file
-A README.md file is used to describe the overall aims and purpose of the analyses in the repository or a folder within a repository. It is often the first file that someone will look at in a repo/folder, so it is important to include information that would be valuable to an outsider trying to make use of the work.
-
-To add a README.md file, click “Add file” and then “Create new file”.
-```{r 01-Chapter1-18, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image6.png")
-```
-
-Name your file “README.md”.
-```{r 01-Chapter1-19, echo=FALSE, fig.width=6, fig.height=7, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image7.png")
-```
-
-A README.md file uses R markdown syntax. This type of syntax is very helpful as you continue to develop R coding skills, as it provides a mechanism through which your code's output can be visualized and saved as a rendered file version. There are many helpful resources for R markdown, including some that we find helpful:
-
-+ [R Markdown Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
-+ [R Markdown Syntax Overview](https://bookdown.org/yihui/rmarkdown/markdown-syntax.html)
-
-The final README.md file for the **OVERALL** repository for manuscript submission should look something like the screenshot below. Always include…
-
-+ The main goal of the project
-+ The final manuscript name, year it was published, Pub Med ID (if applicable)
-+ Graphical abstract (if needed for publication)
-+ Names and brief descriptions of each file
- + Include both the goal of the analysis and the methodology used (ie. Using chi square tests to determine if there are statistically significant differences across demographic groups)
-+ If the code was written in the software Jupyter (ie. has the extension .ipynb not .R or .Rmd), NBViewer is a website that can render jupyter notebooks (files). This is helpful, because sometimes the files take too long to render, so link the repository from the NB viewer website.
- + Go to [nbviewer.org](nbviewer.org) --> type in the name of the repository --> copy the url and add it to the README.md file
-```{r 01-Chapter1-20, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image8.png")
-```
-
-The final README.md file for the a subfolder within a repository should look something like the screenshot below. Always include…
-
-+ The name of each file
-+ Brief description of each file
- + Include both the goal of the analysis and the methodology used
-+ Table or Figure name in the corresponding manuscript (if applicable)
-```{r 01-Chapter1-21, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image9.png")
-```
-
-**Note**: That the organization structure for the README.md files are simply recommendations and should be changed based on needs of the project. However, it is important to include information and organize the repository in a way that helps other readers and colleagues navigate it who aren't familiar with the project.
-
-
-
-#### Example Repositories
-Below are links to repositories that contain code for analyses used in published manuscripts. These are examples of well organized Github repositories.
-
-- [Wildfires and Environmental Justice: Future Wildfire Events Predicted to Disproportionally Impact Socioeconomically Vulnerable Communities in North Carolina](https://github.com/UNC-CEMALB/Wildfires-and-Environmental-Justice-Future-Wildfire-Events-Predicted-to-Disproportionally-Impact-So/tree/main)
-
-- [Plasma sterols and vitamin D are correlates and predictors of ozone-induced inflammation in the lung: A pilot study](https://github.com/UNC-CEMALB/Plasma-sterols-and-vitamin-D-are-correlates-and-predictors-of-ozone-induced-inflammation-in-the-lung/tree/main)
-
-- [Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples](https://github.com/Ragerlab/Script_for_Cytokine-Signature-Clusters-as-a-Tool-to-Compare-Changes-associated-with-Tobacco-Product-)
-
-
-
-
-## Tracking Code Changes using Github Branches
-Github is a useful platform for managing and facilitating code tracking performed by different collaborators through branches.
-
-When creating a repository on Github, it automatically creates a default branch entitled "main". It's possible to create a new **branch** which allows a programmer to make changes to files in a repository in isolation from the main branch. This is beneficial, because the same file can be compared across branches, potentially created by different scientists, and merged together to reflect those changes. **Note:** In order for this to work the file in main branch has to have the same name and the file in the newly created branch.
-
-Let's start by creating a new branch. First, navigate to a repository, select "main" and then "View all branches".
-```{r 01-Chapter1-22, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image14.png")
-```
-
-Click "New branch", give your branch a title, and click "Create new branch". In the screenshot, you'll see the new branch entitled "jr-changes".
-```{r 01-Chapter1-23, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image15.png")
-```
-
-As a new collaborator interested in comparing and merging code changes to a file, click on the new branch that was just created. Based on the screenshot, that means click "jr-changes".
-```{r 01-Chapter1-24, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image16.png")
-```
-
-After uploading the file(s) to this branch, you'll see a notification that this branch is now a certain number of commits ahead of the main branch. A **commit** records the number of changes to files in a branch. Based on the screenshot, "jr-changes" is now 2 commits ahead of "main".
-```{r 01-Chapter1-25, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image17.png")
-```
-
-Click on "2 commits ahead" and scroll down to compare versions between the "main" and "jr-changes" branches. A pull request will need to be created. A **pull request** allows other collaborators to see changes made to a file within a branch. These proposed changes can be discussed and amended before merging them into the main branch. For more information, visit Github's [branches](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches), [pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) and [comparing branches in pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-comparing-branches-in-pull-requests) documentation.
-```{r 01-Chapter1-26, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image18.png")
-```
-
-Go ahead and click on "Create pull request". Click on "Create pull request" again on the next screen. Select "Merge pull request" and then "Confirm merge".
-```{r 01-Chapter1-27, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_3_Input/Module1_3_Image19.png")
-```
-
-
-
-## Concluding Remarks
-In summary, this training module serves as a basic tutorial for sharing code on Github in a way that is beneficial for scientific research. Concepts discussed include uploading and updating code, making a repository easily readable for manuscript submissions, and tracking code changes across collaborators. We encourage trainees and data scientists to implement code tracking and sharing through Github and to also keep up with current trends in data analysis documentation that continue to evolve over time.
-
-
-
-:::tyk
-Try creating your own Github profile, set up a practice repo with subfolders, and a detailed READ.md file paralleling the suggested formatting and content detailed above for your own data analyses!
-:::
-
-# 1.4 Data Wrangling in Excel
-
-This training module was developed by Alexis Payton, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-This module is intended to be a starting guide to cleaning and organizing an example toxicology dataset in Excel. **Data wrangling** involves cleaning, removing of erroneous data, and restructuring necessary for to preparing wet lab generated data for downstream analyses. These steps will ensure that:
-
-+ Data are amenable to downstream analyses in R, or your preferred programming language
-+ Data are clear and easily interpretable by collaborators, reviewers, and readers
-
-Click [here](https://www.alteryx.com/glossary/data-wrangling#:~:text=Data%20wrangling%20is%20the%20process,also%20sometimes%20called%20data%20munging.) for more information on data wrangling.
-
-In this training tutorial, we'll make use of an example dataset that needs to be wrangled. The dataset contains concentration values for molecules that were measured using protein-based ELISA technologies. These molecules specifically span 17 sterols and cytokines, selected based upon their important roles in mediating biological responses. These measures were derived from human serum samples. Demographic information also exists for each subject.
-
-The following steps detailed in this training module are by no means exhaustive! Further resources are provided at the end. This module provides example steps that are helpful when wrangling your data in Excel. Datasets often come in many different formats from our wet bench colleagues, therefore some steps will likely need to be added, removed, or amended depending on your specific data.
-
-
-
-## Save a Copy of the Soon-To-Be Organized and Cleaned Dataset as a New File
-Open Microsoft Excel and prior to **ANY** edits, click “File” --> “Save As” to save a new version of the file that can serve as the cleaned version of the data. This is very important for file tracking purposes, and can help in the instance that the original version needs to be referred back to (e.g., if data are accidentally deleted or modified during downstream steps).
-
-+ The file needs to be named something indicative of the data it contains followed by the current date (e.g., "Allostatic Mediator Data_061622").
-+ The title should be succinct and descriptive.
-+ It is okay to use dashes or underscores in the name of the title.
-+ Do not include special characters, such as $, #, @, !, %, &, *, (, ), and +. Special characters tend to generate errors on local hard drives when syncing to cloud-based servers, and they are difficult to upload into programming software.
-
-
-Let's first view what the dataset currently looks like:
-
-```{r 01-Chapter1-28, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image1.png")
-```
-
-
-
-### Helpful Excel Keyboard Shortcuts
-
-The following keyboard shortcuts can help you work more efficiently in Excel:
-
-+ Move to the last cell in use on the sheet
- + Control + Fn + Right arrow key (Mac users)
- + Control + End (PC users)
-+ Move to the beginning of the sheet
- + Control + Fn + Left arrow key, then same Control + Fn + Up arrow key (Mac users)
- + Control + Home (PC users)
-+ Highlight and grab all data
- + Click on the first cell in the upper left hand corner then click and hold Shift + Command + Down arrow key + Right arrow key (Mac users)
- + Shift + Command + Down arrow key + Right arrow key (PC users)
-
-**Note:** This only works if there are no cells with missing information or gaps in the columns/rows used to define the peripheral area.
-
-For more available shortcuts on various operating systems click [here](https://support.microsoft.com/en-us/office/keyboard-shortcuts-in-excel-1798d9d5-842a-42b8-9c99-9b7213f0040f).
-
-
-
-## Remove Extraneous White Space
-Before we can begin organizing the data, we need to remove the entirely blank rows of cells. This reduces the file size and allows for the use of the filter function in Excel, as well as other organizing functions, which will be used in the next few steps. This step also makes the data look more tidy and amenable to import for coding purposes.
-
-+ **Excel Trick #1:** Select all lines that need to be removed and press Control + minus key for Mac and PC users. (Note that there are other ways to do this for larger datasets, but this works fine for this small example.)
-+ **Excel Trick #2:** An easier way to remove blank rows and cells for larger datasets, includes clicking "Find & Select"--> "Special" --> "Blanks" --> click "OK" to select all blank rows and cells. Click "Delete" within the home tab --> "Delete sheet rows".
-
-After removing the blank rows, the file should look like the screenshot below.
-```{r 01-Chapter1-29, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image2.png")
-```
-
-
-
-## Replace Missing Data with “NA”
-There are many ways missing data can be encoded in datasets. This includes values like "blank", "N/A", "NA", or leaving a cell blank. Replacing all missing values with "NA" values is done for 2 reasons:
-
-+ To confirm that the data is indeed missing
-+ R reads in "NA" values as missing values
-
-To check for missing values, the filter function can be used on each column and only select cells with missing values. You may need to scroll to the bottom of the filter pop up window for numerical data. Enter "NA" into the cell of the filtered column. Double click the bottom right corner of the cell to copy the "NA" down the rest of the column.
-
-There was no missing data in this dataset, so this step can be skipped.
-
-
-
-## Create a Metadata Tab
-Metadata explains what each column represents in the dataset. Metadata is now a required component of data sharing, so it is best to initiate this process prior to data analysis. Ideally, this information is filled in by the scientist(s) who generated the data.
-
-+ Create a new tab (preferably as the first tab) and label it “XXXXX_METADATA” (ie., “Allostatic_METADATA")
-+ Then relabel the original data tab as “XXXX_DATA” (ie., “Allostatic_DATA).
-+ Within the metadata tab, create three columns: the first, "Column Identifier", contains each of the column names found in the data tab; the second, "Code", contains the individual variable/ abbreviation for each column identifier; the third, "Description" contains additional information and definitions for abbreviations.
-
-```{r 01-Chapter1-30, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image3.png")
-```
-
-
-
-## Abbreviate and Capitalize Categorical Data
-Categorical data are easier to handle in programming languages when they are capitalized and abbreviated. It also helps reduce typos and potential typing mistakes within your script.
-
-For this dataset, the following variables were edited:
-
-+ Group
- + "control" became "NS" for non-smoker
- + "smoker" became "CS" for cigarette smoker
-+ Sex
- + "f" became "F" for female
- + "m" became "M" for male
-+ Race
- + "AA" became "B" for Black
- + "White" became "W" for White
-
-**Excel Trick:** To change cells that contain the same data simultaneously, navigate to "Edit", click "Find", and then "Replace".
-
-Once the categorical data have been abbreviated, add those abbreviations to the metadata and describe what they symbolize.
-```{r 01-Chapter1-31, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image4.png")
-```
-
-
-
-## Alphabetize (Sort) the Data by the Categorical Variable of Interest
-For this dataset, we will sort by the column "Group". This organizes the data and sets it up for the next step.
-
-+ Highlight all the column headers.
-+ Click on the "Sort & Filter" button and click "Filter".
-+ Click on the arrow on cell that contains the column name "Group" and click "Ascending".
-
-
-## Create a New Subject Number Column
-Analysis-specific subjects are created to give an ordinal subject number to each subject, which allows the scientist to easily identify the number of subjects. In addition, these new ordinal subject numbers will be used to create a subject identifier that combines both a subject's group and subject number that is helpful for downstream visualization analyses.
-
-+ Relabel the subject number/identifier column as “Original_Subject_Number” and create an ordinal subject number column labeled “Subject_Number”.
-
-R reads in spaces between words as periods, therefore it’s common practice to replace spaces with underscores when doing data analysis in R. Avoid using dashes in column names or anywhere else in the dataset.
-```{r 01-Chapter1-32, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image5.png")
-```
-
-
-
-## Remove Special Symbols and Dashes
-Programming languages, in general, do not operate well with special symbols and dashes, particularly when included in column identifiers. For this reason, it is best to remove these while cleaning up your data, prior to importing it into R or your preferred programming software.
-
-In this case, this dataset contains dashes and Greek letters within some of the column header identifiers. Here, it is beneficial to remove these dashes (e.g., change IL-10 to IL10) and replace the Greek letters with first letter of the word in English (e.g., change TNF-$\alpha$ to TNFa).
-
-
-
-## Bold all Column Names and Center all Data
-These data will likely be shared with collaborators, uploaded onto data deposition websites, and used as supporting information in published manuscripts. For these purposes, it is nice to format data in Excel such that it is visually appealing and easy to digest.
-
-For example, here, it is nice to bold column identifiers and center the data, as shown below:
-```{r 01-Chapter1-33, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image6.png")
-```
-
-
-
-## Create a Subject Identifier Column
-The subject identifier column labeled, “Group_Subject_No”, combines the subject number with the variable of interest (ie. Group for this dataset). This is useful for analyses to identify outliers by the subject number and the group.
-
-+ Insert 2 additional columns where the current "Sex" column is.
-+ To combine values from two different columns, type "=CONCAT(D1," _ ",C1)" in the first cell in the first column inserted.
-+ Double click the right corner of the cell for the formula to be copied to last row in the dataset.
-+ Copy the entire column and paste only the values in the second column by navigating to the drop down arrow next to "Paste" and click "Paste Values".
-+ Label the second column "Group_Subject_No" and delete the first column.
-
-```{r 01-Chapter1-34, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image7.png")
-```
-
-## Separate Subject Demographic Data from Experimental Measurements
-This example dataset is very small, so the demographic data (e.g., sex, race, age) was kept within the same file as the experimentally measured molecules. Though in larger datasets (e.g., genome-wide data, exposomic data, etc), it is often beneficial to separate the demographic data into one file that can be labeled according to the following format: “XXX_Subject_Info_061622” (ie. “Allostatic_Subject_Info_061622”).
-
-This step was not completed for this current data, since it had a smaller size and the downstream analyses were simple.
-
-
-
-## Convert Data from Wide to Long Format
-A wide format contains values that **DO NOT** repeat the subject identifier column. For this dataset, each subject has one row containing all of its data, therefore the subject identifier occurs once in the dataset.
-
-**Wide Format**
-```{r 01-Chapter1-35, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image8.png")
-```
-
-A long format contains values that **DO** repeat the subject identifier column. For this dataset, that means a new column was created entitled "Variable" containing all the mediator names and a column entitled "Value" containing all their corresponding values. In the screenshot, an additional column, "Category", was added to help with the categorization of mediators in R analyses.
-
-**Long Format**
-```{r 01-Chapter1-36, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image9.png")
-```
-
-The reason a long format is preferred is because it makes visualizations and statistical analyses more efficient in R. In the long format, we were able to add a column entitled "Category" to categorize the mediators into "AL Biomarker" or "Cytokine" allowing us to more easily subset the mediators in R. Read more about wide and long formats [here](https://towardsdatascience.com/long-and-wide-formats-in-data-explained-e48d7c9a06cb).
-
-To convert the data from a wide to long format, follow the steps below:
-
-## Pivoting Data from a Wide to Long Format
-To do this, a power query in Excel will be used. Note: If you are working on a Mac, you will need to have at least Excel 2016 installed to follow this tutorial, as Power Query is not avaialble for earlier versions. Add-ins are available for Windows users. See [this link](https://blog.enterprisedna.co/how-to-add-power-query-to-excel/) for more details.
-
-1. Start by copying all of the data, including the column titles. (Hint: Try using the keyboard shortcut mentioned above.)
-2. Click the tab at the top that says "Data". Then click "Get Data (Power Query)" at the far left.
-3. It will ask you to choose a data source. Click "Blank table" in the bottom row.
-4. Paste the data into the table. (Hint: Use the shortcut Ctrl + "v"). At this point, your screen should look like the screenshot below.
-```{r 01-Chapter1-37, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image10.png")
-```
-
-5. Click "Use first row as headers" and then click "Next" in the bottom right hand corner.
-6. Select all the columns with biomarker names. That should be the column "Cortisol" through the end.
-```{r 01-Chapter1-38, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image11.png")
-```
-
-7. Click the "Transform" button in the upper left hand corner. Then click "Unpivot columns" in the middle of the pane. The final result should look like the sceenshot below with all the biomarkers now in one column entitled "Attribute" and their corresponding values in another column entitled "Value".
-```{r 01-Chapter1-39, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image12.png")
-```
-
-8. To save this, go back to the "Home" tab and click "Close & load". You should see something similar to the screenshot below.
-```{r 01-Chapter1-40, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image13.png")
-```
-
-9. In the upper right with all the shaded tables (within the "Table" tab), click the arrow to the left of the green table until you see one with no shading. Then click the table with no colors.
-10. Click "Convert to Range" within the "Table" tab. This removes the power query capabilities, so that the data is a regular excel sheet.
-11. Now the "Category" column can be created to identify the types of biomarkers in the dataset. The allostatic load (AL) biomarkers denoted in the "Category" column include the variables Cortisol, CRP, Fibrinogen, Hba1c, HDL, and Noradrenaline. The rest of the variables were labeled as cytokines. Additionally, we can make this data more closely resemble the final long format screenshot by bolding the headers, centering all the data, etc.
-
-We have successfully wrangled our data and the final dataset now looks like this:
-```{r 01-Chapter1-41, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image14.png")
-```
-
-
-
-## Generating Summary-Level Statistics with Pivot Tables
-A PivotTable is a tool in Excel used to summarize numerical data. It’s called a pivot table, because it pivots or changes how the data is displayed to make statistical inferences. This can be useful for generating initial summary-level statistics to guage the distribution of data.
-
-To create a PivotTable, start by selecting all of the data. (Hint: Try using the keyboard shortcut mentioned above.) Click "Insert" tab on the upper left-hand side, click "PivotTable", and click "OK". The new PivotTable should be available in a new sheet as seen in the screenshot below.
-```{r 01-Chapter1-42, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image15.png")
-```
-
-A PivotTable will be constructed based on the column headers that can be dragged into the PivotTable fields located on the right-hand side. For example, what if we were interested in determining if there were differences in average expression between non-smokers and cigarette smokers in each category of biomarkers? As seen below, drag the "Group" variable under the "Rows" field and drag the "Value" variable under the "Values" field.
-```{r 01-Chapter1-43, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image16.png")
-```
-
-Notice that it automatically calculates the sum of the expression values for each group. To change the function to average, click the "i" icon and select "Average". The output should mirror what's below with non-smokers having an average expression that's more than double that of cigarette smokers.
-```{r 01-Chapter1-44, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_1/Module1_4_Input/Module1_4_Image17.png")
-```
-
-
-
-## Excel vs. R: Which Should You Use?
-For the most part, it's better to perform final analyses in R (or another programming language) rather than Excel for the following reasons...
-
-+ R clearly shows the code (instructions), which makes editing, interpretability, and sharing easier. This makes analyses more reproducible and can save time.
-+ R has packages that makes more complex analyses possible (i.e., machine learning and heatmaps) that aren't available in Excel.
-+ R can handle larger data sets.
-+ R can compute and process data faster.
-
-However, Excel is still a software that has many benefits for running analyses including...
-
-+ Excel is user-friendly and most people have experience in navigating the software at a basic level.
-+ Excel can be faster for rudimentary statistical analyses and visualizations.
-
-Depending on each scientist's skill-level and the complexity of the analysis, Excel or R could be beneficial.
-
-
-
-
-## Concluding Remarks
-In summary, this training module highlights the importance of data wrangling and how to do so in Microsoft Excel for downstream analyses. Concepts discussed include helpful Excel features like power queries and pivot tables and when to use Microsoft Excel vs. R.
-
-### Additional Resources
-Data wrangling in Excel can be expedited with knowledge of useful features and functions to format data. Check out the resources below for additional information on Excel tricks.
-
-+ [Data Analysis in Excel](https://careerfoundry.com/en/blog/data-analytics/data-analysis-in-excel/)
-+ [Excel Spreesheet Hacks](https://www.lifehack.org/articles/technology/20-excel-spreadsheet-secrets-youll-never-know-you-dont-read-this.html)
-+ [Excel for Beginners](https://www.udemy.com/course/useful-excel-for-beginners/)
-
-
-
-
-
-:::tyk
-1. Try wrangling the "Module1_4_TYKInput.xlsx" to mimic the cleaned versions of the data found in "Module1_4_TYKSolution.xlsx". This dataset includes sterol and cytokine concentration levels extracted from induced sputum samples collected after ozone exposure. After wrangling, you should end up with a sheet for subject information and a sheet for experimental data.
-2. Using the a PivotTable on the cleaned dataset, find the standard deviation of each cytokine variable stratified by the disease status.
-:::
-
diff --git a/Chapter_1/1_1_FAIR/1_1_FAIR.Rmd b/Chapter_1/1_1_FAIR/1_1_FAIR.Rmd
new file mode 100644
index 0000000..9ddf0ee
--- /dev/null
+++ b/Chapter_1/1_1_FAIR/1_1_FAIR.Rmd
@@ -0,0 +1,309 @@
+# (PART\*) Chapter 1 Introductory Data Science {-}
+
+# 1.1 FAIR Data Management Practices
+
+This training module was developed by Rebecca Boyles, with contributions from Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+This training module provides a description of FAIR data management practices, and points participants to important resources to help ensure generated data meet current FAIR guidelines. This training module is descriptive content-based (as opposed to coding-based), in order to present information clearly and serve as an important resource alongside the other scripted training activities.
+
+
+
+### Training Module's Environmental Heatlh Questions
+This training module was specifically developed to answer the following questions:
+
+1. What is FAIR?
+2. When was FAIR first developed?
+3. When making data ‘Findable’, who and what should be able to find your data?
+4. When saving/formatting your data, which of the following formats is preferred to meet FAIR principles: .pdf, .csv, or a proprietary output file from your lab instrument?
+5. How can I find a suitable data repository for my data?
+
+
+
+## Introduction to FAIR
+Proper data management is of utmost importance while leading data analyses within the field of environmental health science. A method to ensure proper data management is the implementation of Findability, Accessibility, Interoperability, and Reusability (FAIR) practices. A landmark paper that describes FAIR practices in environmental health research is the following:
+
++ Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15. PMID: [26978244](https://pubmed.ncbi.nlm.nih.gov/26978244/).
+
+The FAIR principles describe a framework for data management and stewardship aimed at increasing the value of data by enabling sharing and reuse. These principles were originally developed from discussions during the [Jointly Designing a Data FAIRport](https://www.lorentzcenter.nl/jointly-designing-a-data-fairport.html) meeting at the Lorentz Center in Leiden, The Netherlands in 2014, which brought together stakeholders to discuss the creation of an environment for virtual computational science. The resulting principles are technology agnostic, discipline independent, community driven, and internationally adopted.
+
+Below is a schematic providing an overview of this guiding principle:
+```{r 1-1-FAIR-1, echo=FALSE, fig.height=3.5, fig.width=3.5, fig.align='center' }
+knitr::include_graphics("Chapter_1/1_1_FAIR/Module1_1_Image1.png")
+```
+
+### Answer to Environmental Health Question 1 & 2
+:::question
+*With this background, we can answer **Environmental Health Question #1 and #2***: What is FAIR and when was it first developed?
+:::
+
+:::answer
+**Answer**: FAIR is guiding framework that was recently established to promote best data management practices, to ensure that data are Findable, Accessibility, Interoperable, and Reusable. It was first developed in 2014- which means that these principles are very new and continuing to evolve!
+:::
+
+
+
+## Breaking Down FAIR, Letter-by-Letter
+
+The aspects of the FAIR principles apply to data and metadata with the aim of making the information available to people and computers as described in the seminal paper by [Wilkinson et al., 2016](https://pubmed.ncbi.nlm.nih.gov/26978244/).
+
+
+### F (Findable) in FAIR
+The F in FAIR identifies components of the principles needed to make the meta(data) findable through the application of unique persistent identifiers, thoroughly described, reference the unique identifiers, and that the descriptive information (i.e., metadata) could be searched by both *humans and computer systems*.
+
+**F1. (Meta)data are assigned a globally unique and persistent identifier**
+
++ Each dataset is assigned a globally unique and persistent identifier (PID), for example a DOI. These identifiers allow to find, cite and track (meta)data.
++ A DOI looks like: https://doi.org/10.1109/5.771073
++ Action: Ensure that each dataset is assigned a globally unique and persistent identifier. Certain repositories automatically assign identifiers to datasets as a service. If not, obtain a PID via a [PID registration service](https://pidservices.org/).
+
+**F2. Data are described with rich metadata**
+
++ Each dataset is thoroughly (see R1) described: these metadata document how the data was generated, under what term (license) and how it can be (re)used and provide the necessary context for proper interpretation. This information needs to be machine-readable.
++ Action: Fully document each dataset in the metadata, which may include descriptive information about the context, quality and condition, or characteristics of the data. Another researcher in any field, or their computer, should be able to properly understand the nature of your dataset. Be as generous as possible with your metadata (see R1).
+
+**F3. Metadata clearly and explicitly include the identifier of the data it describes**
+
++ Explanation: The metadata and the dataset they describe are separate files. The association between a metadata file and the dataset is obvious thanks to the mention of the dataset’s PID in the metadata.
++ Action: Make sure that the metadata contains the dataset’s PID.
+
+**F4. (Meta)data are registered or indexed in a searchable resource**
+
++ Explanation: Metadata are used to build easily searchable indexes of datasets. These resources will allow to search for existing datasets similarly to searching for a book in a library.
++ Action: Provide detailed and complete metadata for each dataset (see F2).
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: When making data ‘Findable’, who and what should be able to find your data?
+:::
+
+:::answer
+**Answer**: Both humans and computer systems should be able to find your data.
+:::
+
+
+
+### A (Accessible) in FAIR
+The A components are designed to enable meta(data) be available long-term, accessed by humans and machines using standard communication protocols with clearly described limitations on reuse.
+
+**A1. (Meta)data are retrievable by their identifier using a standardized communications protocol**
+
++ Explanation: If one knows a dataset’s identifier and the location where it is archived, one can access at least the metadata. Furthermore, the user knows how to proceed to get access to the data.
++ Action: Clearly define who can access the actual data and specify how. It is possible that data will not be downloaded, but rather reused *in situ*. If so, the metadata must specify the conditions under which this is allowed (sometimes versus the conditions needed to fulfill for external usage/“download”).
+
+**A1.1 The protocol is open, free, and universally implementable**
+
++ Explanation: Anyone with a computer and an internet connection can access at least the metadata.
+
+**A1.2 The protocol allows for an authentication and authorization procedure, where necessary**
+
++ Explanation: It often makes sense to request users to create a user account on a repository. This allows to authenticate the owner (or contributor) of each dataset, and to potentially set user specific rights.
+
+**A2. Metadata are accessible, even when the data are no longer available**
+
++ Explanation: Maintaining all datasets in a readily usable state eternally would require an enormous amount of curation work (adapting to new standards for formats, converting to different format if specifically needed software is discontinued, etc). Keeping the metadata describing each dataset accessible, however, can be done with fewer resources. This allows to build comprehensive data indexes including all current, past, and potentially arising datasets.
++ Action: Provide detailed and complete metadata for each dataset (see R1).
+
+
+
+### I (Interoperable) in FAIR
+The I components of the principles address needs for data exchange and interpretation by humans and machines which includes the use of controlled vocabularies or ontologies to describe meta(data) and to describe provenance relationships through appropriate data citation.
+
+**I1. (Meta)data use a formal, accessible, shared, and broadly applicable language**
+
++ Explanation: Interoperability typically means that each computer system has at least knowledge of the other system’s formats in which data is exchanged. If (meta)data are to be searchable and if compatible data sources should be combinable in a (semi)automatic way, computer systems need to be able to decide if the content of datasets are comparable.
++ Action: Provide machine readable data and metadata in an accessible language, using a well-established formalism. Data and metadata are annotated with resolvable vocabularies/ontologies/thesauri that are commonly used in the field (see I2).
+
+**I2. (Meta)data use vocabularies that follow FAIR principles**
+
++ Explanation: The controlled vocabulary (e.g., [MESH](https://www.ncbi.nlm.nih.gov/mesh/)) used to describe datasets needs to be documented. This documentation needs to be easily findable and accessible by anyone who uses the dataset.
++ Action: The vocabularies/ontologies/thesauri are themselves findable, accessible, interoperable and thoroughly documented, hence FAIR. Lists of these standards can be found at: [NCBO BioPortal](https://bioportal.bioontology.org/), [FAIRSharing](https://fairsharing.org/), [OBO Foundry](http://www.obofoundry.org/).
+
+**I3. (Meta)data include qualified references to other (meta)data**
+
++ Explanation: If the dataset builds on another dataset, if additional datasets are needed to complete the data, or if complementary information is stored in a different dataset, this needs to be specified. In particular, the scientific link between the datasets needs to be described. Furthermore, all datasets need to be properly cited (i.e. including their persistent identifiers).
++ Action: Properly cite relevant/associated datasets, by providing their persistent identifiers, in the metadata, and describe the scientific link/relation to your dataset.
+
+
+
+### R (Reusable) in FAIR
+The R components highlight needs for the meta(data) to be reused and support integration such as sufficient description of the data and data use limitations.
+
+**R1. Meta(data) are richly described with a plurality of accurate and relevant attributes**
+
+Explanation: Description of a dataset is required at two different levels:
+
++ Metadata describing the dataset: what does the dataset contain, how was the data generated, how has it been processed, how can it be reused.
++ Metadata describing the data: any needed information to properly use the data, such as definitions of the variable names
+
+Action: Provide complete metadata for each data file.
+
++ Scope of your data: for what purpose was it generated/collected?
++ Particularities or limitations about the data that other users should be aware of.
++ Date of the dataset generation, lab conditions, who prepared the data, parameter settings, name and version of the software used.
++ Variable names are explained or self-explanatory.
++ Version of the archived and/or reused data is clearly specified and documented.
+
+
+
+## What Does This Mean for You?
+We advise the following as 'starting-points' for participants to start meeting FAIR guidances:
+
++ Learn how to create a [Data Management Plan](https://dmptool.org)
++ Keep good documentation (project & data-level) while working
++ Do not use proprietary file formats (.csv is a great go-to formats for your data!)
++ When able, use a domain appropriate metadata standard or ontology
++ Ruthlessly document any steps in a project
++ Most of FAIR can be handled by selecting a good data or software repository
++ Don’t forget to include a [license](https://resources.data.gov/open-licenses/)!
+
+### Answer to Environmental Health Question 4
+:::question
+*With these, we can answer **Environmental Health Question #4***: When saving/formatting your data, which of the following formats is preferred to meet FAIR principles: .pdf, .csv, or a proprietary output file from your lab instrument?
+:::
+
+:::answer
+**Answer**: A .csv file is preferred to enhance data sharing.
+:::
+
+
+
+## Data Repositories for Sharing of Data
+When you are organizing your data to deposit online, it is important to identify an appropriate repository to publish your dataset it. A good starting place is a repository registry such as [FAIRsharing.org](https://fairsharing.org/) or [re3data.org](https://www.re3data.org/). Journals can also provide helpful resources and starting repository lists, such as [Nature](https://www.nature.com/sdata/policies/repositories#general) and [PLOS](https://journals.plos.org/plosone/s/recommended-repositories), which both have published a list of recommended repositories. Funding agencies, including the NIH, can also inform specific repositories.
+
+Below are some examples of two main categories of data repositories:
+
+**1. Domain Agnostic Data Repositories**
+Domain agnostic repositories allow the deposition of any data type. Some examples include the following:
+
++ Data in Brief Articles (e.g., [Elsevier's Data in Brief Journal](https://www.journals.elsevier.com/data-in-brief))
++ [Dryad](https://www.datadryad.org)
++ [Figshare](https://figshare.com/)
++ [The Dataverse Project](https://dataverse.org/)
++ [Zenodo](https://zenodo.org/)
+
+
+**2. Domain Specific Data Repositories**
+Domain specific repositories allow the deposition of specific types of data, produced from specific types of technologies or within specific domains. Some examples include the following:
+
++ [Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
++ [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/)
++ [The Immunology Database and Analysis Portal](https://www.immport.org/home)
++ [Metabolomics Workbench (National Metabolomics Data Repository)](https://www.metabolomicsworkbench.org/data/index.php)
++ [Microphysiology Systems Database](https://upddi.pitt.edu/microphysiology-systems-database/)
++ [Mouse Genome Informatics](http://www.informatics.jax.org/)
++ [Mouse Phenome Database](https://phenome.jax.org/)
++ [OpenNeuro](https://openneuro.org/)
++ [Protein Data Bank](https://www.rcsb.org/)
++ [ProteomeXchange](http://www.proteomexchange.org/)
++ [Rat Genome Database](https://rgd.mcw.edu/)
++ [The Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
++ [Zebrafish Model Organism Database](http://zfin.org/)
++ and many, many, many others...
+
+### Answer to Environmental Health Question 5
+:::question
+*With these, we can answer **Environmental Health Question #5***: How can I find a suitable data repository for my data?
+:::
+
+:::answer
+**Answer**: I can search through a data repository registry service or look for recommendations from NIH or other funding agencies.
+:::
+
+
+
+## Recent Shifts in Regulatory Policies for Data Sharing
+
+### The NIH Data Management and Sharing Policy
+NIH’s data management and sharing (DMS) policy became effective January 2023. This policy specifically lists the expectations that investigators must comply with in order to promote the sharing of scientific data.
+
+Information about this recent policy can be found through updated [NIH websites](https://sharing.nih.gov/data-management-and-sharing-policy).
+
+Information about writing an official Data Management and Sharing (DMS) plan for your research can be found through [NIH's Guidance on Writing a Data Management & Sharing Plan](https://sharing.nih.gov/data-management-and-sharing-policy/planning-and-budgeting-for-data-management-and-sharing/writing-a-data-management-and-sharing-plan#after).
+
+
+### The 2018 Evidence Act
+The Evidence Act, or Foundations for Evidence-Based Policymaking Act of 2018, was signed into U.S. law on January 14, 2019.
+
+The Act requires federal agencies to build the capacity to use evidence and data in their decision-making and policymaking. It also requires agencies to:
+Develop an evidence-building plan as part of their quadrennial strategic plan & Develop an evaluation plan concurrent with their annual performance plan.
+
+The Evidence Act also:
+
++ Mandates that data be "open by default"
++ Specifies that a comprehensive data inventory should be created for each agency's open data assets
+
+**How Does the NIH Data Management and Sharing Policy Intersect with the 2018 Evidence Act?**
+Making your data FAIR, by definition, makes it more shareable and reusable. Many of the requirements in the NIH DMS and the Evidence Act policy overlap with the FAIR principles.
+
+
+### The CARE Principles for Indigenous Data Governance
+While we are experiencing increased requirements for the open sharing of data, it is important to recognize that there are circumstances and populations that should, at the same time, be carefully protected. Examples include human clinical or epidemiological data that may become identifiable upon the sharing of sensitive data. Another example includes the consideration of Indigenous populations. A recent article by [Carroll et al. 2021](https://www.nature.com/articles/s41597-021-00892-0) describes in their abstract:
+
+*As big data, open data, and open science advance to increase access to complex and large datasets for innovation, discovery, and decision-making, Indigenous Peoples’ rights to control and access their data within these data environments remain limited. Operationalizing the FAIR Principles for scientific data with the CARE Principles for Indigenous Data Governance enhances machine actionability and brings people and purpose to the fore to resolve Indigenous Peoples’ rights to and interests in their data across the data lifecycle.*
+
+
+
+## Additional Training Resources on FAIR
+Many organizations, from specific programs to broad organizations, provide training and resources for scientists in FAIR principles. Some of the notable global organizations organizing and providing training that offer opportunities for community involvement are:
+
++ [Committee on Data for Science and Technology (CODATA)](https://www.codata.org/uploads/CODATA@45years.pdf)
++ [Global Alliance for Genomics & Health](https://pubmed.ncbi.nlm.nih.gov/27149219/)
++ [GoFAIR](https://www.go-fair.org/)
++ [Force11](https://www.force11.org/)
++ [Research Data Alliance](http://www.dlib.org/dlib/january14/01guest_editorial.html)
+
+
+**Example Workshops discussing FAIR**:
+
++ NAS Implementing FAIR Data for People and Machines: Impacts and Implications (2019). Available at: https://www.nationalacademies.org/our-work/implementing-fair-data-for-people-and-machines-impacts-and-implications
+
++ NIH Catalyzing Knowledge-driven Discovery in Environmental Health Sciences Through a Harmonized Language, Virtual Workshop (2021). Available at: https://www.niehs.nih.gov/news/events/pastmtg/2021/ehslanguage/index.cfm
+
++ NIH Trustworthy Data Repositories Workshop (2019). Available at: https://datascience.nih.gov/data-ecosystem/trustworthy-data-repositories-workshop
+
++ NIH Virtual Workshop on Data Metrics (2020). Available at: https://datascience.nih.gov/data-ecosystem/nih-virtual-workshop-on-data-metrics
+
++ NIH Workshop on the Role of Generalist Repositories to Enhance Data Discoverability and Reuse: Workshop Summary (2020). Available at: https://datascience.nih.gov/data-ecosystem/nih-data-repository-workshop-summary
+
+
+
+**Example Government Report Documents on FAIR:**
+
++ Collins S, Genova F, Harrower N, Hodson S, Jones S, Laaksonen L, Mietchen D, Petrauskaite R, Wittenburg P. Turning FAIR into reality: Final report and action plan from the European Commission expert group on FAIR data: European Union; 2018. Available at: https://www.vdu.lt/cris/handle/20.500.12259/103794.
+
++ EU. FAIR Data Advanced Use Cases: From Principles to Practice in the Netherlands. 2018. European Union. Available at: doi:10.5281/zenodo.1250535.
+
++ NIH. Final NIH Policy for Data Management and Sharing and Supplemental Information. National Institutes of Health. Federal Register, vol. 85, 2020-23674, 30 Oct. 2020, pp. 68890–900. Available at: https://www.federalregister.gov/d/2020-23674.
+
++ NIH. NIH Strategic Plan for Data Science 2018. National Institutes of Health. Available at: https://datascience.nih.gov/strategicplan.
+
++ NLM. NLM Strategic Plan 2017 to 2027. U.S. National Library of Medicine, Feb. 2018. Available at: https://www.nlm.nih.gov/about/strategic-plan.html.
+
+
+
+**Example Related Publications on FAIR:**
+
++ Comess S, Akbay A, Vasiliou M, Hines RN, Joppa L, Vasiliou V, Kleinstreuer N. Bringing Big Data to Bear in Environmental Public Health: Challenges and Recommendations. Front Artif Intell. 2020 May;3:31. doi: 10.3389/frai.2020.00031. Epub 2020 May 15. PMID: 33184612; PMCID: [PMC7654840](https://pubmed.ncbi.nlm.nih.gov/33184612/).
+
++ Koers H, Bangert D, Hermans E, van Horik R, de Jong M, Mokrane M. Recommendations for Services in a FAIR Data Ecosystem. Patterns (N Y). 2020 Jul 7;1(5):100058. doi: 10.1016/j.patter.2020.100058. Erratum in: Patterns (N Y). 2020 Sep 11;1(6):100104. PMID: [33205119](https://pubmed.ncbi.nlm.nih.gov/33205119/).
+
++ Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, Pétavy F, Galvez J, Becnel LB, Zhou FL, Harmon N, Jauregui B, Jackson T, Hudson L. FAIR data sharing: The roles of common data elements and harmonization. J Biomed Inform. 2020 Jul;107:103421. doi: 10.1016/j.jbi.2020.103421. Epub 2020 May 12. PMID: [32407878](https://pubmed.ncbi.nlm.nih.gov/32407878/).
+
++ Lin D, Crabtree J, Dillo I, Downs RR, Edmunds R, Giaretta D, De Giusti M, L'Hours H, Hugo W, Jenkyns R, Khodiyar V, Martone ME, Mokrane M, Navale V, Petters J, Sierman B, Sokolova DV, Stockhause M, Westbrook J. The TRUST Principles for digital repositories. Sci Data. 2020 May 14;7(1):144. PMID: [32409645](https://pubmed.ncbi.nlm.nih.gov/32409645/).
+
++ Thessen AE, Grondin CJ, Kulkarni RD, Brander S, Truong L, Vasilevsky NA, Callahan TJ, Chan LE, Westra B, Willis M, Rothenberg SE, Jarabek AM, Burgoon L, Korrick SA, Haendel MA. Community Approaches for Integrating Environmental Exposures into Human Models of Disease. Environ Health Perspect. 2020 Dec;128(12):125002. PMID: [33369481](https://pubmed.ncbi.nlm.nih.gov/33369481/).
+
++ Roundtable on Environmental Health Sciences, Research, and Medicine; Board on Population Health and Public Health Practice; Health and Medicine Division; National Academies of Sciences, Engineering, and Medicine. Principles and Obstacles for Sharing Data from Environmental Health Research: Workshop Summary. Washington (DC): National Academies Press (US); 2016 Apr 29. PMID: [27227195](https://pubmed.ncbi.nlm.nih.gov/27227195/).
+
+
+
+
+
+:::tyk
+Let’s imagine that you’re a researcher who is planning on gathering a lot of data using the zebrafish model. In order to adequately prepare your studies and steps to ensure data are deposited into proper repositories, you have the idea to check repository information obtained in [FAIRsharing.org](https://fairsharing.org/). What are some example repositories and relevant ontology resources that you could use to organize, deposit, and share your zebrafish data (hint: use the search tool)?
+:::
diff --git a/Chapter_1/Module1_1_Input/Module1_1_Image1.png b/Chapter_1/1_1_FAIR/Module1_1_Image1.png
similarity index 100%
rename from Chapter_1/Module1_1_Input/Module1_1_Image1.png
rename to Chapter_1/1_1_FAIR/Module1_1_Image1.png
diff --git a/Chapter_1/1_2_Data_Sharing/1_2_Data_Sharing.Rmd b/Chapter_1/1_2_Data_Sharing/1_2_Data_Sharing.Rmd
new file mode 100644
index 0000000..054e948
--- /dev/null
+++ b/Chapter_1/1_2_Data_Sharing/1_2_Data_Sharing.Rmd
@@ -0,0 +1,238 @@
+
+# 1.2 Data Sharing through Online Repositories
+## An Overview and Example with the Dataverse Repository
+
+This training module was developed by Kyle R. Roell, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Submitting data to publicly available repositories is an essential part of ensuring data meet FAIR guidelines, as discussed in detail in the previous training module. There are many benefits to sharing and submitting your researching, such as:
+
++ Making more use out of data that are generated in your lab
++ More easily sharing and integrating across datasets
++ Ensuring reproducibility in analysis findings and conclusions
++ Improving the tracking and archiving of data sources, and data updates
++ Increasing the awareness and attention surrounding your research as others locate your data through additional online queries
+
+
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. How should I structure my data for upload into online repositories?
+2. What does the term 'metadata' mean and what does it look like?
+
+
+This module will introduce some of the repositories that are commonly used to deposit data, how to set up metadata files, and how to organize example data in preparation for sharing. We will also provide information surrounding best practices for data organization and sharing through these repositories. Additional resources are also provided throughout, as there are many ways to organize, share, and deposit data depending on your data types and structures and overall research goals.
+
+
+
+## Data Repositories
+
+There are many publicly available repositories that we should consider when depositing data. Some general repository registries that are helpful to search through include [FAIRsharing.org](https://fairsharing.org/) or [re3data.org](https://www.re3data.org/). Journals can also provide helpful resources and starting repository lists, such as [Nature](https://www.nature.com/sdata/policies/repositories#general) and [PLOS](https://journals.plos.org/plosone/s/recommended-repositories), which both have published a list of recommended repositories. As detailed in the FAIR training module, there are two main categories of data repositories:
+
+**1. Domain Agnostic Data Repositories**
+Domain agnostic repositories allow the deposition of any data type. Some examples include:
+
++ Data in Brief Articles (e.g., [Elsevier's Data in Brief Journal](https://www.journals.elsevier.com/data-in-brief))
++ [Dryad](https://www.datadryad.org)
++ [Figshare](https://figshare.com/)
++ [The Dataverse Project](https://dataverse.org/)
++ [Zenodo](https://zenodo.org/)
+
+
+**2. Domain Specific Data Repositories**
+Domain specific repositories allow the deposition of specific types of data, produced from specific types of technologies or within specific domains. Some examples include:
+
++ [Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
++ [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/)
++ [The Immunology Database and Analysis Portal](https://www.immport.org/home)
++ [Metabolomics Workbench (National Metabolomics Data Repository)](https://www.metabolomicsworkbench.org/data/index.php)
++ [Microphysiology Systems Database](https://upddi.pitt.edu/microphysiology-systems-database/)
++ [Mouse Genome Informatics](http://www.informatics.jax.org/)
++ [Mouse Phenome Database](https://phenome.jax.org/)
++ [OpenNeuro](https://openneuro.org/)
++ [Protein Data Bank](https://www.rcsb.org/)
++ [ProteomeXchange](http://www.proteomexchange.org/)
++ [Rat Genome Database](https://rgd.mcw.edu/)
++ [The Database of Genotypes and Phenotypes](https://www.ncbi.nlm.nih.gov/gap/)
++ [Zebrafish Model Organism Database](http://zfin.org/)
++ and many, many, many others...
+
+This training module focuses on providing an example of how to organize and upload data into the Dataverse; though many of the methods described below pertain to other data repositories as well, and also incorporate general data organization and sharing best practices.
+
+
+
+## The Dataverse Project
+Dataverse, organized through [The Dataverse Project](https://dataverse.org/), is a popular repository option that allows for upload of most types of material, without any stringent requirements. The Dataverse organization also provides ample resources on how to organize, upload, and share data through Dataverse. These resources include very thorough, readable, and user guides and best practices.
+```{r 1-2-Data-Sharing-1, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image1.png")
+```
+*Screenshot of the main page of [The Dataverse Project](https://dataverse.org/)*
+
+An easier way to think about Dataverse is to interpret it similar to a folder system on your computer. A Dataverse is just an online folder that contains files, data, or datasets that are all related to some topic, project, etc. Although Dataverse was started at Harvard and the base Dataverse lives there, there are many versions of Dataverse that are specific to and supported by various institutions. For example, these training modules are being developed primarily by faculty, staff, and students at the University of North Carolina at Chapel Hill. As such, the examples contained in this module will specifically connect with the [UNC Dataverse](https://dataverse.unc.edu); though many of the methods outlined here are applicable to other Dataverses and additional online repositories, in general.
+
+
+
+## What is a Dataverse?
+
+Remember how we pointed out that a Dataverse is similar to a folder system on a computer? Well, here we are going to show you what that actually looks like. But first, something that can be confusing when starting to work with Dataverse is the fact that the term Dataverse is used for both the overarching repository as well as individual subsections (or folders) in which data are stored. For example, the UNC Dataverse is called a Dataverse, but to upload data, you need to upload it to a specific sub-Dataverse. So, what is the difference between the high level UNC Dataverse and smaller, sub-dataverses? Well, nothing, really. The UNC Dataverse is similar to a large folder that says, these are all the projects and research related to or contained within UNC. From there, we want to be more specific about where we store our research, so we are creating more sub-Dataverses (folders) within that higher, overarching UNC Dataverse.
+
+As an example, using the UNC Dataverse, here we can see various sub-Dataverses that have been created as repositories for specific projects or types of data.
+
+```{r 1-2-Data-Sharing-2, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image2.png")
+```
+
+As another example looking within a specific Dataverse, here we can see the Dataverse that hosts datasets and publications for Dr. Julia Rager's lab, the [Ragerlab-Dataverse](https://dataverse.unc.edu/dataverse/ragerlab).
+
+```{r 1-2-Data-Sharing-3, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image3.png")
+```
+
+Within this Datavere, we can see various datasets produced by her lab. It is worth noting that the datasets may not necessarily be directly related to each other in terms of exact topic, for example, the Ragerlab-Dataverse hosts data pertaining to wildfire smoke exposure as well as chemical exposures and breast cancer. But they are all pertaining to experiments and analyses run within her specific lab.
+
+Let's now start talking more specifically about how to organize data and format files for Dataverse, create your own "Dataverse", upload datasets, and what this all means!
+
+
+
+### Dataset Structure
+
+Before uploading your data to any data repository, it is important to structure your data efficiently and effectively, making it easy for others to navigate, understand, and utilize. While we will cover this in various sections throughout these training modules, here are some basic tips for data structure and organization.
+
++ Keep all data for one participant or subject within one column (or row) of your dataset
+ + Genomic data and other analytical assays tend to have subjects on columns and genes, expression, etc. as the rows
+ + Descriptive and demographic data often tend to have subjects or participants as the rows and each descriptor variable (including demographics and any other subject variables) as columns
++ Create succinct, descriptive variable names
+ + For example, do not use something like "This Variable Contains Information Regarding Smoking Status", and instead just using something like, "Smoking_Status"
+ + Be aware of using spacing, special characters, and capitalization within variable names
++ Think about transforming data from wide to long format depending on your specific dataset and general conventions
++ Be sure to follow specific guidelines of repository when appropriate
+
+**TAME 2.0 Module 1.1 FAIR Data Management Practices** and **TAME 2.0 Module 1.4 Data Wrangling in Excel** are also helpful resources to reference when thinking about organizing your data.
+
+A general example of an organized, long format dataset in Excel in provided below:
+```{r 1-2-Data-Sharing-4, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image4.png")
+```
+
+Only .csv or .txt files can be uploaded to dataverse; therefore, the metadata and data tabs in an excel file will need to saved and uploaded as two separate .csv or .txt files.
+
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question 1***: How should I structure my data for upload into online repositories?
+:::
+
+:::answer
+**Answer**: It is ideal to have data clearly organized and filled, with succinct and descriptive variable names clearly labeled and values filled in. Most commonly, datasets should be saved as separate .csv or .txt files for upload into data repositories.
+:::
+
+
+
+## Metadata
+There are many different definitions of what a metadata file is. Helpful explanations, for example, are provided by the [UNC University Libraries](https://guides.lib.unc.edu/metadata/definition):
+
+:::txtbx
+There are many definitions of metadata, but one of the simplest is *data about data*. More specifically...
+
++ *Metadata (in terms of data management) describe a dataset:* how they were collected; when they were collected; what assumptions were made in their methodology; their geographic scope; if there are multiple files, how they relate to one another; the definitions of individual variables and, if applicable, what possible answers were (i.e., to survey questions); the calibration of any equipment used in data collection; the version of software used for analysis; etc. Very often, a dataset that has no metadata is incomprehensible.
+
++ *Metadata ARE data.* They are pieces of information that have some meaning in relation to another piece of information. They can be created, managed, stored, and preserved like any other data.
+
++ *Metadata can be applied to anything.* A computer file can be described in the same way that a book or piece of art can be described. For example, both can have a title, an author, and a year created. Metadata should be documented for research outputs of any kind.
+
++ *Metadata generally has little value on their own.* Metadata adds value to other information, but are usually not valuable in themselves. There are exceptions to this rule, such as text transcription of an audio file.
+
+There are three kinds of metadata:
+
++ Descriptive metadata consist of information about the content and context of your data.
+
+ + Examples: title, creator, subject keywords, and description (abstract)
+
++ Structural metadata describe the physical structure of compound data.
+
+ + Examples: camera used, aperture, exposure, file format, and relation to other data or files
+
++ Administrative metadata are information used to manage your data.
+
+ + Examples: when and how they were created, who can access them, software required to use them, and copyright permissions
+
+:::
+
+
+Therefore, after having organized your primary dataset for submission into online repositories, it is equally important to have a metadata file for easy comprehension and utilization of your data for future researchers or anyone downloading your data. While most repositories capture some metadata on the dataset page (e.g., descripton of data, upload date, contact information), there is generally little information about the specific data values and variables. In this section, we review some general guidelines and tips to better annotate your data.
+
+First, keep in mind, depending on the specific repository you are using, you may have to follow their metadata standards. But, if uploading to more generalist repository, this may be up to you to define.
+
+Generally, a metadata file consists of a set of descriptors for each variable in the data. If you are uploading data that contains many covariates or descriptive variables, it is essential that you provide a metadata file that describes these covariates. Both a description of the variable as well as any specific levels of any categorical or factor type variables.
+
+From the dataset presented previously, here we present an example of an associated metadata file:
+```{r 1-2-Data-Sharing-5, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image5.png")
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question 2***: What does the term 'metadata' mean and what does it look like?
+:::
+
+:::answer
+**Answer**: Metadata refers to the information that describes and explains data. It looks like an additional dataset that provides context with details such as the source, type, owner, and relationships to other datasets. This file can help users understand the relevance of a specific dataset and provide guidance on how to use it.
+:::
+
+
+
+## Creating a Dataverse
+
+Now, let's review how to actually create a Dataverse. First, navigate to the parent Dataverse that you would like to use as your primary host website. For example, our group uses the [UNC Dataverse](https://dataverse.unc.edu/). If you do not already have one, create a username and login.
+
+Then, from the home Dataverse page, click "Add Data" and select "New Dataverse".
+```{r 1-2-Data-Sharing-6, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image6.png")
+```
+
+And fill in the information necessary.
+
+And that is it. After creating your Dataverse site, you will need to publish it; however, before it is accessible to the public, note that you can actually create a Dataverse within another Dataverse (similar to a folder within a folder on your computer). This makes sense even when you are creating a new Dataverse at the home, UNC Dataverse level, you are still technically creating a new Dataverse within an existing one (the large UNC Dataverse).
+
+Here are some tips as you create your Dataverse:
+
++ Do not recreate a Dataverse that already exists
++ Choose a name that is specific, but general enough that it doesn't only pertain to one specific dataset
++ You can add more than one contact email, if necessary
+
+
+
+## Creating a Dataset
+
+Creating a dataset creates a page for your data containing information about that data, a citation for the data (something valuable and somewhat unique to Dataverse), as well the place from where you data can be directly accessed or downloaded. First, decide the specific Dataverse your data will live and navigate to that specific Dataverse site. Then carry out the following steps to create a dataset:
+
++ Navigate to the Dataverse page under which your dataset will live
++ Click "Add Data" and then select "New Dataset"
+
+```{r 1-2-Data-Sharing-7, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_2_Data_Sharing/Module1_2_Image7.png")
+```
+
++ Fill in the necessary information
++ Upload your data and metadata file(s) structured as detailed above
+
+Now, you have a dataset within your Dataverse. Again, you will have to publish the dataset for someone to have access to it. The easy part of using a more generalist repository like Dataverse, is that you do not have to have a strict data structure adherence. However, this means it is up to you to make sure your data is readable and useable.
+
+
+
+## Concluding Remarks
+In this training module, we set out to express the importance of uploading data to online repositories, demonstrate what the upload process may look like using a generalist repository (Dataverse), and give some examples and tips on structuring data for upload and creating metadata files. It is important to choose the appropriate repository for your data based on your field of study and specifications of your work.
+
+
+
+
+
+:::tyk
+Try creating your own Dataverse repository, format your files to be uploaded to Dataverse, and upload those files to your new repository!
+:::
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image1.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image1.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image1.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image1.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image2.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image2.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image2.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image2.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image3.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image3.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image3.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image3.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image4.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image4.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image4.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image4.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image5.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image5.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image5.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image5.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image6.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image6.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image6.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image6.png
diff --git a/Chapter_1/Module1_2_Input/Module1_2_Image7.png b/Chapter_1/1_2_Data_Sharing/Module1_2_Image7.png
similarity index 100%
rename from Chapter_1/Module1_2_Input/Module1_2_Image7.png
rename to Chapter_1/1_2_Data_Sharing/Module1_2_Image7.png
diff --git a/Chapter_1/1_3_Github/1_3_Github.Rmd b/Chapter_1/1_3_Github/1_3_Github.Rmd
new file mode 100644
index 0000000..d605de5
--- /dev/null
+++ b/Chapter_1/1_3_Github/1_3_Github.Rmd
@@ -0,0 +1,209 @@
+
+# 1.3 File Management using Github
+
+This training module was developed by Alexis Payton, Lauren E. Koval, Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Good data practices like file management and code tracking are imperative for data analysis initiatives, especially when working in research teams and/or shared project folders. Often times analyses and manuscripts are edited many times prior to being submitted for a grant or publication. Analysis methods are also shared between members of a research team and to external communities, as further detailed in **TAME 2.0 Module 1.1 FAIR Data Management Practices**. Therefore, Github has emerged as an effective way to manage, share, and track how code changes over time.
+
+[Github](Github.com) is an open source or publicly accessible platform designed to facilitate version control and issue tracking of code. It is used by us and many of our colleagues to not only document versions of script written for data analysis and visualization, but to also make our code publicly available for open communication and dissemination of results.
+
+This training module serves a launch pad for getting acclimated with Github and includes...
+
++ Creating an account
++ Uploading code
++ Creating a repository and making it legible for manuscript submission
+
+
+## Creating an Account
+First, users must create their own accounts within github to start uploading/sharing code. To do this, navigate to [github.com](github.com), click "Sign Up", and follow the on screen instructions.
+```{r 1-3-Github-1, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image10.png")
+```
+
+
+
+## Creating a Repository
+A repository, also known as a "repo", is similar to a project folder that will contain all code pertaining to a specific project (which can be used for specific research programs, grants, or manuscripts, as examples). A repository can be set to public or private. If a repo is initially set to private to keep findings confidential prior to publication, it can always be updated to public once findings are ready for public dissemination. Multiple people can be allowed to work on a project together within a single repository.
+
+To access the repositories that are currently available to you through your user account, click the circle in top right-hand corner and click "Your repositories".
+```{r 1-3-Github-2, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image11.png")
+```
+
+To create a new repository, click on the green button that says "New".
+```{r 1-3-Github-3, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image12.png")
+```
+
+Then give your repository a descriptive name. We often edit the repo titles to match the title of specific manuscripts, though specific titling formats are up to the users/team's preference.
+
+For more information, visit Github's [Create a repo](https://docs.github.com/en/get-started/quickstart/create-a-repo) documentation.
+
+Then click "Add a README file" to initiate the README file, which is important to continually edit to provide analysis-specific background information, and any additional information that would be helpful during and after code is drafted to better facilitate tracking information and project details. *We provide further details surrounding specific information that can be included within the README file below.*
+```{r 1-3-Github-4, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image13.png")
+```
+
+
+
+## Uploading Code
+
+The simplest way to upload code is to first navigate to the repository that you would like to upload your code/associated files to. Note that this could represent a repo that you created or that someone granted you access to.
+
+Click “Add file” then click “Upload files”. Drag and drop your file containing your script into github and click “Commit changes”.
+```{r 1-3-Github-5, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image1.png")
+```
+
+A more advanced way to upload code is by using the command line, which allows a user to directly interact with the computer or software application. Further documentation can be found [here](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository).
+
+
+
+## Adding Subfolders in a Repository
+To keep the repository organized, it might be necessary to create a new folder (like the folder labeled “1.1. Summary Statistics” in the above screenshot). Files can be grouped into these folders based on the type of analysis.
+
+To do so, click on the new file and then click on the pencil icon next to the "Blame" button.
+```{r 1-3-Github-6, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image2.png")
+```
+
+Click on the box that contains the title of the file. Write the title of your new folder and then end with a forward slash (/). In the screenshot below, we're creating a new folder entitled "New Folder". Click “Commit changes” and your file should now be in a new folder.
+```{r 1-3-Github-7, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image3.png")
+```
+
+
+
+## Updating Code
+Saving iterations of code can save valuable time later as analyses are constantly being updated and edited. If your code undergoes substantial changes, (e.g., adding/ removing steps or if there’s code that is likely to be beneficial later on, but is no longer relevant to the current analysis), it is helpful to save that version in Github for future reference.
+
+To do so, create a subfolder named “Archive” and move the old file into it. If you have multiple versions of a file with the same name, add the current date to prevent the file from being overwritten later on as seen in the screenshot below.
+```{r 1-3-Github-8, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image4.png")
+```
+
+Once the old file version has been archived, now upload the most recent version of your code to the main folder. Based on the screenshot above, that would be under “3. ML Visualizations”.
+
+
+*Note: If a file is uploaded with the same name it will be overwritten, which can't be undone! Therefore, put the older file into the archive folder if you'd like it to be saved **PRIOR** to uploading the new version.*
+
+
+
+## Updating Repository Titles and Structure to Support a Manuscript
+
+If the code is for a manuscript, it's helpful to include the table or figure name it pertains to in the manuscript in parentheses. For example, "Baseline Clusters (Figure 3)". This allows viewers to find find the code for each table or figure faster.
+```{r 1-3-Github-9, echo=FALSE, fig.width=6, fig.height=7, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image5.png")
+```
+
+
+
+### Using a README.md file
+A README.md file is used to describe the overall aims and purpose of the analyses in the repository or a folder within a repository. It is often the first file that someone will look at in a repo/folder, so it is important to include information that would be valuable to an outsider trying to make use of the work.
+
+To add a README.md file, click “Add file” and then “Create new file”.
+```{r 1-3-Github-10, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image6.png")
+```
+
+Name your file “README.md”.
+```{r 1-3-Github-11, echo=FALSE, fig.width=6, fig.height=7, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image7.png")
+```
+
+A README.md file uses R markdown syntax. This type of syntax is very helpful as you continue to develop R coding skills, as it provides a mechanism through which your code's output can be visualized and saved as a rendered file version. There are many helpful resources for R markdown, including some that we find helpful:
+
++ [R Markdown Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
++ [R Markdown Syntax Overview](https://bookdown.org/yihui/rmarkdown/markdown-syntax.html)
+
+The final README.md file for the **OVERALL** repository for manuscript submission should look something like the screenshot below. Always include…
+
++ The main goal of the project
++ The final manuscript name, year it was published, Pub Med ID (if applicable)
++ Graphical abstract (if needed for publication)
++ Names and brief descriptions of each file
+ + Include both the goal of the analysis and the methodology used (ie. Using chi square tests to determine if there are statistically significant differences across demographic groups)
++ If the code was written in the software Jupyter (ie. has the extension .ipynb not .R or .Rmd), NBViewer is a website that can render jupyter notebooks (files). This is helpful, because sometimes the files take too long to render, so link the repository from the NB viewer website.
+ + Go to [nbviewer.org](nbviewer.org) --> type in the name of the repository --> copy the url and add it to the README.md file
+```{r 1-3-Github-12, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image8.png")
+```
+
+The final README.md file for the a subfolder within a repository should look something like the screenshot below. Always include…
+
++ The name of each file
++ Brief description of each file
+ + Include both the goal of the analysis and the methodology used
++ Table or Figure name in the corresponding manuscript (if applicable)
+```{r 1-3-Github-13, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image9.png")
+```
+
+**Note**: That the organization structure for the README.md files are simply recommendations and should be changed based on needs of the project. However, it is important to include information and organize the repository in a way that helps other readers and colleagues navigate it who aren't familiar with the project.
+
+
+
+#### Example Repositories
+Below are links to repositories that contain code for analyses used in published manuscripts. These are examples of well organized Github repositories.
+
+- [Wildfires and Environmental Justice: Future Wildfire Events Predicted to Disproportionally Impact Socioeconomically Vulnerable Communities in North Carolina](https://github.com/UNC-CEMALB/Wildfires-and-Environmental-Justice-Future-Wildfire-Events-Predicted-to-Disproportionally-Impact-So/tree/main)
+
+- [Plasma sterols and vitamin D are correlates and predictors of ozone-induced inflammation in the lung: A pilot study](https://github.com/UNC-CEMALB/Plasma-sterols-and-vitamin-D-are-correlates-and-predictors-of-ozone-induced-inflammation-in-the-lung/tree/main)
+
+- [Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples](https://github.com/Ragerlab/Script_for_Cytokine-Signature-Clusters-as-a-Tool-to-Compare-Changes-associated-with-Tobacco-Product-)
+
+
+
+
+## Tracking Code Changes using Github Branches
+Github is a useful platform for managing and facilitating code tracking performed by different collaborators through branches.
+
+When creating a repository on Github, it automatically creates a default branch entitled "main". It's possible to create a new **branch** which allows a programmer to make changes to files in a repository in isolation from the main branch. This is beneficial, because the same file can be compared across branches, potentially created by different scientists, and merged together to reflect those changes. **Note:** In order for this to work the file in main branch has to have the same name and the file in the newly created branch.
+
+Let's start by creating a new branch. First, navigate to a repository, select "main" and then "View all branches".
+```{r 1-3-Github-14, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image14.png")
+```
+
+Click "New branch", give your branch a title, and click "Create new branch". In the screenshot, you'll see the new branch entitled "jr-changes".
+```{r 1-3-Github-15, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image15.png")
+```
+
+As a new collaborator interested in comparing and merging code changes to a file, click on the new branch that was just created. Based on the screenshot, that means click "jr-changes".
+```{r 1-3-Github-16, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image16.png")
+```
+
+After uploading the file(s) to this branch, you'll see a notification that this branch is now a certain number of commits ahead of the main branch. A **commit** records the number of changes to files in a branch. Based on the screenshot, "jr-changes" is now 2 commits ahead of "main".
+```{r 1-3-Github-17, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image17.png")
+```
+
+Click on "2 commits ahead" and scroll down to compare versions between the "main" and "jr-changes" branches. A pull request will need to be created. A **pull request** allows other collaborators to see changes made to a file within a branch. These proposed changes can be discussed and amended before merging them into the main branch. For more information, visit Github's [branches](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-branches), [pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) and [comparing branches in pull requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-comparing-branches-in-pull-requests) documentation.
+```{r 1-3-Github-18, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image18.png")
+```
+
+Go ahead and click on "Create pull request". Click on "Create pull request" again on the next screen. Select "Merge pull request" and then "Confirm merge".
+```{r 1-3-Github-19, echo=FALSE, fig.width=2, fig.height=3, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_3_Github/Module1_3_Image19.png")
+```
+
+
+
+## Concluding Remarks
+In summary, this training module serves as a basic tutorial for sharing code on Github in a way that is beneficial for scientific research. Concepts discussed include uploading and updating code, making a repository easily readable for manuscript submissions, and tracking code changes across collaborators. We encourage trainees and data scientists to implement code tracking and sharing through Github and to also keep up with current trends in data analysis documentation that continue to evolve over time.
+
+
+
+:::tyk
+Try creating your own Github profile, set up a practice repo with subfolders, and a detailed READ.md file paralleling the suggested formatting and content detailed above for your own data analyses!
+:::
+
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image1.png b/Chapter_1/1_3_Github/Module1_3_Image1.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image1.png
rename to Chapter_1/1_3_Github/Module1_3_Image1.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image10.png b/Chapter_1/1_3_Github/Module1_3_Image10.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image10.png
rename to Chapter_1/1_3_Github/Module1_3_Image10.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image11.png b/Chapter_1/1_3_Github/Module1_3_Image11.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image11.png
rename to Chapter_1/1_3_Github/Module1_3_Image11.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image12.png b/Chapter_1/1_3_Github/Module1_3_Image12.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image12.png
rename to Chapter_1/1_3_Github/Module1_3_Image12.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image13.png b/Chapter_1/1_3_Github/Module1_3_Image13.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image13.png
rename to Chapter_1/1_3_Github/Module1_3_Image13.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image14.png b/Chapter_1/1_3_Github/Module1_3_Image14.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image14.png
rename to Chapter_1/1_3_Github/Module1_3_Image14.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image15.png b/Chapter_1/1_3_Github/Module1_3_Image15.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image15.png
rename to Chapter_1/1_3_Github/Module1_3_Image15.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image16.png b/Chapter_1/1_3_Github/Module1_3_Image16.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image16.png
rename to Chapter_1/1_3_Github/Module1_3_Image16.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image17.png b/Chapter_1/1_3_Github/Module1_3_Image17.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image17.png
rename to Chapter_1/1_3_Github/Module1_3_Image17.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image18.png b/Chapter_1/1_3_Github/Module1_3_Image18.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image18.png
rename to Chapter_1/1_3_Github/Module1_3_Image18.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image19.png b/Chapter_1/1_3_Github/Module1_3_Image19.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image19.png
rename to Chapter_1/1_3_Github/Module1_3_Image19.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image2.png b/Chapter_1/1_3_Github/Module1_3_Image2.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image2.png
rename to Chapter_1/1_3_Github/Module1_3_Image2.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image20.png b/Chapter_1/1_3_Github/Module1_3_Image20.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image20.png
rename to Chapter_1/1_3_Github/Module1_3_Image20.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image3.png b/Chapter_1/1_3_Github/Module1_3_Image3.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image3.png
rename to Chapter_1/1_3_Github/Module1_3_Image3.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image4.png b/Chapter_1/1_3_Github/Module1_3_Image4.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image4.png
rename to Chapter_1/1_3_Github/Module1_3_Image4.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image5.png b/Chapter_1/1_3_Github/Module1_3_Image5.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image5.png
rename to Chapter_1/1_3_Github/Module1_3_Image5.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image6.png b/Chapter_1/1_3_Github/Module1_3_Image6.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image6.png
rename to Chapter_1/1_3_Github/Module1_3_Image6.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image7.png b/Chapter_1/1_3_Github/Module1_3_Image7.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image7.png
rename to Chapter_1/1_3_Github/Module1_3_Image7.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image8.png b/Chapter_1/1_3_Github/Module1_3_Image8.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image8.png
rename to Chapter_1/1_3_Github/Module1_3_Image8.png
diff --git a/Chapter_1/Module1_3_Input/Module1_3_Image9.png b/Chapter_1/1_3_Github/Module1_3_Image9.png
similarity index 100%
rename from Chapter_1/Module1_3_Input/Module1_3_Image9.png
rename to Chapter_1/1_3_Github/Module1_3_Image9.png
diff --git a/Chapter_1/1_4_Excel/1_4_Excel.Rmd b/Chapter_1/1_4_Excel/1_4_Excel.Rmd
new file mode 100644
index 0000000..66b842d
--- /dev/null
+++ b/Chapter_1/1_4_Excel/1_4_Excel.Rmd
@@ -0,0 +1,294 @@
+
+# 1.4 Data Wrangling in Excel
+
+This training module was developed by Alexis Payton, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+This module is intended to be a starting guide to cleaning and organizing an example toxicology dataset in Excel. **Data wrangling** involves cleaning, removing of erroneous data, and restructuring necessary for to preparing wet lab generated data for downstream analyses. These steps will ensure that:
+
++ Data are amenable to downstream analyses in R, or your preferred programming language
++ Data are clear and easily interpretable by collaborators, reviewers, and readers
+
+Click [here](https://www.alteryx.com/glossary/data-wrangling#:~:text=Data%20wrangling%20is%20the%20process,also%20sometimes%20called%20data%20munging.) for more information on data wrangling.
+
+In this training tutorial, we'll make use of an example dataset that needs to be wrangled. The dataset contains concentration values for molecules that were measured using protein-based ELISA technologies. These molecules specifically span 17 sterols and cytokines, selected based upon their important roles in mediating biological responses. These measures were derived from human serum samples. Demographic information also exists for each subject.
+
+The following steps detailed in this training module are by no means exhaustive! Further resources are provided at the end. This module provides example steps that are helpful when wrangling your data in Excel. Datasets often come in many different formats from our wet bench colleagues, therefore some steps will likely need to be added, removed, or amended depending on your specific data.
+
+
+
+## Save a Copy of the Soon-To-Be Organized and Cleaned Dataset as a New File
+Open Microsoft Excel and prior to **ANY** edits, click “File” --> “Save As” to save a new version of the file that can serve as the cleaned version of the data. This is very important for file tracking purposes, and can help in the instance that the original version needs to be referred back to (e.g., if data are accidentally deleted or modified during downstream steps).
+
++ The file needs to be named something indicative of the data it contains followed by the current date (e.g., "Allostatic Mediator Data_061622").
++ The title should be succinct and descriptive.
++ It is okay to use dashes or underscores in the name of the title.
++ Do not include special characters, such as $, #, @, !, %, &, *, (, ), and +. Special characters tend to generate errors on local hard drives when syncing to cloud-based servers, and they are difficult to upload into programming software.
+
+
+Let's first view what the dataset currently looks like:
+
+```{r 1-4-Excel-1, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image1.png")
+```
+
+
+
+### Helpful Excel Keyboard Shortcuts
+
+The following keyboard shortcuts can help you work more efficiently in Excel:
+
++ Move to the last cell in use on the sheet
+ + Control + Fn + Right arrow key (Mac users)
+ + Control + End (PC users)
++ Move to the beginning of the sheet
+ + Control + Fn + Left arrow key, then same Control + Fn + Up arrow key (Mac users)
+ + Control + Home (PC users)
++ Highlight and grab all data
+ + Click on the first cell in the upper left hand corner then click and hold Shift + Command + Down arrow key + Right arrow key (Mac users)
+ + Shift + Command + Down arrow key + Right arrow key (PC users)
+
+**Note:** This only works if there are no cells with missing information or gaps in the columns/rows used to define the peripheral area.
+
+For more available shortcuts on various operating systems click [here](https://support.microsoft.com/en-us/office/keyboard-shortcuts-in-excel-1798d9d5-842a-42b8-9c99-9b7213f0040f).
+
+
+
+## Remove Extraneous White Space
+Before we can begin organizing the data, we need to remove the entirely blank rows of cells. This reduces the file size and allows for the use of the filter function in Excel, as well as other organizing functions, which will be used in the next few steps. This step also makes the data look more tidy and amenable to import for coding purposes.
+
++ **Excel Trick #1:** Select all lines that need to be removed and press Control + minus key for Mac and PC users. (Note that there are other ways to do this for larger datasets, but this works fine for this small example.)
++ **Excel Trick #2:** An easier way to remove blank rows and cells for larger datasets, includes clicking "Find & Select"--> "Special" --> "Blanks" --> click "OK" to select all blank rows and cells. Click "Delete" within the home tab --> "Delete sheet rows".
+
+After removing the blank rows, the file should look like the screenshot below.
+```{r 1-4-Excel-2, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image2.png")
+```
+
+
+
+## Replace Missing Data with “NA”
+There are many ways missing data can be encoded in datasets. This includes values like "blank", "N/A", "NA", or leaving a cell blank. Replacing all missing values with "NA" values is done for 2 reasons:
+
++ To confirm that the data is indeed missing
++ R reads in "NA" values as missing values
+
+To check for missing values, the filter function can be used on each column and only select cells with missing values. You may need to scroll to the bottom of the filter pop up window for numerical data. Enter "NA" into the cell of the filtered column. Double click the bottom right corner of the cell to copy the "NA" down the rest of the column.
+
+There was no missing data in this dataset, so this step can be skipped.
+
+
+
+## Create a Metadata Tab
+Metadata explains what each column represents in the dataset. Metadata is now a required component of data sharing, so it is best to initiate this process prior to data analysis. Ideally, this information is filled in by the scientist(s) who generated the data.
+
++ Create a new tab (preferably as the first tab) and label it “XXXXX_METADATA” (ie., “Allostatic_METADATA")
++ Then relabel the original data tab as “XXXX_DATA” (ie., “Allostatic_DATA).
++ Within the metadata tab, create three columns: the first, "Column Identifier", contains each of the column names found in the data tab; the second, "Code", contains the individual variable/ abbreviation for each column identifier; the third, "Description" contains additional information and definitions for abbreviations.
+
+```{r 1-4-Excel-3, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image3.png")
+```
+
+
+
+## Abbreviate and Capitalize Categorical Data
+Categorical data are easier to handle in programming languages when they are capitalized and abbreviated. It also helps reduce typos and potential typing mistakes within your script.
+
+For this dataset, the following variables were edited:
+
++ Group
+ + "control" became "NS" for non-smoker
+ + "smoker" became "CS" for cigarette smoker
++ Sex
+ + "f" became "F" for female
+ + "m" became "M" for male
++ Race
+ + "AA" became "B" for Black
+ + "White" became "W" for White
+
+**Excel Trick:** To change cells that contain the same data simultaneously, navigate to "Edit", click "Find", and then "Replace".
+
+Once the categorical data have been abbreviated, add those abbreviations to the metadata and describe what they symbolize.
+```{r 1-4-Excel-4, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image4.png")
+```
+
+
+
+## Alphabetize (Sort) the Data by the Categorical Variable of Interest
+For this dataset, we will sort by the column "Group". This organizes the data and sets it up for the next step.
+
++ Highlight all the column headers.
++ Click on the "Sort & Filter" button and click "Filter".
++ Click on the arrow on cell that contains the column name "Group" and click "Ascending".
+
+
+## Create a New Subject Number Column
+Analysis-specific subjects are created to give an ordinal subject number to each subject, which allows the scientist to easily identify the number of subjects. In addition, these new ordinal subject numbers will be used to create a subject identifier that combines both a subject's group and subject number that is helpful for downstream visualization analyses.
+
++ Relabel the subject number/identifier column as “Original_Subject_Number” and create an ordinal subject number column labeled “Subject_Number”.
+
+R reads in spaces between words as periods, therefore it’s common practice to replace spaces with underscores when doing data analysis in R. Avoid using dashes in column names or anywhere else in the dataset.
+```{r 1-4-Excel-5, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image5.png")
+```
+
+
+
+## Remove Special Symbols and Dashes
+Programming languages, in general, do not operate well with special symbols and dashes, particularly when included in column identifiers. For this reason, it is best to remove these while cleaning up your data, prior to importing it into R or your preferred programming software.
+
+In this case, this dataset contains dashes and Greek letters within some of the column header identifiers. Here, it is beneficial to remove these dashes (e.g., change IL-10 to IL10) and replace the Greek letters with first letter of the word in English (e.g., change TNF-$\alpha$ to TNFa).
+
+
+
+## Bold all Column Names and Center all Data
+These data will likely be shared with collaborators, uploaded onto data deposition websites, and used as supporting information in published manuscripts. For these purposes, it is nice to format data in Excel such that it is visually appealing and easy to digest.
+
+For example, here, it is nice to bold column identifiers and center the data, as shown below:
+```{r 1-4-Excel-6, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image6.png")
+```
+
+
+
+## Create a Subject Identifier Column
+The subject identifier column labeled, “Group_Subject_No”, combines the subject number with the variable of interest (ie. Group for this dataset). This is useful for analyses to identify outliers by the subject number and the group.
+
++ Insert 2 additional columns where the current "Sex" column is.
++ To combine values from two different columns, type "=CONCAT(D1," _ ",C1)" in the first cell in the first column inserted.
++ Double click the right corner of the cell for the formula to be copied to last row in the dataset.
++ Copy the entire column and paste only the values in the second column by navigating to the drop down arrow next to "Paste" and click "Paste Values".
++ Label the second column "Group_Subject_No" and delete the first column.
+
+```{r 1-4-Excel-7, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image7.png")
+```
+
+## Separate Subject Demographic Data from Experimental Measurements
+This example dataset is very small, so the demographic data (e.g., sex, race, age) was kept within the same file as the experimentally measured molecules. Though in larger datasets (e.g., genome-wide data, exposomic data, etc), it is often beneficial to separate the demographic data into one file that can be labeled according to the following format: “XXX_Subject_Info_061622” (ie. “Allostatic_Subject_Info_061622”).
+
+This step was not completed for this current data, since it had a smaller size and the downstream analyses were simple.
+
+
+
+## Convert Data from Wide to Long Format
+A wide format contains values that **DO NOT** repeat the subject identifier column. For this dataset, each subject has one row containing all of its data, therefore the subject identifier occurs once in the dataset.
+
+**Wide Format**
+```{r 1-4-Excel-8, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image8.png")
+```
+
+A long format contains values that **DO** repeat the subject identifier column. For this dataset, that means a new column was created entitled "Variable" containing all the mediator names and a column entitled "Value" containing all their corresponding values. In the screenshot, an additional column, "Category", was added to help with the categorization of mediators in R analyses.
+
+**Long Format**
+```{r 1-4-Excel-9, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image9.png")
+```
+
+The reason a long format is preferred is because it makes visualizations and statistical analyses more efficient in R. In the long format, we were able to add a column entitled "Category" to categorize the mediators into "AL Biomarker" or "Cytokine" allowing us to more easily subset the mediators in R. Read more about wide and long formats [here](https://towardsdatascience.com/long-and-wide-formats-in-data-explained-e48d7c9a06cb).
+
+To convert the data from a wide to long format, follow the steps below:
+
+## Pivoting Data from a Wide to Long Format
+To do this, a power query in Excel will be used. Note: If you are working on a Mac, you will need to have at least Excel 2016 installed to follow this tutorial, as Power Query is not avaialble for earlier versions. Add-ins are available for Windows users. See [this link](https://blog.enterprisedna.co/how-to-add-power-query-to-excel/) for more details.
+
+1. Start by copying all of the data, including the column titles. (Hint: Try using the keyboard shortcut mentioned above.)
+2. Click the tab at the top that says "Data". Then click "Get Data (Power Query)" at the far left.
+3. It will ask you to choose a data source. Click "Blank table" in the bottom row.
+4. Paste the data into the table. (Hint: Use the shortcut Ctrl + "v"). At this point, your screen should look like the screenshot below.
+```{r 1-4-Excel-10, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image10.png")
+```
+
+5. Click "Use first row as headers" and then click "Next" in the bottom right hand corner.
+6. Select all the columns with biomarker names. That should be the column "Cortisol" through the end.
+```{r 1-4-Excel-11, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image11.png")
+```
+
+7. Click the "Transform" button in the upper left hand corner. Then click "Unpivot columns" in the middle of the pane. The final result should look like the sceenshot below with all the biomarkers now in one column entitled "Attribute" and their corresponding values in another column entitled "Value".
+```{r 1-4-Excel-12, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image12.png")
+```
+
+8. To save this, go back to the "Home" tab and click "Close & load". You should see something similar to the screenshot below.
+```{r 1-4-Excel-13, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image13.png")
+```
+
+9. In the upper right with all the shaded tables (within the "Table" tab), click the arrow to the left of the green table until you see one with no shading. Then click the table with no colors.
+10. Click "Convert to Range" within the "Table" tab. This removes the power query capabilities, so that the data is a regular excel sheet.
+11. Now the "Category" column can be created to identify the types of biomarkers in the dataset. The allostatic load (AL) biomarkers denoted in the "Category" column include the variables Cortisol, CRP, Fibrinogen, Hba1c, HDL, and Noradrenaline. The rest of the variables were labeled as cytokines. Additionally, we can make this data more closely resemble the final long format screenshot by bolding the headers, centering all the data, etc.
+
+We have successfully wrangled our data and the final dataset now looks like this:
+```{r 1-4-Excel-14, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image14.png")
+```
+
+
+
+## Generating Summary-Level Statistics with Pivot Tables
+A PivotTable is a tool in Excel used to summarize numerical data. It’s called a pivot table, because it pivots or changes how the data is displayed to make statistical inferences. This can be useful for generating initial summary-level statistics to guage the distribution of data.
+
+To create a PivotTable, start by selecting all of the data. (Hint: Try using the keyboard shortcut mentioned above.) Click "Insert" tab on the upper left-hand side, click "PivotTable", and click "OK". The new PivotTable should be available in a new sheet as seen in the screenshot below.
+```{r 1-4-Excel-15, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image15.png")
+```
+
+A PivotTable will be constructed based on the column headers that can be dragged into the PivotTable fields located on the right-hand side. For example, what if we were interested in determining if there were differences in average expression between non-smokers and cigarette smokers in each category of biomarkers? As seen below, drag the "Group" variable under the "Rows" field and drag the "Value" variable under the "Values" field.
+```{r 1-4-Excel-16, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image16.png")
+```
+
+Notice that it automatically calculates the sum of the expression values for each group. To change the function to average, click the "i" icon and select "Average". The output should mirror what's below with non-smokers having an average expression that's more than double that of cigarette smokers.
+```{r 1-4-Excel-17, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_1/1_4_Excel/Module1_4_Image17.png")
+```
+
+
+
+## Excel vs. R: Which Should You Use?
+For the most part, it's better to perform final analyses in R (or another programming language) rather than Excel for the following reasons...
+
++ R clearly shows the code (instructions), which makes editing, interpretability, and sharing easier. This makes analyses more reproducible and can save time.
++ R has packages that makes more complex analyses possible (i.e., machine learning and heatmaps) that aren't available in Excel.
++ R can handle larger data sets.
++ R can compute and process data faster.
+
+However, Excel is still a software that has many benefits for running analyses including...
+
++ Excel is user-friendly and most people have experience in navigating the software at a basic level.
++ Excel can be faster for rudimentary statistical analyses and visualizations.
+
+Depending on each scientist's skill-level and the complexity of the analysis, Excel or R could be beneficial.
+
+
+
+
+## Concluding Remarks
+In summary, this training module highlights the importance of data wrangling and how to do so in Microsoft Excel for downstream analyses. Concepts discussed include helpful Excel features like power queries and pivot tables and when to use Microsoft Excel vs. R.
+
+### Additional Resources
+Data wrangling in Excel can be expedited with knowledge of useful features and functions to format data. Check out the resources below for additional information on Excel tricks.
+
++ [Data Analysis in Excel](https://careerfoundry.com/en/blog/data-analytics/data-analysis-in-excel/)
++ [Excel Spreesheet Hacks](https://www.lifehack.org/articles/technology/20-excel-spreadsheet-secrets-youll-never-know-you-dont-read-this.html)
++ [Excel for Beginners](https://www.udemy.com/course/useful-excel-for-beginners/)
+
+
+
+
+
+:::tyk
+1. Try wrangling the "Module1_4_TYKInput.xlsx" to mimic the cleaned versions of the data found in "Module1_4_TYKSolution.xlsx". This dataset includes sterol and cytokine concentration levels extracted from induced sputum samples collected after ozone exposure. After wrangling, you should end up with a sheet for subject information and a sheet for experimental data.
+2. Using the a PivotTable on the cleaned dataset, find the standard deviation of each cytokine variable stratified by the disease status.
+:::
+
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image1.png b/Chapter_1/1_4_Excel/Module1_4_Image1.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image1.png
rename to Chapter_1/1_4_Excel/Module1_4_Image1.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image10.png b/Chapter_1/1_4_Excel/Module1_4_Image10.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image10.png
rename to Chapter_1/1_4_Excel/Module1_4_Image10.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image11.png b/Chapter_1/1_4_Excel/Module1_4_Image11.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image11.png
rename to Chapter_1/1_4_Excel/Module1_4_Image11.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image12.png b/Chapter_1/1_4_Excel/Module1_4_Image12.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image12.png
rename to Chapter_1/1_4_Excel/Module1_4_Image12.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image13.png b/Chapter_1/1_4_Excel/Module1_4_Image13.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image13.png
rename to Chapter_1/1_4_Excel/Module1_4_Image13.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image14.png b/Chapter_1/1_4_Excel/Module1_4_Image14.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image14.png
rename to Chapter_1/1_4_Excel/Module1_4_Image14.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image15.png b/Chapter_1/1_4_Excel/Module1_4_Image15.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image15.png
rename to Chapter_1/1_4_Excel/Module1_4_Image15.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image16.png b/Chapter_1/1_4_Excel/Module1_4_Image16.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image16.png
rename to Chapter_1/1_4_Excel/Module1_4_Image16.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image17.png b/Chapter_1/1_4_Excel/Module1_4_Image17.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image17.png
rename to Chapter_1/1_4_Excel/Module1_4_Image17.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image2.png b/Chapter_1/1_4_Excel/Module1_4_Image2.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image2.png
rename to Chapter_1/1_4_Excel/Module1_4_Image2.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image3.png b/Chapter_1/1_4_Excel/Module1_4_Image3.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image3.png
rename to Chapter_1/1_4_Excel/Module1_4_Image3.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image4.png b/Chapter_1/1_4_Excel/Module1_4_Image4.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image4.png
rename to Chapter_1/1_4_Excel/Module1_4_Image4.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image5.png b/Chapter_1/1_4_Excel/Module1_4_Image5.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image5.png
rename to Chapter_1/1_4_Excel/Module1_4_Image5.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image6.png b/Chapter_1/1_4_Excel/Module1_4_Image6.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image6.png
rename to Chapter_1/1_4_Excel/Module1_4_Image6.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image7.png b/Chapter_1/1_4_Excel/Module1_4_Image7.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image7.png
rename to Chapter_1/1_4_Excel/Module1_4_Image7.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image8.png b/Chapter_1/1_4_Excel/Module1_4_Image8.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image8.png
rename to Chapter_1/1_4_Excel/Module1_4_Image8.png
diff --git a/Chapter_1/Module1_4_Input/Module1_4_Image9.png b/Chapter_1/1_4_Excel/Module1_4_Image9.png
similarity index 100%
rename from Chapter_1/Module1_4_Input/Module1_4_Image9.png
rename to Chapter_1/1_4_Excel/Module1_4_Image9.png
diff --git a/Chapter_2/02-Chapter2.Rmd b/Chapter_2/02-Chapter2.Rmd
deleted file mode 100644
index fdf734c..0000000
--- a/Chapter_2/02-Chapter2.Rmd
+++ /dev/null
@@ -1,1727 +0,0 @@
-# (PART\*) Chapter 2 Coding in R {-}
-
-
-# 2.1 Downloading and Programming in R
-
-This training module was developed by Kyle Roell, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-In this training module, we will provide a brief introduction of:
-
-+ R
-+ R Studio
-+ Packages in R
-+ Scripting basics
-+ Code troubleshooting
-
-## General Introduction and Installation of R and RStudio
-
-### What is R?
-
-**R** is a programming language. Computer script (lines of code) can be used to increase data analysis reproducibility, transparency, and methods sharing, and is becoming increasingly incorporated into exposure science, toxicology, and environmental health research. One of the most commonly used coding languages in the field of environmental health science is the **R language**. Some advantages of using R include the following:
-
-+ Free, open-source programming language that is licensed under the Free Software Foundation’s GNU General Public License
-+ Can be run across all major platforms and operating systems, including Unix, Windows, and MacOS
-+ Publicly available packages help you carry out analyses efficiently (without you having to code for everything yourself)
-+ Large, diverse collection of packages
-+ Comprehensive documentation
-+ When code is efficiently tracked during development/execution, it promotes reproducible analyses
-
-Because of these advantages, R has emerged as an avenue for world-wide collaboration in data science. Other commonly implemented scripting languages in the field of environmental health research include Python and SAS, among others; and these training tutorials focus on R as an important introductory-level example that also houses many relevant packages and example datasets as further described throughout TAME.
-
-### Downloading and Installing R
-
-To download R, first navigate to [https://cran.rstudio.com/](https://cran.rstudio.com/) and download the .pkg file for your operating system. Install this file according to your computer's typical program installation steps.
-
-### What is RStudio?
-
-**RStudio** is an Integrated Development Environment (IDE) for R, which makes it more 'user friendly' when developing and using R script. It is a desktop application that can be downloaded for free, online.
-
-### Downloading and Installing RStudio
-
-To download RStudio:
-
-+ Navigate to: [https://posit.co/download/rstudio-desktop/](https://posit.co/download/rstudio-desktop/)
-+ Scroll down and select "Download RStudio"
-+ Install according to your computer's typical program installation steps
-
-### RStudio Orientation
-
-Here is a screenshot demonstrating what the RStudio desktop app looks like:
-```{r 02-Chapter2-1, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image1.png")
-```
-
-The default RStudio layout has four main panes (numbered above in the blue boxes):
-
-1. **Source Editor:** allows you to open and edit script files and view data.
-2. **Console:** where you can type code that will execute immediately when you press enter/return. This is also where code from script files will appear when you run the code.
-3. **Environment:** shows you the objects in your environment.
-4. **Viewer:** has a number of useful tabs, including:
- 1. **Files:** a file manager that allows you to navigate similar to Finder or File Explorer
- 2. **Plots:** where plots you generate by executing code will appear
- 3. **Packages:** shows you packages that are loaded (checked) and those that can be loaded (unchecked)
- 4. **Help:** where help pages will appear for packages and functions (see below for further instructions on the help option)
-
-Under "Tools" → "Global Options," RStudio panes can be customized to appear in different configurations or with different color themes. A number of other options can also be changed. For example, you can choose to have colors highlighted the color they appear or rainbow colored parentheses that can help you visualize nested code.
-```{r 02-Chapter2-2, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image2.png")
-```
-
-## Introduction to R Packages
-
-One of the major benefits to coding in the R language is access to the continually expanding resource of thousands of user-developed **packages**. Packages represent compilations of code and functions fitted for a specialized focus or purpose. These are
-often written by R users and submitted to the [CRAN](https://cran.r-project.org/web/packages/), or another host such as [BioConductor](https://www.bioconductor.org/) or [Github](https://github.com/).
-
-Packages aid in improved data analyses and methods sharing. Packages have varying utilities, spanning basic organization and manipulation of data, visualizing data, and more advanced approaches to parse and analyze data, with examples included in all of the proceeding training modules.
-
-Examples of some common packages that we'll be using throughout these training modules include the following:
-
-+ ***tidyverse***: A collection of open source R packages that share an underlying design philosophy, grammar, and data structures of tidy data. For more information on the *tidyverse* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/tidyverse/index.html), primary [webpage](https://www.tidyverse.org/packages/), and [peer-reviewed article released in 2018](https://onlinelibrary.wiley.com/doi/10.1002/sdr.1600).
-
-+ ***ggplot2***: A system for creating graphics. Users provide the data and tell R what type of graph to use, how to map variables to aesthetics (elements of the graph), and additional stylistic elements to include in the graph. For more information on the *ggplot2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/ggplot2/index.html) and [R Documentation](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5).
-
-More information on these packages, as well as many others, is included throughout TAME training modules.
-
-### Downloading/Installing R Packages
-
-R packages often do not need to be downloaded from a website. Instead, you can install packages and load them through running script in R. Note that you only need to install packages one time, but packages must be loaded each time you start a new R session.
-
-```{r 02-Chapter2-3, eval=FALSE, echo=TRUE}
-# Install the package
-install.packages(“tidyverse”)
-
-# Load the package for use
-library(tidyverse)
-```
-
-Many packages also exist as part of the baseline configuration of an R working environment, and do not require manual loading each time you launch R. These include the following packages:
-
-+ datasets
-+ graphics
-+ methods
-+ stats
-+ utils
-
-You can learn more about a function by typing one question mark before the name of the function, which will bring up documentation in the Help tab of the Viewer window. Importantly, this documentation includes a description of the different arguments that can be passed to the function and examples for how to use the function.
-
-```{r 02-Chapter2-4, eval=FALSE}
-?install.packages
-```
-
-```{r 02-Chapter2-5, echo=FALSE, fig.align = "center", out.width = "400px" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image3.png")
-```
-
-You can learn more about a package by typing two question marks before the name of the package. This will bring up vingettes and help pages associated with that package.
-
-```{r 02-Chapter2-6, eval=FALSE}
-??tidyverse
-```
-
-```{r 02-Chapter2-7, echo=FALSE, fig.align = "center", out.width = "400px" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image4.png")
-```
-
-
-
-## Scripting Basics
-
-### Data Types
-
-Before writing any script, let's first review different data types in R. Data types are what they imply – the type of data you are handling. It is important to understand data types because functions often require a specific data type as input.
-
-R has 5 basic data types:
-
-+ Logical (e.g., TRUE or FALSE)
-+ Integer (e.g., 1, 2, 3)
-+ Numeric (real or decimal)
-+ Character (e.g., ”apple”)
-+ Complex (e.g., 1 + 0i)
-
-Numeric variables are often stored as “double” values (sometimes shown as < dbl >), or a decimal type with at least two decimal places. Character variables can also be stored as factors, which are data structures that are implemented to store categorical data in a specific order (also known as levels).
-
-Data are stored in data structures. There are many different data structures in R. Some packages even implement unique data structures. The most common data structures are:
-
-+ **Vectors:** also known as an atomic vector, can contain characters, logical values, integers, or numeric values (but all elements must be the same data type).
-+ **Matrices:** a vector with multiple dimensions. Elements must still be all the same data type.
-+ **Data frames:** similar to a matrix but can contain different data types and additional attributes such as row names (and is one of the most common data structures in environmental health research). Tibbles are a stricter type of data frame implemented in the *tidyverse* package.
-+ **Lists:** a special type of vector that acts as a container – other data structures can be stored within the list, and lists can contain other lists. Lists can contain elements that are different data structures.
-
-```{r 02-Chapter2-8, echo=FALSE, fig.align = "center"}
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image5.png")
-```
-
-### Writing Script
-
-R code is written line by line. It may take just one line or many lines of code for one step to be executed, depending on the number of arguments to the function you are using. R code is executed (run) by selecting the line(s) of code to run and pressing return/enter (or a keyboard shortcut), or by clicking "Run" in the upper right corner of the script.
-
-A very simple example of running code is as follows:
-```{r 02-Chapter2-9}
-3 + 4
-```
-
-We can see that when we ran our code, the answer was returned. But what if we want to store that answer? We can assign that number to a variable named `x` using the assignment operator `<-`:
-```{r 02-Chapter2-10}
-x <- 3 + 4
-```
-
-Then, if we run a line of code with our variable, we will get that value:
-```{r 02-Chapter2-11}
-x
-```
-
-The assignment operator can also be used to assign values to any of the data structures discussed above, such as vectors and data frames, as shown here:
-```{r 02-Chapter2-12}
-# Creating a vector of values called my_values
-my_values <- c(7, 3, 8, 9)
-
-# Viewing the vector
-my_values
-
-# Creating a data frame of values corresponding to colors
-my_df <- data.frame(values = my_values, color = c("Blue", "Red", "Yellow", "Purple"))
-
-# Viewing the data frame
-my_df
-```
-
-### Comments
-
-You may have noticed in the code chunks above that there were `#` followed by phrases describing the code. R allows for scripts to contain non-code elements, called comments, that will not be run or interpreted. Comments are useful to help make code more interpretable for others or to add reminders of what and why parts of code may have been written.
-
-To make a comment, simply use a `#` followed by the comment. A `#` only comments out a single line of code. In other words, only that line will be commented and therefore not be run, but lines directly above/below it will still be run:
-```{r 02-Chapter2-13}
-# This is an R comment!
-```
-
-For more on comments, see **TAME 2.0 Module 2.2 Coding Best Practices**.
-
-### Autofilling
-
-RStudio will autofill function names and object names as you type, which can save a lot of time. When you are typing a variable or function name, you can press tab while typing. RStudio will look for variables or functions that match the first few letters you've typed. If multiple matches are found, RStudio will provide you with a drop down list to select from, which may be useful when searching through newly installed packages or trying to quickly type variable names in an R script.
-
-For example, let's say we instead named our example data frame something much longer, and we had two data frames with similar names. If we start typing in `my_` and pause our typing, all of the objects that start with that name will appear as options in a list. To select which one to autofill, navigate down the list and click return/enter.
-
-```{r 02-Chapter2-14}
-my_df_with_really_long_name <- data.frame(values = my_values, color = c("Blue", "Red", "Yellow", "Purple"))
-
-my_df_with_really_long_name_2 <- data.frame(values = my_values, color = c("Green", "Teal", "Magenta", "Orange"))
-```
-
-```{r 02-Chapter2-15, echo=FALSE, fig.align = "center"}
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image6.png")
-```
-
-
-### Finding and Setting Your Working Directory
-Another step that is commonly done at the very beginning of your code is setting your working direction. This tells your computer where to look for files that you want to import and where to deposit output files produced during your scripted activities.
-
-To view your current working directory, run the following:
-
-```{r wd, eval=FALSE}
-getwd()
-```
-
-To set or change the location of your working directory, run the following:
-
-```{r 02-Chapter2-16, eval=FALSE, echo=TRUE}
-setwd("/file path to where your input files are")
-```
-
-Note that macOS file paths use `/` to separate folders, whereas PC file paths use `\`.
-
-You can easily find the file path to your desired working directory by navigating to "Session", then "Set Working Directory", and "Choose Directory":
-
-```{r 02-Chapter2-17, echo=FALSE, out.width = "500px", fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image7.png")
-```
-
-In the popup box, navigate to the folder you want to set as your working directory and click "Open." Look in the R console, which will now contain a line of code with `setwd()` containing your file path. You can copy this line of code to the top of your script for future use. Alternatively, you can navigate to the folder you want in Finder or File Explorer and right click to see the file path.
-
-Within your working directory, you can make sub-folders to keep your analyses organized. Here is an example folder hierarchy:
-
-```{r 02-Chapter2-18, echo=FALSE, out.width = "300px", fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image8.png")
-```
-
-How you set up your folder hierarchy is highly dependent on your specific analysis and coding style. However, we recommend that you:
-
-+ Name your script something concise, but descriptive (no acronyms)
-+ Consider using dates when appropriate
-+ Separate your analysis into logical sections so that script doesn’t get too long or hard to follow
-+ Revisit and adapt your organization as the project evolves!
-+ Archive old code so you can revisit it
-
-#### A Quick Note About Projects
-
-Creating projects allows you to store your progress (open script, global environment) for one project in an R Project File. This facilitates quick transitions between multiple projects. Find detailed information about how to set up projects [here](https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects).
-
-```{r 02-Chapter2-19, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image9.png")
-```
-
-### Importing Files
-
-After setting the working directory, you can import and export files using various functions based on the type of file being imported or exported. Often, it is easiest to import data into R that are in a comma separated value / comma delimited file (.csv) or tab / text delimited file (.txt).
-
-Other datatypes such as SAS data files or large .csv files may require different functions to be more efficiently read in, and some of these file formats will be discussed in future modules. Files can also be imported and exported from Excel using the [*openxlsx*](https://ycphs.github.io/openxlsx/) package.
-
-Below, we will demonstrate how to read in .csv and .txt files:
-
-```{r 02-Chapter2-20}
-# Read in the .csv data that's located in our working directory
-csv.dataset <- read.csv("Chapter_2/Module2_1_Input/Module2_1_InputData1.csv")
-
-# Read in the .txt data
-txt.dataset <- read.table("Chapter_2/Module2_1_Input/Module2_1_InputData1.txt")
-```
-
-These datasets now appear as saved dataframes ("csv.dataset" and "txt.dataset") in our working environment.
-
-### Viewing Data
-
-After data have been loaded into R, or created within R, you will likely want to view what these datasets look like.
-Datasets can be viewed in their entirety, or datasets can be subsetted to quickly look at part of the data.
-
-Here's some example script to view just the beginnings of a dataframe using the `head()` function:
-```{r 02-Chapter2-21}
-head(csv.dataset)
-```
-
-Here, you can see that this automatically brings up a view of the first five rows of the dataframe.
-
-Another way to view the first five rows of a dataframe is to run the following:
-```{r 02-Chapter2-22}
-csv.dataset[1:5,]
-```
-
-This brings us to an important concept - indexing! Brackets are used in R to index. Within the bracket, the first argument represents the row numbers, and the second argument represents the column numbers. A colon between two numbers means to select all of the columns in between the left and right numbers. The above line of code told R to select rows 1 to 5, and, by leaving the column argument blank, all of the columns.
-
-Expanding on this, to view the first 5 rows and 2 columns, we can run the following:
-```{r 02-Chapter2-23}
-csv.dataset[1:5, 1:2]
-```
-
-For another example: What if we want to only view the first and third row, and first and fourth column? We can use a vector within the index to do this:
-```{r 02-Chapter2-24}
-csv.dataset[c(1, 3), c(1, 4)]
-```
-
-To view the entire dataset, use the `View()` function:
-
-```{r 02-Chapter2-25, eval=FALSE, echo=TRUE}
-View(csv.dataset)
-```
-
-Another way to view a dataset is to just click on the name of the data in the environment pane. The view window will pop up in the same way that it did with the `View()` function.
-
-### Determining Data Structures and Data Types
-
-As discussed above, there are a number of different data structures and types that can be used in R. Here, we will demonstrate functions that can be used to identify data structures and types within R objects. The `glimpse()` function, which is part of the *tidyverse* package, is helpful because it allows us to see an overview of our column names and the types of data contained within those columns.
-
-```{r 02-Chapter2-26, message = FALSE}
-# Load tidyverse package
-library(tidyverse)
-
-glimpse(csv.dataset)
-```
-Here, we see that our `Sample` column is a character column, while the rest are integers.
-
-The `class()` function is also helpful for understanding objects in our global environment:
-```{r 02-Chapter2-27}
-# What class (data structure) is our object?
-class(csv.dataset)
-
-# What class (data type) is a specific column in our data?
-class(csv.dataset$Sample)
-```
-
-These functions are particularly helpful when introducing new functions or troubleshooting code because functions often require input data to be a specific structure or data type.
-
-### Exporting Data
-
-Now that we have these datasets saved as dataframes, we can use these as examples to export data files from the R environment back into our local directory.
-
-There are many ways to export data in R. Data can be written out into a .csv file, tab delimited .txt file, or RData file, for example. There are also many functions within packages that write out specific datasets generated by that package.
-
-To write out to a .csv file:
-```{r 02-Chapter2-28, eval=F}
-write.csv(csv.dataset, "Module2_1_SameCSVFileNowOut.csv")
-```
-
-To write out a .txt tab delimited file:
-```{r 02-Chapter2-29, eval=F}
-write.table(txt.dataset, "Module2_1_SameTXTFileNowOut.txt")
-```
-
-R also allows objects to be saved in RData files. These files can be read into R, as well, and will load the object into the current workspace. Entire workspaces are also able to be saved in RData files, such that when you open an RData file, your script and Global Environment will be just as you saved them. Below includes example code to carry out these tasks, and though these files are not provided, they are just example code for future reference.
-
-```{r saving, eval = F}
-# Read in saved single R data object
-r.obj = readRDS("data.rds")
-
-# Write single R object to file
-saveRDS(object, "single_object.rds")
-
-# Read in multiple saved R objects
-load("multiple_data.RData")
-
-# Save multiple R objects
-save(object1, object2, "multiple_objects.RData")
-
-# Save entire workspace
-save.image("entire_workspace.RData")
-
-# Load entire workspace
-load("entire_workspace.RData")
-```
-
-## Code Troubleshooting
-
-Learning how to code is an iterative, exploratory process. The secret to coding is to...
-```{r 02-Chapter2-30, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_2/Module2_1_Input/Image10.png")
-```
-
-Make sure to include "R" and the package and/or function name in your search. Don't be afraid to try out different solutions until you find one that works for you, but also know when it is time to ask for help. For example, when you have tried solutions available on forums, but they aren't working for you, or you know a colleague has already spent a significant amount of time developing code for this specific task.
-
-Note that when reading question/answer forums, make sure to look at how recent a post is, as packages are updated frequently, and old answers may or may not work.
-
-Some common reasons that code doesn't work and potential solutions to these problems include:
-
-+ Two packages are loaded that have functions with the same name, and the default function is not the one you are intending to run.
- + Solutions: specify the package that you want the function to be called from each time you use it (e.g., `dplyr::select()`) or re-assign that function at the beginning of your script (e.g., `select <- dplyr::select`)
-
-+ Your data object is the wrong input type (is a data frame and needs to be a matrix, is character but needs to be numeric)
- + Solution: double check the documentation (?functionname) for the input/variable type needed
-
-+ You accidentally wrote over your data frame or variable with another section of code
- + Solution: re-run your code from the beginning, checking that your input is in the correct format
-
-+ There is a bug in the function/package you are trying to use (this is most common after packages are updated or after you update your version of R)
- + Solution: post an issue on GitHub for that package (or StackOverflow if there is not a GitHub) using a reproducible example
-
-There are a number of forums that can be extremely helpful when troubleshooting your code, such as:
-
-+ [Stack Overflow](https://stackoverflow.com/): one of the most common forums to post questions related to coding and will often be the first few links in a Google search about any code troubleshooting. It is free to make an account, which allows you to post and answer questions.
-+ [Cross Validated](https://stats.stackexchange.com/): a forum focused on statistics, including machine learning, data analysis, data mining, and data visualization, and is best for conceptual questions related to how statistical tests are carried out, when to use specific tests, and how to interpret tests (rather than code execution questions, which are more appropriate to post on Stack Overflow).
-+ [BioConductor Forum](https://support.bioconductor.org/): provides a platform for specific coding and conceptual questions about BioConductor packages.
-+ [GitHub](https://github.com): can also be used to create posts about specific issues/bugs for functions within that package.
-
-
-**Before you post a question, make sure you have thoroughly explored answers to existing similar questions and are able to explain in your question why those haven’t worked for you.** You will also need to provide a **reproducible example** of your error or question, meaning that you provide all information (input data, packages, code) needed such that others can reproduce your exact issues. While demonstrating a reproducible example is beyond the scope of this module, see the below links and packages for help getting started:
-
-+ Detailed step-by-step guides for how to make reproducible examples:
- + [How to Reprex](https://aosmith16.github.io/spring-r-topics/slides/week09_reprex.html#1) by Ariel Muldoon
- + [What's a reproducible example (reprex) and how do I create one?](https://community.rstudio.com/t/faq-whats-a-reproducible-example-reprex-and-how-do-i-create-one/5219)
-+ Helpful packages:
- + [*reprex*](https://reprex.tidyverse.org/): part of tidyverse, useful for preparing reproducible code for posting to forums.
- + [*datapasta*](https://aosmith16.github.io/spring-r-topics/slides/week09_reprex.html#43): useful for creating code you can copy and paste that creates a new data frame as a subset of your original data.
-
-## Concluding Remarks
-
-Together, this training module provides introductory level information on installing and loading packages in R, scripting basics, importing and exporting data, and code troubleshooting.
-
-### Additional Resources
-
-+ [Coursera](https://www.coursera.org/learn/r-programming & https://www.coursera.org/courses?query=r)
-+ [Stack Overflow How to Learn R](https://stackoverflow.com/questions/1744861/how-to-learn-r-as-a-programming-language)
-+ [R for Data Science](https://r4ds.had.co.nz/)
-
-
-
-:::tyk
-1. Install R and RStudio on your computer.
-2. Launch RStudio and explore installing packages (e.g., *tidyverse*) and understanding data types using the [built-in datasets](https://machinelearningmastery.com/built-in-datasets-in-r/#:~:text=The%20ecosystem%20in%20R%20contains,to%20test%20out%20your%20program.) in R.
-3. Make a vector of the letters A-E.
-4. Make a data frame of the letters A-E in one column and their corresponding number in the alphabet order in the second column (e.g., A corresponds with 1).
-:::
-
-# 2.2 Coding "Best" Practices
-
-This training module was developed by Kyle Roell, Alexis Payton, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-In this training module, we will be going over coding "best" practices. The reason we put "best" in quotes is because these practices are what we currently consider best or better, though everyone has different coding styles, annotation styles, etc that also change over time. Here, we hope to give you a sense of what we do when coding, why we do it, and why we think it is important. We will also be pointing out other guides to style, annotations, and best practices that we suggest implementing into your own coding.
-
-Some of the questions we hope to answer in this module are:
-
-+ What type of scripting file should I use?
-+ What should I name my script?
-+ What should I put at the top of every script and why is it important?
-+ How should I annotate my code?
-+ Why are annotations important?
-+ How do I implement these coding practices into my own code?
-+ Where can I find other resources to help with coding best practices?
-
-In the following sections, we will be addressing these questions. Keep in mind that the advice and suggestions in this section are just that: advice and suggestions. So please take them into consideration and integrate them into your own coding style as appropriate.
-
-## Scripting File Types
-
-Two of the most common scripting file types applicable to the R language are .R (normal R files) and .Rmd (R Markdown). Normal R files appear as plain text and can be used for running any normal R code. R Markdown files are used for more intensive documentation of code and allow for a combination of code, non-code text explaining the code, and viewing of code output, tables, and figures that are rendered together into an output file (typically .html, although other formats such as .pdf are also offered). For example, TAME is coded using R Markdown, which allows us to include blocks of non-code text, hyperlinks, annotated code, schematics, and output figures all in one place. We highly encourage the use of R Markdown as the default scripting file type for R-based projects because it produces a polished final document that is easy for others to follow, whereas .R files are more appropriate for short, one-off analyses and writing in-depth functions and packages. However, code executed in normal .R files and R Markdown will produce the same results, and ultimately, which file type to use is personal preference.
-
-See below for screenshots that demonstrate some of the stylistic differences between .R, .Rmd, and .Rmd knitted to HTML format:
-```{r 02-Chapter2-31, out.width = "1000px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_2/Module2_2_Input/Image1.png")
-```
-
-If you are interested in learning more about the basic features of R Markdown and how to use them, see the following resources:
-
-+ [RStudio introduction to R Markdown](https://rmarkdown.rstudio.com/lesson-1.html)
-+ [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
-+ [Bookdown R Markdown guide](https://bookdown.org/yihui/rmarkdown/html-document.html)
-+ [Including external images in R Markdown with knitr](https://www.r-bloggers.com/2021/02/external-graphics-with-knitr/)
-+ [Interactive plots with plotly](https://cengel.github.io/R-data-viz/interactive-graphs.html)
-+ [Interactive data tables with DT](https://rstudio.github.io/DT/)
-
-### Naming the Script File
-
-The first thing we need to talk about, which is sometimes overlooked in the discussion of coding practices, is script file naming conventions and high level descriptive headers within a script. It is important to remember to name your code something concise, but descriptive. You want to be able to easily recognize what the script is for and does without a cumbersome, lengthy title. Some tips for naming conventions:
-
-+ Be concise, but descriptive
-+ Use dates when appropriate
-+ Avoid special characters
-+ Use full words if possible, avoiding non-standard acronyms
-
-Keep in mind that each script should have a clear purpose within a given project. And, it is sometimes necessary, and often common, to have multiple scripts within one project that all pertain to different parts of the analysis. For example, it may be appropriate to have one script for data cleaning and pre-processing and another script for analyzing data. When scripting an analysis with multiple sub-analyses, some prefer to keep code for each sub-analysis separate (e.g., one file for an ANOVA and one file for a k-means analysis on the same data input), while others prefer to have longer code files with more subsections. Whichever method you choose, we recommend maintaining clear documentation that indicates locations for input and output files for each sub-analysis (e.g., whether global environment objects or output files from a previous script are needed to run the current script).
-
-## Script Headers and Annotation
-
-### Script Header
-
-Once your script is created and named, it is generally recommended to include a header at the top of the script. The script header can be used for describing:
-
-+ Title of Script - This can be a longer or more readable name than script file name.
-+ Author(s) - Who wrote the script?
-+ Date - When was the script developed?
-+ Description - Provides a more detailed description of the purpose of the script and any notes or special considerations for this particular script.
-
-In R, it is common to include multiple `#`, the comment operator, or a `#` followed by another special character, to start and end a block of coding annotation or the script header. An example of this in an .R file is shown below:
-
-```{r 02-Chapter2-32}
-########################################################################
-########################################################################
-### Script Longer Title
-###
-### Description of what this script does!
-### Also can include special notes or anything else here.
-###
-### Created by: Kyle Roell and Julia Rager
-### Last updated: 01 May 2023
-########################################################################
-########################################################################
-```
-
-This block of comment operators is common in .R but not .Rmd files because .Rmd files have their own specific type of header, known as the [YAML](https://zsmith27.github.io/rmarkdown_crash-course/lesson-4-yaml-headers.html), which contains the title, author, date, and formatting outputs for the .Rmd file:
-
-```{r 02-Chapter2-33, out.width = "300px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_2/Module2_2_Input/Image2.png")
-```
-
-We will now review how annotations within the script itself can make a huge difference in understanding the code within.
-
-### Annotations
-
-Before we review coding style considerations, it is important to address code annotating. So, what are annotations and why are they important?
-
-Annotations are notes embedded within your code as comments that will not be run. The beauty of annotating your code is that not only others, but future you, will be able to read through and better understand what a particular piece of code does. We suggest annotating your code while you write it and incorporate a lot of description. While not every single line needs an annotation, or a very detailed one, it is helpful to provide comments and annotation as much as you can while maintaining feasibility.
-
-#### General annotation style
-
-In general, annotations will be short sentences that describe what your code does or why you are executing that specific code. This can be helpful when you are defining a covariate a specific way, performing a specific analytical technique, or just generally explaining why you are doing what you're doing.
-
-```{r 02-Chapter2-34, eval=F}
-
-# Performing logistic regression to assess association between xyz and abc
-# Regression confounders: V1, V2, V3 ...
-
-xyz.regression.output = glm(xyz ~ abc + V1 + V2 + V3, family=binomial(), data=example.data)
-
-```
-
-#### Mid-script headings
-
-Another common approach to annotations is to use mid-script type headings to separate out the script into various sections. For example, you might want to create distinct sections for "Loading Packages, Data, and Setup", "Covariate Definition", "Correlation Analysis", "Regression Analysis", etc. This can help you, and others, reading your script, to navigate the script more easily. It also can be more visually pleasing to see the script split up into multiple sections as opposed to one giant chunk of code interspersed with comments. Similar to above, the following example is specific to .R files. For .Rmd files, sub headers can be created by increasing the number of `#` before the header.
-
-```{r 02-Chapter2-35, eval=F}
-
-###########################################################################
-###########################################################################
-###
-### Regression Analyses
-###
-### You can even add some descriptions or notes here about this section!
-###
-###########################################################################
-
-
-# Performing logistic regression to assess association between xyz and abc
-# Regression confounders: V1, V2, V3 ...
-
-xyz.regression.output = glm(xyz ~ abc + V1 + V2 + V3, family=binomial(), data=example.data)
-
-```
-General tips for annotations:
-
-+ Make comments that are useful and meaningful
-+ You don't need to comment every single line
-+ In general, you probably won't over-comment your script, so more is generally better
-+ That being said, don't write super long paragraphs every few lines
-+ Split up your script into various sections using mid-script headings when appropriate
-
-
-#### Quick, short comments and annotations
-
-While it is important to provide descriptive annotations, not every one needs to be a sentence or longer. As stated previously, it is not necessary to comment every single line. Here is an example of very brief commenting:
-```{r 02-Chapter2-36, eval=F }
-
-# Loading necessary packages
-
-library(ggplot2) # Plotting package
-
-```
-
-In the example above, we can see that these short comments clearly convey what the script does -- load the necessary package and indicate what the package is needed for. Short, one line annotations can also be placed after lines to clarify that specific line or within the larger mid-script headings to split up these larger sections of code.
-
-
-## Coding Style
-
-Coding style is often a contentious topic! There are MANY styles of coding, and no two coders have the same exact style, even if they are following the same reference. Here, we will provide some guides to coding style and go over some of the basic, general tips for making your code readable and efficient. Here is an example showing how you can use spacing to align variable assignment:
-
-```{r 02-Chapter2-37, eval=F}
-
-# Example of using spacing for alignment of variable assignment
-
-Longer_variable_name_x = 1
-Short_name_y = 2
-```
-
-Note that guides will suggest you use `<-` as the assignment operator. However, for most situations, `<-` and `=` will do the same thing.
-
-For spacing around certain symbols and operators:
-
-+ Include a space after `if`, before parenthesis
-+ Include a space on either side of symbols such as `<`
-+ The first (opening) curly brace should not be on its own line, but the second (closing) should
-
-```{r 02-Chapter2-38, eval = F}
-# Example of poor style
-
-if(Longer_variable_name_x
-
-
-
-:::tyk
-Using the input file provided ("Module2_2_TYKInput.R"):
-
-1. Convert the script and annotations into R Markdown format.
-2. Improve the organization, comments, and scripting to follow the coding best practices described in this module. List the changes you made at the bottom of the new R Markdown file.
-
-*Notes on the starting code:*
-
-1. This starting code uses dummy data to demonstrate how to make a graph in R that includes bars representing the mean, with standard deviation error bars overlaid.
-2. You don't need to understand every step in the code to be able to improve the existing coding style! You can run each step of the code if needed to understand better what it does.
-:::
-
-# 2.3 Data Manipulation and Reshaping
-
-This training module was developed by Kyle Roell, Alexis Payton, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for downstream statistical analyses and visualization code. The beginning of almost any scripted analysis includes important formatting steps that make the data easier to read and work with. This can be done in several ways, including:
-
-+ [Base R operations and functions](https://www.r-project.org/about.html), or
-+ A collection of packages (and philosophy) known as [The Tidyverse](https://www.tidyverse.org).
-
-In this training tutorial we will review some of the most common ways you can organize and manipulate data, including:
-
-+ Merging data
-+ Filtering and subsetting data
-+ Pivoting data wider and longer (also known as casting and melting)
-
-These approaches will first be demonstrated using the functions available in base R. Then, the exact same approaches will be demonstrated using the functions and syntax that are part of the Tidyverse package.
-
-We will demonstrate these data manipulation and organization methods using an environmentally relevant example data set from a human cohort. This dataset was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a unique data set for these training purposes. The dataset contains environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples and associated demographic data.
-
-### Training Module's Environmental Health Question
-
-This training module was specifically developed to answer the following environmental health question using data manipulation and reshaping approaches:
-
-What is the average urinary chromium concentration across different maternal education levels?
-
-We'll use base R and *Tidydverse* to answer this question, but let's start with Base R.
-
-### Workspace Preparation and Data Import
-
-#### Set your working directory
-
-In preparation, first let's set our working directory to the folder path that contains our input files:
-```{r 02-Chapter2-39, eval = FALSE}
-setwd("/file path to where your input files are")
-```
-
-Note that macOS file paths use `/` as folder separators, and PC file paths use `\`.
-
-
-#### Importing example datasets
-
-Next, let's read in our example data sets:
-```{r 02-Chapter2-40}
-demographic_data <- read.csv("Chapter_2/Module2_3_Input/Module2_3_InputData1.csv")
-chemical_data <- read.csv("Chapter_2/Module2_3_Input/Module2_3_InputData2.csv")
-```
-
-#### Viewing example datasets
-Let's see what these datasets look like:
-```{r 02-Chapter2-41}
-dim(demographic_data)
-dim(chemical_data)
-```
-
-
-The demographic data set includes 200 rows x 7 columns, while the chemical measurement data set includes 200 rows x 7 columns.
-
-We can preview the demographic data frame by using the `head()` function, which displays all the columns and the first 6 rows of a data frame:
-```{r 02-Chapter2-42}
-head(demographic_data)
-```
-
-
-These demographic data are organized according to subject ID (first column) followed by the following subject information:
-
-+ `ID`: subject number
-+ `BMI`: body mass index
-+ `MAge`: maternal age in years
-+ `MEdu`: maternal education level; 1 = "less than high school", 2 = "high school or some college", 3 = "college or greater"
-+ `BW`: body weight in grams
-+ `GA`: gestational age in weeks
-
-We can also preview the chemical dataframe:
-```{r 02-Chapter2-43}
-head(chemical_data)
-```
-
-These chemical data are organized according to subject ID (first column), followed by measures of:
-
-+ `DWAs`: drinking water arsenic levels in µg/L
-+ `DWCd`: drinking water cadmium levels in µg/L
-+ `DWCr`: drinking water chromium levels in µg/L
-+ `UAs`: urinary arsenic levels in µg/L
-+ `UCd`: urinary cadmium levels in µg/L
-+ `UCr`: urinary chromium levels in µg/L
-
-## Data Manipulation Using Base R
-
-### Merging Data Using Base R Syntax
-
-Merging datasets represents the joining together of two or more datasets, using a common identifier (generally some sort of ID) to connect the rows. This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.
-
-Let's start by merging the example demographic data with the chemical measurement data using the base R function `merge()`. To learn more about this function, you can type `?merge`, which brings up helpful information in the R console. To merge these datasets with the merge function, use the following code. The `by =` argument specifies the column used to match the rows of data.
-```{r 02-Chapter2-44}
-full.data <- merge(demographic_data, chemical_data, by = "ID")
-dim(full.data)
-```
-
-This merged dataframe contains 200 rows x 12 columns. Viewing this merged dataframe, we can see that the `merge()` function retained the first column in each original dataframe (`ID`), though did not replicate it since it was used as the identifier for merging. All other columns include their original data, just merged together by the IDs in the first column.
-```{r 02-Chapter2-45}
-head(full.data)
-```
-
-These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. This option allows you to edit the name of the column that is used in each dataframe. Here, these are still the same "ID", but you can see that adding the `by.x` and `by.y` arguments allows you to specify instances when different column names are used in the two datasets.
-```{r 02-Chapter2-46}
-full.data <- merge(demographic_data, chemical_data, by.x = "ID", by.y = "ID")
-
-# Viewing data
-head(full.data)
-```
-
-
-Note that after merging datasets, it is always helpful to check that the merging was done properly before proceeding with your data analysis. Helpful checks could include viewing the merged dataset, checking the numbers of rows and columns to make sure chunks of data are not missing, and searching for values (or strings) that exist in one dataset but not the other, among other mechanisms of QA/QC.
-
-
-### Filtering and Subsetting Data Using Base R Syntax
-
-Filtering and subsetting data are useful tools when you need to focus on specific parts of your dataset for downstream analyses. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for removing unneeded variables or samples from dataframes as you are working through your script.
-
-Note that in the examples that follow, we will create new dataframes that are distinguished from our original dataframe by adding sequential numbers to the end of the dataframe name (e.g., subset.data1, subset.data2, subset.data3). This style of dataframe naming is useful for the simple examples we are demonstrating, but in a full scripted analysis, we encourage the use of more descriptive dataframe names. For example, if you are subsetting your data to include only the first 100 rows, you could name that dataframe "data.first100."
-
-For this example, let's first define a vector of columns that we want to keep in our analysis, then subset the data by keeping only the columns specified in our vector:
-```{r 02-Chapter2-47}
-# Defining a vector of columns to keep in the analysis
-subset.columns <- c("BMI", "MAge", "MEdu")
-
-# Subsetting the data by selecting the columns represented in the defined 'subset.columns' vector
-subset.data1 <- full.data[,subset.columns]
-
-# Viewing the top of this subsetted dataframe
-head(subset.data1)
-```
-
-We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:
-```{r 02-Chapter2-48}
-subset.data2 <- full.data[1:100,]
-
-# Viewing the dimensions of this new dataframe
-dim(subset.data2)
-```
-
-To remove the first 100 rows, we use the same code as above, but include a `-` sign before our vector to indicate that these rows should be removed:
-```{r 02-Chapter2-49}
-subset.data3 <- full.data[-c(1:100),]
-
-# Viewing the dimensions of this new dataframe
-dim(subset.data3)
-```
-
-**Conditional statements** are also written to filter and subset data. A **conditional statement** is written to execute one block of code if the statement is true and a different block of code if the statement is false.
-
-A conditional statement requires a Boolean or true/false statement that will be either `TRUE` or `FALSE`. A couple of the more commonly used functions used to create conditional statements include...
-
-+ `if(){}` or an **if statement** means "execute R code when the condition is met".
-+ `if(){} else{}` or an **if/else statement** means "execute R code when condition 1 is met, if not execute R code for condition 2".
-+ `ifelse()` is a function that executes the same logic as an if/else statement. The first argument specifies a condition to be met. If that condition is met, R code in the second argument is executed, and if that condition is not met, R code in the third argument is executed.
-
-There are six comparison operators that are used to created these Boolean values:
-
-+ `==` means "equals".
-+ `!=` means "not equal".
-+ `<` means "less than".
-+ `>` means "greater than".
-+ `<=` means "less than or equal to".
-+ `>=` mean "greater than or equal to".
-
-There are also three logical operators that are used to create these Boolean values:
-
-+ `&` means "and".
-+ `|` means "or".
-+ `!` means "not".
-
-We can filter data based on conditions using the `subset()` function. For example, the following code filters for subjects whose BMI is greater than 25 and who have a college education:
-```{r 02-Chapter2-50}
-subset.data4 <- subset(full.data, BMI > 25 & MEdu == 3)
-```
-
-Additionally, we can subset and select specific columns we would like to keep, using the `select` argument within the `subset()` function:
-```{r 02-Chapter2-51}
-# Filtering for subjects whose BMI is less than 22 or greater than 27
-# Also selecting the BMI, maternal age, and maternal education columns
-subset.data5 <- subset(full.data, BMI < 22 | BMI > 27, select = subset.columns)
-```
-
-For more information on the `subset()` function, see its associated [documentation](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset).
-
-
-### Melting and Casting Data using Base R Syntax
-
-Melting and casting refers to the conversion of data to "long" or "wide" form as discussed previously in **TAME 2.0 Module 1.4 Data Wrangling in Excel**. You will often see data within the environmental health field in wide format, though long format is necessary for some procedures, such as plotting with [*ggplot2*](https://ggplot2.tidyverse.org) and performing certain analyses.
-
-Here, we'll illustrate some example script to melt and cast data using the [*reshape2*](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4) package.
-Let's first install and load the `reshape2` package:
-```{r 02-Chapter2-52, message = FALSE}
-if (!requireNamespace("reshape2"))
- install.packages("reshape2");
-```
-
-```{r 02-Chapter2-53}
-library(reshape2)
-```
-
-Using the fully merged dataframe, let's remind ourselves what these data look like in the current dataframe format:
-```{r 02-Chapter2-54}
-head(full.data)
-```
-
-
-These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in **wide (also known as casted)** format.
-
-Let's convert this dataframe to **long (also known as melted)** format. Here, will will specify that we want a row for each unique sample ID + variable measure pair by using `id = "ID"`:
-```{r 02-Chapter2-55}
-full.melted <- melt(full.data, id = "ID")
-
-# Viewing this new dataframe
-head(full.melted)
-```
-
-You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled `variable`. Then, the third column contains the value of this variable.
-
-Let's see an example view of the middle of this new dataframe:
-```{r 02-Chapter2-56}
-full.melted[1100:1110,1:3]
-```
-
-Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:
-```{r 02-Chapter2-57}
-dim(full.melted)
-```
-
-Let's now re-cast this dataframe back into wide format using the `dcast()` function. Here, we are telling the `dcast()` function to give us a sample (ID) for every variable in the column labeled `variable`. The column names from the variable column and corresponding values from the value column are then used to fill in the dataset:
-```{r 02-Chapter2-58}
-full.cast <- dcast(full.melted, ID ~ variable)
-head(full.cast)
-```
-
-Here, we can see that this dataframe is back in its original casted (or wide) format. Now that we're familiar with some base R functions to reshape our data, let's answer our original question: What is the average urinary chromium concentration for each maternal education level?
-
-Although it is not necessary to calculate the average, we could first subset our data frame to only include the two columns we are interested in (MEdu and UCr):
-```{r 02-Chapter2-59}
-subset.data6 <- full.data[,c("MEdu", "UCr")]
-
-head(subset.data6)
-```
-
-Next, we will make a new data frame for each maternal education level:
-```{r 02-Chapter2-60}
-# Creating new data frames based on maternal education category
-data.matedu.1 <- subset(subset.data6, MEdu == 1)
-data.matedu.2 <- subset(subset.data6, MEdu == 2)
-data.matedu.3 <- subset(subset.data6, MEdu == 3)
-
-# Previewing the first data frame to make sure our function is working as specified
-head(data.matedu.1)
-```
-
-Last, we can calculate the average urinary chromium concentration using each of our data frames:
-```{r 02-Chapter2-61}
-mean(data.matedu.1$UCr)
-mean(data.matedu.2$UCr)
-mean(data.matedu.3$UCr)
-```
-
-:::question
- With this, we can answer our **Environmental Health Question**:
-
-What is the average urinary chromium concentration across different maternal education levels?
-:::
-
-:::answer
-**Answer:** The average urinary Chromium concentrations are 39.9 µg/L for participants with less than high school education, 40.6 µg/L for participants with high school or some college education, and 40.4 µg/L for participants with college education or greater.
-:::
-
-## Introduction to Tidyverse
-
-[Tidyverse](https://www.tidyverse.org) is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax and formatting that differ slightly from base R functions. There are eight core tidyverse packages:
-
-+ For data visualization and exploration:
- + *ggplot2*
-+ For data wrangling and transformation:
- + *dplyr*
- + *tidyr*
- + *stringr*
- + *forcats*
-+ For data import and management:
- + *tibble*
- + *readr*
-+ For functional programming:
- + *purr*
-
-Here, we will carry out all the of the same data organization exercises demonstrated above using packages that are part of The Tidyverse, specifically using functions that are part of the *dplyr* and *tidyr* packages.
-
-### Downloading and Loading the Tidyverse Package
-
-If you don't have *tidyverse* already installed, you will need to install it using:
-```{r 02-Chapter2-62, message = FALSE}
-if(!require(tidyverse))
- install.packages("tidyverse")
-```
-
-And then load the *tidyverse* package using:
-```{r 02-Chapter2-63}
-library(tidyverse)
-```
-
-Note that by loading the *tidyverse* package, you are also loading all of the packages included within The Tidyverse and do not need to separately load these packages.
-
-### Merging Data Using Tidyverse Syntax
-
-To merge the same example dataframes using *tidyverse*, you can run the following script:
-```{r 02-Chapter2-64}
-full.data.tidy <- inner_join(demographic_data, chemical_data, by = "ID")
-
-head(full.data.tidy)
-```
-
-Note that you can still merge dataframes that have different ID column names with the argument `by = c("ID.x", "ID.y")`. *tidyverse* also has other `join`, functions, shown in the graphic below ([source](https://tavareshugo.github.io/r-intro-tidyverse-gapminder/08-joins/index.html)):
-```{r 02-Chapter2-65, echo = FALSE, out.width = "400px", fig.align = "center"}
-knitr::include_graphics("Chapter_2/Module2_3_Input/Image1.svg")
-```
-
-+ **inner_join** keeps only rows that have matching ID variables in both datasets
-+ **full_join** keeps the rows in both datasets
-+ **left_join** matches rows based on the ID variables in the first dataset (and omits any rows from the second dataset that do not have matching ID variables in the first dataset)
-+ **right_join** matches rows based on ID variables in the second dataset (and omits any rows from the first dataset that do not have matching ID variables in the second dataset)
-+ **anti_join(x,y)** keeps the rows that are unique to the first dataset
-+ **anti_join(y,x)** keeps the rows that are unique to the second dataset
-
-### The Pipe Operator
-
-One of the most important elements of Tidyverse syntax is use of the pipe operator (`%>%`). The pipe operator can be used to chain multiple functions together. It takes the object (typically a dataframe) to the left of the pipe operator and passes it to the function to the right of the pipe operator. Multiple pipes can be used in chain to execute multiple data cleaning steps without the need for intermediate dataframes. The pipe operator can be used to pass data to functions within all of the Tidyverse universe packages, not just the functions demonstrated here.
-
-Below, we can see the same code executed above, but this time with the pipe operator. The `demographic_data` dataframe is passed to `inner_join()` as the first argument to that function, with the following arguments remaining the same.
-```{r 02-Chapter2-66}
-full.data.tidy2 <- demographic_data %>%
- inner_join(chemical_data, by = "ID")
-
-head(full.data.tidy2)
-```
-
-Because the pipe operator is often used in a chain, it is best practice is to start a new line after each pipe operator, with the new lines of code indented. This makes code with multiple piped steps easier to follow. However, if just one function is being executed, the pipe operator can be used on the same line as the input and function or omitted altogether (as shown in the previous two code chunks). Here is an example of placing the function to the right of the pipe operator on a new line, with placeholder functions shown as additional steps:
-```{r 02-Chapter2-67, eval = FALSE}
-full.data.tidy3 <- demographic_data %>%
- inner_join(chemical_data, by = "ID") %>%
- additional_function_1() %>%
- additional_function_2()
-```
-
-### Filtering and Subsetting Data Using Tidyverse Syntax
-
-#### Column-wise functions
-
-The `select()` function is used to subset columns in Tidyverse. Here, we can use our previously defined vector `subset.columns` in the `select()` function to keep only the columns in our `subset.columns` vector. The `all_of()` function tells the `select()` to keep all of the columns that match elements of the `subset.columns` vector.
-```{r 02-Chapter2-68}
-subset.tidy1 <- full.data.tidy %>%
- select(all_of(subset.columns))
-
-head(subset.tidy1)
-```
-
-There are many different ways that `select()` can be used. See below for some examples using dummy variable names:
-```{r 02-Chapter2-69, eval = FALSE}
-# Select specific ranges in the dataframe
-data <- data %>%
- select(start_column_1:end_column_1)
-
-data <- data %>%
- select(c(start_column_1:end_column_1, start_column_2:end_column_2))
-
-# Select columns that match the elements in a character vector an an additional range of columns
-data <- data %>%
- select(c(all_of(character_vector), start_column_1:end_column_1))
-```
-
-To select columns that have names that contain specific strings, you can use functions such as `starts_with()`, `ends_with()`, and `contains()`. These functions allow you to ignore the case of the strings with `ignore.case = TRUE`. These arguments can be combined with specific column names and other selection ranges.
-```{r 02-Chapter2-70, eval = FALSE}
-data <- data %>%
- select(starts_with("starting_string"))
-
-data <- data %>%
- select(other_column_to_keep, starts_with("starting_string"))
-```
-
-To remove columns using tidyverse, you can use similar code, but include a `-` sign before the argument defining the columns.
-```{r 02-Chapter2-71}
-# Removing columns
-subset.tidy2 <- full.data.tidy %>%
- select(-all_of(subset.columns))
-
-# Viewing this new dataframe
-head(subset.tidy2)
-```
-
-#### Row-wise functions
-
-The `slice()` function can be used to keep or remove a certain number of rows based on their position within the dataframe. For example, we can retain only the first 100 rows using the following code:
-
-```{r 02-Chapter2-72}
-subset.tidy3 <- full.data.tidy %>%
- slice(1:100)
-
-dim(subset.tidy3)
-```
-
-Or, we can remove the first 100 rows:
-```{r 02-Chapter2-73}
-subset.tidy4 <- full.data.tidy %>%
- slice(-c(1:100))
-
-dim(subset.tidy4)
-```
-
-The related functions `slice_min()` and `slice_max()` can be used to select rows with the smallest or largest values of a variable.
-
-The `filter()` function can be used to keep or remove specific rows based on conditional statements. For example, we can keep only rows where BMI is greater than 25 and age is greater than 31:
-```{r 02-Chapter2-74}
-subset.tidy5 <- full.data.tidy %>%
- filter(BMI > 25 & MAge > 31)
-
-dim(subset.tidy5)
-```
-
-#### Combining column and row-wise functions
-
-Now, we can see how Tidyverse makes it easy to chain together multiple data manipulation steps. Here, we first filter rows based on values for BMI and age, then we select our columns of interest:
-```{r 02-Chapter2-75}
-subset.tidy6 <- full.data.tidy %>%
- filter(BMI > 25 & MAge > 31) %>%
- select(BMI, MAge, MEdu)
-
-head(subset.tidy6)
-```
-
-### Melting and Casting Data Using Tidyverse Syntax
-
-To melt and cast data in Tidyverse, you can use the pivot functions (i.e., `pivot_longer()` or `pivot_wider()`).
-
-The first argument in the `pivot_longer()` function specifies which columns should be pivoted. This can be specified with either positive or negative selection - i.e., naming columns to pivot with a vector or range or naming columns not to pivot with a `-` sign. Here, we are telling the function to pivot all of the columns except the ID column, which we need to keep to be able to trace back which values came from which subject. The `names_to =` argument allows you to set what you want to name the column that stores the variable names (the column names in wide format). The `values_to =` argument allows you to set what you want to name the column that stores the values. We almost always call these columns "var" and "value", respectively, but you can name them anything that makes sense for your dataset.
-```{r 02-Chapter2-76}
-full.pivotlong <- full.data.tidy %>%
- pivot_longer(-ID, names_to = "var", values_to = "value")
-
-head(full.pivotlong, 15)
-```
-
-To pivot our data back to wide format, we can use `pivot_wider()`, which will pull the column names from the column specified in the `names_from =` argument and the corresponding values from the column specified in the `values_from = ` argument.
-```{r 02-Chapter2-77}
-full.pivotwide <- full.pivotlong %>%
- pivot_wider(names_from = "var", values_from = "value")
-
-head(full.pivotwide)
-```
-
-Now that we're familiar with some *tidyverse* functions to reshape our data, let's answer our original question: What is the average urinary Chromium concentration for each maternal education level?
-
-We can use the `group_by()` function to group our dataset by education class, then the summarize function to calculate the mean of our variable of interest within each class. Note how much shorter and more efficient this code is than the code we used to calculate the same values using base R!
-```{r 02-Chapter2-78}
-full.data %>%
- group_by(MEdu) %>%
- summarize(Avg_UCr = mean(UCr))
-```
-
-For more detailed and advanced examples of pivoting in Tidyverse, see the [Tidyverse Pivoting Vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/pivot.html).
-
-## Concluding Remarks
-
-This training module provides an introductory level overview of data organization and manipulation basics in base R and Tidyverse, including merging, filtering, subsetting, melting, and casting, and demonstrates these methods with an environmentally relevant dataset. These methods are used regularly in scripted analyses and are important preparation steps for almost all downstream analyses and visualizations.
-
-
-
-
-:::tyk
-What subjects, arranged from highest to lowest drinking water cadmium levels, had babies at at least 35 weeks and had urinary cadmium levels of at least 1.5 µg/L?
-
-**Hint**: Try using the `arrange()` function from the *tidyverse* package.
-:::
-
-# 2.4 Improving Coding Efficiencies
-
-This training module was developed by Elise Hickman, Alexis Payton, Kyle Roell, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-In this module, we'll explore how to improve coding efficiency. Coding efficiency involves performing a task in as few lines as possible and can...
-
-+ Shorten code by eliminating redundancies
-+ Reduce the number of typos
-+ Help other coders understand script better
-
-Specific approaches that we will discuss in this module include loops, functions, and list operations, which can all be used to make code more succinct. A **loop** is employed when we want to perform a repetitive task, while a **function** contains a block of code organized together to perform one specific task. **List operations**, in which the same function is applied to a list of dataframes, can also be used to code more efficiently.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
-
-2. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI $\geq$ 18.5) subjects?
-
-3. Are there statistically significant difference in drinking water arsenic, cadmium, and chromium between non-obese (BMI < 29.9) and obese (BMI $\geq$ 29.9) subjects?
-
-We will demonstrate how this analysis can be approached using for loops, functions, or list operations. We will introduce the syntax and structure of each approach first, followed by application of the approach to our data. First, let's prepare the workspace and familiarize ourselves with the dataset we are going to use.
-
-
-### Data Import and Workspace Preparation
-
-#### Installing required packages
-
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you. We will be using the *tidyverse* package for data manipulation steps and the [*rstatix*](https://github.com/kassambara/rstatix) package for statistical tests, as it provides pipe friendly adaptations of the base R statistical tests and returns results in a dataframe rather than a list format, making results easier to access. This brings up an important aspect of coding efficiency - sometimes, there is already a package that has been designed with functions to help you execute your desired analysis in an efficient way, so you don't need to write custom functions yourself! So, don't forget to explore packages relevant to your analysis before spending a lot of time developing custom solutions (although, sometimes this is necessary).
-
-```{r 02-Chapter2-79, message = FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse")
-if (!requireNamespace("rstatix"))
- install.packages("rstatix")
-```
-
-#### Loading required packages
-
-```{r 02-Chapter2-80, message = FALSE}
-library(tidyverse)
-library(rstatix)
-```
-
-#### Setting your working directory
-
-```{r 02-Chapter2-81, eval = FALSE}
-setwd("/file path to where your input files are")
-```
-
-#### Importing example dataset
-
-The first example dataset contains subject demographic data, and the second dataset contains corresponding chemical data. Familiarize yourself with these data used previously in **TAME 2.0 Module 2.3 Data Manipulation and Reshaping**.
-
-```{r 02-Chapter2-82}
-# Load the demographic data
-demographic_data <- read.csv("Chapter_2/Module2_4_Input/Module2_4_InputData1.csv")
-
-# View the top of the demographic dataset
-head(demographic_data)
-
-# Load the chemical data
-chemical_data <- read.csv("Chapter_2/Module2_4_Input/Module2_4_InputData2.csv")
-
-# View the top of the chemical dataset
-head(chemical_data)
-```
-
-#### Preparing the example dataset
-
-For ease of analysis, we will merge these two datasets before proceeding.
-```{r 02-Chapter2-83}
-# Merging data
-full_data <- inner_join(demographic_data, chemical_data, by = "ID")
-
-# Previewing new data
-head(full_data)
-```
-
-Continuous demographic variables, like BMI, are often dichotomized (or converted to a categorical variable with two categories representing higher vs. lower values) to increase statistical power in analyses. This is particularly important for clinical data that tend to have smaller sample sizes. In our initial dataframe, BMI is a continuous or numeric variable; however, our questions require us to dichotomize BMI. We can use the following code, which relies on if/else logic (see **TAME 2.0 Module 2.3 Data Manipulation and Reshaping** for more information) to generate a new column representing our dichotomized BMI variable for our first environmental health question.
-```{r 02-Chapter2-84}
-# Adding dichotomized BMI column
-full_data <- full_data %>%
- mutate(Dichotomized_BMI = ifelse(BMI < 25, "Normal", "Overweight"))
-
-# Previewing new data
-head(full_data)
-```
-
-We can see that we now have created a new column entitled `Dichotomized_BMI` that we can use to perform a statistical test to assess if there are differences between drinking water metals between normal and overweight subjects.
-
-
-
-## Loops
-
-We will start with loops. There are three main types of loops in R: `for`, `while`, and `repeat`. We will focus on `for` loops in this module, but for more in-depth information on loops, including the additional types of loops, see [here](https://intro2r.com/loops.html). Before applying loops to our data, let's discuss how `for` loops work.
-
-The basic structure of a `for` loop is shown here:
-```{r 02-Chapter2-85}
-# Basic structure of a for loop
-for (i in 1:4){
- print(i)
-}
-```
-
-`for` loops always start with `for` followed by a statement in parentheses. The argument in the parentheses tells R how to iterate (or repeat) through the code in the curly brackets. Here, we are telling R to iterate through the code in curly brackets 4 times. Each time we told R to print the value of our iterator, or `i`, which has a value of 1, 2, 3, and then 4. Loops can also iterate through columns in a dataset. For example, we can use a `for` loop to print the ages of each subject:
-```{r 02-Chapter2-86}
-# Creating a smaller dataframe for our loop example
-full_data_subset <- full_data[1:6, ]
-
-# Finding the total number of rows or subjects in the dataset
-number_of_rows <- length(full_data_subset$MAge)
-
-# Creating a for loop to iterate from 1 to the last row
-for (i in 1:number_of_rows){
- # Printing each subject age
- # Need to put `[i]` to index the correct value corresponding to the row we are evaluating
- print(full_data_subset$MAge[i])
-}
-```
-
-Now that we know how a `for` loop works, how can we apply this approach to determine whether there are statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects.
-
-Because our data are normally distributed and there are two groups that we are comparing, we will use a t-test applied to each metal measured in drinking water. Testing for assumptions is outside the scope of this module, but see **TAME 2.0 Module 3.3 Normality Tests and Data Transformation** for more information on this topic.
-
-Running a t-test in R is very simple, which we can demonstrate by running a t-test on the drinking water arsenic data:
-```{r 02-Chapter2-87}
-# Running t-test and storing results in t_test_res
-t_test_res <- full_data %>%
- t_test(DWAs ~ Dichotomized_BMI)
-
-# Viewing results
-t_test_res
-```
-
-We can see that our p-value is 0.468. Because this is greater than 0.05, we cannot reject the null hypothesis that normal weight and overweight subjects are exposed to the same drinking water arsenic concentrations. Although this was a very simple line of code to run, what if we have many columns we want to run the same t-test on? We can use a `for` loop to iterate through these columns.
-
-Let's break down the steps of our `for` loop before executing the code.
-
-1. First, we will define the variables (columns) we want to run our t-test on. This is different from our approach above, because in those code chunks, we were using numbers to indicate the number of iterations through the loop. Here, we are naming the specific variables instead, and R will iterate though each of these variables. Note that we could omit this step and instead use the numeric column index of our variables of interest `[7:9]`. However, naming the specific columns makes this approach more robust because if additional data are added to or removed from our dataframe, the numeric column index of our variables could change. Which approach you choose really depends on the purpose of your loop!
-
-2. Second, we will create an empty dataframe where we will store the results generated by our `for` loop.
-
-3. Third, we will actually run our for loop. This will tell R: for each variable in our `vars_of_interest` vector, run a t-test with that variable (and store the results in a temporary dataframe called "res"), then add those results to our final results dataframe. A row will be added to the results dataframe each time R iterates through a new variable, resulting in a dataframe that stores the results of all of our t-tests.
-
-```{r 02-Chapter2-88}
-# Defining variables (columns) we want to run a t-test on
-vars_of_interest <- c("DWAs", "DWCd", "DWCr")
-
-# Creating an empty dataframe to store results
-t_test_res_DW <- data.frame()
-
-# Running for loop
-for (i in vars_of_interest) {
-
- # Storing the results of each iteration of the loop in a temporary results dataframe
- res <- full_data %>%
-
- # Writing the formula needed for each iteration of the loop
- t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
-
- # Adding a row to the results dataframe each time the loop is iterated
- t_test_res_DW <- bind_rows(t_test_res_DW, res)
-}
-
-# Viewing our results
-t_test_res_DW
-```
-
-:::question
- With this, we can answer **Environmental Health Question #1**:
-
-Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
-:::
-
-:::answer
-**Answer**: No, there are not any statistically significant differences in drinking water metals between normal weight and overweight subjects.
-:::
-
-
-
-### Formulas and Pasting
-
-Note the use of the code `as.formula(paste0(i, "~ Dichotomized_BMI"))`. Let's take a quick detour to discuss the use of the `as.formula()` and `paste()` functions, as these are important functions often used in loops and user-defined functions.
-
-Many statistical test functions and regression functions require one argument to be a formula, which is typically formatted as `y ~ x`, where y is the dependent variable of interest and x is an independent variable. For some functions, additional variables can be included on the right side of the formula to represent covariates (additional variables of interest). The function `as.formula()` returns the argument in parentheses in formula format so that it can be correctly passed to other functions. We can demonstrate that here by assigning a dummy variable `j` the character string `var1`:
-
-```{r 02-Chapter2-89}
-# Assigning variable
-j <- "var1"
-
-# Demonstrating output of as.formula()
-as.formula(paste(j, " ~ Dichotomized_BMI", sep = ""))
-```
-
-We can use the `paste()` function to combine strings of characters. The paste function takes each argument (as many arguments as is needed) and pastes them together into one character string, with the separator between arguments set by the `sep = ` argument. When our y variable is changing with each iteration of our for loop, we can use the `paste()` function to write our formula correctly by telling the function to paste the variable `i`, followed by the rest of our formula, which stays the same for each iteration of the loop. Let's examine the output of just the `paste()` part of our code:
-```{r 02-Chapter2-90}
-paste(j, " ~ Dichotomized_BMI", sep = "")
-```
-
-The `paste()` function is very flexible and can be useful in many other settings when you need to create one character string from arguments from different sources! Notice that the output looks different from the output of `as.formula()`. There is a returned index (`[1]`), and there are quotes around the character string. The last function we will highlight here is the `noquote()` function, which can be helpful if you'd like a string without quotes:
-```{r 02-Chapter2-91}
-noquote(paste(j, " ~ Dichotomized_BMI", sep = ""))
-```
-
-However, this still returns an indexed number, so there are times when it will not allow code to execute properly (for example, when we need a formula format).
-
-Next, we will learn about functions and apply them to our dataset to answer our additional environmental health questions.
-
-
-
-## Functions
-
-Functions are useful when you want to execute a block of code organized together to perform one specific task, and you want to be able to change parameters for that task easily rather than having to copy and paste code over and over that largely stays the same but might have small modifications in certain arguments. The basic structure of a function is as follows:
-
-```{r 02-Chapter2-92, eval = FALSE}
-function_name <- function(parameter_1, parameter_2...){
-
- # Function body (where the code goes)
- insert_code_here
-
- # What the function returns
- return()
-}
-```
-
-A function requires you to name it as we did with `function_name`. In parentheses, the function requires you to specify the arguments or parameters. Parameters (i.e., `parameter_1`) act as placeholders in the body of the function. This allows us to change the values of the parameters each time a function is called, while the majority of the code remains the same. Lastly, we have a `return()` statement, which specifies what object (i.e., vector, dataframe, etc.) we want to retrieve from a function. Although a function can display the last expression from the function body in the absence of a `return()` statement, it's a good habit to include it as the last expression. It is important to note that, although functions can take many input parameters and execute large code chunks, they can only return one item, whether that is a value, vector, dataframe, plot, code output, or list.
-
-When writing your own functions, it is important to describe the purpose of the function, its input, its parameters, and its output so that others can understand what your functions does and how to use it. This can be defined either in text above a code chunk if you are using R Markdown or as comments within the code itself. We'll start with a simple function. Let's say we want to convert temperatures from Fahrenheit to Celsius. We can write a function that takes the temperature in Fahrenheit and converts it to Celsius. Note that we have given our parameters descriptive names (`fahrenheit_temperature`, `celsius_temperature`), which makes our code more readable than if we assigned them dummy names such as x and y.
-
-```{r 02-Chapter2-93}
-# Function to convert temperatures in Fahrenheit to Celsius
-## Parameters: temperature in Fahrenheit (input)
-## Output: temperature in Celsius
-
-fahrenheit_to_celsius <- function(fahrenheit_temperature){
-
- celsius_temperature <- (fahrenheit_temperature - 32) * (5/9)
-
- return(celsius_temperature)
-}
-```
-
-Notice that the above code block was run, but there isn't an output. Rather, running the code assigns the function code to that function. When you run code defining a function, that function will appear in your Global Environment under the "Functions" section. We can see the output of the function by providing an input value. Let's start by converting 41 degrees Fahrenheit to Celsius:
-
-```{r 02-Chapter2-94}
-# Calling the function
-# Here, 41 is the `fahrenheit_temperature` in the function
-fahrenheit_to_celsius(41)
-```
-
-41 degrees Fahrenheit is equivalent to 5 degrees Celsius. We can also have the function convert a vector of values.
-
-```{r 02-Chapter2-95}
-# Defining vector of temperatures
-vector_of_temperatures <- c(81,74,23,65)
-
-# Calling the function
-fahrenheit_to_celsius(vector_of_temperatures)
-```
-
-Before getting back to answer our environmental health related questions, let's look at one more example of a function. This time we'll create a function that can calculate the circumference of a circle based on its radius in inches. Here you can also see a different style of commenting to describe the function's purpose, inputs, and outputs.
-
-```{r 02-Chapter2-96}
-circle_circumference <- function(radius){
- # Calculating a circle's circumference based on the radius inches
-
- # :parameters: radius
- # :output: circumference and radius
-
- # Calculating diameter first
- diameter <- 2 * radius
-
- # Calculating circumference
- circumference <- pi * diameter
-
- return(circumference)
-}
-
-# Calling function
-circle_circumference(3)
-```
-
-So, if a circle had a radius of 3 inches, its circumference would be ~19 inches. What if we were interested in seeing the diameter to double check our code?
-
-```{r 02-Chapter2-97, error = TRUE, suppress_error_alert = TRUE}
-diameter
-```
-
-R throws an error, because the variable `diameter` was created inside the function and the function only returned the `circumference` variable. This is actually one of the ways that functions can improve coding efficiency - by not needing to store intermediate variables that aren't of interest to the main goal of the code or analysis. However, there are two ways we can still see the `diameter` variable:
-
-1. Put print statements in the body of the function (`print(diameter)`).
-2. Have the function return a different variable or list of variables (`c(circumference, diameter)`). See the below section on **List Operation** for more on this topic.
-
-We can now move on to using a more complicated function to answer all three of our environmental health questions without repeating our earlier code three times. The main difference between each of our first three environmental health questions is the BMI cutoff used to dichotomize the BMI variable, so we can use that as one of the parameters for our function. We can also use arguments in our function to name our groups.
-
-We can adapt our previous `for` loop code into a function that will take different BMI cutoffs and return statistical results by including parameters to define the parts of the analysis that will change with each unique question. For example:
-
-+ Changing the BMI cutoff from a number (in our previous code) to our parameter name that specifies the cutoff
-+ Changing the group names for assigning category (in our previous code) to our parameter names
-
-```{r 02-Chapter2-98}
-# Function to dichotomize BMI into different categories and return results of t-test on drinking water metals between dichotomized groups
-
-## Parameters:
-### input_data: dataframe containing BMI and drinking water metals levels
-### bmi_cutoff: numeric value specifying the cut point for dichotomizing BMI
-### lower_group_name: name for the group of subjects with BMIs lower than the cutoff
-### upper_group_name: name for the group of subjects with BMIs higher than the cutoff
-### variables: vector of variable names that statistical test should be run on
-
-## Output: dataframe with statistical results for each variable in the variables vector
-
-bmi_DW_ttest <- function(input_data, bmi_cutoff, lower_group_name, upper_group_name, variables){
-
- # Creating dichotomized variable
- dichotomized_data <- input_data %>%
- mutate(Dichotomized_BMI = ifelse(BMI < bmi_cutoff, lower_group_name, upper_group_name))
-
- # Creating an empty dataframe to store results
- t_test_res_DW <- data.frame()
-
- # Running for loop
- for (i in variables) {
-
- # Storing the results of each iteration of the loop in a temporary results dataframe
- res <- dichotomized_data %>%
-
- # Writing the formula needed for each iteration of the loop
- t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
-
- # Adding a row to the results dataframe each time the loop is iterated
- t_test_res_DW <- bind_rows(t_test_res_DW, res)
- }
-
- # Return results
- return(t_test_res_DW)
-
-}
-```
-
-For the first example of using the function, we have included the name of each argument for clarity, but this isn't necessary *if* you pass in the arguments *in the order they were defined when writing the function*.
-```{r 02-Chapter2-99}
-# Defining variables (columns) we want to run a t-test on
-vars_of_interest <- c("DWAs", "DWCd", "DWCr")
-
-# Apply function for normal vs. overweight (bmi_cutoff = 25)
-bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal",
- upper_group_name = "Overweight", variables = vars_of_interest)
-```
-
-Here, we can see the same results as above in the **Loops** section. We can next apply the function to answer our additional environmental health questions:
-```{r 02-Chapter2-100}
-# Apply function for underweight vs. non-underweight (bmi_cutoff = 18.5)
-bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)
-
-# Apply function for non-obese vs. obese (bmi_cutoff = 29.9)
-bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)
-```
-
-:::question
- With this, we can answer **Environmental Health Questions #2 & #3**:
-
-Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI $\geq$ 18.5) subjects or between non-obese (BMI < 29.9) and obese (BMI $\geq$ 29.9) subjects?
-:::
-
-:::answer
-**Answer**: No, there are not any statistically significant differences in drinking water metals between underweight and non-underweight subjects or between non-obese and obese subjects.
-:::
-
-Here, we were able to answer all three of our environmental health questions within relatively few lines of code by using a function to efficiently assess different variations on our analysis.
-
-In the last section of this module, we will demonstrate how to use list operations to improve coding efficiency.
-
-
-
-## List operations
-
-Lists are a data type in R that can store other data types (including lists, to make nested lists). This allows you to store multiple dataframes in one object and apply the same functions to each dataframe in the list. Lists can also be helpful for storing the results of a function if you would like to be able to access multiple outputs. For example, if we return to our example of a function that calculates the circumference of a circle, we can store both the diameter and circumference as list objects. The function will then return a list containing both of these values when called.
-```{r 02-Chapter2-101}
-# Adding list element to our function
-circle_circumference_4 <- function(radius){
- # Calculating a circle's circumference and diameter based on the radius in inches
-
- # :parameters: radius
- # :output: list that contains diameter [1] and circumference [2]
-
- # Calculating diameter first
- diameter <- 2 * radius
-
- # Calculating circumference
- circumference <- pi * diameter
-
- # Storing results in a named list
- results <- list("diameter" = diameter, "circumference" = circumference)
-
- # Return results
- results
-}
-
-# Calling function
-circle_circumference_4(10)
-```
-
-We can also call the results individually using the following code:
-```{r 02-Chapter2-102}
-# Storing results of function
-circle_10 <- circle_circumference_4(10)
-
-# Viewing only diameter
-
-## Method 1
-circle_10$diameter
-
-## Method 2
-circle_10[1]
-
-# Viewing only circumference
-
-## Method 1
-circle_10$circumference
-
-## Method 2
-circle_10[2]
-```
-
-In the context of our dataset, we can use list operations to clean up and combine our results from all three BMI stratification approaches. This is often necessary to prepare data to share with collaborators or for supplementary tables in a manuscript. Let's revisit our code for producing our statistical results, this time assigning our results to a dataframe rather than viewing them.
-```{r 02-Chapter2-103}
-# Defining variables (columns) we want to run a t-test on
-vars_of_interest <- c("DWAs", "DWCd", "DWCr")
-
-# Normal vs. overweight (bmi_cutoff = 25)
-norm_vs_overweight <- bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal",
- upper_group_name = "Overweight", variables = vars_of_interest)
-
-# Underweight vs. non-underweight (bmi_cutoff = 18.5)
-under_vs_nonunderweight <- bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)
-
-# Non-obese vs. obese (bmi_cutoff = 29.9)
-nonobese_vs_obese <- bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)
-
-# Viewing one results dataframe as an example
-norm_vs_overweight
-```
-
-For publication purposes, let's say we want to make the following formatting changes:
-
-+ Keep only the comparison of interest (for example Normal vs. Overweight) and the associated p-value, removing columns that are not as useful for interpreting or sharing the results
-+ Rename the `.y.` column so that its contents are clearer
-+ Collapse all of our data into one final dataframe
-
-We can first write a function to execute these cleaning steps:
-```{r 02-Chapter2-104}
-# Function to clean results dataframes
-
-## Parameters:
-### input_data: dataframe containing results of t-test
-
-## Output: cleaned dataframe
-
-data_cleaning <- function(input_data) {
-
- data <- input_data %>%
-
- # Rename .y. column
- rename("Variable" = ".y.") %>%
-
- # Merge group1 and group2
- unite(Comparison, group1, group2, sep = " vs. ") %>%
-
- # Keep only columns of interest
- select(c(Variable, Comparison, p))
-
- return(data)
-}
-```
-
-Then, we can make a list of our dataframes to clean and apply:
-```{r 02-Chapter2-105}
-# Making list of dataframes
-t_test_res_list <- list(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese)
-
-# Viewing list of dataframes
-head(t_test_res_list)
-```
-
-And we can apply the cleaning function to each of the dataframes using the `lapply()` function, which takes a list as the first argument and the function to apply to each list element as the second argument:
-```{r 02-Chapter2-106}
-# Applying cleaning function
-t_test_res_list_cleaned <- lapply(t_test_res_list, data_cleaning)
-
-# Vieweing cleaned dataframes
-head(t_test_res_list_cleaned)
-```
-
-Last, we can collapse our list down into one dataframe using the `do.call()` and `rbind.data.frame()` functions, which together, take the elements of the list and collapse them into a dataframe by binding the rows together:
-```{r 02-Chapter2-107}
-t_test_res_cleaned <- do.call(rbind.data.frame, t_test_res_list_cleaned)
-
-# Viewing final dataframe
-t_test_res_cleaned
-```
-
-The above example is just that - an example to demonstrate the mechanics of using list operations. However, there are actually a couple of even more efficient ways to execute the above cleaning steps:
-
-1. Build cleaning steps into the analysis function if you know you will not need to access the raw results dataframe.
-2. Bind all three dataframes together, then execute the cleaning steps.
-
-We will demonstrate #2 below:
-```{r 02-Chapter2-108}
-# Start by binding the rows of each of the results dataframes
-t_test_res_cleaned_2 <- bind_rows(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese) %>%
-
- # Rename .y. column
- rename("Variable" = ".y.") %>%
-
- # Merge group1 and group2
- unite(Comparison, group1, group2, sep = " vs. ") %>%
-
- # Keep only columns of interest
- select(c(Variable, Comparison, p))
-
-# Viewing results
-t_test_res_cleaned_2
-```
-
-As you can see, this dataframe is the same as the one we produced using list operations. It was produced using fewer lines of code and without the need for a user-defined function! For our purposes, this was a more efficient approach. However, we felt it was important to demonstrate the mechanics of list operations because there may be times where you do need to keep dataframes separate during specific analyses.
-
-
-
-## Concluding Remarks
-
-This module provided an introduction to loops, functions, and list operations and demonstrated how to use them to efficiently analyze an environmentally relevant dataset. When and how you implement these approaches depends on your coding style and the goals of your analysis. Although here we were focused on statistical tests and data cleaning, these flexible approaches can be used in a variety of data analysis steps. We encourage you to implement loops, functions, and list operations in your analyses when you find the need to iterate through statistical tests, visualizations, data cleaning, or other common workflow elements!
-
-## Additional Resources
-
-+ [Intro2r Loops](https://intro2r.com/functions-in-r.html)
-+ [Intro2r Functions in R](https://intro2r.com/prog_r.html)
-+ [Hadley Wickham Advanced R - Functionals](http://adv-r.had.co.nz/Functionals.html)
-
-
-
-
-
-
-:::tyk
-Use the same input data we used in this module to answer the following questions and produce a cleaned, publication-ready data table of results. Note that these data are normally distributed, so you can use a t-test.
-
-1. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between younger (MAge < 40) and older (MAge $\geq$ 40) mothers?
-2. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
-:::
diff --git a/Chapter_2/2_1_R_Programming/2_1_R_Programming.Rmd b/Chapter_2/2_1_R_Programming/2_1_R_Programming.Rmd
new file mode 100644
index 0000000..5ea9401
--- /dev/null
+++ b/Chapter_2/2_1_R_Programming/2_1_R_Programming.Rmd
@@ -0,0 +1,444 @@
+# (PART\*) Chapter 2 Coding in R {-}
+
+
+# 2.1 Downloading and Programming in R
+
+This training module was developed by Kyle Roell, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+In this training module, we will provide a brief introduction of:
+
++ R
++ R Studio
++ Packages in R
++ Scripting basics
++ Code troubleshooting
+
+## General Introduction and Installation of R and RStudio
+
+### What is R?
+
+**R** is a programming language. Computer script (lines of code) can be used to increase data analysis reproducibility, transparency, and methods sharing, and is becoming increasingly incorporated into exposure science, toxicology, and environmental health research. One of the most commonly used coding languages in the field of environmental health science is the **R language**. Some advantages of using R include the following:
+
++ Free, open-source programming language that is licensed under the Free Software Foundation’s GNU General Public License
++ Can be run across all major platforms and operating systems, including Unix, Windows, and MacOS
++ Publicly available packages help you carry out analyses efficiently (without you having to code for everything yourself)
++ Large, diverse collection of packages
++ Comprehensive documentation
++ When code is efficiently tracked during development/execution, it promotes reproducible analyses
+
+Because of these advantages, R has emerged as an avenue for world-wide collaboration in data science. Other commonly implemented scripting languages in the field of environmental health research include Python and SAS, among others; and these training tutorials focus on R as an important introductory-level example that also houses many relevant packages and example datasets as further described throughout TAME.
+
+### Downloading and Installing R
+
+To download R, first navigate to [https://cran.rstudio.com/](https://cran.rstudio.com/) and download the .pkg file for your operating system. Install this file according to your computer's typical program installation steps.
+
+### What is RStudio?
+
+**RStudio** is an Integrated Development Environment (IDE) for R, which makes it more 'user friendly' when developing and using R script. It is a desktop application that can be downloaded for free, online.
+
+### Downloading and Installing RStudio
+
+To download RStudio:
+
++ Navigate to: [https://posit.co/download/rstudio-desktop/](https://posit.co/download/rstudio-desktop/)
++ Scroll down and select "Download RStudio"
++ Install according to your computer's typical program installation steps
+
+### RStudio Orientation
+
+Here is a screenshot demonstrating what the RStudio desktop app looks like:
+```{r 2-1-R-Programming-1, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image1.png")
+```
+
+The default RStudio layout has four main panes (numbered above in the blue boxes):
+
+1. **Source Editor:** allows you to open and edit script files and view data.
+2. **Console:** where you can type code that will execute immediately when you press enter/return. This is also where code from script files will appear when you run the code.
+3. **Environment:** shows you the objects in your environment.
+4. **Viewer:** has a number of useful tabs, including:
+ 1. **Files:** a file manager that allows you to navigate similar to Finder or File Explorer
+ 2. **Plots:** where plots you generate by executing code will appear
+ 3. **Packages:** shows you packages that are loaded (checked) and those that can be loaded (unchecked)
+ 4. **Help:** where help pages will appear for packages and functions (see below for further instructions on the help option)
+
+Under "Tools" → "Global Options," RStudio panes can be customized to appear in different configurations or with different color themes. A number of other options can also be changed. For example, you can choose to have colors highlighted the color they appear or rainbow colored parentheses that can help you visualize nested code.
+```{r 2-1-R-Programming-2, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image2.png")
+```
+
+## Introduction to R Packages
+
+One of the major benefits to coding in the R language is access to the continually expanding resource of thousands of user-developed **packages**. Packages represent compilations of code and functions fitted for a specialized focus or purpose. These are
+often written by R users and submitted to the [CRAN](https://cran.r-project.org/web/packages/), or another host such as [BioConductor](https://www.bioconductor.org/) or [Github](https://github.com/).
+
+Packages aid in improved data analyses and methods sharing. Packages have varying utilities, spanning basic organization and manipulation of data, visualizing data, and more advanced approaches to parse and analyze data, with examples included in all of the proceeding training modules.
+
+Examples of some common packages that we'll be using throughout these training modules include the following:
+
++ ***tidyverse***: A collection of open source R packages that share an underlying design philosophy, grammar, and data structures of tidy data. For more information on the *tidyverse* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/tidyverse/index.html), primary [webpage](https://www.tidyverse.org/packages/), and [peer-reviewed article released in 2018](https://onlinelibrary.wiley.com/doi/10.1002/sdr.1600).
+
++ ***ggplot2***: A system for creating graphics. Users provide the data and tell R what type of graph to use, how to map variables to aesthetics (elements of the graph), and additional stylistic elements to include in the graph. For more information on the *ggplot2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/ggplot2/index.html) and [R Documentation](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5).
+
+More information on these packages, as well as many others, is included throughout TAME training modules.
+
+### Downloading/Installing R Packages
+
+R packages often do not need to be downloaded from a website. Instead, you can install packages and load them through running script in R. Note that you only need to install packages one time, but packages must be loaded each time you start a new R session.
+
+```{r 2-1-R-Programming-3, eval=FALSE, echo=TRUE}
+# Install the package
+install.packages(“tidyverse”)
+
+# Load the package for use
+library(tidyverse)
+```
+
+Many packages also exist as part of the baseline configuration of an R working environment, and do not require manual loading each time you launch R. These include the following packages:
+
++ datasets
++ graphics
++ methods
++ stats
++ utils
+
+You can learn more about a function by typing one question mark before the name of the function, which will bring up documentation in the Help tab of the Viewer window. Importantly, this documentation includes a description of the different arguments that can be passed to the function and examples for how to use the function.
+
+```{r 2-1-R-Programming-4, eval=FALSE}
+?install.packages
+```
+
+```{r 2-1-R-Programming-5, echo=FALSE, fig.align = "center", out.width = "400px" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image3.png")
+```
+
+You can learn more about a package by typing two question marks before the name of the package. This will bring up vingettes and help pages associated with that package.
+
+```{r 2-1-R-Programming-6, eval=FALSE}
+??tidyverse
+```
+
+```{r 2-1-R-Programming-7, echo=FALSE, fig.align = "center", out.width = "400px" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image4.png")
+```
+
+
+
+## Scripting Basics
+
+### Data Types
+
+Before writing any script, let's first review different data types in R. Data types are what they imply – the type of data you are handling. It is important to understand data types because functions often require a specific data type as input.
+
+R has 5 basic data types:
+
++ Logical (e.g., TRUE or FALSE)
++ Integer (e.g., 1, 2, 3)
++ Numeric (real or decimal)
++ Character (e.g., ”apple”)
++ Complex (e.g., 1 + 0i)
+
+Numeric variables are often stored as “double” values (sometimes shown as < dbl >), or a decimal type with at least two decimal places. Character variables can also be stored as factors, which are data structures that are implemented to store categorical data in a specific order (also known as levels).
+
+Data are stored in data structures. There are many different data structures in R. Some packages even implement unique data structures. The most common data structures are:
+
++ **Vectors:** also known as an atomic vector, can contain characters, logical values, integers, or numeric values (but all elements must be the same data type).
++ **Matrices:** a vector with multiple dimensions. Elements must still be all the same data type.
++ **Data frames:** similar to a matrix but can contain different data types and additional attributes such as row names (and is one of the most common data structures in environmental health research). Tibbles are a stricter type of data frame implemented in the *tidyverse* package.
++ **Lists:** a special type of vector that acts as a container – other data structures can be stored within the list, and lists can contain other lists. Lists can contain elements that are different data structures.
+
+```{r 2-1-R-Programming-8, echo=FALSE, fig.align = "center"}
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image5.png")
+```
+
+### Writing Script
+
+R code is written line by line. It may take just one line or many lines of code for one step to be executed, depending on the number of arguments to the function you are using. R code is executed (run) by selecting the line(s) of code to run and pressing return/enter (or a keyboard shortcut), or by clicking "Run" in the upper right corner of the script.
+
+A very simple example of running code is as follows:
+```{r 2-1-R-Programming-9 }
+3 + 4
+```
+
+We can see that when we ran our code, the answer was returned. But what if we want to store that answer? We can assign that number to a variable named `x` using the assignment operator `<-`:
+```{r 2-1-R-Programming-10 }
+x <- 3 + 4
+```
+
+Then, if we run a line of code with our variable, we will get that value:
+```{r 2-1-R-Programming-11 }
+x
+```
+
+The assignment operator can also be used to assign values to any of the data structures discussed above, such as vectors and data frames, as shown here:
+```{r 2-1-R-Programming-12 }
+# Creating a vector of values called my_values
+my_values <- c(7, 3, 8, 9)
+
+# Viewing the vector
+my_values
+
+# Creating a data frame of values corresponding to colors
+my_df <- data.frame(values = my_values, color = c("Blue", "Red", "Yellow", "Purple"))
+
+# Viewing the data frame
+my_df
+```
+
+### Comments
+
+You may have noticed in the code chunks above that there were `#` followed by phrases describing the code. R allows for scripts to contain non-code elements, called comments, that will not be run or interpreted. Comments are useful to help make code more interpretable for others or to add reminders of what and why parts of code may have been written.
+
+To make a comment, simply use a `#` followed by the comment. A `#` only comments out a single line of code. In other words, only that line will be commented and therefore not be run, but lines directly above/below it will still be run:
+```{r 2-1-R-Programming-13 }
+# This is an R comment!
+```
+
+For more on comments, see **TAME 2.0 Module 2.2 Coding Best Practices**.
+
+### Autofilling
+
+RStudio will autofill function names and object names as you type, which can save a lot of time. When you are typing a variable or function name, you can press tab while typing. RStudio will look for variables or functions that match the first few letters you've typed. If multiple matches are found, RStudio will provide you with a drop down list to select from, which may be useful when searching through newly installed packages or trying to quickly type variable names in an R script.
+
+For example, let's say we instead named our example data frame something much longer, and we had two data frames with similar names. If we start typing in `my_` and pause our typing, all of the objects that start with that name will appear as options in a list. To select which one to autofill, navigate down the list and click return/enter.
+
+```{r 2-1-R-Programming-14 }
+my_df_with_really_long_name <- data.frame(values = my_values, color = c("Blue", "Red", "Yellow", "Purple"))
+
+my_df_with_really_long_name_2 <- data.frame(values = my_values, color = c("Green", "Teal", "Magenta", "Orange"))
+```
+
+```{r 2-1-R-Programming-15, echo=FALSE, fig.align = "center"}
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image6.png")
+```
+
+
+### Finding and Setting Your Working Directory
+Another step that is commonly done at the very beginning of your code is setting your working direction. This tells your computer where to look for files that you want to import and where to deposit output files produced during your scripted activities.
+
+To view your current working directory, run the following:
+
+```{r 2-1-R-Programming-16, eval=FALSE}
+getwd()
+```
+
+To set or change the location of your working directory, run the following:
+
+```{r 2-1-R-Programming-17, eval=FALSE, echo=TRUE}
+setwd("/file path to where your input files are")
+```
+
+Note that macOS file paths use `/` to separate folders, whereas PC file paths use `\`.
+
+You can easily find the file path to your desired working directory by navigating to "Session", then "Set Working Directory", and "Choose Directory":
+
+```{r 2-1-R-Programming-18, echo=FALSE, out.width = "500px", fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image7.png")
+```
+
+In the popup box, navigate to the folder you want to set as your working directory and click "Open." Look in the R console, which will now contain a line of code with `setwd()` containing your file path. You can copy this line of code to the top of your script for future use. Alternatively, you can navigate to the folder you want in Finder or File Explorer and right click to see the file path.
+
+Within your working directory, you can make sub-folders to keep your analyses organized. Here is an example folder hierarchy:
+
+```{r 2-1-R-Programming-19, echo=FALSE, out.width = "300px", fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image8.png")
+```
+
+How you set up your folder hierarchy is highly dependent on your specific analysis and coding style. However, we recommend that you:
+
++ Name your script something concise, but descriptive (no acronyms)
++ Consider using dates when appropriate
++ Separate your analysis into logical sections so that script doesn’t get too long or hard to follow
++ Revisit and adapt your organization as the project evolves!
++ Archive old code so you can revisit it
+
+#### A Quick Note About Projects
+
+Creating projects allows you to store your progress (open script, global environment) for one project in an R Project File. This facilitates quick transitions between multiple projects. Find detailed information about how to set up projects [here](https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects).
+
+```{r 2-1-R-Programming-20, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image9.png")
+```
+
+### Importing Files
+
+After setting the working directory, you can import and export files using various functions based on the type of file being imported or exported. Often, it is easiest to import data into R that are in a comma separated value / comma delimited file (.csv) or tab / text delimited file (.txt).
+
+Other datatypes such as SAS data files or large .csv files may require different functions to be more efficiently read in, and some of these file formats will be discussed in future modules. Files can also be imported and exported from Excel using the [*openxlsx*](https://ycphs.github.io/openxlsx/) package.
+
+Below, we will demonstrate how to read in .csv and .txt files:
+
+```{r 2-1-R-Programming-21 }
+# Read in the .csv data that's located in our working directory
+csv.dataset <- read.csv("Chapter_2/2_1_R_Programming/Module2_1_InputData1.csv")
+
+# Read in the .txt data
+txt.dataset <- read.table("Chapter_2/2_1_R_Programming/Module2_1_InputData1.txt")
+```
+
+These datasets now appear as saved dataframes ("csv.dataset" and "txt.dataset") in our working environment.
+
+### Viewing Data
+
+After data have been loaded into R, or created within R, you will likely want to view what these datasets look like.
+Datasets can be viewed in their entirety, or datasets can be subsetted to quickly look at part of the data.
+
+Here's some example script to view just the beginnings of a dataframe using the `head()` function:
+```{r 2-1-R-Programming-22 }
+head(csv.dataset)
+```
+
+Here, you can see that this automatically brings up a view of the first five rows of the dataframe.
+
+Another way to view the first five rows of a dataframe is to run the following:
+```{r 2-1-R-Programming-23 }
+csv.dataset[1:5,]
+```
+
+This brings us to an important concept - indexing! Brackets are used in R to index. Within the bracket, the first argument represents the row numbers, and the second argument represents the column numbers. A colon between two numbers means to select all of the columns in between the left and right numbers. The above line of code told R to select rows 1 to 5, and, by leaving the column argument blank, all of the columns.
+
+Expanding on this, to view the first 5 rows and 2 columns, we can run the following:
+```{r 2-1-R-Programming-24 }
+csv.dataset[1:5, 1:2]
+```
+
+For another example: What if we want to only view the first and third row, and first and fourth column? We can use a vector within the index to do this:
+```{r 2-1-R-Programming-25 }
+csv.dataset[c(1, 3), c(1, 4)]
+```
+
+To view the entire dataset, use the `View()` function:
+
+```{r 2-1-R-Programming-26, eval=FALSE, echo=TRUE}
+View(csv.dataset)
+```
+
+Another way to view a dataset is to just click on the name of the data in the environment pane. The view window will pop up in the same way that it did with the `View()` function.
+
+### Determining Data Structures and Data Types
+
+As discussed above, there are a number of different data structures and types that can be used in R. Here, we will demonstrate functions that can be used to identify data structures and types within R objects. The `glimpse()` function, which is part of the *tidyverse* package, is helpful because it allows us to see an overview of our column names and the types of data contained within those columns.
+
+```{r 2-1-R-Programming-27, message = FALSE}
+# Load tidyverse package
+library(tidyverse)
+
+glimpse(csv.dataset)
+```
+Here, we see that our `Sample` column is a character column, while the rest are integers.
+
+The `class()` function is also helpful for understanding objects in our global environment:
+```{r 2-1-R-Programming-28 }
+# What class (data structure) is our object?
+class(csv.dataset)
+
+# What class (data type) is a specific column in our data?
+class(csv.dataset$Sample)
+```
+
+These functions are particularly helpful when introducing new functions or troubleshooting code because functions often require input data to be a specific structure or data type.
+
+### Exporting Data
+
+Now that we have these datasets saved as dataframes, we can use these as examples to export data files from the R environment back into our local directory.
+
+There are many ways to export data in R. Data can be written out into a .csv file, tab delimited .txt file, or RData file, for example. There are also many functions within packages that write out specific datasets generated by that package.
+
+To write out to a .csv file:
+```{r 2-1-R-Programming-29, eval=F}
+write.csv(csv.dataset, "Module2_1_SameCSVFileNowOut.csv")
+```
+
+To write out a .txt tab delimited file:
+```{r 2-1-R-Programming-30, eval=F}
+write.table(txt.dataset, "Module2_1_SameTXTFileNowOut.txt")
+```
+
+R also allows objects to be saved in RData files. These files can be read into R, as well, and will load the object into the current workspace. Entire workspaces are also able to be saved in RData files, such that when you open an RData file, your script and Global Environment will be just as you saved them. Below includes example code to carry out these tasks, and though these files are not provided, they are just example code for future reference.
+
+```{r 2-1-R-Programming-31, eval = F}
+# Read in saved single R data object
+r.obj = readRDS("data.rds")
+
+# Write single R object to file
+saveRDS(object, "single_object.rds")
+
+# Read in multiple saved R objects
+load("multiple_data.RData")
+
+# Save multiple R objects
+save(object1, object2, "multiple_objects.RData")
+
+# Save entire workspace
+save.image("entire_workspace.RData")
+
+# Load entire workspace
+load("entire_workspace.RData")
+```
+
+## Code Troubleshooting
+
+Learning how to code is an iterative, exploratory process. The secret to coding is to...
+```{r 2-1-R-Programming-32, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_2/2_1_R_Programming/Image10.png")
+```
+
+Make sure to include "R" and the package and/or function name in your search. Don't be afraid to try out different solutions until you find one that works for you, but also know when it is time to ask for help. For example, when you have tried solutions available on forums, but they aren't working for you, or you know a colleague has already spent a significant amount of time developing code for this specific task.
+
+Note that when reading question/answer forums, make sure to look at how recent a post is, as packages are updated frequently, and old answers may or may not work.
+
+Some common reasons that code doesn't work and potential solutions to these problems include:
+
++ Two packages are loaded that have functions with the same name, and the default function is not the one you are intending to run.
+ + Solutions: specify the package that you want the function to be called from each time you use it (e.g., `dplyr::select()`) or re-assign that function at the beginning of your script (e.g., `select <- dplyr::select`)
+
++ Your data object is the wrong input type (is a data frame and needs to be a matrix, is character but needs to be numeric)
+ + Solution: double check the documentation (?functionname) for the input/variable type needed
+
++ You accidentally wrote over your data frame or variable with another section of code
+ + Solution: re-run your code from the beginning, checking that your input is in the correct format
+
++ There is a bug in the function/package you are trying to use (this is most common after packages are updated or after you update your version of R)
+ + Solution: post an issue on GitHub for that package (or StackOverflow if there is not a GitHub) using a reproducible example
+
+There are a number of forums that can be extremely helpful when troubleshooting your code, such as:
+
++ [Stack Overflow](https://stackoverflow.com/): one of the most common forums to post questions related to coding and will often be the first few links in a Google search about any code troubleshooting. It is free to make an account, which allows you to post and answer questions.
++ [Cross Validated](https://stats.stackexchange.com/): a forum focused on statistics, including machine learning, data analysis, data mining, and data visualization, and is best for conceptual questions related to how statistical tests are carried out, when to use specific tests, and how to interpret tests (rather than code execution questions, which are more appropriate to post on Stack Overflow).
++ [BioConductor Forum](https://support.bioconductor.org/): provides a platform for specific coding and conceptual questions about BioConductor packages.
++ [GitHub](https://github.com): can also be used to create posts about specific issues/bugs for functions within that package.
+
+
+**Before you post a question, make sure you have thoroughly explored answers to existing similar questions and are able to explain in your question why those haven’t worked for you.** You will also need to provide a **reproducible example** of your error or question, meaning that you provide all information (input data, packages, code) needed such that others can reproduce your exact issues. While demonstrating a reproducible example is beyond the scope of this module, see the below links and packages for help getting started:
+
++ Detailed step-by-step guides for how to make reproducible examples:
+ + [How to Reprex](https://aosmith16.github.io/spring-r-topics/slides/week09_reprex.html#1) by Ariel Muldoon
+ + [What's a reproducible example (reprex) and how do I create one?](https://community.rstudio.com/t/faq-whats-a-reproducible-example-reprex-and-how-do-i-create-one/5219)
++ Helpful packages:
+ + [*reprex*](https://reprex.tidyverse.org/): part of tidyverse, useful for preparing reproducible code for posting to forums.
+ + [*datapasta*](https://aosmith16.github.io/spring-r-topics/slides/week09_reprex.html#43): useful for creating code you can copy and paste that creates a new data frame as a subset of your original data.
+
+## Concluding Remarks
+
+Together, this training module provides introductory level information on installing and loading packages in R, scripting basics, importing and exporting data, and code troubleshooting.
+
+### Additional Resources
+
++ [Coursera](https://www.coursera.org/learn/r-programming & https://www.coursera.org/courses?query=r)
++ [Stack Overflow How to Learn R](https://stackoverflow.com/questions/1744861/how-to-learn-r-as-a-programming-language)
++ [R for Data Science](https://r4ds.had.co.nz/)
+
+
+
+:::tyk
+1. Install R and RStudio on your computer.
+2. Launch RStudio and explore installing packages (e.g., *tidyverse*) and understanding data types using the [built-in datasets](https://machinelearningmastery.com/built-in-datasets-in-r/#:~:text=The%20ecosystem%20in%20R%20contains,to%20test%20out%20your%20program.) in R.
+3. Make a vector of the letters A-E.
+4. Make a data frame of the letters A-E in one column and their corresponding number in the alphabet order in the second column (e.g., A corresponds with 1).
+:::
diff --git a/Chapter_2/Module2_1_Input/Image1.png b/Chapter_2/2_1_R_Programming/Image1.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image1.png
rename to Chapter_2/2_1_R_Programming/Image1.png
diff --git a/Chapter_2/Module2_1_Input/Image10.png b/Chapter_2/2_1_R_Programming/Image10.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image10.png
rename to Chapter_2/2_1_R_Programming/Image10.png
diff --git a/Chapter_2/Module2_1_Input/Image2.png b/Chapter_2/2_1_R_Programming/Image2.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image2.png
rename to Chapter_2/2_1_R_Programming/Image2.png
diff --git a/Chapter_2/Module2_1_Input/Image3.png b/Chapter_2/2_1_R_Programming/Image3.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image3.png
rename to Chapter_2/2_1_R_Programming/Image3.png
diff --git a/Chapter_2/Module2_1_Input/Image4.png b/Chapter_2/2_1_R_Programming/Image4.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image4.png
rename to Chapter_2/2_1_R_Programming/Image4.png
diff --git a/Chapter_2/Module2_1_Input/Image5.png b/Chapter_2/2_1_R_Programming/Image5.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image5.png
rename to Chapter_2/2_1_R_Programming/Image5.png
diff --git a/Chapter_2/Module2_1_Input/Image6.png b/Chapter_2/2_1_R_Programming/Image6.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image6.png
rename to Chapter_2/2_1_R_Programming/Image6.png
diff --git a/Chapter_2/Module2_1_Input/Image7.png b/Chapter_2/2_1_R_Programming/Image7.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image7.png
rename to Chapter_2/2_1_R_Programming/Image7.png
diff --git a/Chapter_2/Module2_1_Input/Image8.png b/Chapter_2/2_1_R_Programming/Image8.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image8.png
rename to Chapter_2/2_1_R_Programming/Image8.png
diff --git a/Chapter_2/Module2_1_Input/Image9.png b/Chapter_2/2_1_R_Programming/Image9.png
similarity index 100%
rename from Chapter_2/Module2_1_Input/Image9.png
rename to Chapter_2/2_1_R_Programming/Image9.png
diff --git a/Chapter_2/Module2_1_Input/Module2_1_InputData1.csv b/Chapter_2/2_1_R_Programming/Module2_1_InputData1.csv
similarity index 100%
rename from Chapter_2/Module2_1_Input/Module2_1_InputData1.csv
rename to Chapter_2/2_1_R_Programming/Module2_1_InputData1.csv
diff --git a/Chapter_2/Module2_1_Input/Module2_1_InputData1.txt b/Chapter_2/2_1_R_Programming/Module2_1_InputData1.txt
similarity index 100%
rename from Chapter_2/Module2_1_Input/Module2_1_InputData1.txt
rename to Chapter_2/2_1_R_Programming/Module2_1_InputData1.txt
diff --git a/Chapter_2/2_2_Best_Practices/2_2_Best_Practices.Rmd b/Chapter_2/2_2_Best_Practices/2_2_Best_Practices.Rmd
new file mode 100644
index 0000000..e6ebda4
--- /dev/null
+++ b/Chapter_2/2_2_Best_Practices/2_2_Best_Practices.Rmd
@@ -0,0 +1,240 @@
+# 2.2 Coding "Best" Practices
+
+This training module was developed by Kyle Roell, Alexis Payton, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+In this training module, we will be going over coding "best" practices. The reason we put "best" in quotes is because these practices are what we currently consider best or better, though everyone has different coding styles, annotation styles, etc that also change over time. Here, we hope to give you a sense of what we do when coding, why we do it, and why we think it is important. We will also be pointing out other guides to style, annotations, and best practices that we suggest implementing into your own coding.
+
+Some of the questions we hope to answer in this module are:
+
++ What type of scripting file should I use?
++ What should I name my script?
++ What should I put at the top of every script and why is it important?
++ How should I annotate my code?
++ Why are annotations important?
++ How do I implement these coding practices into my own code?
++ Where can I find other resources to help with coding best practices?
+
+In the following sections, we will be addressing these questions. Keep in mind that the advice and suggestions in this section are just that: advice and suggestions. So please take them into consideration and integrate them into your own coding style as appropriate.
+
+## Scripting File Types
+
+Two of the most common scripting file types applicable to the R language are .R (normal R files) and .Rmd (R Markdown). Normal R files appear as plain text and can be used for running any normal R code. R Markdown files are used for more intensive documentation of code and allow for a combination of code, non-code text explaining the code, and viewing of code output, tables, and figures that are rendered together into an output file (typically .html, although other formats such as .pdf are also offered). For example, TAME is coded using R Markdown, which allows us to include blocks of non-code text, hyperlinks, annotated code, schematics, and output figures all in one place. We highly encourage the use of R Markdown as the default scripting file type for R-based projects because it produces a polished final document that is easy for others to follow, whereas .R files are more appropriate for short, one-off analyses and writing in-depth functions and packages. However, code executed in normal .R files and R Markdown will produce the same results, and ultimately, which file type to use is personal preference.
+
+See below for screenshots that demonstrate some of the stylistic differences between .R, .Rmd, and .Rmd knitted to HTML format:
+```{r 2-2-Best-Practices-1, out.width = "1000px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_2/2_2_Best_Practices/Image1.png")
+```
+
+If you are interested in learning more about the basic features of R Markdown and how to use them, see the following resources:
+
++ [RStudio introduction to R Markdown](https://rmarkdown.rstudio.com/lesson-1.html)
++ [R Markdown Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf)
++ [Bookdown R Markdown guide](https://bookdown.org/yihui/rmarkdown/html-document.html)
++ [Including external images in R Markdown with knitr](https://www.r-bloggers.com/2021/02/external-graphics-with-knitr/)
++ [Interactive plots with plotly](https://cengel.github.io/R-data-viz/interactive-graphs.html)
++ [Interactive data tables with DT](https://rstudio.github.io/DT/)
+
+### Naming the Script File
+
+The first thing we need to talk about, which is sometimes overlooked in the discussion of coding practices, is script file naming conventions and high level descriptive headers within a script. It is important to remember to name your code something concise, but descriptive. You want to be able to easily recognize what the script is for and does without a cumbersome, lengthy title. Some tips for naming conventions:
+
++ Be concise, but descriptive
++ Use dates when appropriate
++ Avoid special characters
++ Use full words if possible, avoiding non-standard acronyms
+
+Keep in mind that each script should have a clear purpose within a given project. And, it is sometimes necessary, and often common, to have multiple scripts within one project that all pertain to different parts of the analysis. For example, it may be appropriate to have one script for data cleaning and pre-processing and another script for analyzing data. When scripting an analysis with multiple sub-analyses, some prefer to keep code for each sub-analysis separate (e.g., one file for an ANOVA and one file for a k-means analysis on the same data input), while others prefer to have longer code files with more subsections. Whichever method you choose, we recommend maintaining clear documentation that indicates locations for input and output files for each sub-analysis (e.g., whether global environment objects or output files from a previous script are needed to run the current script).
+
+## Script Headers and Annotation
+
+### Script Header
+
+Once your script is created and named, it is generally recommended to include a header at the top of the script. The script header can be used for describing:
+
++ Title of Script - This can be a longer or more readable name than script file name.
++ Author(s) - Who wrote the script?
++ Date - When was the script developed?
++ Description - Provides a more detailed description of the purpose of the script and any notes or special considerations for this particular script.
+
+In R, it is common to include multiple `#`, the comment operator, or a `#` followed by another special character, to start and end a block of coding annotation or the script header. An example of this in an .R file is shown below:
+
+```{r 2-2-Best-Practices-2 }
+########################################################################
+########################################################################
+### Script Longer Title
+###
+### Description of what this script does!
+### Also can include special notes or anything else here.
+###
+### Created by: Kyle Roell and Julia Rager
+### Last updated: 01 May 2023
+########################################################################
+########################################################################
+```
+
+This block of comment operators is common in .R but not .Rmd files because .Rmd files have their own specific type of header, known as the [YAML](https://zsmith27.github.io/rmarkdown_crash-course/lesson-4-yaml-headers.html), which contains the title, author, date, and formatting outputs for the .Rmd file:
+
+```{r 2-2-Best-Practices-3, out.width = "300px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_2/2_2_Best_Practices/Image2.png")
+```
+
+We will now review how annotations within the script itself can make a huge difference in understanding the code within.
+
+### Annotations
+
+Before we review coding style considerations, it is important to address code annotating. So, what are annotations and why are they important?
+
+Annotations are notes embedded within your code as comments that will not be run. The beauty of annotating your code is that not only others, but future you, will be able to read through and better understand what a particular piece of code does. We suggest annotating your code while you write it and incorporate a lot of description. While not every single line needs an annotation, or a very detailed one, it is helpful to provide comments and annotation as much as you can while maintaining feasibility.
+
+#### General annotation style
+
+In general, annotations will be short sentences that describe what your code does or why you are executing that specific code. This can be helpful when you are defining a covariate a specific way, performing a specific analytical technique, or just generally explaining why you are doing what you're doing.
+
+```{r 2-2-Best-Practices-4, eval=F}
+
+# Performing logistic regression to assess association between xyz and abc
+# Regression confounders: V1, V2, V3 ...
+
+xyz.regression.output = glm(xyz ~ abc + V1 + V2 + V3, family=binomial(), data=example.data)
+
+```
+
+#### Mid-script headings
+
+Another common approach to annotations is to use mid-script type headings to separate out the script into various sections. For example, you might want to create distinct sections for "Loading Packages, Data, and Setup", "Covariate Definition", "Correlation Analysis", "Regression Analysis", etc. This can help you, and others, reading your script, to navigate the script more easily. It also can be more visually pleasing to see the script split up into multiple sections as opposed to one giant chunk of code interspersed with comments. Similar to above, the following example is specific to .R files. For .Rmd files, sub headers can be created by increasing the number of `#` before the header.
+
+```{r 2-2-Best-Practices-5, eval=F}
+
+###########################################################################
+###########################################################################
+###
+### Regression Analyses
+###
+### You can even add some descriptions or notes here about this section!
+###
+###########################################################################
+
+
+# Performing logistic regression to assess association between xyz and abc
+# Regression confounders: V1, V2, V3 ...
+
+xyz.regression.output = glm(xyz ~ abc + V1 + V2 + V3, family=binomial(), data=example.data)
+
+```
+General tips for annotations:
+
++ Make comments that are useful and meaningful
++ You don't need to comment every single line
++ In general, you probably won't over-comment your script, so more is generally better
++ That being said, don't write super long paragraphs every few lines
++ Split up your script into various sections using mid-script headings when appropriate
+
+
+#### Quick, short comments and annotations
+
+While it is important to provide descriptive annotations, not every one needs to be a sentence or longer. As stated previously, it is not necessary to comment every single line. Here is an example of very brief commenting:
+```{r 2-2-Best-Practices-6, eval=F }
+
+# Loading necessary packages
+
+library(ggplot2) # Plotting package
+
+```
+
+In the example above, we can see that these short comments clearly convey what the script does -- load the necessary package and indicate what the package is needed for. Short, one line annotations can also be placed after lines to clarify that specific line or within the larger mid-script headings to split up these larger sections of code.
+
+
+## Coding Style
+
+Coding style is often a contentious topic! There are MANY styles of coding, and no two coders have the same exact style, even if they are following the same reference. Here, we will provide some guides to coding style and go over some of the basic, general tips for making your code readable and efficient. Here is an example showing how you can use spacing to align variable assignment:
+
+```{r 2-2-Best-Practices-7, eval=F}
+
+# Example of using spacing for alignment of variable assignment
+
+Longer_variable_name_x = 1
+Short_name_y = 2
+```
+
+Note that guides will suggest you use `<-` as the assignment operator. However, for most situations, `<-` and `=` will do the same thing.
+
+For spacing around certain symbols and operators:
+
++ Include a space after `if`, before parenthesis
++ Include a space on either side of symbols such as `<`
++ The first (opening) curly brace should not be on its own line, but the second (closing) should
+
+```{r 2-2-Best-Practices-8, eval = F}
+# Example of poor style
+
+if(Longer_variable_name_x
+
+
+
+:::tyk
+Using the input file provided ("Module2_2_TYKInput.R"):
+
+1. Convert the script and annotations into R Markdown format.
+2. Improve the organization, comments, and scripting to follow the coding best practices described in this module. List the changes you made at the bottom of the new R Markdown file.
+
+*Notes on the starting code:*
+
+1. This starting code uses dummy data to demonstrate how to make a graph in R that includes bars representing the mean, with standard deviation error bars overlaid.
+2. You don't need to understand every step in the code to be able to improve the existing coding style! You can run each step of the code if needed to understand better what it does.
+:::
diff --git a/Chapter_2/Module2_2_Input/Image1.png b/Chapter_2/2_2_Best_Practices/Image1.png
similarity index 100%
rename from Chapter_2/Module2_2_Input/Image1.png
rename to Chapter_2/2_2_Best_Practices/Image1.png
diff --git a/Chapter_2/Module2_2_Input/Image2.png b/Chapter_2/2_2_Best_Practices/Image2.png
similarity index 100%
rename from Chapter_2/Module2_2_Input/Image2.png
rename to Chapter_2/2_2_Best_Practices/Image2.png
diff --git a/Chapter_2/2_3_Data_Manipulation/2_3_Data_Manipulation.Rmd b/Chapter_2/2_3_Data_Manipulation/2_3_Data_Manipulation.Rmd
new file mode 100644
index 0000000..cb7b6c6
--- /dev/null
+++ b/Chapter_2/2_3_Data_Manipulation/2_3_Data_Manipulation.Rmd
@@ -0,0 +1,483 @@
+
+# 2.3 Data Manipulation and Reshaping
+
+This training module was developed by Kyle Roell, Alexis Payton, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for downstream statistical analyses and visualization code. The beginning of almost any scripted analysis includes important formatting steps that make the data easier to read and work with. This can be done in several ways, including:
+
++ [Base R operations and functions](https://www.r-project.org/about.html), or
++ A collection of packages (and philosophy) known as [The Tidyverse](https://www.tidyverse.org).
+
+In this training tutorial we will review some of the most common ways you can organize and manipulate data, including:
+
++ Merging data
++ Filtering and subsetting data
++ Pivoting data wider and longer (also known as casting and melting)
+
+These approaches will first be demonstrated using the functions available in base R. Then, the exact same approaches will be demonstrated using the functions and syntax that are part of the Tidyverse package.
+
+We will demonstrate these data manipulation and organization methods using an environmentally relevant example data set from a human cohort. This dataset was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a unique data set for these training purposes. The dataset contains environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples and associated demographic data.
+
+### Training Module's Environmental Health Question
+
+This training module was specifically developed to answer the following environmental health question using data manipulation and reshaping approaches:
+
+What is the average urinary chromium concentration across different maternal education levels?
+
+We'll use base R and *Tidydverse* to answer this question, but let's start with Base R.
+
+### Workspace Preparation and Data Import
+
+#### Set your working directory
+
+In preparation, first let's set our working directory to the folder path that contains our input files:
+```{r 2-3-Data-Manipulation-1, eval = FALSE}
+setwd("/file path to where your input files are")
+```
+
+Note that macOS file paths use `/` as folder separators, and PC file paths use `\`.
+
+
+#### Importing example datasets
+
+Next, let's read in our example data sets:
+```{r 2-3-Data-Manipulation-2 }
+demographic_data <- read.csv("Chapter_2/2_3_Data_Manipulation/Module2_3_InputData1.csv")
+chemical_data <- read.csv("Chapter_2/2_3_Data_Manipulation/Module2_3_InputData2.csv")
+```
+
+#### Viewing example datasets
+Let's see what these datasets look like:
+```{r 2-3-Data-Manipulation-3 }
+dim(demographic_data)
+dim(chemical_data)
+```
+
+
+The demographic data set includes 200 rows x 7 columns, while the chemical measurement data set includes 200 rows x 7 columns.
+
+We can preview the demographic data frame by using the `head()` function, which displays all the columns and the first 6 rows of a data frame:
+```{r 2-3-Data-Manipulation-4 }
+head(demographic_data)
+```
+
+
+These demographic data are organized according to subject ID (first column) followed by the following subject information:
+
++ `ID`: subject number
++ `BMI`: body mass index
++ `MAge`: maternal age in years
++ `MEdu`: maternal education level; 1 = "less than high school", 2 = "high school or some college", 3 = "college or greater"
++ `BW`: body weight in grams
++ `GA`: gestational age in weeks
+
+We can also preview the chemical dataframe:
+```{r 2-3-Data-Manipulation-5 }
+head(chemical_data)
+```
+
+These chemical data are organized according to subject ID (first column), followed by measures of:
+
++ `DWAs`: drinking water arsenic levels in µg/L
++ `DWCd`: drinking water cadmium levels in µg/L
++ `DWCr`: drinking water chromium levels in µg/L
++ `UAs`: urinary arsenic levels in µg/L
++ `UCd`: urinary cadmium levels in µg/L
++ `UCr`: urinary chromium levels in µg/L
+
+## Data Manipulation Using Base R
+
+### Merging Data Using Base R Syntax
+
+Merging datasets represents the joining together of two or more datasets, using a common identifier (generally some sort of ID) to connect the rows. This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.
+
+Let's start by merging the example demographic data with the chemical measurement data using the base R function `merge()`. To learn more about this function, you can type `?merge`, which brings up helpful information in the R console. To merge these datasets with the merge function, use the following code. The `by =` argument specifies the column used to match the rows of data.
+```{r 2-3-Data-Manipulation-6 }
+full.data <- merge(demographic_data, chemical_data, by = "ID")
+dim(full.data)
+```
+
+This merged dataframe contains 200 rows x 12 columns. Viewing this merged dataframe, we can see that the `merge()` function retained the first column in each original dataframe (`ID`), though did not replicate it since it was used as the identifier for merging. All other columns include their original data, just merged together by the IDs in the first column.
+```{r 2-3-Data-Manipulation-7 }
+head(full.data)
+```
+
+These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. This option allows you to edit the name of the column that is used in each dataframe. Here, these are still the same "ID", but you can see that adding the `by.x` and `by.y` arguments allows you to specify instances when different column names are used in the two datasets.
+```{r 2-3-Data-Manipulation-8 }
+full.data <- merge(demographic_data, chemical_data, by.x = "ID", by.y = "ID")
+
+# Viewing data
+head(full.data)
+```
+
+
+Note that after merging datasets, it is always helpful to check that the merging was done properly before proceeding with your data analysis. Helpful checks could include viewing the merged dataset, checking the numbers of rows and columns to make sure chunks of data are not missing, and searching for values (or strings) that exist in one dataset but not the other, among other mechanisms of QA/QC.
+
+
+### Filtering and Subsetting Data Using Base R Syntax
+
+Filtering and subsetting data are useful tools when you need to focus on specific parts of your dataset for downstream analyses. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for removing unneeded variables or samples from dataframes as you are working through your script.
+
+Note that in the examples that follow, we will create new dataframes that are distinguished from our original dataframe by adding sequential numbers to the end of the dataframe name (e.g., subset.data1, subset.data2, subset.data3). This style of dataframe naming is useful for the simple examples we are demonstrating, but in a full scripted analysis, we encourage the use of more descriptive dataframe names. For example, if you are subsetting your data to include only the first 100 rows, you could name that dataframe "data.first100."
+
+For this example, let's first define a vector of columns that we want to keep in our analysis, then subset the data by keeping only the columns specified in our vector:
+```{r 2-3-Data-Manipulation-9 }
+# Defining a vector of columns to keep in the analysis
+subset.columns <- c("BMI", "MAge", "MEdu")
+
+# Subsetting the data by selecting the columns represented in the defined 'subset.columns' vector
+subset.data1 <- full.data[,subset.columns]
+
+# Viewing the top of this subsetted dataframe
+head(subset.data1)
+```
+
+We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:
+```{r 2-3-Data-Manipulation-10 }
+subset.data2 <- full.data[1:100,]
+
+# Viewing the dimensions of this new dataframe
+dim(subset.data2)
+```
+
+To remove the first 100 rows, we use the same code as above, but include a `-` sign before our vector to indicate that these rows should be removed:
+```{r 2-3-Data-Manipulation-11 }
+subset.data3 <- full.data[-c(1:100),]
+
+# Viewing the dimensions of this new dataframe
+dim(subset.data3)
+```
+
+**Conditional statements** are also written to filter and subset data. A **conditional statement** is written to execute one block of code if the statement is true and a different block of code if the statement is false.
+
+A conditional statement requires a Boolean or true/false statement that will be either `TRUE` or `FALSE`. A couple of the more commonly used functions used to create conditional statements include...
+
++ `if(){}` or an **if statement** means "execute R code when the condition is met".
++ `if(){} else{}` or an **if/else statement** means "execute R code when condition 1 is met, if not execute R code for condition 2".
++ `ifelse()` is a function that executes the same logic as an if/else statement. The first argument specifies a condition to be met. If that condition is met, R code in the second argument is executed, and if that condition is not met, R code in the third argument is executed.
+
+There are six comparison operators that are used to created these Boolean values:
+
++ `==` means "equals".
++ `!=` means "not equal".
++ `<` means "less than".
++ `>` means "greater than".
++ `<=` means "less than or equal to".
++ `>=` mean "greater than or equal to".
+
+There are also three logical operators that are used to create these Boolean values:
+
++ `&` means "and".
++ `|` means "or".
++ `!` means "not".
+
+We can filter data based on conditions using the `subset()` function. For example, the following code filters for subjects whose BMI is greater than 25 and who have a college education:
+```{r 2-3-Data-Manipulation-12 }
+subset.data4 <- subset(full.data, BMI > 25 & MEdu == 3)
+```
+
+Additionally, we can subset and select specific columns we would like to keep, using the `select` argument within the `subset()` function:
+```{r 2-3-Data-Manipulation-13 }
+# Filtering for subjects whose BMI is less than 22 or greater than 27
+# Also selecting the BMI, maternal age, and maternal education columns
+subset.data5 <- subset(full.data, BMI < 22 | BMI > 27, select = subset.columns)
+```
+
+For more information on the `subset()` function, see its associated [documentation](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset).
+
+
+### Melting and Casting Data using Base R Syntax
+
+Melting and casting refers to the conversion of data to "long" or "wide" form as discussed previously in **TAME 2.0 Module 1.4 Data Wrangling in Excel**. You will often see data within the environmental health field in wide format, though long format is necessary for some procedures, such as plotting with [*ggplot2*](https://ggplot2.tidyverse.org) and performing certain analyses.
+
+Here, we'll illustrate some example script to melt and cast data using the [*reshape2*](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4) package.
+Let's first install and load the `reshape2` package:
+```{r 2-3-Data-Manipulation-14, message = FALSE}
+if (!requireNamespace("reshape2"))
+ install.packages("reshape2");
+```
+
+```{r 2-3-Data-Manipulation-15 }
+library(reshape2)
+```
+
+Using the fully merged dataframe, let's remind ourselves what these data look like in the current dataframe format:
+```{r 2-3-Data-Manipulation-16 }
+head(full.data)
+```
+
+
+These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in **wide (also known as casted)** format.
+
+Let's convert this dataframe to **long (also known as melted)** format. Here, will will specify that we want a row for each unique sample ID + variable measure pair by using `id = "ID"`:
+```{r 2-3-Data-Manipulation-17 }
+full.melted <- melt(full.data, id = "ID")
+
+# Viewing this new dataframe
+head(full.melted)
+```
+
+You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled `variable`. Then, the third column contains the value of this variable.
+
+Let's see an example view of the middle of this new dataframe:
+```{r 2-3-Data-Manipulation-18 }
+full.melted[1100:1110,1:3]
+```
+
+Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:
+```{r 2-3-Data-Manipulation-19 }
+dim(full.melted)
+```
+
+Let's now re-cast this dataframe back into wide format using the `dcast()` function. Here, we are telling the `dcast()` function to give us a sample (ID) for every variable in the column labeled `variable`. The column names from the variable column and corresponding values from the value column are then used to fill in the dataset:
+```{r 2-3-Data-Manipulation-20 }
+full.cast <- dcast(full.melted, ID ~ variable)
+head(full.cast)
+```
+
+Here, we can see that this dataframe is back in its original casted (or wide) format. Now that we're familiar with some base R functions to reshape our data, let's answer our original question: What is the average urinary chromium concentration for each maternal education level?
+
+Although it is not necessary to calculate the average, we could first subset our data frame to only include the two columns we are interested in (MEdu and UCr):
+```{r 2-3-Data-Manipulation-21 }
+subset.data6 <- full.data[,c("MEdu", "UCr")]
+
+head(subset.data6)
+```
+
+Next, we will make a new data frame for each maternal education level:
+```{r 2-3-Data-Manipulation-22 }
+# Creating new data frames based on maternal education category
+data.matedu.1 <- subset(subset.data6, MEdu == 1)
+data.matedu.2 <- subset(subset.data6, MEdu == 2)
+data.matedu.3 <- subset(subset.data6, MEdu == 3)
+
+# Previewing the first data frame to make sure our function is working as specified
+head(data.matedu.1)
+```
+
+Last, we can calculate the average urinary chromium concentration using each of our data frames:
+```{r 2-3-Data-Manipulation-23 }
+mean(data.matedu.1$UCr)
+mean(data.matedu.2$UCr)
+mean(data.matedu.3$UCr)
+```
+
+:::question
+ With this, we can answer our **Environmental Health Question**:
+
+What is the average urinary chromium concentration across different maternal education levels?
+:::
+
+:::answer
+**Answer:** The average urinary Chromium concentrations are 39.9 µg/L for participants with less than high school education, 40.6 µg/L for participants with high school or some college education, and 40.4 µg/L for participants with college education or greater.
+:::
+
+## Introduction to Tidyverse
+
+[Tidyverse](https://www.tidyverse.org) is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax and formatting that differ slightly from base R functions. There are eight core tidyverse packages:
+
++ For data visualization and exploration:
+ + *ggplot2*
++ For data wrangling and transformation:
+ + *dplyr*
+ + *tidyr*
+ + *stringr*
+ + *forcats*
++ For data import and management:
+ + *tibble*
+ + *readr*
++ For functional programming:
+ + *purr*
+
+Here, we will carry out all the of the same data organization exercises demonstrated above using packages that are part of The Tidyverse, specifically using functions that are part of the *dplyr* and *tidyr* packages.
+
+### Downloading and Loading the Tidyverse Package
+
+If you don't have *tidyverse* already installed, you will need to install it using:
+```{r 2-3-Data-Manipulation-24, message = FALSE}
+if(!require(tidyverse))
+ install.packages("tidyverse")
+```
+
+And then load the *tidyverse* package using:
+```{r 2-3-Data-Manipulation-25 }
+library(tidyverse)
+```
+
+Note that by loading the *tidyverse* package, you are also loading all of the packages included within The Tidyverse and do not need to separately load these packages.
+
+### Merging Data Using Tidyverse Syntax
+
+To merge the same example dataframes using *tidyverse*, you can run the following script:
+```{r 2-3-Data-Manipulation-26 }
+full.data.tidy <- inner_join(demographic_data, chemical_data, by = "ID")
+
+head(full.data.tidy)
+```
+
+Note that you can still merge dataframes that have different ID column names with the argument `by = c("ID.x", "ID.y")`. *tidyverse* also has other `join`, functions, shown in the graphic below ([source](https://tavareshugo.github.io/r-intro-tidyverse-gapminder/08-joins/index.html)):
+```{r 2-3-Data-Manipulation-27, echo = FALSE, out.width = "400px", fig.align = "center"}
+knitr::include_graphics("Chapter_2/2_3_Data_Manipulation/Image1.svg")
+```
+
++ **inner_join** keeps only rows that have matching ID variables in both datasets
++ **full_join** keeps the rows in both datasets
++ **left_join** matches rows based on the ID variables in the first dataset (and omits any rows from the second dataset that do not have matching ID variables in the first dataset)
++ **right_join** matches rows based on ID variables in the second dataset (and omits any rows from the first dataset that do not have matching ID variables in the second dataset)
++ **anti_join(x,y)** keeps the rows that are unique to the first dataset
++ **anti_join(y,x)** keeps the rows that are unique to the second dataset
+
+### The Pipe Operator
+
+One of the most important elements of Tidyverse syntax is use of the pipe operator (`%>%`). The pipe operator can be used to chain multiple functions together. It takes the object (typically a dataframe) to the left of the pipe operator and passes it to the function to the right of the pipe operator. Multiple pipes can be used in chain to execute multiple data cleaning steps without the need for intermediate dataframes. The pipe operator can be used to pass data to functions within all of the Tidyverse universe packages, not just the functions demonstrated here.
+
+Below, we can see the same code executed above, but this time with the pipe operator. The `demographic_data` dataframe is passed to `inner_join()` as the first argument to that function, with the following arguments remaining the same.
+```{r 2-3-Data-Manipulation-28 }
+full.data.tidy2 <- demographic_data %>%
+ inner_join(chemical_data, by = "ID")
+
+head(full.data.tidy2)
+```
+
+Because the pipe operator is often used in a chain, it is best practice is to start a new line after each pipe operator, with the new lines of code indented. This makes code with multiple piped steps easier to follow. However, if just one function is being executed, the pipe operator can be used on the same line as the input and function or omitted altogether (as shown in the previous two code chunks). Here is an example of placing the function to the right of the pipe operator on a new line, with placeholder functions shown as additional steps:
+```{r 2-3-Data-Manipulation-29, eval = FALSE}
+full.data.tidy3 <- demographic_data %>%
+ inner_join(chemical_data, by = "ID") %>%
+ additional_function_1() %>%
+ additional_function_2()
+```
+
+### Filtering and Subsetting Data Using Tidyverse Syntax
+
+#### Column-wise functions
+
+The `select()` function is used to subset columns in Tidyverse. Here, we can use our previously defined vector `subset.columns` in the `select()` function to keep only the columns in our `subset.columns` vector. The `all_of()` function tells the `select()` to keep all of the columns that match elements of the `subset.columns` vector.
+```{r 2-3-Data-Manipulation-30 }
+subset.tidy1 <- full.data.tidy %>%
+ select(all_of(subset.columns))
+
+head(subset.tidy1)
+```
+
+There are many different ways that `select()` can be used. See below for some examples using dummy variable names:
+```{r 2-3-Data-Manipulation-31, eval = FALSE}
+# Select specific ranges in the dataframe
+data <- data %>%
+ select(start_column_1:end_column_1)
+
+data <- data %>%
+ select(c(start_column_1:end_column_1, start_column_2:end_column_2))
+
+# Select columns that match the elements in a character vector an an additional range of columns
+data <- data %>%
+ select(c(all_of(character_vector), start_column_1:end_column_1))
+```
+
+To select columns that have names that contain specific strings, you can use functions such as `starts_with()`, `ends_with()`, and `contains()`. These functions allow you to ignore the case of the strings with `ignore.case = TRUE`. These arguments can be combined with specific column names and other selection ranges.
+```{r 2-3-Data-Manipulation-32, eval = FALSE}
+data <- data %>%
+ select(starts_with("starting_string"))
+
+data <- data %>%
+ select(other_column_to_keep, starts_with("starting_string"))
+```
+
+To remove columns using tidyverse, you can use similar code, but include a `-` sign before the argument defining the columns.
+```{r 2-3-Data-Manipulation-33 }
+# Removing columns
+subset.tidy2 <- full.data.tidy %>%
+ select(-all_of(subset.columns))
+
+# Viewing this new dataframe
+head(subset.tidy2)
+```
+
+#### Row-wise functions
+
+The `slice()` function can be used to keep or remove a certain number of rows based on their position within the dataframe. For example, we can retain only the first 100 rows using the following code:
+
+```{r 2-3-Data-Manipulation-34 }
+subset.tidy3 <- full.data.tidy %>%
+ slice(1:100)
+
+dim(subset.tidy3)
+```
+
+Or, we can remove the first 100 rows:
+```{r 2-3-Data-Manipulation-35 }
+subset.tidy4 <- full.data.tidy %>%
+ slice(-c(1:100))
+
+dim(subset.tidy4)
+```
+
+The related functions `slice_min()` and `slice_max()` can be used to select rows with the smallest or largest values of a variable.
+
+The `filter()` function can be used to keep or remove specific rows based on conditional statements. For example, we can keep only rows where BMI is greater than 25 and age is greater than 31:
+```{r 2-3-Data-Manipulation-36 }
+subset.tidy5 <- full.data.tidy %>%
+ filter(BMI > 25 & MAge > 31)
+
+dim(subset.tidy5)
+```
+
+#### Combining column and row-wise functions
+
+Now, we can see how Tidyverse makes it easy to chain together multiple data manipulation steps. Here, we first filter rows based on values for BMI and age, then we select our columns of interest:
+```{r 2-3-Data-Manipulation-37 }
+subset.tidy6 <- full.data.tidy %>%
+ filter(BMI > 25 & MAge > 31) %>%
+ select(BMI, MAge, MEdu)
+
+head(subset.tidy6)
+```
+
+### Melting and Casting Data Using Tidyverse Syntax
+
+To melt and cast data in Tidyverse, you can use the pivot functions (i.e., `pivot_longer()` or `pivot_wider()`).
+
+The first argument in the `pivot_longer()` function specifies which columns should be pivoted. This can be specified with either positive or negative selection - i.e., naming columns to pivot with a vector or range or naming columns not to pivot with a `-` sign. Here, we are telling the function to pivot all of the columns except the ID column, which we need to keep to be able to trace back which values came from which subject. The `names_to =` argument allows you to set what you want to name the column that stores the variable names (the column names in wide format). The `values_to =` argument allows you to set what you want to name the column that stores the values. We almost always call these columns "var" and "value", respectively, but you can name them anything that makes sense for your dataset.
+```{r 2-3-Data-Manipulation-38 }
+full.pivotlong <- full.data.tidy %>%
+ pivot_longer(-ID, names_to = "var", values_to = "value")
+
+head(full.pivotlong, 15)
+```
+
+To pivot our data back to wide format, we can use `pivot_wider()`, which will pull the column names from the column specified in the `names_from =` argument and the corresponding values from the column specified in the `values_from = ` argument.
+```{r 2-3-Data-Manipulation-39 }
+full.pivotwide <- full.pivotlong %>%
+ pivot_wider(names_from = "var", values_from = "value")
+
+head(full.pivotwide)
+```
+
+Now that we're familiar with some *tidyverse* functions to reshape our data, let's answer our original question: What is the average urinary Chromium concentration for each maternal education level?
+
+We can use the `group_by()` function to group our dataset by education class, then the summarize function to calculate the mean of our variable of interest within each class. Note how much shorter and more efficient this code is than the code we used to calculate the same values using base R!
+```{r 2-3-Data-Manipulation-40 }
+full.data %>%
+ group_by(MEdu) %>%
+ summarize(Avg_UCr = mean(UCr))
+```
+
+For more detailed and advanced examples of pivoting in Tidyverse, see the [Tidyverse Pivoting Vignette](https://cran.r-project.org/web/packages/tidyr/vignettes/pivot.html).
+
+## Concluding Remarks
+
+This training module provides an introductory level overview of data organization and manipulation basics in base R and Tidyverse, including merging, filtering, subsetting, melting, and casting, and demonstrates these methods with an environmentally relevant dataset. These methods are used regularly in scripted analyses and are important preparation steps for almost all downstream analyses and visualizations.
+
+
+
+
+:::tyk
+What subjects, arranged from highest to lowest drinking water cadmium levels, had babies at at least 35 weeks and had urinary cadmium levels of at least 1.5 µg/L?
+
+**Hint**: Try using the `arrange()` function from the *tidyverse* package.
+:::
diff --git a/Chapter_2/Module2_3_Input/Image1.svg b/Chapter_2/2_3_Data_Manipulation/Image1.svg
similarity index 100%
rename from Chapter_2/Module2_3_Input/Image1.svg
rename to Chapter_2/2_3_Data_Manipulation/Image1.svg
diff --git a/Chapter_2/Module2_3_Input/Module2_3_InputData1.csv b/Chapter_2/2_3_Data_Manipulation/Module2_3_InputData1.csv
similarity index 100%
rename from Chapter_2/Module2_3_Input/Module2_3_InputData1.csv
rename to Chapter_2/2_3_Data_Manipulation/Module2_3_InputData1.csv
diff --git a/Chapter_2/Module2_3_Input/Module2_3_InputData2.csv b/Chapter_2/2_3_Data_Manipulation/Module2_3_InputData2.csv
similarity index 100%
rename from Chapter_2/Module2_3_Input/Module2_3_InputData2.csv
rename to Chapter_2/2_3_Data_Manipulation/Module2_3_InputData2.csv
diff --git a/Chapter_2/2_4_Code_Efficiency/2_4_Code_Efficiency.Rmd b/Chapter_2/2_4_Code_Efficiency/2_4_Code_Efficiency.Rmd
new file mode 100644
index 0000000..b9d9efd
--- /dev/null
+++ b/Chapter_2/2_4_Code_Efficiency/2_4_Code_Efficiency.Rmd
@@ -0,0 +1,559 @@
+
+# 2.4 Improving Coding Efficiencies
+
+This training module was developed by Elise Hickman, Alexis Payton, Kyle Roell, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+In this module, we'll explore how to improve coding efficiency. Coding efficiency involves performing a task in as few lines as possible and can...
+
++ Shorten code by eliminating redundancies
++ Reduce the number of typos
++ Help other coders understand script better
+
+Specific approaches that we will discuss in this module include loops, functions, and list operations, which can all be used to make code more succinct. A **loop** is employed when we want to perform a repetitive task, while a **function** contains a block of code organized together to perform one specific task. **List operations**, in which the same function is applied to a list of dataframes, can also be used to code more efficiently.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
+
+2. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI $\geq$ 18.5) subjects?
+
+3. Are there statistically significant difference in drinking water arsenic, cadmium, and chromium between non-obese (BMI < 29.9) and obese (BMI $\geq$ 29.9) subjects?
+
+We will demonstrate how this analysis can be approached using for loops, functions, or list operations. We will introduce the syntax and structure of each approach first, followed by application of the approach to our data. First, let's prepare the workspace and familiarize ourselves with the dataset we are going to use.
+
+
+### Data Import and Workspace Preparation
+
+#### Installing required packages
+
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you. We will be using the *tidyverse* package for data manipulation steps and the [*rstatix*](https://github.com/kassambara/rstatix) package for statistical tests, as it provides pipe friendly adaptations of the base R statistical tests and returns results in a dataframe rather than a list format, making results easier to access. This brings up an important aspect of coding efficiency - sometimes, there is already a package that has been designed with functions to help you execute your desired analysis in an efficient way, so you don't need to write custom functions yourself! So, don't forget to explore packages relevant to your analysis before spending a lot of time developing custom solutions (although, sometimes this is necessary).
+
+```{r 2-4-Code-Efficiency-1, message = FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse")
+if (!requireNamespace("rstatix"))
+ install.packages("rstatix")
+```
+
+#### Loading required packages
+
+```{r 2-4-Code-Efficiency-2, message = FALSE}
+library(tidyverse)
+library(rstatix)
+```
+
+#### Setting your working directory
+
+```{r 2-4-Code-Efficiency-3, eval = FALSE}
+setwd("/file path to where your input files are")
+```
+
+#### Importing example dataset
+
+The first example dataset contains subject demographic data, and the second dataset contains corresponding chemical data. Familiarize yourself with these data used previously in **TAME 2.0 Module 2.3 Data Manipulation and Reshaping**.
+
+```{r 2-4-Code-Efficiency-4 }
+# Load the demographic data
+demographic_data <- read.csv("Chapter_2/2_4_Code_Efficiency/Module2_4_InputData1.csv")
+
+# View the top of the demographic dataset
+head(demographic_data)
+
+# Load the chemical data
+chemical_data <- read.csv("Chapter_2/2_4_Code_Efficiency/Module2_4_InputData2.csv")
+
+# View the top of the chemical dataset
+head(chemical_data)
+```
+
+#### Preparing the example dataset
+
+For ease of analysis, we will merge these two datasets before proceeding.
+```{r 2-4-Code-Efficiency-5 }
+# Merging data
+full_data <- inner_join(demographic_data, chemical_data, by = "ID")
+
+# Previewing new data
+head(full_data)
+```
+
+Continuous demographic variables, like BMI, are often dichotomized (or converted to a categorical variable with two categories representing higher vs. lower values) to increase statistical power in analyses. This is particularly important for clinical data that tend to have smaller sample sizes. In our initial dataframe, BMI is a continuous or numeric variable; however, our questions require us to dichotomize BMI. We can use the following code, which relies on if/else logic (see **TAME 2.0 Module 2.3 Data Manipulation and Reshaping** for more information) to generate a new column representing our dichotomized BMI variable for our first environmental health question.
+```{r 2-4-Code-Efficiency-6 }
+# Adding dichotomized BMI column
+full_data <- full_data %>%
+ mutate(Dichotomized_BMI = ifelse(BMI < 25, "Normal", "Overweight"))
+
+# Previewing new data
+head(full_data)
+```
+
+We can see that we now have created a new column entitled `Dichotomized_BMI` that we can use to perform a statistical test to assess if there are differences between drinking water metals between normal and overweight subjects.
+
+
+
+## Loops
+
+We will start with loops. There are three main types of loops in R: `for`, `while`, and `repeat`. We will focus on `for` loops in this module, but for more in-depth information on loops, including the additional types of loops, see [here](https://intro2r.com/loops.html). Before applying loops to our data, let's discuss how `for` loops work.
+
+The basic structure of a `for` loop is shown here:
+```{r 2-4-Code-Efficiency-7 }
+# Basic structure of a for loop
+for (i in 1:4){
+ print(i)
+}
+```
+
+`for` loops always start with `for` followed by a statement in parentheses. The argument in the parentheses tells R how to iterate (or repeat) through the code in the curly brackets. Here, we are telling R to iterate through the code in curly brackets 4 times. Each time we told R to print the value of our iterator, or `i`, which has a value of 1, 2, 3, and then 4. Loops can also iterate through columns in a dataset. For example, we can use a `for` loop to print the ages of each subject:
+```{r 2-4-Code-Efficiency-8 }
+# Creating a smaller dataframe for our loop example
+full_data_subset <- full_data[1:6, ]
+
+# Finding the total number of rows or subjects in the dataset
+number_of_rows <- length(full_data_subset$MAge)
+
+# Creating a for loop to iterate from 1 to the last row
+for (i in 1:number_of_rows){
+ # Printing each subject age
+ # Need to put `[i]` to index the correct value corresponding to the row we are evaluating
+ print(full_data_subset$MAge[i])
+}
+```
+
+Now that we know how a `for` loop works, how can we apply this approach to determine whether there are statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects.
+
+Because our data are normally distributed and there are two groups that we are comparing, we will use a t-test applied to each metal measured in drinking water. Testing for assumptions is outside the scope of this module, but see **TAME 2.0 Module 3.3 Normality Tests and Data Transformation** for more information on this topic.
+
+Running a t-test in R is very simple, which we can demonstrate by running a t-test on the drinking water arsenic data:
+```{r 2-4-Code-Efficiency-9 }
+# Running t-test and storing results in t_test_res
+t_test_res <- full_data %>%
+ t_test(DWAs ~ Dichotomized_BMI)
+
+# Viewing results
+t_test_res
+```
+
+We can see that our p-value is 0.468. Because this is greater than 0.05, we cannot reject the null hypothesis that normal weight and overweight subjects are exposed to the same drinking water arsenic concentrations. Although this was a very simple line of code to run, what if we have many columns we want to run the same t-test on? We can use a `for` loop to iterate through these columns.
+
+Let's break down the steps of our `for` loop before executing the code.
+
+1. First, we will define the variables (columns) we want to run our t-test on. This is different from our approach above, because in those code chunks, we were using numbers to indicate the number of iterations through the loop. Here, we are naming the specific variables instead, and R will iterate though each of these variables. Note that we could omit this step and instead use the numeric column index of our variables of interest `[7:9]`. However, naming the specific columns makes this approach more robust because if additional data are added to or removed from our dataframe, the numeric column index of our variables could change. Which approach you choose really depends on the purpose of your loop!
+
+2. Second, we will create an empty dataframe where we will store the results generated by our `for` loop.
+
+3. Third, we will actually run our for loop. This will tell R: for each variable in our `vars_of_interest` vector, run a t-test with that variable (and store the results in a temporary dataframe called "res"), then add those results to our final results dataframe. A row will be added to the results dataframe each time R iterates through a new variable, resulting in a dataframe that stores the results of all of our t-tests.
+
+```{r 2-4-Code-Efficiency-10 }
+# Defining variables (columns) we want to run a t-test on
+vars_of_interest <- c("DWAs", "DWCd", "DWCr")
+
+# Creating an empty dataframe to store results
+t_test_res_DW <- data.frame()
+
+# Running for loop
+for (i in vars_of_interest) {
+
+ # Storing the results of each iteration of the loop in a temporary results dataframe
+ res <- full_data %>%
+
+ # Writing the formula needed for each iteration of the loop
+ t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
+
+ # Adding a row to the results dataframe each time the loop is iterated
+ t_test_res_DW <- bind_rows(t_test_res_DW, res)
+}
+
+# Viewing our results
+t_test_res_DW
+```
+
+:::question
+ With this, we can answer **Environmental Health Question #1**:
+
+Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
+:::
+
+:::answer
+**Answer**: No, there are not any statistically significant differences in drinking water metals between normal weight and overweight subjects.
+:::
+
+
+
+### Formulas and Pasting
+
+Note the use of the code `as.formula(paste0(i, "~ Dichotomized_BMI"))`. Let's take a quick detour to discuss the use of the `as.formula()` and `paste()` functions, as these are important functions often used in loops and user-defined functions.
+
+Many statistical test functions and regression functions require one argument to be a formula, which is typically formatted as `y ~ x`, where y is the dependent variable of interest and x is an independent variable. For some functions, additional variables can be included on the right side of the formula to represent covariates (additional variables of interest). The function `as.formula()` returns the argument in parentheses in formula format so that it can be correctly passed to other functions. We can demonstrate that here by assigning a dummy variable `j` the character string `var1`:
+
+```{r 2-4-Code-Efficiency-11 }
+# Assigning variable
+j <- "var1"
+
+# Demonstrating output of as.formula()
+as.formula(paste(j, " ~ Dichotomized_BMI", sep = ""))
+```
+
+We can use the `paste()` function to combine strings of characters. The paste function takes each argument (as many arguments as is needed) and pastes them together into one character string, with the separator between arguments set by the `sep = ` argument. When our y variable is changing with each iteration of our for loop, we can use the `paste()` function to write our formula correctly by telling the function to paste the variable `i`, followed by the rest of our formula, which stays the same for each iteration of the loop. Let's examine the output of just the `paste()` part of our code:
+```{r 2-4-Code-Efficiency-12 }
+paste(j, " ~ Dichotomized_BMI", sep = "")
+```
+
+The `paste()` function is very flexible and can be useful in many other settings when you need to create one character string from arguments from different sources! Notice that the output looks different from the output of `as.formula()`. There is a returned index (`[1]`), and there are quotes around the character string. The last function we will highlight here is the `noquote()` function, which can be helpful if you'd like a string without quotes:
+```{r 2-4-Code-Efficiency-13 }
+noquote(paste(j, " ~ Dichotomized_BMI", sep = ""))
+```
+
+However, this still returns an indexed number, so there are times when it will not allow code to execute properly (for example, when we need a formula format).
+
+Next, we will learn about functions and apply them to our dataset to answer our additional environmental health questions.
+
+
+
+## Functions
+
+Functions are useful when you want to execute a block of code organized together to perform one specific task, and you want to be able to change parameters for that task easily rather than having to copy and paste code over and over that largely stays the same but might have small modifications in certain arguments. The basic structure of a function is as follows:
+
+```{r 2-4-Code-Efficiency-14, eval = FALSE}
+function_name <- function(parameter_1, parameter_2...){
+
+ # Function body (where the code goes)
+ insert_code_here
+
+ # What the function returns
+ return()
+}
+```
+
+A function requires you to name it as we did with `function_name`. In parentheses, the function requires you to specify the arguments or parameters. Parameters (i.e., `parameter_1`) act as placeholders in the body of the function. This allows us to change the values of the parameters each time a function is called, while the majority of the code remains the same. Lastly, we have a `return()` statement, which specifies what object (i.e., vector, dataframe, etc.) we want to retrieve from a function. Although a function can display the last expression from the function body in the absence of a `return()` statement, it's a good habit to include it as the last expression. It is important to note that, although functions can take many input parameters and execute large code chunks, they can only return one item, whether that is a value, vector, dataframe, plot, code output, or list.
+
+When writing your own functions, it is important to describe the purpose of the function, its input, its parameters, and its output so that others can understand what your functions does and how to use it. This can be defined either in text above a code chunk if you are using R Markdown or as comments within the code itself. We'll start with a simple function. Let's say we want to convert temperatures from Fahrenheit to Celsius. We can write a function that takes the temperature in Fahrenheit and converts it to Celsius. Note that we have given our parameters descriptive names (`fahrenheit_temperature`, `celsius_temperature`), which makes our code more readable than if we assigned them dummy names such as x and y.
+
+```{r 2-4-Code-Efficiency-15 }
+# Function to convert temperatures in Fahrenheit to Celsius
+## Parameters: temperature in Fahrenheit (input)
+## Output: temperature in Celsius
+
+fahrenheit_to_celsius <- function(fahrenheit_temperature){
+
+ celsius_temperature <- (fahrenheit_temperature - 32) * (5/9)
+
+ return(celsius_temperature)
+}
+```
+
+Notice that the above code block was run, but there isn't an output. Rather, running the code assigns the function code to that function. When you run code defining a function, that function will appear in your Global Environment under the "Functions" section. We can see the output of the function by providing an input value. Let's start by converting 41 degrees Fahrenheit to Celsius:
+
+```{r 2-4-Code-Efficiency-16 }
+# Calling the function
+# Here, 41 is the `fahrenheit_temperature` in the function
+fahrenheit_to_celsius(41)
+```
+
+41 degrees Fahrenheit is equivalent to 5 degrees Celsius. We can also have the function convert a vector of values.
+
+```{r 2-4-Code-Efficiency-17 }
+# Defining vector of temperatures
+vector_of_temperatures <- c(81,74,23,65)
+
+# Calling the function
+fahrenheit_to_celsius(vector_of_temperatures)
+```
+
+Before getting back to answer our environmental health related questions, let's look at one more example of a function. This time we'll create a function that can calculate the circumference of a circle based on its radius in inches. Here you can also see a different style of commenting to describe the function's purpose, inputs, and outputs.
+
+```{r 2-4-Code-Efficiency-18 }
+circle_circumference <- function(radius){
+ # Calculating a circle's circumference based on the radius inches
+
+ # :parameters: radius
+ # :output: circumference and radius
+
+ # Calculating diameter first
+ diameter <- 2 * radius
+
+ # Calculating circumference
+ circumference <- pi * diameter
+
+ return(circumference)
+}
+
+# Calling function
+circle_circumference(3)
+```
+
+So, if a circle had a radius of 3 inches, its circumference would be ~19 inches. What if we were interested in seeing the diameter to double check our code?
+
+```{r 2-4-Code-Efficiency-19, error = TRUE, suppress_error_alert = TRUE}
+diameter
+```
+
+R throws an error, because the variable `diameter` was created inside the function and the function only returned the `circumference` variable. This is actually one of the ways that functions can improve coding efficiency - by not needing to store intermediate variables that aren't of interest to the main goal of the code or analysis. However, there are two ways we can still see the `diameter` variable:
+
+1. Put print statements in the body of the function (`print(diameter)`).
+2. Have the function return a different variable or list of variables (`c(circumference, diameter)`). See the below section on **List Operation** for more on this topic.
+
+We can now move on to using a more complicated function to answer all three of our environmental health questions without repeating our earlier code three times. The main difference between each of our first three environmental health questions is the BMI cutoff used to dichotomize the BMI variable, so we can use that as one of the parameters for our function. We can also use arguments in our function to name our groups.
+
+We can adapt our previous `for` loop code into a function that will take different BMI cutoffs and return statistical results by including parameters to define the parts of the analysis that will change with each unique question. For example:
+
++ Changing the BMI cutoff from a number (in our previous code) to our parameter name that specifies the cutoff
++ Changing the group names for assigning category (in our previous code) to our parameter names
+
+```{r 2-4-Code-Efficiency-20 }
+# Function to dichotomize BMI into different categories and return results of t-test on drinking water metals between dichotomized groups
+
+## Parameters:
+### input_data: dataframe containing BMI and drinking water metals levels
+### bmi_cutoff: numeric value specifying the cut point for dichotomizing BMI
+### lower_group_name: name for the group of subjects with BMIs lower than the cutoff
+### upper_group_name: name for the group of subjects with BMIs higher than the cutoff
+### variables: vector of variable names that statistical test should be run on
+
+## Output: dataframe with statistical results for each variable in the variables vector
+
+bmi_DW_ttest <- function(input_data, bmi_cutoff, lower_group_name, upper_group_name, variables){
+
+ # Creating dichotomized variable
+ dichotomized_data <- input_data %>%
+ mutate(Dichotomized_BMI = ifelse(BMI < bmi_cutoff, lower_group_name, upper_group_name))
+
+ # Creating an empty dataframe to store results
+ t_test_res_DW <- data.frame()
+
+ # Running for loop
+ for (i in variables) {
+
+ # Storing the results of each iteration of the loop in a temporary results dataframe
+ res <- dichotomized_data %>%
+
+ # Writing the formula needed for each iteration of the loop
+ t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
+
+ # Adding a row to the results dataframe each time the loop is iterated
+ t_test_res_DW <- bind_rows(t_test_res_DW, res)
+ }
+
+ # Return results
+ return(t_test_res_DW)
+
+}
+```
+
+For the first example of using the function, we have included the name of each argument for clarity, but this isn't necessary *if* you pass in the arguments *in the order they were defined when writing the function*.
+```{r 2-4-Code-Efficiency-21 }
+# Defining variables (columns) we want to run a t-test on
+vars_of_interest <- c("DWAs", "DWCd", "DWCr")
+
+# Apply function for normal vs. overweight (bmi_cutoff = 25)
+bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal",
+ upper_group_name = "Overweight", variables = vars_of_interest)
+```
+
+Here, we can see the same results as above in the **Loops** section. We can next apply the function to answer our additional environmental health questions:
+```{r 2-4-Code-Efficiency-22 }
+# Apply function for underweight vs. non-underweight (bmi_cutoff = 18.5)
+bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)
+
+# Apply function for non-obese vs. obese (bmi_cutoff = 29.9)
+bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)
+```
+
+:::question
+ With this, we can answer **Environmental Health Questions #2 & #3**:
+
+Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI $\geq$ 18.5) subjects or between non-obese (BMI < 29.9) and obese (BMI $\geq$ 29.9) subjects?
+:::
+
+:::answer
+**Answer**: No, there are not any statistically significant differences in drinking water metals between underweight and non-underweight subjects or between non-obese and obese subjects.
+:::
+
+Here, we were able to answer all three of our environmental health questions within relatively few lines of code by using a function to efficiently assess different variations on our analysis.
+
+In the last section of this module, we will demonstrate how to use list operations to improve coding efficiency.
+
+
+
+## List operations
+
+Lists are a data type in R that can store other data types (including lists, to make nested lists). This allows you to store multiple dataframes in one object and apply the same functions to each dataframe in the list. Lists can also be helpful for storing the results of a function if you would like to be able to access multiple outputs. For example, if we return to our example of a function that calculates the circumference of a circle, we can store both the diameter and circumference as list objects. The function will then return a list containing both of these values when called.
+```{r 2-4-Code-Efficiency-23 }
+# Adding list element to our function
+circle_circumference_4 <- function(radius){
+ # Calculating a circle's circumference and diameter based on the radius in inches
+
+ # :parameters: radius
+ # :output: list that contains diameter [1] and circumference [2]
+
+ # Calculating diameter first
+ diameter <- 2 * radius
+
+ # Calculating circumference
+ circumference <- pi * diameter
+
+ # Storing results in a named list
+ results <- list("diameter" = diameter, "circumference" = circumference)
+
+ # Return results
+ results
+}
+
+# Calling function
+circle_circumference_4(10)
+```
+
+We can also call the results individually using the following code:
+```{r 2-4-Code-Efficiency-24 }
+# Storing results of function
+circle_10 <- circle_circumference_4(10)
+
+# Viewing only diameter
+
+## Method 1
+circle_10$diameter
+
+## Method 2
+circle_10[1]
+
+# Viewing only circumference
+
+## Method 1
+circle_10$circumference
+
+## Method 2
+circle_10[2]
+```
+
+In the context of our dataset, we can use list operations to clean up and combine our results from all three BMI stratification approaches. This is often necessary to prepare data to share with collaborators or for supplementary tables in a manuscript. Let's revisit our code for producing our statistical results, this time assigning our results to a dataframe rather than viewing them.
+```{r 2-4-Code-Efficiency-25 }
+# Defining variables (columns) we want to run a t-test on
+vars_of_interest <- c("DWAs", "DWCd", "DWCr")
+
+# Normal vs. overweight (bmi_cutoff = 25)
+norm_vs_overweight <- bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal",
+ upper_group_name = "Overweight", variables = vars_of_interest)
+
+# Underweight vs. non-underweight (bmi_cutoff = 18.5)
+under_vs_nonunderweight <- bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)
+
+# Non-obese vs. obese (bmi_cutoff = 29.9)
+nonobese_vs_obese <- bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)
+
+# Viewing one results dataframe as an example
+norm_vs_overweight
+```
+
+For publication purposes, let's say we want to make the following formatting changes:
+
++ Keep only the comparison of interest (for example Normal vs. Overweight) and the associated p-value, removing columns that are not as useful for interpreting or sharing the results
++ Rename the `.y.` column so that its contents are clearer
++ Collapse all of our data into one final dataframe
+
+We can first write a function to execute these cleaning steps:
+```{r 2-4-Code-Efficiency-26 }
+# Function to clean results dataframes
+
+## Parameters:
+### input_data: dataframe containing results of t-test
+
+## Output: cleaned dataframe
+
+data_cleaning <- function(input_data) {
+
+ data <- input_data %>%
+
+ # Rename .y. column
+ rename("Variable" = ".y.") %>%
+
+ # Merge group1 and group2
+ unite(Comparison, group1, group2, sep = " vs. ") %>%
+
+ # Keep only columns of interest
+ select(c(Variable, Comparison, p))
+
+ return(data)
+}
+```
+
+Then, we can make a list of our dataframes to clean and apply:
+```{r 2-4-Code-Efficiency-27 }
+# Making list of dataframes
+t_test_res_list <- list(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese)
+
+# Viewing list of dataframes
+head(t_test_res_list)
+```
+
+And we can apply the cleaning function to each of the dataframes using the `lapply()` function, which takes a list as the first argument and the function to apply to each list element as the second argument:
+```{r 2-4-Code-Efficiency-28 }
+# Applying cleaning function
+t_test_res_list_cleaned <- lapply(t_test_res_list, data_cleaning)
+
+# Vieweing cleaned dataframes
+head(t_test_res_list_cleaned)
+```
+
+Last, we can collapse our list down into one dataframe using the `do.call()` and `rbind.data.frame()` functions, which together, take the elements of the list and collapse them into a dataframe by binding the rows together:
+```{r 2-4-Code-Efficiency-29 }
+t_test_res_cleaned <- do.call(rbind.data.frame, t_test_res_list_cleaned)
+
+# Viewing final dataframe
+t_test_res_cleaned
+```
+
+The above example is just that - an example to demonstrate the mechanics of using list operations. However, there are actually a couple of even more efficient ways to execute the above cleaning steps:
+
+1. Build cleaning steps into the analysis function if you know you will not need to access the raw results dataframe.
+2. Bind all three dataframes together, then execute the cleaning steps.
+
+We will demonstrate #2 below:
+```{r 2-4-Code-Efficiency-30 }
+# Start by binding the rows of each of the results dataframes
+t_test_res_cleaned_2 <- bind_rows(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese) %>%
+
+ # Rename .y. column
+ rename("Variable" = ".y.") %>%
+
+ # Merge group1 and group2
+ unite(Comparison, group1, group2, sep = " vs. ") %>%
+
+ # Keep only columns of interest
+ select(c(Variable, Comparison, p))
+
+# Viewing results
+t_test_res_cleaned_2
+```
+
+As you can see, this dataframe is the same as the one we produced using list operations. It was produced using fewer lines of code and without the need for a user-defined function! For our purposes, this was a more efficient approach. However, we felt it was important to demonstrate the mechanics of list operations because there may be times where you do need to keep dataframes separate during specific analyses.
+
+
+
+## Concluding Remarks
+
+This module provided an introduction to loops, functions, and list operations and demonstrated how to use them to efficiently analyze an environmentally relevant dataset. When and how you implement these approaches depends on your coding style and the goals of your analysis. Although here we were focused on statistical tests and data cleaning, these flexible approaches can be used in a variety of data analysis steps. We encourage you to implement loops, functions, and list operations in your analyses when you find the need to iterate through statistical tests, visualizations, data cleaning, or other common workflow elements!
+
+## Additional Resources
+
++ [Intro2r Loops](https://intro2r.com/functions-in-r.html)
++ [Intro2r Functions in R](https://intro2r.com/prog_r.html)
++ [Hadley Wickham Advanced R - Functionals](http://adv-r.had.co.nz/Functionals.html)
+
+
+
+
+
+
+:::tyk
+Use the same input data we used in this module to answer the following questions and produce a cleaned, publication-ready data table of results. Note that these data are normally distributed, so you can use a t-test.
+
+1. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between younger (MAge < 40) and older (MAge $\geq$ 40) mothers?
+2. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between between normal weight (BMI < 25) and overweight (BMI $\geq$ 25) subjects?
+:::
diff --git a/Chapter_2/Module2_4_Input/Module2_4_InputData1.csv b/Chapter_2/2_4_Code_Efficiency/Module2_4_InputData1.csv
similarity index 100%
rename from Chapter_2/Module2_4_Input/Module2_4_InputData1.csv
rename to Chapter_2/2_4_Code_Efficiency/Module2_4_InputData1.csv
diff --git a/Chapter_2/Module2_4_Input/Module2_4_InputData2.csv b/Chapter_2/2_4_Code_Efficiency/Module2_4_InputData2.csv
similarity index 100%
rename from Chapter_2/Module2_4_Input/Module2_4_InputData2.csv
rename to Chapter_2/2_4_Code_Efficiency/Module2_4_InputData2.csv
diff --git a/Chapter_3/03-Chapter3.Rmd b/Chapter_3/03-Chapter3.Rmd
deleted file mode 100644
index d9c636d..0000000
--- a/Chapter_3/03-Chapter3.Rmd
+++ /dev/null
@@ -1,1582 +0,0 @@
-# (PART\*) Chapter 3 Basics of Data Analysis and Visualizations {-}
-
-# 3.1 Data Visualizations
-
-This training module was developed by Alexis Payton, Kyle Roell, Lauren E. Koval, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Data Visualizations
-
-Selecting an approach to visualize data is an important consideration when presenting scientific research, given that figures have the capability to summarize large amounts of data efficiently and effectively. (At least that's the goal!) This module will focus on basic data visualizations that we view to be most commonly used, both in and outside of the field of environmental health research, many of which you have likely seen before. This module is not meant to be an exhaustive representation of all figure types, rather it serves as an introduction to some types of figures and how to approach choosing the one that most optimally displays your data and primary findings. When selecting a data visualization approach, here are some helpful questions you should first ask yourself:
-
-+ What message am I trying to convey with this figure?
-+ How does this figure highlight major findings from the paper?
-+ Who is the audience?
-+ What type of data am I working with?
-
-[A Guide To Getting Data Visualization Right](https://www.smashingmagazine.com/2023/01/guide-getting-data-visualization-right/) is a great resource for determining which figure is best suited for various types of data. More complex methodology-specific charts are presented in succeeding TAME modules. These include visualizations for:
-
-+ Two Group Comparisons (e.g.,boxplots and logistic regression) in **Module 3.4 Introduction to Statistical Tests** and **Module 4.4 Two Group Comparisons and Visualizations**
-+ Multi-Group Comparisons (e.g.,boxplots) in **Module 3.4 Introduction to Statistical Tests** and **Module 4.5 Multi-Group Comparisons and Visualizations**
-+ Supervised Machine Learning (e.g.,decision boundary plots, variable importance plots) in **Module 5.3 Supervised ML Model Interpretation**
-+ Unsupervised Machine Learning
- + Principal Component Analysis (PCA) plots and heatmaps in **Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**
- + Dendrograms, clustering visualizations, heatmaps, and variable contribution plots in **Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**
-+ -Omics Expression (e.g.,MA plots and volcano plots) in **Module 6.2 -Omics and Systems Biology: Transcriptomic Applications**
-+ Mixtures Methods
- + Forest Plots in **Module 6.3 Mixtures I: Overview and Quantile G-Computation Application**
- + Trace Plots in **Module 6.4 Mixtures II: BKMR Application**
- + Sufficient Similarity (e.g.,heatmaps, clustering) in **Module 6.5 Mixtures III: Sufficient Similarity**
-+ Toxicokinetic Modeling (e.g.,line graph, dose response) in **Module 6.6 Toxicokinetic Modeling**
-
-
-
-## Introduction to Training Module
-Visualizing data is an important step in any data analysis, including those carried out in environmental health research. Often, visualizations allow scientists to better understand trends and patterns within a particular dataset under evaluation. Even after statistical analysis of a dataset, it is important to then communicate these findings to a wide variety of target audiences. Visualizations are a vital part of communicating complex data and results to target audiences.
-
-In this module, we highlight some figures that can be used to visualize larger, more high-dimensional datasets using figures that are more simple (but still relevant!) than methods presented later on in TAME. This training module specifically reviews the formatting of data in preparation of generating visualizations, scaling datasets, and then guides users through the generation of the following example data visualizations:
-
-+ Density plots
-+ Boxplots
-+ Correlation plots
-+ Heatmaps
-
-These visualization approaches are demonstrated using a large environmental chemistry dataset. This example dataset was generated through chemical speciation analysis of smoke samples collected during lab-based simulations of wildfire events. Specifically, different biomass materials (eucalyptus, peat, pine, pine needles, and red oak) were burned under two combustion conditions of flaming and smoldering, resulting in the generation of 12 different smoke samples. These data have been previously published in the following environmental health research studies, with data made publicly available:
-
-+ Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. doi: 10.1016/j.scitotenv.2021.145759. Epub 2021 Feb 10. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
-+ Kim YH, Warren SH, Krantz QT, King C, Jaskot R, Preston WT, George BJ, Hays MD, Landis MS, Higuchi M, DeMarini DM, Gilmour MI. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ Health Perspect. 2018 Jan 24;126(1):017011. doi: 10.1289/EHP2200. PMID: [29373863](https://pubmed.ncbi.nlm.nih.gov/29373863/).
-
-### GGplot2
-
-*ggplot2* is a powerful package used to create graphics in R. It was designed based on the philosophy that every figure can be built using a dataset, a coordinate system, and a geom that specifies the type of plot. As a result, it is fairly straightforward to create highly customizable figures and is typically preferred over using base R to generate graphics. We'll generate all of the figures in this module using *ggplot2*.
-
-For additional resources on *ggplot2* see [ggplot2 Posit Documentation](https://ggplot2.tidyverse.org/) and [Data Visualization with ggplot2](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html).
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 03-Chapter3-1}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r install_libs1, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
-if (!requireNamespace("GGally"))
- install.packages("GGally");
-if (!requireNamespace("corrplot"))
- install.packages("corrplot");
-if (!requireNamespace("pheatmap"))
- install.packages("pheatmap");
-```
-
-#### Loading R packages required for this session
-```{r 03-Chapter3-2, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
-library(tidyverse)
-library(GGally)
-library(corrplot)
-library(reshape2)
-library(pheatmap)
-```
-
-#### Set your working directory
-```{r 03-Chapter3-3, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-Then let's read in our example dataset. As mentioned in the introduction, this example dataset represents chemical measurements across 12 different biomass burn scenarios representing potential wildfire events. Let's upload and view these data:
-```{r 03-Chapter3-4}
-# Load the data
-smoke_data <- read.csv("Chapter_3/Module3_1_Input/Module3_1_InputData.csv")
-
-# View the top of the dataset
-head(smoke_data)
-```
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
-2. Are there correlations between biomass burn conditions based on the chemical concentration data?
-3. Under which biomass burn conditions are concentrations of certain chemical categories the highest?
-
-
-
-We can create a **density plot** to answer the first question. Similar to a histogram, density plots are an effective way to show overall distributions of data and can be useful to compare across various test conditions or other stratifications of the data.
-
-In this example of a density plot, we'll visualize the distributions of chemical concentration data on the x axis. A density plot automatically displays where values are concentrated on the y axis. Additionally, we'll want to have multiple density plots within the same figure for each biomass burn condition.
-
-Before the data can be visualized, it needs to be converted from a wide to long format. This is because we need to have variable or column names entitled `Chemical_Concentration` and `Biomass_Burn_Condition` that can be placed into `ggplot()`. For review on converting between long and wide formats and using other tidyverse tools, see **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**.
-```{r 03-Chapter3-5}
-longer_smoke_data = pivot_longer(smoke_data, cols = 4:13, names_to = "Biomass_Burn_Condition",
- values_to = "Chemical_Concentration")
-
-head(longer_smoke_data)
-```
-
-#### Scaling dataframes for downstream data visualizations
-
-A data preparation method that is commonly used to convert values into those that can be used to better illustrate overall data trends is **data scaling**. Scaling can be achieved through data transformations or normalization procedures, depending on the specific dataset and goal of analysis/visualization. Scaling is often carried out using data vectors or columns of a dataframe.
-
-For this example, we will normalize the chemical concentration dataset using a basic scaling and centering procedure using the base R function, `scale()`. This algorithm results in the normalization of a dataset using the mean value and standard deviation. This scaling step will convert chemical concentration values in our dataset into normalized values across samples, such that each chemical's concentration distributions are more easily comparable between the different biomass burn conditions.
-
-For more information on the `scale()` function, see its associated [RDocumentation](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale) and helpful tutorial on [Implementing the scale() function in R](https://www.journaldev.com/47818/r-scale-function).
-```{r 03-Chapter3-6}
-scaled_longer_smoke_data = longer_smoke_data %>%
- # scaling within each chemical
- group_by(Chemical) %>%
- mutate(Scaled_Chemical_Concentration = scale(Chemical_Concentration)) %>%
- ungroup()
-
-head(scaled_longer_smoke_data) # see the new scaled values now in the last column (column 7)
-```
-
-We can see that in the `Scaled_Chemical_Concentration` column, each chemical is scaled based on a normal distribution centered around 0, with values now less than or greater than zero.
-
-Now that we have our dataset formatted, let's plot it.
-
-## Density Plot Visualization
-
-The following code can be used to generate a density plot:
-```{r 03-Chapter3-7, fig.align = "center"}
-ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, color = Biomass_Burn_Condition)) +
- geom_density()
-```
-
-### Answer to Environmental Health Question 1, Method I
-:::question
-*With this method, we can answer **Environmental Health Question #1***: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
-:::
-
-:::answer
-**Answer**: In general, there are a high number of chemicals that were measured at relatively lower abundances across all smoke samples (hence, the peak in occurrence density occurring towards the left, before 0). The three conditions of smoldering peat, flaming peat, and flaming pine contained the most chemicals at the highest relative concentrations (hence, these lines are the top three lines towards the right).
-:::
-
-
-
-## Boxplot Visualization
-A **boxplot** can also be used to answer our first environmental health question: **How do the distributions of the chemical concentration data differ based on each biomass burn scenario?**. A boxplot also displays a data's distribution, but it incorporates a visualization of a five number summary (i.e., minimum, first quartile, median, third quartile, and maximum). Any outliers are displayed as dots.
-
-For this example, let's have `Scaled_Chemical_Concentration` on the x axis and `Biomass_Burn_Condition` on the y axis. The `scaled_longer_smoke_data` dataframe is the format we need, so we'll use that for plotting.
-```{r 03-Chapter3-8, fig.align = "center"}
-ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, y = Biomass_Burn_Condition,
- color = Biomass_Burn_Condition)) +
- geom_boxplot()
-```
-
-### Answer to Environmental Health Question 1, Method II
-:::question
-*With this alternative method, we can answer, in a different way, **Environmental Health Question #1***: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
-:::
-
-:::answer
-**Answer, Method II**: The median chemical concentration is fairly low (less than 0) for all biomass burn conditions. Overall, there isn't much variation in chemical concentrations with the exception of smoldering peat, flaming peat, and flaming eucalyptus.
-:::
-
-
-
-## Correlation Visualizations
-Let's turn our attention to the second environmental health question: **Are there correlations between biomass burn conditions based on the chemical concentration data?** We'll use two different correlation visualizations to answer this question using the *GGally* package.
-
-*GGally* is a package that serves as an extension of *ggplot2*, the baseline R plotting system based on the grammar of graphics. GGally is very useful for creating plots that compare groups or features within a dataset, among many other utilities. Here we will demonstrate the `ggpairs()` function within *GGally* using the scaled chemistry dataset. This function will produce an image that shows correlation values between biomass burn sample pairs and also illustrates the overall distributions of values in the samples. For more information on *GGally*, see its associated [RDocumentation](https://www.rdocumentation.org/packages/GGally/versions/1.5.0) and [example helpful tutorial](http://www.sthda.com/english/wiki/ggally-r-package-extension-to-ggplot2-for-correlation-matrix-and-survival-plots-r-software-and-data-visualization).
-
-*GGally* requires a wide dataframe with ids (i.e.,`Chemical`) as the rows and the variables that will be compared to each other (i.e.,`Biomass_Burn_Condition`) as the columns. Let's create that dataframe.
-```{r 03-Chapter3-9}
-# first selecting the chemical, biomass burn condition, and
-# the scaled chemical concentration columns
-wide_scaled_data = scaled_longer_smoke_data %>%
- pivot_wider(id_cols = Chemical, names_from = "Biomass_Burn_Condition",
- values_from = "Scaled_Chemical_Concentration") %>%
- # converting the chemical names to row names
- column_to_rownames(var = "Chemical")
-
-head(wide_scaled_data)
-```
-
-By default, `ggpairs()` displays Pearson's correlations. To show Spearman's correlations takes more nuance, but can be done using the code that has been commented out below.
-```{r 03-Chapter3-10, fig.align = "center", fig.width = 15, fig.height = 15}
-
-# ggpairs with Pearson's correlations
-wide_scaled_data = data.frame(as.matrix(wide_scaled_data))
-ggpairs(wide_scaled_data)
-
-# ggpairs with Spearman's correlations
-# pearson_correlations = cor(wide_scaled_data, method = "spearman")
-# ggpairs(wide_scaled_data, upper = list(continuous = wrap(ggally_cor, method = "spearman")))
-```
-
-Many of these biomass burn conditions have significant correlations denoted by the asterisks.
-
-+ '*': p value < 0.1
-+ '**': p value < 0.05
-+ '***': p value < 0.01
-
-The upper right portion displays the correlation values, where a value less than 0 indicates negative correlation and a value greater than 0 signifies positive correlation. The diagonal shows the density plots for each variable. The lower left portion visualizes the values of the two variables compared using a scatterplot.
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Are there correlations between biomass burn conditions based on the chemical concentration data?
-:::
-
-:::answer
-**Answer**: There is low correlation between many of the variables (-0.5 < correlation value < 0.5). Eucalyptus flaming and pine flaming are significantly positively correlated along with peat flaming and pine needles flaming (correlation value ~0.7 and p value < 0.001).
-:::
-
-We can visualize correlations another way using the other function from *GGally*, `ggcorr()`, which visualizes each correlation as a square. Note that this function calculates Pearson's correlations by default. However, this can be changed using the `method` parameter shown in the code commented out below.
-```{r 03-Chapter3-11, fig.align = "center", fig.width = 10, fig.height = 7}
-# Pearson's correlations
-ggcorr(wide_scaled_data)
-
-# Spearman's correlations
-# ggcorr(wide_scaled_data, method = "spearman")
-```
-
-We'll visualize correlations between each of the groups using one more figure using the `corrplot()` function from the *corrplot* package.
-```{r 03-Chapter3-12, fig.align = "center"}
-# Need to supply corrplot with a correlation matrix, here, using the 'cor' function
-corrplot(cor(wide_scaled_data))
-```
-
-Each of these correlation figures displays the same information, but the one you choose to use is a matter of personal preference. Click on the following resources for additional information on [ggpairs()](https://r-charts.com/correlation/ggpairs/) and [corrplot()](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html).
-
-
-
-## Heatmap Visualization
-
-Last, we'll turn our attention to answering the final environmental health question: **Under which biomass burn conditions are concentrations of certain chemical categories the highest?** This can be addressed with the help of a heatmap.
-
-**Heatmaps** are a highly effective method of viewing an entire dataset at once. Heatmaps can appear similar to correlation plots, but typically illustrate other values (e.g., concentrations, expression levels, presence/absence, etc) besides correlation values. They are used to draw patterns between two variables of highest interest (that comprise the x and y axis, though additional bars can be added to display other layers of information). In this instance, we'll use a heatmap to determine whether there are patterns apparent between chemical categories and biomass burn condition on chemical concentrations.
-
-For this example, we can plot `Biomass_Burn_Condition` and `Chemical.Category` on the axes and fill in the values with `Scaled_Chemical_Concentration`. When generating heatmaps, scaled values are often used to better distinguish patterns between groups/samples.
-
-In this example, we also plan to display the median scaled concentration value within the heatmap as an additional layer of helpful information to aid in interpretation. To do so, we'll need to take the median chemical concentration for each biomass burn condition within each chemical category. However, since we want `ggplot()` to visualize the median scaled values with the color of the tiles this step was already necessary.
-```{r 03-Chapter3-13}
-# We'll find the median value and add that data to the dataframe as an additional column
-heatmap_df = scaled_longer_smoke_data %>%
- group_by(Biomass_Burn_Condition, Chemical.Category) %>%
- mutate(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration))
-
-head(heatmap_df)
-```
-
-Now we can plot the data and add the `Median_Scaled_Concentration` to the figure using `geom_text()`. Note that specifying the original `Scaled_Chemical_Concentration` in the **fill** parameter will NOT give you the same heatmap as specifying the median values in `ggplot()`.
-```{r 03-Chapter3-14, fig.align = "center", fig.width = 12, fig.height= 5}
-ggplot(data = heatmap_df, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
- fill = Median_Scaled_Concentration)) +
- geom_tile() + # function used to specify a heatmap for ggplot
- geom_text(aes(label = round(Median_Scaled_Concentration, 2))) # adding concentration values as text, rounding to two values after the decimal
-```
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: Under which biomass burn conditions are concentrations of certain chemical categories the highest?
-:::
-
-:::answer
-**Answer**: Peat flaming has the highest concentrations of inorganics and ions. Eucalyptus smoldering has the highest concentrations of levoglucosans. Pine smoldering has the highest concentrations of methoxyphenols. Peat smoldering has the highest concentrations of n-alkanes. Pine needles smoldering has highest concentrations of PAHs.
-:::
-
-This same heatmap can be achieved another way using the `pheatmap()` function from the *pheatmap* package. Using this function requires us to use a wide dataset, which we need to create. It will contain `Chemical.Category`, `Biomass_Burn_Condition` and `Scaled_Chemical_Concentration`.
-```{r 03-Chapter3-15, message=FALSE}
-heatmap_df2 = scaled_longer_smoke_data %>%
- group_by(Biomass_Burn_Condition, Chemical.Category) %>%
- # using the summarize function instead of mutate function as was done previously since we only need the median values now
- summarize(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration)) %>%
- # transforming the data to a wide format
- pivot_wider(id_cols = Biomass_Burn_Condition, names_from = "Chemical.Category",
- values_from = "Median_Scaled_Concentration") %>%
- # converting the chemical names to row names
- column_to_rownames(var = "Biomass_Burn_Condition")
-
-head(heatmap_df2)
-```
-
-Now let's generate the same heatmap this time using the `pheatmap()` function:
-```{r 03-Chapter3-16, fig.align = "center"}
-pheatmap(heatmap_df2,
- # removing the clustering option from both rows and columns
- cluster_rows = FALSE, cluster_cols = FALSE,
- # adding the values for each cell, making those values black, and changing the font size
- display_numbers = TRUE, number_color = "black", fontsize = 12)
-```
-
-Notice that the `pheatmap()` function does not include axes or legend titles as with `ggplot()`, however those can be added to the figure after exporting from R in MS Powerpoint or Adobe. Additional parameters, including `cluster_rows`, for the `pheatmap()` function are discussed further in **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**. For basic heatmaps like the ones shown here, `ggplot()` or `pheatmap()` can both be used however, both have their pros and cons. For example, `ggplot()` figures tend to be more customizable and easily combined with other figures, while `pheatmap()` has additional parameters built into the function that can make plotting certain features advantageous like clustering.
-
-
-
-## Concluding Remarks
-In conclusion, this training module provided example code to create highly customizable data visualizations using *ggplot2* pertinent to environmental health research.
-
-
-
-
-
-:::tyk
-Replicate the figure below! The heatmap still visualizes the median chemical concentrations, but this time we're separating the burn conditions, allowing us to determine if the concentrations of chemicals released are contingent upon the burn condition.
-
-For additional figures available and to view aspects of figures that can be changed in *GGplot2*, check out this [GGPlot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). You might need it to make this figure!
-
-**Hint 1**: Use the `separate()` function from *tidyverse* to split `Biomass_Burn_Condition` into `Biomass` and `Burn_Condition`.
-
-**Hint 2**: Use the function `facet_wrap()` within `ggplot()` to separate the heatmaps by `Burn_Condition`.
-:::
-```{r 03-Chapter3-17, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_3/Module3_1_Input/Module3_1_Image1.png")
-```
-
-# 3.2 Improving Data Visualizations
-
-This training module was developed by Alexis Payton, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Data Visulization Conventions
-
-Data visualizations are used to convey key takeaways from a research study's findings in a clear, succinct manner to highlight data trends, patterns, and/or relationships. In environmental health research, this is of particular importance for high-dimensional datasets that can typically be parsed using multiple methods, potentially resulting in many different approaches to visualize data. As a consequence, researchers are often faced with an overwhelming amount of options when deciding which visualization scheme(s) most optimally translate their results for effective dissemination. Effective data visualization approaches are vital to a researcher's success for many reasons. For instance, manuscript readers or peer reviewers often scroll through a study's text and focus on the quality and novelty of study figures before deciding whether to read/review the paper. Therefore, the importance of data visualizations cannot be understated in any research field.
-
-As a high-level introduction, it is important that we first communicate some traits that we think are imperative towards ensuring a successful data visualization approach as described in more detail below.
-
-Keys to successful data visualizations:
-
-+ **Consider your audience, data type, and research question prior to selecting a figure to visualize your data**
-
- For example, if more computationally complex methods are used in a manuscript that is intended for a journal with an audience that doesn't have that same level of expertise, consider spending time focusing on how those results are presented in an approachable way for that audience. For a review of how to choose a rudimentary chart based on the data type, check out [How to Choose the Right Data Visualization](https://www.atlassian.com/data/charts/how-to-choose-data-visualization). Some of these basic charts will be presented in this module, while more complex analysis-specific visualizations, especially ones developed for high-dimensional data will be presented in later modules.
-
-+ **Take the legibility of the figure into account**
-
- This includes avoiding abbreviations when possible. (If they can't be avoided explain them in the caption.) All titles should be capitalized, including titles for the legend(s) and axes. Underscores and periods between words should be replaced with spaces. Consider the legibility of the figure if printed in black and white. (However, that's not as important these days.) Lastly, feel free to describe your plot in further detail in the caption to aid the reader in understanding the results presented.
-
-+ **Minimize text**
-
- Main titles aren't necessary for single paneled figures (like the examples below), because in a publication the title of the figure is right underneath each figure. It's good practice to remove this kind of extraneous text, which can make the figure seem more cluttered. Titles can be helpful in multi-panel figures, especially if there are multiple panels with the same figure type that present slightly different results. For example, in the Test Your Knowledge section, you'll need to create two heatmaps, but one displays data under smoldering conditions and the other displays data under flaming conditions. In general, try to reduce the amount of extraneous text in a plot to keep a reader focused on the most important elements and takeaways in the plot.
-
-+ **Use the minimal number of figures you need to support your narrative**
-
- It is important to include an optimal number of figures within manuscripts and scientific reports. Too many figures might overwhelm the overall narrative, while too few might not provide enough substance to support your main findings. It can be helpful to also consider placing some figures in supplemental material to aid in the overall flow of your scientific writing.
-
-+ **Select an appropriate color palette**
-
- Packages have been developed to offer color palettes including *MetBrewer* and *RColorBrewer*. In addition, *ggsci* is a package that offers a collection of color palettes used in various scientific journals. For more information, check out *MetBrewer*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/MetBrewer/index.html) and [example tutorial](https://github.com/BlakeRMills/MetBrewer). For more information on *RColorBrewer*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/RColorBrewer/index.html) and [example tutorial](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For more information on *ggsci*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html). In general, it's better to avoid bright and flashy colors that can be difficult to read.
-
- It's advisable to use colors for manuscript figures that are color-blind friendly. Check out these [Stack overflow answers about color blind-safe color palettes and packages](https://stackoverflow.com/questions/57153428/r-plot-color-combinations-that-are-colorblind-accessible). Popular packages for generating colorblind-friendly palettes include [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) and [rcartocolor](https://github.com/Nowosad/rcartocolor).
-
-+ **Use color strategically**
-
- Color can be used to visualize a variable. There are three ways to categorize color schemes - sequential, diverging, and qualitative. Below, definitions are provided for each along with example figures that we've previously published that illustrate each color scheme. In addition, figure titles and captions are also provided for context. Note that some of these figures have been simplified from what was originally published to show more streamlined examples for TAME.
-
- - **Sequential**: intended for ordered categorical data (i.e., disease severity, likert scale, quintiles). The choropleth map below is from [Winker, Payton et. al](https://doi.org/10.3389/fpubh.2024.1339700).
-```{r 03-Chapter3-18, echo=FALSE, out.width = "65%", fig.align='center'}
-knitr::include_graphics("Chapter_3/Module3_2_Input/Module3_2_Image1.png")
-```
-
**Figure 1. Geospatial distribution of the risk of future wildfire events across North Carolina.** Census tracts in North Carolina were binned into quintiles based on Wildfire Hazard Potential (WHP) with 1 (pale orange) having the lowest risk and 5 (dark red) having the highest risk. Figure regenerated here in alignment with its published
-[Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)]
-
- - **Diverging**: intended to emphasize continuous data at extremes of the data range (typically using darker colors) and mid-range values (typically using lighter colors). This color scheme is ideal for charts like heatmaps. The heatmap below is from [Payton, Perryman et. al](0.1152/ajplung.00299.2021).
-```{r 03-Chapter3-19, echo=FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics("Chapter_3/Module3_2_Input/Module3_2_Image2.png")
-```
-
**Figure 2. Individual cytokine expression levels across all subjects.** Cytokine concentrations were derived from nasal lavage fluid samples. On the x axis, subjects were ordered first according to tobacco use status, starting with non-smokers then cigarette smokers and e-cigarette users. Within tobacco use groups, subjects are ordered from lowest to highest average cytokine concentration from left to right. Within each cluster shown on the y axis, cytokines are ordered from lowest to highest average cytokine concentration from bottom to top. Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
-
- - **Qualitative**: intended for nominal categorical data to visualize clear differences between groups (i.e., soil types and exposure groups). The dendrogram below is from [Koval et. al](10.1038/s41370-022-00451-8).
-```{r 03-Chapter3-20, echo=FALSE, out.width = "75%", fig.align='center'}
-knitr::include_graphics("Chapter_3/Module3_2_Input/Module3_2_Image3.png")
-```
-
**Figure 3. Translating chemical use inventory data to inform human exposure patterning.** Groups A-I illustrate the identified clusters of exposure source categories. Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
-
-+ **Consider ordering axes to reveal patterns relevant to the research questions**
-
- Ordering the axes can reveal potential patterns that may not be clear in the visualization otherwise. In the cytokine expression heatmap above, there are not clear differences in cytokine expression across the tobacco use groups. However, e-cigarette users seem to have slightly more muted responses compared to non-smokers and cigarette smokers in clusters B and C, which was corroborated in subsequent statistical analyses. It is also evident that Cluster A had the lowest cytokine concentrations, followed by Cluster B, and then Cluster C with the greatest concentrations.
-
-What makes these figures so compelling is how the aspects introduced above were thoughtfully incorporated. In the next section, we'll put those principles into practice using data that were described and referenced previously in **TAME 2.0 Module 3.1 Data Visualizations**.
-
-
-
-## Introduction to Training Module
-In this module, *ggplot2*, R's data visualization package will be used to walk through ways to improve data visualizations. We'll recreate two figures (i.e., the boxplot and heatmap) constructed previously in **TAME 2.0 Module 3.1 Data Visualizations** and improve them so they are publication-ready. Additionally, we'll write figure titles and captions to contextualize the results presented for each visualization. When writing figure titles and captions, it is helpful to address the research question or overall concept that the figure seeks to capture rather than getting into the weeds of specific methods the plot is based on. This is especially important when visualizing more complex methods that your audience might not have as much knowledge on.
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 03-Chapter3-21, clear_env, echo=TRUE, eval=TRUE}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 03-Chapter3-22, install_libs2, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
-if (!requireNamespace("MetBrewer"))
- install.packages("MetBrewer");
-if (!requireNamespace("RColorBrewer"))
- install.packages("RColorBrewer");
-if (!requireNamespace("pheatmap"))
- install.packages("pheatmap");
-if (!requireNamespace("cowplot"))
- install.packages("cowplot");
-```
-
-#### Loading required R packages
-```{r 03-Chapter3-23, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
-library(tidyverse)
-library(MetBrewer)
-library(RColorBrewer)
-library(pheatmap)
-library(cowplot)
-```
-
-#### Set your working directory
-```{r 03-Chapter3-24, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-Let's now read in our example dataset. As mentioned in the introduction, this example dataset represents chemical measurements across 12 different biomass burn scenarios, representing chemicals emitted during potential wildfire events. Let's upload and view these data:
-```{r 03-Chapter3-25}
-# Load the data
-smoke_data <- read.csv("Chapter_3/Module3_2_Input/Module3_2_InputData.csv")
-
-# View the top of the dataset
-head(smoke_data)
-```
-
-Now that we've been able to view the dataset, let's come up with questions that can be answered with our boxplot and heatmap figure. This will inform how we format the dataframe for visualization.
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Boxplot: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
-2. Heatmap: Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?
-3. How can these figures be combined into a single plot that can be then be exported from R?
-
-#### Formatting dataframes for downstream visualization code
-First, format the dataframe by changing it from a wide to long format and normalizing the chemical concentration data. For more details on this data reshaping visit **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**.
-```{r 03-Chapter3-26}
-scaled_longer_smoke_data = pivot_longer(smoke_data, cols = 4:13, names_to = "Biomass_Burn_Condition",
- values_to = "Chemical_Concentration") %>%
- # scaling within each chemical
- group_by(Chemical) %>%
- mutate(Scaled_Chemical_Concentration = scale(Chemical_Concentration)) %>%
- ungroup()
-
-head(scaled_longer_smoke_data)
-```
-
-
-## Creating an Improved Boxplot Visualization
-
-As we did in the previous module, a boxplot will be constructed to answer the first environmental heath question: **How do the distributions of the chemical concentration data differ based on each biomass burn scenario?**. Let's remind ourselves of the original figure from the previous module.
-
-```{r 03-Chapter3-27, fig.align = "center", echo = FALSE, fig.width = 7, fig.height = 5}
-ggplot(data = scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, color = Biomass_Burn_Condition)) +
- geom_boxplot()
-```
-
-Based on the figure above, peat smoldering has the highest median scaled chemical concentration. However, this was difficult to determine given that the burn conditions aren't labeled on the x axis and a sequential color palette was used, making it difficult to identify the correct boxplot with its burn condition in the legend. If you look closely, the colors in the legend are in a reverse order of the colors assigned to the boxplots. Let's identify some elements of this graph that can be modified to make it easier to answer our research question.
-
-:::txtbx
-### There are four main aspects we can adjust on this figure:
-
-**1. The legibility of the text in the legend and axes.**
-
-Creating spaces between the text or exchanging the underscores for spaces improves the legibility of the figure.
-
- **2. The order of the boxplots.**
-
-Ordering the biomass burn conditions from highest to lowest based on their median scaled chemical concentration allows the reader to easily determine the biomass burn condition that had the greatest or least chemical concentrations relative to each other. In R, this can be done by putting the `Biomass_Burn_Condition` variable into a factor.
-
-**3. Use of color.**
-
-Variables can be visualized using color, text, size, etc. In this figure, it is redundant to have the biomass burn condition encoded in the legend and the color. Instead this variable can be put on the y axis and the legend will be removed to be more concise. The shades of the colors will also be changed, but to keep each burn condition distinct from each other, colors will be chosen that are distinct from one another. Therefore, we will choose a qualitative color scheme.
-
-**4. Show all data points when possible.**
-
-Many journals now require that authors report every single value when making data visualizations, particularly for small *n* studies using bar graphs and boxplots to show results. Instead of just displaying the mean/median and surrounding data range, it is advised to show how every replicate landed in the study range when possible. Note that this requirement is not feasible for studies with larger sample sizes though should be considered for smaller *in vitro* and animal model studies.
-:::
-
-Let's start with addressing **#1: Legibility of Axis Text**. The legend title and axis titles can easily be changed with `ggplot()`, so that will be done later. To remove the underscore from the `Biomass_Burn_Condition` column, we can use the function `gsub()`, which will replace all of the underscores with spaces, resulting in a cleaner-looking graph.
-```{r 03-Chapter3-28}
-# First adding spaces between the biomass burn conditions
-scaled_longer_smoke_data = scaled_longer_smoke_data %>%
- mutate(Biomass_Burn_Condition = gsub("_", " ", Biomass_Burn_Condition))
-
-# Viewing dataframe
-head(scaled_longer_smoke_data)
-```
-
-**#2. Reordering the boxplots based on the median scaled chemical concentration**.
-After calculating the median scaled chemical concentration for each biomass burn condition, the new dataframe will be arranged from lowest to highest median scaled concentration from the top of the dataframe to the bottom. This order will be saved in a vector, `median_biomass_order`. Although the biomass burn conditions are saved from lowest to highest concentration, `ggplot()` will plot them in reverse order with the highest concentration at the top and the lowest at the bottom of the y axis.
-
-Axis reordering can also be accomplished using `reorder` within the `ggplot()` function as described [here](https://guslipkin.medium.com/reordering-bar-and-column-charts-with-ggplot2-in-r-435fad1c643e) and [here](https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html).
-```{r 03-Chapter3-29}
-median_biomass = scaled_longer_smoke_data %>%
- group_by(Biomass_Burn_Condition) %>%
- summarize(Median_Concentration = median(Scaled_Chemical_Concentration)) %>%
- # arranges dataframe from lowest to highest from top to bottom
- arrange(Median_Concentration)
-
-head(median_biomass)
-
-# Saving that order
-median_biomass_order = median_biomass$Biomass_Burn_Condition
-```
-
-
-```{r 03-Chapter3-30}
-# Putting into factor to organize the burn conditions
-scaled_longer_smoke_data$Biomass_Burn_Condition = factor(scaled_longer_smoke_data$Biomass_Burn_Condition,
- levels = median_biomass_order)
-
-# Final dataframe to be used for plotting
-head(scaled_longer_smoke_data)
-```
-
-Now that the dataframe has been finalized, we can plot the new boxplot. The final revision, **#3: Making Use of Color**, will be addressed with `ggplot()`. However, a palette can be chosen from the *MetBrewer* package.
-```{r 03-Chapter3-31}
-# Choosing the "Jurarez" palette from the `MetBrewer` package
-# `n = 12`, since there are 12 biomass burn conditions
-juarez_colors = met.brewer(name = "Juarez", n = 12)[1:12]
-```
-
-**#4. Show all data points when possible** will also be addressed with `ggplot()` by simply using `geom_point()`.
-```{r 03-Chapter3-32, fig.align = "center", out.width = "75%", out.height = "75%"}
-FigureX1 = ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, y = Biomass_Burn_Condition,
- color = Biomass_Burn_Condition)) +
- geom_boxplot() +
- # jittering the points, so they're not all on top of each other and adding transparency
- geom_point(position = position_jitter(h = 0.1), alpha = 0.7) +
-
- theme_light() + # changing the theme
- theme(axis.text = element_text(size = 9), # changing size of axis labels
- axis.title = element_text(face = "bold", size = rel(1.5)), # changes axis titles
- legend.position = "none") + # removes legend
-
- xlab('Scaled Chemical Concentration (pg/uL)') + ylab('Biomass Burn Condition') + # changing axis labels
- scale_color_manual(values = c(juarez_colors)) # changing the colors
-
-FigureX1
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Chemical concentration distributions of biomass burn conditions.** The boxplots are based on the scaled chemical concentration values, which used the raw chemical concentrations values scaled within each chemical. The individual dots represent the concentrations of each chemical. The biomass burn conditions on the y axis are ordered from greatest (top) to least (bottom) based on median scaled chemical concentration."
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: Which biomass burn condition has the highest total chemical concentration?
-:::
-
-:::answer
-**Answer**: Smoldering peat has the highest median chemical concentration, however the median concentrations are comparable across all biomass burn conditions. All the flaming conditions have the highest median chemical concentrations and more overall variation than their respective smoldering conditions with the exception of smoldering peat.
-:::
-
-You may notice that the scaled chemical concentration was put on the x axis and burn condition was put on the y axis and not vice versa. When names are longer in length, they are more legible if placed on the y axis.
-
-Other aspects of the figure were changed in the latest version, but those are minor compared to changing the order of the boxplots, revamping the text, and changing the usage of color. For example, the background was changed from gray to white. Figure backgrounds are generally white, because the figure is easier to read if the paper is printed in black and white. A plot's background can easily be changed to white in R using `theme_light()`, `theme_minimal()`, or `theme_bw()`. Posit provides a very helpful [GGplot2 cheat sheet](https://posit.co/resources/cheatsheets/?type=posit-cheatsheets&_page=2/) for changing a figure's parameters.
-
-
-
-## Creating an Improved Heatmap Visualization
-
-We'll use a heatmap to answer the second environmental health question: **Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?** Let's view the original heatmap from the previous module and find aspects of it that can be improved.
-```{r 03-Chapter3-33, fig.align = "center", fig.width = 10, fig.height= 5}
-# Changing the biomass condition variable back to a character from a factor
-scaled_longer_smoke_data$Biomass_Burn_Condition = as.character(scaled_longer_smoke_data$Biomass_Burn_Condition)
-
-# Calculating the median value within each biomass burn condition and category
-scaled_longer_smoke_data = scaled_longer_smoke_data %>%
- group_by(Biomass_Burn_Condition, Chemical.Category) %>%
- mutate(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration))
-
-# Plotting
-ggplot(data = scaled_longer_smoke_data, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
- fill = Median_Scaled_Concentration)) +
- geom_tile() +
- geom_text(aes(label = round(Median_Scaled_Concentration, 2))) # adding concentration values as text, rounding to two values after the decimal
-```
-
-From the figure above, it's clear that certain biomass burn conditions are associated with higher chemical concentrations for some of the chemical categories. For example, peat flaming exposure was associated with higher levels of inorganics and ions, while pine smoldering exposure was associated with higher levels of methoxyphenols. Although these are important findings, it is still difficult to determine if there are greater similarities in chemical profiles based on the biomass or the incineration temperature. Therefore, let's identify some elements of this chart that can be modified to make it easier to answer our research question.
-
-:::txtbx
-### There are three main aspects we can adjust on this figure:
-
-**1. The legibility of the text in the legend and axes.**
-Similar to what we did previously, we'll replace underscores and periods with spaces in the axis labels and titles.
-
-**2. The order of the axis labels.**
-Ordering the biomass burn condition and chemical category from highest to lowest based on their median scaled chemical concentration allows the reader to easily determine the biomass burn condition that had the greatest or least total chemical concentrations relative to each other. From the previous boxplot figure, biomass burn condition is already in this order, however we need to order the chemical category by putting the variable into a factor.
-
-**3. Use of color.**
-Notice that in the boxplot we used a qualitative palette, which is best for creating visual differences between different classes or groups. In this heatmap, we'll choose a diverging color palette that uses two or more contrasting colors. A diverging color palette is able to highlight mid range with a lighter color and values at either extreme with a darker color or vice versa.
-:::
-
-**#1: Legibility of Text** can be addressed in `ggplot()` and so can **#2: Reordering the heatmap**.
-
-`Biomass_Burn_Condition` has already been reordered and put into a factor, but we need to do the same with `Chemical.Category`. Similar to before, median scaled chemical concentration for each chemical category will be calculated. However, this time the new dataframe will be arranged from highest to lowest median scaled concentration from the top of the dataframe to the bottom. `ggplot()` will plot them in the SAME order with the highest concentration on the left side and the lowest on the right side of the figure.
-```{r 03-Chapter3-34}
-# Order the chemical category by the median scaled chemical concentration
-median_chemical = scaled_longer_smoke_data %>%
- group_by(Chemical.Category) %>%
- summarize(Median_Concentration = median(Scaled_Chemical_Concentration)) %>%
- arrange(-Median_Concentration)
-
-head(median_chemical)
-
-# Saving that order
-median_chemical_order = median_chemical$Chemical.Category
-```
-
-```{r 03-Chapter3-35}
-# Putting into factor to organize the chemical categories
-scaled_longer_smoke_data$Chemical.Category = factor(scaled_longer_smoke_data$Chemical.Category,
- levels = median_chemical_order)
-
-# Putting burn conditons back into a factor to organize them
-scaled_longer_smoke_data$Biomass_Burn_Condition = factor(scaled_longer_smoke_data$Biomass_Burn_Condition,
- levels = median_biomass_order)
-
-# Viewing the dataframe to be plotted
-head(scaled_longer_smoke_data)
-```
-
-Now that the dataframe has been finalized, we can plot the new boxplot. The final revision, **#3: Making Use of Color**, will be addressed with `ggplot()`. Here a palette is chosen from the *RColorBrewer* package.
-```{r 03-Chapter3-36}
-# Only needed to choose 2 colors for 'low' and 'high' in the heatmap
-# `n = 8` in the code to generate more colors that can be chosen from
-rcolorbrewer_colors = brewer.pal(n = 8, name = 'Accent')
-```
-
-
-```{r 03-Chapter3-37, fig.align = "center", fig.width = 10, fig.height = 4}
-FigureX2 = ggplot(data = scaled_longer_smoke_data, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
- fill = Median_Scaled_Concentration)) +
- geom_tile(color = 'white') + # adds white space between the tiles
- geom_text(aes(label = round(Median_Scaled_Concentration, 2))) + # adding concentration values as text
-
- theme_minimal() + # changing the theme
- theme(axis.text = element_text(size = 9), # changing size of axis labels
- axis.title = element_text(face = "bold", size = rel(1.5)), # changes axis titles
- legend.title = element_text(face = 'bold', size = 10), # changes legend title
- legend.text = element_text(size = 9)) + # changes legend text
-
- labs(x = 'Chemical Category', y = 'Biomass Burn Condition',
- fill = "Scaled Chemical\nConcentration (pg/mL)") + # changing axis labels
- scale_fill_gradient(low = rcolorbrewer_colors[5], high = rcolorbrewer_colors[6]) # changing the colors
-
-FigureX2
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Chemical category concentrations across biomass burn conditions.** Scaled chemical concentration values are based on the raw chemical concentration values scaled within each chemical. Chemical category on the x axis is ordered from highest to lowest median concentration from left to right. Biomass burn condition on the y axis is ordered from the highest to lowest median concentration from top to bottom. The values in each tile represent the median scaled chemical concentration."
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?
-:::
-
-:::answer
-**Answer**: Ordering the axes from highest to lowest concentration didn't help organize the data as much as we would've liked given some of the variance of chemical concentrations across the chemical categories. Nevertheless, it's still clear that peat flaming produces the highest concentration of inorganics and ions, peat smoldering with n-Alkanes, eucalyptus smoldering with Levoglucosan, pine smoldering with methoxyphenols, and pine flaming with PAHs. In addition, flaming conditions seem to have higher levels of inorganics and ions while smoldering conditions seem to have higher levels of levoglucosan and PAHs.
-:::
-
-It would be helpful if there was a way to group these chemical profiles based on similarity and that's where the `pheatmap()` function can be helpful when it can be difficult to spot those patterns using visual inspection alone. Just for fun, let's briefly visualize a hierarchical clustering heatmap, which will be used to group both the biomass burn conditions and chemical categories based on their chemical concentrations. In this module, we'll focus only on the `pheatmap()` visualization, but more information on hierarchical clustering can be found in **Module 5.5 Unsupervised Machine Learning II: Additional Clustering Applications**.
-
-As we showed in the previous module, this function requires a wide dataframe which we'll need to create. It will contain `Chemical.Category`, `Biomass_Burn_Condition` and `Scaled_Chemical_Concentration`.
-```{r 03-Chapter3-38, message=FALSE}
-heatmap_df2 = scaled_longer_smoke_data %>%
- group_by(Biomass_Burn_Condition, Chemical.Category) %>%
- # using the summarize function instead of mutate function as was done previously since we only need the median values now
- summarize(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration)) %>%
- # transforming the data to a wide format
- pivot_wider(id_cols = Biomass_Burn_Condition, names_from = "Chemical.Category",
- values_from = "Median_Scaled_Concentration") %>%
- # converting the chemical names to row names
- column_to_rownames(var = "Biomass_Burn_Condition")
-
-head(heatmap_df2)
-```
-
-Now let's generate the same heatmap this time using the `pheatmap()` function:
-```{r 03-Chapter3-39, fig.align = "center"}
-# creating a color palette
-blue_pink_palette = colorRampPalette(c(rcolorbrewer_colors[5], rcolorbrewer_colors[6]))
-
-pheatmap(heatmap_df2,
- # changing the color scheme
- color = blue_pink_palette(40),
- # hierarchical clustering of the biomass burn conditions
- cluster_rows = TRUE,
- # creating white space between the two largest clusters
- cutree_row = 2,
- # adding the values for each cell and making those values black
- display_numbers = TRUE, number_color = "black",
- # changing the font size and the angle of the column names
- fontsize = 12, angle_col = 45)
-```
-By using incorporating the dendrogram into the visualization, it's easier to see that the chemical profiles have greater similarities within incineration temperatures rather than biomasses (with the exception of pine needles smoldering).
-
-
-
-## Creating Multi-Plot Figures
-We can combine figures using the `plot_grid()` function from the *cowplot* package. For additional information on the `plot_grid()` function and parameters that can be changed see [Arranging Plots in a Grid](https://wilkelab.org/cowplot/articles/plot_grid.html). Other packages that have figure combining capabilities include the *[patchwork](https://patchwork.data-imaginist.com/)* package and the [`grid_arrange()`](https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html) function from the *gridExtra* package.
-
-Figures can also be combined after they're exported from R using other applications like MS powerpoint and Adobe pdf.
-```{r 03-Chapter3-40, fig.align = "center", fig.width = 20, fig.height = 6, fig.retina= 3 }
-FigureX = plot_grid(FigureX1, FigureX2,
- # Adding labels, changing size their size and position
- labels = "AUTO", label_size = 15, label_x = 0.04,
- rel_widths = c(1, 1.5))
-FigureX
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Chemical concentration distributions across biomass burn conditions.** (A) The boxplots are based on the scaled chemical concentration values, which used the raw chemical concentrations values scaled within each chemical. The individual dots represent the concentrations of each chemical. The biomass burn conditions on the y axis are ordered from greatest (top) to least (bottom) based on median scaled chemical concentration. (B) The heatmap visualizes concentrations across chemical categories. Chemical category on the x axis is ordered from highest to lowest median concentration from left to right. Biomass burn condition on the y axis is ordered from the highest to lowest median concentration from top to bottom. The values in each tile represent the median scaled chemical concentration.
-
-By putting these two figures side by side, it's now easier to compare the distributions of each biomass burn condition in figure A alongside the median chemical category concentrations in figure B that are responsible for the variation seen on the left.
-
-
-
-## Concluding Remarks
-In conclusion, this training module provided information and example code for improving, streamlining, and making *ggplot2* figures publication ready. Keep in mind that concepts and ideas presented in this module can be subjective and might need to be amended given the situation, dataset, and visualization.
-
-
-
-### Additional Resources
-
-+ [Beginner's Guide to Data Visualizations](https://towardsdatascience.com/beginners-guide-to-enhancing-visualizations-in-r-9fa5a00927c9) and [Improving Data Visualizations in R](https://towardsdatascience.com/8-tips-for-better-data-visualization-2f7118e8a9f4)
-+ [Generating Colors for Visualizations](https://blog.datawrapper.de/colorguide/)
-+ [Additional Hands on Training](https://github.com/hbctraining/publication_perfect)
-+ Brewer, Cynthia A. 1994. Color use guidelines for mapping and visualization. Chapter 7 (pp. 123-147) in Visualization in Modern Cartography
-+ Hattab, G., Rhyne, T.-M., & Heider, D. (2020). Ten simple rules to colorize biological data visualization. PLOS Computational Biology, 16(10), e1008259. PMID: [33057327](https://doi.org/10.1371/journal.pcbi.1008259)
-
-Lastly, for researchers who are newer to R programming, [*ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/) is a package specifically designed to create publication-ready graphs similar to *ggplot2* with more concise syntax. This package is particularly useful for statistically relevant visualizations, which are further explored in later modules including, **TAME 2.0 Module 3.4 Introduction to Statistical Tests**, **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations**, and **TAME 2.0 Module 4.5 Multigroup Comparisons and Visualizations**.
-
-
-
-
-
-:::tyk
-Replicate the figure below! The heatmap is the same as the "Test Your Knowledge" figure from **TAME 2.0 Module 3.1 Data Visualizations**. This time we'll focus on making the figure look more publication ready by cleaning up the titles, cleaning up the labels, and changing the colors. The heatmap still visualizes the median chemical concentrations, but this time we're separating the burn conditions, allowing us to determine if the concentrations of chemicals released are contingent upon the burn condition.
-
-**Hint**: To view additional aspects of figures that can be changed in *ggplot2* check out this [GGPlot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). It might come in handy!
-:::
-```{r 03-Chapter3-41, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_3/Module3_2_Input/Module3_2_Image4.png")
-```
-
-# 3.3 Normality Tests and Data Transformations
-
-This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-When selecting the appropriate statistical tests to evaluate potential trends in your data, selection often relies upon whether or not underlying data are normally distributed. Many statistical tests and methods that are commonly implemented in exposure science, toxicology, and environmental health research rely on assumptions of normality. Applying a statistical test intended for data with a specific distribution when your data do not fit within that distribution can generate unreliable results, with the potential for false positive and false negative findings. Thus, one of the most common statistical tests to perform at the beginning of an analysis is a test for normality.
-
-In this training module, we will:
-
-+ Review the normal distribution and why it is important
-+ Demonstrate how to test whether your variable distributions are normal...
- + Qualitatively, with histograms and Q-Q plots
- + Quantitatively, with the Shapiro-Wilk test
-+ Discuss data transformation approaches
-+ Demonstrate log~2~ data transformation for non-normal data
-+ Discuss additional considerations related to normality
-
-We will demonstrate normality assessment using example data derived from a study in which chemical exposure profiles were collected across study participants through silicone wristbands. This exposure monitoring technique has been described through previous publications, including the following examples:
-
-+ O'Connell SG, Kincl LD, Anderson KA. [Silicone wristbands as personal passive samplers](https://pubs.acs.org/doi/full/10.1021/es405022f). Environ Sci Technol. 2014 Mar 18;48(6):3327-35. doi: 10.1021/es405022f. Epub 2014 Feb 26. PMID: 24548134; PMCID: PMC3962070.
-
-+ Kile ML, Scott RP, O'Connell SG, Lipscomb S, MacDonald M, McClelland M, Anderson KA. [Using silicone wristbands to evaluate preschool children's exposure to flame retardants](https://www.sciencedirect.com/science/article/pii/S0013935116300743). Environ Res. 2016 May;147:365-72. doi: 10.1016/j.envres.2016.02.034. Epub 2016 Mar 3. PMID: 26945619; PMCID: PMC4821754.
-
-+ Hammel SC, Hoffman K, Phillips AL, Levasseur JL, Lorenzo AM, Webster TF, Stapleton HM. [Comparing the Use of Silicone Wristbands, Hand Wipes, And Dust to Evaluate Children's Exposure to Flame Retardants and Plasticizers](https://pubs.acs.org/doi/full/10.1021/acs.est.9b07909). Environ Sci Technol. 2020 Apr 7;54(7):4484-4494. doi: 10.1021/acs.est.9b07909. Epub 2020 Mar 11. PMID: 32122123; PMCID: PMC7430043.
-
-+ Levasseur JL, Hammel SC, Hoffman K, Phillips AL, Zhang S, Ye X, Calafat AM, Webster TF, Stapleton HM. [Young children's exposure to phenols in the home: Associations between house dust, hand wipes, silicone wristbands, and urinary biomarkers](https://www.sciencedirect.com/science/article/pii/S0160412020322728). Environ Int. 2021 Feb;147:106317. doi: 10.1016/j.envint.2020.106317. Epub 2020 Dec 17. PMID: 33341585; PMCID: PMC7856225.
-
-
-In the current example dataset, chemical exposure profiles were obtained from the analysis of silicone wristbands worn by 97 participants for one week. Chemical concentrations on the wristbands were measured with gas chromatography mass spectrometry. The subset of chemical data used in this training module are all phthalates, a group of chemicals used primarily in plastic products to increase flexibility and durability.
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 03-Chapter3-42}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-
-```{r 03-Chapter3-43, message = FALSE}
-if (!requireNamespace("openxlsx"))
- install.packages("openxlsx");
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("ggpubr"))
- install.packages("ggpubr");
-```
-
-#### Loading R packages required for this session
-```{r 03-Chapter3-44, message = FALSE}
-library(openxlsx) # for importing data
-library(tidyverse) # for manipulating and plotting data
-library(ggpubr) # for making Q-Q plots with ggplot
-```
-
-#### Set your working directory
-```{r 03-Chapter3-45, eval = FALSE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-```{r 03-Chapter3-46, message = FALSE}
-# Import data
-wrist_data <- read.xlsx("Chapter_3/Module3_3_Input/Module3_3_InputData.xlsx")
-
-# Viewing the data
-head(wrist_data)
-```
-
-Our example dataset contains subject IDs (`S_ID`), subject ages, and measurements of 8 different phthalates from silicone wristbands:
-
-+ `DEP`: Diethyl phthalate
-+ `DBP` : Dibutyl phthalate
-+ `BBP` : Butyl benzyl phthalate
-+ `DEHA` : Di(2-ethylhexyl) adipate
-+ `DEHP` : Di(2-ethylhexyl) phthalate
-+ `DEHT`: Di(2-ethylhexyl) terephthalate
-+ `DINP` : Diisononyl phthalate
-+ `TOTM` : Trioctyltrimellitate
-
-The units for the chemical data are nanogram of chemical per gram of silicone wristband (ng/g) per day the participant wore the wristband. One of the primary questions in this study was whether there were significant differences in chemical exposure between subjects with different levels of social stress or between subjects with differing demographic characteristics. However, before we can analyze the data for significant differences between groups, we first need to assess whether our numeric variables are normally distributed.
-
-
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are these data normally distributed?
-2. How does the distribution of data influence the statistical tests performed on the data?
-
-Before answering these questions, let's define normality and how to test for it in R.
-
-
-
-## What is a Normal Distribution?
-
-A normal distribution is a distribution of data in which values are distributed roughly symmetrically out from the mean such that 68.3% of values fall within one standard deviation of the mean, 95.4% of values fall within 2 standard deviations of the mean, and 99.7% of values fall within three standard deviations of the mean.
-```{r 03-Chapter3-47, out.width = "800px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_3/Module3_3_Input/Module3_3_Image1.png")
-```
-
Figure Credit: D Wells, CC BY-SA 4.0 , via Wikimedia Commons
-
-Common parametric statistical tests, such as t-tests, one-way ANOVAs, and Pearson correlations, rely on the assumption that data fall within the normal distribution for calculation of z-scores and p-values. Non-parametric tests, such as the Wilcoxon Rank Sum test, Kruskal-Wallis test, and Spearman Rank correlation, do not rely on assumptions about data distribution. Some of the aforementioned between-group comparisons were introduced in **TAME 2.0 Module 3.4 Introduction to Statistical Tests**. They, along with non-parametric tests, are explored further in later modules including **TAME 2.0 Module 4.4 Two-Group Comparisons & Visualizations** and **TAME 2.0 Module 4.5 Multi-group Comparisons & Visualizations**.
-
-
-
-## Qualitative Assessment of Normality
-
-We can begin by assessing the normality of our data through plots. For example, plotting data using [histograms](https://en.wikipedia.org/wiki/Histogram), [densities](https://www.data-to-viz.com/graph/density.html#:~:text=Definition,used%20in%20the%20same%20concept.), or [Q-Q plots](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) can graphically help inform if a variable’s values appear to be normally distributed or not. We will start with visualizing our data distributions with histograms.
-
-### Histograms
-
-Let's start with visualizing the distribution of the participant's ages using the `hist()` function that is part of base R.
-```{r 03-Chapter3-48, fig.align = 'center'}
-hist(wrist_data$Age)
-```
-
-We can edit some of the parameters to improve this basic histogram visualization. For example, we can decrease the size of each bin using the breaks parameter:
-```{r 03-Chapter3-49, fig.align = 'center'}
-hist(wrist_data$Age, breaks = 10)
-```
-
-The `hist()` function is useful for plotting single distributions, but what if we have many variables that need normality assessment? We can leverage *ggplot2*'s powerful and flexible graphics functions such as `geom_histogram()` and `facet_wrap()` to inspect histograms of all of our variables in one figure panel. For more information on data manipulation in general, see **TAME 2.0 Module 2.3 Data Manipulation & Reshaping** and for more on *ggplot2* including the use of `facet_wrap()`, see **TAME 2.0 Module 3.2 Improving Data Visualizations**.
-
-First, we'll pivot our data to longer to prepare for plotting. Then, we'll make our plot. We can use the `theme_set()` function to set a default graphing theme for the rest of the script. A graphing theme represents a set of default formatting parameters (mostly colors) that ggplot will use to make your graphs. `theme_bw()` is a basic theme that includes a white background for the plot and dark grey axis text and minor axis lines. The theme that you use is a matter of personal preference. For more on the different themes available through *ggplot2*, see [here](https://ggplot2.tidyverse.org/reference/ggtheme.html).
-
-```{r 03-Chapter3-50, message = FALSE, fig.align = 'center'}
-# Pivot data longer to prepare for plotting
-wrist_data_long <- wrist_data %>%
- pivot_longer(!S_ID, names_to = "variable", values_to = "value")
-
-# Set theme for graphing
-theme_set(theme_bw())
-
-# Make figure panel of histograms
-ggplot(wrist_data_long, aes(value)) +
- geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
- facet_wrap(~ variable, scales = "free") +
- labs(y = "# of Observations", x = "Value")
-```
-
-From these histograms, we can see that our chemical variables do not appear to be normally distributed.
-
-### Q-Q Plots
-
-Q-Q (quantile-quantile) plots are another way to visually assess normality. Similar to the histogram above, we can create a single Q-Q plot for the age variable using base R functions. Normal Q-Q plots (Q-Q plots where the theoretical quantiles are based on a normal distribution) have theoretical quantiles on the x-axis and sample quantiles, representing the distribution of the variable of interest from the dataset, on the y-axis. If the variable of interest is normally distributed, the points on the graph will fall along the reference line.
-```{r 03-Chapter3-51, fig.align = 'center'}
-# Plot points
-qqnorm(wrist_data$Age)
-
-# Add a reference line for theoretically normally distributed data
-qqline(wrist_data$Age)
-```
-Small variations from the reference line, as seen above, are to be expected for the most extreme values. Overall, we can see that the age data are relatively normally distributed, as the points fall along the reference line.
-
-To make a figure panel with Q-Q plots for all of our variables of interest, we can use the `ggqqplot()` function within the *[ggpubr](https://rpkgs.datanovia.com/ggpubr/)* package. This function generates Q-Q plots and has arguments that are similar to *ggplot2*.
-```{r 03-Chapter3-52, fig.align = 'center'}
-ggqqplot(wrist_data_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
-```
-With this figure panel, we can see that the chemical data have very noticeable deviations from the reference, suggesting non-normal distributions.
-
-To answer our first environmental health question, age is the only variable that appears to be normally distributed in our dataset. This is based on our histograms and Q-Q plots with data centered in the middle and spreading with a distribution on both the lower and upper sides that follow typical normal data distributions. However, chemical concentrations appear to be non-normally distributed.
-
-Next, we will implement a quantitative approach to assessing normality, based on a statistical test for normality.
-
-
-
-## Quantitative Normality Assessment
-
-### Single Variable Normality Assessment
-
-We will use the Shapiro-Wilk test to quantitatively assess whether our data distribution is normal, again looking at the age data. This test can be carried out simply using the `shapiro.test()` function from the base R stats package. When using this test and interpreting its results, it is important to remember that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
-```{r 03-Chapter3-53}
-shapiro.test(wrist_data$Age)
-```
-This test resulted in a p-value of 0.8143, so we cannot reject the null hypothesis (that data are normally distributed). This means that we can assume that age is normally distributed, which is consistent with our visualizations above.
-
-### Multiple Variable Normality Assessment
-
-With a large dataset containing many variables of interest (e.g., our example data with multiple chemicals), it is more efficient to test each column for normality and then store those results in a dataframe. We can use the base R function `apply()` to apply the Shapiro Wilk test over all of the numeric columns of our dataframe. This function generates a list of results, with a list element for each variable tested. There are also other ways that you could iterate through each of your columns, such as a `for` loop or a function as discussed in **TAME 2.0 Module 2.4 Improving Coding Efficiencies**.
-```{r 03-Chapter3-54}
-# Apply Shapiro Wilk test
-shapiro_res <- apply(wrist_data %>% select(-S_ID), 2, shapiro.test)
-
-# View first three list elements
-glimpse(shapiro_res[1:3])
-```
-
-We can then convert those list results into a dataframe. Each variable is now in a row, with columns describing outputs of the statistical test.
-```{r 03-Chapter3-55}
-# Create results dataframe
-shapiro_res <- do.call(rbind.data.frame, shapiro_res)
-
-# View results dataframe
-shapiro_res
-```
-
-Finally, we can clean up our results dataframe and add a column that will quickly tell us whether our variables are normally or non-normally distributed based on the Shapiro-Wilk normality test results.
-```{r 03-Chapter3-56}
-# Clean dataframe
-shapiro_res <- shapiro_res %>%
-
- # Add normality conclusion
- mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
-
- # Remove columns that do not contain informative data
- select(c(p.value, normal))
-
-# View cleaned up dataframe
-shapiro_res
-```
-
-The results from the Shapiro-Wilk test demonstrate that age data are normally distributed, while the chemical concentration data are non-normally distributed. These results support the conclusions we made based on our qualitative assessment above with histograms and Q-Q plots.
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can now answer **Environmental Health Question #1***: Are these data normally distributed?
-:::
-
-:::answer
-**Answer:** Age is normally distributed, while chemical concentrates are non-normally distributed.
-:::
-
-### Answer to Environmental Health Question 2
-:::question
-*We can also answer **Environmental Health Question #2***: How does the distribution of data influence the statistical tests performed on the data?
-:::
-
-:::answer
-**Answer:** Parametric statistical tests should be used when analyzing the age data, and non-parametric tests should be used when analyzing the chemical concentration data
-:::
-
-
-
-## Data Transformation
-
-There are a number of approaches that can be used to change the range and/or distribution of values within each variable. Typically, the purpose for applying these changes is to reduce bias in a dataset, remove known sources of variation, or prepare data for specific downstream analyses. The following are general definitions for common terms used when discussing these changes:
-
-+ **Transformation** refers to any process used to change data into other, related values. Normalization and standardization are types of data transformation. Transformation can also refer to performing the same mathematical operation on every value in your dataframe. For example, taking the log~2~ or log~10~ of every value is referred to as log transformation.
-
- + **Normalization** is the process of transforming variables so that they are on a similar scale and therefore are comparable. This can be important when variables in a dataset contain a mixture of data types that are represented by vastly different numeric magnitudes or when there are known sources of variability across samples. Normalization methods are highly dependent on the type of input data. One example of normalization is min-max scaling, which results in a range for each variable of 0 to 1. Although normalization in computational methodologies typically refers to min-max scaling or other similar methods where the variable's range is bounded by specific values, wet-bench approaches also employ normalization - for example, using a reference gene for RT-qPCR assays or dividing a total protein amount for each sample by the volume of each sample to obtain a concentration.
-
- + **Standardization**, also known as Z-score normalization, is a specific type of normalization that involves subtracting each value from the mean of that variable and dividing by that variable's standard deviation. The standardized values for each variable will have a mean of 0 and a standard deviation of 1. The `scale()` function in R performs standardization by default when the data are centered (argument `center = TRUE` is included within the scale function).
-
-### Transformation of example data
-
-When data are non-normally distributed, such as with the chemical concentrations in our example dataset, it may be desirable to transform the data so that the distribution becomes closer to a normal distribution, particularly if there are only parametric tests available to test your hypothesis. A common transformation used in environmental health research is log~2~ transformation, in which data are transformed by taking the log~2~ of each value in the dataframe.
-
-Let's log~2~ transform our chemical data and examine the resulting histograms and Q-Q plots to qualitatively assess whether data appear more normal following transformation. We will apply a pseudo-log~2~ transformation, where we will add 1 to each value before log~2~ transforming so that all resulting values are positive and any zeroes in the dataframe do not return -Inf.
-```{r 03-Chapter3-57, fig.align = 'center'}
-# Apply psuedo log2 (pslog2) transformation to chemical data
-wrist_data_pslog2 <- wrist_data %>%
- mutate(across(DEP:TOTM, ~ log2(.x + 1)))
-
-# Pivot data longer
-wrist_data_pslog2_long <- wrist_data_pslog2 %>%
- pivot_longer(!S_ID, names_to = "variable", values_to = "value")
-
-# Make figure panel of histograms
-ggplot(wrist_data_pslog2_long, aes(value)) +
- geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
- facet_wrap(~ variable, scales = "free") +
- labs(y = "# of Observations", x = "Value")
-```
-
-```{r 03-Chapter3-58, fig.align = 'center'}
-# Make a figure panel of Q-Q plots
-ggqqplot(wrist_data_pslog2_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
-```
-
-Both the histograms and the Q-Q plots demonstrate that our log~2~ transformed data are more normally distributed than the raw data graphed above. Let's apply the Shapiro-Wilk test to our log~2~ transformed data to determine if the chemical distributions are normally distributed.
-```{r 03-Chapter3-59}
-# Apply Shapiro Wilk test
-shapiro_res_pslog2 <- apply(wrist_data_pslog2 %>% select(-S_ID), 2, shapiro.test)
-
-# Create results dataframe
-shapiro_res_pslog2 <- do.call(rbind.data.frame, shapiro_res_pslog2)
-
-# Clean dataframe
-shapiro_res_pslog2 <- shapiro_res_pslog2 %>%
-
- ## Add normality conclusion
- mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
-
- ## Remove columns that do not contain informative data
- select(c(p.value, normal))
-
-# View cleaned up dataframe
-shapiro_res_pslog2
-```
-
-The results from the Shapiro-Wilk test demonstrate that the the log~2~ chemical concentration data are more normally distributed than the raw data. Overall, the p-values, even for the chemicals that are still non-normally distributed, are much higher, and only 2 out of the 8 chemicals are non-normally distributed by the Shapiro-Wilk test. We can also calculate average p-values across all variables for our raw and log~2~ transformed data to further demonstrate this point.
-```{r 03-Chapter3-60}
-# Calculate the mean Shapiro-Wilk p-value for the raw chemical data
-mean(shapiro_res$p.value)
-
-# Calculate the mean Shapiro-Wilk p-value for the pslog2 transformed chemical data
-mean(shapiro_res_pslog2$p.value)
-```
-
-Therefore, the log~2~ chemical data would be most appropriate to use if researchers are wanting to perform parametric statistical testing (and particularly if there is not a non-parametric statistical test for a given experimental design). It is important to note that if you proceed to statistical testing using log~2~ or other transformed data, graphs you make of significant results should use the transformed values on the y-axis, and findings should be interpreted in the context of the transformed values.
-
-
-
-## Additional Considerations Regarding Normality
-
-The following sections detail additional considerations regarding normality. Similar to other advice in TAME, appropriate methods for handling normality assessment and normal versus non-normal data can be dependent on your field, lab, endpoints of interest, and downstream analyses. We encourage you to take those elements of your study into account, alongside the guidance provided here, when assessing normality. Regardless of the specific steps you take, be sure to report normality assessment steps and the data transformation or statistical test decisions you make based on them in your final report or manuscript.
-
-#### Determining which data should go through normality testing:
-
-Values for all samples (rows) that will be going into statistical testing should be tested for normality. If you are only going to be statistically testing a subset of your data, perform the normality test on that subset. Another way to think of this is that data points that are on the same graph together and/or that have been used as input for a statistical test should be tested for normality together.
-
-#### Analyzing datasets with a mixture of normally and non-normally distributed variables:
-
-There are a couple of different routes you can pursue if you have a mixture of normally and non-normally distributed variables in your dataframe:
-
-+ Perform parametric statistical tests on the normally distributed variables and non-parametric tests on the non-normally distributed variable.
-+ Perform the statistical test across all variables that fits with the majority of the variable distributions in your dataset.
-
-Our preference is to perform one test across all variables of the same data type/endpoint (e.g., all chemical concentrations, all cytokine concentrations). Aim to choose an approach that fits *best* rather than *perfectly*.
-
-#### Improving efficiency for normality assessment:
-
-If you find yourself frequently performing the same normality assessment workflow, consider writing a function that will execute each normality testing step (making a histogram, making a Q-Q plot, determining Shapiro-Wilk normality variable by variable, and determining the average Shapiro-Wilk p-value across all variables) and store the results in a list for easy inspection.
-
-
-
-## Concluding Remarks
-
-In conclusion, this training module serves as an introduction to and step by step tutorial for normality assessment and data transformations. Approaches described in this training module include visualizations to qualitatively assess normality, statistical tests to quantitatively assess normality, data transformation, and other distribution considerations relating to normality. These methods are an important step in data characterization and exploration prior to downstream analyses and statistical testing, and they can be applied to nearly all studies carried out in environmental health research.
-
-### Additional Resources
-
-+ [Descriptive Statistics and Normality Tests for Statistical Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/)
-+ [STHDA Normality Test in R](https://www.datanovia.com/en/lessons/normality-test-in-r/)
-+ [Normalization vs. Standardization](https://www.geeksforgeeks.org/normalization-vs-standardization/)
-
-
-
-
-
-:::tyk
-Use the input file provided ("Module3_3_TYKInput.xlsx"), which represents a similar dataset to the one used in the module, to answer the following questions:
-
-1. Are any variables normally distributed in the raw data?
-2. Does psuedo log~2~ transforming the values make the distributions overall more or less normally distributed?
-3. What are the average Shapiro-Wilk p-values for the raw and psuedo log~2~ transformed data?
-:::
-
-# 3.4 Introduction to Statistical Tests
-
-This training module was developed by Alexis Payton, Kyle Roell, Elise Hickman, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-This training module provides a brief introduction to some of the most commonly implemented statistics and associated visualizations used in exposure science, toxicology, and environmental health studies. This module first uploads an example dataset that is similar to the data used in **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**, though it includes some expanded subject information data to allow for more example statistical tests. Then, methods to evaluate data normality are presented, including visualization-based and statistical-based approaches.
-
-Basic statistical tests discussed in this module include:
-
-+ T test
-+ Analysis of Variance (ANOVA) with a Tukey's Post-Hoc test
-+ Regression Modeling (Linear and Logistic)
-+ Chi-squared test
-+ Fisher’s exact test
-
-These statistical tests are very simple, with more extensive examples and associated descriptions of statistical models in the proceeding applications-based training modules in:
-
-+ TAME 2.0 Module 4.4 Two-Group Comparisons & Visualizations
-+ TAME 2.0 Module 4.5 Multi-Group Comparisons & Visualizations
-+ TAME 2.0 Module 4.6 Advanced Multi-Group Comparisons & Visualizations
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 03-Chapter3-61}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 03-Chapter3-62, results=FALSE, message=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("car"))
- install.packages("car");
-if (!requireNamespace("ggpubr"))
- install.packages("ggpubr");
-if(!requireNamespace("effects"))
- install.packages("effects");
-```
-
-#### Loading R packages required for this session
-```{r 03-Chapter3-63, results=FALSE, message=FALSE}
-library(tidyverse) # all tidyverse packages, including dplyr and ggplot2
-library(car) # package for statistical tests
-library(ggpubr) # ggplot2 based plots
-library(effects) # for linear modeling
-```
-
-#### Set your working directory
-```{r 03-Chapter3-64, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example datasets
-
-Let's read in our example dataset. Note that these data are similar to those used previously, except that demographic and chemical measurement data were previously merged, and a few additional columns of subject information/demographics were added to serve as more thorough examples of data for use in this training module.
-```{r 03-Chapter3-65}
-# Loading data
-full.data <- read.csv("Chapter_3/Module3_4_Input/Module3_4_InputData.csv")
-```
-
-Let's view the top of the first 9 columns of data in this dataframe:
-```{r 03-Chapter3-66}
-full.data[1:10,1:9]
-```
-
-These represent the subject information/demographic data, which include the following columns:
-
-+ `ID`: subject number
-+ `BMI`: body mass index
-+ `BMIcat`: BMI <= 18.5 binned as "Underweight", 18.5 < BMI <= 24.5 binned as "Normal", BMI > 24.5 binned as "Overweight"
-+ `MAge`: maternal age in years
-+ `MEdu`: maternal education level; "No_HS_Degree" = "less than high school", "No_College_Degree" = "high school or some college", "College_Degree" = "college or greater"
-+ `BW`: body weight in grams
-+ `GA`: gestational age in weeks
-+ `Smoker`: "NS" = non-smoker, "S" = smoker
-+ `Smoker3`: "Never", "Former", or "Current" smoking status
-
-Let's now view the remaining columns (columns 10-15) in this dataframe:
-```{r 03-Chapter3-67}
-full.data[1:10,10:15]
-```
-
-These columns represent the environmental exposure measures, including:
-
-+ `DWAs`: drinking water arsenic levels in µg/L
-+ `DWCd`: drinking water cadmium levels in µg/L
-+ `DWCr`: drinking water chromium levels in µg/L
-+ `UAs`: urinary arsenic levels in µg/L
-+ `UCd`: urinary cadmium levels in µg/L
-+ `UCr`: urinary chromium levels in µg/L
-
-
-Now that the script is prepared and the data are uploaded, we can start by asking some initial questions about the data that can be answered by running some basic statistical tests and visualizations.
-
-
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are there statistically significant differences in BMI between non-smokers and smokers?
-2. Are there statistically significant differences in BMI between current, former, and never smokers?
-3. Is there a relationship between maternal BMI and birth weight?
-4. Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight?
-5. Are there statistically significant differences in gestational age based on whether a subject is a non-smoker or a smoker?
-6. Is there a relationship between smoking status and BMI?
-
-
-
-## Assessing Normality & Homogeneity of Variance
-Statistical test selection often relies upon whether or not the underlying data are normally distributed and that variance across the groups is the same (homogeneity of variances). Many statistical tests and methods that are commonly implemented in exposure science, toxicology, and environmental health research rely on assumptions of normality. Thus, one of the most common statistical tests to perform at the beginning of an analysis is a **test for normality**.
-
-As discussed in the previous module, there are a few ways to evaluate the normality of a dataset:
-
-*First*, you can visually gauge whether a dataset appears to be normally distributed through plots. For example, plotting data using histograms, densities, or Q-Q plots can graphically help inform if a variable's values appear to be normally distributed or not.
-
-*Second*, you can evaluate normality using statistical tests, such as the **Kolmogorov-Smirnov (K-S) test** and **Shapiro-Wilk test**. When using these tests and interpreting their results, it is important to remember that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
-
-
-
-Let's start with the first approach based on data visualizations. In this module, we'll primarily be generating figures using the ***ggubr*** package which is specifically designed to generate ggplot2-based figures using more streamlined coding syntax. In addition, this package has statistical parameters for plotting that are useful for basic statistical analysis, especially for people with introductory experience to plotting in R. For further documentation on *ggubr*, click [here](https://jtr13.github.io/cc20/brief-introduction-and-tutorial-of-ggpubr-package.html).
-
-Let's begin with a [histogram](https://en.wikipedia.org/wiki/Histogram) to view the distribution of BMI data using the `gghistogram()` function from the *ggubr* package:
-```{r 03-Chapter3-68, fig.width=5, fig.height=4, fig.align = 'center'}
-gghistogram(data = full.data, x = "BMI", bins = 20)
-```
-
-Let's also view the [Q–Q (quantile-quantile) plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) using the `ggqqplot()` function also from the *ggubr* package:
-```{r 03-Chapter3-69, fig.width=5, fig.height=5, fig.align = 'center'}
-ggqqplot(full.data$BMI, ylab = "BMI")
-```
-
-From these visualizations, the BMI variable appears to be normally distributed, with data centered in the middle and spreading with a distribution on both the lower and upper sides that follow typical normal data distributions.
-
-
-
-Let's now implement the second approach based on statistical tests for normality. Here, let's use the [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) as an example, again looking at the BMI data.
-```{r 03-Chapter3-70}
-shapiro.test(full.data$BMI)
-```
-
-This test resulted in a p-value of 0.3773, so we cannot reject the null hypothesis (that the BMI data are normally distributed). These findings support the assumption that these data are normally distributed.
-
-Next, we'll assess homogeneity of variance using the Levene's test. This will be done using the `leveneTest()`function from the *car* package:
-```{r 03-Chapter3-71}
-# First convert the smoker variable to a factor
-full.data$Smoker = factor(full.data$Smoker, levels = c("NS", "S"))
-leveneTest(BMI ~ Smoker, data = full.data)
-```
-The p value, (`Pr>F`), is 0.6086 indicating that variance in BMI across the smoking groups is the same. Therefore, the assumptions of a t-test, including normality and homogeneity of variance, have been met.
-
-
-
-## Two-Group Visualizations and Statistical Comparisons using the T-Test
-T-tests are commonly used to test for a significant difference between the means of two groups in normally distributed data. In this example, we will be answering **Environmental Health Question 1**: Are there statistically significant differences in BMI between non-smokers and smokers?
-
-We will specifically implement a two sample t-test (or independent samples t-test).
-
-Let’s first visualize the BMI data across these two groups using boxplots:
-```{r 03-Chapter3-72, fig.width=5, fig.height=4, fig.align = 'center'}
-ggboxplot(data = full.data, x = "Smoker", y = "BMI")
-```
-
-From this plot, it looks like non-smokers (labeled "NS") *may* have significantly higher BMI than smokers (labeled "S"), though we need statistical evaluation of these data to more thoroughly evaluate this potential data trend.
-
-It is easy to perform a t-test on these data using the `t.test()` function from the base R stats package:
-```{r 03-Chapter3-73}
-t.test(data = full.data, BMI ~ Smoker)
-```
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: Are there statistically significant differences in BMI between non-smokers and smokers?
-:::
-
-:::answer
-**Answer**: From this statistical output, we can see that the overall mean BMI in non-smokers (group "NS") is 26.1, and the overall mean BMI in smokers (group "S") is 23.4. We can also see that the resulting p-value comparison between the means of these two groups is, indeed, significant (p-value = 0.013), meaning that the means between these groups are significantly different (i.e., are not equal).
-:::
-
-It's also helpful to save these results into a variable within the R global environment, which then allows us to access specific output values and extract them more easily for our records. For example, we can run the following to specifically extract the resulting p-value from this test:
-```{r 03-Chapter3-74, fig.align = 'center'}
-ttest.res <- t.test(data = full.data, BMI ~ Smoker) # making a list in the R global environment with the statistical results
-signif(ttest.res$p.value, 2) # pulling the p-value and using the `signif` function to round to 2 significant figures
-```
-
-
-
-## Three-Group Visualizations and Statistical Comparisons using an ANOVA
-Analysis of Variance (ANOVA) is a statistical method that can be used to compare means across three or more groups in normally distributed data. To demonstrate an ANOVA test on this dataset, let's answer **Environmental Health Question 2**: Are there statistically significant differences in BMI between current, former, and never smokers? To do this we'll use the `Smoker3` variable from our dataset.
-
-Let's again start by viewing these data distributions using a boxplot:
-```{r 03-Chapter3-75, fig.align = 'center'}
-ggboxplot(data = full.data, x = "Smoker3", y = "BMI")
-```
-
-From this cursory review of the data, it looks like the current smokers likely demonstrate significantly different BMI measures than the former and never smokers, though we need statistical tests to verify this potential trend. We also require statistical tests to evaluate potential differences (or lack of differences) between former and never smokers.
-
-Let’s now run the ANOVA to compare BMI between smoking groups, using the `aov()` function to fit an ANOVA model:
-```{r 03-Chapter3-76}
-smoker_anova = aov(data = full.data, BMI ~ Smoker3)
-smoker_anova
-```
-
-We need to extract the typical ANOVA results table using either the `summary()` or `anova()` function on the resulting fitted object:
-```{r 03-Chapter3-77}
-anova(smoker_anova)
-```
-
-This table outputs a lot of information, including the `F value` referring to the resulting F-statistic, `Pr(>F)` referring to the p-value of the F-statistic, and other values that are described in detail through other available resources including this [helpful video](https://online.stat.psu.edu/stat485/lesson/12/12.2) through PennState's statistics online resources.
-
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Are there statistically significant differences in BMI between current, former, never smokers?
-:::
-
-:::answer
-**Answer**: From this ANOVA output table, we can conclude that the group means across all three groups are not equal given that the p value, written as `Pr(>F)` is significant (p value = 5.88 x 10^-12^). However, it doesn't tell us which groups differ from each other and that's where post hoc tests like Tukey's are useful.
-:::
-
-Let's run a Tukey's post hoc test using the `TukeyHSD()` function in base R to determine which of the current, former, and never smokers have significant differences in BMI:
-```{r 03-Chapter3-78}
-smoker_tukey = TukeyHSD(smoker_anova)
-smoker_tukey
-```
-
-Although the above Tukey object contains a column `p adj`, those are the raw unadjusted p values. It is common practice to adjust p values from multiple comparisons to prevent the reporting of false positives or reporting of a significant difference that doesn't actually exist ([Feise, 2002](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-2-8#:~:text=Thus%2C%20the%20main%20benefit%20of,exists%20%5B10%E2%80%9321%5D.)). There are a couple of different methods that are used to adjust p values including the Bonferroni and the Benjamini & Hochberg approaches.
-
-For this example, we'll use the `p.adjust()` function to obtain the Benjamini & Hochberg adjusted p values. Check out the associated [RDocumentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust) to discover other methods that can be used to adjust p values using the `p.adjust()` function:
-```{r 03-Chapter3-79}
-# First converting the Tukey object into a dataframe
-smoker_tukey_df = data.frame(smoker_tukey$Smoker3) %>%
- # renaming the `p adj` to `P Value` for clarity
- rename(`P Value` = p.adj)
-
-# Adding a column with the adjusted p values
-smoker_tukey_df$`P Adj` = p.adjust(smoker_tukey_df$`P Value`, method = "fdr")
-smoker_tukey_df
-```
-
-### Answer to Environmental Health Question 2
-*We can use this additional information to further answer **Environmental Health Question #2***: Are there statistically significant differences in BMI between current, former, and never smokers?
-
-**Answer**: Current smokers have significantly lower BMIs than people who have never smoked and people who have formerly smoked. This is made evident by the 95% confidence intervals (`lwr` and `upr`) that don't cross 0 and the p values that are less than 0.05 even after adjusting.
-
-
-
-## Regression Modeling and Visualization: Linear and Logistic Regressions
-Regression modeling aims to find a relationship between a dependent variable (or outcome, response, y) and an independent variable (or predictor, explanatory variable, x). There are many forms of regression analysis, but here we will focus on two: linear regression and logistic regression.
-
-In brief, **linear regression** is generally used when you have a continuous dependent variable and there is assumed to be some sort of linear relationship between the dependent and independent variables. Conversely, **logistic regression** is often used when the dependent variable is dichotomous.
-
-Let's first run through an example linear regression model to answer **Environmental Health Question 3**: Is there a relationship between maternal BMI and birth weight?
-
-### Linear Regression
-We will first visualize the data and a run simple correlation analysis to evaluate whether these data are generally correlated. Then, we will run a linear regression to evaluate the relationship between these variables in more detail.
-
-
-Plotting the variables against one another and adding a linear regression line using the function `ggscatter()` from the *ggubr* package:
-```{r 03-Chapter3-80, fig.align = 'center'}
-ggscatter(full.data, x = "BMI", y = "BW",
- # Adding a linear line with 95% condfidence intervals as the shaded region
- add = "reg.line", conf.int = TRUE,
- # Customize reg. line
- add.params = list(color = "blue", fill = "lightgray"),
- # Adding Pearson's correlation coefficient
- cor.coef = TRUE, cor.method = "pearson", cor.coeff.args = list(label.sep = "\n"))
-```
-
-We can also run a basic correlation analysis between these two variables using the `cor.test()` function. This function uses the Pearson's correlation test as default, which we can implement here due to the previously discussed assumption of normality for this dataset. Note that other tests are needed in instances when data are not normally distributed (e.g., Spearman Rank). This function is used here to extract the Pearson's correlation coefficient and p-value (which also appear above in the upper left corner of the graph):
-```{r 03-Chapter3-81}
-cor.res <- cor.test(full.data$BW, full.data$BMI)
-signif(cor.res$estimate, 2)
-signif(cor.res$p.value, 2)
-```
-
-Together, it looks like there may be an association between BW and BMI, based on these correlation results, demonstrating a significant p-value of 0.0004.
-
-To test this further, let’s run a linear regression analysis using the `lm()` function, using BMI (X) as the independent variable and BW as the dependent variable (Y):
-```{r 03-Chapter3-82}
-crude_lm <- lm(data = full.data, BW ~ BMI)
-summary(crude_lm) # viewing the results summary
-```
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: Is there a relationship between maternal BMI and birth weight?
-:::
-
-:::answer
-**Answer**: Not only is there a slight positive correlation between maternal BMI and BW as indicated by ~0.25 correlation coefficient, this linear relationship is significant due to the p-value being ~0.0004.
-:::
-
-
-Additionally, we can derive confidence intervals for the BMI estimate using:
-```{r 03-Chapter3-83}
-confint(crude_lm)["BMI",]
-```
-
-Notice that the r-squared (R^2^) value in regression output is the squared value of the previously calculated correlation coefficient (R).
-```{r 03-Chapter3-84}
-signif(sqrt(summary(crude_lm)$r.squared), 2)
-```
-
-
-
-In epidemiological studies, the potential influence of confounders is considered by including important covariates within the final regression model. Let's go ahead and investigate **Environmental Health Question 4**: Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight? We can do that by adding those variables to the linear model.
-
-```{r 03-Chapter3-85}
-adjusted_lm = lm(data = full.data, BW ~ BMI + MAge + GA)
-summary(adjusted_lm)
-```
-
-
-
-Let's further visualize these regression modeling results by adding a regression line to the original scatterplot. Before doing so, we'll use the `effect()` function from the *effects* package to make estimated predictions of birth weight values for the crude and adjusted linear models. The crude model only has BMI as the dependent variable, while the adjusted model includes BMI, maternal age, and gestational age as dependent variables. This function creates a table that contains 5 columns: fitted values for BMI (`BMI`), predictor values (`fit`), standard errors of the predictions (`se`), lower confidence limits (`lower`), and upper confidence limits (`upper`). An additional column, `Model`, was added to specify whether the values correspond to the crude or adjusted model.
-
-For additional information on visualizing adjusted linear models, see [Plotting Adjusted Associations in R](https://nickmichalak.com/post/2019-02-13-plotting-adjusted-associations-in-r/plotting-adjusted-associations-in-r/).
-```{r 03-Chapter3-86}
-crude_lm_predtable = data.frame(effect(term = "BMI", mod = crude_lm), Model = "Crude")
-adjusted_lm_predtable = data.frame(effect(term = "BMI", mod = adjusted_lm), Model = "Adjusted")
-
-# Viewing one of the tables
-crude_lm_predtable
-```
-
-Now we can plot each linear model and their corresponding 95% confidence intervals (CI). It's easier to visualize this using *ggplot2* instead of *ggubr* so that's what we'll use:
-```{r 03-Chapter3-87, fig.align = 'center'}
-options(repr.plot.width=9, repr.plot.height=6) # changing dimensions of the entire figure
-ggplot(full.data, aes(x = BMI, y = BW)) +
- geom_point() +
- # Crude line
- geom_line(data = crude_lm_predtable, mapping = aes(x = BMI, y = fit, color = Model)) +
- # Adjusted line
- geom_line(data = adjusted_lm_predtable, mapping = aes(x = BMI, y = fit, color = Model)) +
- # Crude 95% CI
- geom_ribbon(data = crude_lm_predtable, mapping = aes(x = BMI, y = fit, ymin = lower, ymax = upper, fill = Model), alpha = 0.25) +
- # Adjusted 95% CI
- geom_ribbon(data = adjusted_lm_predtable, mapping = aes(x = BMI, y = fit, ymin = lower, ymax = upper, fill = Model), alpha = 0.25)
-```
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer **Environmental Health Question #4***: Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight?
-:::
-
-:::answer
-**Answer**: BMI is still significantly associated with BW and the included covariates are also shown to be significantly related to birth weight in this model. However, the addition of gestational age and maternal age did not have much of an impact on modifying the relationship between BMI and birth weight.
-:::
-
-
-
-### Logistic Regression
-To carry out a logistic regression, we need to evaluate one continuous variable (here, we select gestational age, using the `GA` variable) and one dichotomous variable (here, we select smoking status, using the `Smoker` variable) to evaluate **Environmental Health Question 5**: Are there statistically significant differences in gestational age based on whether a subject is a non-smoker or a smoker?
-
-Because smoking status is a dichotomous variable, we will use logistic regression to look at this relationship. Let's first visualize these data using a stacked bar plot for the dichotomous smoker dataset:
-```{r 03-Chapter3-88, fig.width=5, fig.height=4, fig.align = 'center'}
-ggboxplot(data = full.data, x = "Smoker", y = "GA")
-```
-
-
-With this visualization, it's difficult to tell whether or not there are significant differences in maternal education based on smoking status.
-
-
-Let's now run the statistical analysis, using logistic regression modeling:
-```{r 03-Chapter3-89}
-# Before running the model, "Smoker", needs to be binarized to 0's or 1's for the glm function
-glm_data = full.data %>%
- mutate(Smoker = ifelse(Smoker == "NS", 0,1))
-
-# Use GLM (generalized linear model) and specify the family as binomial
-# This tells GLM to run a logistic regression
-log.res = glm(Smoker ~ GA, family = "binomial", data = glm_data)
-
-summary(log.res) # viewing the results
-```
-
-Similar to the regression modeling analysis, we can also derive confidence intervals:
-```{r 03-Chapter3-90}
-confint(log.res)["GA",]
-```
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can answer **Environmental Health Question #5***: Are there statistically significant differences in maternal education level based on whether they are a non-smoker or a smoker?
-:::
-
-:::answer
-**Answer**: Collectively, these results show a non-significant p-value relating gestational age to smoking status. The confidence intervals also overlap across zero. Therefore, these data do not demonstrate a significant association between gestational age and smoking status.
-:::
-
-
-
-## Statistical Evaluations of Categorical Data using the Chi-Squared Test and Fisher's Exact Test
-Chi-squared test and Fisher's exact tests are used primarily when evaluating data distributions between two categorical variables.
-The difference between a Chi-squared test and the Fisher's exact test surrounds the specific procedure being run. The [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) is an approximation and is run with larger sample sizes to determine whether there is a statistically significant difference between the expected vs. observed frequencies in one or more categories of a contingency table. The [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) is similar, though is an exact measure that can be run on any sample size, including smaller sample sizes.
-
-The number of samples or subjects (*n*) considered to be sufficiently large enough is subjective, contingent upon the research question being asked, and the experimental design. However, smaller sample sizes can be more permissible if the sample is normally distributed, but generally speaking having *n* > 30 is a common convention in statistics ([Alexander, 2022](https://datepsychology.com/no-the-sample-size-is-not-too-small/)).
-
-For this example, we are interested in evaluating the potential relationship between two categorical variables: smoking status (using the `Smoker` variable) and categorical BMI group (using the `BMIcat` variable) to address **Environmental Health Question 6**: Is there a relationship between smoking status and BMI?
-
-To run these categorical statistical tests, let's first create and view a 2-way contingency table describing the frequencies of observations across the categorical BMI and smoking groups:
-```{r 03-Chapter3-91}
-ContingencyTable <- with(full.data, table(BMIcat, Smoker))
-ContingencyTable
-```
-
-Now let's run the Chi-squared test on this table:
-```{r 03-Chapter3-92}
-chisq.test(ContingencyTable)
-```
-
-Note that we can also run the Chi-squared test using the following code, without having to generate the contingency table:
-```{r 03-Chapter3-93, warning = FALSE}
-chisq.test(full.data$BMIcat, full.data$Smoker)
-```
-
-Or:
-```{r 03-Chapter3-94, warning = FALSE}
-with(full.data, chisq.test(BMIcat, Smoker))
-```
-
-### Answer to Environmental Health Question 6
-:::question
-Note that these all produce the same results. *With this, we can answer **Environmental Health Question #6***: Is there a relationship between smoking status and BMI?
-:::
-
-:::answer
-**Answer**: This results in a p-value = 0.34, demonstrating that there is no significant relationship between BMI categories and smoking status.
-:::
-
-
-We can also run a Fisher's Exact Test when considering sample sizes. We won't run this here due to computing time, but here is some example code for your records:
-```{r 03-Chapter3-95}
-#With small sample sizes, can use Fisher's Exact Test
-#fisher.test(full.data$BMI, full.data$Smoker)
-```
-
-## Concluding Remarks
-In conclusion, this training module serves as a high-level introduction to basic statistics and visualization methods. Statistical approaches described in this training module include tests for normality, t-test, analysis of variance, regression modeling, chi-squared test, and Fisher’s exact test. Visualization approaches include boxplots, histograms, scatterplots, and regression lines. These methods serve as an important foundation for nearly all studies carried out in environmental health research.
-
-
-
-
-
-:::tyk
-1. If we're interested in investigating if there are significant differences in birth weight based on maternal education level, which statistical test should you use?
-2. Is that relationship considered to be statistically significant and how can we visualize the distributions of these groups?
-:::
diff --git a/Chapter_3/3_1_Data_Visualization/3_1_Data_Visualization.Rmd b/Chapter_3/3_1_Data_Visualization/3_1_Data_Visualization.Rmd
new file mode 100644
index 0000000..738efb0
--- /dev/null
+++ b/Chapter_3/3_1_Data_Visualization/3_1_Data_Visualization.Rmd
@@ -0,0 +1,329 @@
+# (PART\*) Chapter 3 Basics of Data Analysis and Visualizations {-}
+
+# 3.1 Data Visualizations
+
+This training module was developed by Alexis Payton, Kyle Roell, Lauren E. Koval, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Data Visualizations
+
+Selecting an approach to visualize data is an important consideration when presenting scientific research, given that figures have the capability to summarize large amounts of data efficiently and effectively. (At least that's the goal!) This module will focus on basic data visualizations that we view to be most commonly used, both in and outside of the field of environmental health research, many of which you have likely seen before. This module is not meant to be an exhaustive representation of all figure types, rather it serves as an introduction to some types of figures and how to approach choosing the one that most optimally displays your data and primary findings. When selecting a data visualization approach, here are some helpful questions you should first ask yourself:
+
++ What message am I trying to convey with this figure?
++ How does this figure highlight major findings from the paper?
++ Who is the audience?
++ What type of data am I working with?
+
+[A Guide To Getting Data Visualization Right](https://www.smashingmagazine.com/2023/01/guide-getting-data-visualization-right/) is a great resource for determining which figure is best suited for various types of data. More complex methodology-specific charts are presented in succeeding TAME modules. These include visualizations for:
+
++ Two Group Comparisons (e.g.,boxplots and logistic regression) in **Module 3.4 Introduction to Statistical Tests** and **Module 4.4 Two Group Comparisons and Visualizations**
++ Multi-Group Comparisons (e.g.,boxplots) in **Module 3.4 Introduction to Statistical Tests** and **Module 4.5 Multi-Group Comparisons and Visualizations**
++ Supervised Machine Learning (e.g.,decision boundary plots, variable importance plots) in **Module 5.3 Supervised ML Model Interpretation**
++ Unsupervised Machine Learning
+ + Principal Component Analysis (PCA) plots and heatmaps in **Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**
+ + Dendrograms, clustering visualizations, heatmaps, and variable contribution plots in **Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**
++ -Omics Expression (e.g.,MA plots and volcano plots) in **Module 6.2 -Omics and Systems Biology: Transcriptomic Applications**
++ Mixtures Methods
+ + Forest Plots in **Module 6.3 Mixtures I: Overview and Quantile G-Computation Application**
+ + Trace Plots in **Module 6.4 Mixtures II: BKMR Application**
+ + Sufficient Similarity (e.g.,heatmaps, clustering) in **Module 6.5 Mixtures III: Sufficient Similarity**
++ Toxicokinetic Modeling (e.g.,line graph, dose response) in **Module 6.6 Toxicokinetic Modeling**
+
+
+
+## Introduction to Training Module
+Visualizing data is an important step in any data analysis, including those carried out in environmental health research. Often, visualizations allow scientists to better understand trends and patterns within a particular dataset under evaluation. Even after statistical analysis of a dataset, it is important to then communicate these findings to a wide variety of target audiences. Visualizations are a vital part of communicating complex data and results to target audiences.
+
+In this module, we highlight some figures that can be used to visualize larger, more high-dimensional datasets using figures that are more simple (but still relevant!) than methods presented later on in TAME. This training module specifically reviews the formatting of data in preparation of generating visualizations, scaling datasets, and then guides users through the generation of the following example data visualizations:
+
++ Density plots
++ Boxplots
++ Correlation plots
++ Heatmaps
+
+These visualization approaches are demonstrated using a large environmental chemistry dataset. This example dataset was generated through chemical speciation analysis of smoke samples collected during lab-based simulations of wildfire events. Specifically, different biomass materials (eucalyptus, peat, pine, pine needles, and red oak) were burned under two combustion conditions of flaming and smoldering, resulting in the generation of 12 different smoke samples. These data have been previously published in the following environmental health research studies, with data made publicly available:
+
++ Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. doi: 10.1016/j.scitotenv.2021.145759. Epub 2021 Feb 10. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
++ Kim YH, Warren SH, Krantz QT, King C, Jaskot R, Preston WT, George BJ, Hays MD, Landis MS, Higuchi M, DeMarini DM, Gilmour MI. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ Health Perspect. 2018 Jan 24;126(1):017011. doi: 10.1289/EHP2200. PMID: [29373863](https://pubmed.ncbi.nlm.nih.gov/29373863/).
+
+### GGplot2
+
+*ggplot2* is a powerful package used to create graphics in R. It was designed based on the philosophy that every figure can be built using a dataset, a coordinate system, and a geom that specifies the type of plot. As a result, it is fairly straightforward to create highly customizable figures and is typically preferred over using base R to generate graphics. We'll generate all of the figures in this module using *ggplot2*.
+
+For additional resources on *ggplot2* see [ggplot2 Posit Documentation](https://ggplot2.tidyverse.org/) and [Data Visualization with ggplot2](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html).
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 3-1-Data-Visualization-1 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 3-1-Data-Visualization-2, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
+if (!requireNamespace("GGally"))
+ install.packages("GGally");
+if (!requireNamespace("corrplot"))
+ install.packages("corrplot");
+if (!requireNamespace("pheatmap"))
+ install.packages("pheatmap");
+```
+
+#### Loading R packages required for this session
+```{r 3-1-Data-Visualization-3, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
+library(tidyverse)
+library(GGally)
+library(corrplot)
+library(reshape2)
+library(pheatmap)
+```
+
+#### Set your working directory
+```{r 3-1-Data-Visualization-4, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+Then let's read in our example dataset. As mentioned in the introduction, this example dataset represents chemical measurements across 12 different biomass burn scenarios representing potential wildfire events. Let's upload and view these data:
+```{r 3-1-Data-Visualization-5 }
+# Load the data
+smoke_data <- read.csv("Chapter_3/3_1_Data_Visualization/Module3_1_InputData.csv")
+
+# View the top of the dataset
+head(smoke_data)
+```
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
+2. Are there correlations between biomass burn conditions based on the chemical concentration data?
+3. Under which biomass burn conditions are concentrations of certain chemical categories the highest?
+
+
+
+We can create a **density plot** to answer the first question. Similar to a histogram, density plots are an effective way to show overall distributions of data and can be useful to compare across various test conditions or other stratifications of the data.
+
+In this example of a density plot, we'll visualize the distributions of chemical concentration data on the x axis. A density plot automatically displays where values are concentrated on the y axis. Additionally, we'll want to have multiple density plots within the same figure for each biomass burn condition.
+
+Before the data can be visualized, it needs to be converted from a wide to long format. This is because we need to have variable or column names entitled `Chemical_Concentration` and `Biomass_Burn_Condition` that can be placed into `ggplot()`. For review on converting between long and wide formats and using other tidyverse tools, see **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**.
+```{r 3-1-Data-Visualization-6 }
+longer_smoke_data = pivot_longer(smoke_data, cols = 4:13, names_to = "Biomass_Burn_Condition",
+ values_to = "Chemical_Concentration")
+
+head(longer_smoke_data)
+```
+
+#### Scaling dataframes for downstream data visualizations
+
+A data preparation method that is commonly used to convert values into those that can be used to better illustrate overall data trends is **data scaling**. Scaling can be achieved through data transformations or normalization procedures, depending on the specific dataset and goal of analysis/visualization. Scaling is often carried out using data vectors or columns of a dataframe.
+
+For this example, we will normalize the chemical concentration dataset using a basic scaling and centering procedure using the base R function, `scale()`. This algorithm results in the normalization of a dataset using the mean value and standard deviation. This scaling step will convert chemical concentration values in our dataset into normalized values across samples, such that each chemical's concentration distributions are more easily comparable between the different biomass burn conditions.
+
+For more information on the `scale()` function, see its associated [RDocumentation](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/scale) and helpful tutorial on [Implementing the scale() function in R](https://www.journaldev.com/47818/r-scale-function).
+```{r 3-1-Data-Visualization-7 }
+scaled_longer_smoke_data = longer_smoke_data %>%
+ # scaling within each chemical
+ group_by(Chemical) %>%
+ mutate(Scaled_Chemical_Concentration = scale(Chemical_Concentration)) %>%
+ ungroup()
+
+head(scaled_longer_smoke_data) # see the new scaled values now in the last column (column 7)
+```
+
+We can see that in the `Scaled_Chemical_Concentration` column, each chemical is scaled based on a normal distribution centered around 0, with values now less than or greater than zero.
+
+Now that we have our dataset formatted, let's plot it.
+
+## Density Plot Visualization
+
+The following code can be used to generate a density plot:
+```{r 3-1-Data-Visualization-8, fig.align = "center"}
+ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, color = Biomass_Burn_Condition)) +
+ geom_density()
+```
+
+### Answer to Environmental Health Question 1, Method I
+:::question
+*With this method, we can answer **Environmental Health Question #1***: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
+:::
+
+:::answer
+**Answer**: In general, there are a high number of chemicals that were measured at relatively lower abundances across all smoke samples (hence, the peak in occurrence density occurring towards the left, before 0). The three conditions of smoldering peat, flaming peat, and flaming pine contained the most chemicals at the highest relative concentrations (hence, these lines are the top three lines towards the right).
+:::
+
+
+
+## Boxplot Visualization
+A **boxplot** can also be used to answer our first environmental health question: **How do the distributions of the chemical concentration data differ based on each biomass burn scenario?**. A boxplot also displays a data's distribution, but it incorporates a visualization of a five number summary (i.e., minimum, first quartile, median, third quartile, and maximum). Any outliers are displayed as dots.
+
+For this example, let's have `Scaled_Chemical_Concentration` on the x axis and `Biomass_Burn_Condition` on the y axis. The `scaled_longer_smoke_data` dataframe is the format we need, so we'll use that for plotting.
+```{r 3-1-Data-Visualization-9, fig.align = "center"}
+ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, y = Biomass_Burn_Condition,
+ color = Biomass_Burn_Condition)) +
+ geom_boxplot()
+```
+
+### Answer to Environmental Health Question 1, Method II
+:::question
+*With this alternative method, we can answer, in a different way, **Environmental Health Question #1***: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
+:::
+
+:::answer
+**Answer, Method II**: The median chemical concentration is fairly low (less than 0) for all biomass burn conditions. Overall, there isn't much variation in chemical concentrations with the exception of smoldering peat, flaming peat, and flaming eucalyptus.
+:::
+
+
+
+## Correlation Visualizations
+Let's turn our attention to the second environmental health question: **Are there correlations between biomass burn conditions based on the chemical concentration data?** We'll use two different correlation visualizations to answer this question using the *GGally* package.
+
+*GGally* is a package that serves as an extension of *ggplot2*, the baseline R plotting system based on the grammar of graphics. GGally is very useful for creating plots that compare groups or features within a dataset, among many other utilities. Here we will demonstrate the `ggpairs()` function within *GGally* using the scaled chemistry dataset. This function will produce an image that shows correlation values between biomass burn sample pairs and also illustrates the overall distributions of values in the samples. For more information on *GGally*, see its associated [RDocumentation](https://www.rdocumentation.org/packages/GGally/versions/1.5.0) and [example helpful tutorial](http://www.sthda.com/english/wiki/ggally-r-package-extension-to-ggplot2-for-correlation-matrix-and-survival-plots-r-software-and-data-visualization).
+
+*GGally* requires a wide dataframe with ids (i.e.,`Chemical`) as the rows and the variables that will be compared to each other (i.e.,`Biomass_Burn_Condition`) as the columns. Let's create that dataframe.
+```{r 3-1-Data-Visualization-10 }
+# first selecting the chemical, biomass burn condition, and
+# the scaled chemical concentration columns
+wide_scaled_data = scaled_longer_smoke_data %>%
+ pivot_wider(id_cols = Chemical, names_from = "Biomass_Burn_Condition",
+ values_from = "Scaled_Chemical_Concentration") %>%
+ # converting the chemical names to row names
+ column_to_rownames(var = "Chemical")
+
+head(wide_scaled_data)
+```
+
+By default, `ggpairs()` displays Pearson's correlations. To show Spearman's correlations takes more nuance, but can be done using the code that has been commented out below.
+```{r 3-1-Data-Visualization-11, fig.align = "center", fig.width = 15, fig.height = 15}
+
+# ggpairs with Pearson's correlations
+wide_scaled_data = data.frame(as.matrix(wide_scaled_data))
+ggpairs(wide_scaled_data)
+
+# ggpairs with Spearman's correlations
+# pearson_correlations = cor(wide_scaled_data, method = "spearman")
+# ggpairs(wide_scaled_data, upper = list(continuous = wrap(ggally_cor, method = "spearman")))
+```
+
+Many of these biomass burn conditions have significant correlations denoted by the asterisks.
+
++ '*': p value < 0.1
++ '**': p value < 0.05
++ '***': p value < 0.01
+
+The upper right portion displays the correlation values, where a value less than 0 indicates negative correlation and a value greater than 0 signifies positive correlation. The diagonal shows the density plots for each variable. The lower left portion visualizes the values of the two variables compared using a scatterplot.
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Are there correlations between biomass burn conditions based on the chemical concentration data?
+:::
+
+:::answer
+**Answer**: There is low correlation between many of the variables (-0.5 < correlation value < 0.5). Eucalyptus flaming and pine flaming are significantly positively correlated along with peat flaming and pine needles flaming (correlation value ~0.7 and p value < 0.001).
+:::
+
+We can visualize correlations another way using the other function from *GGally*, `ggcorr()`, which visualizes each correlation as a square. Note that this function calculates Pearson's correlations by default. However, this can be changed using the `method` parameter shown in the code commented out below.
+```{r 3-1-Data-Visualization-12, fig.align = "center", fig.width = 10, fig.height = 7}
+# Pearson's correlations
+ggcorr(wide_scaled_data)
+
+# Spearman's correlations
+# ggcorr(wide_scaled_data, method = "spearman")
+```
+
+We'll visualize correlations between each of the groups using one more figure using the `corrplot()` function from the *corrplot* package.
+```{r 3-1-Data-Visualization-13, fig.align = "center"}
+# Need to supply corrplot with a correlation matrix, here, using the 'cor' function
+corrplot(cor(wide_scaled_data))
+```
+
+Each of these correlation figures displays the same information, but the one you choose to use is a matter of personal preference. Click on the following resources for additional information on [ggpairs()](https://r-charts.com/correlation/ggpairs/) and [corrplot()](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html).
+
+
+
+## Heatmap Visualization
+
+Last, we'll turn our attention to answering the final environmental health question: **Under which biomass burn conditions are concentrations of certain chemical categories the highest?** This can be addressed with the help of a heatmap.
+
+**Heatmaps** are a highly effective method of viewing an entire dataset at once. Heatmaps can appear similar to correlation plots, but typically illustrate other values (e.g., concentrations, expression levels, presence/absence, etc) besides correlation values. They are used to draw patterns between two variables of highest interest (that comprise the x and y axis, though additional bars can be added to display other layers of information). In this instance, we'll use a heatmap to determine whether there are patterns apparent between chemical categories and biomass burn condition on chemical concentrations.
+
+For this example, we can plot `Biomass_Burn_Condition` and `Chemical.Category` on the axes and fill in the values with `Scaled_Chemical_Concentration`. When generating heatmaps, scaled values are often used to better distinguish patterns between groups/samples.
+
+In this example, we also plan to display the median scaled concentration value within the heatmap as an additional layer of helpful information to aid in interpretation. To do so, we'll need to take the median chemical concentration for each biomass burn condition within each chemical category. However, since we want `ggplot()` to visualize the median scaled values with the color of the tiles this step was already necessary.
+```{r 3-1-Data-Visualization-14 }
+# We'll find the median value and add that data to the dataframe as an additional column
+heatmap_df = scaled_longer_smoke_data %>%
+ group_by(Biomass_Burn_Condition, Chemical.Category) %>%
+ mutate(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration))
+
+head(heatmap_df)
+```
+
+Now we can plot the data and add the `Median_Scaled_Concentration` to the figure using `geom_text()`. Note that specifying the original `Scaled_Chemical_Concentration` in the **fill** parameter will NOT give you the same heatmap as specifying the median values in `ggplot()`.
+```{r 3-1-Data-Visualization-15, fig.align = "center", fig.width = 12, fig.height= 5}
+ggplot(data = heatmap_df, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
+ fill = Median_Scaled_Concentration)) +
+ geom_tile() + # function used to specify a heatmap for ggplot
+ geom_text(aes(label = round(Median_Scaled_Concentration, 2))) # adding concentration values as text, rounding to two values after the decimal
+```
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: Under which biomass burn conditions are concentrations of certain chemical categories the highest?
+:::
+
+:::answer
+**Answer**: Peat flaming has the highest concentrations of inorganics and ions. Eucalyptus smoldering has the highest concentrations of levoglucosans. Pine smoldering has the highest concentrations of methoxyphenols. Peat smoldering has the highest concentrations of n-alkanes. Pine needles smoldering has highest concentrations of PAHs.
+:::
+
+This same heatmap can be achieved another way using the `pheatmap()` function from the *pheatmap* package. Using this function requires us to use a wide dataset, which we need to create. It will contain `Chemical.Category`, `Biomass_Burn_Condition` and `Scaled_Chemical_Concentration`.
+```{r 3-1-Data-Visualization-16, message=FALSE}
+heatmap_df2 = scaled_longer_smoke_data %>%
+ group_by(Biomass_Burn_Condition, Chemical.Category) %>%
+ # using the summarize function instead of mutate function as was done previously since we only need the median values now
+ summarize(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration)) %>%
+ # transforming the data to a wide format
+ pivot_wider(id_cols = Biomass_Burn_Condition, names_from = "Chemical.Category",
+ values_from = "Median_Scaled_Concentration") %>%
+ # converting the chemical names to row names
+ column_to_rownames(var = "Biomass_Burn_Condition")
+
+head(heatmap_df2)
+```
+
+Now let's generate the same heatmap this time using the `pheatmap()` function:
+```{r 3-1-Data-Visualization-17, fig.align = "center"}
+pheatmap(heatmap_df2,
+ # removing the clustering option from both rows and columns
+ cluster_rows = FALSE, cluster_cols = FALSE,
+ # adding the values for each cell, making those values black, and changing the font size
+ display_numbers = TRUE, number_color = "black", fontsize = 12)
+```
+
+Notice that the `pheatmap()` function does not include axes or legend titles as with `ggplot()`, however those can be added to the figure after exporting from R in MS Powerpoint or Adobe. Additional parameters, including `cluster_rows`, for the `pheatmap()` function are discussed further in **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**. For basic heatmaps like the ones shown here, `ggplot()` or `pheatmap()` can both be used however, both have their pros and cons. For example, `ggplot()` figures tend to be more customizable and easily combined with other figures, while `pheatmap()` has additional parameters built into the function that can make plotting certain features advantageous like clustering.
+
+
+
+## Concluding Remarks
+In conclusion, this training module provided example code to create highly customizable data visualizations using *ggplot2* pertinent to environmental health research.
+
+
+
+
+
+:::tyk
+Replicate the figure below! The heatmap still visualizes the median chemical concentrations, but this time we're separating the burn conditions, allowing us to determine if the concentrations of chemicals released are contingent upon the burn condition.
+
+For additional figures available and to view aspects of figures that can be changed in *GGplot2*, check out this [GGPlot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). You might need it to make this figure!
+
+**Hint 1**: Use the `separate()` function from *tidyverse* to split `Biomass_Burn_Condition` into `Biomass` and `Burn_Condition`.
+
+**Hint 2**: Use the function `facet_wrap()` within `ggplot()` to separate the heatmaps by `Burn_Condition`.
+:::
+```{r 3-1-Data-Visualization-18, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_3/3_1_Data_Visualization/Module3_1_Image1.png")
+```
diff --git a/Chapter_3/Module3_1_Input/Module3_1_Image1.png b/Chapter_3/3_1_Data_Visualization/Module3_1_Image1.png
similarity index 100%
rename from Chapter_3/Module3_1_Input/Module3_1_Image1.png
rename to Chapter_3/3_1_Data_Visualization/Module3_1_Image1.png
diff --git a/Chapter_3/Module3_1_Input/Module3_1_InputData.csv b/Chapter_3/3_1_Data_Visualization/Module3_1_InputData.csv
similarity index 100%
rename from Chapter_3/Module3_1_Input/Module3_1_InputData.csv
rename to Chapter_3/3_1_Data_Visualization/Module3_1_InputData.csv
diff --git a/Chapter_3/3_2_Improving_Visualization/3_2_Improving_Visualization.Rmd b/Chapter_3/3_2_Improving_Visualization/3_2_Improving_Visualization.Rmd
new file mode 100644
index 0000000..5e3a845
--- /dev/null
+++ b/Chapter_3/3_2_Improving_Visualization/3_2_Improving_Visualization.Rmd
@@ -0,0 +1,441 @@
+
+# 3.2 Improving Data Visualizations
+
+This training module was developed by Alexis Payton, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Data Visulization Conventions
+
+Data visualizations are used to convey key takeaways from a research study's findings in a clear, succinct manner to highlight data trends, patterns, and/or relationships. In environmental health research, this is of particular importance for high-dimensional datasets that can typically be parsed using multiple methods, potentially resulting in many different approaches to visualize data. As a consequence, researchers are often faced with an overwhelming amount of options when deciding which visualization scheme(s) most optimally translate their results for effective dissemination. Effective data visualization approaches are vital to a researcher's success for many reasons. For instance, manuscript readers or peer reviewers often scroll through a study's text and focus on the quality and novelty of study figures before deciding whether to read/review the paper. Therefore, the importance of data visualizations cannot be understated in any research field.
+
+As a high-level introduction, it is important that we first communicate some traits that we think are imperative towards ensuring a successful data visualization approach as described in more detail below.
+
+Keys to successful data visualizations:
+
++ **Consider your audience, data type, and research question prior to selecting a figure to visualize your data**
+
+ For example, if more computationally complex methods are used in a manuscript that is intended for a journal with an audience that doesn't have that same level of expertise, consider spending time focusing on how those results are presented in an approachable way for that audience. For a review of how to choose a rudimentary chart based on the data type, check out [How to Choose the Right Data Visualization](https://www.atlassian.com/data/charts/how-to-choose-data-visualization). Some of these basic charts will be presented in this module, while more complex analysis-specific visualizations, especially ones developed for high-dimensional data will be presented in later modules.
+
++ **Take the legibility of the figure into account**
+
+ This includes avoiding abbreviations when possible. (If they can't be avoided explain them in the caption.) All titles should be capitalized, including titles for the legend(s) and axes. Underscores and periods between words should be replaced with spaces. Consider the legibility of the figure if printed in black and white. (However, that's not as important these days.) Lastly, feel free to describe your plot in further detail in the caption to aid the reader in understanding the results presented.
+
++ **Minimize text**
+
+ Main titles aren't necessary for single paneled figures (like the examples below), because in a publication the title of the figure is right underneath each figure. It's good practice to remove this kind of extraneous text, which can make the figure seem more cluttered. Titles can be helpful in multi-panel figures, especially if there are multiple panels with the same figure type that present slightly different results. For example, in the Test Your Knowledge section, you'll need to create two heatmaps, but one displays data under smoldering conditions and the other displays data under flaming conditions. In general, try to reduce the amount of extraneous text in a plot to keep a reader focused on the most important elements and takeaways in the plot.
+
++ **Use the minimal number of figures you need to support your narrative**
+
+ It is important to include an optimal number of figures within manuscripts and scientific reports. Too many figures might overwhelm the overall narrative, while too few might not provide enough substance to support your main findings. It can be helpful to also consider placing some figures in supplemental material to aid in the overall flow of your scientific writing.
+
++ **Select an appropriate color palette**
+
+ Packages have been developed to offer color palettes including *MetBrewer* and *RColorBrewer*. In addition, *ggsci* is a package that offers a collection of color palettes used in various scientific journals. For more information, check out *MetBrewer*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/MetBrewer/index.html) and [example tutorial](https://github.com/BlakeRMills/MetBrewer). For more information on *RColorBrewer*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/RColorBrewer/index.html) and [example tutorial](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html). For more information on *ggsci*, see its associated [RDocumentation](https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html). In general, it's better to avoid bright and flashy colors that can be difficult to read.
+
+ It's advisable to use colors for manuscript figures that are color-blind friendly. Check out these [Stack overflow answers about color blind-safe color palettes and packages](https://stackoverflow.com/questions/57153428/r-plot-color-combinations-that-are-colorblind-accessible). Popular packages for generating colorblind-friendly palettes include [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) and [rcartocolor](https://github.com/Nowosad/rcartocolor).
+
++ **Use color strategically**
+
+ Color can be used to visualize a variable. There are three ways to categorize color schemes - sequential, diverging, and qualitative. Below, definitions are provided for each along with example figures that we've previously published that illustrate each color scheme. In addition, figure titles and captions are also provided for context. Note that some of these figures have been simplified from what was originally published to show more streamlined examples for TAME.
+
+ - **Sequential**: intended for ordered categorical data (i.e., disease severity, likert scale, quintiles). The choropleth map below is from [Winker, Payton et. al](https://doi.org/10.3389/fpubh.2024.1339700).
+```{r 3-2-Improving-Visualization-1, echo=FALSE, out.width = "65%", fig.align='center'}
+knitr::include_graphics("Chapter_3/3_2_Improving_Visualization/Module3_2_Image1.png")
+```
+
**Figure 1. Geospatial distribution of the risk of future wildfire events across North Carolina.** Census tracts in North Carolina were binned into quintiles based on Wildfire Hazard Potential (WHP) with 1 (pale orange) having the lowest risk and 5 (dark red) having the highest risk. Figure regenerated here in alignment with its published
+[Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)]
+
+ - **Diverging**: intended to emphasize continuous data at extremes of the data range (typically using darker colors) and mid-range values (typically using lighter colors). This color scheme is ideal for charts like heatmaps. The heatmap below is from [Payton, Perryman et. al](0.1152/ajplung.00299.2021).
+```{r 3-2-Improving-Visualization-2, echo=FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics("Chapter_3/3_2_Improving_Visualization/Module3_2_Image2.png")
+```
+
**Figure 2. Individual cytokine expression levels across all subjects.** Cytokine concentrations were derived from nasal lavage fluid samples. On the x axis, subjects were ordered first according to tobacco use status, starting with non-smokers then cigarette smokers and e-cigarette users. Within tobacco use groups, subjects are ordered from lowest to highest average cytokine concentration from left to right. Within each cluster shown on the y axis, cytokines are ordered from lowest to highest average cytokine concentration from bottom to top. Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
+
+ - **Qualitative**: intended for nominal categorical data to visualize clear differences between groups (i.e., soil types and exposure groups). The dendrogram below is from [Koval et. al](10.1038/s41370-022-00451-8).
+```{r 3-2-Improving-Visualization-3, echo=FALSE, out.width = "75%", fig.align='center'}
+knitr::include_graphics("Chapter_3/3_2_Improving_Visualization/Module3_2_Image3.png")
+```
+
**Figure 3. Translating chemical use inventory data to inform human exposure patterning.** Groups A-I illustrate the identified clusters of exposure source categories. Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
+
++ **Consider ordering axes to reveal patterns relevant to the research questions**
+
+ Ordering the axes can reveal potential patterns that may not be clear in the visualization otherwise. In the cytokine expression heatmap above, there are not clear differences in cytokine expression across the tobacco use groups. However, e-cigarette users seem to have slightly more muted responses compared to non-smokers and cigarette smokers in clusters B and C, which was corroborated in subsequent statistical analyses. It is also evident that Cluster A had the lowest cytokine concentrations, followed by Cluster B, and then Cluster C with the greatest concentrations.
+
+What makes these figures so compelling is how the aspects introduced above were thoughtfully incorporated. In the next section, we'll put those principles into practice using data that were described and referenced previously in **TAME 2.0 Module 3.1 Data Visualizations**.
+
+
+
+## Introduction to Training Module
+In this module, *ggplot2*, R's data visualization package will be used to walk through ways to improve data visualizations. We'll recreate two figures (i.e., the boxplot and heatmap) constructed previously in **TAME 2.0 Module 3.1 Data Visualizations** and improve them so they are publication-ready. Additionally, we'll write figure titles and captions to contextualize the results presented for each visualization. When writing figure titles and captions, it is helpful to address the research question or overall concept that the figure seeks to capture rather than getting into the weeds of specific methods the plot is based on. This is especially important when visualizing more complex methods that your audience might not have as much knowledge on.
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 3-2-Improving-Visualization-4, echo=TRUE, eval=TRUE}
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 3-2-Improving-Visualization-5, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
+if (!requireNamespace("MetBrewer"))
+ install.packages("MetBrewer");
+if (!requireNamespace("RColorBrewer"))
+ install.packages("RColorBrewer");
+if (!requireNamespace("pheatmap"))
+ install.packages("pheatmap");
+if (!requireNamespace("cowplot"))
+ install.packages("cowplot");
+```
+
+#### Loading required R packages
+```{r 3-2-Improving-Visualization-6, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
+library(tidyverse)
+library(MetBrewer)
+library(RColorBrewer)
+library(pheatmap)
+library(cowplot)
+```
+
+#### Set your working directory
+```{r 3-2-Improving-Visualization-7, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+Let's now read in our example dataset. As mentioned in the introduction, this example dataset represents chemical measurements across 12 different biomass burn scenarios, representing chemicals emitted during potential wildfire events. Let's upload and view these data:
+```{r 3-2-Improving-Visualization-8 }
+# Load the data
+smoke_data <- read.csv("Chapter_3/3_2_Improving_Visualization/Module3_2_InputData.csv")
+
+# View the top of the dataset
+head(smoke_data)
+```
+
+Now that we've been able to view the dataset, let's come up with questions that can be answered with our boxplot and heatmap figure. This will inform how we format the dataframe for visualization.
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Boxplot: How do the distributions of the chemical concentration data differ based on each biomass burn scenario?
+2. Heatmap: Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?
+3. How can these figures be combined into a single plot that can be then be exported from R?
+
+#### Formatting dataframes for downstream visualization code
+First, format the dataframe by changing it from a wide to long format and normalizing the chemical concentration data. For more details on this data reshaping visit **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**.
+```{r 3-2-Improving-Visualization-9 }
+scaled_longer_smoke_data = pivot_longer(smoke_data, cols = 4:13, names_to = "Biomass_Burn_Condition",
+ values_to = "Chemical_Concentration") %>%
+ # scaling within each chemical
+ group_by(Chemical) %>%
+ mutate(Scaled_Chemical_Concentration = scale(Chemical_Concentration)) %>%
+ ungroup()
+
+head(scaled_longer_smoke_data)
+```
+
+
+## Creating an Improved Boxplot Visualization
+
+As we did in the previous module, a boxplot will be constructed to answer the first environmental heath question: **How do the distributions of the chemical concentration data differ based on each biomass burn scenario?**. Let's remind ourselves of the original figure from the previous module.
+
+```{r 3-2-Improving-Visualization-10, fig.align = "center", echo = FALSE, fig.width = 7, fig.height = 5}
+ggplot(data = scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, color = Biomass_Burn_Condition)) +
+ geom_boxplot()
+```
+
+Based on the figure above, peat smoldering has the highest median scaled chemical concentration. However, this was difficult to determine given that the burn conditions aren't labeled on the x axis and a sequential color palette was used, making it difficult to identify the correct boxplot with its burn condition in the legend. If you look closely, the colors in the legend are in a reverse order of the colors assigned to the boxplots. Let's identify some elements of this graph that can be modified to make it easier to answer our research question.
+
+:::txtbx
+### There are four main aspects we can adjust on this figure:
+
+**1. The legibility of the text in the legend and axes.**
+
+Creating spaces between the text or exchanging the underscores for spaces improves the legibility of the figure.
+
+ **2. The order of the boxplots.**
+
+Ordering the biomass burn conditions from highest to lowest based on their median scaled chemical concentration allows the reader to easily determine the biomass burn condition that had the greatest or least chemical concentrations relative to each other. In R, this can be done by putting the `Biomass_Burn_Condition` variable into a factor.
+
+**3. Use of color.**
+
+Variables can be visualized using color, text, size, etc. In this figure, it is redundant to have the biomass burn condition encoded in the legend and the color. Instead this variable can be put on the y axis and the legend will be removed to be more concise. The shades of the colors will also be changed, but to keep each burn condition distinct from each other, colors will be chosen that are distinct from one another. Therefore, we will choose a qualitative color scheme.
+
+**4. Show all data points when possible.**
+
+Many journals now require that authors report every single value when making data visualizations, particularly for small *n* studies using bar graphs and boxplots to show results. Instead of just displaying the mean/median and surrounding data range, it is advised to show how every replicate landed in the study range when possible. Note that this requirement is not feasible for studies with larger sample sizes though should be considered for smaller *in vitro* and animal model studies.
+:::
+
+Let's start with addressing **#1: Legibility of Axis Text**. The legend title and axis titles can easily be changed with `ggplot()`, so that will be done later. To remove the underscore from the `Biomass_Burn_Condition` column, we can use the function `gsub()`, which will replace all of the underscores with spaces, resulting in a cleaner-looking graph.
+```{r 3-2-Improving-Visualization-11 }
+# First adding spaces between the biomass burn conditions
+scaled_longer_smoke_data = scaled_longer_smoke_data %>%
+ mutate(Biomass_Burn_Condition = gsub("_", " ", Biomass_Burn_Condition))
+
+# Viewing dataframe
+head(scaled_longer_smoke_data)
+```
+
+**#2. Reordering the boxplots based on the median scaled chemical concentration**.
+After calculating the median scaled chemical concentration for each biomass burn condition, the new dataframe will be arranged from lowest to highest median scaled concentration from the top of the dataframe to the bottom. This order will be saved in a vector, `median_biomass_order`. Although the biomass burn conditions are saved from lowest to highest concentration, `ggplot()` will plot them in reverse order with the highest concentration at the top and the lowest at the bottom of the y axis.
+
+Axis reordering can also be accomplished using `reorder` within the `ggplot()` function as described [here](https://guslipkin.medium.com/reordering-bar-and-column-charts-with-ggplot2-in-r-435fad1c643e) and [here](https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html).
+```{r 3-2-Improving-Visualization-12 }
+median_biomass = scaled_longer_smoke_data %>%
+ group_by(Biomass_Burn_Condition) %>%
+ summarize(Median_Concentration = median(Scaled_Chemical_Concentration)) %>%
+ # arranges dataframe from lowest to highest from top to bottom
+ arrange(Median_Concentration)
+
+head(median_biomass)
+
+# Saving that order
+median_biomass_order = median_biomass$Biomass_Burn_Condition
+```
+
+
+```{r 3-2-Improving-Visualization-13 }
+# Putting into factor to organize the burn conditions
+scaled_longer_smoke_data$Biomass_Burn_Condition = factor(scaled_longer_smoke_data$Biomass_Burn_Condition,
+ levels = median_biomass_order)
+
+# Final dataframe to be used for plotting
+head(scaled_longer_smoke_data)
+```
+
+Now that the dataframe has been finalized, we can plot the new boxplot. The final revision, **#3: Making Use of Color**, will be addressed with `ggplot()`. However, a palette can be chosen from the *MetBrewer* package.
+```{r 3-2-Improving-Visualization-14 }
+# Choosing the "Jurarez" palette from the `MetBrewer` package
+# `n = 12`, since there are 12 biomass burn conditions
+juarez_colors = met.brewer(name = "Juarez", n = 12)[1:12]
+```
+
+**#4. Show all data points when possible** will also be addressed with `ggplot()` by simply using `geom_point()`.
+```{r 3-2-Improving-Visualization-15, fig.align = "center", out.width = "75%", out.height = "75%"}
+FigureX1 = ggplot(scaled_longer_smoke_data, aes(x = Scaled_Chemical_Concentration, y = Biomass_Burn_Condition,
+ color = Biomass_Burn_Condition)) +
+ geom_boxplot() +
+ # jittering the points, so they're not all on top of each other and adding transparency
+ geom_point(position = position_jitter(h = 0.1), alpha = 0.7) +
+
+ theme_light() + # changing the theme
+ theme(axis.text = element_text(size = 9), # changing size of axis labels
+ axis.title = element_text(face = "bold", size = rel(1.5)), # changes axis titles
+ legend.position = "none") + # removes legend
+
+ xlab('Scaled Chemical Concentration (pg/uL)') + ylab('Biomass Burn Condition') + # changing axis labels
+ scale_color_manual(values = c(juarez_colors)) # changing the colors
+
+FigureX1
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Chemical concentration distributions of biomass burn conditions.** The boxplots are based on the scaled chemical concentration values, which used the raw chemical concentrations values scaled within each chemical. The individual dots represent the concentrations of each chemical. The biomass burn conditions on the y axis are ordered from greatest (top) to least (bottom) based on median scaled chemical concentration."
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: Which biomass burn condition has the highest total chemical concentration?
+:::
+
+:::answer
+**Answer**: Smoldering peat has the highest median chemical concentration, however the median concentrations are comparable across all biomass burn conditions. All the flaming conditions have the highest median chemical concentrations and more overall variation than their respective smoldering conditions with the exception of smoldering peat.
+:::
+
+You may notice that the scaled chemical concentration was put on the x axis and burn condition was put on the y axis and not vice versa. When names are longer in length, they are more legible if placed on the y axis.
+
+Other aspects of the figure were changed in the latest version, but those are minor compared to changing the order of the boxplots, revamping the text, and changing the usage of color. For example, the background was changed from gray to white. Figure backgrounds are generally white, because the figure is easier to read if the paper is printed in black and white. A plot's background can easily be changed to white in R using `theme_light()`, `theme_minimal()`, or `theme_bw()`. Posit provides a very helpful [GGplot2 cheat sheet](https://posit.co/resources/cheatsheets/?type=posit-cheatsheets&_page=2/) for changing a figure's parameters.
+
+
+
+## Creating an Improved Heatmap Visualization
+
+We'll use a heatmap to answer the second environmental health question: **Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?** Let's view the original heatmap from the previous module and find aspects of it that can be improved.
+```{r 3-2-Improving-Visualization-16, fig.align = "center", fig.width = 10, fig.height= 5}
+# Changing the biomass condition variable back to a character from a factor
+scaled_longer_smoke_data$Biomass_Burn_Condition = as.character(scaled_longer_smoke_data$Biomass_Burn_Condition)
+
+# Calculating the median value within each biomass burn condition and category
+scaled_longer_smoke_data = scaled_longer_smoke_data %>%
+ group_by(Biomass_Burn_Condition, Chemical.Category) %>%
+ mutate(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration))
+
+# Plotting
+ggplot(data = scaled_longer_smoke_data, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
+ fill = Median_Scaled_Concentration)) +
+ geom_tile() +
+ geom_text(aes(label = round(Median_Scaled_Concentration, 2))) # adding concentration values as text, rounding to two values after the decimal
+```
+
+From the figure above, it's clear that certain biomass burn conditions are associated with higher chemical concentrations for some of the chemical categories. For example, peat flaming exposure was associated with higher levels of inorganics and ions, while pine smoldering exposure was associated with higher levels of methoxyphenols. Although these are important findings, it is still difficult to determine if there are greater similarities in chemical profiles based on the biomass or the incineration temperature. Therefore, let's identify some elements of this chart that can be modified to make it easier to answer our research question.
+
+:::txtbx
+### There are three main aspects we can adjust on this figure:
+
+**1. The legibility of the text in the legend and axes.**
+Similar to what we did previously, we'll replace underscores and periods with spaces in the axis labels and titles.
+
+**2. The order of the axis labels.**
+Ordering the biomass burn condition and chemical category from highest to lowest based on their median scaled chemical concentration allows the reader to easily determine the biomass burn condition that had the greatest or least total chemical concentrations relative to each other. From the previous boxplot figure, biomass burn condition is already in this order, however we need to order the chemical category by putting the variable into a factor.
+
+**3. Use of color.**
+Notice that in the boxplot we used a qualitative palette, which is best for creating visual differences between different classes or groups. In this heatmap, we'll choose a diverging color palette that uses two or more contrasting colors. A diverging color palette is able to highlight mid range with a lighter color and values at either extreme with a darker color or vice versa.
+:::
+
+**#1: Legibility of Text** can be addressed in `ggplot()` and so can **#2: Reordering the heatmap**.
+
+`Biomass_Burn_Condition` has already been reordered and put into a factor, but we need to do the same with `Chemical.Category`. Similar to before, median scaled chemical concentration for each chemical category will be calculated. However, this time the new dataframe will be arranged from highest to lowest median scaled concentration from the top of the dataframe to the bottom. `ggplot()` will plot them in the SAME order with the highest concentration on the left side and the lowest on the right side of the figure.
+```{r 3-2-Improving-Visualization-17 }
+# Order the chemical category by the median scaled chemical concentration
+median_chemical = scaled_longer_smoke_data %>%
+ group_by(Chemical.Category) %>%
+ summarize(Median_Concentration = median(Scaled_Chemical_Concentration)) %>%
+ arrange(-Median_Concentration)
+
+head(median_chemical)
+
+# Saving that order
+median_chemical_order = median_chemical$Chemical.Category
+```
+
+```{r 3-2-Improving-Visualization-18 }
+# Putting into factor to organize the chemical categories
+scaled_longer_smoke_data$Chemical.Category = factor(scaled_longer_smoke_data$Chemical.Category,
+ levels = median_chemical_order)
+
+# Putting burn conditons back into a factor to organize them
+scaled_longer_smoke_data$Biomass_Burn_Condition = factor(scaled_longer_smoke_data$Biomass_Burn_Condition,
+ levels = median_biomass_order)
+
+# Viewing the dataframe to be plotted
+head(scaled_longer_smoke_data)
+```
+
+Now that the dataframe has been finalized, we can plot the new boxplot. The final revision, **#3: Making Use of Color**, will be addressed with `ggplot()`. Here a palette is chosen from the *RColorBrewer* package.
+```{r 3-2-Improving-Visualization-19 }
+# Only needed to choose 2 colors for 'low' and 'high' in the heatmap
+# `n = 8` in the code to generate more colors that can be chosen from
+rcolorbrewer_colors = brewer.pal(n = 8, name = 'Accent')
+```
+
+
+```{r 3-2-Improving-Visualization-20, fig.align = "center", fig.width = 10, fig.height = 4}
+FigureX2 = ggplot(data = scaled_longer_smoke_data, aes(x = Chemical.Category, y = Biomass_Burn_Condition,
+ fill = Median_Scaled_Concentration)) +
+ geom_tile(color = 'white') + # adds white space between the tiles
+ geom_text(aes(label = round(Median_Scaled_Concentration, 2))) + # adding concentration values as text
+
+ theme_minimal() + # changing the theme
+ theme(axis.text = element_text(size = 9), # changing size of axis labels
+ axis.title = element_text(face = "bold", size = rel(1.5)), # changes axis titles
+ legend.title = element_text(face = 'bold', size = 10), # changes legend title
+ legend.text = element_text(size = 9)) + # changes legend text
+
+ labs(x = 'Chemical Category', y = 'Biomass Burn Condition',
+ fill = "Scaled Chemical\nConcentration (pg/mL)") + # changing axis labels
+ scale_fill_gradient(low = rcolorbrewer_colors[5], high = rcolorbrewer_colors[6]) # changing the colors
+
+FigureX2
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Chemical category concentrations across biomass burn conditions.** Scaled chemical concentration values are based on the raw chemical concentration values scaled within each chemical. Chemical category on the x axis is ordered from highest to lowest median concentration from left to right. Biomass burn condition on the y axis is ordered from the highest to lowest median concentration from top to bottom. The values in each tile represent the median scaled chemical concentration."
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Which classes of chemicals show the highest concentrations across the evaluated biomass burn conditions?
+:::
+
+:::answer
+**Answer**: Ordering the axes from highest to lowest concentration didn't help organize the data as much as we would've liked given some of the variance of chemical concentrations across the chemical categories. Nevertheless, it's still clear that peat flaming produces the highest concentration of inorganics and ions, peat smoldering with n-Alkanes, eucalyptus smoldering with Levoglucosan, pine smoldering with methoxyphenols, and pine flaming with PAHs. In addition, flaming conditions seem to have higher levels of inorganics and ions while smoldering conditions seem to have higher levels of levoglucosan and PAHs.
+:::
+
+It would be helpful if there was a way to group these chemical profiles based on similarity and that's where the `pheatmap()` function can be helpful when it can be difficult to spot those patterns using visual inspection alone. Just for fun, let's briefly visualize a hierarchical clustering heatmap, which will be used to group both the biomass burn conditions and chemical categories based on their chemical concentrations. In this module, we'll focus only on the `pheatmap()` visualization, but more information on hierarchical clustering can be found in **Module 5.5 Unsupervised Machine Learning II: Additional Clustering Applications**.
+
+As we showed in the previous module, this function requires a wide dataframe which we'll need to create. It will contain `Chemical.Category`, `Biomass_Burn_Condition` and `Scaled_Chemical_Concentration`.
+```{r 3-2-Improving-Visualization-21, message=FALSE}
+heatmap_df2 = scaled_longer_smoke_data %>%
+ group_by(Biomass_Burn_Condition, Chemical.Category) %>%
+ # using the summarize function instead of mutate function as was done previously since we only need the median values now
+ summarize(Median_Scaled_Concentration = median(Scaled_Chemical_Concentration)) %>%
+ # transforming the data to a wide format
+ pivot_wider(id_cols = Biomass_Burn_Condition, names_from = "Chemical.Category",
+ values_from = "Median_Scaled_Concentration") %>%
+ # converting the chemical names to row names
+ column_to_rownames(var = "Biomass_Burn_Condition")
+
+head(heatmap_df2)
+```
+
+Now let's generate the same heatmap this time using the `pheatmap()` function:
+```{r 3-2-Improving-Visualization-22, fig.align = "center"}
+# creating a color palette
+blue_pink_palette = colorRampPalette(c(rcolorbrewer_colors[5], rcolorbrewer_colors[6]))
+
+pheatmap(heatmap_df2,
+ # changing the color scheme
+ color = blue_pink_palette(40),
+ # hierarchical clustering of the biomass burn conditions
+ cluster_rows = TRUE,
+ # creating white space between the two largest clusters
+ cutree_row = 2,
+ # adding the values for each cell and making those values black
+ display_numbers = TRUE, number_color = "black",
+ # changing the font size and the angle of the column names
+ fontsize = 12, angle_col = 45)
+```
+By using incorporating the dendrogram into the visualization, it's easier to see that the chemical profiles have greater similarities within incineration temperatures rather than biomasses (with the exception of pine needles smoldering).
+
+
+
+## Creating Multi-Plot Figures
+We can combine figures using the `plot_grid()` function from the *cowplot* package. For additional information on the `plot_grid()` function and parameters that can be changed see [Arranging Plots in a Grid](https://wilkelab.org/cowplot/articles/plot_grid.html). Other packages that have figure combining capabilities include the *[patchwork](https://patchwork.data-imaginist.com/)* package and the [`grid_arrange()`](https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html) function from the *gridExtra* package.
+
+Figures can also be combined after they're exported from R using other applications like MS powerpoint and Adobe pdf.
+```{r 3-2-Improving-Visualization-23, fig.align = "center", fig.width = 20, fig.height = 6, fig.retina= 3 }
+FigureX = plot_grid(FigureX1, FigureX2,
+ # Adding labels, changing size their size and position
+ labels = "AUTO", label_size = 15, label_x = 0.04,
+ rel_widths = c(1, 1.5))
+FigureX
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Chemical concentration distributions across biomass burn conditions.** (A) The boxplots are based on the scaled chemical concentration values, which used the raw chemical concentrations values scaled within each chemical. The individual dots represent the concentrations of each chemical. The biomass burn conditions on the y axis are ordered from greatest (top) to least (bottom) based on median scaled chemical concentration. (B) The heatmap visualizes concentrations across chemical categories. Chemical category on the x axis is ordered from highest to lowest median concentration from left to right. Biomass burn condition on the y axis is ordered from the highest to lowest median concentration from top to bottom. The values in each tile represent the median scaled chemical concentration.
+
+By putting these two figures side by side, it's now easier to compare the distributions of each biomass burn condition in figure A alongside the median chemical category concentrations in figure B that are responsible for the variation seen on the left.
+
+
+
+## Concluding Remarks
+In conclusion, this training module provided information and example code for improving, streamlining, and making *ggplot2* figures publication ready. Keep in mind that concepts and ideas presented in this module can be subjective and might need to be amended given the situation, dataset, and visualization.
+
+
+
+### Additional Resources
+
++ [Beginner's Guide to Data Visualizations](https://towardsdatascience.com/beginners-guide-to-enhancing-visualizations-in-r-9fa5a00927c9) and [Improving Data Visualizations in R](https://towardsdatascience.com/8-tips-for-better-data-visualization-2f7118e8a9f4)
++ [Generating Colors for Visualizations](https://blog.datawrapper.de/colorguide/)
++ [Additional Hands on Training](https://github.com/hbctraining/publication_perfect)
++ Brewer, Cynthia A. 1994. Color use guidelines for mapping and visualization. Chapter 7 (pp. 123-147) in Visualization in Modern Cartography
++ Hattab, G., Rhyne, T.-M., & Heider, D. (2020). Ten simple rules to colorize biological data visualization. PLOS Computational Biology, 16(10), e1008259. PMID: [33057327](https://doi.org/10.1371/journal.pcbi.1008259)
+
+Lastly, for researchers who are newer to R programming, [*ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/) is a package specifically designed to create publication-ready graphs similar to *ggplot2* with more concise syntax. This package is particularly useful for statistically relevant visualizations, which are further explored in later modules including, **TAME 2.0 Module 3.4 Introduction to Statistical Tests**, **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations**, and **TAME 2.0 Module 4.5 Multigroup Comparisons and Visualizations**.
+
+
+
+
+
+:::tyk
+Replicate the figure below! The heatmap is the same as the "Test Your Knowledge" figure from **TAME 2.0 Module 3.1 Data Visualizations**. This time we'll focus on making the figure look more publication ready by cleaning up the titles, cleaning up the labels, and changing the colors. The heatmap still visualizes the median chemical concentrations, but this time we're separating the burn conditions, allowing us to determine if the concentrations of chemicals released are contingent upon the burn condition.
+
+**Hint**: To view additional aspects of figures that can be changed in *ggplot2* check out this [GGPlot2 Cheat Sheet](https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf). It might come in handy!
+:::
+```{r 3-2-Improving-Visualization-24, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_3/3_2_Improving_Visualization/Module3_2_Image4.png")
+```
diff --git a/Chapter_3/Module3_2_Input/Module3_2_Image1.png b/Chapter_3/3_2_Improving_Visualization/Module3_2_Image1.png
similarity index 100%
rename from Chapter_3/Module3_2_Input/Module3_2_Image1.png
rename to Chapter_3/3_2_Improving_Visualization/Module3_2_Image1.png
diff --git a/Chapter_3/Module3_2_Input/Module3_2_Image2.png b/Chapter_3/3_2_Improving_Visualization/Module3_2_Image2.png
similarity index 100%
rename from Chapter_3/Module3_2_Input/Module3_2_Image2.png
rename to Chapter_3/3_2_Improving_Visualization/Module3_2_Image2.png
diff --git a/Chapter_3/Module3_2_Input/Module3_2_Image3.png b/Chapter_3/3_2_Improving_Visualization/Module3_2_Image3.png
similarity index 100%
rename from Chapter_3/Module3_2_Input/Module3_2_Image3.png
rename to Chapter_3/3_2_Improving_Visualization/Module3_2_Image3.png
diff --git a/Chapter_3/Module3_2_Input/Module3_2_Image4.png b/Chapter_3/3_2_Improving_Visualization/Module3_2_Image4.png
similarity index 100%
rename from Chapter_3/Module3_2_Input/Module3_2_Image4.png
rename to Chapter_3/3_2_Improving_Visualization/Module3_2_Image4.png
diff --git a/Chapter_3/Module3_2_Input/Module3_2_InputData.csv b/Chapter_3/3_2_Improving_Visualization/Module3_2_InputData.csv
similarity index 100%
rename from Chapter_3/Module3_2_Input/Module3_2_InputData.csv
rename to Chapter_3/3_2_Improving_Visualization/Module3_2_InputData.csv
diff --git a/Chapter_3/3_3_Normality_Tests/3_3_Normality_Tests.Rmd b/Chapter_3/3_3_Normality_Tests/3_3_Normality_Tests.Rmd
new file mode 100644
index 0000000..c23fff5
--- /dev/null
+++ b/Chapter_3/3_3_Normality_Tests/3_3_Normality_Tests.Rmd
@@ -0,0 +1,356 @@
+
+# 3.3 Normality Tests and Data Transformations
+
+This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+When selecting the appropriate statistical tests to evaluate potential trends in your data, selection often relies upon whether or not underlying data are normally distributed. Many statistical tests and methods that are commonly implemented in exposure science, toxicology, and environmental health research rely on assumptions of normality. Applying a statistical test intended for data with a specific distribution when your data do not fit within that distribution can generate unreliable results, with the potential for false positive and false negative findings. Thus, one of the most common statistical tests to perform at the beginning of an analysis is a test for normality.
+
+In this training module, we will:
+
++ Review the normal distribution and why it is important
++ Demonstrate how to test whether your variable distributions are normal...
+ + Qualitatively, with histograms and Q-Q plots
+ + Quantitatively, with the Shapiro-Wilk test
++ Discuss data transformation approaches
++ Demonstrate log~2~ data transformation for non-normal data
++ Discuss additional considerations related to normality
+
+We will demonstrate normality assessment using example data derived from a study in which chemical exposure profiles were collected across study participants through silicone wristbands. This exposure monitoring technique has been described through previous publications, including the following examples:
+
++ O'Connell SG, Kincl LD, Anderson KA. [Silicone wristbands as personal passive samplers](https://pubs.acs.org/doi/full/10.1021/es405022f). Environ Sci Technol. 2014 Mar 18;48(6):3327-35. doi: 10.1021/es405022f. Epub 2014 Feb 26. PMID: 24548134; PMCID: PMC3962070.
+
++ Kile ML, Scott RP, O'Connell SG, Lipscomb S, MacDonald M, McClelland M, Anderson KA. [Using silicone wristbands to evaluate preschool children's exposure to flame retardants](https://www.sciencedirect.com/science/article/pii/S0013935116300743). Environ Res. 2016 May;147:365-72. doi: 10.1016/j.envres.2016.02.034. Epub 2016 Mar 3. PMID: 26945619; PMCID: PMC4821754.
+
++ Hammel SC, Hoffman K, Phillips AL, Levasseur JL, Lorenzo AM, Webster TF, Stapleton HM. [Comparing the Use of Silicone Wristbands, Hand Wipes, And Dust to Evaluate Children's Exposure to Flame Retardants and Plasticizers](https://pubs.acs.org/doi/full/10.1021/acs.est.9b07909). Environ Sci Technol. 2020 Apr 7;54(7):4484-4494. doi: 10.1021/acs.est.9b07909. Epub 2020 Mar 11. PMID: 32122123; PMCID: PMC7430043.
+
++ Levasseur JL, Hammel SC, Hoffman K, Phillips AL, Zhang S, Ye X, Calafat AM, Webster TF, Stapleton HM. [Young children's exposure to phenols in the home: Associations between house dust, hand wipes, silicone wristbands, and urinary biomarkers](https://www.sciencedirect.com/science/article/pii/S0160412020322728). Environ Int. 2021 Feb;147:106317. doi: 10.1016/j.envint.2020.106317. Epub 2020 Dec 17. PMID: 33341585; PMCID: PMC7856225.
+
+
+In the current example dataset, chemical exposure profiles were obtained from the analysis of silicone wristbands worn by 97 participants for one week. Chemical concentrations on the wristbands were measured with gas chromatography mass spectrometry. The subset of chemical data used in this training module are all phthalates, a group of chemicals used primarily in plastic products to increase flexibility and durability.
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 3-3-Normality-Tests-1 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+
+```{r 3-3-Normality-Tests-2, message = FALSE}
+if (!requireNamespace("openxlsx"))
+ install.packages("openxlsx");
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("ggpubr"))
+ install.packages("ggpubr");
+```
+
+#### Loading R packages required for this session
+```{r 3-3-Normality-Tests-3, message = FALSE}
+library(openxlsx) # for importing data
+library(tidyverse) # for manipulating and plotting data
+library(ggpubr) # for making Q-Q plots with ggplot
+```
+
+#### Set your working directory
+```{r 3-3-Normality-Tests-4, eval = FALSE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+```{r 3-3-Normality-Tests-5, message = FALSE}
+# Import data
+wrist_data <- read.xlsx("Chapter_3/3_3_Normality_Tests/Module3_3_InputData.xlsx")
+
+# Viewing the data
+head(wrist_data)
+```
+
+Our example dataset contains subject IDs (`S_ID`), subject ages, and measurements of 8 different phthalates from silicone wristbands:
+
++ `DEP`: Diethyl phthalate
++ `DBP` : Dibutyl phthalate
++ `BBP` : Butyl benzyl phthalate
++ `DEHA` : Di(2-ethylhexyl) adipate
++ `DEHP` : Di(2-ethylhexyl) phthalate
++ `DEHT`: Di(2-ethylhexyl) terephthalate
++ `DINP` : Diisononyl phthalate
++ `TOTM` : Trioctyltrimellitate
+
+The units for the chemical data are nanogram of chemical per gram of silicone wristband (ng/g) per day the participant wore the wristband. One of the primary questions in this study was whether there were significant differences in chemical exposure between subjects with different levels of social stress or between subjects with differing demographic characteristics. However, before we can analyze the data for significant differences between groups, we first need to assess whether our numeric variables are normally distributed.
+
+
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are these data normally distributed?
+2. How does the distribution of data influence the statistical tests performed on the data?
+
+Before answering these questions, let's define normality and how to test for it in R.
+
+
+
+## What is a Normal Distribution?
+
+A normal distribution is a distribution of data in which values are distributed roughly symmetrically out from the mean such that 68.3% of values fall within one standard deviation of the mean, 95.4% of values fall within 2 standard deviations of the mean, and 99.7% of values fall within three standard deviations of the mean.
+```{r 3-3-Normality-Tests-6, out.width = "800px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_3/3_3_Normality_Tests/Module3_3_Image1.png")
+```
+
Figure Credit: D Wells, CC BY-SA 4.0 , via Wikimedia Commons
+
+Common parametric statistical tests, such as t-tests, one-way ANOVAs, and Pearson correlations, rely on the assumption that data fall within the normal distribution for calculation of z-scores and p-values. Non-parametric tests, such as the Wilcoxon Rank Sum test, Kruskal-Wallis test, and Spearman Rank correlation, do not rely on assumptions about data distribution. Some of the aforementioned between-group comparisons were introduced in **TAME 2.0 Module 3.4 Introduction to Statistical Tests**. They, along with non-parametric tests, are explored further in later modules including **TAME 2.0 Module 4.4 Two-Group Comparisons & Visualizations** and **TAME 2.0 Module 4.5 Multi-group Comparisons & Visualizations**.
+
+
+
+## Qualitative Assessment of Normality
+
+We can begin by assessing the normality of our data through plots. For example, plotting data using [histograms](https://en.wikipedia.org/wiki/Histogram), [densities](https://www.data-to-viz.com/graph/density.html#:~:text=Definition,used%20in%20the%20same%20concept.), or [Q-Q plots](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) can graphically help inform if a variable’s values appear to be normally distributed or not. We will start with visualizing our data distributions with histograms.
+
+### Histograms
+
+Let's start with visualizing the distribution of the participant's ages using the `hist()` function that is part of base R.
+```{r 3-3-Normality-Tests-7, fig.align = 'center'}
+hist(wrist_data$Age)
+```
+
+We can edit some of the parameters to improve this basic histogram visualization. For example, we can decrease the size of each bin using the breaks parameter:
+```{r 3-3-Normality-Tests-8, fig.align = 'center'}
+hist(wrist_data$Age, breaks = 10)
+```
+
+The `hist()` function is useful for plotting single distributions, but what if we have many variables that need normality assessment? We can leverage *ggplot2*'s powerful and flexible graphics functions such as `geom_histogram()` and `facet_wrap()` to inspect histograms of all of our variables in one figure panel. For more information on data manipulation in general, see **TAME 2.0 Module 2.3 Data Manipulation & Reshaping** and for more on *ggplot2* including the use of `facet_wrap()`, see **TAME 2.0 Module 3.2 Improving Data Visualizations**.
+
+First, we'll pivot our data to longer to prepare for plotting. Then, we'll make our plot. We can use the `theme_set()` function to set a default graphing theme for the rest of the script. A graphing theme represents a set of default formatting parameters (mostly colors) that ggplot will use to make your graphs. `theme_bw()` is a basic theme that includes a white background for the plot and dark grey axis text and minor axis lines. The theme that you use is a matter of personal preference. For more on the different themes available through *ggplot2*, see [here](https://ggplot2.tidyverse.org/reference/ggtheme.html).
+
+```{r 3-3-Normality-Tests-9, message = FALSE, fig.align = 'center'}
+# Pivot data longer to prepare for plotting
+wrist_data_long <- wrist_data %>%
+ pivot_longer(!S_ID, names_to = "variable", values_to = "value")
+
+# Set theme for graphing
+theme_set(theme_bw())
+
+# Make figure panel of histograms
+ggplot(wrist_data_long, aes(value)) +
+ geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
+ facet_wrap(~ variable, scales = "free") +
+ labs(y = "# of Observations", x = "Value")
+```
+
+From these histograms, we can see that our chemical variables do not appear to be normally distributed.
+
+### Q-Q Plots
+
+Q-Q (quantile-quantile) plots are another way to visually assess normality. Similar to the histogram above, we can create a single Q-Q plot for the age variable using base R functions. Normal Q-Q plots (Q-Q plots where the theoretical quantiles are based on a normal distribution) have theoretical quantiles on the x-axis and sample quantiles, representing the distribution of the variable of interest from the dataset, on the y-axis. If the variable of interest is normally distributed, the points on the graph will fall along the reference line.
+```{r 3-3-Normality-Tests-10, fig.align = 'center'}
+# Plot points
+qqnorm(wrist_data$Age)
+
+# Add a reference line for theoretically normally distributed data
+qqline(wrist_data$Age)
+```
+Small variations from the reference line, as seen above, are to be expected for the most extreme values. Overall, we can see that the age data are relatively normally distributed, as the points fall along the reference line.
+
+To make a figure panel with Q-Q plots for all of our variables of interest, we can use the `ggqqplot()` function within the *[ggpubr](https://rpkgs.datanovia.com/ggpubr/)* package. This function generates Q-Q plots and has arguments that are similar to *ggplot2*.
+```{r 3-3-Normality-Tests-11, fig.align = 'center'}
+ggqqplot(wrist_data_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
+```
+With this figure panel, we can see that the chemical data have very noticeable deviations from the reference, suggesting non-normal distributions.
+
+To answer our first environmental health question, age is the only variable that appears to be normally distributed in our dataset. This is based on our histograms and Q-Q plots with data centered in the middle and spreading with a distribution on both the lower and upper sides that follow typical normal data distributions. However, chemical concentrations appear to be non-normally distributed.
+
+Next, we will implement a quantitative approach to assessing normality, based on a statistical test for normality.
+
+
+
+## Quantitative Normality Assessment
+
+### Single Variable Normality Assessment
+
+We will use the Shapiro-Wilk test to quantitatively assess whether our data distribution is normal, again looking at the age data. This test can be carried out simply using the `shapiro.test()` function from the base R stats package. When using this test and interpreting its results, it is important to remember that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
+```{r 3-3-Normality-Tests-12 }
+shapiro.test(wrist_data$Age)
+```
+This test resulted in a p-value of 0.8143, so we cannot reject the null hypothesis (that data are normally distributed). This means that we can assume that age is normally distributed, which is consistent with our visualizations above.
+
+### Multiple Variable Normality Assessment
+
+With a large dataset containing many variables of interest (e.g., our example data with multiple chemicals), it is more efficient to test each column for normality and then store those results in a dataframe. We can use the base R function `apply()` to apply the Shapiro Wilk test over all of the numeric columns of our dataframe. This function generates a list of results, with a list element for each variable tested. There are also other ways that you could iterate through each of your columns, such as a `for` loop or a function as discussed in **TAME 2.0 Module 2.4 Improving Coding Efficiencies**.
+```{r 3-3-Normality-Tests-13 }
+# Apply Shapiro Wilk test
+shapiro_res <- apply(wrist_data %>% select(-S_ID), 2, shapiro.test)
+
+# View first three list elements
+glimpse(shapiro_res[1:3])
+```
+
+We can then convert those list results into a dataframe. Each variable is now in a row, with columns describing outputs of the statistical test.
+```{r 3-3-Normality-Tests-14 }
+# Create results dataframe
+shapiro_res <- do.call(rbind.data.frame, shapiro_res)
+
+# View results dataframe
+shapiro_res
+```
+
+Finally, we can clean up our results dataframe and add a column that will quickly tell us whether our variables are normally or non-normally distributed based on the Shapiro-Wilk normality test results.
+```{r 3-3-Normality-Tests-15 }
+# Clean dataframe
+shapiro_res <- shapiro_res %>%
+
+ # Add normality conclusion
+ mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
+
+ # Remove columns that do not contain informative data
+ select(c(p.value, normal))
+
+# View cleaned up dataframe
+shapiro_res
+```
+
+The results from the Shapiro-Wilk test demonstrate that age data are normally distributed, while the chemical concentration data are non-normally distributed. These results support the conclusions we made based on our qualitative assessment above with histograms and Q-Q plots.
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can now answer **Environmental Health Question #1***: Are these data normally distributed?
+:::
+
+:::answer
+**Answer:** Age is normally distributed, while chemical concentrates are non-normally distributed.
+:::
+
+### Answer to Environmental Health Question 2
+:::question
+*We can also answer **Environmental Health Question #2***: How does the distribution of data influence the statistical tests performed on the data?
+:::
+
+:::answer
+**Answer:** Parametric statistical tests should be used when analyzing the age data, and non-parametric tests should be used when analyzing the chemical concentration data
+:::
+
+
+
+## Data Transformation
+
+There are a number of approaches that can be used to change the range and/or distribution of values within each variable. Typically, the purpose for applying these changes is to reduce bias in a dataset, remove known sources of variation, or prepare data for specific downstream analyses. The following are general definitions for common terms used when discussing these changes:
+
++ **Transformation** refers to any process used to change data into other, related values. Normalization and standardization are types of data transformation. Transformation can also refer to performing the same mathematical operation on every value in your dataframe. For example, taking the log~2~ or log~10~ of every value is referred to as log transformation.
+
+ + **Normalization** is the process of transforming variables so that they are on a similar scale and therefore are comparable. This can be important when variables in a dataset contain a mixture of data types that are represented by vastly different numeric magnitudes or when there are known sources of variability across samples. Normalization methods are highly dependent on the type of input data. One example of normalization is min-max scaling, which results in a range for each variable of 0 to 1. Although normalization in computational methodologies typically refers to min-max scaling or other similar methods where the variable's range is bounded by specific values, wet-bench approaches also employ normalization - for example, using a reference gene for RT-qPCR assays or dividing a total protein amount for each sample by the volume of each sample to obtain a concentration.
+
+ + **Standardization**, also known as Z-score normalization, is a specific type of normalization that involves subtracting each value from the mean of that variable and dividing by that variable's standard deviation. The standardized values for each variable will have a mean of 0 and a standard deviation of 1. The `scale()` function in R performs standardization by default when the data are centered (argument `center = TRUE` is included within the scale function).
+
+### Transformation of example data
+
+When data are non-normally distributed, such as with the chemical concentrations in our example dataset, it may be desirable to transform the data so that the distribution becomes closer to a normal distribution, particularly if there are only parametric tests available to test your hypothesis. A common transformation used in environmental health research is log~2~ transformation, in which data are transformed by taking the log~2~ of each value in the dataframe.
+
+Let's log~2~ transform our chemical data and examine the resulting histograms and Q-Q plots to qualitatively assess whether data appear more normal following transformation. We will apply a pseudo-log~2~ transformation, where we will add 1 to each value before log~2~ transforming so that all resulting values are positive and any zeroes in the dataframe do not return -Inf.
+```{r 3-3-Normality-Tests-16, fig.align = 'center'}
+# Apply psuedo log2 (pslog2) transformation to chemical data
+wrist_data_pslog2 <- wrist_data %>%
+ mutate(across(DEP:TOTM, ~ log2(.x + 1)))
+
+# Pivot data longer
+wrist_data_pslog2_long <- wrist_data_pslog2 %>%
+ pivot_longer(!S_ID, names_to = "variable", values_to = "value")
+
+# Make figure panel of histograms
+ggplot(wrist_data_pslog2_long, aes(value)) +
+ geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
+ facet_wrap(~ variable, scales = "free") +
+ labs(y = "# of Observations", x = "Value")
+```
+
+```{r 3-3-Normality-Tests-17, fig.align = 'center'}
+# Make a figure panel of Q-Q plots
+ggqqplot(wrist_data_pslog2_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
+```
+
+Both the histograms and the Q-Q plots demonstrate that our log~2~ transformed data are more normally distributed than the raw data graphed above. Let's apply the Shapiro-Wilk test to our log~2~ transformed data to determine if the chemical distributions are normally distributed.
+```{r 3-3-Normality-Tests-18 }
+# Apply Shapiro Wilk test
+shapiro_res_pslog2 <- apply(wrist_data_pslog2 %>% select(-S_ID), 2, shapiro.test)
+
+# Create results dataframe
+shapiro_res_pslog2 <- do.call(rbind.data.frame, shapiro_res_pslog2)
+
+# Clean dataframe
+shapiro_res_pslog2 <- shapiro_res_pslog2 %>%
+
+ ## Add normality conclusion
+ mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
+
+ ## Remove columns that do not contain informative data
+ select(c(p.value, normal))
+
+# View cleaned up dataframe
+shapiro_res_pslog2
+```
+
+The results from the Shapiro-Wilk test demonstrate that the the log~2~ chemical concentration data are more normally distributed than the raw data. Overall, the p-values, even for the chemicals that are still non-normally distributed, are much higher, and only 2 out of the 8 chemicals are non-normally distributed by the Shapiro-Wilk test. We can also calculate average p-values across all variables for our raw and log~2~ transformed data to further demonstrate this point.
+```{r 3-3-Normality-Tests-19 }
+# Calculate the mean Shapiro-Wilk p-value for the raw chemical data
+mean(shapiro_res$p.value)
+
+# Calculate the mean Shapiro-Wilk p-value for the pslog2 transformed chemical data
+mean(shapiro_res_pslog2$p.value)
+```
+
+Therefore, the log~2~ chemical data would be most appropriate to use if researchers are wanting to perform parametric statistical testing (and particularly if there is not a non-parametric statistical test for a given experimental design). It is important to note that if you proceed to statistical testing using log~2~ or other transformed data, graphs you make of significant results should use the transformed values on the y-axis, and findings should be interpreted in the context of the transformed values.
+
+
+
+## Additional Considerations Regarding Normality
+
+The following sections detail additional considerations regarding normality. Similar to other advice in TAME, appropriate methods for handling normality assessment and normal versus non-normal data can be dependent on your field, lab, endpoints of interest, and downstream analyses. We encourage you to take those elements of your study into account, alongside the guidance provided here, when assessing normality. Regardless of the specific steps you take, be sure to report normality assessment steps and the data transformation or statistical test decisions you make based on them in your final report or manuscript.
+
+#### Determining which data should go through normality testing:
+
+Values for all samples (rows) that will be going into statistical testing should be tested for normality. If you are only going to be statistically testing a subset of your data, perform the normality test on that subset. Another way to think of this is that data points that are on the same graph together and/or that have been used as input for a statistical test should be tested for normality together.
+
+#### Analyzing datasets with a mixture of normally and non-normally distributed variables:
+
+There are a couple of different routes you can pursue if you have a mixture of normally and non-normally distributed variables in your dataframe:
+
++ Perform parametric statistical tests on the normally distributed variables and non-parametric tests on the non-normally distributed variable.
++ Perform the statistical test across all variables that fits with the majority of the variable distributions in your dataset.
+
+Our preference is to perform one test across all variables of the same data type/endpoint (e.g., all chemical concentrations, all cytokine concentrations). Aim to choose an approach that fits *best* rather than *perfectly*.
+
+#### Improving efficiency for normality assessment:
+
+If you find yourself frequently performing the same normality assessment workflow, consider writing a function that will execute each normality testing step (making a histogram, making a Q-Q plot, determining Shapiro-Wilk normality variable by variable, and determining the average Shapiro-Wilk p-value across all variables) and store the results in a list for easy inspection.
+
+
+
+## Concluding Remarks
+
+In conclusion, this training module serves as an introduction to and step by step tutorial for normality assessment and data transformations. Approaches described in this training module include visualizations to qualitatively assess normality, statistical tests to quantitatively assess normality, data transformation, and other distribution considerations relating to normality. These methods are an important step in data characterization and exploration prior to downstream analyses and statistical testing, and they can be applied to nearly all studies carried out in environmental health research.
+
+### Additional Resources
+
++ [Descriptive Statistics and Normality Tests for Statistical Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/)
++ [STHDA Normality Test in R](https://www.datanovia.com/en/lessons/normality-test-in-r/)
++ [Normalization vs. Standardization](https://www.geeksforgeeks.org/normalization-vs-standardization/)
+
+
+
+
+
+:::tyk
+Use the input file provided ("Module3_3_TYKInput.xlsx"), which represents a similar dataset to the one used in the module, to answer the following questions:
+
+1. Are any variables normally distributed in the raw data?
+2. Does psuedo log~2~ transforming the values make the distributions overall more or less normally distributed?
+3. What are the average Shapiro-Wilk p-values for the raw and psuedo log~2~ transformed data?
+:::
diff --git a/Chapter_3/Module3_3_Input/Module3_3_Image1.png b/Chapter_3/3_3_Normality_Tests/Module3_3_Image1.png
similarity index 100%
rename from Chapter_3/Module3_3_Input/Module3_3_Image1.png
rename to Chapter_3/3_3_Normality_Tests/Module3_3_Image1.png
diff --git a/Chapter_3/Module3_3_Input/Module3_3_InputData.xlsx b/Chapter_3/3_3_Normality_Tests/Module3_3_InputData.xlsx
similarity index 100%
rename from Chapter_3/Module3_3_Input/Module3_3_InputData.xlsx
rename to Chapter_3/3_3_Normality_Tests/Module3_3_InputData.xlsx
diff --git a/Chapter_3/3_4_Statistical_Tests/3_4_Statistical_Tests.Rmd b/Chapter_3/3_4_Statistical_Tests/3_4_Statistical_Tests.Rmd
new file mode 100644
index 0000000..d72f921
--- /dev/null
+++ b/Chapter_3/3_4_Statistical_Tests/3_4_Statistical_Tests.Rmd
@@ -0,0 +1,456 @@
+
+# 3.4 Introduction to Statistical Tests
+
+This training module was developed by Alexis Payton, Kyle Roell, Elise Hickman, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+This training module provides a brief introduction to some of the most commonly implemented statistics and associated visualizations used in exposure science, toxicology, and environmental health studies. This module first uploads an example dataset that is similar to the data used in **TAME 2.0 Module 2.3 Data Manipulation & Reshaping**, though it includes some expanded subject information data to allow for more example statistical tests. Then, methods to evaluate data normality are presented, including visualization-based and statistical-based approaches.
+
+Basic statistical tests discussed in this module include:
+
++ T test
++ Analysis of Variance (ANOVA) with a Tukey's Post-Hoc test
++ Regression Modeling (Linear and Logistic)
++ Chi-squared test
++ Fisher’s exact test
+
+These statistical tests are very simple, with more extensive examples and associated descriptions of statistical models in the proceeding applications-based training modules in:
+
++ TAME 2.0 Module 4.4 Two-Group Comparisons & Visualizations
++ TAME 2.0 Module 4.5 Multi-Group Comparisons & Visualizations
++ TAME 2.0 Module 4.6 Advanced Multi-Group Comparisons & Visualizations
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 3-4-Statistical-Tests-1 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 3-4-Statistical-Tests-2, results=FALSE, message=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("car"))
+ install.packages("car");
+if (!requireNamespace("ggpubr"))
+ install.packages("ggpubr");
+if(!requireNamespace("effects"))
+ install.packages("effects");
+```
+
+#### Loading R packages required for this session
+```{r 3-4-Statistical-Tests-3, results=FALSE, message=FALSE}
+library(tidyverse) # all tidyverse packages, including dplyr and ggplot2
+library(car) # package for statistical tests
+library(ggpubr) # ggplot2 based plots
+library(effects) # for linear modeling
+```
+
+#### Set your working directory
+```{r 3-4-Statistical-Tests-4, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example datasets
+
+Let's read in our example dataset. Note that these data are similar to those used previously, except that demographic and chemical measurement data were previously merged, and a few additional columns of subject information/demographics were added to serve as more thorough examples of data for use in this training module.
+```{r 3-4-Statistical-Tests-5 }
+# Loading data
+full.data <- read.csv("Chapter_3/3_4_Statistical_Tests/Module3_4_InputData.csv")
+```
+
+Let's view the top of the first 9 columns of data in this dataframe:
+```{r 3-4-Statistical-Tests-6 }
+full.data[1:10,1:9]
+```
+
+These represent the subject information/demographic data, which include the following columns:
+
++ `ID`: subject number
++ `BMI`: body mass index
++ `BMIcat`: BMI <= 18.5 binned as "Underweight", 18.5 < BMI <= 24.5 binned as "Normal", BMI > 24.5 binned as "Overweight"
++ `MAge`: maternal age in years
++ `MEdu`: maternal education level; "No_HS_Degree" = "less than high school", "No_College_Degree" = "high school or some college", "College_Degree" = "college or greater"
++ `BW`: body weight in grams
++ `GA`: gestational age in weeks
++ `Smoker`: "NS" = non-smoker, "S" = smoker
++ `Smoker3`: "Never", "Former", or "Current" smoking status
+
+Let's now view the remaining columns (columns 10-15) in this dataframe:
+```{r 3-4-Statistical-Tests-7 }
+full.data[1:10,10:15]
+```
+
+These columns represent the environmental exposure measures, including:
+
++ `DWAs`: drinking water arsenic levels in µg/L
++ `DWCd`: drinking water cadmium levels in µg/L
++ `DWCr`: drinking water chromium levels in µg/L
++ `UAs`: urinary arsenic levels in µg/L
++ `UCd`: urinary cadmium levels in µg/L
++ `UCr`: urinary chromium levels in µg/L
+
+
+Now that the script is prepared and the data are uploaded, we can start by asking some initial questions about the data that can be answered by running some basic statistical tests and visualizations.
+
+
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are there statistically significant differences in BMI between non-smokers and smokers?
+2. Are there statistically significant differences in BMI between current, former, and never smokers?
+3. Is there a relationship between maternal BMI and birth weight?
+4. Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight?
+5. Are there statistically significant differences in gestational age based on whether a subject is a non-smoker or a smoker?
+6. Is there a relationship between smoking status and BMI?
+
+
+
+## Assessing Normality & Homogeneity of Variance
+Statistical test selection often relies upon whether or not the underlying data are normally distributed and that variance across the groups is the same (homogeneity of variances). Many statistical tests and methods that are commonly implemented in exposure science, toxicology, and environmental health research rely on assumptions of normality. Thus, one of the most common statistical tests to perform at the beginning of an analysis is a **test for normality**.
+
+As discussed in the previous module, there are a few ways to evaluate the normality of a dataset:
+
+*First*, you can visually gauge whether a dataset appears to be normally distributed through plots. For example, plotting data using histograms, densities, or Q-Q plots can graphically help inform if a variable's values appear to be normally distributed or not.
+
+*Second*, you can evaluate normality using statistical tests, such as the **Kolmogorov-Smirnov (K-S) test** and **Shapiro-Wilk test**. When using these tests and interpreting their results, it is important to remember that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
+
+
+
+Let's start with the first approach based on data visualizations. In this module, we'll primarily be generating figures using the ***ggubr*** package which is specifically designed to generate ggplot2-based figures using more streamlined coding syntax. In addition, this package has statistical parameters for plotting that are useful for basic statistical analysis, especially for people with introductory experience to plotting in R. For further documentation on *ggubr*, click [here](https://jtr13.github.io/cc20/brief-introduction-and-tutorial-of-ggpubr-package.html).
+
+Let's begin with a [histogram](https://en.wikipedia.org/wiki/Histogram) to view the distribution of BMI data using the `gghistogram()` function from the *ggubr* package:
+```{r 3-4-Statistical-Tests-8, fig.width=5, fig.height=4, fig.align = 'center'}
+gghistogram(data = full.data, x = "BMI", bins = 20)
+```
+
+Let's also view the [Q–Q (quantile-quantile) plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) using the `ggqqplot()` function also from the *ggubr* package:
+```{r 3-4-Statistical-Tests-9, fig.width=5, fig.height=5, fig.align = 'center'}
+ggqqplot(full.data$BMI, ylab = "BMI")
+```
+
+From these visualizations, the BMI variable appears to be normally distributed, with data centered in the middle and spreading with a distribution on both the lower and upper sides that follow typical normal data distributions.
+
+
+
+Let's now implement the second approach based on statistical tests for normality. Here, let's use the [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) as an example, again looking at the BMI data.
+```{r 3-4-Statistical-Tests-10 }
+shapiro.test(full.data$BMI)
+```
+
+This test resulted in a p-value of 0.3773, so we cannot reject the null hypothesis (that the BMI data are normally distributed). These findings support the assumption that these data are normally distributed.
+
+Next, we'll assess homogeneity of variance using the Levene's test. This will be done using the `leveneTest()`function from the *car* package:
+```{r 3-4-Statistical-Tests-11 }
+# First convert the smoker variable to a factor
+full.data$Smoker = factor(full.data$Smoker, levels = c("NS", "S"))
+leveneTest(BMI ~ Smoker, data = full.data)
+```
+The p value, (`Pr>F`), is 0.6086 indicating that variance in BMI across the smoking groups is the same. Therefore, the assumptions of a t-test, including normality and homogeneity of variance, have been met.
+
+
+
+## Two-Group Visualizations and Statistical Comparisons using the T-Test
+T-tests are commonly used to test for a significant difference between the means of two groups in normally distributed data. In this example, we will be answering **Environmental Health Question 1**: Are there statistically significant differences in BMI between non-smokers and smokers?
+
+We will specifically implement a two sample t-test (or independent samples t-test).
+
+Let’s first visualize the BMI data across these two groups using boxplots:
+```{r 3-4-Statistical-Tests-12, fig.width=5, fig.height=4, fig.align = 'center'}
+ggboxplot(data = full.data, x = "Smoker", y = "BMI")
+```
+
+From this plot, it looks like non-smokers (labeled "NS") *may* have significantly higher BMI than smokers (labeled "S"), though we need statistical evaluation of these data to more thoroughly evaluate this potential data trend.
+
+It is easy to perform a t-test on these data using the `t.test()` function from the base R stats package:
+```{r 3-4-Statistical-Tests-13 }
+t.test(data = full.data, BMI ~ Smoker)
+```
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: Are there statistically significant differences in BMI between non-smokers and smokers?
+:::
+
+:::answer
+**Answer**: From this statistical output, we can see that the overall mean BMI in non-smokers (group "NS") is 26.1, and the overall mean BMI in smokers (group "S") is 23.4. We can also see that the resulting p-value comparison between the means of these two groups is, indeed, significant (p-value = 0.013), meaning that the means between these groups are significantly different (i.e., are not equal).
+:::
+
+It's also helpful to save these results into a variable within the R global environment, which then allows us to access specific output values and extract them more easily for our records. For example, we can run the following to specifically extract the resulting p-value from this test:
+```{r 3-4-Statistical-Tests-14, fig.align = 'center'}
+ttest.res <- t.test(data = full.data, BMI ~ Smoker) # making a list in the R global environment with the statistical results
+signif(ttest.res$p.value, 2) # pulling the p-value and using the `signif` function to round to 2 significant figures
+```
+
+
+
+## Three-Group Visualizations and Statistical Comparisons using an ANOVA
+Analysis of Variance (ANOVA) is a statistical method that can be used to compare means across three or more groups in normally distributed data. To demonstrate an ANOVA test on this dataset, let's answer **Environmental Health Question 2**: Are there statistically significant differences in BMI between current, former, and never smokers? To do this we'll use the `Smoker3` variable from our dataset.
+
+Let's again start by viewing these data distributions using a boxplot:
+```{r 3-4-Statistical-Tests-15, fig.align = 'center'}
+ggboxplot(data = full.data, x = "Smoker3", y = "BMI")
+```
+
+From this cursory review of the data, it looks like the current smokers likely demonstrate significantly different BMI measures than the former and never smokers, though we need statistical tests to verify this potential trend. We also require statistical tests to evaluate potential differences (or lack of differences) between former and never smokers.
+
+Let’s now run the ANOVA to compare BMI between smoking groups, using the `aov()` function to fit an ANOVA model:
+```{r 3-4-Statistical-Tests-16 }
+smoker_anova = aov(data = full.data, BMI ~ Smoker3)
+smoker_anova
+```
+
+We need to extract the typical ANOVA results table using either the `summary()` or `anova()` function on the resulting fitted object:
+```{r 3-4-Statistical-Tests-17 }
+anova(smoker_anova)
+```
+
+This table outputs a lot of information, including the `F value` referring to the resulting F-statistic, `Pr(>F)` referring to the p-value of the F-statistic, and other values that are described in detail through other available resources including this [helpful video](https://online.stat.psu.edu/stat485/lesson/12/12.2) through PennState's statistics online resources.
+
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Are there statistically significant differences in BMI between current, former, never smokers?
+:::
+
+:::answer
+**Answer**: From this ANOVA output table, we can conclude that the group means across all three groups are not equal given that the p value, written as `Pr(>F)` is significant (p value = 5.88 x 10^-12^). However, it doesn't tell us which groups differ from each other and that's where post hoc tests like Tukey's are useful.
+:::
+
+Let's run a Tukey's post hoc test using the `TukeyHSD()` function in base R to determine which of the current, former, and never smokers have significant differences in BMI:
+```{r 3-4-Statistical-Tests-18 }
+smoker_tukey = TukeyHSD(smoker_anova)
+smoker_tukey
+```
+
+Although the above Tukey object contains a column `p adj`, those are the raw unadjusted p values. It is common practice to adjust p values from multiple comparisons to prevent the reporting of false positives or reporting of a significant difference that doesn't actually exist ([Feise, 2002](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-2-8#:~:text=Thus%2C%20the%20main%20benefit%20of,exists%20%5B10%E2%80%9321%5D.)). There are a couple of different methods that are used to adjust p values including the Bonferroni and the Benjamini & Hochberg approaches.
+
+For this example, we'll use the `p.adjust()` function to obtain the Benjamini & Hochberg adjusted p values. Check out the associated [RDocumentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust) to discover other methods that can be used to adjust p values using the `p.adjust()` function:
+```{r 3-4-Statistical-Tests-19 }
+# First converting the Tukey object into a dataframe
+smoker_tukey_df = data.frame(smoker_tukey$Smoker3) %>%
+ # renaming the `p adj` to `P Value` for clarity
+ rename(`P Value` = p.adj)
+
+# Adding a column with the adjusted p values
+smoker_tukey_df$`P Adj` = p.adjust(smoker_tukey_df$`P Value`, method = "fdr")
+smoker_tukey_df
+```
+
+### Answer to Environmental Health Question 2
+*We can use this additional information to further answer **Environmental Health Question #2***: Are there statistically significant differences in BMI between current, former, and never smokers?
+
+**Answer**: Current smokers have significantly lower BMIs than people who have never smoked and people who have formerly smoked. This is made evident by the 95% confidence intervals (`lwr` and `upr`) that don't cross 0 and the p values that are less than 0.05 even after adjusting.
+
+
+
+## Regression Modeling and Visualization: Linear and Logistic Regressions
+Regression modeling aims to find a relationship between a dependent variable (or outcome, response, y) and an independent variable (or predictor, explanatory variable, x). There are many forms of regression analysis, but here we will focus on two: linear regression and logistic regression.
+
+In brief, **linear regression** is generally used when you have a continuous dependent variable and there is assumed to be some sort of linear relationship between the dependent and independent variables. Conversely, **logistic regression** is often used when the dependent variable is dichotomous.
+
+Let's first run through an example linear regression model to answer **Environmental Health Question 3**: Is there a relationship between maternal BMI and birth weight?
+
+### Linear Regression
+We will first visualize the data and a run simple correlation analysis to evaluate whether these data are generally correlated. Then, we will run a linear regression to evaluate the relationship between these variables in more detail.
+
+
+Plotting the variables against one another and adding a linear regression line using the function `ggscatter()` from the *ggubr* package:
+```{r 3-4-Statistical-Tests-20, fig.align = 'center'}
+ggscatter(full.data, x = "BMI", y = "BW",
+ # Adding a linear line with 95% condfidence intervals as the shaded region
+ add = "reg.line", conf.int = TRUE,
+ # Customize reg. line
+ add.params = list(color = "blue", fill = "lightgray"),
+ # Adding Pearson's correlation coefficient
+ cor.coef = TRUE, cor.method = "pearson", cor.coeff.args = list(label.sep = "\n"))
+```
+
+We can also run a basic correlation analysis between these two variables using the `cor.test()` function. This function uses the Pearson's correlation test as default, which we can implement here due to the previously discussed assumption of normality for this dataset. Note that other tests are needed in instances when data are not normally distributed (e.g., Spearman Rank). This function is used here to extract the Pearson's correlation coefficient and p-value (which also appear above in the upper left corner of the graph):
+```{r 3-4-Statistical-Tests-21 }
+cor.res <- cor.test(full.data$BW, full.data$BMI)
+signif(cor.res$estimate, 2)
+signif(cor.res$p.value, 2)
+```
+
+Together, it looks like there may be an association between BW and BMI, based on these correlation results, demonstrating a significant p-value of 0.0004.
+
+To test this further, let’s run a linear regression analysis using the `lm()` function, using BMI (X) as the independent variable and BW as the dependent variable (Y):
+```{r 3-4-Statistical-Tests-22 }
+crude_lm <- lm(data = full.data, BW ~ BMI)
+summary(crude_lm) # viewing the results summary
+```
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: Is there a relationship between maternal BMI and birth weight?
+:::
+
+:::answer
+**Answer**: Not only is there a slight positive correlation between maternal BMI and BW as indicated by ~0.25 correlation coefficient, this linear relationship is significant due to the p-value being ~0.0004.
+:::
+
+
+Additionally, we can derive confidence intervals for the BMI estimate using:
+```{r 3-4-Statistical-Tests-23 }
+confint(crude_lm)["BMI",]
+```
+
+Notice that the r-squared (R^2^) value in regression output is the squared value of the previously calculated correlation coefficient (R).
+```{r 3-4-Statistical-Tests-24 }
+signif(sqrt(summary(crude_lm)$r.squared), 2)
+```
+
+
+
+In epidemiological studies, the potential influence of confounders is considered by including important covariates within the final regression model. Let's go ahead and investigate **Environmental Health Question 4**: Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight? We can do that by adding those variables to the linear model.
+
+```{r 3-4-Statistical-Tests-25 }
+adjusted_lm = lm(data = full.data, BW ~ BMI + MAge + GA)
+summary(adjusted_lm)
+```
+
+
+
+Let's further visualize these regression modeling results by adding a regression line to the original scatterplot. Before doing so, we'll use the `effect()` function from the *effects* package to make estimated predictions of birth weight values for the crude and adjusted linear models. The crude model only has BMI as the dependent variable, while the adjusted model includes BMI, maternal age, and gestational age as dependent variables. This function creates a table that contains 5 columns: fitted values for BMI (`BMI`), predictor values (`fit`), standard errors of the predictions (`se`), lower confidence limits (`lower`), and upper confidence limits (`upper`). An additional column, `Model`, was added to specify whether the values correspond to the crude or adjusted model.
+
+For additional information on visualizing adjusted linear models, see [Plotting Adjusted Associations in R](https://nickmichalak.com/post/2019-02-13-plotting-adjusted-associations-in-r/plotting-adjusted-associations-in-r/).
+```{r 3-4-Statistical-Tests-26 }
+crude_lm_predtable = data.frame(effect(term = "BMI", mod = crude_lm), Model = "Crude")
+adjusted_lm_predtable = data.frame(effect(term = "BMI", mod = adjusted_lm), Model = "Adjusted")
+
+# Viewing one of the tables
+crude_lm_predtable
+```
+
+Now we can plot each linear model and their corresponding 95% confidence intervals (CI). It's easier to visualize this using *ggplot2* instead of *ggubr* so that's what we'll use:
+```{r 3-4-Statistical-Tests-27, fig.align = 'center'}
+options(repr.plot.width=9, repr.plot.height=6) # changing dimensions of the entire figure
+ggplot(full.data, aes(x = BMI, y = BW)) +
+ geom_point() +
+ # Crude line
+ geom_line(data = crude_lm_predtable, mapping = aes(x = BMI, y = fit, color = Model)) +
+ # Adjusted line
+ geom_line(data = adjusted_lm_predtable, mapping = aes(x = BMI, y = fit, color = Model)) +
+ # Crude 95% CI
+ geom_ribbon(data = crude_lm_predtable, mapping = aes(x = BMI, y = fit, ymin = lower, ymax = upper, fill = Model), alpha = 0.25) +
+ # Adjusted 95% CI
+ geom_ribbon(data = adjusted_lm_predtable, mapping = aes(x = BMI, y = fit, ymin = lower, ymax = upper, fill = Model), alpha = 0.25)
+```
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer **Environmental Health Question #4***: Are maternal age and gestational age considered potential covariates in the relationship between maternal BMI and birth weight?
+:::
+
+:::answer
+**Answer**: BMI is still significantly associated with BW and the included covariates are also shown to be significantly related to birth weight in this model. However, the addition of gestational age and maternal age did not have much of an impact on modifying the relationship between BMI and birth weight.
+:::
+
+
+
+### Logistic Regression
+To carry out a logistic regression, we need to evaluate one continuous variable (here, we select gestational age, using the `GA` variable) and one dichotomous variable (here, we select smoking status, using the `Smoker` variable) to evaluate **Environmental Health Question 5**: Are there statistically significant differences in gestational age based on whether a subject is a non-smoker or a smoker?
+
+Because smoking status is a dichotomous variable, we will use logistic regression to look at this relationship. Let's first visualize these data using a stacked bar plot for the dichotomous smoker dataset:
+```{r 3-4-Statistical-Tests-28, fig.width=5, fig.height=4, fig.align = 'center'}
+ggboxplot(data = full.data, x = "Smoker", y = "GA")
+```
+
+
+With this visualization, it's difficult to tell whether or not there are significant differences in maternal education based on smoking status.
+
+
+Let's now run the statistical analysis, using logistic regression modeling:
+```{r 3-4-Statistical-Tests-29 }
+# Before running the model, "Smoker", needs to be binarized to 0's or 1's for the glm function
+glm_data = full.data %>%
+ mutate(Smoker = ifelse(Smoker == "NS", 0,1))
+
+# Use GLM (generalized linear model) and specify the family as binomial
+# This tells GLM to run a logistic regression
+log.res = glm(Smoker ~ GA, family = "binomial", data = glm_data)
+
+summary(log.res) # viewing the results
+```
+
+Similar to the regression modeling analysis, we can also derive confidence intervals:
+```{r 3-4-Statistical-Tests-30 }
+confint(log.res)["GA",]
+```
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can answer **Environmental Health Question #5***: Are there statistically significant differences in maternal education level based on whether they are a non-smoker or a smoker?
+:::
+
+:::answer
+**Answer**: Collectively, these results show a non-significant p-value relating gestational age to smoking status. The confidence intervals also overlap across zero. Therefore, these data do not demonstrate a significant association between gestational age and smoking status.
+:::
+
+
+
+## Statistical Evaluations of Categorical Data using the Chi-Squared Test and Fisher's Exact Test
+Chi-squared test and Fisher's exact tests are used primarily when evaluating data distributions between two categorical variables.
+The difference between a Chi-squared test and the Fisher's exact test surrounds the specific procedure being run. The [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) is an approximation and is run with larger sample sizes to determine whether there is a statistically significant difference between the expected vs. observed frequencies in one or more categories of a contingency table. The [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test) is similar, though is an exact measure that can be run on any sample size, including smaller sample sizes.
+
+The number of samples or subjects (*n*) considered to be sufficiently large enough is subjective, contingent upon the research question being asked, and the experimental design. However, smaller sample sizes can be more permissible if the sample is normally distributed, but generally speaking having *n* > 30 is a common convention in statistics ([Alexander, 2022](https://datepsychology.com/no-the-sample-size-is-not-too-small/)).
+
+For this example, we are interested in evaluating the potential relationship between two categorical variables: smoking status (using the `Smoker` variable) and categorical BMI group (using the `BMIcat` variable) to address **Environmental Health Question 6**: Is there a relationship between smoking status and BMI?
+
+To run these categorical statistical tests, let's first create and view a 2-way contingency table describing the frequencies of observations across the categorical BMI and smoking groups:
+```{r 3-4-Statistical-Tests-31 }
+ContingencyTable <- with(full.data, table(BMIcat, Smoker))
+ContingencyTable
+```
+
+Now let's run the Chi-squared test on this table:
+```{r 3-4-Statistical-Tests-32 }
+chisq.test(ContingencyTable)
+```
+
+Note that we can also run the Chi-squared test using the following code, without having to generate the contingency table:
+```{r 3-4-Statistical-Tests-33, warning = FALSE}
+chisq.test(full.data$BMIcat, full.data$Smoker)
+```
+
+Or:
+```{r 3-4-Statistical-Tests-34, warning = FALSE}
+with(full.data, chisq.test(BMIcat, Smoker))
+```
+
+### Answer to Environmental Health Question 6
+:::question
+Note that these all produce the same results. *With this, we can answer **Environmental Health Question #6***: Is there a relationship between smoking status and BMI?
+:::
+
+:::answer
+**Answer**: This results in a p-value = 0.34, demonstrating that there is no significant relationship between BMI categories and smoking status.
+:::
+
+
+We can also run a Fisher's Exact Test when considering sample sizes. We won't run this here due to computing time, but here is some example code for your records:
+```{r 3-4-Statistical-Tests-35 }
+#With small sample sizes, can use Fisher's Exact Test
+#fisher.test(full.data$BMI, full.data$Smoker)
+```
+
+## Concluding Remarks
+In conclusion, this training module serves as a high-level introduction to basic statistics and visualization methods. Statistical approaches described in this training module include tests for normality, t-test, analysis of variance, regression modeling, chi-squared test, and Fisher’s exact test. Visualization approaches include boxplots, histograms, scatterplots, and regression lines. These methods serve as an important foundation for nearly all studies carried out in environmental health research.
+
+
+
+
+
+:::tyk
+1. If we're interested in investigating if there are significant differences in birth weight based on maternal education level, which statistical test should you use?
+2. Is that relationship considered to be statistically significant and how can we visualize the distributions of these groups?
+:::
diff --git a/Chapter_3/Module3_4_Input/Module3_4_InputData.csv b/Chapter_3/3_4_Statistical_Tests/Module3_4_InputData.csv
similarity index 100%
rename from Chapter_3/Module3_4_Input/Module3_4_InputData.csv
rename to Chapter_3/3_4_Statistical_Tests/Module3_4_InputData.csv
diff --git a/Chapter_4/04-Chapter4.Rmd b/Chapter_4/04-Chapter4.Rmd
deleted file mode 100644
index 64a60ea..0000000
--- a/Chapter_4/04-Chapter4.Rmd
+++ /dev/null
@@ -1,2577 +0,0 @@
-# (PART\*) Chapter 4 Converting Wet Lab Data into Dry Lab Analyses {-}
-
-# 4.1 Overview of Experimental Design and Example Data
-
-This training module was developed by Elise Hickman, Sarah Miller, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Converting wet lab experimentation data into dry lab analyses facilitates reproducibility and transparency in data analysis. This is helpful for consistency across members of the same research group, review of analyses by collaborators or reviewers, and implementation of similar future analyses. In comparison with analysis workflows that use subscription- or license-based applications, such as Prism or SAS, analysis workflows that leverage open-source programming languages such as R also increase accessibility of analyses. Additionally, scripted analyses minimize the risk for copy-paste error, which can occur when cleaning experimental data, transferring it to an analysis application, and exporting and formatting analysis results.
-
-Some of the barriers in converting wet lab experimentation into dry lab analyses include data cleaning, selection and implementation of appropriate statistical tests, and reporting results. This chapter will provide introductory material guiding wet-bench scientists in R analyses, bridging the gap between commonly available R tutorials (which, while helpful, may not provide sufficient level of detail or relevant examples) and intensive data science workflows (which may be too detailed).
-
-In this module, we will provide an overview of key experimental design features and terms that will be used throughout this chapter, and we will provide a detailed overview of the example data. In the subsequent modules, we will dive into analyzing the example data.
-
-## Replicates
-
-One of the most important components of selecting an appropriate analysis is first understanding how data should be compared between samples, which often means addressing experimental replicates. There are two main types of replicates that are used in environmental health research: biological replicates and technical replicates.
-
-### Biological Replicates
-
-Biological replicates are the preferred unit of statistical comparison because they represent biologically distinct samples, demonstrating biological variation in the system. What is considered to be a biological replicate can depend on what model system is being used. For example, in studies with human clinical samples or cells from different human donors, the different humans are considered the biological replicates. In studies using animals as model organisms, individual animals are typically considered biological replicates, although this can vary depending on the experimental design. In studies that use cell lines, which are derived from one human or animal and are modified to continuously grow in culture, a biological replicate could be either cells from different passages (different thawed aliquots) grown in completely separate flasks, all experimented with on the same day, or repeating an experiment on the same set of cells (one thawed aliquot) but on separate experimental days, so the cells have grown/replicated between experiments.
-
-The final "N" that you report should reflect your biological replicates, or independent experiments. What constitutes an independent experiment or biological replicate is highly field-, lab-, organism-, and endpoint-dependent, so make sure to discuss this within your research group in the experiment planning phase and again before your analysis begins. No matter what you choose, ensure that when you report your results, you are transparent about what your biological replicates are. For example, the below diagram (adapted from [BitesizeBio](https://bitesizebio.com/47982/n-number-cell-lines/)) illustrates different ways of defining replicates in experiments with cell lines:
-
-```{r 04-Chapter4-1, echo = FALSE, fig.align = "center", out.width = "650px" }
-knitr::include_graphics("Chapter_4/Module4_1_Input/Module4_1_Image1.png")
-```
-
-N = 3 cells could be considered technical replicates if the endpoint of interest is very low throughput, such as single cell imaging or analyses. N = 3 cell culture wells is a more common approach to technical replicates and is typically used when one sample is collected from each well, such as in the case of media or cell lysate collection. Note that each well within the Week 1 biological replicate would be considered a technical replicate for Week 1's experiment. Similarly, each well within the Week 2 biological replicate would be considered a technical replicate for Week 2's experiment. For more on technical replicates, see the next section.
-
-Although N = 3 cell lines is a less common approach to biological replicates, some argue for this approach because each cell line is typically derived from one biological source. In this scenario, each of the cell lines would be unique but would represent the same cell type or lineage (e.g., for respiratory epithelium, A549, 16HBE, and BEAS-2B cell lines).
-
-Also note that to perform statistical analyses, an N of at least 3 biological replicates is needed, and an even higher N may be needed for a sufficiently powered study. Although power calculations are outside the scope of this module, we encourage you to use power calculation resources, such as [G*Power](https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html) to assist in selecting an appropriate N for your study.
-
-
-### Technical Replicates
-
-Technical replicates are repeated measurements on the same sample or biological source, demonstrating the variation underlying protocols, equipment, and sample handling. In environmental health research, there can be technical replicates separately related to either the experimental design or the downstream analyses. Technical replicates related to experimental design refer to the chemical exposure for cell-based (*in vitro*) experiments, where there may be multiple wells of cells from the same passage or human/mouse exposed to the same treatment. Technical replicates related to downstream analyses refer to the endpoints that are measured after chemical exposure in each sample. To illustrate this, consider an experiment where cells from four unique human donors (D1-D4) are grown in cell culture plates, and then three wells of cells from each donor are exposed to a chemical treatment (Tx) or a vehicle control (Ctrl). The plate layout might look something like this, with technical replicates related to experimental design, i.e. chemical exposure, in the same color:
-
-```{r 04-Chapter4-2, echo = FALSE, fig.align = "center", out.width = "500px" }
-knitr::include_graphics("Chapter_4/Module4_1_Input/Module4_1_Image2.png")
-```
-
-For this experiment, we have four biological replicates (the four donors) and three technical exposure replicates per dose (because three wells from each donor were exposed to each condition). The technical replicates here capture potential unintended variation between wells in cell growth and chemical exposure.
-
-Following the exposure of the cells to a chemical of interest, the media is collected from each well and assayed using a plate reader assay for concentrations of a marker of inflammation. For each sample collected (from each well), there are three technical replicates used to measure the concentration of the inflammatory marker. The purpose of these technical replicates is to capture potential unintended well-to-well variation in the plate reader assay. The plate layout might look something like this, ***with the letter and number in each well of the plate layout representing the well in the exposure plate layout that the media sample being assayed came from***:
-
-```{r 04-Chapter4-3, echo = FALSE, fig.align = "center", out.width = "800px" }
-knitr::include_graphics("Chapter_4/Module4_1_Input/Module4_1_Image3.png")
-```
-
-
-Technical replicates should typically be averaged before performing any statistical analysis. For the experiment described above, we would:
-
-1. Average the technical replicates for the plate reader assay to obtain one value per original cell culture well for inflammatory marker concentration.
-
-2. Then, average the technical replicates for the chemical exposure to obtain one value per biological replicate (donor).
-
-This would result in a dataset with eight values (four control and four treatment) for statistical analysis.
-
-#### Number and inclusion of technical replicates
-
-The above example is just one approach to experimental design. As mentioned above in the biological replicates section, selection of appropriate biological and technical replicates can vary greatly depending on your model organism, experimental design, assay, and standards in the field. For example, there may be cases where well-to-well variation for certain assays is minimal compared with variation between biological replicates, or when including technical replicates for each donor is experimentally or financially unfeasible, resulting in a lack of technical replicates.
-
-### Matched Experimental Design
-
-Matching (also known as paired or repeated measures) in an experimental design is also a very important concept when selecting the appropriate statistical analysis. In experiments with matched design, multiple measurements are collected from the same biological replicate. This typically provides increased statistical power because changes are observed within each biological replicate relative to its starting point. In environmental health research, this can include study designs such as:
-
-1. Samples were collected from the same individuals, animals, or cell culture wells pre- and post-exposure.
-
-2. Cells from the same biological replicate were exposed to different doses of a chemical.
-
-The experimental design described above represents a matched design because cells from the same donor are exposed to both the treatment and the vehicle control.
-
-## Orientation to Example Data for Chapter 4
-
-In this chapter, we will be using an example dataset derived from an *in vitro*, or cell culture, experiment. Before diving into analysis of these data in the subsequent modules, we will provide an overview of where these data came from and preview what the input data frames look like.
-
-### Experimental Design
-
-In this experiment, primary human bronchial epithelial cells (HBECs) from sixteen different donors were exposed to the gas acrolein, which is emitted from the combustion of fossil fuels, tobacco, wood, and plastic. Inhalation exposure to acrolein is associated with airway inhalation, and this study aimed to understand how exposure to acrolein changes secretion of markers of inflammation. Prior to experimentation, the HBECs were grown on a permeable membrane support for 24 days with air on one side and liquid media on the other side, allowing them to differentiate into a form that is very similar to what is found in the human body. The cells were then exposed for 2 hours to 0 (filtered air), 0.6, 1, 2, or 4 ppm acrolein, with two technical replicate wells from each donor per dose. Twenty-four hours later, the media was collected, and concentrations of inflammatory markers were measured using an [enzyme-linked immunosorbent assay (ELISA)](https://www.thermofisher.com/us/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-elisa.html).
-
-```{r 04-Chapter4-4, echo = FALSE, fig.align = "center", out.width = "900px" }
-knitr::include_graphics("Chapter_4/Module4_1_Input/Module4_1_Image4.png")
-```
-
-Note that this is a matched experimental design because cells from every donor were exposed to every concentration of acrolein, rather than cells from different donors being exposed to each of the different doses.
-
-### Starting Data
-
-Next, let's familiarize ourselves with the data that resulted from this experiment. There are two input data files, one that contains cytokine concentration data and one that contains demographic information about the donors:
-
-```{r 04-Chapter4-5, echo = FALSE, fig.align = "center", out.width = "900px" }
-knitr::include_graphics("Chapter_4/Module4_1_Input/Module4_1_Image5.png")
-```
-
-The cytokine data contains information about the cytokine measurements for each of the six proteins measured in the basolateral media for each sample (units = pg/mL), which can be identified by the donor, dose, and replicate columns. The demographic data contains information about the age and sex of each donor. In the subsequent modules, we'll be using these data to assess whether exposure to acrolein significantly changes secretion of inflammatory markers and whether donor characteristics, such as sex and age, modify these responses.
-
-## Concluding Remarks
-
-This module reviewed important components of experimental design, such as replicates and matching, which are critical for data pre-processing and selecting appropriate statistical tests.
-
-
-
-:::tyk
-Read the following experimental design descriptions. For each description, determine the number of biological replicates (per group), the number of technical replicates, and whether the experimental design is matched.
-
-1. One hundred participants are recruited to a study aiming to determine whether people who use e-cigarettes have different concentrations of inflammatory markers in their airways. Fifty participants are non e-cigarette users and 50 participants are e-cigarette users. After the airway samples are collected, each sample is analyzed with an ELISA, with three measurements taken per sample.
-
-2. Twenty mice are used in a study aiming to understand the effects of particulate matter on cardiovascular health. The mice are randomized such that half of the mice are exposed to filtered air and half are exposed to particulate matter. During the exposures, the mice are continuously monitored for endpoints such as heart rate and heart function. One month later, the mice that were exposed to particulate matter are exposed to filtered air, and the mice that were exposed to filtered air are exposed to particulate matter, with the same cardiovascular endpoints collected.
-:::
-
-# 4.2 Data Import, Processing, and Summary Statistics
-
-This training module was developed by Elise Hickman, Alexis Payton, Sarah Miller, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-The first steps in any scripted analysis of wet-bench data include importing the data, cleaning the data to prepare for analyses, and conducting preliminary data exploration steps, such as addressing missing values, calculating summary statistics, and assessing normality. Although less exciting than diving right into the statistical analysis, these steps are crucial in guiding downstream analyses and ensuring accurate results. In this module, we will discuss each of these steps and work through them using an example dataset (introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** of inflammatory markers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. What is the mean concentration of each inflammatory biomarker by acrolein concentration?
-
-2. Are our data normally distributed?
-
-
-
-## Data Import
-
-First, we need to import our data. Data can be imported into R from many different file formats, including .csv (as demonstrated in previous chapters), .txt, .xlsx, and .pdf. Often, data are formatted in Excel prior to import, and the [*openxlsx*](https://ycphs.github.io/openxlsx/) package provides helpful functions that allow the user to import data from Excel, create workbooks for storing results generated in R, and export data from R to Excel workbooks. Below, we will use the `read.xlsx()` function to import our data directly from Excel. Other useful packages include [*pdftools*](https://github.com/ropensci/pdftools) (PDF import), [*tm*](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) (text mining of PDFs), and [*plater*](https://cran.r-project.org/web/packages/plater/vignettes/plater-basics.html) (plate reader formatted data import).
-```{r 04-Chapter4-6, echo = FALSE, fig.align = "center", out.width = "850px" }
-knitr::include_graphics("Chapter_4/Module4_2_Input/Module4_2_Image1.png")
-```
-
-### Workspace Preparation and Data Import
-
-#### Set working directory
-
-In preparation, first let's set our working directory to the folder path that contains our input files:
-```{r 04-Chapter4-7, eval = FALSE}
-setwd("/filepath to where your input files are")
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 04-Chapter4-8, install__libs, echo=TRUE, eval=FALSE, warning=FALSE, results='hide', message=FALSE}
-if (!requireNamespace("table1"))
- install.packages("table1");
-if (!requireNamespace("vtable"))
- install.packages("vtable");
-# some packages need to be installed through Bioconductor/ BiocManager
-if (!require("BiocManager", quietly = TRUE))
- install.packages("BiocManager")
-BiocManager::install("pcaMethods")
-BiocManager::install("impute")
-BiocManager::install("imputeLCMD")
-```
-#### Load required packages
-
-And load required packages:
-```{r 04-Chapter4-9, message = FALSE}
-library(openxlsx) # for importing Excel files
-library(DT) # for easier viewing of data tables
-library(tidyverse) # for data cleaning and graphing
-library(imputeLCMD) # for data imputation with QRILC
-library(table1) # for summary table
-library(vtable) # for summary table
-library(ggpubr) # for making Q-Q plots with ggplot
-```
-
-#### Import example datasets
-
-Next, let's read in our example datasets:
-```{r 04-Chapter4-10}
-biomarker_data <- read.xlsx("Chapter_4/Module4_2_Input/Module4_2_InputData1.xlsx")
-demographic_data <- read.xlsx("Chapter_4/Module4_2_Input/Module4_2_InputData2.xlsx")
-```
-
-#### View example datasets
-
-First, let's preview our example data. Using the `datatable()` function from the *DT* package allows us to interactively scroll through our biomarker data.
-```{r 04-Chapter4-11}
-datatable(biomarker_data)
-```
-
-We can see that our biomarker data are arranged with samples in rows and sample information and biomarker measurements in the columns.
-```{r 04-Chapter4-12}
-datatable(demographic_data)
-```
-
-Our demographic data provide information about the donors that our cells came from, matching to the `Donor` column in our biomarker data.
-
-
-
-## Handling Missing Values
-
-Next, we will investigate whether we have missing values and which variables and donors have missing values.
-```{r 04-Chapter4-13}
-# Calculate the total number of NAs per variable
-biomarker_data %>%
- summarise(across(IL1B:VEGF, ~sum(is.na(.))))
-
-# Calculate the number of missing values per subject
-biomarker_data %>%
- group_by(Donor) %>%
- summarise(across(IL1B:VEGF, ~sum(is.na(.))))
-```
-
-Here, we can see that we do have a few missing values. What should we do with these values?
-
-### Missing Values and Data Imputation
-
-#### Missing values
-
-Before deciding what to do about our missing values, it's important to understand why they are missing. There are a few different types of missing values that could be present in a dataset:
-
-1. **Missing completely at random (MCAR):** has nothing to do with the experimental unit being studied (e.g., a sample is damaged or lost in the lab)
-
-2. **Missing at random (MAR):** there may be a systematic difference between missing and measured values, but they can be explained by observed differences in the data or experimental unit
-
-3. **Missing not at random (MNAR):** data are missing due to factors that are not observed/measured (e.g., measurement for a specific endpoint is below the limit of detection (LOD) of an assay)
-
-We know from the researchers who generated this dataset that the values are missing because these specific proteins were below the limit of detection for the assay for certain samples; therefore, our data are missing not at random. This can help us with our choice of imputation method, described below.
-
-#### Imputation
-
-Imputation is the assignment of a value to a missing data point by inferring that value from other properties of the dataset or externally defined limits. Whether or not you should impute your data is not a one-size-fits-all approach and may vary depending on your field, experimental design, the type of data, and the type of missing values in your dataset. Two questions you can ask yourself when deciding whether or not to impute data are:
-
-1. Is imputation needed for downstream analyses? *Some analyses are not permissive to including NAs or 0s; others are.*
-
-2. Will imputing values bias my analyses unnecessarily? *If so, consider analyzing subsets of the data that are complete separately.*
-
-
-There are many different imputation methods (too many to cover them all in this module); here, we will introduce a few that we use most often. We encourage you to explore these in more depth and to understand typical imputation workflows for your lab, data type, and/or discipline.
-
-- For variables where imputed values are expected to be generally bound by the existing range of data (e.g., MCAR): [missForest](https://rpubs.com/lmorgan95/MissForest)
-
-- For variables with samples below the limit of detection for the assay, such as for mass spectrometry or ELISAs (e.g., MNAR)
- - Replace non-detects with the limit of detection divided by the square root of 2
- - [Quantile Regression Imputation of Left-Censored Data (QRILC)](https://www.nature.com/articles/s41598-017-19120-0)
- - [GSimp](https://github.com/WandeRum/GSimp) (can also be used to impute values above a specific threshold)
-
-If you do impute missing values, make sure to include both your raw and imputed data, along with detailed information about the imputation method, within your manuscript, supplemental information, and/or GitHub. You can even present summary statistics for both raw and imputed data for additional transparency.
-
-### Imputation of Our Data
-
-Before imputing our data, it is a good idea to implement a background filter that checks to see if a certain percentage of values for each variable are missing. For variables with a very high percentage of missing values, imputation can be unreliable because there is not enough information for the imputation algorithm to reference. The threshold for what this percentage should be can vary by study design and the extent to which your data are subset into groups that may have differing biomarker profiles; however, a common threshold we frequently use is to remove variables with missing data for 25% or more of samples.
-
-We can use the following code to calculate the percentage values missing for each endpoint:
-```{r 04-Chapter4-14}
-biomarker_data %>%
- summarise(across(IL1B:VEGF, ~sum(is.na(.))/nrow(biomarker_data)*100))
-```
-
-Here, we can see that only about 3-4% of values are missing for our variables with missing data, so we will proceed to imputation with our dataset as-is.
-
-We will impute values using QRILC, which pulls from the left side of the data distribution (the lower values) to impute missing values. We will write a function that will apply QRILC imputation to our dataframe. This function takes a dataframe with missing values as input and returns a dataframe with QRILC imputed values in place of NAs as output.
-```{r 04-Chapter4-15}
-QRILC_imputation = function(df){
- # Normalize data before applying QRILC per QRILC documentation
- ## Select only numeric columns, psuedo log2 transform, and convert to a matrix
- ### 4 comes from there being 3 metadata columns before the numeric data starts
- QRILC_prep = df[,4:dim(df)[2]] %>%
- mutate_all(., function(x) log2(x + 1)) %>%
- as.matrix()
-
- # QRILC imputation
- imputed_QRILC_object = impute.QRILC(QRILC_prep, tune.sigma = 0.1)
- QRILC_log2_df = data.frame(imputed_QRILC_object[1])
-
- # Converting back the original scale
- QRILC_df = QRILC_log2_df %>%
- mutate_all(., function(x) 2^x - 1)
-
- # Adding back in metadata columns
- QRILC_df = cbind(Donor = df$Donor,
- Dose = df$Dose,
- Replicate = df$Replicate,
- QRILC_df)
-
- return(QRILC_df)
-}
-```
-
-Now we can apply the `QRILC_imputation()` function to our dataframe. We use the function `set.seed()` to ensure that the QRILC function generates the same numbers each time we run the script. For more on setting seeds, see [here](https://www.statology.org/set-seed-in-r/).
-```{r 04-Chapter4-16}
-# Set random seed to ensure reproducibility in results
-set.seed(1104)
-
-# Apply function
-biomarker_data_imp <- QRILC_imputation(biomarker_data)
-```
-
-
-## Averaging Replicates
-
-The last step we need to take before our data are ready for analysis is averaging the two technical replicates for each donor and dose. We will do this by creating an ID column that represents the donor and dose together and using that column to group and average the data. This results in a dataframe where our rows contain data representing each biological replicate exposed to each of the five concentrations of acrolein.
-```{r 04-Chapter4-17}
-biomarker_data_imp_avg <- biomarker_data_imp %>%
-
- # Create an ID column that represents the donor and dose
- unite(Donor_Dose, Donor, Dose, sep = "_") %>%
-
- # Average replicates with each unique Donor_Dose
- group_by(Donor_Dose) %>%
- summarize(across(IL1B:VEGF, mean)) %>%
-
- # Round results to the same number of significant figures as the original data
- mutate(across(IL1B:VEGF, \(x) round(x, 2))) %>%
-
- # Separate back out the Donor_Dose column
- separate(Donor_Dose, into = c("Donor", "Dose"), sep = "_")
-
-# View new dataframe
-datatable(biomarker_data_imp_avg)
-```
-
-
-## Descriptive Statistics
-
-Generating descriptive statistics (e.g., mean, median, mode, range, standard deviation) can be helpful for understanding the general distribution of your data and for reporting results either in the main body of a manuscript/report (for small datasets) or in the supplementary material (for larger datasets). There are a number of different approaches that can be used to calculate summary statistics, including functions that are part of base R and that are part of packages. Here, we will demonstrate a few different ways to efficiently calculate descriptive statistics across our dataset.
-
-### Method #1 - Tidyverse and Basic Functions
-
-The mean, or average of data points, is one of the most commonly reported summary statistics and is often reported as mean ± standard deviation to demonstrate the spread in the data. Here, we will make a table of mean ± standard deviation for each of our biomarkers across each of the dose groups using *tidyverse* functions.
-```{r 04-Chapter4-18}
-# Calculate means
-biomarker_group_means <- biomarker_data_imp_avg %>%
- group_by(Dose) %>%
- summarise(across(IL1B:VEGF, \(x) mean(x)))
-
-# View data
-datatable(biomarker_group_means)
-```
-
-You'll notice that there are a lot of decimal places in our calculated means, while in our original data, there are only two decimal places. We can add a step to round the data to our above code chunk to produce cleaner results.
-```{r 04-Chapter4-19}
-# Calculate means
-biomarker_group_means <- biomarker_data_imp_avg %>%
- group_by(Dose) %>%
- summarise(across(IL1B:VEGF, \(x) mean(x))) %>%
- mutate(across(IL1B:VEGF, \(x) round(x, 2)))
-
-# View data
-datatable(biomarker_group_means)
-```
-
-### Answer to Environmental Health Question 1
-:::question
-With this, we can answer **Environmental Health Question 1**: What is the mean concentration of each inflammatory biomarker by acrolein concentration?
-:::
-
-:::answer
-**Answer:** With the above table, we can see the mean concentrations for each of our inflammatory biomarkers by acrolein dose. IL-8 overall has the highest concentrations, followed by VEGF and IL-6. For IL-1$\beta$, IL-8, TNF-$\alpha$, and VEGF, it appears that the concentration of the biomarker goes up with increasing dose.
-:::
-
-We can use very similar code to calculate our standard deviations:
-```{r 04-Chapter4-20}
-# Calculate means
-biomarker_group_sds <- biomarker_data_imp_avg %>%
- group_by(Dose) %>%
- summarise(across(IL1B:VEGF, \(x) sd(x))) %>%
- mutate(across(IL1B:VEGF, \(x) round(x, 1)))
-
-# View data
-datatable(biomarker_group_sds)
-```
-
-Now we've calculated both the means and standard deviations! However, these are typically presented as mean ± standard deviation. We can merge these dataframes by executing the following steps:
-
-1. Pivot each dataframe to a long format, with each row containing the value for one biomarker at one dose.
-2. Create a variable that represents each unique row (combination of `Dose` and `variable`).
-3. Join the dataframes by row.
-4. Unite the two columns with mean and standard deviation, with `±` in between them.
-5. Pivot the dataframe wider so that the dataframe resembles what we started with for the means and standard deviations.
-
-First, we'll pivot each dataframe to a long format and create a variable that represents each unique row.
-```{r 04-Chapter4-21}
-# Pivot dataframes longer and create variable column for each row
-biomarker_group_means_long <- pivot_longer(biomarker_group_means,
- !Dose, names_to = "variable", values_to = "mean") %>%
- unite(Dose_variable, Dose, variable, remove = FALSE)
-
-biomarker_group_sds_long <- pivot_longer(biomarker_group_means,
- !Dose, names_to = "variable", values_to = "sd") %>%
- unite(Dose_variable, Dose, variable, remove = FALSE)
-
-
-# Preview what dataframe looks like
-datatable(biomarker_group_means_long)
-```
-
-Next, we will join the mean and standard deviation datasets. Notice that we are only joining the `Dose_variable` and `sd` columns from the standard deviation dataframe to prevent duplicate columns (`Dose`, `variable`) from being included.
-```{r 04-Chapter4-22}
-# Merge the dataframes by row
-biomarker_group_summstats <- left_join(biomarker_group_means_long,
- biomarker_group_sds_long %>% select(c(Dose_variable, sd)),
- by = "Dose_variable")
-
-# Preview the new dataframe
-datatable(biomarker_group_summstats)
-```
-
-Then, we can unite the mean and standard deviation columns and add the ± symbol between them by storing that character as a variable and pasting that variable in our `paste()` function.
-```{r 04-Chapter4-23}
-# Store plus/minus character
-plusminus <-"\u00b1"
-Encoding(plusminus)<-"UTF-8"
-
-# Create new column with mean +/- standard deviation
-biomarker_group_summstats <- biomarker_group_summstats %>%
- mutate(mean_sd = paste(mean, plusminus, sd, sep = " "))
-
-# Preview the new dataframe
-datatable(biomarker_group_summstats)
-```
-
-Last, we can pivot the dataframe wider to revert it to its original layout, which is easier to read.
-```{r 04-Chapter4-24}
-# Pivot dataframe wider
-biomarker_group_summstats <- biomarker_group_summstats %>%
-
- # Remove columns we don't need any more
- select(-c(Dose_variable, mean, sd)) %>%
-
- # Pivot wider
- pivot_wider(id_cols = Dose, names_from = "variable", values_from = "mean_sd")
-
-# View final dataframe
-datatable(biomarker_group_summstats)
-```
-
-These data are now in a publication-ready format that can be exported to a .txt, .csv., or .xlsx file for sharing.
-
-### Method #2 - Applying a List of Functions
-
-Calculating our mean and standard deviation separately using *tidyverse* wasn't too difficult, but what if we want to calculate other descriptive statistics, such as minimum, median, and maximum? We could use the above approach, but we would need to make a separate dataframe for each and then merge them all together. Instead, we can use the `map_dfr()` function from the *purrr* package, which is also part of *tidyverse.* This function takes a list of functions you want to apply to your data and applies these functions over specified columns in the data. Let's see how it works:
-```{r 04-Chapter4-25}
-# Define summary functions
-summary_functs <- lst(min, median, mean, max, sd)
-
-# Apply functions to data, grouping by dose
-# .id = "statistic" tells the function to create a column describing which statistic that row is reporting
-biomarker_descriptive_stats_all <- map_dfr(summary_functs,
- ~ summarize(biomarker_data_imp_avg %>% group_by(Dose),
- across(IL1B:VEGF, .x)), .id = "statistic")
-
-# View data
-datatable(biomarker_descriptive_stats_all)
-```
-
-Depending on your final goal, descriptive statistics data can then be extracted from this dataframe and cleaned up or reformatted as needed to create a publication-ready table!
-
-### Other Methods
-
-There are also packages that have been developed for specifically making summary tables, such as [*table1*](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html) and [*vtable*](https://cran.r-project.org/web/packages/vtable/vignettes/sumtable.html). These packages can create summary tables in HTML format, which appear nicely in R Markdown and can be copied and pasted into Word. Here, we will briefly demonstrate how these packages work, and we encourage you to explore more using the package vignettes!
-
-#### Table1
-
-The *table1* package makes summary tables using the function `table1()`, which takes the columns that you want in the rows of the table on the left side of the first argument, followed by `|` and then the grouping variable. The output table can be customized in a number of ways, including what summary statistics are output and whether or not statistical comparisons are run between groups (see package vignette for more details).
-```{r 04-Chapter4-26}
-# Get names of all of the columns to include in the table
-paste(names(biomarker_data_imp_avg %>% select(IL1B:VEGF)), collapse=" + ")
-```
-
-```{r 04-Chapter4-27, eval = FALSE}
-# Make the table
-table1(~ IL1B + IL6 + IL8 + IL10 + TNFa + VEGF | Dose, data = biomarker_data_imp_avg)
-```
-
-```{r 04-Chapter4-28, echo = FALSE, fig.align = "center", out.width = "850px" }
-knitr::include_graphics("Chapter_4/Module4_2_Input/Module4_2_Image2.png")
-```
-
-#### Vtable
-
-The *vtable* package includes the function `st()`, which can also be used to make HTML tables (and other output formats; see `out` argument). For example:
-```{r 04-Chapter4-29}
-# HTML output
-st(biomarker_data_imp_avg, group = 'Dose')
-
-# Dataframe output
-st(biomarker_data_imp_avg, group = 'Dose', out = 'return')
-```
-
-Similar to *table1*, see the package vignette for detailed information about how to customize tables using this package.
-
-
-
-## Normality Assessment and Data Transformation
-
-The last step we will take before beginning to test our data for statistical differences between groups (in the next module) is to understand our data's distribution through normality assessment. This will inform which statistical tests we will perform on our data. For more detail on normality testing, including detailed explanations of each type of normality assessment and explanations of the code underlying the following graphs and tables, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**.
-
-We'll start by looking at histograms of our data for qualitative normality assessment:
-```{r 04-Chapter4-30, message = FALSE, fig.align = 'center'}
-# Set theme
-theme_set(theme_bw())
-
-# Pivot data longer to prepare for plotting
-biomarker_data_imp_avg_long <- biomarker_data_imp_avg %>%
- pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
-
-# Make figure panel of histograms
-ggplot(biomarker_data_imp_avg_long, aes(value)) +
- geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
- facet_wrap(~ variable, scales = "free", nrow = 2) +
- labs(y = "# of Observations", x = "Value")
-```
-
-From these histograms, we can see that IL-1$\beta$ appears to be normally distributed, while the other endpoints do not appear to be normally distributed.
-
-We can also use Q-Q plots to assess normality qualitatively:
-```{r 04-Chapter4-31, fig.align = 'center'}
-ggqqplot(biomarker_data_imp_avg_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
-```
-
-With this figure panel, we can see that most of the variables have very noticeable deviations from the reference, suggesting non-normal distributions.
-
-To assess normality quantitatively, we can use the Shapiro-Wilk test. Note that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
-```{r 04-Chapter4-32}
-# Apply Shapiro Wilk test to dataframe
-shapiro_res <- apply(biomarker_data_imp_avg %>% select(IL1B:VEGF), 2, shapiro.test)
-
-# Create results dataframe
-shapiro_res <- do.call(rbind.data.frame, shapiro_res)
-
-# Clean dataframe
-shapiro_res <- shapiro_res %>%
-
- ## Add normality conclusion
- mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
-
- ## Remove columns that do not contain informative data
- select(c(p.value, normal))
-
-# View cleaned up dataframe
-datatable(shapiro_res)
-```
-
-### Answer to Environmental Health Question 2
-:::question
-With this, we can answer **Environmental Health Question 2**: Are our data normally distributed?
-:::
-
-:::answer
-**Answer:** The results from the Shapiro-Wilk test demonstrate that the IL-1$\beta$ data are normally distributed, while the other variables are non-normally distributed. These results support the conclusions we made based on our qualitative assessment above with histograms and Q-Q plots.
-:::
-
-### Log~2~ Transforming and Re-Assessing Normality
-
-Log~2~ transformation is a common transformation used in environmental health research and can move data closer to a normal distribution. For more on data transformation, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**. We will pseudo-log~2~ transform our data, which adds a 1 to each value before log~2~ transformation and ensures that resulting values are positive real numbers. Let's see if the log~2~ data are more normally distributed than the raw data.
-```{r 04-Chapter4-33}
-# Apply log2 transformation to data
-biomarker_data_imp_avg_log2 <- biomarker_data_imp_avg %>%
- mutate(across(IL1B:VEGF, ~ log2(.x + 1)))
-```
-
-Make histogram panel:
-```{r 04-Chapter4-34, fig.align = 'center'}
-# Pivot data longer and make figure panel of histograms
-biomarker_data_imp_avg_log2_long <- biomarker_data_imp_avg_log2 %>%
- pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
-
-# Make histogram panel
-ggplot(biomarker_data_imp_avg_log2_long, aes(value)) +
- geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
- facet_wrap(~ variable, scales = "free") +
- labs(y = "# of Observations", x = "Value")
-```
-
-Make Q-Q plot panel:
-```{r 04-Chapter4-35, fig.align = 'center'}
-ggqqplot(biomarker_data_imp_avg_log2_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
-```
-
-Run Shapiro-Wilk test:
-```{r 04-Chapter4-36}
-# Apply Shapiro Wilk test
-shapiro_res_log2 <- apply(biomarker_data_imp_avg_log2 %>% select(IL1B:VEGF), 2, shapiro.test)
-
-# Create results dataframe
-shapiro_res_log2 <- do.call(rbind.data.frame, shapiro_res_log2)
-
-# Clean dataframe
-shapiro_res_log2 <- shapiro_res_log2 %>%
-
- ## Add normality conclusion
- mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
-
- ## Remove columns that do not contain informative data
- select(c(p.value, normal))
-
-# View cleaned up dataframe
-shapiro_res_log2
-```
-
-The histograms and Q-Q plots demonstrate that the log~2~ data are more normally distributed than the raw data. The results from the Shapiro-Wilk test also demonstrate that the the log~2~ data are more normally distributed as a whole than the raw data. Overall, the p-values, even for the variables that are still non-normally distributed, are much higher.
-
-So, should we proceed with the raw data or the log~2~ data? This depends on what analyses we plan to do. In general, it is best to keep the data in as close to its raw format as possible, so if all of our analyses are available with a non-parametric test, we could use our raw data. However, some statistical tests do not have a non-parametric equivalent, in which case it would likely be best to use the log~2~ transformed data. For subsequent modules, we will proceed with the log~2~ data for consistency; however, choices regarding normality assessment can vary, so be sure to discuss these choices within your research group before proceeding with your analysis.
-
-For more on decisions regarding normality, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**. For more on parametric vs. non-parametric tests, see **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations** and **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**.
-
-
-
-## Concluding Remarks
-
-Taken together, this module demonstrates important data processing steps necessary before proceeding with between-group statistical testing, including data import, handling missing values, averaging replicates, generating descriptive statistics tables, and assessing normality. Careful consideration and description of these steps in the methods section of a manuscript or report increases reproducibility of analyses and helps to improve the accuracy and statistical validity of subsequent statistical results.
-
-
-
-
-
-:::tyk
-
-Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). Work through the same processes demonstrated in this module using the provided data ("Module4_2_TYKInput.xlsx") to answer the following questions:
-
-1. How many technical replicates are there for each dose?
-2. Are there any missing values?
-3. What are the average values for each endpoint by dose?
-4. Are the raw data normally distributed?
-:::
-
-# 4.3 Data Import from PDF Sources
-
-This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Most tutorials for R rely on importing .csv, .xlsx, or .txt files, but there are numerous other file formats that can store data, and these file formats can be more difficult to import into R. PDFs can be particularly difficult to interface with in R because they are not formatted with defined rows/columns/cells as is done in Excel or .csv/.txt formatting. In this module, we will demonstrate how to import data from from PDFs into R and format it such that it is amenable for downstream analyses or export as a table. Familiarity with *tidyverse*, for loops, and functions will make this module much more approachable, so be sure to review **TAME 2.0 Modules 2.3 Data Manipulation and Reshaping** and **2.4 Improving Coding Efficiencies** if you need a refresher.
-
-
-
-### Overview of Example Data
-
-To demonstrate import of data from PDFs, we will be leveraging two example datasets, described in more detail in their respective sections later on in the module.
-
-1. PDFs generated by Nanoparticle Tracking Analysis (NTA), a technique used to quantify the size and distribution of particles (such as extracellular vesicles) in a sample. We will be extracting data from an experiment in which epithelial cells were exposed to four different environmental chemicals or a vehicle control, and secreted particles were isolated and characterized using NTA.
-
-2. A PDF containing information about variables collected as part of a study whose samples are part of NIH's [BioLINCC Repository](https://biolincc.nhlbi.nih.gov/home/).
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Which chemical(s) increase and decrease the concentration of particles secreted by epithelial cells?
-2. How many variables total are available to us to request from the study whose data are store in the repository, and what are these variables?
-
-
-
-## Importing Data from Many Single PDFs with the Same Formatting
-
-### Getting Familiar with the Example Dataset
-
-The following example is based on extracting data from PDFs generated by Nanoparticle Tracking Analysis (NTA), a technique used to quantify the size and distribution of particles in a sample. Each PDF file is associated with one sample, and each PDF contains multiple values that we want to extract. Although this is a very specific type of data, keep in mind that this general approach can be applied to any data stored in PDF format - you will just need to make modifications based on the layout of your PDF file!
-
-For this example, we will be extracting data from 5 PDFs that are identically formatted but contain information unique to each sample. The samples represent particles isolated from epithelial cell media following an experiment where cells were exposed to four different environmental chemicals (labeled "A", "B", "C", and "D") or a vehicle control (labeled "Ctrl").
-
-Here is what a full view of one of the PDFs looks like, with values we want to extract highlighted in yellow:
-```{r 04-Chapter4-37, echo = FALSE, out.width = "850px", fig.align = "center"}
-knitr::include_graphics("Chapter_4/Module4_3_Input/Module4_3_Image1.png")
-```
-
-Our goal is to extract these values and end up with a dataframe that looks like this, with each sample in a row and each variable in a column:
-```{r 04-Chapter4-38, echo = FALSE, message = FALSE}
-# Loading packages
-library(tidyverse)
-library(openxlsx)
-library(DT)
-
-# Reading in data
-ending_data <- read.xlsx("Chapter_4/Module4_3_Input/Module4_3_InputData1.xlsx")
-
-# Renaming some of the columns
-ending_data <- ending_data %>%
- rename("Sample Identifier" = "Sample.Identifier",
- "Experiment Number" = "Experiment.Number",
- "Dilution Factor" = "Dilution.Factor",
- "Concentration (Particles/mL)" = "Concentration.(Particles/.mL)")
-
-datatable(ending_data)
-```
-
-If your files are not already named in a way that reflects unique sample information, such as the date of the experiment or sample ID, update your file names to contain this information before proceeding with the script. Here are the names for the example PDF files:
-```{r 04-Chapter4-39, out.width = "400px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_4/Module4_3_Input/Module4_3_Image2.png")
-```
-
-
-
-### Workspace Preparation and Data Import
-
-#### Installing and loading required R packages
-
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you. We will be using the *pdftools* and *tm* packages to extract text from the PDF. And instead of using `head()` to preview dataframes, we will be using the function `datatable()` from the *DT* package. This function produces interactive tables and generates better formatting for viewing dataframes that have long character strings (like the ones we will be viewing in this section).
-
-```{r 04-Chapter4-40, eval = FALSE}
-if (!requireNamespace("pdftools"))
- install.packages("pdftools")
-if (!requireNamespace("tm"))
- install.packages("tm")
-if (!requireNamespace("DT"))
- install.packages("DT")
-if (!requireNamespace("janitor"))
- install.packages("janitor")
-```
-
-Next, load the packages.
-```{r 04-Chapter4-41, warning = FALSE, message = FALSE}
-library(tidyverse)
-library(pdftools)
-library(tm)
-library(DT)
-library(janitor)
-```
-
-#### Initial data import from PDF files
-
-The following code stores the file names of all of the files in your directory that end in .pdf. To ensure that only PDFs of interest are imported, consider making a subfolder within your directory containing only the PDF extraction script file and the PDFs you want to extract data from.
-```{r 04-Chapter4-42}
-pdf_list <- list.files(path = "./Chapter_4/Module4_3_Input", pattern = "488.pdf$")
-```
-
-We can see that each of our file names are now contained in the list.
-```{r 04-Chapter4-43}
-head(pdf_list)
-```
-
-Next, we need to make a dataframe to store the extracted data. The `PDF Identifier` column will store the file name, and the `Text` column will store extracted text from the PDF.
-```{r 04-Chapter4-44}
-pdf_raw <- data.frame("PDF Identifier" = c(), "Text" = c())
-```
-
-The following code uses a `for` loop to loop through each file (as stored in the pdf_list vector) and extract the text from the PDF. Sometimes this code generates duplicates, so we will also remove the duplicates with `distinct()`.
-```{r 04-Chapter4-45, message = FALSE, warning = FALSE}
-for (i in 1:length(pdf_list)){
-
- # Iterating through each pdf file and separating each line of text
- document_text = pdf_text(paste("./Chapter_4/Module4_3_Input/", pdf_list[i], sep = "")) %>%
- strsplit("\n")
-
- # Saving the name of each PDF file and its text
- document = data.frame("PDF Identifier" = gsub(x = pdf_list[i], pattern = ".pdf", replacement = ""),
- "Text" = document_text, stringsAsFactors = FALSE)
-
- colnames(document) <- c("PDF Identifier", "Text")
-
- # Appending the new text data to the dataframe
- pdf_raw <- rbind(pdf_raw, document)
-}
-
-pdf_raw <- pdf_raw %>%
- distinct()
-```
-
-The new dataframe contains the data from all of the PDFs, with the `PDF Identifier` column containing the name of the input PDF file that corresponds to the text in the column next to it.
-```{r 04-Chapter4-46}
-datatable(pdf_raw)
-```
-
-
-### Extracting Variables of Interest
-
-Specific variables of interest can be extracted from the `pdf_raw` dataframe by filtering the dataframe for rows that contain a specific character string. This character string could be the variable of interest (if that word or set of words is unique and only occurs in that one place in the document) or a character string that occurs in the same line of the PDF as your variable of interest. Examples of both of these approaches are shown below.
-
-It is important to note that there can be different numbers of spaces in each row and after each semicolon, which will change the `sep` argument for each variable. For example, there are a different number of spaces after the semicolon for "Dilution Factor" than there are for "Concentration" (see above PDF screen shot for reference). We will work through an example for the first variable of interest, dilution factor, in detail.
-
-First, we can see what the dataframe looks like when we just filter rows based on keeping only rows that contain the string "Dilution Factor" in the text column using the `grepl()` function.
-```{r 04-Chapter4-47}
-dilution_factor_df <- pdf_raw %>%
- filter(grepl("Dilution Factor", Text))
-
-datatable(dilution_factor_df)
-```
-
-The value we are trying to extract is at the end of a long character string. We will want to use the tidyverse function `separate()` to isolate those values, but we need to know what part of the character string will separate the dilution factor values from the rest of the text. To determine this, we can call just one of the data cells and copy the semicolon and following spaces for use in the `separate()` function.
-```{r 04-Chapter4-48}
-# Return the value in the first row and second column.
-dilution_factor_df[1,2]
-```
-
-Building on top of the previous code, we can now separate the dilution factor value from the rest of the text in the string. The `separate()` function takes an input data column and separates it into two or more columns based on the character passed to the separation argument. Here, everything before the separation string is discarded by setting the first new column to NA. Everything after the separation string will be stored in a new column called `Dilution Factor`, The starting `Text` column is removed by default.
-```{r 04-Chapter4-49}
-dilution_factor_df <- pdf_raw %>%
- filter(grepl("Dilution Factor", Text)) %>%
- separate(Text, into = c(NA, "Dilution Factor"), sep = ": ")
-
-datatable(dilution_factor_df)
-```
-
-For the "Original Concentration" variable, we filter rows by the string "pH" because the word concentration is found in multiple locations in the document.
-```{r 04-Chapter4-50}
-concentration_df = pdf_raw %>%
- filter(grepl("pH", Text)) %>%
- separate(Text, c(NA, "Concentration"), sep = ": ")
-
-datatable(concentration_df)
-```
-
-With the dilution factor variable, there were no additional characters after the value of interest, but here, "Particles / mL" remains and needs to be removed so that the data can be used in downstream analyses. We can add an additional cleaning step to remove "Particles / mL" from the data and add the units to the column title. `sep = " P"` refers to the space before and first letter of the string to be removed.
-```{r 04-Chapter4-51}
-concentration_df = pdf_raw %>%
- filter(grepl("pH", Text)) %>%
- separate(Text, c(NA, "Concentration"), sep = ": ") %>%
- separate(Concentration, c("Concentration (Particles/ mL)", NA), sep = " P")
-
-datatable(concentration_df)
-```
-
-Next, we want to extract size distribution data from the lower table. Note that the space in the first `separate()` function comes from the space between the "Number" and "Concentration" column in the string, and the space in the second `separate()` function comes from the space between the variable name and the number of interest. We can also convert values to numeric since they are currently stored as characters.
-```{r 04-Chapter4-52}
-size_distribution_df = pdf_raw %>%
- filter(grepl("X10", Text)| grepl("X50 ", Text)| grepl("X90", Text) | grepl("Mean", Text)| grepl("StdDev", Text)) %>%
- separate(Text, c("Text", NA), sep = " ") %>%
- separate(Text, c("Text", "Size"), sep = " ") %>%
- mutate(Size = as.numeric(Size)) %>%
- pivot_wider(names_from = Text, values_from = Size)
-
-datatable(size_distribution_df)
-```
-
-### Creating the final dataframe
-
-Now that we have created dataframes for all of the variables that we are interested in, we can join them together into one final dataframe.
-```{r 04-Chapter4-53}
-# Make list of all dataframes to include
-all_variables <- list(dilution_factor_df, concentration_df, size_distribution_df)
-
-# Combine dataframes using reduce function. Sometimes, duplicate rows are generated by full_join.
-full_df = all_variables %>%
- reduce(full_join, by = "PDF Identifier") %>%
- distinct()
-
-# View new dataframe
-datatable(full_df)
-```
-
-For easier downstream analysis, the last step is to separate the `PDF Identifier` column into an informative sample ID that matches up with other experimental data.
-```{r 04-Chapter4-54}
-final_df <- full_df %>%
- separate('PDF Identifier',
- # Split sample identifier column into new columns, retaining the original column
- into = c("Date", "FileNumber", "Experiment Number", "Sample_ID", "Size", "Wavelength"), sep = "_", remove = FALSE) %>%
- select(-c(FileNumber, Size)) %>% # Remove uninformative columns
- mutate(across('Dilution Factor':'StdDev', as.numeric)) # Change variables to numeric where appropriate
-
-datatable(final_df)
-```
-
-Let's make a graph to help us answer Environmental Health Question 1.
-```{r 04-Chapter4-55, message = FALSE}
-theme_set(theme_bw())
-
-data_for_graphing <- final_df %>%
- clean_names()
-
-data_for_graphing$sample_id <- factor(data_for_graphing$sample_id, levels = c("Ctrl", "A", "B", "C", "D"))
-
-ggplot(data_for_graphing, aes(x = sample_id, y = concentration_particles_m_l)) +
- geom_bar(stat = "identity", fill = "gray70", color = "black") +
- ylab("Particle Concentration (Particles/mL)") +
- xlab("Exposure")
-```
-
-:::question
-*With this, we can answer **Environmental Health Question #1***: Which chemical(s) increase and decrease the concentration of particles secreted by epithelial cells?
-:::
-
-:::answer
-**Answer**: Chemicals B and C appear to increase the concentration of secreted particles. However, additional replicates of this experiment are needed to assess statistical significance.
-:::
-
-
-
-## Importing Data Stored in PDF Tables
-
-The above workflow is useful if you just want to extract a few specific values from PDFs, but isn't as useful if data are already in a table format in a PDF. The [*tabulapdf package*](https://github.com/ropensci/tabulapdf) provides helpful functions for extracting dataframes from tables in PDF format.
-
-### Getting Familiar with the Example Dataset
-
-The following example is based on extracting dataframes from a long PDF containing many individual data tables. This particular PDF came from the NIH's BioLINCC Repository and details variables that researchers can request from the repository. Variables are part of larger datasets that contain many variables, with each dataset in a separate table. All of the tables are stored in one PDF file, and some of the tables are longer than one page (this will become relevant later on!). Similar to the first PDF workflow, remember that this is a specific example intended to demonstrate how to work through extracting data from PDFs. Modifications will need to be made for differently formatted PDFs.
-
-Here is what the first three pages of our 75-page starting PDF look like:
-```{r 04-Chapter4-56, echo = FALSE, out.width = "850px", fig.align = "center"}
-knitr::include_graphics("Chapter_4/Module4_3_Input/Module4_3_Image3.png")
-```
-
-If we zoom in a bit more on the first page, we can see that the dataset name is defined in bold above each table. This formatting is consistent throughout the PDF.
-```{r 04-Chapter4-57, echo = FALSE, out.width = "850px", fig.align = "center"}
-knitr::include_graphics("Chapter_4/Module4_3_Input/Module4_3_Image4.png")
-```
-
-The zoomed in view also allows us to see the columns and their contents more clearly. Some are more informative than others. The columns we are most interested in are listed below along with a description to guide you through the contents.
-
-- `Num`: The number assigned to each variable in the dataset. This numbering restarts with 1 for each table.
-- `Variable`: The variable name.
-- `Type`: The type (or class) of the variable, either numeric or character.
-- `Label`: A description of the variable and values associated with the variable.
-
-After extracting the data, we want to end up with a dataframe that contains all of the variables, their corresponding columns, and a column that indicates which dataset the variable is associated with:
-```{r 04-Chapter4-58, echo = FALSE}
-biolincc_final <- read.xlsx("Chapter_4/Module4_3_Input/Module4_3_InputData3.xlsx") %>%
- clean_names()
-
-datatable(biolincc_final)
-```
-
-### Workspace Preparation and Data Import
-
-#### Installing and loading required R packages
-
-Similar to previous sections, we need to install and load a few packages before proceeding. The *tabulapdf* package needs to be installed in a specific way as shown below and can sometimes be difficult to install on Macs. If errors are produced, follow the troubleshooting tips outlined in [this](https://stackoverflow.com/questions/67849830/how-to-install-rjava-package-in-mac-with-m1-architecture) Stack Overflow solution.
-
-```{r 04-Chapter4-59, eval = FALSE}
-# To install all of the packages except for tabulapdf
-if (!requireNamespace("stringr"))
- install.packages("stringr")
-if (!requireNamespace("pdftools"))
- install.packages("pdftools")
-if (!requireNamespace("rJava"))
- install.packages("rJava")
-```
-
-```{r 04-Chapter4-60, message = FALSE, eval = FALSE}
-# To install tabulapdf
-if (!require("remotes")) {
- install.packages("remotes")
-}
-
-library(remotes)
-
-remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulapdf"), force=TRUE, INSTALL_opts = "--no-multiarch")
-```
-
-Load packages:
-```{r 04-Chapter4-61, message = FALSE, eval = FALSE}
-library(tabulapdf)
-library(tidyverse)
-library(janitor)
-library(pdftools)
-library(stringr)
-```
-
-#### Initial data import from PDF file
-
-The `extract_tables()` function automatically extracts tables from PDFs and stores them as tibbles (a specific tidyverse data structure similar to a dataframe) within a list. One table is extracted per page, even if the table spans multiple pages. This line of code can take a few seconds to run depending on the length of your PDF.
-```{r 04-Chapter4-62}
-tables <-tabulapdf::extract_tables("Chapter_4/Module4_3_Input/Module4_3_InputData4.pdf", output = "tibble")
-```
-
-Glimpsing the first three elements in the tables list, we can see that each list element is a dataframe containing the columns from the PDF tables.
-```{r 04-Chapter4-63}
-glimpse(tables[1:3])
-```
-
-Exploring further, here is how each dataframe is formatted:
-```{r 04-Chapter4-64}
-datatable(tables[[1]])
-```
-
-Notice that, although the dataframe format mirrors the PDF table format, the label column is stored across multiple rows with NAs in the other columns of that row because the text was across multiple lines. In our final dataframe, we will want the entire block of text in one cell. We can also remove the "Len", "Format", and "Informat" columns because they are not informative and they are not found in every table. Next, we will walk through how to clean up this table using a series of steps in tidyverse.
-
-### Cleaning dataframes
-
-First, we will select the columns we are interested in and use the `fill()` function to change the NAs in the "Num" column so that each line of text in the "Label" column has the correct "Num" value in the same row.
-```{r 04-Chapter4-65}
-cleaned_table1 <- data.frame(tables[[1]]) %>% # Extract the first table in the list
-
- # Select only the columns of interest
- select(c(Num, Variable, Type, Label)) %>%
-
- # Change the "Num" column to numeric, which is required for the fill function
- mutate(Num = as.numeric(Num)) %>%
-
- # Fill in the NAs in the "Num" column down the column
- fill(Num, .direction = "down")
-
-datatable(cleaned_table1)
-```
-
-We still need to move all of the Label text for each variable into one cell in one row instead of across multiple rows. For this, we can use the `unlist()` function. Here is a demonstration of how the `unlist()` function works using just the first variable:
-```{r 04-Chapter4-66}
-cleaned_table1_var1 <- cleaned_table1 %>%
-
- # Filter dataframe to just contain rows associated with the first variable
- filter(Num == 1) %>%
-
- # Paste all character strings in the Label column with a space in between them into a new column called "new_label"
- mutate(new_label = paste(unlist(Label), collapse = " "))
-
-datatable(cleaned_table1_var1)
-```
-
-We now have all of the text we want in one cell, but we have duplicate rows that we don't need. We can get rid of these rows by assigning blank values "NA" and then omitting rows that contain NAs.
-```{r 04-Chapter4-67, warning = FALSE}
-cleaned_table1_var1 <- cleaned_table1_var1 %>%
- mutate(across(Variable, na_if, "")) %>%
- na.omit()
-
-datatable(cleaned_table1_var1)
-```
-
-We need to apply this code to the whole dataframe and not just one variable, so we can add `group_by(Num)` to our cleaning workflow, followed by the code we just applied to our filtered dataframe.
-```{r 04-Chapter4-68, warning = FALSE}
-cleaned_table1 <- data.frame(tables[[1]]) %>% # Extract the first table in the list
-
- # Select only the columns of interest
- select(c(Num, Variable, Type, Label)) %>%
-
- # Change the "Num" column to numeric, which is required for the fill function
- mutate(Num = as.numeric(Num)) %>%
-
- # Fill in the NAs in the "Num" column down the column
- fill(Num, .direction = "down") %>%
-
- # Group by variable number
- group_by(Num) %>%
- # Unlist the text replace the text in the "Label" column with the unlisted text
- mutate(Label = paste(unlist(Label), collapse =" ")) %>%
-
- # Make blanks in the "Variable" column into NAs
- mutate(across(Variable, na_if, "")) %>%
-
- # Remove rows with NAs
- na.omit()
-
-datatable(cleaned_table1)
-```
-
-Ultimately, we need to clean up each dataframe in the list the same way, and we need all of the dataframes to be in one dataframe, instead of in a list. There are a couple of different ways to do this. Both rely on the code shown above for cleaning up each dataframe. Option #1 uses a for loop, while Option #2 uses application of a function on the list of dataframes. Both result in the same ending dataframe!
-
-**Option #1**
-```{r 04-Chapter4-69, warning = FALSE}
-# Create a dataframe for storing variables
-variables <- data.frame()
-
-# Make a for loop to format each dataframe and add it to the variables
-for (i in 1:length(tables)) {
-
- table <- data.frame(tables[[i]]) %>%
- select(c(Num, Variable, Type, Label)) %>%
- mutate(Num = as.numeric(Num)) %>%
- fill(Num, .direction = "down") %>%
- group_by(Num) %>%
- mutate(Label = paste(unlist(Label), collapse =" ")) %>%
- mutate(across(Variable, na_if, "")) %>%
- na.omit()
-
- variables <- bind_rows(variables, table)
-}
-
-# View resulting dataframe
-datatable(variables)
-```
-
-**Option #2**
-```{r 04-Chapter4-70, warning = FALSE}
-# Write a function that applies all of the cleaning steps to an dataframe (output = cleaned dataframe)
-clean_tables <- function(data) {
-
- data <- data %>%
- select(c(Num, Variable, Type, Label)) %>%
- mutate(Num = as.numeric(Num)) %>%
- fill(Num, .direction = "down") %>%
- group_by(Num) %>%
- mutate(Label = paste(unlist(Label), collapse =" ")) %>%
- mutate(across(Variable, na_if, "")) %>%
- na.omit()
-
- return(data)
-}
-
-# Apply the function over each table in the list of tables
-tables_clean <- lapply(X = tables, FUN = clean_tables)
-
-# Unlist the dataframes and combine them into one dataframe
-tables_clean_unlisted <- do.call(rbind, tables_clean)
-
-# View resulting dataframe
-datatable(tables_clean_unlisted)
-```
-
-### Adding Dataset Names
-
-We now have a dataframe with all of the information from the PDFs contained in one long table. However, now we need to add back in the label on top of each table. We can't do this with the *tabulapdf* package because the name isn't stored in the table. But we can use the *pdftools* package for this!
-
-First, we will read in the pdf using the PDF tools package. This results in a vector containing a long character string for each page of the PDF. Notice a few features of these character strings:
-
-+ Each line is separated by `\n`
-+ Elements [1] and [2] of the vector contain the text "dataset Name:", while element [3] does not because the third page was a continuation of the table from the second page and therefore did not have a table title.
-
-```{r 04-Chapter4-71}
-table_names <- pdf_text("Chapter_4/Module4_3_Input/Module4_3_InputData4.pdf")
-
-head(table_names[1:3])
-```
-
-Similar to the table cleaning section, we will work through an example of extracting the text of interest from one of these character vectors, then apply the same code to all of the character vectors. First, we will select just the first element in the vector and make it into a dataframe.
-```{r 04-Chapter4-72}
-# Create dataframe
-dataset_name_df_var1 <- data.frame(strsplit(table_names[1], "\n"))
-
-# Clean column name
-colnames(dataset_name_df_var1) <- c("Text")
-
-# View dataframe
-datatable(dataset_name_df_var1)
-```
-
-Next, we will extract the dataset name using the same approach used in extracting values from the nanoparticle tracking example above and assign the name to a variable. We filter by the string "Data Set Name" because this is the start of the text string in the row where our dataset name is stored and is the same across all of our datasets.
-```{r 04-Chapter4-73}
-# Create dataframe
-dataset_name_df_var1 <- dataset_name_df_var1 %>%
- filter(grepl("Data Set Name", dataset_name_df_var1$Text)) %>%
- separate(Text, into = c(NA, "dataset"), sep = "Data Set Name: ")
-
-# Assign variable
-dataset_name_var1 <- dataset_name_df_var1[1,1]
-
-# View variable name
-dataset_name_var1
-```
-
-Now that we have the dataset name stored as a variable, we can create a dataframe that will correspond to the rows in our `variables` dataframe. The challenge is that each dataset contains a different number of variables! We can determine how many rows each dataset contains by returning to our `variables` dataframe and calculating the number of rows associated with each dataset. The following code splits the `variables` dataframe into a list of dataframes by each occurrence of 1 in the "Num" column (when the numbering restarts for a new dataset).
-```{r 04-Chapter4-74}
-# Calculate the number of rows associated with each dataset for reference
-dataset_list <- split(variables, cumsum(variables$Num == 1))
-
-glimpse(dataset_list[1:3])
-```
-
-The number of rows in each list is the number of variables in that dataset. We can use this value in creating our dataframe of dataset names.
-```{r 04-Chapter4-75}
-# Store the number of rows in a variable
-n_rows = nrow(data.frame(dataset_list[1]))
-
-# Repeat the dataset name for the number of variables there are
-dataset_name_var1 = data.frame("dataset_name" = rep(dataset_name_var1, times = n_rows))
-
-# View data farme
-datatable(dataset_name_var1)
-```
-
-We now have a dataframe that can be joined with our `variables` dataframe for the first table. We can apply this approach to each table in our original PDF using a `for` loop.
-```{r 04-Chapter4-76}
-# Make dataframe to store dataset names
-dataset_names <- data.frame()
-
-# Create list of datasets
-dataset_list <- split(variables, cumsum(variables$Num == 1))
-
-# Remove elements from the table_names vector that do not contain the string "Data Set Name"
-table_names_filtered <- stringr::str_subset(table_names, 'Data Set Name')
-
-# Populate dataset_names dataframe
-for (i in 1:length(table_names_filtered)) {
-
- # Get dataset name
- dataset_name_df <- data.frame(strsplit(table_names_filtered[i], "\n"))
-
- base::colnames(dataset_name_df) <- c("Text")
-
- dataset_name_df <- dataset_name_df %>%
- filter(grepl("Data Set Name", dataset_name_df$Text)) %>%
- separate(Text, into = c(NA, "dataset"), sep = "Data Set Name: ")
-
- dataset_name <- dataset_name_df[1,1]
-
- # Determine number of variables in that dataset
- data_set <- data.frame(dataset_list[i])
- n_rows = nrow(data_set)
-
- # Repeat the dataset name for the number of variables there are
- dataset_name = data.frame("Data Set Name" = rep(dataset_name, times = n_rows))
-
- # Bind to dataframe
- dataset_names <- bind_rows(dataset_names, dataset_name)
-
-}
-
-
-# Rename column
-colnames(dataset_names) <- c("Data Set Name")
-
-# View
-datatable(dataset_names)
-```
-
-### Combining Dataset Names and Variable Information
-
-Last, we will merge together the dataframe containing dataset names and variable information.
-```{r 04-Chapter4-77}
-# Merge together
-final_variable_df <- cbind(dataset_names, variables) %>%
- rename("Variable Description" = "Label", "Variable Number Within Dataset" = "Num") %>%
- clean_names()
-
-datatable(final_variable_df)
-```
-
-We can also determine how many total variables we have, all of which are accessible via the table we just generated.
-```{r 04-Chapter4-78}
-# Total number of variables
-nrow(final_variable_df)
-
-# Total number of variables
-```
-
-:::question
-*With this, we can answer **Environmental Health Question #2***: How many variables total are available to us to request from the study whose data are stored in the repository, and what are these variables?
-:::
-
-:::answer
-**Answer**: There are 1190 variable available to us. We can browse through the variables, including the sub-table they were from, the type of variable they are, and how they were derived using the table we generated.
-:::
-
-
-
-## Concluding Remarks
-
-This training module provides example case studies demonstrating how to import PDF data into R and clean it so that it is more useful and accessible for analyses. The approaches demonstrated in this module, though specific to our specific example data, can be adapted to many different types of PDF data.
-
-
-
-
-
-:::tyk
-Using the same input files that we used in part 1, "Importing Data from Many Single PDFs with the Same Formatting", found in the Module4_3_TYKInput folder, extract the remaining variables of interest (Original Concentration and Positions Removed) from the PDFs and summarize them in one dataframe.
-:::
-
-
-# 4.4 Two Group Comparisons and Visualizations
-
-This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Two group statistical comparisons, in which we want to know whether the means between two different groups are significantly different, are some of the most common statistical tests in environmental health research and even biomedical research as a field. In this training module, we will demonstrate how to run two group statistical comparisons and how to present publication-quality figures and tables of these results. We will continue to use the same example dataset as used in this chapter's previous modules, which represents concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are there significant differences in inflammatory biomarker concentrations between cells from male and female donors at baseline?
-2. Are there significant differences in inflammatory biomarker concentrations between cells exposed to 0 and 4 ppm acrolein?
-
-### Workspace Preparation and Data Import
-
-Here, we will import the processed data that we generated at the end of **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**. These data, along with the associated demographic data, were introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data**. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, and *ggpubr*, and additional packages relevant to statistical analysis and graphing that will be discussed in greater detail below.
-```{r 04-Chapter4-79, message = FALSE}
-# Load packages
-library(openxlsx)
-library(tidyverse)
-library(DT)
-library(rstatix)
-library(ggpubr)
-```
-
-```{r 04-Chapter4-80}
-# Import data
-biomarker_data <- read.xlsx("Chapter_4/Module4_4_Input/Module4_4_InputData1.xlsx")
-demographic_data <- read.xlsx("Chapter_4/Module4_4_Input/Module4_4_InputData2.xlsx")
-
-# View data
-datatable(biomarker_data)
-datatable(demographic_data)
-```
-
-
-
-## Overview of Two Group Statistical Tests
-
-Before applying statistical tests to our data, let's first review common two group statistical tests, their underlying assumptions, and variations on these tests.
-
-### Common Tests
-
-The two most common two group statistical tests are the...
-
-+ **T-test** (also known as the student's t-test) and the
-+ **Wilcoxon test** (also known as the Wilcox test, Wilcoxon test, or Mann Whitney test)
-
-Both of these tests are testing the null hypothesis that the means of the two populations (groups) are the same; the alternative hypothesis is that they are not the same. A significant p-value means that we can reject the null hypothesis that the means of the two groups are the same. Whether or not a p-value meets criteria for significance is experiment-specific, though commonly implemented p-value filters for significance include p<0.05 and p<0.01. P-values can also be called alpha values, and they indicate the probability of a **type I error**, or false positive, where the null hypothesis is rejected despite it actually being true. On the other hand, a **type II error**, or false negative, occurs when the null hypothesis is not rejected when it actually should have been.
-
-### Assumptions
-
-The main difference between these two tests is in the assumption about the underlying distribution of the data. T-tests assume that the data are pulled from a normal distribution, while Wilcoxon tests do not assume that the data are pulled from a normal distribution. Therefore, it is most appropriate to use a t-test when data are, in general, normally distributed and a Wilcoxon test when data are not normally distributed.
-
-Additional assumptions underlying t-tests and Wilcoxon test are:
-
-- The dependent variable is continuous or ordinal (discrete, ordered values).
-- The data is collected from a representative, random sample.
-
-T-tests also assume that:
-
-- The standard deviations of the two groups are approximately equal (also called homogeneity of variance).
-
-### When to Use a Parametric vs Non-Parametric Test?
-
-Deciding whether to use a parametric or non-parametric test isn't a one size fits all approach, and the decision should be made holistically for each dataset. Typically, parametric tests should be used when the data are normally distributed, continuous, random sampled, without extreme outliers, and representative of independent samples or participants. A non-parametric test can be used when the sample size (*n*) is small, outliers are present in the dataset, and/or the data are not normally distributed.
-
-This decision matters more when dealing with smaller sample sizes (*n*<10) as smaller sample sizes are more prone to being skewed, and parametric tests are more sensitive to outliers. Therefore, when dealing with a smaller *n*, it might be best to perform a data transformation as discussed in **TAME 2.0 Module 3.3 Normality Testing & Data Transformations** and then perform a parametric test if more parametric assumptions are able to be met, or to use non-parametric tests. For larger sample sizes (*n*>50), outliers can potentially be removed and the dataset can be retested for assumptions. Lastly, what's considered "small" or "large" in regards to sample size can be subjective and should be taken into consideration within the context of the experiment.
-
-### Variations
-
-**Unequal Variance:** When the assumption of homogeneity of variance is not met, a Welch's t-test is generally preferred over a student's t-test. This can be implemented easily by setting `var.equal = FALSE` as an argument to the function executing the t-test (e.g., `t.test()`, `t_test()`). For more on testing homogeneity of variance in R, see [here](https://www.datanovia.com/en/lessons/homogeneity-of-variance-test-in-r/).
-
-**Paired vs Unpaired:** Variations on the t-test and Wilcoxon test are used when the experimental design is paired (also called repeated measures or matching). This occurs when there are different treatments, exposures, or time points collected from the same biological/experimental unit. For example, cells from the same donor or passage number exposed to different concentrations of a chemical represents a paired design. Matched/paired experiments have increased power to detect significant differences because samples can be compared back to their own controls.
-
-**One vs Two-Sided:** A one-sided test evaluates the hypothesis that the mean of the treatment group significantly differs in a specific direction from the control. A two-sided test evaluates the hypothesis that the mean of the treatment group significantly differs from the control but does not specify a direction for that change. A two-sided test is the preferred approach and the default in R because, typically, either direction of change is possible and represents an informative finding. However, one-sided tests may be appropriate if an effect can only possibly occur in one direction. This can be implemented by setting `alternative = "one.sided"` within the statistical testing function.
-
-### Which test should I choose?
-
-We provide the following flowchart to help guide your choice of statistical test to compare two groups:
-```{r 04-Chapter4-81, echo = FALSE, fig.align = "center", out.width = "800px" }
-knitr::include_graphics("Chapter_4/Module4_4_Input/Module4_4_Image1.png")
-```
-
-
-
-## Statistical vs. Biological Significance
-
-Another important topic to discuss before proceeding to statistical testing is the true meaning of statistical significance. Statistical significance simply means that it is unlikely that the patterns being observed are due to random chance. However, just because an effect is statistically significant does not mean that it is biologically significant (i.e., has notable biological consequences). Often, there also needs to be a sufficient magnitude of effect (also called effect size) for the effects on a system to be meaningful. Although a p-value < 0.05 is often considered the threshold for significance, this is just a standard threshold set to a generally "acceptable" amount of error (5%). What about a p-value of 0.058 with a very large biological effect? Accounting for effect size is also why filters such as log~2~ fold change are often applied alongside p-value filters in -omics based analysis.
-
-In discussions of effect size, the population size is also a consideration - a small percentage increase in a very large population can represent tens of thousands of individuals (or more). Another consideration is that we frequently do not know what magnitude of biological effect should be considered "significant." These discussions can get complicated very quickly, and here we do not propose to have a solution to these thought experiments; rather, we recommend considering both statistical and biological significance when interpreting data. And, as stated in other sections of TAME, transparent reporting of statistical results will aid the audience in interpreting the data through their preferred perspectives.
-
-
-
-## Unpaired Test Example
-
-We will start by performing a statistical test to determine whether there are significant differences in biomarker concentrations between male and female donors at baseline (0 ppm exposure). Previously we determined that the majority of our data was non-normally distributed (see **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), so we'll skip testing for that assumption in this module. Based on those results, we will use the Wilcoxon test to determine if there are significant differences between groups. The Wilcoxon test does not assume homogeneity of variance, so we do not need to test for that prior to applying the test. This is an unpaired analysis because samples collected from the cells derived from male and female donor cells are different sets of cells (i.e., independent from each other). Thus, the specific statistical test applied will be the Wilcoxon Rank Sum test.
-First, we will filter our dataframe to only data representing the control (0 ppm) exposure:
-```{r 04-Chapter4-82}
-biomarker_data_malevsfemale <- biomarker_data %>% filter(Dose == "0")
-```
-
-Next, we need to add the demographic data to our dataframe:
-```{r 04-Chapter4-83}
-biomarker_data_malevsfemale <- biomarker_data_malevsfemale %>% left_join(demographic_data %>% select(Donor, Sex), by = "Donor")
-```
-
-Here is what our data look like now:
-```{r 04-Chapter4-84}
-datatable(biomarker_data_malevsfemale)
-```
-
-We can demonstrate the basic anatomy of the Wilcoxon test function `wilcox.test()` by running the function on just one variable.
-```{r 04-Chapter4-85}
-wilcox.test(IL1B ~ Sex, data = biomarker_data_malevsfemale)
-```
-The p-value of 0.8371 indicates that males and females do not have significantly different concentrations of IL-1$\beta$.
-
-The `wilcox.test()` function is part of the pre-loaded package *stats*. The package [*rstatix*](https://rpkgs.datanovia.com/rstatix/) provides identical statistical tests to *stats* but in a pipe-friendly (tidyverse-friendly) format, and these functions output results as dataframes rather than the text displayed above.
-```{r 04-Chapter4-86}
-biomarker_data_malevsfemale %>% wilcox_test(IL1B ~ Sex)
-```
-Here, we can see the exact same results as with the `wilcox.test()` function. For the rest of this module, we'll proceed with using the *rstatix* version of statistical testing functions.
-
-Although it is simple to run the Wilcoxon test with the code above, it's impractical for a large number of endpoints and doesn't store the results in an organized way. Instead, we can run the Wilcoxon test over every variable of interest using a `for` loop. There are also other ways you could approach this, such as a function applied over a list. This `for` loop runs the Wilcoxon test on each endpoint, stores the results in a dataframe, and then binds together the results dataframes for each variable of interest. Note that you could easily change `wilcox_test()` to `t_test()` and add additional arguments to modify the way the statistical test is run.
-```{r 04-Chapter4-87, warning = FALSE}
-# Create a vector with the names of the variables you want to run the test on
-endpoints <- colnames(biomarker_data_malevsfemale %>% select(IL1B:VEGF))
-
-# Create dataframe to store results
-sex_wilcoxres <- data.frame()
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable.
- endpoint <- endpoints[i]
-
- # Run wilcox test and store in results dataframe.
- res_df <- biomarker_data_malevsfemale %>%
- wilcox_test(as.formula(paste0(endpoint, "~ Sex", sep = "")))
-
- # Bind results from this test with other tests in this loop
- sex_wilcoxres <- rbind(sex_wilcoxres, res_df)
-
-}
-
-# View results
-sex_wilcoxres
-```
-
-:::question
-With this, we can answer **Environmental Health Question #1**:
-Are there significant differences in inflammatory biomarker concentrations between cells from male and female donors at baseline?
-:::
-
-:::answer
-**Answer**: There are not any significant differences in concentrations of any of our biomarkers between male and female donors at baseline.
-:::
-
-
-
-### Adjusting for Multiple Hypothesis Testing
-
-Above, we compared concentrations between males and females for six different endpoints or variables. Each time we run a comparison (with a p-value threshold of < 0.05), we are accepting that there is a 5% chance that a significant result will actually be due to random chance and that we are rejecting the null hypothesis when it is actually true (type I error).
-
-Since we are testing six different hypotheses simultaneously, what is the probability then of observing at least one significant result due just to chance?
-
-$$\mathbb{P}({\rm At Least One Significant Result}) = 1 - \mathbb{P}({\rm NoSignificantResults}) = 1 - (1 - 0.05)^{6} = 0.26$$
-
-Here, we can see that we have a 26% chance of observing at least one significant result, even if all the tests are actually not significant. This chance increases as our number of endpoints increases; therefore, adjusting for multiple hypothesis testing becomes even more important with larger datasets. Many methods exist for adjusting for multiple hypothesis testing, with some of the most popular including Bonferroni, False Discovery Rate (FDR), and Benjamini-Hochberg (BH).
-
-However, opinions about when and how to adjust for multiple hypothesis testing can vary and also depend on the question you are trying to answer. For example, when there are a low number of variables (e.g., < 10), it's often not necessary to adjust for multiple hypothesis testing, and when there are many variables (e.g., 100s to 1000s), it is necessary, but what about for an intermediate number of comparisons? Whether or not to apply multiple hypothesis test correction also depends on whether each endpoint is of interest on its own or whether the analysis seeks to make general statements about all of the endpoints together and on whether reducing type I or type II error is most important in the analysis.
-
-For this analysis, we will not adjust for multiple hypothesis testing due to our relatively low number of variables. For more on multiple hypothesis testing, check out the following publications:
-
-+ Mohieddin J; Naser AP. "Why, When and How to Adjust Your P Values?". Cell Journal (Yakhteh), 20, 4, 2018, 604-607. doi: 10.22074/cellj.2019.5992 PUBMID: [30124010](https://www.celljournal.org/article_250554.html)
-+ Feise, R.J. Do multiple outcome measures require p-value adjustment?. BMC Med Res Methodol 2, 8 (2002). https://doi.org/10.1186/1471-2288-2-8 PUBMID: [12069695](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-2-8#citeas)
-
-
-
-## Paired Test Example
-
-To demonstrate an example of a paired two group test, we can also determine whether exposure to 4 ppm acrolein significantly changes biomarker concentrations. This is now a paired design because each donor's cells were exposed to both 0 and 4 ppm acrolein.
-
-To prepare the data, we will filter the dataframe to only include 0 and 4 ppm:
-```{r 04-Chapter4-88}
-biomarker_data_0vs4 <- biomarker_data %>%
- filter(Dose == "0" | Dose == "4")
-```
-
-Let's view the dataframe. Note how the measurements for each donor are next to each other - this an important element of the default handling of the paired analysis in R. The dataframe should have the donors in the same order for the 0 and 4 ppm data.
-```{r 04-Chapter4-89}
-datatable(biomarker_data_0vs4)
-```
-
-We can now run the same type of loop that we ran before, changing the independent variable in the formula to `~ Dose` and adding `paired = TRUE` to the `wilcox_test()` function.
-```{r 04-Chapter4-90}
-# Create a vector with the names of the variables you want to run the test on
-endpoints <- colnames(biomarker_data_0vs4 %>% select(IL1B:VEGF))
-
-# Create dataframe to store results
-dose_wilcoxres <- data.frame()
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable.
- endpoint <- endpoints[i]
-
- # Run wilcox test and store in results dataframe.
- res_df <- biomarker_data_0vs4 %>%
- wilcox_test(as.formula(paste0(endpoint, "~ Dose", sep = "")),
- paired = TRUE)
-
- # Bind results from this test with other tests in this loop
- dose_wilcoxres <- rbind(dose_wilcoxres, res_df)
-}
-
-# View results
-dose_wilcoxres
-```
-
-Although this dataframe contains useful information about our statistical test, such as the groups being compared, the sample size (*n*) of each group, and the test statistic, what we really want (and what would likely be shared in supplemental material), is a more simplified version of these results in table format and more detailed information (*n*, specific statistical test, groups being compared) in the table legend. We can clean up the results using the following code to make clearer column names and ensure that the p-values are formatted consistently.
-
-```{r 04-Chapter4-91}
-dose_wilcoxres <- dose_wilcoxres %>%
- select(c(.y., p)) %>%
- mutate(p = format(p, digits = 3, scientific = TRUE)) %>%
- rename("Variable" = ".y.", "P-Value" = "p")
-
-datatable(dose_wilcoxres)
-```
-
-:::question
-With this, we can answer **Environmental Health Question #2**:
-
-Are there significant differences in inflammatory biomarker concentrations between cells exposed to 0 and 4 ppm acrolein?
-:::
-
-:::answer
-**Answer**: Yes, there are significant differences in IL-1$\beta$, IL-6, IL-8, TNF-$\alpha$, and VEGF concentrations between cells exposed to 0 and 4 ppm acrolein.
-:::
-
-
-
-## Visualizing Results
-
-Now, let's visualize our results using *ggplot2*. For an introduction to *ggplot2* visualizations, see **TAME 2.0 Modules 3.1 Data Visualizations** and **3.2 Improving Data Visualizations**, as well as the extensive online documentation available for *ggplot2*.
-
-### Single Plots
-We will start by making a very basic box and whisker plot of the IL-1$\beta$ data with individual data points overlaid. It is best practice to show all data points, allowing the reader to view the whole spread of the data, which can be obscured by plots such as bar plots with mean and standard error.
-```{r 04-Chapter4-92, fig.align = "center"}
-# Setting theme for plot
-theme_set(theme_bw())
-
-# Making plot
-ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
- geom_boxplot() +
- geom_jitter(position = position_jitter(0.15))
-```
-
-We could add statistical markings to denote significance to this graph manually in PowerPoint or Adobe Illustrator, but there are actually R packages that act as extensions to *ggplot2* and will do this for you! Two of our favorites are [*ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/) and [*ggsignif*](https://cran.r-project.org/web/packages/ggsignif/vignettes/intro.html). Here is an example using *ggpubr*:
-```{r 04-Chapter4-93, fig.align = "center"}
-ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
- geom_boxplot() +
- geom_jitter(position = position_jitter(0.15)) +
- # Adding a p value from a paired Wilcoxon test
- stat_compare_means(method = "wilcox.test", paired = TRUE)
-```
-
-We can further clean up our figure by modifying elements of the plot's theme, including the font sizes, axis range, colors, and the way that the statistical results are presented. Perfecting figures can be time consuming but ultimately worth it, because clear figures aid greatly in presenting a coherent story that is understandable to readers/listeners.
-```{r 04-Chapter4-94, fig.align = "center"}
-ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
- # outlier.shape = NA removes outliers
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- # Changing box plot colors
- scale_fill_manual(values = c("#BFBFBF", "#EE2B2B")) +
- geom_jitter(size = 3, position = position_jitter(0.15)) +
- # Adding a p value from a paired Wilcoxon test
- stat_compare_means(method = "wilcox.test", paired = TRUE,
- # Changing the value to asterisks and moving to the middle of the plot
- label = "p.signif", label.x = 1.5, label.y = 4.5, size = 12) +
- ylim(2.5, 5) +
- # Changing y axis label
- labs(y = "Log2(IL-1\u03B2 (pg/mL))") +
- # Removing legend
- theme(legend.position = "none",
- axis.title = element_text(color = "black", size = 15),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 12))
-```
-
-### Multiple plots
-
-Making one plot was relatively straightforward, but to graph all of our endpoints, we would either need to repeat that code chunk for each individual biomarker or write a function to create similar plots given a specific biomarker as input. Then, we would need to stitch together the individual plots in external software or using a package such as [*patchwork*](https://patchwork.data-imaginist.com/) (which is a great package if you need to combine individual figures from different sources or different size ratios!).
-
-While these are workable solutions and would get us to the same place, *ggplot2* actually contains a function - `facet_wrap()` - that can be used to graph multiple endpoints from the same groups in one figure panel, which takes care of a lot of the work for us!
-
-To prepare our data for facet plotting, first we will pivot it longer:
-```{r 04-Chapter4-95}
-biomarker_data_0vs4_long <- biomarker_data_0vs4 %>%
- pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
-
-datatable(biomarker_data_0vs4_long)
-```
-
-Then, we can use similar code to what we used to make our single graph, with a few modifications to plot multiple panels simultaneously and adjust the style of the plot. Although it is beyond the scope of this module to explain the mechanics of each line of code, here are a few specific things to note about the code below that may be helpful when constructing similar plots:
-
-- To create the plot with all six endpoints instead of just one, we:
- - Changed input dataframe from wide to long format
- - Changed `y =` from one specific endpoint to `value`
- - Added the `facet_wrap()` argument
- - `~ variable` tells the function to make an individual plot for each variable
- - `nrow = 2 ` tells the function to put the plots into two rows
- - `scales = "free_y"` tells the function to allow each individual graph to have a unique y-scale that best shows all of the data on that graph
- - `labeller` feeds the edited (more stylistically correct) names for each panel to the function
-
-- To ensure that the statistical results appear cleanly, within `stat_compare_means()`, we:
- - Added `hide.ns = TRUE` so that only significant results are shown
- - Added `label.x.npc = "center"` and `hjust = 0.5` to ensure that asterisks are centered on the plot and that the text is center justified
-
-- To add padding along the y axis, allowing space for significance asterisks, we added `scale_y_continuous(expand = expansion(mult = c(0.1, 0.4)))`
-
-```{r 04-Chapter4-96, warning = FALSE, fig.align = "center"}
-# Create clean labels for the graph titles
-new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
- "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
-
-# Make graph
-ggplot(biomarker_data_0vs4_long, aes(x = Dose, y = value)) +
- # outlier.shape = NA removes outliers
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- # Changing box plot colors
- scale_fill_manual(values = c("#BFBFBF", "#EE2B2B")) +
- geom_jitter(size = 1.5, position = position_jitter(0.15)) +
- # Adding a p value from a paired Wilcoxon test
- stat_compare_means(method = "wilcox.test", paired = TRUE,
- # Changing the value to asterisks and moving to the middle of the plot
- label = "p.signif", size = 10, hide.ns = TRUE, label.x.npc = "center",
- hjust = 0.5) +
- # Adding padding y axis
- scale_y_continuous(expand = expansion(mult = c(0.1, 0.4))) +
- # Changing y axis label
- ylab(expression(Log[2]*"(Concentration (pg/ml))")) +
- # Faceting by each biomarker
- facet_wrap(~ variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
- # Removing legend
- theme(legend.position = "none",
- axis.title = element_text(color = "black", size = 12),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 10),
- strip.text = element_text(size = 12, face = "bold"))
-```
-
-An appropriate title for this figure could be:
-
-"**Figure X. Exposure to 4 ppm acrolein increases inflammatory biomarker secretion in primary human bronchial epithelial cells.** Groups were compared using the Wilcoxon signed rank test. * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001, *n* = 16 per group (paired)."
-
-
-
-## Concluding Remarks
-
-In this module, we introduced two group statistical tests, which are some of the most common statistical tests applied in biomedical research. We applied these tests to our example dataset and demonstrated how to produce publication-quality tables and figures of our results. Implementing a workflow such as this enables efficient analysis of wet-bench generated data and customization of output figures and tables suited to your personal preferences.
-
-
-
-
-
-:::tyk
-Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed. Due to the relatively low *n* of this dataset, we therefore recommend using non-parametric statistical tests.
-
-Use the same processes demonstrated in this module and the provided data (“Module4_4_TYKInput1.xlsx” (functional data) and “Module4_4_TYKInput2.xlsx” (demographic data)), run analyses and make publication-quality figures and tables to answer the following questions to determine:
-
-1. Are there significant differences in functional endpoints between cells from male and female donors at baseline?
-2. Are there significant differences in functional endpoints between cells exposed to 0 and 4 ppm acrolein? Go ahead and use non-parametric tests for these analyses.
-:::
-
-# 4.5 Multi-Group and Multi-Variable Comparisons and Visualizations
-
-This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-In the previous module, we covered how to apply two-group statistical testing, one of the most basic types of statistical tests. In this module, we will build on the concepts introduced previously to apply statistical testing to datasets with more than two groups, which are also very common in environmental health research. We will review common multi-group overall effects tests and post-hoc tests, and we will demonstrate how to apply these tests and how to graph the results using the same example dataset as in previous modules in this chapter, which represents concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are there significant differences in inflammatory biomarker concentrations between different doses of acrolein?
-2. Do TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein?
-
-### Workspace Preparation and Data Import
-
-Here, we will import the processed data that we generated at the end of TAME 2.0 Module 4.2, introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** and the associated demographic data. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, *ggpubr*, and *rstatix*.
-
-#### Cleaning the global environment
-```{r 04-Chapter4-97, clear_envi, echo=TRUE, eval=TRUE}
-rm(list=ls())
-```
-
-#### Loading R packages required for this session
-```{r 04-Chapter4-98, load_libs, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
-library(openxlsx)
-library(tidyverse)
-library(DT)
-library(rstatix)
-library(ggpubr)
-```
-
-#### Set your working directory
-```{r 04-Chapter4-99, file_path, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-```{r 04-Chapter4-100, read_data, echo=TRUE, eval=TRUE}
-biomarker_data <- read.xlsx("Chapter_4/Module4_5_Input/Module4_5_InputData1.xlsx")
-demographic_data <- read.xlsx("Chapter_4/Module4_5_Input/Module4_5_InputData2.xlsx")
-
-# View data
-datatable(biomarker_data)
-datatable(demographic_data)
-```
-
-
-## Overview of Multi-Group Statistical Tests
-
-Before applying statistical tests to our data, let's first review the mechanics of multi-group statistical tests, including overall effects tests and post-hoc tests.
-```{r 04-Chapter4-101, echo = FALSE, fig.align = "center", out.width = "600px" }
-knitr::include_graphics("Chapter_4/Module4_5_Input/Module4_5_Image1.png")
-```
-
-### Overall Effects Tests
-
-The first step for multi-group statistical testing is to run an overall effects test. The null hypothesis for the overall effects test is that there are no differences among group means. A significant p-value rejects the null hypothesis that the groups are drawn from populations with the same mean and indicates that at least one group differs significantly from the overall mean. Similar to two-group statistical testing, choice of the specific overall statistical test to run depends on whether the data are normally or non-normally distributed and whether the experimental design is paired:
-
-```{r 04-Chapter4-102, echo = FALSE, fig.align = "center", out.width = "700px" }
-knitr::include_graphics("Chapter_4/Module4_5_Input/Module4_5_Image2.png")
-```
-
-Importantly, overall effects tests return **one** p-value regardless of the number of groups being compared. To determine which pairwise comparisons are significant, post-hoc testing is needed.
-
-### Post-Hoc Testing
-
-If significance is obtained with an overall effects test, we can use post-hoc testing to determine which specific pairs of groups are significantly different from each other. Just as with two group statistical tests and overall effects multi-group statistical tests, choosing the appropriate post-hoc test depends on the data's normality and whether the experimental design is paired:
-```{r 04-Chapter4-103, echo = FALSE, fig.align = "center", out.width = "700px" }
-knitr::include_graphics("Chapter_4/Module4_5_Input/Module4_5_Image3.png")
-```
-
-Note that the above diagram represents commonly selected post-hoc tests; others may also be appropriate depending on your specific experimental design. As with other aspects of the analysis, be sure to report which post-hoc test(s) you performed!
-
-### Correcting for Multiple Hypothesis Testing
-
-Correcting for multiple hypothesis testing is important for both the overall effects test (if you are running it over many endpoints) and post-hoc tests; however, it is particularly important for post-hoc tests. This is because even an analysis of a relatively small number of experimental groups results in quite a few pairwise comparisons. Comparing each of our five dose groups to each other in our example data, there are 10 separate statistical tests being performed! Therefore, it is generally advisable to adjust pairwise post-hoc testing p-values. The Tukey's HSD function within *rstatix* does this automatically, while pairwise t-tests, pairwise Wilcoxon tests, and Dunn's test do not. P-value adjustment can be added to their respective *rstatix* functions using the `p.adjust.method = ` argument.
-
-When applying a post-hoc test, you may choose to compare every group to every other group, or you may only be interested in significant differences between specific groups (e.g., treatment groups vs. a control). This choice will be governed by your hypothesis. Statistical testing functions will typically default to comparing all groups to each other, but the comparisons can be defined using the `comparisons = ` argument if you want to restrict the test to specific comparisons. It is important to decide at the beginning of your analysis which comparisons are relevant to your hypothesis because the number of pairwise tests performed in the post-hoc analysis will influence how much the resulting p-values will be adjusted for multiple hypothesis testing.
-
-### Which test should I choose?
-
-Use the following flowchart to help guide your choice of statistical test to compare multiple groups:
-```{r 04-Chapter4-104, echo = FALSE, fig.align = "center", out.width = "900px" }
-knitr::include_graphics("Chapter_4/Module4_5_Input/Module4_5_Image4.png")
-```
-
-
-
-## Multi-Group Analysis Example
-
-To determine whether there are significant differences across all of our doses, the Friedman test is the most appropriate due to our matched experimental design and non-normally distributed data. The `friedman_test()` function is part of the [rstatix](https://github.com/kassambara/rstatix) package. This package also has many other helpful functions for statistical tests that are pipe/tidyverse friendly. To demonstrate how this test works, we will first perform the test on one variable:
-```{r 04-Chapter4-105}
-biomarker_data %>% friedman_test(IL1B ~ Dose | Donor)
-```
-
-A p-value of 0.01 indicates that we can reject the null hypothesis that all of our data are drawn from groups that have equivalent means.
-
-Now, we can run a `for` loop similar to our two-group comparisons in **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations** to determine the overall p-value for each endpoint:
-```{r 04-Chapter4-106}
-# Create a vector with the names of the variables you want to run the test on
-endpoints <- colnames(biomarker_data %>% select(IL1B:VEGF))
-
-# Create data frame to store results
-dose_friedmanres <- data.frame()
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable.
- endpoint <- endpoints[i]
-
- # Run wilcox test and store in results data frame.
- res <- biomarker_data %>%
- friedman_test(as.formula(paste0(endpoint, "~ Dose | Donor", sep = ""))) %>%
- select(c(.y., p))
-
- dose_friedmanres <- rbind(dose_friedmanres, res)
-}
-
-# View results
-datatable(dose_friedmanres)
-```
-
-These results demonstrate that all of our endpoints have significant overall differences across doses (p < 0.05). To determine which pairwise comparisons are significant, we next need to apply a post-hoc test. We will apply a pairwise, paired Wilcoxon test due to our experimental design and data distribution, with the Benjamini-Hochberg (BH) correction for multiple testing:
-```{r 04-Chapter4-107}
-dose_wilcox_posthoc_IL1B <- biomarker_data %>%
- pairwise_wilcox_test(IL1B ~ Dose, paired = TRUE, p.adjust.method = "BH")
-
-dose_wilcox_posthoc_IL1B
-```
-
-Here, we can now see whether there are statistically significant differences in IL-1$\beta$ secretion between each of our doses. To generate pairwise comparison results for each of our inflammatory biomarkers, we can run a for loop similar to the one we ran for our overall test:
-```{r 04-Chapter4-108}
-# Create a vector with the names of the variables you want to run the test on
-endpoints <- colnames(biomarker_data %>% select(IL1B:VEGF))
-
-# Create data frame to store results
-dose_wilcox_posthoc <- data.frame()
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable.
- endpoint <- endpoints[i]
-
- # Run wilcox test and store in results data frame.
- res <- biomarker_data %>%
- pairwise_wilcox_test(as.formula(paste0(endpoint, "~ Dose", sep = "")), paired = TRUE, p.adjust.method = "BH")
-
- dose_wilcox_posthoc <- rbind(dose_wilcox_posthoc, res)
-}
-
-# View results
-datatable(dose_wilcox_posthoc)
-```
-
-We now have a dataframe storing all of our pairwise comparison results. However, this is a lot to scroll through, making it hard to interpret. We can generate a publication-quality table by manipulating the table and joining it with the overall test data.
-```{r 04-Chapter4-109}
-dose_results_cleaned <- dose_wilcox_posthoc %>%
- unite(comparison, group1, group2, sep = " vs. ") %>%
- select(c(.y., comparison, p.adj)) %>%
- pivot_wider(id_cols = ".y.", names_from = "comparison", values_from = "p.adj") %>%
- left_join(dose_friedmanres, by = ".y.") %>%
- relocate(p, .after = ".y.") %>%
- rename("Variable" = ".y.", "Overall" = "p") %>%
- mutate(across('Overall':'2 vs. 4', \(x) format(x, scientific = TRUE, digits = 3)))
-
-datatable(dose_results_cleaned)
-```
-
-To more easily see overall significance patterns, we could also make the same table but with significance stars instead of p-values by keeping the `p.adjust.signif` column instead of the `p.adj` column in our post-hoc test results dataframe:
-```{r 04-Chapter4-110}
-dose_results_cleaned_2 <- dose_wilcox_posthoc %>%
- unite(comparison, group1, group2, sep = " vs. ") %>%
- select(c(.y., comparison, p.adj.signif)) %>%
- pivot_wider(id_cols = ".y.", names_from = "comparison", values_from = "p.adj.signif") %>%
- left_join(dose_friedmanres, by = ".y.") %>%
- relocate(p, .after = ".y.") %>%
- rename("Variable" = ".y.", "Overall" = "p") %>%
- mutate(across('Overall':'2 vs. 4', \(x) format(x, scientific = TRUE, digits = 3)))
-
-datatable(dose_results_cleaned_2)
-```
-
-### Answer to Environmental Health Question 1
-:::question
- With this, we can answer **Environmental Health Question #1 **: Are there significant differences in inflammatory biomarker concentrations between different doses of acrolein?
-:::
-
-:::answer
-**Answer**: Yes, there are significant differences in inflammatory biomarker concentrations between different doses of acrolein. The overall p-values for all biomarkers are significant. Within each biomarker, at least one pairwise comparison was significant between doses, with a majority of these significant comparisons being with the highest dose (4 ppm).
-:::
-
-
-
-## Visualization of Multi-Group Statistical Results
-
-The statistical results we generated are a lot to digest in table format, so it can be helpful to graph the results. As our statistical testing becomes more complicated, so does the code used to generate results. The *ggpubr* package can perform statistical testing and overlay the results onto graphs for a specific set of tests, such as overall effects tests and unpaired t-tests or Wilcoxon tests. However, for tests that aren't available by default, the package also contains the helpful `stat_pvalue_manual()` function that can be added to plots. This is what we will need to use to add the results of our pairwise, paired Wilcoxon test with BH correction, as there is no option for BH correction within the default function we might otherwise use (`stat_compare_means()`). We will first work through an example of this using one of our endpoints, and then we will demonstrate how to apply it to facet plotting.
-
-### Single Plot
-
-We first need to format our existing statistical results so that they match the format that the function needs as input. Specifically, the dataframe needs to contain the following columns:
-
-+ `group1` and `group2`: the groups being compared
-+ A column containing the results you want displayed (`p`, `p.adj`, or `p.adj.signif` typically)
-+ `y.position`, which tells the function where to plot the significance markers
-
-Our results dataframe for IL-1$\beta$ already contains our groups and p-values:
-```{r 04-Chapter4-111}
-datatable(dose_wilcox_posthoc_IL1B)
-```
-
-We can add the position columns using the function `add_xy_position()`:
-
-```{r 04-Chapter4-112}
-dose_wilcox_posthoc_IL1B <- dose_wilcox_posthoc_IL1B %>%
- add_xy_position(x = "Dose", step.increase = 2)
-
-datatable(dose_wilcox_posthoc_IL1B)
-```
-
-Now, we are ready to make a graph of our results. We will use `stat_friedman_test()` to add our overall p-value and `stat_pvalue_manual()` to add our pairwise values.
-```{r 04-Chapter4-113, out.width = "600px", message = FALSE, fig.align = "center"}
-# Set graphing theme
-theme_set(theme_bw())
-
-# Make plot
-ggplot(biomarker_data, aes(x = Dose, y = IL1B)) +
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
- geom_jitter(size = 3, position = position_jitter(0.15)) +
- stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
- label.x.npc = "left", label.y = 9.5, hjust = 0.5, size = 6) +
- stat_pvalue_manual(dose_wilcox_posthoc_IL1B, label = "p.adj.signif", size = 12, hide.ns = TRUE) +
- ylim(2.5, 10) +
- labs(y = "Log2(IL-1\u03B2 (pg/mL))", x = "Acrolein (ppm)") +
- theme(legend.position = "none",
- axis.title = element_text(color = "black", size = 15),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 12))
-```
-
-However, to make room for all of our annotations, our data become compressed, and it makes it difficult to see our data. Although presentation of statistical results is largely a matter of personal preference, we could clean up this plot by making our annotations appear on top of the bars, with indication in the figure legend that the comparison is with a specific dose. We will do this by:
-
-1. Filtering our results to those that are significant.
-2. Changing the symbol for comparisons that are not to the 0 dose.
-3. Layering this text onto the plot with `geom_text()` rather than `stat_pvalue_manual()`.
-
-First, let's filter our results to significant results and change the symbol for comparisons that are not to the 0 dose to a caret (^) instead of stars. We can do this by creating a new column called label that keeps the existing label if `group1` is 0, and if not, changes the label to a caret of the same length. We then use the summarize function to paste the labels for each of the groups together, resulting in a final dataframe containing our annotations for our plot.
-
-```{r 04-Chapter4-114}
-dose_wilcox_posthoc_IL1B_2 <- dose_wilcox_posthoc_IL1B %>%
-
- # Filter results to those that are significant
- filter(p.adj <= 0.05) %>%
-
- # Make new symbol
- mutate(label = ifelse(group1 == "0", p.adj.signif, strrep("^", nchar(p.adj.signif)))) %>%
-
- # Select only the columns we need
- select(c(group1, group2, label)) %>%
-
- # Combine symbols for the same group
- group_by(group2) %>% summarise(label = paste(label, collapse=" ")) %>%
-
- # Remove duplicate row
- distinct(group2, .keep_all = TRUE) %>%
-
- # Rename group2 to dose
- rename("Dose" = "group2")
-
-dose_wilcox_posthoc_IL1B_2
-```
-
-Then, we can use the same code as for our previous plot, but instead of using `stat_pvalue_manual()`, we will use `geom_text()` in combination with the dataframe we just created.
-```{r 04-Chapter4-115, out.width = "600px", fig.align = "center"}
-ggplot(biomarker_data, aes(x = Dose, y = IL1B)) +
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
- geom_jitter(size = 3, position = position_jitter(0.15)) +
- stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
- label.x.npc = "left", label.y = 4.85, hjust = 0.5, size = 6) +
- geom_text(data = dose_wilcox_posthoc_IL1B_2, aes(x = Dose, y = 4.5,
- label = paste0(label)), size = 10, hjust = 0.5) +
- ylim(2.5, 5) +
- labs(y = "Log2(IL-1\u03B2 (pg/mL))", x = "Acrolein (ppm)") +
- theme(legend.position = "none",
- axis.title = element_text(color = "black", size = 15),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 12))
-```
-
-An appropriate title for this figure could be:
-
-"**Figure X. Exposure to 0.6-4 ppm acrolein increases IL-1$\beta$ secretion in primary human bronchial epithelial cells.** Groups were compared using the Friedman test to obtain overall p-value and Wilcoxon signed rank test for post-hoc testing. * p < 0.05 in comparison with 0 ppm, ^ p < 0.05 in comparison with 0.6 ppm, n = 16 per group (paired)."
-
-
-### Faceted Plot
-
-Ideally, we would extend this sort of graphical approach to our faceted plot showing all of our endpoints. However, there are quite a few statistically significant comparisons to graph, including comparisons that are significant between different pairs of doses (not just back to the control). While we could attempt to graph all of them, ultimately, this will lead to a cluttered figure panel. When thinking about how to simplify our plots, some options are:
-
-1. Instead of using the number of symbols to represent p-values, we could use a single symbol to represent any comparison with a p-value with at least p < 0.05, and that symbol could be different depending on which group the significance is in comparison to. Symbols can be difficult to parse in R, so we could use letters or even the group names above the column of interest. For example, if the concentration of an endpoint at 2 ppm was significant in comparison with both 0 and 0.6 ppm, we could annotate "0, 0.6" above the 2 ppm column, or we could choose a letter ("a, b") or symbol ("*, ^") to convey these results.
-
-2. If the pattern is the same across many of the endpoints measured, we could graph a subset of the endpoints with the most notable data trends or the most biological meaning for the main body of the manuscript, with data for additional endpoints referred to in the text and shown in the supplemental figures or tables.
-
-3. If most of the significant comparisons are back to the control group, we could choose to only show comparisons with the control group, with textual description of the other significant comparisons and indication that those specific p-values can be viewed in the supplemental table of results.
-
-Which approach you decide to take (or maybe another approach altogether) is a matter of both personal preference and your specific study goals. You may also decide that it is important to you to show all significant comparisons, which will require more careful formatting of the plots to ensure that all text and annotations are legible. For this module, we will proceed with option #3 because many of our comparisons to the control dose (0) are significant, and we have enough groups that there likely will not be space to annotate all of them above our data.
-
-We will take similar steps here that we did when constructing our single endpoint graph, with a couple of small differences. Specifically, we need to:
-
-1. Create a dataframe of labels/annotations as we did above, but now filtered to only significant comparisons with the 0 group.
-2. Add to the label/annotation dataframe what we want the y position for each of the labels to be, which will be different for each endpoint.
-
-First, let's create our annotations dataframe. We will start with the results dataframe from our posthoc testing:
-```{r 04-Chapter4-116}
-datatable(dose_wilcox_posthoc)
-```
-
-```{r 04-Chapter4-117}
-dose_wilcox_posthoc_forgraph <- dose_wilcox_posthoc %>%
-
- filter(p.adj <= 0.05) %>%
-
- # Filter for only comparisons to 0
- filter(group1 == "0") %>%
-
- # Rename columns
- rename("variable" = ".y.", "Dose" = "group2")
-
-datatable(dose_wilcox_posthoc_forgraph)
-```
-
-The `Dose` column will be used to tell *ggplot2* where to place the annotations on the x axis, but we need to also specify where to add the annotations on the y axis. This will be different for each variable because each variable is on a different scale. We can approach this by computing the maximum value of each variable, then increasing that by 20% to add some space on top of the points.
-
-```{r 04-Chapter4-118}
-sig_labs_y <- biomarker_data %>%
- summarise(across(IL1B:VEGF, \(x) max(x))) %>%
- t() %>% as.data.frame() %>%
- rownames_to_column("variable") %>%
- rename("y_pos" = "V1") %>%
- mutate(y_pos = y_pos*1.2)
-
-sig_labs_y
-```
-
-Then, we can join these data to our labeling dataframe to complete what we need to make the annotations.
-```{r 04-Chapter4-119}
-dose_wilcox_posthoc_forgraph <- dose_wilcox_posthoc_forgraph %>%
- left_join(sig_labs_y, by = "variable")
-```
-
-Now, it's time to graph! Keep in mind that although the plotting script can get long and unweildy, each line is just a new instruction to ggplot about a formatting element or an additional layer to add to the graph.
-```{r 04-Chapter4-120, out.width = "800px", fig.align = "center"}
-# Pivot data longer
-biomarker_data_long <- biomarker_data %>%
- pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
-
-# Create clean labels for the graph titles
-new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
- "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
-
-# Make graph
-ggplot(biomarker_data_long, aes(x = Dose, y = value)) +
- # outlier.shape = NA removes outliers
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- # Changing box plot colors
- scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
- geom_jitter(size = 1.5, position = position_jitter(0.15)) +
- # Adding a p value from Friedman test
- stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
- label.x.npc = "left", vjust = -3.5, hjust = 0.1, size = 3.5) +
- # Add label
- geom_text(data = dose_wilcox_posthoc_forgraph, aes(x = Dose, y = y_pos, label = p.adj.signif,
- size = 5, hjust = 0.5)) +
- # Adding padding y axis
- scale_y_continuous(expand = expansion(mult = c(0.1, 0.6))) +
- # Changing y axis label
- ylab(expression(Log[2]*"(Concentration (pg/ml))")) +
- # Changing x axis label
- xlab("Acrolein (ppm)") +
- # Faceting by each biomarker
- facet_wrap(~ variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
- # Removing legend
- theme(legend.position = "none",
- axis.title = element_text(color = "black", size = 12),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 10),
- strip.text = element_text(size = 12, face = "bold"))
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Exposure to acrolein increases secretion of proinflammatory biomarkers in primary human bronchial epithelial cells.** Groups were compared using the Friedman test to obtain overall p-value and Wilcoxon signed rank test for post-hoc testing. * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001 for comparison with control. For additional significant comparisons, see Supplemental Table X. n = 16 per group (paired).”
-
-### Answer to Environmental Health Question 2
-:::question
- With this, we can answer **Environmental Health Question #2 **: Do TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein?
-:::
-
-:::answer
-**Answer**: Yes, TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein, which we were able to visualize, along with other mediators, in our facet plot.
-:::
-
-
-
-## Concluding Remarks
-
-In this module, we introduced common multi-group statistical tests, including both overall effects tests and post-hoc testing. We applied these tests to our example dataset and demonstrated how to produce publication-quality tables and figures of our results. Implementing a workflow such as this enables efficient analysis of wet-bench generated data and customization of output figures and tables suited to your personal preferences.
-
-### Additional Resources
-
-- [STHDA: How to Add P-Values and Significance Levels to ggplots using *ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/)
-- [Adding p-values with *ggprism*](https://cran.r-project.org/web/packages/ggprism/vignettes/pvalues.html)
-- [Overview of *ggsignif*](https://const-ae.github.io/ggsignif/)
-
-
-
-
-
-:::tyk
-
-Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed.
-
-Use the same processes demonstrated in this module and the provided data (“Module4_5_TYKInput.xlsx” (functional data)) to run analyses and make a publication-quality figure panel and table to answer the following question: Are there significant differences in functional endpoints between cells treated with different concentrations of acrolein?
-
-For an extra challenge, try also making your faceted plot in the style of option #1 above, with different symbols, letters, or group names above columns to indicate which group that column in significant in comparison with.
-:::
-
-# 4.6 Advanced Multi-Group Comparisons
-
-This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-In the previous module, we covered how to apply multi-group statistical testing, in which we tested for significant differences in endpoints across different values for one independent variable. In this module, we will build on the concepts introduced previously to test for significant differences in endpoints while considering two or more independent variables. We will review relevant statistical approaches and demonstrate how to apply these tests using the same example dataset as in previous modules in this chapter. As a reminder, this dataset includes concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Are there significant differences in inflammatory biomarker concentrations between sex and different doses of acrolein?
-
-2. Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?
-
-### Workspace Preparation and Data Import
-
-Here, we will import the processed data that we generated at the end of TAME 2.0 Module 4.2, introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** and associated demographic data. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, *ggpubr*, and *rstatix*.
-
-#### Cleaning the global environment
-```{r 04-Chapter4-121, clear__env, echo=TRUE, eval=FALSE}
-rm(list=ls())
-```
-
-#### Loading R packages required for this session
-```{r 04-Chapter4-122, load__libs, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
-library(openxlsx)
-library(tidyverse)
-library(DT)
-library(rstatix)
-library(ggpubr)
-library(multcomp)
-library(pander)
-
-theme_set(theme_bw()) # Set graphing theme
-```
-
-#### Set your working directory
-```{r 04-Chapter4-123, file_path, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-```{r 04-Chapter4-124, read__data, echo=TRUE, eval=TRUE}
-biomarker_data <- read.xlsx("Chapter_4/Module4_6_Input/Module4_6_InputData1.xlsx")
-demographic_data <- read.xlsx("Chapter_4/Module4_6_Input/Module4_6_InputData2.xlsx")
-
-# View data
-datatable(biomarker_data)
-datatable(demographic_data)
-```
-
-## Advanced Multi-Group Comparisons
-
-### Two-way ANOVA
-The first test that we'll introduce is a **two-way ANOVA**. This test involves testing for mean differences in a continuous dependent variable across two categorical independent variables. (As a refresher, a one-way ANOVA uses a single independent variable to compare mean differences between groups.) Subjects or samples can be matched based upon their between-group factors (i.e., exposure duration) and/or their within-group factors (i.e., batch effects). Models that include both between-group and within-group factors are known as **mixed two-way ANOVAs**.
-
-Like other parametric tests, two-way ANOVAs assume:
-
-+ Homogeneity of variance
-+ Independent observations
-+ Normal distribution
-
-
-### ANCOVA
-
-An **Analysis of Covariances (ANCOVA)** tests for mean differences in a continuous dependent variable and at least one categorical independent variable. It also includes another variable, known as a covariate, that needs to be controlled or adjusted for to more accurately capture the relationship between the independent and dependent variables. Potential covariates can include either between-group factors like exposure duration and/or within-group factors like batch effects or sex. Note that if the dataset has a smaller sample size, stratification of the dataset based on that covariate is another option to determine its effects rather than adjusting for it using an ANCOVA.
-
-ANCOVAs have the same assumptions listed above.
-
-
-**Note**: It is possible to run *two-way ANCOVA* models, where the model contains two independent variables and at least one covariate to be adjusted for.
-
-
-
-## Two-way ANOVA Example
-
-Our first environmental health question can be answered using a two-way ANOVA. We can test three different null hypotheses using this test:
-
-1. There is no difference in average biomarker concentrations based on sex.
-2. There is no difference in average biomarker concentrations based on dose.
-3. The effect of sex on average biomarker concentration does not depend on the effect of dose and vice versa.
-
-
-
-The first step would be to check that the assumptions (independence, homogeneity of variance, and normal distribution) have been met, but this was done previously in **TAME 2.0 Module 4.4 Two Group Comparisions and Visualizations**.
-
-To run our two-way ANOVA, we will use the `anova_test()` function from the *rstatix* package. This function allows us to define subject identifiers for matching between-subject factor variables (such as sex - factors that differ between subjects) and within-subject factors (such as dose - factors that are measured within each subject). Since we have both between- and within- subject factors, we will specifically be running a two-way mixed ANOVA.
-
-First, we need to add our demographic data to our biomarker data so that these variables can be incorporated into the analysis. Also, we need to convert `Dose` into a factor to specify the levels.
-```{r 04-Chapter4-125}
-biomarker_data <- biomarker_data %>%
- left_join(demographic_data, by = "Donor") %>%
- mutate(Dose = factor(Dose, levels = c("0", "0.6", "1", "2", "4")))
-
-# viewing data
-datatable(biomarker_data)
-```
-
-Then, we can demonstrate how to run the two-way ANOVA and what the results look like by running the test on just one of our variables (IL-1$\beta$).
-```{r 04-Chapter4-126}
-get_anova_table(anova_test(data = biomarker_data,
- dv = IL1B,
- wid = Donor,
- between = Sex,
- within = Dose))
-```
-The column names are described below:
-
-+ `Effect`: the name of the variable tested
-+ `DFn`: degrees of freedom in the numerator
-+ `Dfd`: degrees of freedom in the denominator
-+ `F`: F distribution test
-+ `p`: p-value
-+ `p<.05`: denotes whether the p-value is significant
-+ `ges`: generalized effect size
-
-Based on the table above, there are significant differences in IL-1$\beta$ concentrations based on dose (p-value = 0.02). There are no significant differences in IL-1$\beta$ between the sexes nor are there significant differences in IL-1$\beta$ with an interaction between sex and dose.
-
-Similar to previous modules, we now want to apply our two-way ANOVA to each of our variables of interest. To do this, we can use a for loop that will:
-
-1. Loop through each column in the data and apply the test to each column.
-2. Pull out statistics we are interested in (for example, p-value) and bind the results from each column together into a results dataframe.
-```{r 04-Chapter4-127}
-# Create a vector with the names of the variables you want to run the test on
-endpoints <- colnames(biomarker_data %>% dplyr::select(IL1B:VEGF))
-
-# Create data frame to store results
-twoway_aov_res <- data.frame(Factor = c("Dose", "Sex", "Sex:Dose"))
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable
- endpoint <- endpoints[i]
-
- # Run two-way mixed ANOVA and store results in res_aov
- res_aov <- anova_test(data = biomarker_data,
- dv = paste0(endpoint),
- wid = Donor,
- between = Sex,
- within = Dose)
-
- # Extract the results we are interested in (from the ANOVA table)
- res_df <- data.frame(get_anova_table(res_aov)) %>%
- dplyr::select(c(Effect, p)) %>%
- rename("Factor" = "Effect")
-
- # Rename columns in the results dataframe so that the output is more nicely formatted
- names(res_df)[names(res_df) == 'p'] <- noquote(paste0(endpoint))
-
- # Bind the results to the results dataframe
- twoway_aov_res <- merge(twoway_aov_res, res_df, by = "Factor", all.y = TRUE)
-}
-
-# View results
-datatable(twoway_aov_res)
-```
-
-An appropriate title for this table could be:
-
-“**Figure X. Statistical test results for differences in cytokine concentrations.** A two-way ANOVA was performed using sex and dose as independent variables to test for statistical differences in concentration across 6 cytokines."
-
-From this table, dose is the only variable with significant differences in concentrations in all 6 biomarkers (p-value < 0.05).
-
-Although we know that dose has significant differences overall, an ANOVA test doesn't tell us which doses of acrolein differ from each other or the directionality of each biomarker's change in concentration after exposure to each dose. Therefore, we need to use a post-hoc test. One common post-hoc test following a one-way or two-way ANOVA is a Tukey’s HSD. However, there is no way to pass the output of the `anova_test()` function to the `TukeyHSD()` function. A good alternative is a pairwise t-test with a Bonferroni correction. Our data are paired in that there are repeated measures (doses) on each subject.
-```{r 04-Chapter4-128}
-# Create data frame to store results
-twoway_aov_pairedt <- data.frame(Comparison = c("0_0.6", "0_1", "0_2", "0_4", "0.6_1", "0.6_2", "0.6_4", "1_2", "1_4", "2_4"))
-
-# Run for loop
-for (i in 1:length(endpoints)) {
-
- # Assign a name to the endpoint variable.
- endpoint <- endpoints[i]
-
- # Run pairwise t-tests
- res_df <- biomarker_data %>%
- pairwise_t_test(as.formula(paste0(paste0(endpoint), "~", "Dose", sep = "")),
- paired = TRUE,
- p.adjust.method = "bonferroni") %>%
- unite(Comparison, group1, group2, sep = "_", remove = FALSE) %>%
- dplyr::select(Comparison, p.adj)
-
- # Rename columns in the results data frame so that the output is more nicely formatted.
- names(res_df)[names(res_df) == 'p.adj'] <- noquote(paste0(endpoint))
-
- # Bind the results to the results data frame.
- twoway_aov_pairedt <- merge(twoway_aov_pairedt, res_df, by = "Comparison", all.y = TRUE)
-}
-
-# View results
-datatable(twoway_aov_pairedt)
-```
-
-An appropriate title for this table could be:
-
-“**Figure X. Post hoc testing for differences in cytokine concentrations.** Paired t-tests were run as a post hoc test using dose as an independent variable to test for statistical differences in concentration across 6 cytokines."
-
-Note that this table and the two-way ANOVA table would likely be put into supplemental material for a publication. Before including this table in supplemental material, it would be best to clean it up (make the two comparison groups more clear, round all results to the same number of decimals) as demonstrated in **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**.
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: Are there significant differences in inflammatory biomarker concentrations between sex and different doses of acrolein?
-:::
-
-:::answer
-**Answer**: Based on the two-way ANOVA and post-hoc t-tests, there are only significant differences in cytokine concentrations based on dose (p adj < 0.05). All biomarkers, with the exception of IL-6, had at least 1 significantly different concentration when comparing doses.
-:::
-
-### Visualizing Two-Way ANOVA Results
-
-Since our overall p-values associated with dose were significant for a number of mediators, we will proceed with creating our final figures with our endpoints by dose, showing the overall two-way ANOVA p-value and the pairwise comparisons from our post hoc paired pairwise t-tests.
-
-To facilitate plotting in a faceted panel, we'll first pivot our `biomarker_data` dataframe longer.
-```{r 04-Chapter4-129}
-biomarker_data_long <- biomarker_data %>%
- dplyr::select(-c(Age_yr, Sex)) %>%
- pivot_longer(-c(Donor, Dose), names_to = "Variable", values_to = "Value")
-
-datatable(biomarker_data_long)
-```
-
-Then, we will create an annotation dataframe for adding our overall two-way ANOVA p-values. This dataframe needs to contain a column for our variables (to match with our variable column in our `biomarker_data_long` dataframe) and the p-value for annotation. We can extract these from our `two_way_aov_res` dataframe generated above.
-```{r 04-Chapter4-130}
-overall_dose_pvals <- twoway_aov_res %>%
- # Transpose dataframe
- column_to_rownames("Factor") %>%
- t() %>% data.frame() %>%
- rownames_to_column("Variable") %>%
- # Keep only the dose results and rename them to p-value
- dplyr::select(c(Variable, Dose)) %>%
- rename(`P Value` = Dose)
-
-datatable(overall_dose_pvals)
-```
-
-We now have our p-values for each biomarker. Next, we'll make a column where our p-values are formatted with "p = " for annotation on the graph.
-```{r 04-Chapter4-131}
-overall_dose_pvals <- overall_dose_pvals %>%
- mutate(`P Value` = formatC(`P Value`, format = "e", digits = 2),
- label = paste("p = ", `P Value`, sep = ""))
-
-datatable(overall_dose_pvals)
-```
-
-Finally, we'll add a column indicating where to add the labels on the y-axis. This will be different for each variable because each variable is on a different scale. We can approach this by computing the maximum value of each variable, then increasing that by 10% to add some space on top of the points.
-```{r 04-Chapter4-132}
-sig_labs_y <- biomarker_data %>%
- summarise(across(IL1B:VEGF, \(x) max(x))) %>%
- t() %>% as.data.frame() %>%
- rownames_to_column("Variable") %>%
- rename("y_pos" = "V1") %>%
- # moving the significance asterisks higher on the y axis
- mutate(y_pos = y_pos * 1.1)
-
-sig_labs_y
-
-
-overall_dose_pvals <- overall_dose_pvals %>%
- left_join(sig_labs_y, by = "Variable")
-
-datatable(overall_dose_pvals)
-```
-
-Now, we'll use the `biomarker_data` dataframe to plot our individual points and boxplots (similar to the plotting demonstrated in previous TAME Chapter 4 modules) and our `overall_dose_pvals` dataframe to add our p value annotation.
-```{r 04-Chapter4-133, fig.width = 12, fig.height = 6, fig.align='center'}
-# Create clean labels for the graph titles
-new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
- "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
-
-# Make graph
-ggplot(biomarker_data_long, aes(x = Dose, y = Value)) +
- # outlier.shape = NA removes outliers
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- geom_jitter(size = 1.5, position = position_jitter(0.15), alpha = 0.7) +
- # Add label
- geom_text(data = overall_dose_pvals, aes(x = 1.3, y = y_pos, label = label,
- size = 5)) +
- # Adding padding y axis
- scale_y_continuous(expand = expansion(mult = c(0.1, 0.1))) +
-
- # Faceting by each biomarker
- facet_wrap(~ Variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
-
- theme(legend.position = "none", # Removing legend
- axis.title = element_text(face = "bold", size = rel(1.3)),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 10),
- strip.text = element_text(size = 12, face = "bold")) +
-
- # Changing axes labels
- labs(x = "Acrolein (ppm)", y = expression(bold(Log[2]*"(Concentration (pg/ml))")))
-```
-
-
-It's a bit more difficult to add the pairwise t test results to the boxplots comparing each treatment group to each other as was done similarly in **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**, so that addition to the figure was omitted here.
-
-
-
-## ANCOVA Example
-
-In the following ANCOVA example, we'll still investigate potential differences in cytokine concentrations as result of varying doses of acrolein. However, this time we'll adjust for sex and age to answer our second environmental health question: **Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?**.
-
-Let's first demonstrate how to run an ANCOVA and what the results look like by running the test on just one of our variables (IL-1$\beta$). The `Anova()` function was specifically designed to run type II or III ANOVA tests, which have different approaches to dealing with interactions terms and unbalanced datasets. For more information on Type I, II, III ANOVA tests, check out [Anova – Type I/II/III SS explained](https://md.psych.bio.uni-goettingen.de/mv/unit/lm_cat/lm_cat_unbal_ss_explained.html). For the purposes of this example just know that isn't much of a difference between the type I, II, or III results.
-```{r 04-Chapter4-134}
-anova_test = aov(IL1B ~ Dose + Sex + Age_yr, data = biomarker_data)
-type3_anova = Anova(anova_test, type = 'III')
-type3_anova
-```
-Based on the table above, there are significant differences in IL-1$\beta$ concentrations in dose after adjusting for sex and age (p-value = 0.009).
-
-Now we'll run ANCOVA tests across all of our biomarkers.
-```{r 04-Chapter4-135}
-# Create data frame to store results
-ancova_res = data.frame()
-
-# Add row names to data frame so that it will be able to add ANCOVA results
-rownames <- c("(Intercept)", "Dose", "Sex", "Age_yr")
-ancova_res <- data.frame(cbind(rownames))
-
-# Assign row names
-ancova_res <- data.frame(ancova_res[, -1], row.names = ancova_res$rownames)
-
-# Perform ANCOVA over all columns
-for (i in 3:8) {
-
- fit = aov(as.formula(paste0(names(biomarker_data)[i], "~ Dose + Sex + Age_yr", sep = "")),
- biomarker_data)
- res <- data.frame(car::Anova(fit, type = "III"))
- res <- subset(res, select = Pr..F.)
- names(res)[names(res) == 'Pr..F.'] <- noquote(paste0(names(biomarker_data[i])))
- ancova_res <- transform(merge(ancova_res, res, by = 0), row.names = Row.names, Row.names = NULL)
-
-}
-
-# Transpose for easy viewing, keep columns of interest, and apply BH adjustment
-ancova_res <- data.frame(t(ancova_res)) %>%
- dplyr::select(Dose) %>%
- mutate(across(everything(), \(x) format(p.adjust(x, "BH"), scientific = TRUE)))
-
-# View results
-datatable(ancova_res)
-```
-
-Looking at the table above, there are statistically differences in all cytokine concentrations with the exception of IL-6 based on dose (p adj < 0.05). To determine what doses were significantly different from one another we'll need to run Tukey's post hoc tests.
-```{r 04-Chapter4-136}
-# Create results data frame with a column showing the comparisons (extracted from single run vs for loop)
-tukey_res <- data.frame(Comparison = c("0.6 - 0", "1 - 0", "2 - 0", "4 - 0", "1 - 0.6", "2 - 0.6",
-"4 - 0.6", "2 - 1", "4 - 1", "4 - 2"))
-
-# Perform Tukey's test
-for (i in 3:8) {
-
- # need to run ANCOVA first
- fit = aov(as.formula(paste0(names(biomarker_data)[i], "~ Dose + Sex + Age_yr", sep = "")),
- biomarker_data)
-
- # Tukey's
- posthoc <- summary(glht(fit, linfct = mcp(Dose = "Tukey")), test = adjusted("BH"))
- res <- summary(posthoc)$test
-
- # Formatting the df with the Tukey's values
- res_df <- data.frame(cbind (res$coefficients, res$sigma, res$tstat, res$pvalues))
- colnames(res_df) <- c("Estimate", "Std.Error", "t.value", "Pr(>|t|)")
- res_df <- round(res_df[4],4)
- names(res_df)[names(res_df) == 'Pr(>|t|)'] <- noquote(paste0(names(biomarker_data[i])))
- res_df <- res_df %>% rownames_to_column("Comparison")
-
- tukey_res <- left_join(tukey_res, res_df, by = "Comparison")
-}
-
-datatable(tukey_res)
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?
-:::
-
-:::answer
-**Answer**: Based on the ANCOVA tests, there are significant differences resulting from various doses of acrolein (p adj < 0.05) across all cytokine concentrations with the exception of IL-6. All biomarkers, with the exception of IL-6, had at least 1 significantly different biomarker concentration when comparing doses.
-:::
-
-### Visualizing ANCOVA Results
-
-Before graphing these results, we first need to think about which ones we want to display. For simplicity's sake, we will demonstrate graphing only comparisons that are with the control("0") group and that are significant. To do this, we'll:
-
-1. Separate our `Comparison` column into a `group1` and `group2` column.
-2. Filter to comparisons including only the 0 group.
-3. Pivot the dataframe longer, to match the format of our data used as input for facet plotting.
-4. Filter to only p-values that are less that 0.05.
-```{r 04-Chapter4-137}
-tukey_res_forgraph <- tukey_res %>%
- separate(Comparison, into = c("group1", "group2"), sep = " - ") %>%
- filter(group2 == "0") %>%
- dplyr::select(-group2) %>%
- pivot_longer(!group1, names_to = "Variable", values_to = "P Value") %>%
- filter(`P Value` < 0.05) %>%
- # rounding the p values to 4 digits for readability
- mutate(`P Value` = round(`P Value`, 4))
-
-datatable(tukey_res_forgraph)
-```
-
-Next, we can take a few steps to add columns to the dataframe that will aid in graphing:
-
-1. Add a column for significance stars.
-2. Add a column to indicate the y position for the significance annotation (similar to the above example with the two-way ANOVA).
-```{r 04-Chapter4-138}
-# Add column for significance stars
-tukey_res_forgraph <- tukey_res_forgraph %>%
- mutate(p.signif = ifelse(`P Value` < 0.0001, "****",
- ifelse(`P Value` < 0.001, "***",
- ifelse(`P Value` < 0.01, "**",
- ifelse(`P Value` < 0.05, "*", NA)))))
-
-# Calculate y positions to plot significance stars
-sig_labs_y_tukey <- biomarker_data %>%
- summarise(across(IL1B:VEGF, \(x) max(x))) %>%
- t() %>% as.data.frame() %>%
- rownames_to_column("Variable") %>%
- rename("y_pos" = "V1") %>%
- mutate(y_pos = y_pos * 1.15)
-
-sig_labs_y_tukey
-
-# Join y positions to tukey_res
-tukey_res_forgraph <- tukey_res_forgraph %>%
- left_join(sig_labs_y_tukey, by = "Variable") %>%
- rename("Dose" = "group1")
-
-datatable(tukey_res_forgraph)
-```
-
-We also need to prepare our overall p-values from our ANCOVA for display:
-```{r 04-Chapter4-139}
-ancova_res_forgraphing <- ancova_res %>%
- rename(`P Value` = Dose) %>%
- rownames_to_column("Variable") %>%
- left_join(sig_labs_y, by = "Variable") %>%
- mutate(`P Value` = formatC(as.numeric(`P Value`), format = "e", digits = 2),
- label = paste("p = ", `P Value`, sep = ""))
-
-```
-
-Now, we are ready to make our graph! We will use similar code to the above, this time adding in our significance stars over specific columns.
-```{r 04-Chapter4-140, fig.width = 12, fig.height = 7, fig.align='center'}
-# Make graph
-ggplot(biomarker_data_long, aes(x = Dose, y = Value)) +
- # outlier.shape = NA removes outliers
- geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
- # Changing box plot colors
- scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
- geom_jitter(size = 1.5, position = position_jitter(0.15), alpha = 0.7) +
- # Add overall ANCOVA label
- geom_text(data = ancova_res_forgraphing, aes(x = 1.3, y = y_pos * 1.15, label = label, size = 10)) +
- # Add tukey annotation
- geom_text(data = tukey_res_forgraph, aes(x = Dose, y = y_pos, label = p.signif, size = 10, hjust = 0.5)) +
-
- # Faceting by each biomarker
- facet_wrap(~ Variable, nrow = 2, scales = "free_y", labeller = labeller(Variable = new_labels)) +
- # Removing legend
- theme(legend.position = "none",
- axis.title = element_text(face = "bold", size = rel(1.5)),
- axis.title.x = element_text(vjust = -0.75),
- axis.title.y = element_text(vjust = 2),
- axis.text = element_text(color = "black", size = 10),
- strip.text = element_text(size = 12, face = "bold")) +
-
- # Changing axes labels
- labs(x = "Acrolein (ppm)", y = expression(bold(Log[2]*"(Concentration (pg/ml))")))
-```
-An appropriate title for this figure could be:
-
-“**Figure X. Acrolein exposure increases inflammatory cytokine secretion in most primary human bronchial epithelial cells.** Overall p-values from ANCOVA tests adjusting for age and sex are in the left-hand corner. Tukey's post hoc tests were subsequently run and significant Benjamini-Hochberg adjusted p-values are denoted with asterisks compared to the control (0ppm) dose only. p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001, *n* = 16 per group.”
-
-
-
-## Concluding Remarks
-In this module, we introduced advanced multi-group comparisons using two-way ANOVA and ANCOVA tests. These overall effect tests along with post-hoc testing were used on an example dataset to provide a basis for publication-ready tables and figures to present these results. This training module provides code and text for advanced multi-group comparisons necessary to answer more complex research questions.
-
-
-
-
-### Additional Resources
- + [Two-Way ANOVA](https://www.scribbr.com/statistics/two-way-anova/)
- + [Repeated Measure ANOVA in R](https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/)
- + [ANCOVA Example](https://ibecav.github.io/ancova_example/)
- + [Nonparametric ANOVA RDocumentation](https://cran.r-project.org/web/packages/fANCOVA/fANCOVA.pdf)
- + [Nonparametric ANCOVA RDocumentation](https://www.rdocumentation.org/packages/sm/versions/2.2-6.0/topics/sm.ancova)
-
-
-
-
-
-:::tyk
-Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed.
-Using the data found in “Module4_5_TYKInput.xlsx”, answer the following research question: Are there significant differences in functional endpoints based on doses of acrolein and sex after adjusting for age? To streamline the analysis, we'll only include doses of acrolein at 0, 1, and 4ppm.
-
-**Hint**: You'll need to run a two-way ANCOVA. Given that some of the assumptions for parametric tests (i.e., normality and homogeneity of variance) and the size of the data is on the smaller side, we likely wouldn't run a parametric test. However, we'll do so here just to illustrate an example of how to run a two-way ANCOVA.
-:::
diff --git a/Chapter_4/4_1_Experimental_Design/4_1_Experimental_Design.Rmd b/Chapter_4/4_1_Experimental_Design/4_1_Experimental_Design.Rmd
new file mode 100644
index 0000000..8b79dc5
--- /dev/null
+++ b/Chapter_4/4_1_Experimental_Design/4_1_Experimental_Design.Rmd
@@ -0,0 +1,115 @@
+# (PART\*) Chapter 4 Converting Wet Lab Data into Dry Lab Analyses {-}
+
+# 4.1 Overview of Experimental Design and Example Data
+
+This training module was developed by Elise Hickman, Sarah Miller, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Converting wet lab experimentation data into dry lab analyses facilitates reproducibility and transparency in data analysis. This is helpful for consistency across members of the same research group, review of analyses by collaborators or reviewers, and implementation of similar future analyses. In comparison with analysis workflows that use subscription- or license-based applications, such as Prism or SAS, analysis workflows that leverage open-source programming languages such as R also increase accessibility of analyses. Additionally, scripted analyses minimize the risk for copy-paste error, which can occur when cleaning experimental data, transferring it to an analysis application, and exporting and formatting analysis results.
+
+Some of the barriers in converting wet lab experimentation into dry lab analyses include data cleaning, selection and implementation of appropriate statistical tests, and reporting results. This chapter will provide introductory material guiding wet-bench scientists in R analyses, bridging the gap between commonly available R tutorials (which, while helpful, may not provide sufficient level of detail or relevant examples) and intensive data science workflows (which may be too detailed).
+
+In this module, we will provide an overview of key experimental design features and terms that will be used throughout this chapter, and we will provide a detailed overview of the example data. In the subsequent modules, we will dive into analyzing the example data.
+
+## Replicates
+
+One of the most important components of selecting an appropriate analysis is first understanding how data should be compared between samples, which often means addressing experimental replicates. There are two main types of replicates that are used in environmental health research: biological replicates and technical replicates.
+
+### Biological Replicates
+
+Biological replicates are the preferred unit of statistical comparison because they represent biologically distinct samples, demonstrating biological variation in the system. What is considered to be a biological replicate can depend on what model system is being used. For example, in studies with human clinical samples or cells from different human donors, the different humans are considered the biological replicates. In studies using animals as model organisms, individual animals are typically considered biological replicates, although this can vary depending on the experimental design. In studies that use cell lines, which are derived from one human or animal and are modified to continuously grow in culture, a biological replicate could be either cells from different passages (different thawed aliquots) grown in completely separate flasks, all experimented with on the same day, or repeating an experiment on the same set of cells (one thawed aliquot) but on separate experimental days, so the cells have grown/replicated between experiments.
+
+The final "N" that you report should reflect your biological replicates, or independent experiments. What constitutes an independent experiment or biological replicate is highly field-, lab-, organism-, and endpoint-dependent, so make sure to discuss this within your research group in the experiment planning phase and again before your analysis begins. No matter what you choose, ensure that when you report your results, you are transparent about what your biological replicates are. For example, the below diagram (adapted from [BitesizeBio](https://bitesizebio.com/47982/n-number-cell-lines/)) illustrates different ways of defining replicates in experiments with cell lines:
+
+```{r 4-1-Experimental-Design-1, echo = FALSE, fig.align = "center", out.width = "650px" }
+knitr::include_graphics("Chapter_4/4_1_Experimental_Design/Module4_1_Image1.png")
+```
+
+N = 3 cells could be considered technical replicates if the endpoint of interest is very low throughput, such as single cell imaging or analyses. N = 3 cell culture wells is a more common approach to technical replicates and is typically used when one sample is collected from each well, such as in the case of media or cell lysate collection. Note that each well within the Week 1 biological replicate would be considered a technical replicate for Week 1's experiment. Similarly, each well within the Week 2 biological replicate would be considered a technical replicate for Week 2's experiment. For more on technical replicates, see the next section.
+
+Although N = 3 cell lines is a less common approach to biological replicates, some argue for this approach because each cell line is typically derived from one biological source. In this scenario, each of the cell lines would be unique but would represent the same cell type or lineage (e.g., for respiratory epithelium, A549, 16HBE, and BEAS-2B cell lines).
+
+Also note that to perform statistical analyses, an N of at least 3 biological replicates is needed, and an even higher N may be needed for a sufficiently powered study. Although power calculations are outside the scope of this module, we encourage you to use power calculation resources, such as [G*Power](https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html) to assist in selecting an appropriate N for your study.
+
+
+### Technical Replicates
+
+Technical replicates are repeated measurements on the same sample or biological source, demonstrating the variation underlying protocols, equipment, and sample handling. In environmental health research, there can be technical replicates separately related to either the experimental design or the downstream analyses. Technical replicates related to experimental design refer to the chemical exposure for cell-based (*in vitro*) experiments, where there may be multiple wells of cells from the same passage or human/mouse exposed to the same treatment. Technical replicates related to downstream analyses refer to the endpoints that are measured after chemical exposure in each sample. To illustrate this, consider an experiment where cells from four unique human donors (D1-D4) are grown in cell culture plates, and then three wells of cells from each donor are exposed to a chemical treatment (Tx) or a vehicle control (Ctrl). The plate layout might look something like this, with technical replicates related to experimental design, i.e. chemical exposure, in the same color:
+
+```{r 4-1-Experimental-Design-2, echo = FALSE, fig.align = "center", out.width = "500px" }
+knitr::include_graphics("Chapter_4/4_1_Experimental_Design/Module4_1_Image2.png")
+```
+
+For this experiment, we have four biological replicates (the four donors) and three technical exposure replicates per dose (because three wells from each donor were exposed to each condition). The technical replicates here capture potential unintended variation between wells in cell growth and chemical exposure.
+
+Following the exposure of the cells to a chemical of interest, the media is collected from each well and assayed using a plate reader assay for concentrations of a marker of inflammation. For each sample collected (from each well), there are three technical replicates used to measure the concentration of the inflammatory marker. The purpose of these technical replicates is to capture potential unintended well-to-well variation in the plate reader assay. The plate layout might look something like this, ***with the letter and number in each well of the plate layout representing the well in the exposure plate layout that the media sample being assayed came from***:
+
+```{r 4-1-Experimental-Design-3, echo = FALSE, fig.align = "center", out.width = "800px" }
+knitr::include_graphics("Chapter_4/4_1_Experimental_Design/Module4_1_Image3.png")
+```
+
+
+Technical replicates should typically be averaged before performing any statistical analysis. For the experiment described above, we would:
+
+1. Average the technical replicates for the plate reader assay to obtain one value per original cell culture well for inflammatory marker concentration.
+
+2. Then, average the technical replicates for the chemical exposure to obtain one value per biological replicate (donor).
+
+This would result in a dataset with eight values (four control and four treatment) for statistical analysis.
+
+#### Number and inclusion of technical replicates
+
+The above example is just one approach to experimental design. As mentioned above in the biological replicates section, selection of appropriate biological and technical replicates can vary greatly depending on your model organism, experimental design, assay, and standards in the field. For example, there may be cases where well-to-well variation for certain assays is minimal compared with variation between biological replicates, or when including technical replicates for each donor is experimentally or financially unfeasible, resulting in a lack of technical replicates.
+
+### Matched Experimental Design
+
+Matching (also known as paired or repeated measures) in an experimental design is also a very important concept when selecting the appropriate statistical analysis. In experiments with matched design, multiple measurements are collected from the same biological replicate. This typically provides increased statistical power because changes are observed within each biological replicate relative to its starting point. In environmental health research, this can include study designs such as:
+
+1. Samples were collected from the same individuals, animals, or cell culture wells pre- and post-exposure.
+
+2. Cells from the same biological replicate were exposed to different doses of a chemical.
+
+The experimental design described above represents a matched design because cells from the same donor are exposed to both the treatment and the vehicle control.
+
+## Orientation to Example Data for Chapter 4
+
+In this chapter, we will be using an example dataset derived from an *in vitro*, or cell culture, experiment. Before diving into analysis of these data in the subsequent modules, we will provide an overview of where these data came from and preview what the input data frames look like.
+
+### Experimental Design
+
+In this experiment, primary human bronchial epithelial cells (HBECs) from sixteen different donors were exposed to the gas acrolein, which is emitted from the combustion of fossil fuels, tobacco, wood, and plastic. Inhalation exposure to acrolein is associated with airway inhalation, and this study aimed to understand how exposure to acrolein changes secretion of markers of inflammation. Prior to experimentation, the HBECs were grown on a permeable membrane support for 24 days with air on one side and liquid media on the other side, allowing them to differentiate into a form that is very similar to what is found in the human body. The cells were then exposed for 2 hours to 0 (filtered air), 0.6, 1, 2, or 4 ppm acrolein, with two technical replicate wells from each donor per dose. Twenty-four hours later, the media was collected, and concentrations of inflammatory markers were measured using an [enzyme-linked immunosorbent assay (ELISA)](https://www.thermofisher.com/us/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-elisa.html).
+
+```{r 4-1-Experimental-Design-4, echo = FALSE, fig.align = "center", out.width = "900px" }
+knitr::include_graphics("Chapter_4/4_1_Experimental_Design/Module4_1_Image4.png")
+```
+
+Note that this is a matched experimental design because cells from every donor were exposed to every concentration of acrolein, rather than cells from different donors being exposed to each of the different doses.
+
+### Starting Data
+
+Next, let's familiarize ourselves with the data that resulted from this experiment. There are two input data files, one that contains cytokine concentration data and one that contains demographic information about the donors:
+
+```{r 4-1-Experimental-Design-5, echo = FALSE, fig.align = "center", out.width = "900px" }
+knitr::include_graphics("Chapter_4/4_1_Experimental_Design/Module4_1_Image5.png")
+```
+
+The cytokine data contains information about the cytokine measurements for each of the six proteins measured in the basolateral media for each sample (units = pg/mL), which can be identified by the donor, dose, and replicate columns. The demographic data contains information about the age and sex of each donor. In the subsequent modules, we'll be using these data to assess whether exposure to acrolein significantly changes secretion of inflammatory markers and whether donor characteristics, such as sex and age, modify these responses.
+
+## Concluding Remarks
+
+This module reviewed important components of experimental design, such as replicates and matching, which are critical for data pre-processing and selecting appropriate statistical tests.
+
+
+
+:::tyk
+Read the following experimental design descriptions. For each description, determine the number of biological replicates (per group), the number of technical replicates, and whether the experimental design is matched.
+
+1. One hundred participants are recruited to a study aiming to determine whether people who use e-cigarettes have different concentrations of inflammatory markers in their airways. Fifty participants are non e-cigarette users and 50 participants are e-cigarette users. After the airway samples are collected, each sample is analyzed with an ELISA, with three measurements taken per sample.
+
+2. Twenty mice are used in a study aiming to understand the effects of particulate matter on cardiovascular health. The mice are randomized such that half of the mice are exposed to filtered air and half are exposed to particulate matter. During the exposures, the mice are continuously monitored for endpoints such as heart rate and heart function. One month later, the mice that were exposed to particulate matter are exposed to filtered air, and the mice that were exposed to filtered air are exposed to particulate matter, with the same cardiovascular endpoints collected.
+:::
diff --git a/Chapter_4/Module4_1_Input/Module4_1_Image1.png b/Chapter_4/4_1_Experimental_Design/Module4_1_Image1.png
similarity index 100%
rename from Chapter_4/Module4_1_Input/Module4_1_Image1.png
rename to Chapter_4/4_1_Experimental_Design/Module4_1_Image1.png
diff --git a/Chapter_4/Module4_1_Input/Module4_1_Image2.png b/Chapter_4/4_1_Experimental_Design/Module4_1_Image2.png
similarity index 100%
rename from Chapter_4/Module4_1_Input/Module4_1_Image2.png
rename to Chapter_4/4_1_Experimental_Design/Module4_1_Image2.png
diff --git a/Chapter_4/Module4_1_Input/Module4_1_Image3.png b/Chapter_4/4_1_Experimental_Design/Module4_1_Image3.png
similarity index 100%
rename from Chapter_4/Module4_1_Input/Module4_1_Image3.png
rename to Chapter_4/4_1_Experimental_Design/Module4_1_Image3.png
diff --git a/Chapter_4/Module4_1_Input/Module4_1_Image4.png b/Chapter_4/4_1_Experimental_Design/Module4_1_Image4.png
similarity index 100%
rename from Chapter_4/Module4_1_Input/Module4_1_Image4.png
rename to Chapter_4/4_1_Experimental_Design/Module4_1_Image4.png
diff --git a/Chapter_4/Module4_1_Input/Module4_1_Image5.png b/Chapter_4/4_1_Experimental_Design/Module4_1_Image5.png
similarity index 100%
rename from Chapter_4/Module4_1_Input/Module4_1_Image5.png
rename to Chapter_4/4_1_Experimental_Design/Module4_1_Image5.png
diff --git a/Chapter_4/4_2_Data_Import/4_2_Data_Import.Rmd b/Chapter_4/4_2_Data_Import/4_2_Data_Import.Rmd
new file mode 100644
index 0000000..b62930a
--- /dev/null
+++ b/Chapter_4/4_2_Data_Import/4_2_Data_Import.Rmd
@@ -0,0 +1,517 @@
+
+# 4.2 Data Import, Processing, and Summary Statistics
+
+This training module was developed by Elise Hickman, Alexis Payton, Sarah Miller, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+The first steps in any scripted analysis of wet-bench data include importing the data, cleaning the data to prepare for analyses, and conducting preliminary data exploration steps, such as addressing missing values, calculating summary statistics, and assessing normality. Although less exciting than diving right into the statistical analysis, these steps are crucial in guiding downstream analyses and ensuring accurate results. In this module, we will discuss each of these steps and work through them using an example dataset (introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** of inflammatory markers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. What is the mean concentration of each inflammatory biomarker by acrolein concentration?
+
+2. Are our data normally distributed?
+
+
+
+## Data Import
+
+First, we need to import our data. Data can be imported into R from many different file formats, including .csv (as demonstrated in previous chapters), .txt, .xlsx, and .pdf. Often, data are formatted in Excel prior to import, and the [*openxlsx*](https://ycphs.github.io/openxlsx/) package provides helpful functions that allow the user to import data from Excel, create workbooks for storing results generated in R, and export data from R to Excel workbooks. Below, we will use the `read.xlsx()` function to import our data directly from Excel. Other useful packages include [*pdftools*](https://github.com/ropensci/pdftools) (PDF import), [*tm*](https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) (text mining of PDFs), and [*plater*](https://cran.r-project.org/web/packages/plater/vignettes/plater-basics.html) (plate reader formatted data import).
+```{r 4-2-Data-Import-1, echo = FALSE, fig.align = "center", out.width = "850px" }
+knitr::include_graphics("Chapter_4/4_2_Data_Import/Module4_2_Image1.png")
+```
+
+### Workspace Preparation and Data Import
+
+#### Set working directory
+
+In preparation, first let's set our working directory to the folder path that contains our input files:
+```{r 4-2-Data-Import-2, eval = FALSE}
+setwd("/filepath to where your input files are")
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 4-2-Data-Import-3, echo=TRUE, eval=FALSE, warning=FALSE, results='hide', message=FALSE}
+if (!requireNamespace("table1"))
+ install.packages("table1");
+if (!requireNamespace("vtable"))
+ install.packages("vtable");
+# some packages need to be installed through Bioconductor/ BiocManager
+if (!require("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
+BiocManager::install("pcaMethods")
+BiocManager::install("impute")
+BiocManager::install("imputeLCMD")
+```
+#### Load required packages
+
+And load required packages:
+```{r 4-2-Data-Import-4, message = FALSE}
+library(openxlsx) # for importing Excel files
+library(DT) # for easier viewing of data tables
+library(tidyverse) # for data cleaning and graphing
+library(imputeLCMD) # for data imputation with QRILC
+library(table1) # for summary table
+library(vtable) # for summary table
+library(ggpubr) # for making Q-Q plots with ggplot
+```
+
+#### Import example datasets
+
+Next, let's read in our example datasets:
+```{r 4-2-Data-Import-5 }
+biomarker_data <- read.xlsx("Chapter_4/4_2_Data_Import/Module4_2_InputData1.xlsx")
+demographic_data <- read.xlsx("Chapter_4/4_2_Data_Import/Module4_2_InputData2.xlsx")
+```
+
+#### View example datasets
+
+First, let's preview our example data. Using the `datatable()` function from the *DT* package allows us to interactively scroll through our biomarker data.
+```{r 4-2-Data-Import-6 }
+datatable(biomarker_data)
+```
+
+We can see that our biomarker data are arranged with samples in rows and sample information and biomarker measurements in the columns.
+```{r 4-2-Data-Import-7 }
+datatable(demographic_data)
+```
+
+Our demographic data provide information about the donors that our cells came from, matching to the `Donor` column in our biomarker data.
+
+
+
+## Handling Missing Values
+
+Next, we will investigate whether we have missing values and which variables and donors have missing values.
+```{r 4-2-Data-Import-8 }
+# Calculate the total number of NAs per variable
+biomarker_data %>%
+ summarise(across(IL1B:VEGF, ~sum(is.na(.))))
+
+# Calculate the number of missing values per subject
+biomarker_data %>%
+ group_by(Donor) %>%
+ summarise(across(IL1B:VEGF, ~sum(is.na(.))))
+```
+
+Here, we can see that we do have a few missing values. What should we do with these values?
+
+### Missing Values and Data Imputation
+
+#### Missing values
+
+Before deciding what to do about our missing values, it's important to understand why they are missing. There are a few different types of missing values that could be present in a dataset:
+
+1. **Missing completely at random (MCAR):** has nothing to do with the experimental unit being studied (e.g., a sample is damaged or lost in the lab)
+
+2. **Missing at random (MAR):** there may be a systematic difference between missing and measured values, but they can be explained by observed differences in the data or experimental unit
+
+3. **Missing not at random (MNAR):** data are missing due to factors that are not observed/measured (e.g., measurement for a specific endpoint is below the limit of detection (LOD) of an assay)
+
+We know from the researchers who generated this dataset that the values are missing because these specific proteins were below the limit of detection for the assay for certain samples; therefore, our data are missing not at random. This can help us with our choice of imputation method, described below.
+
+#### Imputation
+
+Imputation is the assignment of a value to a missing data point by inferring that value from other properties of the dataset or externally defined limits. Whether or not you should impute your data is not a one-size-fits-all approach and may vary depending on your field, experimental design, the type of data, and the type of missing values in your dataset. Two questions you can ask yourself when deciding whether or not to impute data are:
+
+1. Is imputation needed for downstream analyses? *Some analyses are not permissive to including NAs or 0s; others are.*
+
+2. Will imputing values bias my analyses unnecessarily? *If so, consider analyzing subsets of the data that are complete separately.*
+
+
+There are many different imputation methods (too many to cover them all in this module); here, we will introduce a few that we use most often. We encourage you to explore these in more depth and to understand typical imputation workflows for your lab, data type, and/or discipline.
+
+- For variables where imputed values are expected to be generally bound by the existing range of data (e.g., MCAR): [missForest](https://rpubs.com/lmorgan95/MissForest)
+
+- For variables with samples below the limit of detection for the assay, such as for mass spectrometry or ELISAs (e.g., MNAR)
+ - Replace non-detects with the limit of detection divided by the square root of 2
+ - [Quantile Regression Imputation of Left-Censored Data (QRILC)](https://www.nature.com/articles/s41598-017-19120-0)
+ - [GSimp](https://github.com/WandeRum/GSimp) (can also be used to impute values above a specific threshold)
+
+If you do impute missing values, make sure to include both your raw and imputed data, along with detailed information about the imputation method, within your manuscript, supplemental information, and/or GitHub. You can even present summary statistics for both raw and imputed data for additional transparency.
+
+### Imputation of Our Data
+
+Before imputing our data, it is a good idea to implement a background filter that checks to see if a certain percentage of values for each variable are missing. For variables with a very high percentage of missing values, imputation can be unreliable because there is not enough information for the imputation algorithm to reference. The threshold for what this percentage should be can vary by study design and the extent to which your data are subset into groups that may have differing biomarker profiles; however, a common threshold we frequently use is to remove variables with missing data for 25% or more of samples.
+
+We can use the following code to calculate the percentage values missing for each endpoint:
+```{r 4-2-Data-Import-9 }
+biomarker_data %>%
+ summarise(across(IL1B:VEGF, ~sum(is.na(.))/nrow(biomarker_data)*100))
+```
+
+Here, we can see that only about 3-4% of values are missing for our variables with missing data, so we will proceed to imputation with our dataset as-is.
+
+We will impute values using QRILC, which pulls from the left side of the data distribution (the lower values) to impute missing values. We will write a function that will apply QRILC imputation to our dataframe. This function takes a dataframe with missing values as input and returns a dataframe with QRILC imputed values in place of NAs as output.
+```{r 4-2-Data-Import-10 }
+QRILC_imputation = function(df){
+ # Normalize data before applying QRILC per QRILC documentation
+ ## Select only numeric columns, psuedo log2 transform, and convert to a matrix
+ ### 4 comes from there being 3 metadata columns before the numeric data starts
+ QRILC_prep = df[,4:dim(df)[2]] %>%
+ mutate_all(., function(x) log2(x + 1)) %>%
+ as.matrix()
+
+ # QRILC imputation
+ imputed_QRILC_object = impute.QRILC(QRILC_prep, tune.sigma = 0.1)
+ QRILC_log2_df = data.frame(imputed_QRILC_object[1])
+
+ # Converting back the original scale
+ QRILC_df = QRILC_log2_df %>%
+ mutate_all(., function(x) 2^x - 1)
+
+ # Adding back in metadata columns
+ QRILC_df = cbind(Donor = df$Donor,
+ Dose = df$Dose,
+ Replicate = df$Replicate,
+ QRILC_df)
+
+ return(QRILC_df)
+}
+```
+
+Now we can apply the `QRILC_imputation()` function to our dataframe. We use the function `set.seed()` to ensure that the QRILC function generates the same numbers each time we run the script. For more on setting seeds, see [here](https://www.statology.org/set-seed-in-r/).
+```{r 4-2-Data-Import-11 }
+# Set random seed to ensure reproducibility in results
+set.seed(1104)
+
+# Apply function
+biomarker_data_imp <- QRILC_imputation(biomarker_data)
+```
+
+
+## Averaging Replicates
+
+The last step we need to take before our data are ready for analysis is averaging the two technical replicates for each donor and dose. We will do this by creating an ID column that represents the donor and dose together and using that column to group and average the data. This results in a dataframe where our rows contain data representing each biological replicate exposed to each of the five concentrations of acrolein.
+```{r 4-2-Data-Import-12 }
+biomarker_data_imp_avg <- biomarker_data_imp %>%
+
+ # Create an ID column that represents the donor and dose
+ unite(Donor_Dose, Donor, Dose, sep = "_") %>%
+
+ # Average replicates with each unique Donor_Dose
+ group_by(Donor_Dose) %>%
+ summarize(across(IL1B:VEGF, mean)) %>%
+
+ # Round results to the same number of significant figures as the original data
+ mutate(across(IL1B:VEGF, \(x) round(x, 2))) %>%
+
+ # Separate back out the Donor_Dose column
+ separate(Donor_Dose, into = c("Donor", "Dose"), sep = "_")
+
+# View new dataframe
+datatable(biomarker_data_imp_avg)
+```
+
+
+## Descriptive Statistics
+
+Generating descriptive statistics (e.g., mean, median, mode, range, standard deviation) can be helpful for understanding the general distribution of your data and for reporting results either in the main body of a manuscript/report (for small datasets) or in the supplementary material (for larger datasets). There are a number of different approaches that can be used to calculate summary statistics, including functions that are part of base R and that are part of packages. Here, we will demonstrate a few different ways to efficiently calculate descriptive statistics across our dataset.
+
+### Method #1 - Tidyverse and Basic Functions
+
+The mean, or average of data points, is one of the most commonly reported summary statistics and is often reported as mean ± standard deviation to demonstrate the spread in the data. Here, we will make a table of mean ± standard deviation for each of our biomarkers across each of the dose groups using *tidyverse* functions.
+```{r 4-2-Data-Import-13 }
+# Calculate means
+biomarker_group_means <- biomarker_data_imp_avg %>%
+ group_by(Dose) %>%
+ summarise(across(IL1B:VEGF, \(x) mean(x)))
+
+# View data
+datatable(biomarker_group_means)
+```
+
+You'll notice that there are a lot of decimal places in our calculated means, while in our original data, there are only two decimal places. We can add a step to round the data to our above code chunk to produce cleaner results.
+```{r 4-2-Data-Import-14 }
+# Calculate means
+biomarker_group_means <- biomarker_data_imp_avg %>%
+ group_by(Dose) %>%
+ summarise(across(IL1B:VEGF, \(x) mean(x))) %>%
+ mutate(across(IL1B:VEGF, \(x) round(x, 2)))
+
+# View data
+datatable(biomarker_group_means)
+```
+
+### Answer to Environmental Health Question 1
+:::question
+With this, we can answer **Environmental Health Question 1**: What is the mean concentration of each inflammatory biomarker by acrolein concentration?
+:::
+
+:::answer
+**Answer:** With the above table, we can see the mean concentrations for each of our inflammatory biomarkers by acrolein dose. IL-8 overall has the highest concentrations, followed by VEGF and IL-6. For IL-1$\beta$, IL-8, TNF-$\alpha$, and VEGF, it appears that the concentration of the biomarker goes up with increasing dose.
+:::
+
+We can use very similar code to calculate our standard deviations:
+```{r 4-2-Data-Import-15 }
+# Calculate means
+biomarker_group_sds <- biomarker_data_imp_avg %>%
+ group_by(Dose) %>%
+ summarise(across(IL1B:VEGF, \(x) sd(x))) %>%
+ mutate(across(IL1B:VEGF, \(x) round(x, 1)))
+
+# View data
+datatable(biomarker_group_sds)
+```
+
+Now we've calculated both the means and standard deviations! However, these are typically presented as mean ± standard deviation. We can merge these dataframes by executing the following steps:
+
+1. Pivot each dataframe to a long format, with each row containing the value for one biomarker at one dose.
+2. Create a variable that represents each unique row (combination of `Dose` and `variable`).
+3. Join the dataframes by row.
+4. Unite the two columns with mean and standard deviation, with `±` in between them.
+5. Pivot the dataframe wider so that the dataframe resembles what we started with for the means and standard deviations.
+
+First, we'll pivot each dataframe to a long format and create a variable that represents each unique row.
+```{r 4-2-Data-Import-16 }
+# Pivot dataframes longer and create variable column for each row
+biomarker_group_means_long <- pivot_longer(biomarker_group_means,
+ !Dose, names_to = "variable", values_to = "mean") %>%
+ unite(Dose_variable, Dose, variable, remove = FALSE)
+
+biomarker_group_sds_long <- pivot_longer(biomarker_group_means,
+ !Dose, names_to = "variable", values_to = "sd") %>%
+ unite(Dose_variable, Dose, variable, remove = FALSE)
+
+
+# Preview what dataframe looks like
+datatable(biomarker_group_means_long)
+```
+
+Next, we will join the mean and standard deviation datasets. Notice that we are only joining the `Dose_variable` and `sd` columns from the standard deviation dataframe to prevent duplicate columns (`Dose`, `variable`) from being included.
+```{r 4-2-Data-Import-17 }
+# Merge the dataframes by row
+biomarker_group_summstats <- left_join(biomarker_group_means_long,
+ biomarker_group_sds_long %>% select(c(Dose_variable, sd)),
+ by = "Dose_variable")
+
+# Preview the new dataframe
+datatable(biomarker_group_summstats)
+```
+
+Then, we can unite the mean and standard deviation columns and add the ± symbol between them by storing that character as a variable and pasting that variable in our `paste()` function.
+```{r 4-2-Data-Import-18 }
+# Store plus/minus character
+plusminus <-"\u00b1"
+Encoding(plusminus)<-"UTF-8"
+
+# Create new column with mean +/- standard deviation
+biomarker_group_summstats <- biomarker_group_summstats %>%
+ mutate(mean_sd = paste(mean, plusminus, sd, sep = " "))
+
+# Preview the new dataframe
+datatable(biomarker_group_summstats)
+```
+
+Last, we can pivot the dataframe wider to revert it to its original layout, which is easier to read.
+```{r 4-2-Data-Import-19 }
+# Pivot dataframe wider
+biomarker_group_summstats <- biomarker_group_summstats %>%
+
+ # Remove columns we don't need any more
+ select(-c(Dose_variable, mean, sd)) %>%
+
+ # Pivot wider
+ pivot_wider(id_cols = Dose, names_from = "variable", values_from = "mean_sd")
+
+# View final dataframe
+datatable(biomarker_group_summstats)
+```
+
+These data are now in a publication-ready format that can be exported to a .txt, .csv., or .xlsx file for sharing.
+
+### Method #2 - Applying a List of Functions
+
+Calculating our mean and standard deviation separately using *tidyverse* wasn't too difficult, but what if we want to calculate other descriptive statistics, such as minimum, median, and maximum? We could use the above approach, but we would need to make a separate dataframe for each and then merge them all together. Instead, we can use the `map_dfr()` function from the *purrr* package, which is also part of *tidyverse.* This function takes a list of functions you want to apply to your data and applies these functions over specified columns in the data. Let's see how it works:
+```{r 4-2-Data-Import-20 }
+# Define summary functions
+summary_functs <- lst(min, median, mean, max, sd)
+
+# Apply functions to data, grouping by dose
+# .id = "statistic" tells the function to create a column describing which statistic that row is reporting
+biomarker_descriptive_stats_all <- map_dfr(summary_functs,
+ ~ summarize(biomarker_data_imp_avg %>% group_by(Dose),
+ across(IL1B:VEGF, .x)), .id = "statistic")
+
+# View data
+datatable(biomarker_descriptive_stats_all)
+```
+
+Depending on your final goal, descriptive statistics data can then be extracted from this dataframe and cleaned up or reformatted as needed to create a publication-ready table!
+
+### Other Methods
+
+There are also packages that have been developed for specifically making summary tables, such as [*table1*](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html) and [*vtable*](https://cran.r-project.org/web/packages/vtable/vignettes/sumtable.html). These packages can create summary tables in HTML format, which appear nicely in R Markdown and can be copied and pasted into Word. Here, we will briefly demonstrate how these packages work, and we encourage you to explore more using the package vignettes!
+
+#### Table1
+
+The *table1* package makes summary tables using the function `table1()`, which takes the columns that you want in the rows of the table on the left side of the first argument, followed by `|` and then the grouping variable. The output table can be customized in a number of ways, including what summary statistics are output and whether or not statistical comparisons are run between groups (see package vignette for more details).
+```{r 4-2-Data-Import-21 }
+# Get names of all of the columns to include in the table
+paste(names(biomarker_data_imp_avg %>% select(IL1B:VEGF)), collapse=" + ")
+```
+
+```{r 4-2-Data-Import-22, eval = FALSE}
+# Make the table
+table1(~ IL1B + IL6 + IL8 + IL10 + TNFa + VEGF | Dose, data = biomarker_data_imp_avg)
+```
+
+```{r 4-2-Data-Import-23, echo = FALSE, fig.align = "center", out.width = "850px" }
+knitr::include_graphics("Chapter_4/4_2_Data_Import/Module4_2_Image2.png")
+```
+
+#### Vtable
+
+The *vtable* package includes the function `st()`, which can also be used to make HTML tables (and other output formats; see `out` argument). For example:
+```{r 4-2-Data-Import-24 }
+# HTML output
+st(biomarker_data_imp_avg, group = 'Dose')
+
+# Dataframe output
+st(biomarker_data_imp_avg, group = 'Dose', out = 'return')
+```
+
+Similar to *table1*, see the package vignette for detailed information about how to customize tables using this package.
+
+
+
+## Normality Assessment and Data Transformation
+
+The last step we will take before beginning to test our data for statistical differences between groups (in the next module) is to understand our data's distribution through normality assessment. This will inform which statistical tests we will perform on our data. For more detail on normality testing, including detailed explanations of each type of normality assessment and explanations of the code underlying the following graphs and tables, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**.
+
+We'll start by looking at histograms of our data for qualitative normality assessment:
+```{r 4-2-Data-Import-25, message = FALSE, fig.align = 'center'}
+# Set theme
+theme_set(theme_bw())
+
+# Pivot data longer to prepare for plotting
+biomarker_data_imp_avg_long <- biomarker_data_imp_avg %>%
+ pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
+
+# Make figure panel of histograms
+ggplot(biomarker_data_imp_avg_long, aes(value)) +
+ geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
+ facet_wrap(~ variable, scales = "free", nrow = 2) +
+ labs(y = "# of Observations", x = "Value")
+```
+
+From these histograms, we can see that IL-1$\beta$ appears to be normally distributed, while the other endpoints do not appear to be normally distributed.
+
+We can also use Q-Q plots to assess normality qualitatively:
+```{r 4-2-Data-Import-26, fig.align = 'center'}
+ggqqplot(biomarker_data_imp_avg_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
+```
+
+With this figure panel, we can see that most of the variables have very noticeable deviations from the reference, suggesting non-normal distributions.
+
+To assess normality quantitatively, we can use the Shapiro-Wilk test. Note that the null hypothesis is that the sample distribution is normal, and a significant p-value means the distribution is non-normal.
+```{r 4-2-Data-Import-27 }
+# Apply Shapiro Wilk test to dataframe
+shapiro_res <- apply(biomarker_data_imp_avg %>% select(IL1B:VEGF), 2, shapiro.test)
+
+# Create results dataframe
+shapiro_res <- do.call(rbind.data.frame, shapiro_res)
+
+# Clean dataframe
+shapiro_res <- shapiro_res %>%
+
+ ## Add normality conclusion
+ mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
+
+ ## Remove columns that do not contain informative data
+ select(c(p.value, normal))
+
+# View cleaned up dataframe
+datatable(shapiro_res)
+```
+
+### Answer to Environmental Health Question 2
+:::question
+With this, we can answer **Environmental Health Question 2**: Are our data normally distributed?
+:::
+
+:::answer
+**Answer:** The results from the Shapiro-Wilk test demonstrate that the IL-1$\beta$ data are normally distributed, while the other variables are non-normally distributed. These results support the conclusions we made based on our qualitative assessment above with histograms and Q-Q plots.
+:::
+
+### Log~2~ Transforming and Re-Assessing Normality
+
+Log~2~ transformation is a common transformation used in environmental health research and can move data closer to a normal distribution. For more on data transformation, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**. We will pseudo-log~2~ transform our data, which adds a 1 to each value before log~2~ transformation and ensures that resulting values are positive real numbers. Let's see if the log~2~ data are more normally distributed than the raw data.
+```{r 4-2-Data-Import-28 }
+# Apply log2 transformation to data
+biomarker_data_imp_avg_log2 <- biomarker_data_imp_avg %>%
+ mutate(across(IL1B:VEGF, ~ log2(.x + 1)))
+```
+
+Make histogram panel:
+```{r 4-2-Data-Import-29, fig.align = 'center'}
+# Pivot data longer and make figure panel of histograms
+biomarker_data_imp_avg_log2_long <- biomarker_data_imp_avg_log2 %>%
+ pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
+
+# Make histogram panel
+ggplot(biomarker_data_imp_avg_log2_long, aes(value)) +
+ geom_histogram(fill = "gray40", color = "black", binwidth = function(x) {(max(x) - min(x))/25}) +
+ facet_wrap(~ variable, scales = "free") +
+ labs(y = "# of Observations", x = "Value")
+```
+
+Make Q-Q plot panel:
+```{r 4-2-Data-Import-30, fig.align = 'center'}
+ggqqplot(biomarker_data_imp_avg_log2_long, x = "value", facet.by = "variable", ggtheme = theme_bw(), scales = "free")
+```
+
+Run Shapiro-Wilk test:
+```{r 4-2-Data-Import-31 }
+# Apply Shapiro Wilk test
+shapiro_res_log2 <- apply(biomarker_data_imp_avg_log2 %>% select(IL1B:VEGF), 2, shapiro.test)
+
+# Create results dataframe
+shapiro_res_log2 <- do.call(rbind.data.frame, shapiro_res_log2)
+
+# Clean dataframe
+shapiro_res_log2 <- shapiro_res_log2 %>%
+
+ ## Add normality conclusion
+ mutate(normal = ifelse(p.value < 0.05, F, T)) %>%
+
+ ## Remove columns that do not contain informative data
+ select(c(p.value, normal))
+
+# View cleaned up dataframe
+shapiro_res_log2
+```
+
+The histograms and Q-Q plots demonstrate that the log~2~ data are more normally distributed than the raw data. The results from the Shapiro-Wilk test also demonstrate that the the log~2~ data are more normally distributed as a whole than the raw data. Overall, the p-values, even for the variables that are still non-normally distributed, are much higher.
+
+So, should we proceed with the raw data or the log~2~ data? This depends on what analyses we plan to do. In general, it is best to keep the data in as close to its raw format as possible, so if all of our analyses are available with a non-parametric test, we could use our raw data. However, some statistical tests do not have a non-parametric equivalent, in which case it would likely be best to use the log~2~ transformed data. For subsequent modules, we will proceed with the log~2~ data for consistency; however, choices regarding normality assessment can vary, so be sure to discuss these choices within your research group before proceeding with your analysis.
+
+For more on decisions regarding normality, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**. For more on parametric vs. non-parametric tests, see **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations** and **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**.
+
+
+
+## Concluding Remarks
+
+Taken together, this module demonstrates important data processing steps necessary before proceeding with between-group statistical testing, including data import, handling missing values, averaging replicates, generating descriptive statistics tables, and assessing normality. Careful consideration and description of these steps in the methods section of a manuscript or report increases reproducibility of analyses and helps to improve the accuracy and statistical validity of subsequent statistical results.
+
+
+
+
+
+:::tyk
+
+Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). Work through the same processes demonstrated in this module using the provided data ("Module4_2_TYKInput.xlsx") to answer the following questions:
+
+1. How many technical replicates are there for each dose?
+2. Are there any missing values?
+3. What are the average values for each endpoint by dose?
+4. Are the raw data normally distributed?
+:::
diff --git a/Chapter_4/Module4_2_Input/Module4_2_Image1.png b/Chapter_4/4_2_Data_Import/Module4_2_Image1.png
similarity index 100%
rename from Chapter_4/Module4_2_Input/Module4_2_Image1.png
rename to Chapter_4/4_2_Data_Import/Module4_2_Image1.png
diff --git a/Chapter_4/Module4_2_Input/Module4_2_Image2.png b/Chapter_4/4_2_Data_Import/Module4_2_Image2.png
similarity index 100%
rename from Chapter_4/Module4_2_Input/Module4_2_Image2.png
rename to Chapter_4/4_2_Data_Import/Module4_2_Image2.png
diff --git a/Chapter_4/Module4_2_Input/Module4_2_InputData1.xlsx b/Chapter_4/4_2_Data_Import/Module4_2_InputData1.xlsx
similarity index 100%
rename from Chapter_4/Module4_2_Input/Module4_2_InputData1.xlsx
rename to Chapter_4/4_2_Data_Import/Module4_2_InputData1.xlsx
diff --git a/Chapter_4/Module4_2_Input/Module4_2_InputData2.xlsx b/Chapter_4/4_2_Data_Import/Module4_2_InputData2.xlsx
similarity index 100%
rename from Chapter_4/Module4_2_Input/Module4_2_InputData2.xlsx
rename to Chapter_4/4_2_Data_Import/Module4_2_InputData2.xlsx
diff --git a/Chapter_4/Module4_3_Input/20230214_0002_Expt1_A_size_488.pdf b/Chapter_4/4_3_PDF_Import/20230214_0002_Expt1_A_size_488.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/20230214_0002_Expt1_A_size_488.pdf
rename to Chapter_4/4_3_PDF_Import/20230214_0002_Expt1_A_size_488.pdf
diff --git a/Chapter_4/Module4_3_Input/20230214_0006_Expt1_Ctrl_size_488.pdf b/Chapter_4/4_3_PDF_Import/20230214_0006_Expt1_Ctrl_size_488.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/20230214_0006_Expt1_Ctrl_size_488.pdf
rename to Chapter_4/4_3_PDF_Import/20230214_0006_Expt1_Ctrl_size_488.pdf
diff --git a/Chapter_4/Module4_3_Input/20230214_0014_Expt1_C_size_488.pdf b/Chapter_4/4_3_PDF_Import/20230214_0014_Expt1_C_size_488.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/20230214_0014_Expt1_C_size_488.pdf
rename to Chapter_4/4_3_PDF_Import/20230214_0014_Expt1_C_size_488.pdf
diff --git a/Chapter_4/Module4_3_Input/20230214_0023_Expt1_D_size_488.pdf b/Chapter_4/4_3_PDF_Import/20230214_0023_Expt1_D_size_488.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/20230214_0023_Expt1_D_size_488.pdf
rename to Chapter_4/4_3_PDF_Import/20230214_0023_Expt1_D_size_488.pdf
diff --git a/Chapter_4/Module4_3_Input/20230214_0024_Expt1_B_size_488.pdf b/Chapter_4/4_3_PDF_Import/20230214_0024_Expt1_B_size_488.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/20230214_0024_Expt1_B_size_488.pdf
rename to Chapter_4/4_3_PDF_Import/20230214_0024_Expt1_B_size_488.pdf
diff --git a/Chapter_4/4_3_PDF_Import/4_3_PDF_Import.Rmd b/Chapter_4/4_3_PDF_Import/4_3_PDF_Import.Rmd
new file mode 100644
index 0000000..35566b7
--- /dev/null
+++ b/Chapter_4/4_3_PDF_Import/4_3_PDF_Import.Rmd
@@ -0,0 +1,611 @@
+
+# 4.3 Data Import from PDF Sources
+
+This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Most tutorials for R rely on importing .csv, .xlsx, or .txt files, but there are numerous other file formats that can store data, and these file formats can be more difficult to import into R. PDFs can be particularly difficult to interface with in R because they are not formatted with defined rows/columns/cells as is done in Excel or .csv/.txt formatting. In this module, we will demonstrate how to import data from from PDFs into R and format it such that it is amenable for downstream analyses or export as a table. Familiarity with *tidyverse*, for loops, and functions will make this module much more approachable, so be sure to review **TAME 2.0 Modules 2.3 Data Manipulation and Reshaping** and **2.4 Improving Coding Efficiencies** if you need a refresher.
+
+
+
+### Overview of Example Data
+
+To demonstrate import of data from PDFs, we will be leveraging two example datasets, described in more detail in their respective sections later on in the module.
+
+1. PDFs generated by Nanoparticle Tracking Analysis (NTA), a technique used to quantify the size and distribution of particles (such as extracellular vesicles) in a sample. We will be extracting data from an experiment in which epithelial cells were exposed to four different environmental chemicals or a vehicle control, and secreted particles were isolated and characterized using NTA.
+
+2. A PDF containing information about variables collected as part of a study whose samples are part of NIH's [BioLINCC Repository](https://biolincc.nhlbi.nih.gov/home/).
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Which chemical(s) increase and decrease the concentration of particles secreted by epithelial cells?
+2. How many variables total are available to us to request from the study whose data are store in the repository, and what are these variables?
+
+
+
+## Importing Data from Many Single PDFs with the Same Formatting
+
+### Getting Familiar with the Example Dataset
+
+The following example is based on extracting data from PDFs generated by Nanoparticle Tracking Analysis (NTA), a technique used to quantify the size and distribution of particles in a sample. Each PDF file is associated with one sample, and each PDF contains multiple values that we want to extract. Although this is a very specific type of data, keep in mind that this general approach can be applied to any data stored in PDF format - you will just need to make modifications based on the layout of your PDF file!
+
+For this example, we will be extracting data from 5 PDFs that are identically formatted but contain information unique to each sample. The samples represent particles isolated from epithelial cell media following an experiment where cells were exposed to four different environmental chemicals (labeled "A", "B", "C", and "D") or a vehicle control (labeled "Ctrl").
+
+Here is what a full view of one of the PDFs looks like, with values we want to extract highlighted in yellow:
+```{r 4-3-PDF-Import-1, echo = FALSE, out.width = "850px", fig.align = "center"}
+knitr::include_graphics("Chapter_4/4_3_PDF_Import/Module4_3_Image1.png")
+```
+
+Our goal is to extract these values and end up with a dataframe that looks like this, with each sample in a row and each variable in a column:
+```{r 4-3-PDF-Import-2, echo = FALSE, message = FALSE}
+# Loading packages
+library(tidyverse)
+library(openxlsx)
+library(DT)
+
+# Reading in data
+ending_data <- read.xlsx("Chapter_4/4_3_PDF_Import/Module4_3_InputData1.xlsx")
+
+# Renaming some of the columns
+ending_data <- ending_data %>%
+ rename("Sample Identifier" = "Sample.Identifier",
+ "Experiment Number" = "Experiment.Number",
+ "Dilution Factor" = "Dilution.Factor",
+ "Concentration (Particles/mL)" = "Concentration.(Particles/.mL)")
+
+datatable(ending_data)
+```
+
+If your files are not already named in a way that reflects unique sample information, such as the date of the experiment or sample ID, update your file names to contain this information before proceeding with the script. Here are the names for the example PDF files:
+```{r 4-3-PDF-Import-3, out.width = "400px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_4/4_3_PDF_Import/Module4_3_Image2.png")
+```
+
+
+
+### Workspace Preparation and Data Import
+
+#### Installing and loading required R packages
+
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you. We will be using the *pdftools* and *tm* packages to extract text from the PDF. And instead of using `head()` to preview dataframes, we will be using the function `datatable()` from the *DT* package. This function produces interactive tables and generates better formatting for viewing dataframes that have long character strings (like the ones we will be viewing in this section).
+
+```{r 4-3-PDF-Import-4, eval = FALSE}
+if (!requireNamespace("pdftools"))
+ install.packages("pdftools")
+if (!requireNamespace("tm"))
+ install.packages("tm")
+if (!requireNamespace("DT"))
+ install.packages("DT")
+if (!requireNamespace("janitor"))
+ install.packages("janitor")
+```
+
+Next, load the packages.
+```{r 4-3-PDF-Import-5, warning = FALSE, message = FALSE}
+library(tidyverse)
+library(pdftools)
+library(tm)
+library(DT)
+library(janitor)
+```
+
+#### Initial data import from PDF files
+
+The following code stores the file names of all of the files in your directory that end in .pdf. To ensure that only PDFs of interest are imported, consider making a subfolder within your directory containing only the PDF extraction script file and the PDFs you want to extract data from.
+```{r 4-3-PDF-Import-6 }
+pdf_list <- list.files(path = "./Chapter_4/4_3_PDF_Import", pattern = "488.pdf$")
+```
+
+We can see that each of our file names are now contained in the list.
+```{r 4-3-PDF-Import-7 }
+head(pdf_list)
+```
+
+Next, we need to make a dataframe to store the extracted data. The `PDF Identifier` column will store the file name, and the `Text` column will store extracted text from the PDF.
+```{r 4-3-PDF-Import-8 }
+pdf_raw <- data.frame("PDF Identifier" = c(), "Text" = c())
+```
+
+The following code uses a `for` loop to loop through each file (as stored in the pdf_list vector) and extract the text from the PDF. Sometimes this code generates duplicates, so we will also remove the duplicates with `distinct()`.
+```{r 4-3-PDF-Import-9, message = FALSE, warning = FALSE}
+for (i in 1:length(pdf_list)){
+
+ # Iterating through each pdf file and separating each line of text
+ document_text = pdf_text(paste("./Chapter_4/4_3_PDF_Import/", pdf_list[i], sep = "")) %>%
+ strsplit("\n")
+
+ # Saving the name of each PDF file and its text
+ document = data.frame("PDF Identifier" = gsub(x = pdf_list[i], pattern = ".pdf", replacement = ""),
+ "Text" = document_text, stringsAsFactors = FALSE)
+
+ colnames(document) <- c("PDF Identifier", "Text")
+
+ # Appending the new text data to the dataframe
+ pdf_raw <- rbind(pdf_raw, document)
+}
+
+pdf_raw <- pdf_raw %>%
+ distinct()
+```
+
+The new dataframe contains the data from all of the PDFs, with the `PDF Identifier` column containing the name of the input PDF file that corresponds to the text in the column next to it.
+```{r 4-3-PDF-Import-10 }
+datatable(pdf_raw)
+```
+
+
+### Extracting Variables of Interest
+
+Specific variables of interest can be extracted from the `pdf_raw` dataframe by filtering the dataframe for rows that contain a specific character string. This character string could be the variable of interest (if that word or set of words is unique and only occurs in that one place in the document) or a character string that occurs in the same line of the PDF as your variable of interest. Examples of both of these approaches are shown below.
+
+It is important to note that there can be different numbers of spaces in each row and after each semicolon, which will change the `sep` argument for each variable. For example, there are a different number of spaces after the semicolon for "Dilution Factor" than there are for "Concentration" (see above PDF screen shot for reference). We will work through an example for the first variable of interest, dilution factor, in detail.
+
+First, we can see what the dataframe looks like when we just filter rows based on keeping only rows that contain the string "Dilution Factor" in the text column using the `grepl()` function.
+```{r 4-3-PDF-Import-11 }
+dilution_factor_df <- pdf_raw %>%
+ filter(grepl("Dilution Factor", Text))
+
+datatable(dilution_factor_df)
+```
+
+The value we are trying to extract is at the end of a long character string. We will want to use the tidyverse function `separate()` to isolate those values, but we need to know what part of the character string will separate the dilution factor values from the rest of the text. To determine this, we can call just one of the data cells and copy the semicolon and following spaces for use in the `separate()` function.
+```{r 4-3-PDF-Import-12 }
+# Return the value in the first row and second column.
+dilution_factor_df[1,2]
+```
+
+Building on top of the previous code, we can now separate the dilution factor value from the rest of the text in the string. The `separate()` function takes an input data column and separates it into two or more columns based on the character passed to the separation argument. Here, everything before the separation string is discarded by setting the first new column to NA. Everything after the separation string will be stored in a new column called `Dilution Factor`, The starting `Text` column is removed by default.
+```{r 4-3-PDF-Import-13 }
+dilution_factor_df <- pdf_raw %>%
+ filter(grepl("Dilution Factor", Text)) %>%
+ separate(Text, into = c(NA, "Dilution Factor"), sep = ": ")
+
+datatable(dilution_factor_df)
+```
+
+For the "Original Concentration" variable, we filter rows by the string "pH" because the word concentration is found in multiple locations in the document.
+```{r 4-3-PDF-Import-14 }
+concentration_df = pdf_raw %>%
+ filter(grepl("pH", Text)) %>%
+ separate(Text, c(NA, "Concentration"), sep = ": ")
+
+datatable(concentration_df)
+```
+
+With the dilution factor variable, there were no additional characters after the value of interest, but here, "Particles / mL" remains and needs to be removed so that the data can be used in downstream analyses. We can add an additional cleaning step to remove "Particles / mL" from the data and add the units to the column title. `sep = " P"` refers to the space before and first letter of the string to be removed.
+```{r 4-3-PDF-Import-15 }
+concentration_df = pdf_raw %>%
+ filter(grepl("pH", Text)) %>%
+ separate(Text, c(NA, "Concentration"), sep = ": ") %>%
+ separate(Concentration, c("Concentration (Particles/ mL)", NA), sep = " P")
+
+datatable(concentration_df)
+```
+
+Next, we want to extract size distribution data from the lower table. Note that the space in the first `separate()` function comes from the space between the "Number" and "Concentration" column in the string, and the space in the second `separate()` function comes from the space between the variable name and the number of interest. We can also convert values to numeric since they are currently stored as characters.
+```{r 4-3-PDF-Import-16 }
+size_distribution_df = pdf_raw %>%
+ filter(grepl("X10", Text)| grepl("X50 ", Text)| grepl("X90", Text) | grepl("Mean", Text)| grepl("StdDev", Text)) %>%
+ separate(Text, c("Text", NA), sep = " ") %>%
+ separate(Text, c("Text", "Size"), sep = " ") %>%
+ mutate(Size = as.numeric(Size)) %>%
+ pivot_wider(names_from = Text, values_from = Size)
+
+datatable(size_distribution_df)
+```
+
+### Creating the final dataframe
+
+Now that we have created dataframes for all of the variables that we are interested in, we can join them together into one final dataframe.
+```{r 4-3-PDF-Import-17 }
+# Make list of all dataframes to include
+all_variables <- list(dilution_factor_df, concentration_df, size_distribution_df)
+
+# Combine dataframes using reduce function. Sometimes, duplicate rows are generated by full_join.
+full_df = all_variables %>%
+ reduce(full_join, by = "PDF Identifier") %>%
+ distinct()
+
+# View new dataframe
+datatable(full_df)
+```
+
+For easier downstream analysis, the last step is to separate the `PDF Identifier` column into an informative sample ID that matches up with other experimental data.
+```{r 4-3-PDF-Import-18 }
+final_df <- full_df %>%
+ separate('PDF Identifier',
+ # Split sample identifier column into new columns, retaining the original column
+ into = c("Date", "FileNumber", "Experiment Number", "Sample_ID", "Size", "Wavelength"), sep = "_", remove = FALSE) %>%
+ select(-c(FileNumber, Size)) %>% # Remove uninformative columns
+ mutate(across('Dilution Factor':'StdDev', as.numeric)) # Change variables to numeric where appropriate
+
+datatable(final_df)
+```
+
+Let's make a graph to help us answer Environmental Health Question 1.
+```{r 4-3-PDF-Import-19, message = FALSE}
+theme_set(theme_bw())
+
+data_for_graphing <- final_df %>%
+ clean_names()
+
+data_for_graphing$sample_id <- factor(data_for_graphing$sample_id, levels = c("Ctrl", "A", "B", "C", "D"))
+
+ggplot(data_for_graphing, aes(x = sample_id, y = concentration_particles_m_l)) +
+ geom_bar(stat = "identity", fill = "gray70", color = "black") +
+ ylab("Particle Concentration (Particles/mL)") +
+ xlab("Exposure")
+```
+
+:::question
+*With this, we can answer **Environmental Health Question #1***: Which chemical(s) increase and decrease the concentration of particles secreted by epithelial cells?
+:::
+
+:::answer
+**Answer**: Chemicals B and C appear to increase the concentration of secreted particles. However, additional replicates of this experiment are needed to assess statistical significance.
+:::
+
+
+
+## Importing Data Stored in PDF Tables
+
+The above workflow is useful if you just want to extract a few specific values from PDFs, but isn't as useful if data are already in a table format in a PDF. The [*tabulapdf package*](https://github.com/ropensci/tabulapdf) provides helpful functions for extracting dataframes from tables in PDF format.
+
+### Getting Familiar with the Example Dataset
+
+The following example is based on extracting dataframes from a long PDF containing many individual data tables. This particular PDF came from the NIH's BioLINCC Repository and details variables that researchers can request from the repository. Variables are part of larger datasets that contain many variables, with each dataset in a separate table. All of the tables are stored in one PDF file, and some of the tables are longer than one page (this will become relevant later on!). Similar to the first PDF workflow, remember that this is a specific example intended to demonstrate how to work through extracting data from PDFs. Modifications will need to be made for differently formatted PDFs.
+
+Here is what the first three pages of our 75-page starting PDF look like:
+```{r 4-3-PDF-Import-20, echo = FALSE, out.width = "850px", fig.align = "center"}
+knitr::include_graphics("Chapter_4/4_3_PDF_Import/Module4_3_Image3.png")
+```
+
+If we zoom in a bit more on the first page, we can see that the dataset name is defined in bold above each table. This formatting is consistent throughout the PDF.
+```{r 4-3-PDF-Import-21, echo = FALSE, out.width = "850px", fig.align = "center"}
+knitr::include_graphics("Chapter_4/4_3_PDF_Import/Module4_3_Image4.png")
+```
+
+The zoomed in view also allows us to see the columns and their contents more clearly. Some are more informative than others. The columns we are most interested in are listed below along with a description to guide you through the contents.
+
+- `Num`: The number assigned to each variable in the dataset. This numbering restarts with 1 for each table.
+- `Variable`: The variable name.
+- `Type`: The type (or class) of the variable, either numeric or character.
+- `Label`: A description of the variable and values associated with the variable.
+
+After extracting the data, we want to end up with a dataframe that contains all of the variables, their corresponding columns, and a column that indicates which dataset the variable is associated with:
+```{r 4-3-PDF-Import-22, echo = FALSE}
+biolincc_final <- read.xlsx("Chapter_4/4_3_PDF_Import/Module4_3_InputData3.xlsx") %>%
+ clean_names()
+
+datatable(biolincc_final)
+```
+
+### Workspace Preparation and Data Import
+
+#### Installing and loading required R packages
+
+Similar to previous sections, we need to install and load a few packages before proceeding. The *tabulapdf* package needs to be installed in a specific way as shown below and can sometimes be difficult to install on Macs. If errors are produced, follow the troubleshooting tips outlined in [this](https://stackoverflow.com/questions/67849830/how-to-install-rjava-package-in-mac-with-m1-architecture) Stack Overflow solution.
+
+```{r 4-3-PDF-Import-23, eval = FALSE}
+# To install all of the packages except for tabulapdf
+if (!requireNamespace("stringr"))
+ install.packages("stringr")
+if (!requireNamespace("pdftools"))
+ install.packages("pdftools")
+if (!requireNamespace("rJava"))
+ install.packages("rJava")
+```
+
+```{r 4-3-PDF-Import-24, message = FALSE, eval = FALSE}
+# To install tabulapdf
+if (!require("remotes")) {
+ install.packages("remotes")
+}
+
+library(remotes)
+
+remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulapdf"), force=TRUE, INSTALL_opts = "--no-multiarch")
+```
+
+Load packages:
+```{r 4-3-PDF-Import-25, message = FALSE, eval = FALSE}
+library(tabulapdf)
+library(tidyverse)
+library(janitor)
+library(pdftools)
+library(stringr)
+```
+
+#### Initial data import from PDF file
+
+The `extract_tables()` function automatically extracts tables from PDFs and stores them as tibbles (a specific tidyverse data structure similar to a dataframe) within a list. One table is extracted per page, even if the table spans multiple pages. This line of code can take a few seconds to run depending on the length of your PDF.
+```{r 4-3-PDF-Import-26 }
+tables <-tabulapdf::extract_tables("Chapter_4/4_3_PDF_Import/Module4_3_InputData4.pdf", output = "tibble")
+```
+
+Glimpsing the first three elements in the tables list, we can see that each list element is a dataframe containing the columns from the PDF tables.
+```{r 4-3-PDF-Import-27 }
+glimpse(tables[1:3])
+```
+
+Exploring further, here is how each dataframe is formatted:
+```{r 4-3-PDF-Import-28 }
+datatable(tables[[1]])
+```
+
+Notice that, although the dataframe format mirrors the PDF table format, the label column is stored across multiple rows with NAs in the other columns of that row because the text was across multiple lines. In our final dataframe, we will want the entire block of text in one cell. We can also remove the "Len", "Format", and "Informat" columns because they are not informative and they are not found in every table. Next, we will walk through how to clean up this table using a series of steps in tidyverse.
+
+### Cleaning dataframes
+
+First, we will select the columns we are interested in and use the `fill()` function to change the NAs in the "Num" column so that each line of text in the "Label" column has the correct "Num" value in the same row.
+```{r 4-3-PDF-Import-29 }
+cleaned_table1 <- data.frame(tables[[1]]) %>% # Extract the first table in the list
+
+ # Select only the columns of interest
+ select(c(Num, Variable, Type, Label)) %>%
+
+ # Change the "Num" column to numeric, which is required for the fill function
+ mutate(Num = as.numeric(Num)) %>%
+
+ # Fill in the NAs in the "Num" column down the column
+ fill(Num, .direction = "down")
+
+datatable(cleaned_table1)
+```
+
+We still need to move all of the Label text for each variable into one cell in one row instead of across multiple rows. For this, we can use the `unlist()` function. Here is a demonstration of how the `unlist()` function works using just the first variable:
+```{r 4-3-PDF-Import-30 }
+cleaned_table1_var1 <- cleaned_table1 %>%
+
+ # Filter dataframe to just contain rows associated with the first variable
+ filter(Num == 1) %>%
+
+ # Paste all character strings in the Label column with a space in between them into a new column called "new_label"
+ mutate(new_label = paste(unlist(Label), collapse = " "))
+
+datatable(cleaned_table1_var1)
+```
+
+We now have all of the text we want in one cell, but we have duplicate rows that we don't need. We can get rid of these rows by assigning blank values "NA" and then omitting rows that contain NAs.
+```{r 4-3-PDF-Import-31, warning = FALSE}
+cleaned_table1_var1 <- cleaned_table1_var1 %>%
+ mutate(across(Variable, na_if, "")) %>%
+ na.omit()
+
+datatable(cleaned_table1_var1)
+```
+
+We need to apply this code to the whole dataframe and not just one variable, so we can add `group_by(Num)` to our cleaning workflow, followed by the code we just applied to our filtered dataframe.
+```{r 4-3-PDF-Import-32, warning = FALSE}
+cleaned_table1 <- data.frame(tables[[1]]) %>% # Extract the first table in the list
+
+ # Select only the columns of interest
+ select(c(Num, Variable, Type, Label)) %>%
+
+ # Change the "Num" column to numeric, which is required for the fill function
+ mutate(Num = as.numeric(Num)) %>%
+
+ # Fill in the NAs in the "Num" column down the column
+ fill(Num, .direction = "down") %>%
+
+ # Group by variable number
+ group_by(Num) %>%
+ # Unlist the text replace the text in the "Label" column with the unlisted text
+ mutate(Label = paste(unlist(Label), collapse =" ")) %>%
+
+ # Make blanks in the "Variable" column into NAs
+ mutate(across(Variable, na_if, "")) %>%
+
+ # Remove rows with NAs
+ na.omit()
+
+datatable(cleaned_table1)
+```
+
+Ultimately, we need to clean up each dataframe in the list the same way, and we need all of the dataframes to be in one dataframe, instead of in a list. There are a couple of different ways to do this. Both rely on the code shown above for cleaning up each dataframe. Option #1 uses a for loop, while Option #2 uses application of a function on the list of dataframes. Both result in the same ending dataframe!
+
+**Option #1**
+```{r 4-3-PDF-Import-33, warning = FALSE}
+# Create a dataframe for storing variables
+variables <- data.frame()
+
+# Make a for loop to format each dataframe and add it to the variables
+for (i in 1:length(tables)) {
+
+ table <- data.frame(tables[[i]]) %>%
+ select(c(Num, Variable, Type, Label)) %>%
+ mutate(Num = as.numeric(Num)) %>%
+ fill(Num, .direction = "down") %>%
+ group_by(Num) %>%
+ mutate(Label = paste(unlist(Label), collapse =" ")) %>%
+ mutate(across(Variable, na_if, "")) %>%
+ na.omit()
+
+ variables <- bind_rows(variables, table)
+}
+
+# View resulting dataframe
+datatable(variables)
+```
+
+**Option #2**
+```{r 4-3-PDF-Import-34, warning = FALSE}
+# Write a function that applies all of the cleaning steps to an dataframe (output = cleaned dataframe)
+clean_tables <- function(data) {
+
+ data <- data %>%
+ select(c(Num, Variable, Type, Label)) %>%
+ mutate(Num = as.numeric(Num)) %>%
+ fill(Num, .direction = "down") %>%
+ group_by(Num) %>%
+ mutate(Label = paste(unlist(Label), collapse =" ")) %>%
+ mutate(across(Variable, na_if, "")) %>%
+ na.omit()
+
+ return(data)
+}
+
+# Apply the function over each table in the list of tables
+tables_clean <- lapply(X = tables, FUN = clean_tables)
+
+# Unlist the dataframes and combine them into one dataframe
+tables_clean_unlisted <- do.call(rbind, tables_clean)
+
+# View resulting dataframe
+datatable(tables_clean_unlisted)
+```
+
+### Adding Dataset Names
+
+We now have a dataframe with all of the information from the PDFs contained in one long table. However, now we need to add back in the label on top of each table. We can't do this with the *tabulapdf* package because the name isn't stored in the table. But we can use the *pdftools* package for this!
+
+First, we will read in the pdf using the PDF tools package. This results in a vector containing a long character string for each page of the PDF. Notice a few features of these character strings:
+
++ Each line is separated by `\n`
++ Elements [1] and [2] of the vector contain the text "dataset Name:", while element [3] does not because the third page was a continuation of the table from the second page and therefore did not have a table title.
+
+```{r 4-3-PDF-Import-35 }
+table_names <- pdf_text("Chapter_4/4_3_PDF_Import/Module4_3_InputData4.pdf")
+
+head(table_names[1:3])
+```
+
+Similar to the table cleaning section, we will work through an example of extracting the text of interest from one of these character vectors, then apply the same code to all of the character vectors. First, we will select just the first element in the vector and make it into a dataframe.
+```{r 4-3-PDF-Import-36 }
+# Create dataframe
+dataset_name_df_var1 <- data.frame(strsplit(table_names[1], "\n"))
+
+# Clean column name
+colnames(dataset_name_df_var1) <- c("Text")
+
+# View dataframe
+datatable(dataset_name_df_var1)
+```
+
+Next, we will extract the dataset name using the same approach used in extracting values from the nanoparticle tracking example above and assign the name to a variable. We filter by the string "Data Set Name" because this is the start of the text string in the row where our dataset name is stored and is the same across all of our datasets.
+```{r 4-3-PDF-Import-37 }
+# Create dataframe
+dataset_name_df_var1 <- dataset_name_df_var1 %>%
+ filter(grepl("Data Set Name", dataset_name_df_var1$Text)) %>%
+ separate(Text, into = c(NA, "dataset"), sep = "Data Set Name: ")
+
+# Assign variable
+dataset_name_var1 <- dataset_name_df_var1[1,1]
+
+# View variable name
+dataset_name_var1
+```
+
+Now that we have the dataset name stored as a variable, we can create a dataframe that will correspond to the rows in our `variables` dataframe. The challenge is that each dataset contains a different number of variables! We can determine how many rows each dataset contains by returning to our `variables` dataframe and calculating the number of rows associated with each dataset. The following code splits the `variables` dataframe into a list of dataframes by each occurrence of 1 in the "Num" column (when the numbering restarts for a new dataset).
+```{r 4-3-PDF-Import-38 }
+# Calculate the number of rows associated with each dataset for reference
+dataset_list <- split(variables, cumsum(variables$Num == 1))
+
+glimpse(dataset_list[1:3])
+```
+
+The number of rows in each list is the number of variables in that dataset. We can use this value in creating our dataframe of dataset names.
+```{r 4-3-PDF-Import-39 }
+# Store the number of rows in a variable
+n_rows = nrow(data.frame(dataset_list[1]))
+
+# Repeat the dataset name for the number of variables there are
+dataset_name_var1 = data.frame("dataset_name" = rep(dataset_name_var1, times = n_rows))
+
+# View data farme
+datatable(dataset_name_var1)
+```
+
+We now have a dataframe that can be joined with our `variables` dataframe for the first table. We can apply this approach to each table in our original PDF using a `for` loop.
+```{r 4-3-PDF-Import-40 }
+# Make dataframe to store dataset names
+dataset_names <- data.frame()
+
+# Create list of datasets
+dataset_list <- split(variables, cumsum(variables$Num == 1))
+
+# Remove elements from the table_names vector that do not contain the string "Data Set Name"
+table_names_filtered <- stringr::str_subset(table_names, 'Data Set Name')
+
+# Populate dataset_names dataframe
+for (i in 1:length(table_names_filtered)) {
+
+ # Get dataset name
+ dataset_name_df <- data.frame(strsplit(table_names_filtered[i], "\n"))
+
+ base::colnames(dataset_name_df) <- c("Text")
+
+ dataset_name_df <- dataset_name_df %>%
+ filter(grepl("Data Set Name", dataset_name_df$Text)) %>%
+ separate(Text, into = c(NA, "dataset"), sep = "Data Set Name: ")
+
+ dataset_name <- dataset_name_df[1,1]
+
+ # Determine number of variables in that dataset
+ data_set <- data.frame(dataset_list[i])
+ n_rows = nrow(data_set)
+
+ # Repeat the dataset name for the number of variables there are
+ dataset_name = data.frame("Data Set Name" = rep(dataset_name, times = n_rows))
+
+ # Bind to dataframe
+ dataset_names <- bind_rows(dataset_names, dataset_name)
+
+}
+
+
+# Rename column
+colnames(dataset_names) <- c("Data Set Name")
+
+# View
+datatable(dataset_names)
+```
+
+### Combining Dataset Names and Variable Information
+
+Last, we will merge together the dataframe containing dataset names and variable information.
+```{r 4-3-PDF-Import-41 }
+# Merge together
+final_variable_df <- cbind(dataset_names, variables) %>%
+ rename("Variable Description" = "Label", "Variable Number Within Dataset" = "Num") %>%
+ clean_names()
+
+datatable(final_variable_df)
+```
+
+We can also determine how many total variables we have, all of which are accessible via the table we just generated.
+```{r 4-3-PDF-Import-42 }
+# Total number of variables
+nrow(final_variable_df)
+
+# Total number of variables
+```
+
+:::question
+*With this, we can answer **Environmental Health Question #2***: How many variables total are available to us to request from the study whose data are stored in the repository, and what are these variables?
+:::
+
+:::answer
+**Answer**: There are 1190 variable available to us. We can browse through the variables, including the sub-table they were from, the type of variable they are, and how they were derived using the table we generated.
+:::
+
+
+
+## Concluding Remarks
+
+This training module provides example case studies demonstrating how to import PDF data into R and clean it so that it is more useful and accessible for analyses. The approaches demonstrated in this module, though specific to our specific example data, can be adapted to many different types of PDF data.
+
+
+
+
+
+:::tyk
+Using the same input files that we used in part 1, "Importing Data from Many Single PDFs with the Same Formatting", found in the Module4_3_TYKInput folder, extract the remaining variables of interest (Original Concentration and Positions Removed) from the PDFs and summarize them in one dataframe.
+:::
diff --git a/Chapter_4/Module4_3_Input/Module4_3_Image1.png b/Chapter_4/4_3_PDF_Import/Module4_3_Image1.png
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_Image1.png
rename to Chapter_4/4_3_PDF_Import/Module4_3_Image1.png
diff --git a/Chapter_4/Module4_3_Input/Module4_3_Image2.png b/Chapter_4/4_3_PDF_Import/Module4_3_Image2.png
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_Image2.png
rename to Chapter_4/4_3_PDF_Import/Module4_3_Image2.png
diff --git a/Chapter_4/Module4_3_Input/Module4_3_Image3.png b/Chapter_4/4_3_PDF_Import/Module4_3_Image3.png
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_Image3.png
rename to Chapter_4/4_3_PDF_Import/Module4_3_Image3.png
diff --git a/Chapter_4/Module4_3_Input/Module4_3_Image4.png b/Chapter_4/4_3_PDF_Import/Module4_3_Image4.png
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_Image4.png
rename to Chapter_4/4_3_PDF_Import/Module4_3_Image4.png
diff --git a/Chapter_4/Module4_3_Input/Module4_3_InputData1.xlsx b/Chapter_4/4_3_PDF_Import/Module4_3_InputData1.xlsx
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_InputData1.xlsx
rename to Chapter_4/4_3_PDF_Import/Module4_3_InputData1.xlsx
diff --git a/Chapter_4/Module4_3_Input/Module4_3_InputData3.xlsx b/Chapter_4/4_3_PDF_Import/Module4_3_InputData3.xlsx
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_InputData3.xlsx
rename to Chapter_4/4_3_PDF_Import/Module4_3_InputData3.xlsx
diff --git a/Chapter_4/Module4_3_Input/Module4_3_InputData4.pdf b/Chapter_4/4_3_PDF_Import/Module4_3_InputData4.pdf
similarity index 100%
rename from Chapter_4/Module4_3_Input/Module4_3_InputData4.pdf
rename to Chapter_4/4_3_PDF_Import/Module4_3_InputData4.pdf
diff --git a/Chapter_4/4_4_Two_Groups/4_4_Two_Groups.Rmd b/Chapter_4/4_4_Two_Groups/4_4_Two_Groups.Rmd
new file mode 100644
index 0000000..3e81447
--- /dev/null
+++ b/Chapter_4/4_4_Two_Groups/4_4_Two_Groups.Rmd
@@ -0,0 +1,386 @@
+
+# 4.4 Two Group Comparisons and Visualizations
+
+This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Two group statistical comparisons, in which we want to know whether the means between two different groups are significantly different, are some of the most common statistical tests in environmental health research and even biomedical research as a field. In this training module, we will demonstrate how to run two group statistical comparisons and how to present publication-quality figures and tables of these results. We will continue to use the same example dataset as used in this chapter's previous modules, which represents concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are there significant differences in inflammatory biomarker concentrations between cells from male and female donors at baseline?
+2. Are there significant differences in inflammatory biomarker concentrations between cells exposed to 0 and 4 ppm acrolein?
+
+### Workspace Preparation and Data Import
+
+Here, we will import the processed data that we generated at the end of **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**. These data, along with the associated demographic data, were introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data**. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, and *ggpubr*, and additional packages relevant to statistical analysis and graphing that will be discussed in greater detail below.
+```{r 4-4-Two-Groups-1, message = FALSE}
+# Load packages
+library(openxlsx)
+library(tidyverse)
+library(DT)
+library(rstatix)
+library(ggpubr)
+```
+
+```{r 4-4-Two-Groups-2 }
+# Import data
+biomarker_data <- read.xlsx("Chapter_4/4_4_Two_Groups/Module4_4_InputData1.xlsx")
+demographic_data <- read.xlsx("Chapter_4/4_4_Two_Groups/Module4_4_InputData2.xlsx")
+
+# View data
+datatable(biomarker_data)
+datatable(demographic_data)
+```
+
+
+
+## Overview of Two Group Statistical Tests
+
+Before applying statistical tests to our data, let's first review common two group statistical tests, their underlying assumptions, and variations on these tests.
+
+### Common Tests
+
+The two most common two group statistical tests are the...
+
++ **T-test** (also known as the student's t-test) and the
++ **Wilcoxon test** (also known as the Wilcox test, Wilcoxon test, or Mann Whitney test)
+
+Both of these tests are testing the null hypothesis that the means of the two populations (groups) are the same; the alternative hypothesis is that they are not the same. A significant p-value means that we can reject the null hypothesis that the means of the two groups are the same. Whether or not a p-value meets criteria for significance is experiment-specific, though commonly implemented p-value filters for significance include p<0.05 and p<0.01. P-values can also be called alpha values, and they indicate the probability of a **type I error**, or false positive, where the null hypothesis is rejected despite it actually being true. On the other hand, a **type II error**, or false negative, occurs when the null hypothesis is not rejected when it actually should have been.
+
+### Assumptions
+
+The main difference between these two tests is in the assumption about the underlying distribution of the data. T-tests assume that the data are pulled from a normal distribution, while Wilcoxon tests do not assume that the data are pulled from a normal distribution. Therefore, it is most appropriate to use a t-test when data are, in general, normally distributed and a Wilcoxon test when data are not normally distributed.
+
+Additional assumptions underlying t-tests and Wilcoxon test are:
+
+- The dependent variable is continuous or ordinal (discrete, ordered values).
+- The data is collected from a representative, random sample.
+
+T-tests also assume that:
+
+- The standard deviations of the two groups are approximately equal (also called homogeneity of variance).
+
+### When to Use a Parametric vs Non-Parametric Test?
+
+Deciding whether to use a parametric or non-parametric test isn't a one size fits all approach, and the decision should be made holistically for each dataset. Typically, parametric tests should be used when the data are normally distributed, continuous, random sampled, without extreme outliers, and representative of independent samples or participants. A non-parametric test can be used when the sample size (*n*) is small, outliers are present in the dataset, and/or the data are not normally distributed.
+
+This decision matters more when dealing with smaller sample sizes (*n*<10) as smaller sample sizes are more prone to being skewed, and parametric tests are more sensitive to outliers. Therefore, when dealing with a smaller *n*, it might be best to perform a data transformation as discussed in **TAME 2.0 Module 3.3 Normality Testing & Data Transformations** and then perform a parametric test if more parametric assumptions are able to be met, or to use non-parametric tests. For larger sample sizes (*n*>50), outliers can potentially be removed and the dataset can be retested for assumptions. Lastly, what's considered "small" or "large" in regards to sample size can be subjective and should be taken into consideration within the context of the experiment.
+
+### Variations
+
+**Unequal Variance:** When the assumption of homogeneity of variance is not met, a Welch's t-test is generally preferred over a student's t-test. This can be implemented easily by setting `var.equal = FALSE` as an argument to the function executing the t-test (e.g., `t.test()`, `t_test()`). For more on testing homogeneity of variance in R, see [here](https://www.datanovia.com/en/lessons/homogeneity-of-variance-test-in-r/).
+
+**Paired vs Unpaired:** Variations on the t-test and Wilcoxon test are used when the experimental design is paired (also called repeated measures or matching). This occurs when there are different treatments, exposures, or time points collected from the same biological/experimental unit. For example, cells from the same donor or passage number exposed to different concentrations of a chemical represents a paired design. Matched/paired experiments have increased power to detect significant differences because samples can be compared back to their own controls.
+
+**One vs Two-Sided:** A one-sided test evaluates the hypothesis that the mean of the treatment group significantly differs in a specific direction from the control. A two-sided test evaluates the hypothesis that the mean of the treatment group significantly differs from the control but does not specify a direction for that change. A two-sided test is the preferred approach and the default in R because, typically, either direction of change is possible and represents an informative finding. However, one-sided tests may be appropriate if an effect can only possibly occur in one direction. This can be implemented by setting `alternative = "one.sided"` within the statistical testing function.
+
+### Which test should I choose?
+
+We provide the following flowchart to help guide your choice of statistical test to compare two groups:
+```{r 4-4-Two-Groups-3, echo = FALSE, fig.align = "center", out.width = "800px" }
+knitr::include_graphics("Chapter_4/4_4_Two_Groups/Module4_4_Image1.png")
+```
+
+
+
+## Statistical vs. Biological Significance
+
+Another important topic to discuss before proceeding to statistical testing is the true meaning of statistical significance. Statistical significance simply means that it is unlikely that the patterns being observed are due to random chance. However, just because an effect is statistically significant does not mean that it is biologically significant (i.e., has notable biological consequences). Often, there also needs to be a sufficient magnitude of effect (also called effect size) for the effects on a system to be meaningful. Although a p-value < 0.05 is often considered the threshold for significance, this is just a standard threshold set to a generally "acceptable" amount of error (5%). What about a p-value of 0.058 with a very large biological effect? Accounting for effect size is also why filters such as log~2~ fold change are often applied alongside p-value filters in -omics based analysis.
+
+In discussions of effect size, the population size is also a consideration - a small percentage increase in a very large population can represent tens of thousands of individuals (or more). Another consideration is that we frequently do not know what magnitude of biological effect should be considered "significant." These discussions can get complicated very quickly, and here we do not propose to have a solution to these thought experiments; rather, we recommend considering both statistical and biological significance when interpreting data. And, as stated in other sections of TAME, transparent reporting of statistical results will aid the audience in interpreting the data through their preferred perspectives.
+
+
+
+## Unpaired Test Example
+
+We will start by performing a statistical test to determine whether there are significant differences in biomarker concentrations between male and female donors at baseline (0 ppm exposure). Previously we determined that the majority of our data was non-normally distributed (see **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), so we'll skip testing for that assumption in this module. Based on those results, we will use the Wilcoxon test to determine if there are significant differences between groups. The Wilcoxon test does not assume homogeneity of variance, so we do not need to test for that prior to applying the test. This is an unpaired analysis because samples collected from the cells derived from male and female donor cells are different sets of cells (i.e., independent from each other). Thus, the specific statistical test applied will be the Wilcoxon Rank Sum test.
+First, we will filter our dataframe to only data representing the control (0 ppm) exposure:
+```{r 4-4-Two-Groups-4 }
+biomarker_data_malevsfemale <- biomarker_data %>% filter(Dose == "0")
+```
+
+Next, we need to add the demographic data to our dataframe:
+```{r 4-4-Two-Groups-5 }
+biomarker_data_malevsfemale <- biomarker_data_malevsfemale %>% left_join(demographic_data %>% select(Donor, Sex), by = "Donor")
+```
+
+Here is what our data look like now:
+```{r 4-4-Two-Groups-6 }
+datatable(biomarker_data_malevsfemale)
+```
+
+We can demonstrate the basic anatomy of the Wilcoxon test function `wilcox.test()` by running the function on just one variable.
+```{r 4-4-Two-Groups-7 }
+wilcox.test(IL1B ~ Sex, data = biomarker_data_malevsfemale)
+```
+The p-value of 0.8371 indicates that males and females do not have significantly different concentrations of IL-1$\beta$.
+
+The `wilcox.test()` function is part of the pre-loaded package *stats*. The package [*rstatix*](https://rpkgs.datanovia.com/rstatix/) provides identical statistical tests to *stats* but in a pipe-friendly (tidyverse-friendly) format, and these functions output results as dataframes rather than the text displayed above.
+```{r 4-4-Two-Groups-8 }
+biomarker_data_malevsfemale %>% wilcox_test(IL1B ~ Sex)
+```
+Here, we can see the exact same results as with the `wilcox.test()` function. For the rest of this module, we'll proceed with using the *rstatix* version of statistical testing functions.
+
+Although it is simple to run the Wilcoxon test with the code above, it's impractical for a large number of endpoints and doesn't store the results in an organized way. Instead, we can run the Wilcoxon test over every variable of interest using a `for` loop. There are also other ways you could approach this, such as a function applied over a list. This `for` loop runs the Wilcoxon test on each endpoint, stores the results in a dataframe, and then binds together the results dataframes for each variable of interest. Note that you could easily change `wilcox_test()` to `t_test()` and add additional arguments to modify the way the statistical test is run.
+```{r 4-4-Two-Groups-9, warning = FALSE}
+# Create a vector with the names of the variables you want to run the test on
+endpoints <- colnames(biomarker_data_malevsfemale %>% select(IL1B:VEGF))
+
+# Create dataframe to store results
+sex_wilcoxres <- data.frame()
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable.
+ endpoint <- endpoints[i]
+
+ # Run wilcox test and store in results dataframe.
+ res_df <- biomarker_data_malevsfemale %>%
+ wilcox_test(as.formula(paste0(endpoint, "~ Sex", sep = "")))
+
+ # Bind results from this test with other tests in this loop
+ sex_wilcoxres <- rbind(sex_wilcoxres, res_df)
+
+}
+
+# View results
+sex_wilcoxres
+```
+
+:::question
+With this, we can answer **Environmental Health Question #1**:
+Are there significant differences in inflammatory biomarker concentrations between cells from male and female donors at baseline?
+:::
+
+:::answer
+**Answer**: There are not any significant differences in concentrations of any of our biomarkers between male and female donors at baseline.
+:::
+
+
+
+### Adjusting for Multiple Hypothesis Testing
+
+Above, we compared concentrations between males and females for six different endpoints or variables. Each time we run a comparison (with a p-value threshold of < 0.05), we are accepting that there is a 5% chance that a significant result will actually be due to random chance and that we are rejecting the null hypothesis when it is actually true (type I error).
+
+Since we are testing six different hypotheses simultaneously, what is the probability then of observing at least one significant result due just to chance?
+
+$$\mathbb{P}({\rm At Least One Significant Result}) = 1 - \mathbb{P}({\rm NoSignificantResults}) = 1 - (1 - 0.05)^{6} = 0.26$$
+
+Here, we can see that we have a 26% chance of observing at least one significant result, even if all the tests are actually not significant. This chance increases as our number of endpoints increases; therefore, adjusting for multiple hypothesis testing becomes even more important with larger datasets. Many methods exist for adjusting for multiple hypothesis testing, with some of the most popular including Bonferroni, False Discovery Rate (FDR), and Benjamini-Hochberg (BH).
+
+However, opinions about when and how to adjust for multiple hypothesis testing can vary and also depend on the question you are trying to answer. For example, when there are a low number of variables (e.g., < 10), it's often not necessary to adjust for multiple hypothesis testing, and when there are many variables (e.g., 100s to 1000s), it is necessary, but what about for an intermediate number of comparisons? Whether or not to apply multiple hypothesis test correction also depends on whether each endpoint is of interest on its own or whether the analysis seeks to make general statements about all of the endpoints together and on whether reducing type I or type II error is most important in the analysis.
+
+For this analysis, we will not adjust for multiple hypothesis testing due to our relatively low number of variables. For more on multiple hypothesis testing, check out the following publications:
+
++ Mohieddin J; Naser AP. "Why, When and How to Adjust Your P Values?". Cell Journal (Yakhteh), 20, 4, 2018, 604-607. doi: 10.22074/cellj.2019.5992 PUBMID: [30124010](https://www.celljournal.org/article_250554.html)
++ Feise, R.J. Do multiple outcome measures require p-value adjustment?. BMC Med Res Methodol 2, 8 (2002). https://doi.org/10.1186/1471-2288-2-8 PUBMID: [12069695](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-2-8#citeas)
+
+
+
+## Paired Test Example
+
+To demonstrate an example of a paired two group test, we can also determine whether exposure to 4 ppm acrolein significantly changes biomarker concentrations. This is now a paired design because each donor's cells were exposed to both 0 and 4 ppm acrolein.
+
+To prepare the data, we will filter the dataframe to only include 0 and 4 ppm:
+```{r 4-4-Two-Groups-10 }
+biomarker_data_0vs4 <- biomarker_data %>%
+ filter(Dose == "0" | Dose == "4")
+```
+
+Let's view the dataframe. Note how the measurements for each donor are next to each other - this an important element of the default handling of the paired analysis in R. The dataframe should have the donors in the same order for the 0 and 4 ppm data.
+```{r 4-4-Two-Groups-11 }
+datatable(biomarker_data_0vs4)
+```
+
+We can now run the same type of loop that we ran before, changing the independent variable in the formula to `~ Dose` and adding `paired = TRUE` to the `wilcox_test()` function.
+```{r 4-4-Two-Groups-12 }
+# Create a vector with the names of the variables you want to run the test on
+endpoints <- colnames(biomarker_data_0vs4 %>% select(IL1B:VEGF))
+
+# Create dataframe to store results
+dose_wilcoxres <- data.frame()
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable.
+ endpoint <- endpoints[i]
+
+ # Run wilcox test and store in results dataframe.
+ res_df <- biomarker_data_0vs4 %>%
+ wilcox_test(as.formula(paste0(endpoint, "~ Dose", sep = "")),
+ paired = TRUE)
+
+ # Bind results from this test with other tests in this loop
+ dose_wilcoxres <- rbind(dose_wilcoxres, res_df)
+}
+
+# View results
+dose_wilcoxres
+```
+
+Although this dataframe contains useful information about our statistical test, such as the groups being compared, the sample size (*n*) of each group, and the test statistic, what we really want (and what would likely be shared in supplemental material), is a more simplified version of these results in table format and more detailed information (*n*, specific statistical test, groups being compared) in the table legend. We can clean up the results using the following code to make clearer column names and ensure that the p-values are formatted consistently.
+
+```{r 4-4-Two-Groups-13 }
+dose_wilcoxres <- dose_wilcoxres %>%
+ select(c(.y., p)) %>%
+ mutate(p = format(p, digits = 3, scientific = TRUE)) %>%
+ rename("Variable" = ".y.", "P-Value" = "p")
+
+datatable(dose_wilcoxres)
+```
+
+:::question
+With this, we can answer **Environmental Health Question #2**:
+
+Are there significant differences in inflammatory biomarker concentrations between cells exposed to 0 and 4 ppm acrolein?
+:::
+
+:::answer
+**Answer**: Yes, there are significant differences in IL-1$\beta$, IL-6, IL-8, TNF-$\alpha$, and VEGF concentrations between cells exposed to 0 and 4 ppm acrolein.
+:::
+
+
+
+## Visualizing Results
+
+Now, let's visualize our results using *ggplot2*. For an introduction to *ggplot2* visualizations, see **TAME 2.0 Modules 3.1 Data Visualizations** and **3.2 Improving Data Visualizations**, as well as the extensive online documentation available for *ggplot2*.
+
+### Single Plots
+We will start by making a very basic box and whisker plot of the IL-1$\beta$ data with individual data points overlaid. It is best practice to show all data points, allowing the reader to view the whole spread of the data, which can be obscured by plots such as bar plots with mean and standard error.
+```{r 4-4-Two-Groups-14, fig.align = "center"}
+# Setting theme for plot
+theme_set(theme_bw())
+
+# Making plot
+ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
+ geom_boxplot() +
+ geom_jitter(position = position_jitter(0.15))
+```
+
+We could add statistical markings to denote significance to this graph manually in PowerPoint or Adobe Illustrator, but there are actually R packages that act as extensions to *ggplot2* and will do this for you! Two of our favorites are [*ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/) and [*ggsignif*](https://cran.r-project.org/web/packages/ggsignif/vignettes/intro.html). Here is an example using *ggpubr*:
+```{r 4-4-Two-Groups-15, fig.align = "center"}
+ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
+ geom_boxplot() +
+ geom_jitter(position = position_jitter(0.15)) +
+ # Adding a p value from a paired Wilcoxon test
+ stat_compare_means(method = "wilcox.test", paired = TRUE)
+```
+
+We can further clean up our figure by modifying elements of the plot's theme, including the font sizes, axis range, colors, and the way that the statistical results are presented. Perfecting figures can be time consuming but ultimately worth it, because clear figures aid greatly in presenting a coherent story that is understandable to readers/listeners.
+```{r 4-4-Two-Groups-16, fig.align = "center"}
+ggplot(biomarker_data_0vs4, aes(x = Dose, y = IL1B)) +
+ # outlier.shape = NA removes outliers
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ # Changing box plot colors
+ scale_fill_manual(values = c("#BFBFBF", "#EE2B2B")) +
+ geom_jitter(size = 3, position = position_jitter(0.15)) +
+ # Adding a p value from a paired Wilcoxon test
+ stat_compare_means(method = "wilcox.test", paired = TRUE,
+ # Changing the value to asterisks and moving to the middle of the plot
+ label = "p.signif", label.x = 1.5, label.y = 4.5, size = 12) +
+ ylim(2.5, 5) +
+ # Changing y axis label
+ labs(y = "Log2(IL-1\u03B2 (pg/mL))") +
+ # Removing legend
+ theme(legend.position = "none",
+ axis.title = element_text(color = "black", size = 15),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 12))
+```
+
+### Multiple plots
+
+Making one plot was relatively straightforward, but to graph all of our endpoints, we would either need to repeat that code chunk for each individual biomarker or write a function to create similar plots given a specific biomarker as input. Then, we would need to stitch together the individual plots in external software or using a package such as [*patchwork*](https://patchwork.data-imaginist.com/) (which is a great package if you need to combine individual figures from different sources or different size ratios!).
+
+While these are workable solutions and would get us to the same place, *ggplot2* actually contains a function - `facet_wrap()` - that can be used to graph multiple endpoints from the same groups in one figure panel, which takes care of a lot of the work for us!
+
+To prepare our data for facet plotting, first we will pivot it longer:
+```{r 4-4-Two-Groups-17 }
+biomarker_data_0vs4_long <- biomarker_data_0vs4 %>%
+ pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
+
+datatable(biomarker_data_0vs4_long)
+```
+
+Then, we can use similar code to what we used to make our single graph, with a few modifications to plot multiple panels simultaneously and adjust the style of the plot. Although it is beyond the scope of this module to explain the mechanics of each line of code, here are a few specific things to note about the code below that may be helpful when constructing similar plots:
+
+- To create the plot with all six endpoints instead of just one, we:
+ - Changed input dataframe from wide to long format
+ - Changed `y =` from one specific endpoint to `value`
+ - Added the `facet_wrap()` argument
+ - `~ variable` tells the function to make an individual plot for each variable
+ - `nrow = 2 ` tells the function to put the plots into two rows
+ - `scales = "free_y"` tells the function to allow each individual graph to have a unique y-scale that best shows all of the data on that graph
+ - `labeller` feeds the edited (more stylistically correct) names for each panel to the function
+
+- To ensure that the statistical results appear cleanly, within `stat_compare_means()`, we:
+ - Added `hide.ns = TRUE` so that only significant results are shown
+ - Added `label.x.npc = "center"` and `hjust = 0.5` to ensure that asterisks are centered on the plot and that the text is center justified
+
+- To add padding along the y axis, allowing space for significance asterisks, we added `scale_y_continuous(expand = expansion(mult = c(0.1, 0.4)))`
+
+```{r 4-4-Two-Groups-18, warning = FALSE, fig.align = "center"}
+# Create clean labels for the graph titles
+new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
+ "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
+
+# Make graph
+ggplot(biomarker_data_0vs4_long, aes(x = Dose, y = value)) +
+ # outlier.shape = NA removes outliers
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ # Changing box plot colors
+ scale_fill_manual(values = c("#BFBFBF", "#EE2B2B")) +
+ geom_jitter(size = 1.5, position = position_jitter(0.15)) +
+ # Adding a p value from a paired Wilcoxon test
+ stat_compare_means(method = "wilcox.test", paired = TRUE,
+ # Changing the value to asterisks and moving to the middle of the plot
+ label = "p.signif", size = 10, hide.ns = TRUE, label.x.npc = "center",
+ hjust = 0.5) +
+ # Adding padding y axis
+ scale_y_continuous(expand = expansion(mult = c(0.1, 0.4))) +
+ # Changing y axis label
+ ylab(expression(Log[2]*"(Concentration (pg/ml))")) +
+ # Faceting by each biomarker
+ facet_wrap(~ variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
+ # Removing legend
+ theme(legend.position = "none",
+ axis.title = element_text(color = "black", size = 12),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 10),
+ strip.text = element_text(size = 12, face = "bold"))
+```
+
+An appropriate title for this figure could be:
+
+"**Figure X. Exposure to 4 ppm acrolein increases inflammatory biomarker secretion in primary human bronchial epithelial cells.** Groups were compared using the Wilcoxon signed rank test. * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001, *n* = 16 per group (paired)."
+
+
+
+## Concluding Remarks
+
+In this module, we introduced two group statistical tests, which are some of the most common statistical tests applied in biomedical research. We applied these tests to our example dataset and demonstrated how to produce publication-quality tables and figures of our results. Implementing a workflow such as this enables efficient analysis of wet-bench generated data and customization of output figures and tables suited to your personal preferences.
+
+
+
+
+
+:::tyk
+Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed. Due to the relatively low *n* of this dataset, we therefore recommend using non-parametric statistical tests.
+
+Use the same processes demonstrated in this module and the provided data (“Module4_4_TYKInput1.xlsx” (functional data) and “Module4_4_TYKInput2.xlsx” (demographic data)), run analyses and make publication-quality figures and tables to answer the following questions to determine:
+
+1. Are there significant differences in functional endpoints between cells from male and female donors at baseline?
+2. Are there significant differences in functional endpoints between cells exposed to 0 and 4 ppm acrolein? Go ahead and use non-parametric tests for these analyses.
+:::
diff --git a/Chapter_4/Module4_4_Input/Module4_4_Image1.png b/Chapter_4/4_4_Two_Groups/Module4_4_Image1.png
similarity index 100%
rename from Chapter_4/Module4_4_Input/Module4_4_Image1.png
rename to Chapter_4/4_4_Two_Groups/Module4_4_Image1.png
diff --git a/Chapter_4/Module4_4_Input/Module4_4_InputData1.xlsx b/Chapter_4/4_4_Two_Groups/Module4_4_InputData1.xlsx
similarity index 100%
rename from Chapter_4/Module4_4_Input/Module4_4_InputData1.xlsx
rename to Chapter_4/4_4_Two_Groups/Module4_4_InputData1.xlsx
diff --git a/Chapter_4/Module4_4_Input/Module4_4_InputData2.xlsx b/Chapter_4/4_4_Two_Groups/Module4_4_InputData2.xlsx
similarity index 100%
rename from Chapter_4/Module4_4_Input/Module4_4_InputData2.xlsx
rename to Chapter_4/4_4_Two_Groups/Module4_4_InputData2.xlsx
diff --git a/Chapter_4/4_5_Multiple_Groups/4_5_Multiple_Groups.Rmd b/Chapter_4/4_5_Multiple_Groups/4_5_Multiple_Groups.Rmd
new file mode 100644
index 0000000..e703e2c
--- /dev/null
+++ b/Chapter_4/4_5_Multiple_Groups/4_5_Multiple_Groups.Rmd
@@ -0,0 +1,438 @@
+
+# 4.5 Multi-Group and Multi-Variable Comparisons and Visualizations
+
+This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+In the previous module, we covered how to apply two-group statistical testing, one of the most basic types of statistical tests. In this module, we will build on the concepts introduced previously to apply statistical testing to datasets with more than two groups, which are also very common in environmental health research. We will review common multi-group overall effects tests and post-hoc tests, and we will demonstrate how to apply these tests and how to graph the results using the same example dataset as in previous modules in this chapter, which represents concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are there significant differences in inflammatory biomarker concentrations between different doses of acrolein?
+2. Do TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein?
+
+### Workspace Preparation and Data Import
+
+Here, we will import the processed data that we generated at the end of TAME 2.0 Module 4.2, introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** and the associated demographic data. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, *ggpubr*, and *rstatix*.
+
+#### Cleaning the global environment
+```{r 4-5-Multiple-Groups-1, echo=TRUE, eval=TRUE}
+rm(list=ls())
+```
+
+#### Loading R packages required for this session
+```{r 4-5-Multiple-Groups-2, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
+library(openxlsx)
+library(tidyverse)
+library(DT)
+library(rstatix)
+library(ggpubr)
+```
+
+#### Set your working directory
+```{r 4-5-Multiple-Groups-3, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+```{r 4-5-Multiple-Groups-4, echo=TRUE, eval=TRUE}
+biomarker_data <- read.xlsx("Chapter_4/4_5_Multiple_Groups/Module4_5_InputData1.xlsx")
+demographic_data <- read.xlsx("Chapter_4/4_5_Multiple_Groups/Module4_5_InputData2.xlsx")
+
+# View data
+datatable(biomarker_data)
+datatable(demographic_data)
+```
+
+
+## Overview of Multi-Group Statistical Tests
+
+Before applying statistical tests to our data, let's first review the mechanics of multi-group statistical tests, including overall effects tests and post-hoc tests.
+```{r 4-5-Multiple-Groups-5, echo = FALSE, fig.align = "center", out.width = "600px" }
+knitr::include_graphics("Chapter_4/4_5_Multiple_Groups/Module4_5_Image1.png")
+```
+
+### Overall Effects Tests
+
+The first step for multi-group statistical testing is to run an overall effects test. The null hypothesis for the overall effects test is that there are no differences among group means. A significant p-value rejects the null hypothesis that the groups are drawn from populations with the same mean and indicates that at least one group differs significantly from the overall mean. Similar to two-group statistical testing, choice of the specific overall statistical test to run depends on whether the data are normally or non-normally distributed and whether the experimental design is paired:
+
+```{r 4-5-Multiple-Groups-6, echo = FALSE, fig.align = "center", out.width = "700px" }
+knitr::include_graphics("Chapter_4/4_5_Multiple_Groups/Module4_5_Image2.png")
+```
+
+Importantly, overall effects tests return **one** p-value regardless of the number of groups being compared. To determine which pairwise comparisons are significant, post-hoc testing is needed.
+
+### Post-Hoc Testing
+
+If significance is obtained with an overall effects test, we can use post-hoc testing to determine which specific pairs of groups are significantly different from each other. Just as with two group statistical tests and overall effects multi-group statistical tests, choosing the appropriate post-hoc test depends on the data's normality and whether the experimental design is paired:
+```{r 4-5-Multiple-Groups-7, echo = FALSE, fig.align = "center", out.width = "700px" }
+knitr::include_graphics("Chapter_4/4_5_Multiple_Groups/Module4_5_Image3.png")
+```
+
+Note that the above diagram represents commonly selected post-hoc tests; others may also be appropriate depending on your specific experimental design. As with other aspects of the analysis, be sure to report which post-hoc test(s) you performed!
+
+### Correcting for Multiple Hypothesis Testing
+
+Correcting for multiple hypothesis testing is important for both the overall effects test (if you are running it over many endpoints) and post-hoc tests; however, it is particularly important for post-hoc tests. This is because even an analysis of a relatively small number of experimental groups results in quite a few pairwise comparisons. Comparing each of our five dose groups to each other in our example data, there are 10 separate statistical tests being performed! Therefore, it is generally advisable to adjust pairwise post-hoc testing p-values. The Tukey's HSD function within *rstatix* does this automatically, while pairwise t-tests, pairwise Wilcoxon tests, and Dunn's test do not. P-value adjustment can be added to their respective *rstatix* functions using the `p.adjust.method = ` argument.
+
+When applying a post-hoc test, you may choose to compare every group to every other group, or you may only be interested in significant differences between specific groups (e.g., treatment groups vs. a control). This choice will be governed by your hypothesis. Statistical testing functions will typically default to comparing all groups to each other, but the comparisons can be defined using the `comparisons = ` argument if you want to restrict the test to specific comparisons. It is important to decide at the beginning of your analysis which comparisons are relevant to your hypothesis because the number of pairwise tests performed in the post-hoc analysis will influence how much the resulting p-values will be adjusted for multiple hypothesis testing.
+
+### Which test should I choose?
+
+Use the following flowchart to help guide your choice of statistical test to compare multiple groups:
+```{r 4-5-Multiple-Groups-8, echo = FALSE, fig.align = "center", out.width = "900px" }
+knitr::include_graphics("Chapter_4/4_5_Multiple_Groups/Module4_5_Image4.png")
+```
+
+
+
+## Multi-Group Analysis Example
+
+To determine whether there are significant differences across all of our doses, the Friedman test is the most appropriate due to our matched experimental design and non-normally distributed data. The `friedman_test()` function is part of the [rstatix](https://github.com/kassambara/rstatix) package. This package also has many other helpful functions for statistical tests that are pipe/tidyverse friendly. To demonstrate how this test works, we will first perform the test on one variable:
+```{r 4-5-Multiple-Groups-9 }
+biomarker_data %>% friedman_test(IL1B ~ Dose | Donor)
+```
+
+A p-value of 0.01 indicates that we can reject the null hypothesis that all of our data are drawn from groups that have equivalent means.
+
+Now, we can run a `for` loop similar to our two-group comparisons in **TAME 2.0 Module 4.4 Two Group Comparisons and Visualizations** to determine the overall p-value for each endpoint:
+```{r 4-5-Multiple-Groups-10 }
+# Create a vector with the names of the variables you want to run the test on
+endpoints <- colnames(biomarker_data %>% select(IL1B:VEGF))
+
+# Create data frame to store results
+dose_friedmanres <- data.frame()
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable.
+ endpoint <- endpoints[i]
+
+ # Run wilcox test and store in results data frame.
+ res <- biomarker_data %>%
+ friedman_test(as.formula(paste0(endpoint, "~ Dose | Donor", sep = ""))) %>%
+ select(c(.y., p))
+
+ dose_friedmanres <- rbind(dose_friedmanres, res)
+}
+
+# View results
+datatable(dose_friedmanres)
+```
+
+These results demonstrate that all of our endpoints have significant overall differences across doses (p < 0.05). To determine which pairwise comparisons are significant, we next need to apply a post-hoc test. We will apply a pairwise, paired Wilcoxon test due to our experimental design and data distribution, with the Benjamini-Hochberg (BH) correction for multiple testing:
+```{r 4-5-Multiple-Groups-11 }
+dose_wilcox_posthoc_IL1B <- biomarker_data %>%
+ pairwise_wilcox_test(IL1B ~ Dose, paired = TRUE, p.adjust.method = "BH")
+
+dose_wilcox_posthoc_IL1B
+```
+
+Here, we can now see whether there are statistically significant differences in IL-1$\beta$ secretion between each of our doses. To generate pairwise comparison results for each of our inflammatory biomarkers, we can run a for loop similar to the one we ran for our overall test:
+```{r 4-5-Multiple-Groups-12 }
+# Create a vector with the names of the variables you want to run the test on
+endpoints <- colnames(biomarker_data %>% select(IL1B:VEGF))
+
+# Create data frame to store results
+dose_wilcox_posthoc <- data.frame()
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable.
+ endpoint <- endpoints[i]
+
+ # Run wilcox test and store in results data frame.
+ res <- biomarker_data %>%
+ pairwise_wilcox_test(as.formula(paste0(endpoint, "~ Dose", sep = "")), paired = TRUE, p.adjust.method = "BH")
+
+ dose_wilcox_posthoc <- rbind(dose_wilcox_posthoc, res)
+}
+
+# View results
+datatable(dose_wilcox_posthoc)
+```
+
+We now have a dataframe storing all of our pairwise comparison results. However, this is a lot to scroll through, making it hard to interpret. We can generate a publication-quality table by manipulating the table and joining it with the overall test data.
+```{r 4-5-Multiple-Groups-13 }
+dose_results_cleaned <- dose_wilcox_posthoc %>%
+ unite(comparison, group1, group2, sep = " vs. ") %>%
+ select(c(.y., comparison, p.adj)) %>%
+ pivot_wider(id_cols = ".y.", names_from = "comparison", values_from = "p.adj") %>%
+ left_join(dose_friedmanres, by = ".y.") %>%
+ relocate(p, .after = ".y.") %>%
+ rename("Variable" = ".y.", "Overall" = "p") %>%
+ mutate(across('Overall':'2 vs. 4', \(x) format(x, scientific = TRUE, digits = 3)))
+
+datatable(dose_results_cleaned)
+```
+
+To more easily see overall significance patterns, we could also make the same table but with significance stars instead of p-values by keeping the `p.adjust.signif` column instead of the `p.adj` column in our post-hoc test results dataframe:
+```{r 4-5-Multiple-Groups-14 }
+dose_results_cleaned_2 <- dose_wilcox_posthoc %>%
+ unite(comparison, group1, group2, sep = " vs. ") %>%
+ select(c(.y., comparison, p.adj.signif)) %>%
+ pivot_wider(id_cols = ".y.", names_from = "comparison", values_from = "p.adj.signif") %>%
+ left_join(dose_friedmanres, by = ".y.") %>%
+ relocate(p, .after = ".y.") %>%
+ rename("Variable" = ".y.", "Overall" = "p") %>%
+ mutate(across('Overall':'2 vs. 4', \(x) format(x, scientific = TRUE, digits = 3)))
+
+datatable(dose_results_cleaned_2)
+```
+
+### Answer to Environmental Health Question 1
+:::question
+ With this, we can answer **Environmental Health Question #1 **: Are there significant differences in inflammatory biomarker concentrations between different doses of acrolein?
+:::
+
+:::answer
+**Answer**: Yes, there are significant differences in inflammatory biomarker concentrations between different doses of acrolein. The overall p-values for all biomarkers are significant. Within each biomarker, at least one pairwise comparison was significant between doses, with a majority of these significant comparisons being with the highest dose (4 ppm).
+:::
+
+
+
+## Visualization of Multi-Group Statistical Results
+
+The statistical results we generated are a lot to digest in table format, so it can be helpful to graph the results. As our statistical testing becomes more complicated, so does the code used to generate results. The *ggpubr* package can perform statistical testing and overlay the results onto graphs for a specific set of tests, such as overall effects tests and unpaired t-tests or Wilcoxon tests. However, for tests that aren't available by default, the package also contains the helpful `stat_pvalue_manual()` function that can be added to plots. This is what we will need to use to add the results of our pairwise, paired Wilcoxon test with BH correction, as there is no option for BH correction within the default function we might otherwise use (`stat_compare_means()`). We will first work through an example of this using one of our endpoints, and then we will demonstrate how to apply it to facet plotting.
+
+### Single Plot
+
+We first need to format our existing statistical results so that they match the format that the function needs as input. Specifically, the dataframe needs to contain the following columns:
+
++ `group1` and `group2`: the groups being compared
++ A column containing the results you want displayed (`p`, `p.adj`, or `p.adj.signif` typically)
++ `y.position`, which tells the function where to plot the significance markers
+
+Our results dataframe for IL-1$\beta$ already contains our groups and p-values:
+```{r 4-5-Multiple-Groups-15 }
+datatable(dose_wilcox_posthoc_IL1B)
+```
+
+We can add the position columns using the function `add_xy_position()`:
+
+```{r 4-5-Multiple-Groups-16 }
+dose_wilcox_posthoc_IL1B <- dose_wilcox_posthoc_IL1B %>%
+ add_xy_position(x = "Dose", step.increase = 2)
+
+datatable(dose_wilcox_posthoc_IL1B)
+```
+
+Now, we are ready to make a graph of our results. We will use `stat_friedman_test()` to add our overall p-value and `stat_pvalue_manual()` to add our pairwise values.
+```{r 4-5-Multiple-Groups-17, out.width = "600px", message = FALSE, fig.align = "center"}
+# Set graphing theme
+theme_set(theme_bw())
+
+# Make plot
+ggplot(biomarker_data, aes(x = Dose, y = IL1B)) +
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
+ geom_jitter(size = 3, position = position_jitter(0.15)) +
+ stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
+ label.x.npc = "left", label.y = 9.5, hjust = 0.5, size = 6) +
+ stat_pvalue_manual(dose_wilcox_posthoc_IL1B, label = "p.adj.signif", size = 12, hide.ns = TRUE) +
+ ylim(2.5, 10) +
+ labs(y = "Log2(IL-1\u03B2 (pg/mL))", x = "Acrolein (ppm)") +
+ theme(legend.position = "none",
+ axis.title = element_text(color = "black", size = 15),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 12))
+```
+
+However, to make room for all of our annotations, our data become compressed, and it makes it difficult to see our data. Although presentation of statistical results is largely a matter of personal preference, we could clean up this plot by making our annotations appear on top of the bars, with indication in the figure legend that the comparison is with a specific dose. We will do this by:
+
+1. Filtering our results to those that are significant.
+2. Changing the symbol for comparisons that are not to the 0 dose.
+3. Layering this text onto the plot with `geom_text()` rather than `stat_pvalue_manual()`.
+
+First, let's filter our results to significant results and change the symbol for comparisons that are not to the 0 dose to a caret (^) instead of stars. We can do this by creating a new column called label that keeps the existing label if `group1` is 0, and if not, changes the label to a caret of the same length. We then use the summarize function to paste the labels for each of the groups together, resulting in a final dataframe containing our annotations for our plot.
+
+```{r 4-5-Multiple-Groups-18 }
+dose_wilcox_posthoc_IL1B_2 <- dose_wilcox_posthoc_IL1B %>%
+
+ # Filter results to those that are significant
+ filter(p.adj <= 0.05) %>%
+
+ # Make new symbol
+ mutate(label = ifelse(group1 == "0", p.adj.signif, strrep("^", nchar(p.adj.signif)))) %>%
+
+ # Select only the columns we need
+ select(c(group1, group2, label)) %>%
+
+ # Combine symbols for the same group
+ group_by(group2) %>% summarise(label = paste(label, collapse=" ")) %>%
+
+ # Remove duplicate row
+ distinct(group2, .keep_all = TRUE) %>%
+
+ # Rename group2 to dose
+ rename("Dose" = "group2")
+
+dose_wilcox_posthoc_IL1B_2
+```
+
+Then, we can use the same code as for our previous plot, but instead of using `stat_pvalue_manual()`, we will use `geom_text()` in combination with the dataframe we just created.
+```{r 4-5-Multiple-Groups-19, out.width = "600px", fig.align = "center"}
+ggplot(biomarker_data, aes(x = Dose, y = IL1B)) +
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
+ geom_jitter(size = 3, position = position_jitter(0.15)) +
+ stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
+ label.x.npc = "left", label.y = 4.85, hjust = 0.5, size = 6) +
+ geom_text(data = dose_wilcox_posthoc_IL1B_2, aes(x = Dose, y = 4.5,
+ label = paste0(label)), size = 10, hjust = 0.5) +
+ ylim(2.5, 5) +
+ labs(y = "Log2(IL-1\u03B2 (pg/mL))", x = "Acrolein (ppm)") +
+ theme(legend.position = "none",
+ axis.title = element_text(color = "black", size = 15),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 12))
+```
+
+An appropriate title for this figure could be:
+
+"**Figure X. Exposure to 0.6-4 ppm acrolein increases IL-1$\beta$ secretion in primary human bronchial epithelial cells.** Groups were compared using the Friedman test to obtain overall p-value and Wilcoxon signed rank test for post-hoc testing. * p < 0.05 in comparison with 0 ppm, ^ p < 0.05 in comparison with 0.6 ppm, n = 16 per group (paired)."
+
+
+### Faceted Plot
+
+Ideally, we would extend this sort of graphical approach to our faceted plot showing all of our endpoints. However, there are quite a few statistically significant comparisons to graph, including comparisons that are significant between different pairs of doses (not just back to the control). While we could attempt to graph all of them, ultimately, this will lead to a cluttered figure panel. When thinking about how to simplify our plots, some options are:
+
+1. Instead of using the number of symbols to represent p-values, we could use a single symbol to represent any comparison with a p-value with at least p < 0.05, and that symbol could be different depending on which group the significance is in comparison to. Symbols can be difficult to parse in R, so we could use letters or even the group names above the column of interest. For example, if the concentration of an endpoint at 2 ppm was significant in comparison with both 0 and 0.6 ppm, we could annotate "0, 0.6" above the 2 ppm column, or we could choose a letter ("a, b") or symbol ("*, ^") to convey these results.
+
+2. If the pattern is the same across many of the endpoints measured, we could graph a subset of the endpoints with the most notable data trends or the most biological meaning for the main body of the manuscript, with data for additional endpoints referred to in the text and shown in the supplemental figures or tables.
+
+3. If most of the significant comparisons are back to the control group, we could choose to only show comparisons with the control group, with textual description of the other significant comparisons and indication that those specific p-values can be viewed in the supplemental table of results.
+
+Which approach you decide to take (or maybe another approach altogether) is a matter of both personal preference and your specific study goals. You may also decide that it is important to you to show all significant comparisons, which will require more careful formatting of the plots to ensure that all text and annotations are legible. For this module, we will proceed with option #3 because many of our comparisons to the control dose (0) are significant, and we have enough groups that there likely will not be space to annotate all of them above our data.
+
+We will take similar steps here that we did when constructing our single endpoint graph, with a couple of small differences. Specifically, we need to:
+
+1. Create a dataframe of labels/annotations as we did above, but now filtered to only significant comparisons with the 0 group.
+2. Add to the label/annotation dataframe what we want the y position for each of the labels to be, which will be different for each endpoint.
+
+First, let's create our annotations dataframe. We will start with the results dataframe from our posthoc testing:
+```{r 4-5-Multiple-Groups-20 }
+datatable(dose_wilcox_posthoc)
+```
+
+```{r 4-5-Multiple-Groups-21 }
+dose_wilcox_posthoc_forgraph <- dose_wilcox_posthoc %>%
+
+ filter(p.adj <= 0.05) %>%
+
+ # Filter for only comparisons to 0
+ filter(group1 == "0") %>%
+
+ # Rename columns
+ rename("variable" = ".y.", "Dose" = "group2")
+
+datatable(dose_wilcox_posthoc_forgraph)
+```
+
+The `Dose` column will be used to tell *ggplot2* where to place the annotations on the x axis, but we need to also specify where to add the annotations on the y axis. This will be different for each variable because each variable is on a different scale. We can approach this by computing the maximum value of each variable, then increasing that by 20% to add some space on top of the points.
+
+```{r 4-5-Multiple-Groups-22 }
+sig_labs_y <- biomarker_data %>%
+ summarise(across(IL1B:VEGF, \(x) max(x))) %>%
+ t() %>% as.data.frame() %>%
+ rownames_to_column("variable") %>%
+ rename("y_pos" = "V1") %>%
+ mutate(y_pos = y_pos*1.2)
+
+sig_labs_y
+```
+
+Then, we can join these data to our labeling dataframe to complete what we need to make the annotations.
+```{r 4-5-Multiple-Groups-23 }
+dose_wilcox_posthoc_forgraph <- dose_wilcox_posthoc_forgraph %>%
+ left_join(sig_labs_y, by = "variable")
+```
+
+Now, it's time to graph! Keep in mind that although the plotting script can get long and unweildy, each line is just a new instruction to ggplot about a formatting element or an additional layer to add to the graph.
+```{r 4-5-Multiple-Groups-24, out.width = "800px", fig.align = "center"}
+# Pivot data longer
+biomarker_data_long <- biomarker_data %>%
+ pivot_longer(-c(Donor, Dose), names_to = "variable", values_to = "value")
+
+# Create clean labels for the graph titles
+new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
+ "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
+
+# Make graph
+ggplot(biomarker_data_long, aes(x = Dose, y = value)) +
+ # outlier.shape = NA removes outliers
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ # Changing box plot colors
+ scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
+ geom_jitter(size = 1.5, position = position_jitter(0.15)) +
+ # Adding a p value from Friedman test
+ stat_friedman_test(wid = "Donor", p.adjust.method = "none", label = "p = {p.format}",
+ label.x.npc = "left", vjust = -3.5, hjust = 0.1, size = 3.5) +
+ # Add label
+ geom_text(data = dose_wilcox_posthoc_forgraph, aes(x = Dose, y = y_pos, label = p.adj.signif,
+ size = 5, hjust = 0.5)) +
+ # Adding padding y axis
+ scale_y_continuous(expand = expansion(mult = c(0.1, 0.6))) +
+ # Changing y axis label
+ ylab(expression(Log[2]*"(Concentration (pg/ml))")) +
+ # Changing x axis label
+ xlab("Acrolein (ppm)") +
+ # Faceting by each biomarker
+ facet_wrap(~ variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
+ # Removing legend
+ theme(legend.position = "none",
+ axis.title = element_text(color = "black", size = 12),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 10),
+ strip.text = element_text(size = 12, face = "bold"))
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Exposure to acrolein increases secretion of proinflammatory biomarkers in primary human bronchial epithelial cells.** Groups were compared using the Friedman test to obtain overall p-value and Wilcoxon signed rank test for post-hoc testing. * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001 for comparison with control. For additional significant comparisons, see Supplemental Table X. n = 16 per group (paired).”
+
+### Answer to Environmental Health Question 2
+:::question
+ With this, we can answer **Environmental Health Question #2 **: Do TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein?
+:::
+
+:::answer
+**Answer**: Yes, TNF-$\alpha$ concentrations significantly increase with increasing dose of acrolein, which we were able to visualize, along with other mediators, in our facet plot.
+:::
+
+
+
+## Concluding Remarks
+
+In this module, we introduced common multi-group statistical tests, including both overall effects tests and post-hoc testing. We applied these tests to our example dataset and demonstrated how to produce publication-quality tables and figures of our results. Implementing a workflow such as this enables efficient analysis of wet-bench generated data and customization of output figures and tables suited to your personal preferences.
+
+### Additional Resources
+
+- [STHDA: How to Add P-Values and Significance Levels to ggplots using *ggpubr*](http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/)
+- [Adding p-values with *ggprism*](https://cran.r-project.org/web/packages/ggprism/vignettes/pvalues.html)
+- [Overview of *ggsignif*](https://const-ae.github.io/ggsignif/)
+
+
+
+
+
+:::tyk
+
+Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed.
+
+Use the same processes demonstrated in this module and the provided data (“Module4_5_TYKInput.xlsx” (functional data)) to run analyses and make a publication-quality figure panel and table to answer the following question: Are there significant differences in functional endpoints between cells treated with different concentrations of acrolein?
+
+For an extra challenge, try also making your faceted plot in the style of option #1 above, with different symbols, letters, or group names above columns to indicate which group that column in significant in comparison with.
+:::
diff --git a/Chapter_4/Module4_5_Input/Module4_5_Image1.png b/Chapter_4/4_5_Multiple_Groups/Module4_5_Image1.png
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_Image1.png
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_Image1.png
diff --git a/Chapter_4/Module4_5_Input/Module4_5_Image2.png b/Chapter_4/4_5_Multiple_Groups/Module4_5_Image2.png
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_Image2.png
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_Image2.png
diff --git a/Chapter_4/Module4_5_Input/Module4_5_Image3.png b/Chapter_4/4_5_Multiple_Groups/Module4_5_Image3.png
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_Image3.png
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_Image3.png
diff --git a/Chapter_4/Module4_5_Input/Module4_5_Image4.png b/Chapter_4/4_5_Multiple_Groups/Module4_5_Image4.png
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_Image4.png
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_Image4.png
diff --git a/Chapter_4/Module4_5_Input/Module4_5_Image5.png b/Chapter_4/4_5_Multiple_Groups/Module4_5_Image5.png
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_Image5.png
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_Image5.png
diff --git a/Chapter_4/Module4_5_Input/Module4_5_InputData1.xlsx b/Chapter_4/4_5_Multiple_Groups/Module4_5_InputData1.xlsx
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_InputData1.xlsx
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_InputData1.xlsx
diff --git a/Chapter_4/Module4_5_Input/Module4_5_InputData2.xlsx b/Chapter_4/4_5_Multiple_Groups/Module4_5_InputData2.xlsx
similarity index 100%
rename from Chapter_4/Module4_5_Input/Module4_5_InputData2.xlsx
rename to Chapter_4/4_5_Multiple_Groups/Module4_5_InputData2.xlsx
diff --git a/Chapter_4/4_6_Advanced_Multiple_Groups/4_6_Advanced_Multiple_Groups.Rmd b/Chapter_4/4_6_Advanced_Multiple_Groups/4_6_Advanced_Multiple_Groups.Rmd
new file mode 100644
index 0000000..77d1efe
--- /dev/null
+++ b/Chapter_4/4_6_Advanced_Multiple_Groups/4_6_Advanced_Multiple_Groups.Rmd
@@ -0,0 +1,509 @@
+
+# 4.6 Advanced Multi-Group Comparisons
+
+This training module was developed by Elise Hickman, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+In the previous module, we covered how to apply multi-group statistical testing, in which we tested for significant differences in endpoints across different values for one independent variable. In this module, we will build on the concepts introduced previously to test for significant differences in endpoints while considering two or more independent variables. We will review relevant statistical approaches and demonstrate how to apply these tests using the same example dataset as in previous modules in this chapter. As a reminder, this dataset includes concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to different concentrations of acrolein.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Are there significant differences in inflammatory biomarker concentrations between sex and different doses of acrolein?
+
+2. Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?
+
+### Workspace Preparation and Data Import
+
+Here, we will import the processed data that we generated at the end of TAME 2.0 Module 4.2, introduced in **TAME 2.0 Module 4.1 Overview of Experimental Design and Example Data** and associated demographic data. These data represent log~2~ concentrations of inflammatory biomarkers secreted by airway epithelial cells after exposure to four different concentrations of acrolein (plus filtered air as a control). We will also load packages that will be needed for the analysis, including previously introduced packages such as *openxlsx*, *tidyverse*, *DT*, *ggpubr*, and *rstatix*.
+
+#### Cleaning the global environment
+```{r 4-6-Advanced-Multiple-Groups-1, echo=TRUE, eval=FALSE}
+rm(list=ls())
+```
+
+#### Loading R packages required for this session
+```{r 4-6-Advanced-Multiple-Groups-2, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
+library(openxlsx)
+library(tidyverse)
+library(DT)
+library(rstatix)
+library(ggpubr)
+library(multcomp)
+library(pander)
+
+theme_set(theme_bw()) # Set graphing theme
+```
+
+#### Set your working directory
+```{r 4-6-Advanced-Multiple-Groups-3, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+```{r 4-6-Advanced-Multiple-Groups-4, echo=TRUE, eval=TRUE}
+biomarker_data <- read.xlsx("Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData1.xlsx")
+demographic_data <- read.xlsx("Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData2.xlsx")
+
+# View data
+datatable(biomarker_data)
+datatable(demographic_data)
+```
+
+## Advanced Multi-Group Comparisons
+
+### Two-way ANOVA
+The first test that we'll introduce is a **two-way ANOVA**. This test involves testing for mean differences in a continuous dependent variable across two categorical independent variables. (As a refresher, a one-way ANOVA uses a single independent variable to compare mean differences between groups.) Subjects or samples can be matched based upon their between-group factors (i.e., exposure duration) and/or their within-group factors (i.e., batch effects). Models that include both between-group and within-group factors are known as **mixed two-way ANOVAs**.
+
+Like other parametric tests, two-way ANOVAs assume:
+
++ Homogeneity of variance
++ Independent observations
++ Normal distribution
+
+
+### ANCOVA
+
+An **Analysis of Covariances (ANCOVA)** tests for mean differences in a continuous dependent variable and at least one categorical independent variable. It also includes another variable, known as a covariate, that needs to be controlled or adjusted for to more accurately capture the relationship between the independent and dependent variables. Potential covariates can include either between-group factors like exposure duration and/or within-group factors like batch effects or sex. Note that if the dataset has a smaller sample size, stratification of the dataset based on that covariate is another option to determine its effects rather than adjusting for it using an ANCOVA.
+
+ANCOVAs have the same assumptions listed above.
+
+
+**Note**: It is possible to run *two-way ANCOVA* models, where the model contains two independent variables and at least one covariate to be adjusted for.
+
+
+
+## Two-way ANOVA Example
+
+Our first environmental health question can be answered using a two-way ANOVA. We can test three different null hypotheses using this test:
+
+1. There is no difference in average biomarker concentrations based on sex.
+2. There is no difference in average biomarker concentrations based on dose.
+3. The effect of sex on average biomarker concentration does not depend on the effect of dose and vice versa.
+
+
+
+The first step would be to check that the assumptions (independence, homogeneity of variance, and normal distribution) have been met, but this was done previously in **TAME 2.0 Module 4.4 Two Group Comparisions and Visualizations**.
+
+To run our two-way ANOVA, we will use the `anova_test()` function from the *rstatix* package. This function allows us to define subject identifiers for matching between-subject factor variables (such as sex - factors that differ between subjects) and within-subject factors (such as dose - factors that are measured within each subject). Since we have both between- and within- subject factors, we will specifically be running a two-way mixed ANOVA.
+
+First, we need to add our demographic data to our biomarker data so that these variables can be incorporated into the analysis. Also, we need to convert `Dose` into a factor to specify the levels.
+```{r 4-6-Advanced-Multiple-Groups-5 }
+biomarker_data <- biomarker_data %>%
+ left_join(demographic_data, by = "Donor") %>%
+ mutate(Dose = factor(Dose, levels = c("0", "0.6", "1", "2", "4")))
+
+# viewing data
+datatable(biomarker_data)
+```
+
+Then, we can demonstrate how to run the two-way ANOVA and what the results look like by running the test on just one of our variables (IL-1$\beta$).
+```{r 4-6-Advanced-Multiple-Groups-6 }
+get_anova_table(anova_test(data = biomarker_data,
+ dv = IL1B,
+ wid = Donor,
+ between = Sex,
+ within = Dose))
+```
+The column names are described below:
+
++ `Effect`: the name of the variable tested
++ `DFn`: degrees of freedom in the numerator
++ `Dfd`: degrees of freedom in the denominator
++ `F`: F distribution test
++ `p`: p-value
++ `p<.05`: denotes whether the p-value is significant
++ `ges`: generalized effect size
+
+Based on the table above, there are significant differences in IL-1$\beta$ concentrations based on dose (p-value = 0.02). There are no significant differences in IL-1$\beta$ between the sexes nor are there significant differences in IL-1$\beta$ with an interaction between sex and dose.
+
+Similar to previous modules, we now want to apply our two-way ANOVA to each of our variables of interest. To do this, we can use a for loop that will:
+
+1. Loop through each column in the data and apply the test to each column.
+2. Pull out statistics we are interested in (for example, p-value) and bind the results from each column together into a results dataframe.
+```{r 4-6-Advanced-Multiple-Groups-7 }
+# Create a vector with the names of the variables you want to run the test on
+endpoints <- colnames(biomarker_data %>% dplyr::select(IL1B:VEGF))
+
+# Create data frame to store results
+twoway_aov_res <- data.frame(Factor = c("Dose", "Sex", "Sex:Dose"))
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable
+ endpoint <- endpoints[i]
+
+ # Run two-way mixed ANOVA and store results in res_aov
+ res_aov <- anova_test(data = biomarker_data,
+ dv = paste0(endpoint),
+ wid = Donor,
+ between = Sex,
+ within = Dose)
+
+ # Extract the results we are interested in (from the ANOVA table)
+ res_df <- data.frame(get_anova_table(res_aov)) %>%
+ dplyr::select(c(Effect, p)) %>%
+ rename("Factor" = "Effect")
+
+ # Rename columns in the results dataframe so that the output is more nicely formatted
+ names(res_df)[names(res_df) == 'p'] <- noquote(paste0(endpoint))
+
+ # Bind the results to the results dataframe
+ twoway_aov_res <- merge(twoway_aov_res, res_df, by = "Factor", all.y = TRUE)
+}
+
+# View results
+datatable(twoway_aov_res)
+```
+
+An appropriate title for this table could be:
+
+“**Figure X. Statistical test results for differences in cytokine concentrations.** A two-way ANOVA was performed using sex and dose as independent variables to test for statistical differences in concentration across 6 cytokines."
+
+From this table, dose is the only variable with significant differences in concentrations in all 6 biomarkers (p-value < 0.05).
+
+Although we know that dose has significant differences overall, an ANOVA test doesn't tell us which doses of acrolein differ from each other or the directionality of each biomarker's change in concentration after exposure to each dose. Therefore, we need to use a post-hoc test. One common post-hoc test following a one-way or two-way ANOVA is a Tukey’s HSD. However, there is no way to pass the output of the `anova_test()` function to the `TukeyHSD()` function. A good alternative is a pairwise t-test with a Bonferroni correction. Our data are paired in that there are repeated measures (doses) on each subject.
+```{r 4-6-Advanced-Multiple-Groups-8 }
+# Create data frame to store results
+twoway_aov_pairedt <- data.frame(Comparison = c("0_0.6", "0_1", "0_2", "0_4", "0.6_1", "0.6_2", "0.6_4", "1_2", "1_4", "2_4"))
+
+# Run for loop
+for (i in 1:length(endpoints)) {
+
+ # Assign a name to the endpoint variable.
+ endpoint <- endpoints[i]
+
+ # Run pairwise t-tests
+ res_df <- biomarker_data %>%
+ pairwise_t_test(as.formula(paste0(paste0(endpoint), "~", "Dose", sep = "")),
+ paired = TRUE,
+ p.adjust.method = "bonferroni") %>%
+ unite(Comparison, group1, group2, sep = "_", remove = FALSE) %>%
+ dplyr::select(Comparison, p.adj)
+
+ # Rename columns in the results data frame so that the output is more nicely formatted.
+ names(res_df)[names(res_df) == 'p.adj'] <- noquote(paste0(endpoint))
+
+ # Bind the results to the results data frame.
+ twoway_aov_pairedt <- merge(twoway_aov_pairedt, res_df, by = "Comparison", all.y = TRUE)
+}
+
+# View results
+datatable(twoway_aov_pairedt)
+```
+
+An appropriate title for this table could be:
+
+“**Figure X. Post hoc testing for differences in cytokine concentrations.** Paired t-tests were run as a post hoc test using dose as an independent variable to test for statistical differences in concentration across 6 cytokines."
+
+Note that this table and the two-way ANOVA table would likely be put into supplemental material for a publication. Before including this table in supplemental material, it would be best to clean it up (make the two comparison groups more clear, round all results to the same number of decimals) as demonstrated in **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**.
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: Are there significant differences in inflammatory biomarker concentrations between sex and different doses of acrolein?
+:::
+
+:::answer
+**Answer**: Based on the two-way ANOVA and post-hoc t-tests, there are only significant differences in cytokine concentrations based on dose (p adj < 0.05). All biomarkers, with the exception of IL-6, had at least 1 significantly different concentration when comparing doses.
+:::
+
+### Visualizing Two-Way ANOVA Results
+
+Since our overall p-values associated with dose were significant for a number of mediators, we will proceed with creating our final figures with our endpoints by dose, showing the overall two-way ANOVA p-value and the pairwise comparisons from our post hoc paired pairwise t-tests.
+
+To facilitate plotting in a faceted panel, we'll first pivot our `biomarker_data` dataframe longer.
+```{r 4-6-Advanced-Multiple-Groups-9 }
+biomarker_data_long <- biomarker_data %>%
+ dplyr::select(-c(Age_yr, Sex)) %>%
+ pivot_longer(-c(Donor, Dose), names_to = "Variable", values_to = "Value")
+
+datatable(biomarker_data_long)
+```
+
+Then, we will create an annotation dataframe for adding our overall two-way ANOVA p-values. This dataframe needs to contain a column for our variables (to match with our variable column in our `biomarker_data_long` dataframe) and the p-value for annotation. We can extract these from our `two_way_aov_res` dataframe generated above.
+```{r 4-6-Advanced-Multiple-Groups-10 }
+overall_dose_pvals <- twoway_aov_res %>%
+ # Transpose dataframe
+ column_to_rownames("Factor") %>%
+ t() %>% data.frame() %>%
+ rownames_to_column("Variable") %>%
+ # Keep only the dose results and rename them to p-value
+ dplyr::select(c(Variable, Dose)) %>%
+ rename(`P Value` = Dose)
+
+datatable(overall_dose_pvals)
+```
+
+We now have our p-values for each biomarker. Next, we'll make a column where our p-values are formatted with "p = " for annotation on the graph.
+```{r 4-6-Advanced-Multiple-Groups-11 }
+overall_dose_pvals <- overall_dose_pvals %>%
+ mutate(`P Value` = formatC(`P Value`, format = "e", digits = 2),
+ label = paste("p = ", `P Value`, sep = ""))
+
+datatable(overall_dose_pvals)
+```
+
+Finally, we'll add a column indicating where to add the labels on the y-axis. This will be different for each variable because each variable is on a different scale. We can approach this by computing the maximum value of each variable, then increasing that by 10% to add some space on top of the points.
+```{r 4-6-Advanced-Multiple-Groups-12 }
+sig_labs_y <- biomarker_data %>%
+ summarise(across(IL1B:VEGF, \(x) max(x))) %>%
+ t() %>% as.data.frame() %>%
+ rownames_to_column("Variable") %>%
+ rename("y_pos" = "V1") %>%
+ # moving the significance asterisks higher on the y axis
+ mutate(y_pos = y_pos * 1.1)
+
+sig_labs_y
+
+
+overall_dose_pvals <- overall_dose_pvals %>%
+ left_join(sig_labs_y, by = "Variable")
+
+datatable(overall_dose_pvals)
+```
+
+Now, we'll use the `biomarker_data` dataframe to plot our individual points and boxplots (similar to the plotting demonstrated in previous TAME Chapter 4 modules) and our `overall_dose_pvals` dataframe to add our p value annotation.
+```{r 4-6-Advanced-Multiple-Groups-13, fig.width = 12, fig.height = 6, fig.align='center'}
+# Create clean labels for the graph titles
+new_labels <- c("IL10" = "IL-10", "IL1B" = "IL-1\u03B2 ", "IL6" = "IL-6", "IL8" = "IL-8",
+ "TNFa" = "TNF-\u03b1", "VEGF" = "VEGF")
+
+# Make graph
+ggplot(biomarker_data_long, aes(x = Dose, y = Value)) +
+ # outlier.shape = NA removes outliers
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ geom_jitter(size = 1.5, position = position_jitter(0.15), alpha = 0.7) +
+ # Add label
+ geom_text(data = overall_dose_pvals, aes(x = 1.3, y = y_pos, label = label,
+ size = 5)) +
+ # Adding padding y axis
+ scale_y_continuous(expand = expansion(mult = c(0.1, 0.1))) +
+
+ # Faceting by each biomarker
+ facet_wrap(~ Variable, nrow = 2, scales = "free_y", labeller = labeller(variable = new_labels)) +
+
+ theme(legend.position = "none", # Removing legend
+ axis.title = element_text(face = "bold", size = rel(1.3)),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 10),
+ strip.text = element_text(size = 12, face = "bold")) +
+
+ # Changing axes labels
+ labs(x = "Acrolein (ppm)", y = expression(bold(Log[2]*"(Concentration (pg/ml))")))
+```
+
+
+It's a bit more difficult to add the pairwise t test results to the boxplots comparing each treatment group to each other as was done similarly in **TAME 2.0 Module 4.5 Multi-Group Comparisons and Visualizations**, so that addition to the figure was omitted here.
+
+
+
+## ANCOVA Example
+
+In the following ANCOVA example, we'll still investigate potential differences in cytokine concentrations as result of varying doses of acrolein. However, this time we'll adjust for sex and age to answer our second environmental health question: **Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?**.
+
+Let's first demonstrate how to run an ANCOVA and what the results look like by running the test on just one of our variables (IL-1$\beta$). The `Anova()` function was specifically designed to run type II or III ANOVA tests, which have different approaches to dealing with interactions terms and unbalanced datasets. For more information on Type I, II, III ANOVA tests, check out [Anova – Type I/II/III SS explained](https://md.psych.bio.uni-goettingen.de/mv/unit/lm_cat/lm_cat_unbal_ss_explained.html). For the purposes of this example just know that isn't much of a difference between the type I, II, or III results.
+```{r 4-6-Advanced-Multiple-Groups-14 }
+anova_test = aov(IL1B ~ Dose + Sex + Age_yr, data = biomarker_data)
+type3_anova = Anova(anova_test, type = 'III')
+type3_anova
+```
+Based on the table above, there are significant differences in IL-1$\beta$ concentrations in dose after adjusting for sex and age (p-value = 0.009).
+
+Now we'll run ANCOVA tests across all of our biomarkers.
+```{r 4-6-Advanced-Multiple-Groups-15 }
+# Create data frame to store results
+ancova_res = data.frame()
+
+# Add row names to data frame so that it will be able to add ANCOVA results
+rownames <- c("(Intercept)", "Dose", "Sex", "Age_yr")
+ancova_res <- data.frame(cbind(rownames))
+
+# Assign row names
+ancova_res <- data.frame(ancova_res[, -1], row.names = ancova_res$rownames)
+
+# Perform ANCOVA over all columns
+for (i in 3:8) {
+
+ fit = aov(as.formula(paste0(names(biomarker_data)[i], "~ Dose + Sex + Age_yr", sep = "")),
+ biomarker_data)
+ res <- data.frame(car::Anova(fit, type = "III"))
+ res <- subset(res, select = Pr..F.)
+ names(res)[names(res) == 'Pr..F.'] <- noquote(paste0(names(biomarker_data[i])))
+ ancova_res <- transform(merge(ancova_res, res, by = 0), row.names = Row.names, Row.names = NULL)
+
+}
+
+# Transpose for easy viewing, keep columns of interest, and apply BH adjustment
+ancova_res <- data.frame(t(ancova_res)) %>%
+ dplyr::select(Dose) %>%
+ mutate(across(everything(), \(x) format(p.adjust(x, "BH"), scientific = TRUE)))
+
+# View results
+datatable(ancova_res)
+```
+
+Looking at the table above, there are statistically differences in all cytokine concentrations with the exception of IL-6 based on dose (p adj < 0.05). To determine what doses were significantly different from one another we'll need to run Tukey's post hoc tests.
+```{r 4-6-Advanced-Multiple-Groups-16 }
+# Create results data frame with a column showing the comparisons (extracted from single run vs for loop)
+tukey_res <- data.frame(Comparison = c("0.6 - 0", "1 - 0", "2 - 0", "4 - 0", "1 - 0.6", "2 - 0.6",
+"4 - 0.6", "2 - 1", "4 - 1", "4 - 2"))
+
+# Perform Tukey's test
+for (i in 3:8) {
+
+ # need to run ANCOVA first
+ fit = aov(as.formula(paste0(names(biomarker_data)[i], "~ Dose + Sex + Age_yr", sep = "")),
+ biomarker_data)
+
+ # Tukey's
+ posthoc <- summary(glht(fit, linfct = mcp(Dose = "Tukey")), test = adjusted("BH"))
+ res <- summary(posthoc)$test
+
+ # Formatting the df with the Tukey's values
+ res_df <- data.frame(cbind (res$coefficients, res$sigma, res$tstat, res$pvalues))
+ colnames(res_df) <- c("Estimate", "Std.Error", "t.value", "Pr(>|t|)")
+ res_df <- round(res_df[4],4)
+ names(res_df)[names(res_df) == 'Pr(>|t|)'] <- noquote(paste0(names(biomarker_data[i])))
+ res_df <- res_df %>% rownames_to_column("Comparison")
+
+ tukey_res <- left_join(tukey_res, res_df, by = "Comparison")
+}
+
+datatable(tukey_res)
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Are there significant differences in inflammatory biomarker concentrations across different doses of acrolein after controlling for sex and age?
+:::
+
+:::answer
+**Answer**: Based on the ANCOVA tests, there are significant differences resulting from various doses of acrolein (p adj < 0.05) across all cytokine concentrations with the exception of IL-6. All biomarkers, with the exception of IL-6, had at least 1 significantly different biomarker concentration when comparing doses.
+:::
+
+### Visualizing ANCOVA Results
+
+Before graphing these results, we first need to think about which ones we want to display. For simplicity's sake, we will demonstrate graphing only comparisons that are with the control("0") group and that are significant. To do this, we'll:
+
+1. Separate our `Comparison` column into a `group1` and `group2` column.
+2. Filter to comparisons including only the 0 group.
+3. Pivot the dataframe longer, to match the format of our data used as input for facet plotting.
+4. Filter to only p-values that are less that 0.05.
+```{r 4-6-Advanced-Multiple-Groups-17 }
+tukey_res_forgraph <- tukey_res %>%
+ separate(Comparison, into = c("group1", "group2"), sep = " - ") %>%
+ filter(group2 == "0") %>%
+ dplyr::select(-group2) %>%
+ pivot_longer(!group1, names_to = "Variable", values_to = "P Value") %>%
+ filter(`P Value` < 0.05) %>%
+ # rounding the p values to 4 digits for readability
+ mutate(`P Value` = round(`P Value`, 4))
+
+datatable(tukey_res_forgraph)
+```
+
+Next, we can take a few steps to add columns to the dataframe that will aid in graphing:
+
+1. Add a column for significance stars.
+2. Add a column to indicate the y position for the significance annotation (similar to the above example with the two-way ANOVA).
+```{r 4-6-Advanced-Multiple-Groups-18 }
+# Add column for significance stars
+tukey_res_forgraph <- tukey_res_forgraph %>%
+ mutate(p.signif = ifelse(`P Value` < 0.0001, "****",
+ ifelse(`P Value` < 0.001, "***",
+ ifelse(`P Value` < 0.01, "**",
+ ifelse(`P Value` < 0.05, "*", NA)))))
+
+# Calculate y positions to plot significance stars
+sig_labs_y_tukey <- biomarker_data %>%
+ summarise(across(IL1B:VEGF, \(x) max(x))) %>%
+ t() %>% as.data.frame() %>%
+ rownames_to_column("Variable") %>%
+ rename("y_pos" = "V1") %>%
+ mutate(y_pos = y_pos * 1.15)
+
+sig_labs_y_tukey
+
+# Join y positions to tukey_res
+tukey_res_forgraph <- tukey_res_forgraph %>%
+ left_join(sig_labs_y_tukey, by = "Variable") %>%
+ rename("Dose" = "group1")
+
+datatable(tukey_res_forgraph)
+```
+
+We also need to prepare our overall p-values from our ANCOVA for display:
+```{r 4-6-Advanced-Multiple-Groups-19 }
+ancova_res_forgraphing <- ancova_res %>%
+ rename(`P Value` = Dose) %>%
+ rownames_to_column("Variable") %>%
+ left_join(sig_labs_y, by = "Variable") %>%
+ mutate(`P Value` = formatC(as.numeric(`P Value`), format = "e", digits = 2),
+ label = paste("p = ", `P Value`, sep = ""))
+
+```
+
+Now, we are ready to make our graph! We will use similar code to the above, this time adding in our significance stars over specific columns.
+```{r 4-6-Advanced-Multiple-Groups-20, fig.width = 12, fig.height = 7, fig.align='center'}
+# Make graph
+ggplot(biomarker_data_long, aes(x = Dose, y = Value)) +
+ # outlier.shape = NA removes outliers
+ geom_boxplot(aes(fill = Dose), outlier.shape = NA) +
+ # Changing box plot colors
+ scale_fill_manual(values = c("#BFBFBF", "#D5A298", "#E38273", "#EB5F4E", "#EE2B2B")) +
+ geom_jitter(size = 1.5, position = position_jitter(0.15), alpha = 0.7) +
+ # Add overall ANCOVA label
+ geom_text(data = ancova_res_forgraphing, aes(x = 1.3, y = y_pos * 1.15, label = label, size = 10)) +
+ # Add tukey annotation
+ geom_text(data = tukey_res_forgraph, aes(x = Dose, y = y_pos, label = p.signif, size = 10, hjust = 0.5)) +
+
+ # Faceting by each biomarker
+ facet_wrap(~ Variable, nrow = 2, scales = "free_y", labeller = labeller(Variable = new_labels)) +
+ # Removing legend
+ theme(legend.position = "none",
+ axis.title = element_text(face = "bold", size = rel(1.5)),
+ axis.title.x = element_text(vjust = -0.75),
+ axis.title.y = element_text(vjust = 2),
+ axis.text = element_text(color = "black", size = 10),
+ strip.text = element_text(size = 12, face = "bold")) +
+
+ # Changing axes labels
+ labs(x = "Acrolein (ppm)", y = expression(bold(Log[2]*"(Concentration (pg/ml))")))
+```
+An appropriate title for this figure could be:
+
+“**Figure X. Acrolein exposure increases inflammatory cytokine secretion in most primary human bronchial epithelial cells.** Overall p-values from ANCOVA tests adjusting for age and sex are in the left-hand corner. Tukey's post hoc tests were subsequently run and significant Benjamini-Hochberg adjusted p-values are denoted with asterisks compared to the control (0ppm) dose only. p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001, *n* = 16 per group.”
+
+
+
+## Concluding Remarks
+In this module, we introduced advanced multi-group comparisons using two-way ANOVA and ANCOVA tests. These overall effect tests along with post-hoc testing were used on an example dataset to provide a basis for publication-ready tables and figures to present these results. This training module provides code and text for advanced multi-group comparisons necessary to answer more complex research questions.
+
+
+
+
+### Additional Resources
+ + [Two-Way ANOVA](https://www.scribbr.com/statistics/two-way-anova/)
+ + [Repeated Measure ANOVA in R](https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/)
+ + [ANCOVA Example](https://ibecav.github.io/ancova_example/)
+ + [Nonparametric ANOVA RDocumentation](https://cran.r-project.org/web/packages/fANCOVA/fANCOVA.pdf)
+ + [Nonparametric ANCOVA RDocumentation](https://www.rdocumentation.org/packages/sm/versions/2.2-6.0/topics/sm.ancova)
+
+
+
+
+
+:::tyk
+Functional endpoints from these cultures were also measured. These endpoints were: 1) Membrane Permeability (MemPerm), 2) Trans-Epithelial Electrical Resistance (TEER), 3) Ciliary Beat Frequency (CBF), and 4) Expression of Mucin (MUC5AC). These data were already processed and tested for normality (see Test Your Knowledge for **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**), with results indicating that two of the endpoints are normally distributed and two non-normally distributed.
+Using the data found in “Module4_5_TYKInput.xlsx”, answer the following research question: Are there significant differences in functional endpoints based on doses of acrolein and sex after adjusting for age? To streamline the analysis, we'll only include doses of acrolein at 0, 1, and 4ppm.
+
+**Hint**: You'll need to run a two-way ANCOVA. Given that some of the assumptions for parametric tests (i.e., normality and homogeneity of variance) and the size of the data is on the smaller side, we likely wouldn't run a parametric test. However, we'll do so here just to illustrate an example of how to run a two-way ANCOVA.
+:::
diff --git a/Chapter_4/Module4_6_Input/Module4_6_InputData1.xlsx b/Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData1.xlsx
similarity index 100%
rename from Chapter_4/Module4_6_Input/Module4_6_InputData1.xlsx
rename to Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData1.xlsx
diff --git a/Chapter_4/Module4_6_Input/Module4_6_InputData2.xlsx b/Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData2.xlsx
similarity index 100%
rename from Chapter_4/Module4_6_Input/Module4_6_InputData2.xlsx
rename to Chapter_4/4_6_Advanced_Multiple_Groups/Module4_6_InputData2.xlsx
diff --git a/Chapter_5/05-Chapter5.Rmd b/Chapter_5/05-Chapter5.Rmd
deleted file mode 100644
index 4fee44a..0000000
--- a/Chapter_5/05-Chapter5.Rmd
+++ /dev/null
@@ -1,2162 +0,0 @@
-# (PART\*) Chapter 5 Machine Learning & Artificial Intelligence {-}
-
-# 5.1 Introduction to Artificial Intelligence, Machine Learning, and Predictive Modeling for Environmental Health
-
-This training module was developed by David M. Reif, with contributions from Elise Hickman, Alexis Payton, and Julia E. Rager
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Artificial intelligence (AI), machine learning (ML), and predictive modeling are becoming increasingly popular buzzwords both in the public domain and within research fields, including environmental health. Within environmental health, these computational techniques are implemented to integrate large, high dimensional datasets (e.g., chemical, biological, clinical/medical, model estimates, etc) to better understand links between environmental exposures and biological responses.
-
-In this training module, we will:
-
-+ Provide general historical context and taxonomy of modern AI/ML
-+ Provide an overview of the intersection between environmental health science ML through discussing...
- + Why there is a need for ML in environmental health science
- + The differences between ML and traditional statistical methods
- + Predictive modeling in the context of environmental health science
- + Additional applications of ML in environmental health science
-
-
-
-### Training Module's Environmental Health Question
-
-This training module was specifically developed to answer the following environmental health question:
-
-+ How and why are artificial intelligence, machine learning, and predictive modeling used in environmental health research?
-
-## General Historical Context and Taxonomy of Modern AI/ML
-
-Before diving in to the applications of AI and ML in environmental health, let's first establish what these term mean and how they are related. Note that the definitions surrounding AI and ML can be subjective, however the purpose of this module is not to get caught up in semantics, but to broadly understand how AI and ML can be applied to environmental health research.
-
-**Artificial Intelligence (AI)** encompasses computer systems that perform tasks typically associated with human cognition and intelligence. AI is found in our everyday lives, for instance, within face recognition, internet search queries, email spam detection, smart home devices, auto-navigation, and digital assistants.
-
-**Machine Learning (ML)** can be thought of as a subset of AI and describes a computer system that iteratively learns and improves from that experience autonomously.
-
-Below is a high level taxonomy of AI. It's not meant to be an exhaustive depiction of all AI techniques but a simple visualization of how some of these methodologies are nested within each other. **Note**: AI can be categorized in different ways and may deviate from what is illustrated below.
-```{r 05-Chapter5-1, out.width = "800px", echo = FALSE, out.width = "75%", fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image1.png")
-```
-
-Advantages of AI and ML include the automation of repetitive tasks, complex problem solving, and reducing human error. However, disadvantages include learning from biased datasets or patterns that are reflected in the decisions of AI/ML and the potential limited interpretability of algorithms created by AI/ML. Check out the following resources for...
-
-+ Further explanation on differences in [Artificial Intelligence vs. Machine Learning](https://cloud.google.com/learn/artificial-intelligence-vs-machine-learning)
-+ Other subsets of AI that fall outside of the scope of these modules in [Types of Artificial Intelligence](https://builtin.com/artificial-intelligence)
-+ Additional discussion on the utility of ML approaches for high-dimensional data common in environmental health research in [Payton et. al](https://www.frontiersin.org/articles/10.3389/ftox.2023.1171175/full)
-
-It is important to understand the methodological "roots" of current methods. Otherwise, it seems like every approach is novel! AI and ML methods have been around since the mid- to late- 1900s and continue to evolve in the present day. The earliest conceptual roots for these approaches can be traced from antiquity; however, it is generally thought that the field was named "artificial intelligence" at the ["Dartmouth Workshop"](https://home.dartmouth.edu/about/artificial-intelligence-ai-coined-dartmouth) in 1956, led by John McCarthy and others. The following schematic demonstrates the general taxonomy (categories, sub-fields, and specific methods) of modern AI and ML:
-
-```{r 05-Chapter5-2, out.width = "800px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image2.png")
-```
-
-### A Brief Detour to Discuss ChatGPT
-
-**ChatGPT (Chat Generative Pre-trained Transformer)** is a publicly available chatbot developed by OpenAI. It was released in November of 2022 and quickly gained popularity due to its accessibility and ability to have human-like conversations with the user across almost any imaginable topic.
-
-Language Models (LLMs), including large language models like GPT-3 (a predecessor to ChatGPT), generally fall under the "Connectionist AI" category, which use deep learning techniques and are considered a subset of artificial neural networks. They fall under the deep learning subset due to their use of deep neural networks with many layers, allowing them to learn from large amounts of data and find intricate patterns.
-
-LLMs are trained to predict the probability of a word given its context in a dataset (a form of next-word prediction), which is a machine learning methodology. It's notable that they use architectures like [Transformer Networks](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)), which are known for their efficiency in handling sequential data, making them a go-to choice for natural language processing (NLP) tasks. The use of attention mechanisms in these architectures allows the model to focus on different parts of the input sequence when producing an output sequence, offering a substantial improvement in performance for many natural language processing tasks.
-
-The role of ChatGPT and similar tools in the environmental health research space is still being explored. Although ChatGPT has the potential to streamline certain parts of the research process, such as text and language polishing, synthesizing existing information, and suggesting custom coding solutions, it is not an intellectual replacement for the expertise and diverse viewpoints of scientists and must be used transparently and with caution.
-
-
-
-## Application of Machine Learning in Environmental Health Science
-
-For the rest of this module and chapter, we will focus on machine learning (ML). Generally speaking, ML is considered to encompass the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence (AI), discussed broadly above.
-
-### Why do we need machine learning in environmental health science?
-
-There are many avenues to incorporate ML into environmental health research, all aimed at better identifying patterns amongst large datasets spanning medical health records, clinical data, exposure monitoring data, chemistry profiles, and the rapidly expanding realm of biological response data including multiple -omics endpoints.
-
-One well-known problem that can be better addressed by incorporating ML is the 'too many chemicals, too little data' problem. To detail, there are thousands of chemicals in commerce today. Testing these chemicals one by one for toxicity using comprehensive animal screening experiments would take decades and is not feasible financially. Current efforts to address this problem include using cell-based high throughput screening to efficiently determine biological responses to a variety of chemical exposures and treatment conditions.
-
-```{r 05-Chapter5-3, out.width = "700px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image3.png")
-```
-
-These screening efforts result in increasing amounts of data, which can be gathered to start building big databases.
-```{r 05-Chapter5-4, out.width = "700px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image4.png")
-```
-
-When many of these datasets and databases are combined, including diversity across different types of screening platforms, technologies, cell types, species, and other experimental variables, the associated dimensionality of the data gets "big."
-```{r 05-Chapter5-5, out.width = "500px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image5.png")
-```
-
-This presents a problem because these data are diverse and high dimensional (the number of features or endpoints exceeds the number of observations/chemicals). To appropriately analyze and model these data, new approaches beyond traditional statistical methods are needed.
-
-### Machine Learning vs. Traditional Statistical Methods
-
-There is *plenty* of debate as to where the line(s) between ML and traditional statistics should be drawn. In our opinion, a perfect delineation is not necessary for our purposes. Rather, we will focus on the usual goals/intent of each to help us understand the distinction for environmental health research.
-
-Traditional statistics may be able to handle 1:1 or 1:many comparisons of singular quantities (e.g., activity concentrations for two chemicals). However, once the modeling becomes more complex or exploratory, assumptions of most traditional methods will be violated. Furthermore, statistics draws population inferences from a sample, while AI/ML finds generalizable predictive patterns ([Bzdok et al 2018](https://www.nature.com/articles/nmeth.4642)). This is particularly helpful in **predictive toxicology**, in which we leverage high dimensional data to obtain generalizable forecasts for the effects of chemicals on biological systems.
-
-This image shows graphical abstractions of how a "problem" is solved using:
-
-+ Traditional statistics ((A) logistic regression and (B) linear regression), OR
-+ Machine learning ((C) support vector machines, (D) artificial neural networks, and (E) decision trees)
-
-```{r 05-Chapter5-6, out.width = "700px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image6.png")
-```
-
-### Predictive Modeling in the Context of Environmental Health Science
-
-In the previous section, we briefly mentioned **predictive toxicology.** We often think of predictions as having a forward-time component (*i.e. What will happen next?*) ... what about "prediction" in a different sense as applied to toxicology?
-
-Our *working definition* is that **predictive toxicology** describes a multidisciplinary approach to chemical toxicity evaluation that more efficiently uses animal test results, when needed, and leverages expanding non-animal test methods to forecast the effects of a chemical on biological systems. Examples of the questions we can answer using predictive toxicology include:
-
-+ Can we more efficiently design animal studies and analyze data from shorter assays using fewer animals to predict long-term health outcomes?
-+ Can this suite of *in vitro* assays **predict** what would happen in an organism?
-+ Can we use diverse, high dimensional data to cluster chemicals into **predicted** activity classes?
-
-```{r 05-Chapter5-7, out.width = "600px", echo = FALSE, fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_1_Input/Module5_1_Image7.png")
-```
-
-Similar logic applies to the field of exposure science. What about "prediction" applied to exposure science?
-
-Our *working definition* is that **predictive exposure science** describes a multidisciplinary approach to chemical exposure evaluations that more efficiently uses biomonitoring, chemical inventory, and other exposure science-relevant databases to forecast exposure rates in target populations. For example:
-
-+ Can we use existing biomonitoring data from NHANES to predict exposure rates for chemicals that have yet to be measured in target populations? (see ExpoCast program, e.g., [Wambagh et al 2014](https://pubmed.ncbi.nlm.nih.gov/25343693/))
-+ Can I use chemical product use inventory data to predict the likelihood of a chemical being present in a certain consumer product? (e.g., [Phillips et al 2018](https://pubmed.ncbi.nlm.nih.gov/29405058/))
-
-There are many different types of ML methods that we can employ in predictive toxicology and exposure science, depending on the data type / purpose of data analysis. A recent [review](https://pubmed.ncbi.nlm.nih.gov/34029068/) written together with [Erin Baker's lab](https://bakerlab.wordpress.ncsu.edu/) provides a high-level overview on some of the types of ML methods and challenges to address when analyzing multi-omic data (including chemical signature data).
-
-### Answer to Environmental Health Question
-:::question
-*With this, we can now answer our **Environmental Health Question***: How and why are machine learning, predictive modeling, and artificial intelligence used in environmental health research?
-:::
-
-:::answer
-**Answer:** Machine learning, a subcategory of artificial intelligence, can be used in environmental health science to better understand patterns between chemical exposure and biological response in complex, high dimensional datasets. These datasets are often generated as part of efforts to screen many chemicals efficiently. Predictive modeling, which can include machine learning approaches, leverages these data to forecast the effects of a chemical on biological systems.
-:::
-
-### Additional Applications of Machine Learning in Environmental Health Science
-
-In addition to the predictive toxicology questions above, ML can also be applied in the analysis of complex, high dimensional data in observational clinical (human subjects) studies in environmental health, such as:
-
-+ Do subjects cluster by chemical exposure? Are there similarities between subjects that cluster together for chemical exposure, suggesting underlying factors relevant to chemical exposure?
-+ Are biological signatures in different exposure groups different enough overall that ML can predict which group a subject belongs to based on their signature?
-
-
-
-## Concluding Remarks
-
-In conclusion, this training module provides an overview of the field of AI and ML and discusses applications of these tools in environmental health science through predictive modeling. These methods represent common tools that are used in high dimensional data analyses within the field of environmental health sciences.
-
-In the following modules, we will provide specific examples detailing how to apply both supervised and unsupervised machine learning methods to environmental health questions and how to interpret the results of these analyses.
-
-For a review article on ML, see:
-
-+ Odenkirk MT, Reif DM, Baker ES. Multiomic Big Data Analysis Challenges: Increasing Confidence in the Interpretation of Artificial Intelligence Assessments. Anal Chem. 2021 Jun 8;93(22):7763-7773. PMID: [34029068](https://pubmed.ncbi.nlm.nih.gov/34029068/)
-
-For additional case studies that leverage more advanced ML techniques, see the following recent publications that also address environmental health questions from our research groups, with bracketed tags at the end of each citation denoting ML methods used in that study:
-
-+ Clark J, Avula V, Ring C, Eaves LA, Howard T, Santos HP, Smeester L, Bangma JT, O'Shea TM, Fry RC, Rager JE. Comparing the Predictivity of Human Placental Gene, microRNA, and CpG Methylation Signatures in Relation to Perinatal Outcomes. Toxicol Sci. 2021 Sep 28;183(2):269-284. PMID: [34255065](https://pubmed.ncbi.nlm.nih.gov/34255065/) *[hierarchical clustering, principal component analysis, random forest]*
-
-+ Green AJ, Mohlenkamp MJ, Das J, Chaudhari M, Truong L, Tanguay RL, Reif DM. Leveraging high-throughput screening data, deep neural networks, and conditional generative adversarial networks to advance predictive toxicology. PLoS Comput Biol. 2021 Jul 2;17(7):e1009135. PMID: [3421407](https://pubmed.ncbi.nlm.nih.gov/34214078/) *[conditional generative adversarial network, deep neural network, support vector machine, random forest, multilayer perceptron]*
-
-+ To KT, Truong L, Edwards S, Tanguay RL, Reif DM. Multivariate modeling of engineered nanomaterial features associated with developmental toxicity. NanoImpact. 2019 Apr;16:10.1016. PMID: [32133425](https://pubmed.ncbi.nlm.nih.gov/32133425/) *[random forest]*
-
-+ Ring C, Sipes NS, Hsieh JH, Carberry C, Koval LE, Klaren WD, Harris MA, Auerbach SS, Rager JE. Predictive modeling of biological responses in the rat liver using in vitro Tox21 bioactivity: Benefits from high-throughput toxicokinetics. Comput Toxicol. 2021 May;18:100166. PMID: [34013136](https://pubmed.ncbi.nlm.nih.gov/34013136/) *[random forest]*
-
-+ Hickman E, Payton A, Duffney P, Wells H, Ceppe AS, Brocke S, Bailey A, Rebuli ME, Robinette C, Ring B, Rager JE, Alexis NE, Jaspers I. Biomarkers of Airway Immune Homeostasis Differ Significantly with Generation of E-Cigarettes. Am J Respir Crit Care Med. 2022 Nov 15; 206(10):1248-1258. PMID: [35731626](https://pubmed.ncbi.nlm.nih.gov/35731626/) *[hierarchical clustering, quadratic discriminant analysis, multinomial logistic regression]*
-
-+ Perryman AN, Kim H-YH, Payton A, Rager JE, McNell EE, Rebuli ME, et al. (2023) Plasma sterols and vitamin D are correlates and predictors of ozone-induced inflammation in the lung: A pilot study. PLoS ONE 18(5): e0285721. PMID: [37186612](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285721) *[random forest, support vector machine, k nearest neighbor]*
-
-+ Payton AD, Perryman AN, Hoffman JR, Avula V, Wells H, Robinette C, Alexis NE, Jaspers I, Rager JE, Rebuli ME. Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples. American Journal of Physiology-Lung Cellular and Molecular Physiology 2022 322:5, L722-L736. PMID: [35318855](https://journals.physiology.org/doi/abs/10.1152/ajplung.00299.2021) *[k-means clustering, principal component analysis]*
-
-# 5.2 Supervised Machine Learning
-
-This training module was developed by Alexis Payton, Oyemwenosa N. Avenbuan, Lauren E. Koval, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Machine learning is a field that has been around for decades but has exploded in popularity and utility in recent years due to the proliferation of big and/or high dimensional data. Machine learning has the ability to sift through and learn from large volumes of data and use that knowledge to solve problems. The challenges of high dimensional data as they pertain to environmental health and the applications of machine learning to mitigate some of those challenges are discussed further in [Payton et. al](https://www.frontiersin.org/articles/10.3389/ftox.2023.1171175/full). In this module, we will introduce different types of machine learning and then focus in on supervised machine learning, including how to train and assess supervised machine learning models. We will then analyze an example dataset with supervised machine learning highlighting an example with random forest modeling.
-
-
-
-## Types of Machine Learning
-Within the field of machine learning, there are many different types of algorithms that can be leveraged to address environmental health research questions. The two broad categories of machine learning frequently applied to environmental health research are: (1) supervised machine learning and (2) unsupervised machine learning.
-
-**Supervised machine learning** involves training a model using a labeled dataset, where each independent or predictor variable is associated with a dependent variable with a known outcome. This allows the model to learn how to predict the labeled outcome on data it hasn't "seen" before based on the patterns and relationships it previously identified in the data. For example, supervised machine learning has been used for cancer prediction and prognosis based on variables like tumor size, stage, and age ([Lynch et. al](https://www.sciencedirect.com/science/article/abs/pii/S1386505617302368?via%3Dihub), [Asadi et. al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7416093/)).
-
-Supervised machine learning includes:
-
-+ Classification: Using algorithms to classify a categorical outcome (ie. plant species, disease status, etc.)
-+ Regression: Using algorithms to predict a continuous outcome (ie. gene expression, chemical concentration, etc.)
-```{r 05-Chapter5-8, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image1.png")
-```
-
Soni, D. (2018, March 22). Supervised vs. Unsupervised Learning. Towards Data Science; Towards Data Science. https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
-
-**Unsupervised machine learning**, on the other hand, involves using models to find patterns or associations between variables in a dataset that lacks a known or labeled outcome. For example, unsupervised machine learning has been used to identify new patterns across genes that are co-expressed, informing potential biological pathways mediating human disease ([Botía et. al](https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-017-0420-6), [Pagnuco et. al](https://www.sciencedirect.com/science/article/pii/S0888754317300575?via%3Dihub)).
-
-```{r 05-Chapter5-9, echo=FALSE, fig.width=52, fig.height=18, fig.align='center', out.width = "75%"}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image2.png")
-```
-
Langs, G., Röhrich, S., Hofmanninger, J., Prayer, F., Pan, J., Herold, C., & Prosch, H. (2018). Machine learning: from radiomics to discovery and routine. Der Radiologe, 58(S1), 1–6. PMID: [34013136](https://doi.org/10.1007/s00117-018-0407-3). Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
-
-Overall, the distinction between supervised and unsupervised learning is an important concept in machine learning, as it can inform the choice of algorithms and techniques used to analyze and make predictions from data. It is worth noting that there are also other types of machine learning, such as [semi-supervised learning](https://www.altexsoft.com/blog/semi-supervised-learning/), [reinforcement learning](https://www.geeksforgeeks.org/what-is-reinforcement-learning/), and [deep learning](https://www.geeksforgeeks.org/introduction-deep-learning/), though we will not further discuss these topics in this module.
-
-
-
-## Types of Supervised Machine Learning Algorithms
-
-Although this module's example will focus on a random forest model in the coding example below, other commonly used algorithms for supervised machine learning include:
-
-+ **K-Nearest Neighbors (KNN):** Uses distance to classify a data point in the test set based upon the most common class of neighboring data points from the training set. For more information on KNN, see [K-Nearest Neighbor](https://www.ibm.com/topics/knn).
-```{r 05-Chapter5-10, echo=FALSE, out.width = "50%",fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image6.png")
-```
-
-+ **Support Vector Machine (SVM):** Creates a decision boundary line (hyperplane) in n-dimensional space to separate the data into each class so that when new data is presented, they can be easily categorized. For more information on SVM, see [Support Vector Machine](https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm).
-```{r 05-Chapter5-11, echo=FALSE, out.width = "50%", fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image7.png")
-```
-
-+ **Random Forest (RF):** Uses a multitude of decision trees trained on a subset of different samples from the training set and the resulting classification of a data point in the test set is aggregated from all the decision trees. A **decision tree** is a hierarchical model that depicts decisions from predictors and their resulting outcomes. It starts with a root node, which represents an initial test from a single predictor. The root node splits into subsequent decision nodes that test another feature. These decision nodes can either feed into more decision nodes or leaf nodes that represent the predicted class label. A branch or a sub-tree refers to a subsection of an entire decision tree.
-
-Here is an example decision tree with potential variables and decisions informing a college basketball player's likelihood of being drafted to the NBA:
-```{r 05-Chapter5-12, echo=FALSE, out.width = "75%",fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image8.png")
-```
-
-While decision trees are highly interpretable, they are prone to overfitting, thus they may not always generalize well to data outside of the training set. To address this, random forests are comprised of many different decision trees. Each tree is trained on a subset of the samples in the training data, selected with replacement, and a randomly selected set of predictor variables. For a dataset with *p* predictors, it is common to test $\sqrt{p}$, $\frac{p}{2}$, and *p* predictors to see which gives the best results. This process decorrelates the trees. For a classification problem, majority vote of the decision trees determines the final class for a prediction. This process loses interpretability inherent to individual trees, but reduces the risk of overfitting.
-
-For more information on RF and decision trees, check out [Random Forest](https://www.ibm.com/in-en/topics/random-forest) and
-[Decision Trees](https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/#What_is_a_Decision_Tree?).
-
-**Note**: One algorithm is not inherently better than the others with each having their respective advantages and disadvantages. Each algorithm's predictive ability will be largely dependent on the size of the dataset, the distribution of the data points, and the scenario.
-
-
-
-## Training Supervised Machine Learning Models
-
-In supervised machine learning, algorithms need to be trained before they can be used to predict on new data. This involves selecting a smaller portion of the dataset to train the model so it will learn how to predict the outcome as accurately as possible. The process of training an algorithm is essential for enabling the model to learn and improve over time, allowing it to make more accurate predictions and better adapt to new and changing circumstances. Ultimately, the quality and relevance of the training data will have a significant impact on the effectiveness of a machine learning model.
-
-Common partitions of the full dataset used to train and test a supervised machine learning model are the following:
-
-1. **Training Set:** a subset of the data that the algorithm "sees" and uses to identify patterns.
-
-2. **Validation Set**: a subset of the training set that is used to evaluate the model's fit in an unbiased way allowing us to fine-tune its parameters and optimize performance.
-
-3. **Test Set:** a subset of data that is used to evaluate the final model's fit based on the training and validation sets. This provides an objective assessment of the model's ability to generalize new data.
-
-It is common to split the dataset into a training set that contains 60% of the data and the test set that contains 40% of the data, though other common splits include 70% training / 30% test and 80% training / 20% test.
-
-```{r 05-Chapter5-13, echo=FALSE, out.width = "65%", fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image3.png")
-```
-
-It is important to note that the test set should only be examined after the algorithm has been trained using the training/validation sets. Using the test set during the development process can lead to overfitting, where the model performs well on the test data but poorly on new data. The ideal algorithm is generalizable or flexible enough to accurately predict unseen data. This is known as the bias-variance tradeoff. For further information on the bias-variance tradeoff, see [Understanding the Bias-Variance Tradeoff](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229).
-
-### Cross Validation
-
-Finally, we will discuss **cross validation**, which is an approach used during training to expose the model to more patterns in the data and aid in model evaluation. For example, if a model is trained and tested on a 60:40 split, our model's accuracy will likely be influenced by *where* this 60:40 split occurs in the dataset. This will likely bias the data and reduce the algorithm's ability to predict accurately for data not in the training set. Overall, cross validation (CV) is implemented to fine tune a model's parameters and improve prediction accuracy and ability to generalize.
-
-Although there are [a number of cross validation approaches](https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right), we will specifically highlight ***k*-fold cross validation**. k-fold cross validation works by splitting the samples in the training dataset into *k* equally sized folds or groups. For example, if we implement 5-fold CV, we start by...
-
-1. Splitting the training data into 5 groups, or "folds".
-2. Five iterations of training/testing are then run where each of the 5 folds serves as the test data once and as part of the training set four times, as seen in the figure below.
-3. To measure predictive ability of each of the parameters tested, like the number of features to include, values like accuracy and specificity are calculated for each iteration. The parameters that optimize performance are selected for the final model which will be evaluated against the test set not used in training.
-```{r 05-Chapter5-14, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image4.png")
-```
-
-Check out these resources for additional information on [Cross Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) and [Cross Validation Pros & Cons](https://www.geeksforgeeks.org/cross-validation-machine-learning/).
-
-
-
-## Assessing Classification-Based Model Performance
-Evaluation metrics from a confusion matrix are often used to determine the best model during training and measure model performance during testing for classification-based supervised machine learning models. A confusion matrix consists of a table that displays the numbers of how often the algorithm correctly and incorrectly predicted the outcome.
-
-Let's imagine you're interested in predicting whether or not a player will be drafted to the National Basketball Association (NBA) based on a dataset that contains variables regarding a player's assists, points, height etc. Let's say that this dataset contains information on 253 players with 114 that were actually drafted and 139 that weren't drafted. The confusion matrix below shows a model's results where a player that is drafted is the "positive" class and a player that is not drafted is the "negative" class.
-
-```{r 05-Chapter5-15, echo=FALSE, out.width = "50%", fig.width=4, fig.height=5, fig.align='center'}
-knitr::include_graphics("Chapter_5/Module5_2_Input/Module5_2_Image5.png")
-```
-
-Helpful confusion matrix terminology:
-
-+ **True positive (TP)**: the number of correctly classified "positive" data points (i.e., the number of correctly classified players to be drafted)
-+ **True negative (TN)**: the number of correctly classified "negative" data points (i.e., the number of correctly classified players to be not drafted)
-+ **False positive (FP)**: the number of incorrectly classified "positive" data points (i.e., the number of players not drafted incorrectly classified as draft picks)
-+ **False negative (FN)**: the number of incorrectly classified "negative" data points (i.e., the number of draft picks incorrectly classified as players not drafted)
-
-
-Some of the metrics that can be obtained from a confusion matrix are listed below:
-
-+ **Overall Accuracy:** indicates how often the model makes a correct prediction relative to the total number of predictions made and is typically used to assess overall model performance ($\frac{TP+TN}{TP+TN+FP+FN}$).
-
-+ **Sensitivity or Recall:** evaluates how well the model was able to predict the "positive" class. It is calculated as the ratio of correctly classified true positives to the total number of positive cases ($\frac{TP}{TP+FN}$).
-
-+ **Specificity:** evaluates how well the model was able to predict the "negative" class. It is calculated as the ratio of correctly classified true negatives to total number of negatives cases ($\frac{TN}{TN+FP}$).
-
-+ **Balanced Accuracy:** is the mean of sensitivity and specificity and is often used in the case of a class imbalance to gauge how well the model can correctly predict values for both classes ($\frac{sensitivity+specificity}{2}$).
-
-+ **Positive Predictive Value (PPV) or Precision:** evaluates how accurate predictions of the "positive" class are. It is calculated as the ratio of correctly classified true positives to total number of predicted positives ($\frac{TP}{TP+FN}$).
-
-+ **Negative Predictive Value (NPV):** evaluates how accurate predictions of the "negative" class are. It is calculated as the ratio of correctly classified true negatives to total number of predicted negatives ($\frac{TN}{TN+FP}$).
-
-For the above metrics, values fall between 0 and 1. Instances of 0 indicate that the model was not able to classify any data points correctly, and instances of 1 indicate that the model was able to classify all test data correctly. Although subjective, an overall accuracy of at least 0.7 is considered respectable ([Barkved, 2022](https://www.obviously.ai/post/machine-learning-model-performance#:~:text=Good%20accuracy%20in%20machine%20learning,also%20consistent%20with%20industry%20standards.)). Furthermore, a variety of additional metrics exist for evaluating model performance for classification problems ([24 Evaluation Metrics for Binary Classification (And When to Use Them)](https://neptune.ai/blog/evaluation-metrics-binary-classification)). Selecting a metric for evaluating model performance varies by situation and is dependent not only on the individual dataset, but also the question being answered.
-
-
-**Note**: For multi-class classification (more than two labeled outcomes to be predicted), the same metrics are often used, but are obtained in a slightly different way. Regression based supervised machine learning models use loss functions to evaluate model performance. For more information regarding confusion matrices and loss functions for regression-based models, see:
-
- + [Additional Confusion Matrix Metrics](https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5)
- + [Precision vs. Recall or Specificity vs. Sensitivity](https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1)
- + [Loss Functions for Machine Learning Regression](https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3)
-
-
-
-
-## Introduction to Activity and Example Dataset
-
-In this activity, we will analyze an example dataset to see whether we can use environmental monitoring data to predict areas of contamination using random forest (RF). This example model will leverage a dataset of well water variables that span geospatial location, sampling date, and well water attributes, with the goal of predicting whether detectable levels of inorganic arsenic (iAs) are present. This dataset was obtained through the sampling of 713 private wells across North Carolina through the University of North Carolina Superfund Research Program ([UNC-SRP](https://sph.unc.edu/superfund-pages/srp/)) using an analytical method that was capable of detecting levels of iAs greater than 5ppm. As demonstrated through the script below, the algorithm will first be trained and tested, and then resulting model performance will be assessed using the previously detailed confusion matrix and related performance metrics.
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Which well water variables, spanning various geospatial locations, sampling dates, and well water attributes, significantly differ between samples containing detectable levels of iAs vs samples that are not contaminated/ non-detectable?
-2. How can we train a random forest (RF) model to predict whether a well might be contaminated with iAs?
-3. With this RF model, can we predict if iAs will be detected based on well water information?
-4. How could this RF model be improved upon, acknowledging that there is class imbalance?
-
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 05-Chapter5-16, clear__envi, echo=TRUE, eval=TRUE}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 05-Chapter5-17, install_libs, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
-if (!requireNamespace("readxl"))
- install.packages("readxl");
-if (!requireNamespace("lubridate"))
- install.packages("lubridate");
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("gtsummary"))
- install.packages("gtsummary");
-if (!requireNamespace("flextable"))
- install.packages("flextable");
-if (!requireNamespace("caret"))
- install.packages("caret");
-if (!requireNamespace("randomForest"))
- install.packages("randomForest");
-```
-
-#### Loading R packages required for this session
-```{r 05-Chapter5-18, load_libs, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
-library(readxl);
-library(lubridate);
-library(tidyverse);
-library(gtsummary);
-library(flextable);
-library(caret);
-library(randomForest);
-library(cardx);
-```
-
-#### Set your working directory
-```{r 05-Chapter5-19, filepath, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-```{r 05-Chapter5-20, read_data, echo=TRUE, eval=TRUE}
-# Load the data
-arsenic_data <- data.frame(read_xlsx("Chapter_5/Module5_2_Input/Module5_2_InputData.xlsx"))
-
-# View the top of the dataset
-head(arsenic_data)
-```
-
-The columns in this dataset are described below:
-
-+ `Well_ID`: Unique id for each well (This is the sample identifier and not a predictive feature)
-+ `Water_Sample_Date`: Date that the well was sampled
-+ `Casing_Depth`: Depth of the casing of the well (ft)
-+ `Well_Depth`: Depth of the well (ft)
-+ `Static_Water_Depth`: Static water depth in the well (ft)
-+ `Flow_Rate`: Well flow rate (gallons per minute)
-+ `pH`: pH of water sample
-+ `Detect_Concentration`: Binary identifier (either non-detect "ND" or detect "D") if iAs concentration detected in water sample
-
-### Changing Data Types
-First, `Detect_Concentration` needs to be converted from a character to a factor so that Random Forest knows that the non-detect class is the baseline or "negative" class, while the detect class will be the "positive" class. `Water_Sample_Date` will be converted from a character to a date type using the `mdy()` function from the *lubridate* package. This is done so that the model understands this column contains dates.
-```{r 05-Chapter5-21, convert_type, echo=TRUE, eval=TRUE}
-arsenic_data <- arsenic_data %>%
- # Converting `Detect_Concentration` from a character to a factor
- mutate(Detect_Concentration = relevel(factor(Detect_Concentration), ref = "ND"),
- # Converting water sample date from a character to a date type
- Water_Sample_Date = mdy(Water_Sample_Date)) %>%
- # Removing tax id and only keeping the predictor and outcome variables in the dataset
- # This allows us to put the entire dataframe as is into RF
- select(-Well_ID)
-
-# Look at the top of the revised dataset
-head(arsenic_data)
-```
-
-
-
-## Testing for Differences in Predictor Variables across the Outcome Classes
-
-It is useful to run summary statistics on the variables that will be used as predictors in the algorithm to see if there are differences in distributions between the outcomes classes (either non-detect or detect in this case). Typically, greater significance often leads to better predictivity for a certain variable, since the model is better able to separate the classes. We'll use the `tbl_summary()` function from the *gtsummary* package. Note, this may only be practical with smaller datasets or for a subset of predictors if there are many.
-
-For more information on the `tbl_summary()` function, check out this helpful [Tutorial](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html).
-```{r 05-Chapter5-22, tbl, echo=TRUE, eval=TRUE, warning=F, message = F}
-arsenic_data %>%
- # Displaying the mean and standard deviation in parentheses for all continuous variables
- tbl_summary(
- by = Detect_Concentration,
- statistic = list(all_continuous() ~ "{mean} ({sd})")
- ) %>%
- # Adding a column that displays the total number of samples for each variable
- add_n() %>%
- # Adding a column that displays the p-value from a one-way ANOVA test
- add_p(
- test = list(all_continuous() ~ "oneway.test"),
- test.args = list(all_continuous() ~ list(var.equal = TRUE))
- ) %>%
- as_flex_table() %>%
- bold(bold = TRUE, part = "header")
-
-```
-
-
-Note that N refers to the total sample number; ND refers to the samples that contained non-detectable levels of iAs; and D refers to the samples that contained detectable levels of iAs.
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: Which well water variables, spanning various geospatial locations, sampling dates, and well water attributes, significantly differ between samples containing detectable levels of iAs vs samples that are not contaminated/ non-detect?
-:::
-
-:::answer
-**Answer**: All of the evaluated descriptor variables are significantly different, with p<0.05 between detect and non-detect iAs samples, with the exception of the sample date and the static water depth.
-:::
-
-With these findings, we feel comfortable moving forward with these well water descriptive variables as predictors in our model.
-
-
-
-### Setting up Cross Validation
-At this point, we can move forward with training and testing a RF model aimed at predicting whether or not detectable levels of iAs are present in well water samples. We'll take a glance at the distribution of `Detect_Concentration` between the two classes.
-```{r 05-Chapter5-23, train_test, echo=TRUE, eval=TRUE}
-
-# Set seed for reproducibility
-set.seed(17)
-
-# Establish a list of indices that will used to identify our training and testing data with a 60-40 split
-tt_indices <- createDataPartition(y = arsenic_data$Detect_Concentration, p = 0.6, list = FALSE)
-
-# Use indices to make our training and testing datasets and view the number of Ds and NDs
-iAs_train <- arsenic_data[tt_indices,]
-table(iAs_train$Detect_Concentration)
-
-iAs_test <- arsenic_data[-tt_indices,]
-table(iAs_test$Detect_Concentration)
-```
-
-We can see that there are notably more non-detects (`ND`) than detects (`D`) in both our training and testing sets. This is something important to consider when evaluating our model's performance.
-
-Now we can set up our cross validation and train our model. We will be using the `trainControl()` function from the *caret* package for this task. It is one of the most commonly used libraries for supervised machine learning in R and can be leveraged for a variety algorithms including RF, SVM, KNN, and others. This model will be trained with 5-fold cross validation. Additionally, we will test 2, 3, and 6 predictors through the `mtry` parameter.
-
-See the *caret* documentation [here](https://cran.r-project.org/web/packages/caret/vignettes/caret.html).
-```{r 05-Chapter5-24, train, echo=TRUE, eval=TRUE}
-
-# Establish the parameters for our cross validation with 5 folds
-control <- trainControl(method = 'cv',
- number = 5,
- search = 'grid',
- classProbs = TRUE)
-
-# Establish grid of predictors to test in our model as part of hyperparameter tuning
-p <- ncol(arsenic_data) - 1 # p is the total number of predictors in the dataset
-tunegrid_rf <- expand.grid(mtry = c(floor(sqrt(p)), p/2, p)) # We will test sqrt(p), p/2, and p predictors (2,3,& 6 predictors, respectively) to see which performs best
-```
-
-
-
-## Predicting iAs Detection with a Random Forest (RF) Model
-```{r 05-Chapter5-25}
-# Look at the column names in training dataset
-colnames(iAs_train)
-
-# Train model
-rf_train <- train(x = iAs_train[,1:6], # Our predictor variables are in columns 1-6 of the dataframe
- y = iAs_train[,7], # Our outcome variable is in column 7 of the dataframe
- trControl = control, # Specify the cross-validation parameters we defined above
- method = 'rf', # Specify we want to train a Random Forest
- importance = TRUE, # This parameter calculates the variable importance for RF models specifically which can help with downstream analyses
- tuneGrid = tunegrid_rf, # Specify the number of predictors we want to test as defined above
- metric = "Accuracy",
- ) # Specify what evaluation metric we want to use to decide which model is the best
-
-# Look at the results of training
-rf_train
-
-# Save the best model from our training. The best performing model is determined by the number of predictor variables we tested that resulted in the highest accuracy during the cross validation step.
-rf_final <- rf_train$finalModel
-
-# View confusion matrix for best model
-rf_final
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: How can we train a random forest (RF) model to predict whether a well might be contaminated with iAs?
-:::
-
-:::answer
-**Answer**: As is standard practice with supervised ML, we split our full dataset into a training dataset and a test dataset using a 60-40 split. Using the *caret* package, we implemented 5-fold cross validation to train a RF while also testing different numbers of predictors to see which optimized performance. The model that resulted in the greatest accuracy was selected as the final model.
-:::
-
-Now we can see how well our model does on data it hasn't seen before by applying it to our testing data.
-```{r 05-Chapter5-26, test, echo=TRUE, eval=TRUE}
-# Use our best model to predict the classes for our test data. We need to make sure we remove the column of Ds/NDs from our test data.
-rf_res <- predict(rf_final, iAs_test %>%
- select(!Detect_Concentration))
-
-# View a confusion matrix of the results and gauge model performance
-# Be sure to include the 'positive' parameter to specify the correct positive class
-confusionMatrix(rf_res, iAs_test$Detect_Concentration, positive = "D")
-```
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: With this RF model, can we predict if iAs will be detected based on well water information?
-:::
-
-:::answer
-**Answer**: We can use this model to predict if iAs can be detected in well water given that an overall accuracy of ~0.72 is decent, however we should consider other metrics that may influence how good we feel about this model depending on what is important to the question we are trying to answer. For example, the model did a good job at predicting non-detect data based on a sensitivity of ~0.85 and a NPV ~0.78, but struggled at predicting detect data based on a specificity of ~0.39 and a PPV of ~0.50. Additionally, the balanced accuracy of ~0.62 further emphasizes the difference in predictive ability of the model for non-detects and detects. If it is highly important to us that detects are classified correctly, we may want to improve this model before implementing it.
-:::
-
-
-
-## Class Imbalance
-
-It is worth noting this discrepancy in predictive capabilities for detects vs. non-detects makes sense due to the observed class imbalance in our training data. There were notably more non-detects than detects in the training set, so the model was exposed to more of these data points and struggles to distinguish unique characteristics of detects when compared to non-detects. Additionally, we told the training algorithm to prioritize selecting a final model based on its overall accuracy. In the instances of a heavy class imbalance, it is common for a high accuracy to be achieved as the more prevalent class is predicted more often, though this doesn't give the full picture of the model's predictive capabilities. For example, if you consider a dog/cat case with a set of 90 dogs and 10 cats, a model could achieve 90% accuracy by predicting dog every time, which isn’t at all helpful in predicting cats.
-
-This is particularly important, because for toxicology related datasets, the "positive" class often represents the class with greater public health risk/ interest but can have less data. For example, when you classify subjects based upon whether or not they have asthma based on gene expression data. Asthmatics would likely be the "positive" class, but given that asthmatics are less prevalent than non-asthmatics in the general population, they would likely represent the minority class too.
-
-To address this issue, a few methods can be considered. Full implementation of these approaches is beyond the scope of this module, but relevant resources for further exploration are given.
-
-+ **Synthetic Minority Oversampling Technique (SMOTE)**- increases the number of minority classes in the training data, thereby reducing the class imbalance by synthetically generating additional samples derived from the existing minority class samples.
- + [SMOTE Oversampling & Tutorial On How To Implement In Python And R](https://spotintelligence.com/2023/02/17/smote-oversampling-python-r/#:~:text=Conclusion-,The%20SMOTE%20(Synthetic%20Minority%20Over%2Dsampling%20Technique)%20algorithm%20is,datasets%20that%20aren't%20balanced.)
- + [How to Use SMOTE for Imbalanced Data in R (With Example)](https://www.statology.org/smote-in-r/)
-
-+ **Adjusting the loss function**- Loss functions in machine learning quantify the penalty for a bad prediction. They can be adjusted to where the minority class is penalized more forcing the model to learn to make fewer mistakes when predicting the minority class.
-
-+ **Alternative Performance Metrics**- When training the model, alternative metrics to overall accuracy may yield a more robust model capable of better predicting the minority class. Example alternatives may include balanced accuracy or an [F1-score](https://thedatascientist.com/f-1-measure-useful-imbalanced-class-problems/). The *caret* package further allows for [custom, user-defined metrics](https://topepo.github.io/caret/model-training-and-tuning.html#alternate-performance-metrics) to be evaluated during training by specifying the *summaryFunction* parameter in the `trainControl()` function, as seen below, in addition to the [`defaultSummary()` and `twoClassSummary()` functions](https://cran.r-project.org/web/packages/caret/vignettes/caret.html).
-
-In the example code below, we're creating a function (`f1`) that will calculate the F1 score and find the optimal model with the highest F1 score as opposed to the highest accuracy as we did above.
-```{r 05-Chapter5-27, alt_metric, echo=TRUE, eval=FALSE}
-install.packages("MLmetrics")
-library(MLmetrics)
-
-f1 <- function(data, lev = NULL, model = NULL) {
- # Creating a function to calculate the F1 score
- f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
- c(F1 = f1_val)
-}
-
-# 5 fold CV
-ctrl <- trainControl(
- method = "cv",
- number = 5,
- classProbs = TRUE,
- summaryFunction = f1
-)
-
-# Training the RF model
-mod <- train(x = X,
- y = Y,
- trControl = ctrl,
- method = "rf",
- tuneGrid = tunegrid_rf,
- importance = TRUE,
- # Basing the best model performance off of the F1 score within 5 CV
- metric = "F1")
-```
-
-For more in-depth information and additional ways to address class imbalance check out [How to Deal with Imbalanced Data in Classification](https://medium.com/game-of-bits/how-to-deal-with-imbalanced-data-in-classification-bd03cfc66066).
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer **Environmental Health Question #4***: How could this RF model be improved upon, acknowledging that there is class imbalance?
-:::
-
-:::answer
-**Answer**: We can implement SMOTE to increase the number of training data points for the minority class thereby reducing the class imbalance. In conjunction with using SMOTE, another approach includes selecting an alternative performance metric during training that does a better job taking the existing class imbalance into consideration, such as balanced accuracy or an F1-score, improves our predictive ability for the minority class.
-:::
-
-
-
-## Concluding Remarks
-
-In conclusion, this training module has provided an introduction to supervised machine learning using classification techniques in R. Machine learning is a powerful tool that can help researchers gain new insights and improve models to analyze complex datasets faster and in a more comprehensive way. The example we've explored demonstrates the utility of supervised machine learning models on an environmentally relevant dataset.
-
-
-
-### Additional Resources
-To learn more check out the following resources:
-
-+ [IBM - What is Machine Learning](https://www.ibm.com/topics/machine-learning)
-+ [Curate List of AI and Machine Learning Resources](https://medium.com/machine-learning-in-practice/my-curated-list-of-ai-and-machine-learning-resources-from-around-the-web-9a97823b8524)
-+ [Introduction to Machine Learning in R](https://machinelearningmastery.com/machine-learning-in-r-step-by-step/)
-+ Machine Learning by Mueller, J. P. (2021). Machine learning for dummies. John Wiley & Sons.
-
-
-
-
-
-:::tyk
-Using the "Module5_2TYKInput.xlsx", use RF to determine if well water data can be accurate predictors of Manganese detection. The data is structured similarly to the "Module5_2_InputData.xlsx" used in this module, however it now includes 4 additional features:
-
-+ `Longitude`: Longitude of address (decimal degrees)
-+ `Latitude`: Latitude of address (decimal degrees)
-+ `Stream_Distance`: Euclidean distance to the nearest stream (feet)
-+ `Elevation`: Surface elevation of the sample location (feet)
-:::
-
-# 5.3 Supervised Machine Learning Model Interpretation
-
-This training module was developed by Alexis Payton, Lauren E. Koval, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Supervised machine learning (ML) represents a subset of ML methods wherein the outcome variable is known or assigned prior to training a model to be able to predict said outcome. As we discussed in previous modules, ML methods are advantageous in that they easily incorporate a multitude of potential predictor variables, which allows these models to more closely consider real-world, complex environmental health scenarios and offer new insights through a more holistic consideration of available data inputs. However, one disadvantage of ML is that it is often not as easily interpretable as traditional statistics (e.g., regression based methods with defined beta coefficients for each input predictor variable). With this limitation in mind, there are methods and concepts that can be applied to supervised ML algorithms to aid in the understanding of their predictions including variable (feature) importance and decision boundaries, which we will cover in this module. We will also include example visualization techniques of these methods, representing important aspects contributing to model interpretability, since visualizing helps convey concepts faster and across a broader target audience. In addition, this module addresses methods to communicate these findings in a paper so that a wider span of readers can understand overall take-home points. As with other data analyses, we advise to focus just as much on the **why** components of a study's research question(s) as opposed to only focusing on the **what** or **how**. To elaborate, we explain through this module that it is not as important to explain all the intricacies of how a model works and how its parameters were tuned; rather, it is more important to focus on why a particular model was selected and how it will be leveraged to answer your research questions. This can all be a bit subjective and requires expertise within your research field. As a first step, let's first learn about some model interpretation methodologies highlighting **Variable Importance** and **Decision Boundaries** as important examples relevant to environmental health research. Then, this training module will further describe approaches to summarize these methods and communicate supervised ML findings to a broader audience.
-
-
-
-## Variable Importance
-
-When a supervised ML algorithm makes predictions, it relies more heavily on some variables than others. How much a variable contributes to classifying data is known as **variable (feature) importance**. Often times, this is thought of as the impact on overall model performance if a variable were to be removed from the model. There are many methods that are used to measure feature importance, including...
-
-+ **SHapley Additive exPlanations (SHAP)**: based on game theory where each variable is considered a "player" where we're seeking to determine each player's contribution to the outcome of a "game" or overall model performance. It divides the model performance metric amongst all the variables, so that the sum of the shapley values for all the predictors is equal to the overall model performance. For more information on SHAP, see [A Novel Approach to Feature Importance](https://towardsdatascience.com/a-novel-approach-to-feature-importance-shapley-additive-explanations-d18af30fc21b).
-
-+ **Mean decrease gini (gini impurity)**: quantifies the improvement of predictivity with the addition of each predictor in a decision tree, which is then averaged over all the decision trees tested. The higher the value the greater the importance on the algorithm. This metric can easily be extracted from classification-based models, including random forest (RF) classifications, which is what we will focus on in this module.
-
-Note for RF regression-based models, node purity can be extracted as a measure of feature importance. For more information, please see the following resources regarding [Feature Importance](https://www.baeldung.com/cs/ml-feature-importance) and [Mean Decrease Gini](https://cran.r-project.org/web/packages/rfVarImpOOB/vignettes/rfVarImpOOB-vignette.html).
-
-
-
-## Decision Boundary
-Another concept that is pertinent to a model's interpretability is understanding a decision boundary and how visualizing it can further aid in understanding how the model classifies new data points. A **decision boundary** is a line (or a hyperplane) that seeks to separate the training data by class. This line can be linear or non-linear and is formed in n-dimensional space. To clarify, although support vector machine (SVM) specifically uses decision boundaries to classify training data and make predictions on test data, decision boundaries can still be drawn for other algorithms.
-
-A decision boundary can be visualized to convey how well an algorithm is able to classify an outcome based on the data given. It is important to note that most ML models make use of datasets that contain three or more predictors, and it is difficult to visualize a plot in more than three dimensions. Therefore, the number of features and which features to plot need to be narrowed down to two variables. For this reason, the resulting visualization is not a true representation of the decision boundary from the initial model using all predictors, since the visualization only relies on prediction results from two variables. Nevertheless, decision boundary plots can be powerful visualizations to determine thresholds between the outcome classes.
-
-When choosing variables for decision boundary plots, features that have the most influence on the model are often selected, but that is not always the case. Sometimes predictors are selected based upon the environmental health implications relevant to the research question. For example in [Perryman et. al](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285721), lung response following ozone exposure was investigated by sampling derivatives of cholesterol biosynthesis in human subjects. In this paper, these sterol metabolites were used to predict whether a subject would be classified as having a lung response that was considered non-responsive or responsive. A decision boundary plot was made using two predictors:
-
-+ Cholesterol, given that it had the highest variable importance and
-+ Vitamin D, given its synthesis can be affected by ozone despite it having a lower variable importance in the paper's models.
-```{r 05-Chapter5-28, echo=FALSE, fig.align='center', out.width = "80%"}
-knitr::include_graphics("Chapter_5/Module5_3_Input/Module5_3_Image1.png")
-```
-
**Figure 5. Decision boundary plot for SVM model predicting lung response class.** Cholesterol and 25-hydroxyvitamin D were used as predictors visualizing responder status [non-responders(green) and responders (yellow)] and disease status [non-asthmatics (triangles) and asthmatics (circles)]. The shaded regions are the model’s prediction of a subject’s lung response class at a given cholesterol and 25-hydroxyvitamin D concentration.
-
-Takeaways from this decision boundary plot:
-
-+ Subjects with more lung inflammation ("responders") after ozone exposure tended to have higher Vitamin D levels (> 35pmol/mL) and lower Cholesterol levels (< 675nmol/mL).
-+ These "responder" subjects were more likely to be non-asthmatics.
-
-
-
-## Introduction to Example Dataset and Activity
-
-In the previous module, we investigated whether a classification-based RF model using well water variables would be accurate predictors of inorganic arsenic (iAs) contamination. While it is helpful to know if certain variables are able to be used to construct a model that accurately predict detectability, from a public health standpoint, it is also helpful to know which of those features contribute the most to a model's accuracy. Therefore, if we can identify the features that are associated with having lower arsenic detection, we can use that information to inform policies when new wells are constructed. In addition to identifying variables with the greatest importance to the algorithm, it is also pertinent to understand the ranges of when a well is more or less likely to have arsenic detected. For example, are wells with a lower flow rate more likely to have arsenic detected? In this module, this will be addressed by extracting variable importance from the same algorithm and plotting it. The two features with the highest variable importance will be identified and used to construct a decision boundary plot to determine how features are associated with iAs detection.
-
-The data to be used in this module was described and referenced previously in **TAME 2.0 Module 5.2 Supervised Machine Learning**.
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. After plotting variable importance from highest to lowest, which two predictors have the highest variable importance on the predictive accuracy of iAs detection from a RF algorithm?
-2. Using the two features with the highest variable importance, under what conditions are we more likely to predict detectable iAs in wells based on a decision boundary plot?
-3. How do the decision boundaries shift after incorporating SMOTE to address class imbalance?
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 05-Chapter5-29}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 05-Chapter5-30, message=FALSE}
-if (!requireNamespace("readxl"))
- install.packages("readxl");
-if (!requireNamespace("lubridate"))
- install.packages("lubridate");
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("caret"))
- install.packages("caret");
-if (!requireNamespace("randomForest"))
- install.packages("randomForest");
-if (!requireNamespace("themis"))
- install.packages("themis");
-```
-
-#### Loading R packages required for this session
-```{r 05-Chapter5-31, message=FALSE}
-library(readxl)
-library(lubridate)
-library(tidyverse)
-library(caret)
-library(randomForest)
-library(e1071)
-library(ggsci)
-library(themis)
-```
-
-#### Set your working directory
-```{r 05-Chapter5-32, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-```{r 05-Chapter5-33}
-# Load the data
-arsenic_data <- data.frame(read_excel("Chapter_5/Module5_3_Input/Module5_3_InputData.xlsx"))
-
-# View the top of the dataset
-head(arsenic_data)
-```
-
-### Changing Data Types
-First, `Detect_Concentration` needs to be converted from a character to a factor so that Random Forest knows that the non-detect class is the baseline or "negative" class, while the detect class will be the "positive" class. `Water_Sample_Date` will be converted from a character to a date type using the `mdy()` function from the *lubridate* package. This is done so that the model understands this column contains dates.
-```{r 05-Chapter5-34}
-arsenic_data <- arsenic_data %>%
- # Converting `Detect_Concentration` from a character to a factor
- mutate(Detect_Concentration = relevel(factor(Detect_Concentration), ref = "ND"),
- # Converting water sample date from a character to a date type
- Water_Sample_Date = mdy(Water_Sample_Date)) %>%
- # Removing well id and only keeping the predictor and outcome variables in the dataset
- # This allows us to put the entire dataframe as is into RF
- select(-Well_ID)
-
-# View the top of the current dataset
-head(arsenic_data)
-```
-
-
-### Setting up Cross Validation
-Note that the code below is different than the code presented in the previous module, **TAME 2.0 Module 5.2 Supervised Machine Learning**. Both coding methods are valid and produce comparable results, however we wanted to present another way to run *k*-fold cross validation and random forest. In 5-fold cross validation (CV), there are 5 equally-sized folds (ideally!). This means that 80% of the original dataset is split into the 4 folds that comprise the training set and the remaining 20% in the last fold is reserved for the test set.
-
-Previously, the `trainControl()` function was used for CV. This time we'll use the `createFolds()` function also from the *caret* package.
-```{r 05-Chapter5-35}
-# Setting seed for reproducibility
-set.seed(12)
-
-# 5-fold cross validation
-arsenic_index = createFolds(arsenic_data$Detect_Concentration, k = 5)
-
-# Seeing if about 20% of the records are in the testing set
-kfold1 = arsenic_index[[1]]
-length(kfold1)/nrow(arsenic_data)
-
-# Creating vectors for parameters to be tuned
-ntree_values = c(50, 250, 500) # number of decision trees
-p = dim(arsenic_data)[2] - 1 # number of predictor variables in the dataset
-mtry_values = c(sqrt(p), p/2, p) # number of predictors to be used in the model
-```
-
-
-## Predicting iAs Detection with a Random Forest (RF) Model
-Notice that in the code below we are choosing the final RF model to be the one with the lowest out of bag (OOB) error. In the previous module, the final model was chosen based on the highest accuracy, however this is a similar approach here given that OOB error = 1 - Accuracy.
-```{r 05-Chapter5-36}
-# Setting the seed again so the predictions are consistent
-set.seed(12)
-
-# Creating an empty dataframe to save the confusion matrix metrics and variable importance
-metrics = data.frame()
-variable_importance_df = data.frame()
-
-# Iterating through the cross validation folds
-for (i in 1:length(arsenic_index)){
- # Training data
- data_train = arsenic_data[-arsenic_index[[i]],]
-
- # Test data
- data_test = arsenic_data[arsenic_index[[i]],]
-
- # Creating empty lists and dataframes to store errors
- reg_rf_pred_tune = list()
- rf_OOB_errors = list()
- rf_error_df = data.frame()
-
- # Tuning parameters: using ntree and mtry values to determine which combination yields the smallest OOB error
- # from the validation datasets
- for (j in 1:length(ntree_values)){
- for (k in 1:length(mtry_values)){
-
- # Running RF to tune parameters
- reg_rf_pred_tune[[k]] = randomForest(Detect_Concentration ~ ., data = data_train,
- ntree = ntree_values[j], mtry = mtry_values[k])
- # Obtaining the OOB error
- rf_OOB_errors[[k]] = data.frame("Tree Number" = ntree_values[j], "Variable Number" = mtry_values[k],
- "OOB_errors" = reg_rf_pred_tune[[k]]$err.rate[ntree_values[j],1])
-
- # Storing the values in a dataframe
- rf_error_df = rbind(rf_error_df, rf_OOB_errors[[k]])
- }
- }
-
- # Finding the lowest OOB error from the 5 folds using best number of predictors at split
- best_oob_errors <- which(rf_error_df$OOB_errors == min(rf_error_df$OOB_errors))
-
- # Now running RF on the entire training set with the tuned parameters
- # This will be done 5 times for each fold
- reg_rf <- randomForest(Detect_Concentration ~ ., data = data_train,
- ntree = rf_error_df$Tree.Number[min(best_oob_errors)],
- mtry = rf_error_df$Variable.Number[min(best_oob_errors)])
-
- # Predicting on test set and adding the predicted values as an additional column to the test data
- data_test$Pred_Detect_Concentration = predict(reg_rf, newdata = data_test, type = "response")
- matrix = confusionMatrix(data = data_test$Pred_Detect_Concentration,
- reference = data_test$Detect_Concentration, positive = "D")
-
- # Extracting accuracy, sens, spec, PPV, NPV and adding to the dataframe to take mean later
- matrix_values = data.frame(t(c(matrix$byClass[11])), t(c(matrix$byClass[1:4])))
- metrics = rbind(metrics, matrix_values)
-
- # Extracting variable importance
- variable_importance_values = data.frame(importance(reg_rf)) %>%
- rownames_to_column(var = "Predictor")
- variable_importance_df = rbind(variable_importance_df, variable_importance_values)
-}
-
-# Taking average across the 5 folds
-metrics = metrics %>%
- summarise(`Balanced Accuracy` = mean(Balanced.Accuracy), Sensitivity = mean(Sensitivity),
- Specificity = mean(Specificity), PPV = mean(Pos.Pred.Value), NPV = mean(Neg.Pred.Value))
-
-variable_importance_df = variable_importance_df %>%
- group_by(Predictor) %>%
- summarise(MeanDecreaseGini = mean(MeanDecreaseGini)) %>%
- # Sorting from highest to lowest
- arrange(-MeanDecreaseGini)
-```
-
-The confusion matrix results from the previous module are shown below.
-```{r 05-Chapter5-37, echo=FALSE, fig.align='center', out.width = "80%"}
-knitr::include_graphics("Chapter_5/Module5_3_Input/Module5_3_Image2.png")
-```
-
-Now let's double check that when using this new method, our results are still comparable.
-```{r 05-Chapter5-38}
-# First comparing results to the previous module
-round(metrics, 2)
-```
-
-They are! Now we'll take a look at the model's variable importance.
-```{r 05-Chapter5-39}
-variable_importance_df
-```
-
-Although we have the results we need, let's take it a step further and plot the data.
-
-### Reformatting the dataframe for plotting
-First, the dataframe will be transformed so that the figure is more legible. Specifically, spaces will be added between the variables, and the `Predictor` column will be put into a factor to rearrange the order of the variables from lowest to highest mean decrease gini. For additional information on tricks like this to make visualizations easier to read, see **TAME 2.0 Module 3.2 Improving Data Visualizations**.
-```{r 05-Chapter5-40}
-# Adding spaces between the variables that need the space
-modified_variable_importance_df = variable_importance_df %>%
- mutate(Predictor = gsub("_", " ", Predictor))
-
-# Saving the order of the variables from lowest to highest mean decrease gini by putting into a factor
-predictor_order = rev(modified_variable_importance_df$Predictor)
-modified_variable_importance_df$Predictor = factor(modified_variable_importance_df$Predictor,
- levels = predictor_order)
-
-head(modified_variable_importance_df)
-```
-
-## Variable Importance Plot
-```{r 05-Chapter5-41, fig.align='center', out.width = "65%"}
-ggplot(data = modified_variable_importance_df ,
- aes(x = MeanDecreaseGini, y = Predictor, size = 2)) +
- geom_point() +
-
- theme_light() +
- theme(axis.line = element_line(color = "black"), #making x and y axes black
- axis.text = element_text(size = 12), #changing size of x axis labels
- axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
- legend.title = element_text(face = 'bold', size = 14), #changes legend title
- legend.text = element_text(size = 12), #changes legend text
- strip.text.x = element_text(size = 15, face = "bold"), #changes size of facet x axis
- strip.text.y = element_text(size = 15, face = "bold")) + #changes size of facet y axis
- labs(x = 'Variable Importance', y = 'Predictor') + #changing axis labels
-
- guides(size = "none")#removing size legend
-```
-An appropriate title for this figure could be:
-
-“**Figure X. Variable importance from random forest models predicting iAs detection.** Variable importance is derived from mean decrease gini values extracted from random forest models. Features are listed on the y axis from greatest (top) to least (bottom) mean decrease gini."
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: After plotting variable importance from highest to lowest, which two predictors have the highest variable importance on the predictive accuracy of iAs detection from a RF algorithm?
-:::
-
-:::answer
-**Answer**: From the variable importance dataframe and plot, we can see that casing depth and pH had the greatest impact on RF followed by water sample date, flow rate, static water depth, and well depth in descending order.
-:::
-
-Since casing depth and pH have been identified as the predictors with the highest variable importance, they will be prioritized as the two predictors included in the decision boundary plot example below.
-
-
-
-### Decision Boundary Calculation
-
-First, models will be trained using only casing depth and pH as variables. Since, the decision boundary plot will be used for visualization purposes, and a 2-D figure can only plot two variables, we will not worry about tuning the parameters as was previously done. In this module, we're creating a decision boundary based on a random forest model, however we'll also explore what decision boundaries look like for other algorithms including support vector machine (SVM), and k nearest neighbor (KNN), logistic regression. Each supervised ML method has its advantages and performance is dependent upon the situation and the dataset. Therefore, it is common to see multiple models used to predict an outcome of interest in a publication. Let's create additional boundary plots still using casing depth and pH, but this time we will use logistic regression, SVM, and KNN as comparisons to RF.
-```{r 05-Chapter5-42}
-# Creating a dataframe with variables based on the highest predictors
-highest_pred_data = data.frame(arsenic_data[,c("Casing_Depth", "pH", "Detect_Concentration")])
-
-# Training RF
-rf_detect_arsenic = randomForest(Detect_Concentration~., data = highest_pred_data)
-
-# Logistic regression
-lr_detect_arsenic = glm(Detect_Concentration~., data = highest_pred_data, family = binomial(link = 'logit'))
-
-# SVM with a radial kernel (hyperplane)
-svm_detect_arsenic = svm(Detect_Concentration~., data = highest_pred_data, kernel = "radial")
-
-# KNN
-knn_detect_arsenic = knn3(Detect_Concentration~., data = highest_pred_data) # specifying 2 classes
-```
-
-From these predictions, decision boundaries will be calculated. This will be done by predicting `Detect_Concentration` between a grid of values - specifically the minimum and maximum of the two predictors (casing depth and pH). A non-linear line will be drawn on the plot to separate the two classes.
-```{r 05-Chapter5-43}
-get_grid_df <- function(classification_model, data, resolution = 100, predict_type) {
- # This function predicts the outcome (Detect_Concentration) at evenly spaced data points using the two variables (pH and casing depth)
- # to create a decision boundary between the outcome classes (detect and non-detect samples).
-
- # :parameters: a classification-based supervised machine learning model, dataset containing the predictors and outcome variable,
- # specifies the number of data points to make between the minimum and maximum predictor values, prediction type
- # :output: a grid of values for both predictors and their corresponding predicted outcome class
-
- # Grabbing only the predictor data
- predictor_data <- data[,1:2]
-
- # Creating a dataframe that contains the min and max for both features
- min_max_df <- sapply(predictor_data, range, na.rm = TRUE)
-
- # Creating a vector of evenly spaced points between the min and max for the first variable (casing depth)
- variable1_vector <- seq(min_max_df[1,1], min_max_df[2,1], length.out = resolution)
- # Creating a vector of evenly spaced points between the min and max for the second variable (pH)
- variable2_vector <- seq(min_max_df[1,2], min_max_df[2,2], length.out = resolution)
-
- # Creating a dataframe of grid values by combining the two vectors
- grid_df <- data.frame(cbind(rep(variable1_vector, each = resolution), rep(variable2_vector,
- time = resolution)))
- colnames(grid_df) <- colnames(min_max_df)
-
- # Predicting class label based on all the predictor pairs of data
- grid_df$Pred_Class = predict(classification_model, grid_df, type = predict_type)
-
- return(grid_df)
-}
-
-# calling function
-# RF
-grid_df_rf = get_grid_df(rf_detect_arsenic, highest_pred_data, predict_type = "class") %>%
- # Adding in a column that indicates the model so all the dataframes can be combined
- mutate(Model = "A. Random Forest")
-
-# SVM with a radial kernel (hyperplane)
-grid_df_svm = get_grid_df(svm_detect_arsenic, highest_pred_data, predict_type = "class") %>%
- mutate(Model = "B. Support Vector Machine")
-
-# KNN
-grid_df_knn = get_grid_df(knn_detect_arsenic, highest_pred_data, predict_type = "class") %>%
- mutate(Model = "C. K Nearest Neighbor")
-
-# Logistic regression
-grid_df_lr = get_grid_df(lr_detect_arsenic, highest_pred_data, predict_type = "response") %>%
- # First specifying the cutoff point for logistic regression predictions
- # If the response is >= 0.5 it will be classified as a detect prediction
- mutate(Pred_Class = relevel(factor(ifelse(Pred_Class >= 0.5, "D", "ND")), ref = "ND"),
- Model = "D. Logistic Regression")
-
-# Creating 1 dataframe
-grid_df = rbind(grid_df_rf, grid_df_lr, grid_df_svm, grid_df_knn)
-
-# Viewing the dataframe to be plotted
-head(grid_df)
-```
-## Decision Boundary Plot
-
-Now let's plot the grid of predictions with the sampled data.
-```{r 05-Chapter5-44, warning = FALSE, fig.width=15, fig.height=10, fig.align='center'}
-# choosing palette from package
-ggsci_colors = pal_npg()(5)
-
-ggplot() +
- geom_point(data = arsenic_data, aes(x = pH, y = Casing_Depth, color = Detect_Concentration),
- position = position_jitter(w = 0.1, h = 0.1), size = 4, alpha = 0.8) +
- geom_contour(data = grid_df, aes(x = pH, y = Casing_Depth, z = as.numeric(Pred_Class == "D")),
- color = "black", breaks = 0.5) + # adds contour line
- geom_point(data = grid_df, aes(x = pH, y = Casing_Depth, color = Pred_Class),
- size = 0.1) + # shades plot
- xlim(5.9, NA) + # changes the limits of the x axis
-
- facet_wrap(~Model, scales = 'free') +
-
- theme_light() +
- theme(axis.line = element_line(color = "black"), #making x and y axes black
- axis.text = element_text(size = 10), #changing size of x axis labels
- axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
- legend.title = element_text(face = 'bold', size = 12), #changes legend title
- legend.text = element_text(size = 12), #changes legend text
- legend.position = "bottom", # move legend to top left corner
- legend.background = element_rect(color = 'black', fill = 'white', linetype = 'solid'), # changes legend background
- strip.text = element_text(size = 15, face = "bold")) + #changes size of facet x axis
- labs(y = 'Casing Depth (ft)') + #changing axis labels
-
- scale_color_manual(name = "Arsenic Detection", # renaming the legend
- values = ggsci_colors[c(4,5)],
- labels = c('Non-Detect','Detect')) # renaming the classes
-
-
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Using the two features with the highest variable importance, under what conditions are we more likely to predict detectable iAs in wells based on a decision boundary plot?
-:::
-
-:::answer
-**Answer**: There is some overlap between detect and non-detect iAs samples; however, it is evident that wells with detectable levels of iAs were more likely to have lower (<80 ft) casing depths and a more basic pH (> 7) based on RF and KNN models. It seems like SVM and logistic regression could have potentially captured a greater "detect" region indicating that the models likely struggled to predict "detect" values. In the next section, SMOTE will be used to see if these decision boundaries can be improved.
-:::
-
-
-
-## Decision Boundary Plot Incorporating SMOTE
-
-Here, we will create a decision boundary plot still using casing depth and pH, but this time we will make our dataset more balance to see how improve model performance visually. The **Synthetic Minority Oversampling Technique (SMOTE)** was introduced in **TAME 2.0 Module 5.2 Supervised Machine Learning** and will be used to make the dataset more balanced by oversampling the minority class (detect values) and undersampling the majority class (non-detect values).
-
-Starting by training each model:
-```{r 05-Chapter5-45}
-# Using SMOTE first to balance classes
-balanced_highest_pred_data = smotenc(highest_pred_data, "Detect_Concentration")
-
-# Training RF
-rf_detect_arsenic = randomForest(Detect_Concentration~., data = balanced_highest_pred_data)
-
-# Logistic regression
-lr_detect_arsenic = glm(Detect_Concentration~., data = balanced_highest_pred_data, family = binomial(link = 'logit'))
-
-# SVM with a radial kernel (hyperplane)
-svm_detect_arsenic = svm(Detect_Concentration~., data = balanced_highest_pred_data, kernel = "radial")
-
-# KNN
-knn_detect_arsenic = knn3(Detect_Concentration~., data = balanced_highest_pred_data) # specifying 2 classes
-```
-
-Now calling the `get_grid_df()` function we created above to create a grid of predictions.
-```{r 05-Chapter5-46}
-# Calling function
-# RF
-balanced_grid_df_rf = get_grid_df(rf_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
- # Adding in a column that indicates the model so all the dataframes can be combined
- mutate(Model = "A. Random Forest")
-
-# SVM with a radial kernel (hyperplane)
-balanced_grid_df_svm = get_grid_df(svm_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
- mutate(Model = "B. Support Vector Machine")
-
-# KNN
-balanced_grid_df_knn = get_grid_df(knn_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
- mutate(Model = "C. K Nearest Neighbor")
-
-# Logistic regression
-balanced_grid_df_lr = get_grid_df(lr_detect_arsenic, balanced_highest_pred_data, predict_type = "response") %>%
- # First specifying the cutoff point for logistic regression predictions
- # If the response is >= 0.5 it will be classified as a detect prediction
- mutate(Pred_Class = relevel(factor(ifelse(Pred_Class >= 0.5, "D", "ND")), ref = "ND"),
- Model = "D. Logistic Regression")
-
-
-# Creating 1 dataframe
-balanced_grid_df = rbind(balanced_grid_df_rf, balanced_grid_df_lr, balanced_grid_df_svm, balanced_grid_df_knn)
-
-# Viewing the dataframe to be plotted
-head(balanced_grid_df)
-```
-
-```{r 05-Chapter5-47, warning = FALSE, fig.width=15, fig.height=10, fig.align='center'}
-# choosing palette from package
-ggsci_colors = pal_npg()(5)
-
-ggplot() +
- geom_point(data = arsenic_data, aes(x = pH, y = Casing_Depth, color = Detect_Concentration),
- position = position_jitter(w = 0.1, h = 0.1), size = 4, alpha = 0.8) +
- geom_contour(data = balanced_grid_df, aes(x = pH, y = Casing_Depth, z = as.numeric(Pred_Class == "D")),
- color = "black", breaks = 0.5) + # adds contour line
- geom_point(data = balanced_grid_df, aes(x = pH, y = Casing_Depth, color = Pred_Class),
- size = 0.1) + # shades plot
- xlim(5.9, NA) + # changes the limits of the x axis
-
- facet_wrap(~Model, scales = 'free') +
-
- theme_light() +
- theme(axis.line = element_line(color = "black"), #making x and y axes black
- axis.text = element_text(size = 10), #changing size of x axis labels
- axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
- legend.title = element_text(face = 'bold', size = 12), #changes legend title
- legend.text = element_text(size = 12), #changes legend text
- legend.position = "bottom", # move legend to top left corner
- legend.background = element_rect(color = 'black', fill = 'white', linetype = 'solid'), # changes legend background
- strip.text = element_text(size = 15, face = "bold")) + #changes size of facet x axis
- labs(y = 'Casing Depth (ft)') + #changing axis labels
-
- scale_color_manual(name = "Arsenic Detection", # renaming the legend
- values = ggsci_colors[c(4,5)],
- labels = c('Non-Detect','Detect')) # renaming the classes
-```
-An appropriate title for this figure could be:
-
-“**Figure X. Decision boundary plots from supervised machine learning models predicting iAs detection.** The top two predictors on model performance, casing depth and pH, were used to visualize arsenic detection [non-detect (red) and detect (blue)]. The shaded regions represent prediction of a well's detection class based on varying casing depth and pH values using (A) Random Forest, (B) Support Vector Machine, (C) K Nearest Neighbor, and (D) Logistic Regression.
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: How do the decision boundaries shift after incorporating SMOTE to address class imbalance?
-:::
-
-:::answer
-**Answer**: It is still evident that wells with detectable levels of iAs were more likely to have lower (<80 ft) casing depths and a more basic pH (> 7). However, we see the greatest shifts in the decision boundaries of SVM and logistic regression with both models now predicting greater regions to detectable iAs levels.
-:::
-
-
-
-## Concluding Remarks
-In conclusion, this training module provided methodologies to aid in the interpretation of supervised ML with variable importance and decision boundary plots. Variable importance helps quantify the impact of each feature's importance on an algorithm's predictivity. The most important or environmentally-relevant predictors can be selected in a decision boundary plot to further understand and visualize the features impact on the model's classification.
-
-
-
-### Additional Resources
-
-+ Christoph Molnar. (2019, August 27). Interpretable Machine Learning. Github.io. https://christophm.github.io/interpretable-ml-book/
-+ [Variable Importance](https://compgenomr.github.io/book/trees-and-forests-random-forests-in-action.html#variable-importance-1)
-+ [Decision Boundary](https://rpubs.com/ZheWangDataAnalytics/DecisionBoundary)
-
-
-
-
-
-:::tyk
-1. Using the "Module5_2_TYKInput.xlsx", use RF to determine if well water data can be accurate predictors of manganese detection as was done in the previous module. However, this time, incorporate SMOTE in the model. Feel free to use either the `trainControl()` or `createFolds()` function for CV. Extract the variable importance for each predictor on a RF model. What two features have the highest variable importance? **Hint**: Regardless of the cross validation function you choose, run SMOTE on the training dataset only to create a more balanced training set while the test set will remain unchanged.
-
-2. Using casing depth and the feature with the highest variable importance, construct a decision boundary plot. Under what conditions are a well more likely to predict detectable manganese levels based on a decision boundary plot?
-:::
-
-# 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA
-
-This training module was developed by David M. Reif with contributions from Alexis Payton, Lauren E. Koval, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-To reiterate what has been discussed in the previous module, machine learning is a field that has great utility in environmental health sciences, often to investigate high-dimensional datasets. The two main classifications of machine learning discussed throughout the TAME Toolkit are supervised and unsupervised machine learning, though additional classifications exist. Previously, we discussed artificial intelligence and supervised machine learning in **TAME 2.0 Module 5.1 Introduction to Machine Learning & Artificial Intelligence**, **TAME 2.0 Module 5.2 Supervised Machine Learning**, and **TAME 2.0 Module 5.3 Supervised Machine Learning Model Interpretation**. In this module, we'll cover background information on unsupervised machine learning and then work through a scripted example of an unsupervised machine learning analysis.
-
-## Introduction to Unsupervised Machine Learning
-
-**Unsupervised machine learning**, as opposed to supervised machine learning, involves training a model on a dataset lacking ground truths or response variables. In this regard, unsupervised approaches are often used to identify underlying patterns amongst data in a more unbiased manner. This can provide the analyst with insights into the data that may not otherwise be apparent. Unsupervised machine learning has been used for understanding differences in gene expression patterns of breast cancer patients ([Jezequel et. al, 2015](https://link.springer.com/article/10.1186/s13058-015-0550-y)) and evaluating metabolomic signatures of patients with and without cystic fibrosis ([Laguna et. al, 2015](https://onlinelibrary.wiley.com/doi/full/10.1002/ppul.23225?casa_token=Vqlz3JgGm10AAAAA%3A4UFubAP2r97CKl9PK8oYDfgrcjrs_ZySDzDCx1t3qc6XvQRxOqIwjTn_eQxm_lzX8UQLE0zURJu94fI)).
-
-:::moduletextbox
-**Note**: Unsupervised machine learning is used for exploratory purposes, and just because it can find relationships between data points, that doesn't necessarily mean that those relationships have merit, are indicative of causal relationships, or have direct biological implications. Rather, these methods can be used to find new patterns that can also inform future studies testing direct relationships.
-:::
-
-```{r 05-Chapter5-48, echo=FALSE, out.width = "75%", fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_4_Input/Module5_4_Image1.png")
-```
-
Langs, G., Röhrich, S., Hofmanninger, J., Prayer, F., Pan, J., Herold, C., & Prosch, H. (2018). Machine learning: from radiomics to discovery and routine. Der Radiologe, 58(S1), 1–6. PMID: [34013136](https://doi.org/10.1007/s00117-018-0407-3). Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
-
-Unsupervised machine learning includes:
-
-+ **Clustering**: Involves grouping elements in a dataset such that the elements in the same group are more similar to each other than to the elements in the other groups.
- + Exclusive (*K*-means)
- + Overlapping
- + Hierarchical
- + Probabilistic
-+ **Dimensionality reduction**: Focuses on taking high-dimensional data and transforming it into a lower-dimensional space that has fewer features while preserving important information inherent to the original dataset. This is useful because reducing the number of features makes the data easier to visualize while trying to maintain the initial integrity of the dataset.
- + Principal Component Analysis (PCA)
- + Singular Value Decomposition (SVD)
- + t-Distributed Stochastic Neighbor Embedding (t-SNE)
- + Uniform Manifold Approximation and Projection (UMAP)
- + Partial Least Squares-Discriminant Analysis (PLS-DA)
-
-
-In this module, we'll focus on methods for ***K*-means clustering** and **Principal Component Analysis** described in more detail in the following sections. In the next module, **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**, we'll focus on hierarchical clustering. For further information on types of unsupervised machine learning, check out [Unsupervised Learning](https://cloud.google.com/discover/what-is-unsupervised-learning#section-3).
-
-
-
-
-### *K*-Means Clustering
-
-*K*-means is a common clustering algorithm used to partition quantitative data. This algorithm works by first randomly selecting a pre-specified number of clusters, *k*, across the data space with each cluster having a data centroid. When using a standard Euclidean distance metric, the distance is calculated from an observation to each centroid, then the observation is assigned to the cluster of the closest centroid. After all observations have been assigned to one of the *k* clusters, the average of all observations in a cluster is calculated, and the centroid for the cluster is moved to the location of the mean. The process then repeats, with the distance computed between the observations and the updated centroids. Observations may be reassigned to the same cluster or moved to a different cluster if it is closer to another centroid. These iterations continue until there are no longer changes between cluster assignments for observations, resulting in the final cluster assignments that are then carried forward for analysis/interpretation.
-
-Helpful resources on *k*-means clustering include the following: [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf) &
-[Towards Data Science](https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a).
-
-
-
-### Principal Component Analysis (PCA)
-
-Principal Component Analysis, or PCA, is a dimensionality-reduction technique used to transform high-dimensional data into a lower dimensional space while trying to preserve as much of the variability in the original data as possible. PCA has strong foundations in linear algebra, so background knowledge of eigenvalues and eigenvectors is extremely useful. Though the mathematics of PCA is beyond the scope of this module, a variety of more in-depth resources on PCA exist including this [Towards Data Science Blog]("https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643"), and this [Sartorius Blog](https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186#:~:text=Principal%20component%20analysis%2C%20or%20PCA,more%20easily%20visualized%20and%20analyzed.). At a higher level, important concepts in PCA include:
-
-1. PCA partitions variance in a dataset into linearly uncorrelated principal components (PCs), which are weighted combinations of the original features.
-
-2. Each PC (starting from the first one) summarizes a decreasing percentage of variance.
-
-3. Every instance (e.g. chemical) in the original dataset has a "weight" or score" on each PC.
-
-4. Any combination of PCs can be compared to summarize relationships amongst the instances (e.g. chemicals), but typically it's the first two eigenvectors that capture a majority of the variance.
-```{r 05-Chapter5-49, echo=FALSE, out.width= "80%", fig.align = 'center'}
-knitr::include_graphics("Chapter_5/Module5_4_Input/Module5_4_Image2.png")
-```
-
-
-
-## Introduction to Example Data
-
-In this activity, we are going to analyze an example dataset of physicochemical property information for chemicals spanning **per- and polyfluoroalkyl substances (PFAS) and statins**. PFAS represent a ubiquitous and pervasive class of man-made industrial chemicals that are commonly used in food packaging, commercial household products such as Teflon, cleaning products, and flame retardants. PFAS are recognized as highly stable compounds that, upon entering the environment, can persist for many years and act as harmful sources of exposure. Statins represent a class of lipid-lowering compounds that are commonly used as pharmaceutical treatments for patients at risk of cardiovascular disease. Because of their common use amongst patients, statins can also end up in water and wastewater effluent, making them environmentally relevant as well.
-
-This example analysis was designed to evaluate the chemical space of these diverse compounds and to illustrate the utility of unsupervised machine learning methods to differentiate chemical class and make associations between chemical groupings that can inform a variety of environmental and toxicological applications. The two types of machine learning methods that will be employed are *k*-means and PCA (as described in the introduction).
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying unsupervised machine learning techniques?
-2. If substances are able to be clustered, what are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
-3. How do the data compare when physicochemical properties are reduced using PCA?
-4. Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component?
-5. If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
-6. What kinds of applications/endpoints can be better understood and/or predicted because of these derived chemical groupings?
-
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 05-Chapter5-50, clear_env, echo=TRUE, eval=TRUE}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 05-Chapter5-51, message=FALSE}
-if (!requireNamespace("factoextra"))
- install.packages("factoextra");
-if (!requireNamespace("pheatmap"))
- install.packages("pheatmap");
-if (!requireNamespace("cowplot"))
- install.packages("cowplot");
-```
-
-#### Loading required R packages
-```{r 05-Chapter5-52, results=FALSE, message=FALSE}
-library(tidyverse)
-library(factoextra)
-library(pheatmap) #used to make heatmaps
-library(cowplot)
-```
-
-Getting help with packages and functions
-```{r 05-Chapter5-53, eval = FALSE}
-?tidyverse # Package documentation for tidyverse
-?kmeans # Package documentation for kmeans (a part of the standard stats R package, automatically uploaded)
-?prcomp # Package documentation for deriving principal components within a PCA (a part of the standard stats R package, automatically uploaded)
-?pheatmap # Package documentation for pheatmap
-```
-
-#### Set your working directory
-```{r 05-Chapter5-54, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Loading the Example Dataset
-Let's start by loading the datasets needed for this training module. We are going to use a dataset of substances that have a diverse chemical space of PFAS and statin compounds. This list of chemicals will be uploaded alongside physicochemical property data. The chemical lists for 'PFAS' and 'Statins' were obtained from the EPA's Computational Toxicology Dashboard [Chemical Lists](https://comptox.epa.gov/dashboard/chemical-lists). The physicochemical properties were obtained by uploading these lists into the National Toxicology Program’s [Integrated Chemical Environment (ICE)](https://ice.ntp.niehs.nih.gov/).
-```{r 05-Chapter5-55}
-dat <- read.csv("Chapter_5/Module5_4_Input/Module5_4_InputData.csv", fileEncoding = "UTF-8-BOM")
-```
-
-#### Data Viewing
-
-Starting with the overall dimensions:
-```{r 05-Chapter5-56}
-dim(dat)
-```
-
-Then looking at the first four rows and five columns of data:
-```{r 05-Chapter5-57}
-dat[1:4,1:5]
-```
-
-Note that the first column, `List`, designates the following two larger chemical classes:
-```{r 05-Chapter5-58}
-unique(dat$List)
-```
-
-Let's lastly view all of the column headers:
-```{r 05-Chapter5-59}
-colnames(dat)
-```
-
-In the data file, the first four columns represent chemical identifier information. All remaining columns represent different physicochemical properties derived from OPERA via [Integrated Chemical Environment (ICE)](https://ice.ntp.niehs.nih.gov/). Because the original titles of these physicochemical properties contained commas and spaces, R automatically converted these into periods. Hence, titles like `OPERA..Boiling.Point`.
-
-For ease of downstream data analyses, let's create a more focused dataframe option containing only one chemical identifier (CASRN) as row names and then just the physicochemical property columns.
-```{r 05-Chapter5-60}
-# Creating a new dataframe that contains the physiocochemical properties
-chemical_prop_df <- dat[,5:ncol(dat)]
-rownames(chemical_prop_df) <- dat$CASRN
-```
-
-Now explore this data subset:
-```{r 05-Chapter5-61}
-dim(chemical_prop_df) # overall dimensions
-chemical_prop_df[1:4,1:5] # viewing the first four rows and five columns
-colnames(chemical_prop_df)
-```
-
-
-### Evaluating the Original Physicochemical Properties across Substances
-
-Let's first plot two physicochemical properties to determine if and how substances group together without any fancy data reduction or other machine learning techniques. This will answer **Environmental Health Question #1**: Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying unsupervised machine learning techniques?
-
-Let's put molecular weight (`Molecular.Weight`) as one axis and boiling point (`OPERA..Boiling.Point`) on the other. We'll also color by the chemical classes using the `List` column from the original dataframe.
-```{r 05-Chapter5-62, fig.align='center'}
-ggplot(chemical_prop_df[,1:2], aes(x = Molecular.Weight, y = OPERA..Boiling.Point, color = dat$List)) +
- geom_point(size = 2) + theme_bw() +
- ggtitle('Version A: Bivariate Plot of Two Original Physchem Variables') +
- xlab("Molecular Weight") + ylab("Boiling Point")
-```
-
-Let's plot two other physicochemical property variables, Henry's Law constant (`OPERA..Henry.s.Law.Constant`) and melting point (`OPERA..Melting.Point`), to see if the same separation of chemical classes is apparent.
-```{r 05-Chapter5-63, fig.align='center'}
-ggplot(chemical_prop_df[,3:4], aes(x = OPERA..Henry.s.Law.Constant, y = OPERA..Melting.Point,
- color = dat$List)) +
- geom_point(size = 2) + theme_bw() +
- ggtitle('Version B: Bivariate Plot of Two Other Original Physchem Variables') +
- xlab("OPERA..Henry.s.Law.Constant") + ylab("OPERA..Melting.Point")
-```
-
-### Answer to Environmental Health Question 1
-:::question
-*With these, we can answer **Environmental Health Question #1***: Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying machine learning techniques?
-:::
-
-:::answer
-**Answer**: Only in part. From the first plot, we can see that PFAS tend to have lower molecular weight ranges in comparison to the statins, though other property variables clearly overlap in ranges of values making the groupings not entirely clear.
-:::
-
-
-
-## Identifying Clusters of Chemicals through *K*-Means
-
-Let's turn our attention to **Environmental Health Question #2**: If substances are able to be clustered, what are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means? This will be done deriving clusters of chemicals based on ALL underlying physicochemical property data using *k*-means clustering.
-
-For this example, let's coerce the *k*-means algorithms to calculate 2 distinct clusters (based on their corresponding mean centered values). Here, we choose to derive two distinct clusters, because we are ultimately going to see if we can use this information to predict each chemical's classification into two distinct chemical classes (i.e., PFAS vs statins). Note that we can derive more clusters using similar code depending on the question being addressed.
-
-We can give a name to this variable to easily provide the number of clusters in the next lines of code, `num.centers`:
-```{r 05-Chapter5-64}
-num.centers <- 2
-```
-
-Here we derive chemical clusters using *k*-means:
-```{r 05-Chapter5-65}
-clusters <- kmeans(chemical_prop_df, # input dataframe
- centers = num.centers, # number of cluster centers to calculate
- iter.max = 1000, # the maximum number of iterations allowed
- nstart = 50) # the number of rows used as the random set for the initial centers (during the first iteration)
-```
-
-The resulting property values that were derived as the final cluster centers can be pulled using:
-```{r 05-Chapter5-66}
-clusters$centers
-```
-
-Let's add the cluster assignments to the physicochemical data and create a new dataframe, which can then be used in a heatmap visualization to see how these physicochemical data distributions clustered according to *k*-means.
-
-These cluster assignments can be pulled from the `cluster` list output, where chemicals are designated to each cluster with either a 1 or 2. You can view these using:
-```{r 05-Chapter5-67}
-clusters$cluster
-```
-
-Because these results are listed in the exact same order as the inputted dataframe, we can simply add these assignments to the `chemical_prop_df` dataframe.
-```{r 05-Chapter5-68}
-dat_wclusters <- cbind(chemical_prop_df,clusters$cluster)
-colnames(dat_wclusters)[11] <- "Cluster" # renaming this new column "Custer"
-dat_wclusters <- dat_wclusters[order(dat_wclusters$Cluster),] # sorting data by cluster assignments
-```
-
-To generate a heatmap, we need to first create a separate dataframe for the cluster assignments, ordered in the same way as the physicochemical data:
-```{r 05-Chapter5-69}
-hm_cluster <- data.frame(dat_wclusters$Cluster, row.names = row.names(dat_wclusters)) # creating the dataframe
-colnames(hm_cluster) <- "Cluster" # reassigning the column name
-hm_cluster$Cluster <- as.factor(hm_cluster$Cluster) # coercing the cluster numbers into factor variables, to make the heatmap prettier
-
-head(hm_cluster) # viewing this new cluster assignment dataframe
-```
-
-We're going to go ahead and clean up the physiocochemical property names to make the heatmap a bit tidier.
-```{r 05-Chapter5-70}
-clean_names1 = gsub("OPERA..", "", colnames(dat_wclusters))
-# "\\." denotes a period
-clean_names2 = gsub("\\.", " ", clean_names1)
-
-# Reassigning the cleaner names back to the df
-colnames(dat_wclusters) = clean_names2
-
-# Going back to add in the apostrophe in "Henry's Law Constant"
-colnames(dat_wclusters)[3] = "Henry's Law Constant"
-```
-Then we can call this dataframe (`data_wclusters`) into the following heatmap visualization code leveraging the `pheatmap()` function. This function was designed specifically to enable clustered heatmap visualizations. Check out [pheatmap Documenation](https://www.rdocumentation.org/packages/pheatmap/versions/1.0.12/topics/pheatmap) for additional information.
-
-
-
-### Heatmap Visualization of the Resulting *K*-Means Clusters
-```{r 05-Chapter5-71, fig.height=8, fig.width=10}
-pheatmap(dat_wclusters[,1:10],
- cluster_rows = FALSE, cluster_cols = FALSE, # no further clustering, for simplicity
- scale = "column", # scaling the data to make differences across chemicals more apparent
- annotation_row = hm_cluster, # calling the cluster assignment dataframe as a separate color bar
- annotation_names_row = FALSE, # adding removing the annotation name ("Cluster") from the x axis
- angle_col = 45, fontsize_col = 7, fontsize_row = 3, # adjusting size/ orientation of axes labels
- cellheight = 3, cellwidth = 25, # setting height and width for cells
- border_color = FALSE # specify no border surrounding the cells
-)
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Heatmap of physicochemical properties with *k*-means cluster assignments.** Shown are the relative values for each physicochemical property labeled on the x axis. Individual chemical names are listed on the y axis. The chemicals are grouped based on their *k*-means cluster assignment as denoted by the color bar on the left."
-
-Notice that the `pheatmap()` function does not add axes or legend titles. Adding those can provide clarity, however those can be added to the figure after exporting from R in MS Powerpoint or Adobe.
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: What are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
-:::
-
-:::answer
-**Answer**: Properties with values that show obvious differences between resulting clusters including molecular weight, boiling point, negative log of acid dissociation constant, octanol air partition coefficient, and octanol water distribution coefficient.
-:::
-
-
-
-## Principal Component Analysis (PCA)
-Next, we will run through some example analyses applying the common data reduction technique of PCA. We'll start by determining how much of the variance is able to be captured within the first two principal components to answer **Environmental Health Question #3**: How do the data compare when physicochemical properties are reduced using PCA?
-
-
-We can calculate the principal components across ALL physicochemical data across all chemicals using the `prcomp()` function. Always make sure your data is centered and scaled prior to running to PCA, since it's sensitive to variables having different scales.
-```{r 05-Chapter5-72}
-my.pca <- prcomp(chemical_prop_df, # input dataframe of physchem data
- scale = TRUE, center = TRUE)
-```
-
-We can see how much of the variance was able to be captured in each of the eigenvectors or dimensions using a scree plot.
-```{r 05-Chapter5-73, fig.align='center'}
-fviz_eig(my.pca, addlabels = TRUE)
-```
-
-We can also calculate these values and pull them into a dataframe for future use. For example, to pull the percentage of variance explained by each principal component, we can run the following calculations, where first eigenvalues (eigs) are calculated and then used to calculate percent of variance per principal component:
-```{r 05-Chapter5-74}
-eigs <- my.pca$sdev^2
-Comp.stats <- data.frame(eigs, eigs/sum(eigs), row.names = names(eigs))
-colnames(Comp.stats) <- c("Eigen_Values", "Percent_of_Variance")
-
-head(Comp.stats)
-```
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: How do the data compare when physicochemical properties are reduced using PCA?
-:::
-
-:::answer
-**Answer**: Principal Component 1 captures ~41% of the variance and Principal Component 2 captures ~24% across all physicochemical property values across all chemicals. These two components together describe ~65% of data.
-:::
-
-
-
-Next, we'll use PCA to answer **Environmental Health Question #4**: Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component (Comp.1)?
-
-Here are the resulting scores for each chemical's contribution towards each principal component (shown here as components `PC1`-`PC10`).
-```{r 05-Chapter5-75}
-head(my.pca$x)
-```
-
-And the resulting loading factors of each property's contribution towards each principal component.
-```{r 05-Chapter5-76}
-my.pca$rotation
-```
-
-### Answer to Environmental Health Question 4
-:::question
-*With these results, we can answer **Environmental Health Question #4***: Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component (Comp.1)?
-:::
-
-:::answer
-**Answer**: Boiling point contributes the most towards principal component 1, as it has the largest magnitude (0.464).
-:::
-
-
-
-
-### Visualizing PCA Results
-
-Let's turn our attention to **Environmental Health Question #5**: If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
-
-We can start by answering this question by visualizing the first two principal components and coloring each chemical according to class (i.e. PFAS vs statins).
-```{r 05-Chapter5-77, fig.align='center'}
-ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = dat$List)) +
- geom_point(size = 2) + theme_bw() +
- ggtitle('Version C: PCA Plot of the First 2 PCs, colored by Chemical Class') +
- # it's good practice to put the percentage of the variance captured in the axes titles
- xlab("Principal Component 1 (40.9%)") + ylab("Principal Component 2 (23.8%)")
-```
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can answer **Environmental Health Question #5***: If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
-:::
-
-:::answer
- **Answer**: Data become more compressed and variables reduce across principal components capturing the majority of the variance from the original dataset (~65%). This results in improved data visualizations, where all dimensions of the physiochemical dataset are compressed and captured across the displayed components. In addition, the figure above shows a clear separation between PFAS and statin chemical when visualizing the reduced dataset.
-:::
-
-
-
-## Incorporating *K*-Means into PCA for Predictive Modeling
-
-We can also identify cluster-based trends within data that are reduced after running PCA. This example analysis does so, expanding upon the previously generated PCA results.
-
-### Estimate *K*-Means Clusters from PCA Results
-
-Let's first run code similar to the previous *k*-means analysis and associated parameters, though instead here we will use data reduced values from the PCA analysis. Specifically, clusters across PCA "scores" values will be derived, where scores represent the relative amount each chemical contributed to each principal component.
-```{r 05-Chapter5-78}
-clusters_PCA <- kmeans(my.pca$x, centers = num.centers, iter.max = 1000, nstart = 50)
-```
-
-The resulting PCA score values that were derived as the final cluster centers can be pulled using:
-```{r 05-Chapter5-79}
-clusters_PCA$centers
-```
-
-Viewing the final cluster assignment per chemical:
-```{r 05-Chapter5-80}
-head(cbind(rownames(chemical_prop_df),clusters_PCA$cluster))
-```
-
-
-
-#### Visualizing *K*-Means Clusters from PCA Results
-
-Let's now view, again, the results of the main PCA focusing on the first two principal components; though this time let's color each chemical according to *k*-means cluster.
-```{r 05-Chapter5-81, fig.align='center'}
-ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = as.factor(clusters_PCA$cluster))) +
- geom_point(size = 2) + theme_bw() +
- ggtitle('Version D: PCA Plot of the First 2 PCs, colored by k-means Clustering') +
- # it's good practice to put the percentage of the variance capture in the axes titles
- xlab("Principal Component 1 (40.9%)") + ylab("Principal Component 2 (23.8%)")
-```
-
-Let's put these two PCA plots side by side to compare them more easily. We'll also tidy up the figures a bit so they're closer to publication-ready.
-```{r 05-Chapter5-82, fig.align='center', fig.width = 20, fig.height = 6, fig.retina= 3}
-# PCA plot colored by chemical class
-pcaplot1 = ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = dat$List)) +
- geom_point(size = 2) +
-
- theme_light() +
- theme(axis.text = element_text(size = 9), # changing size of axis labels
- axis.title = element_text(face = "bold", size = rel(1.3)), # changes axis titles
- legend.title = element_text(face = 'bold', size = 10), # changes legend title
- legend.text = element_text(size = 9)) + # changes legend text
-
- labs(x = 'Principal Component 1 (40.9%)', y = 'Principal Component 2 (23.8%)',
- color = "Chemical Class") # changing axis labels
-
-# PCA Plot by k means clusters
-pcaplot2 = ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = as.factor(clusters_PCA$cluster))) +
- geom_point(size = 2) +
-
- theme_light() +
- theme(axis.text = element_text(size = 9), # changing size of axis labels
- axis.title = element_text(face = "bold", size = rel(1.3)), # changes axis titles
- legend.text = element_text(size = 9)) + # changes legend text
-
- labs(x = 'Principal Component 1 (40.9%)', y = 'Principal Component 2 (23.8%)',
- color = expression(bold(bolditalic(K)-Means~Cluster))) # changing axis labels
-
-# Creating 1 figure
-plot_grid(pcaplot1, pcaplot2,
- # Adding labels, changing size their size and position
- labels = "AUTO", label_size = 15, label_x = 0.03)
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Principal Component Analysis (PCA) plots highlight similarities between chemical class and *k*-means clusters.** These PCA plots are based on physiochemical properties and compare (A) chemical class categories and the (B) *K*-means derived cluster assignments."
-
-### Answer to Environmental Health Question 6
-:::question
-*With this we can answer **Environmental Health Question #6***: What kinds of applications/endpoints can be better understood and/or predicted because of these derived chemical groupings?
-:::
-
-:::answer
-**Answer**: With these well-informed chemical groupings, we can now better understand the variables that attribute to the chemical classifications. We can also use this information to better understand data trends and predict environmental fate and transport for these chemicals. The reduced variables derived through PCA, and/or *k*-means clustering patterns can also be used as input variables to predict toxicological outcomes.
-:::
-
-
-
-## Concluding Remarks
-In conclusion, this training module provides an example exercise on organizing physicochemical data and analyzing trends within these data to determine chemical groupings. Results are compared from those produced using just the original data vs. clustered data from *k*-means vs. reduced data from PCA. These methods represent common tools that are used in high dimensional data analyses within the field of environmental health sciences.
-
-### Additional Resources
-+ [Detailed study of Principal Component Analysis](https://f0nzie.github.io/machine_learning_compilation/detailed-study-of-principal-component-analysis.html)
-+ [Practical Guide to Cluster Analysis in R](https://xsliulab.github.io/Workshop/2021/week10/r-cluster-book.pdf)
-
-
-
-
-
-:::tyk
-In this training module, we presented an unsupervised machine learning example that was based on defining *k*-means clusters based on chemical class where *k* = 2. Often times, analyses are conducted to explore potential clustering relationships without a preexisting idea of what *k* or the number of clusters should be. In this test your knowledge section, we'll go through an example like that.
-
-Using the accompanying flame retardant and pesticide physicochemical property variables found in the file ("Module5_4_TYKInput.csv"), answer the following questions:
-
-1. What are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
-2. Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component?
-3. If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to accurately predict whether a chemical is a PFAS or a statin?
-:::
-
-# 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications
-
-This training module was developed by Alexis Payton, Lauren E. Koval, David M. Reif, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-The previous module **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**, served as introduction to unsupervised machine learning (ML). **Unsupervised ML** involves training a model on a dataset lacking ground truths or response variables. However, in the previous module, the number of clusters was selected based on prior information (i.e., chemical class), but what if you're in a situation where you don't know how many clusters to investigate a priori? This commonly occurs, particularly in the field of environmental health research in instances when investigators want to take a more unbiased view of their data and/or do not have information that can be used to inform the optimal number of clusters to select. In these instances,unsupervised ML techniques can be very helpful, and in this module, we'll explore the following concepts to further understand unsupervised ML:
-
-+ *K*-Means and hierarchical clustering
-+ Deriving the optimal number of clusters
-+ Visualizing clusters through a PCA-based plot, dendrograms, and heatmaps
-+ Determining each variable's contribution to the clusters
-
-
-
-
-## *K*-Means Clustering
-
-As mentioned in the previous module, *K*-means is a common clustering algorithm used to partition quantitative data. This algorithm works by first, randomly selecting a pre-specified number of clusters, *k*, across the data space, with each cluster having a data centroid. When using a standard Euclidean distance metric, the distance is calculated from an observation to each centroid, then the observation is assigned to the cluster of the closest centroid. After all observations have been assigned to one of the *k* clusters, the average of all observations in a cluster is calculated, and the centroid for the cluster is moved to the location of the mean. The process then repeats, with the distance computed between the observations and the updated centroids. Observations may be reassigned to the same cluster or moved to a different cluster if it is closer to another centroid. These iterations continue until there are no longer changes between cluster assignments for observations, resulting in the final cluster assignments that are then carried forward for analysis/interpretation.
-
-Helpful resources on *k*-means clustering include the following: [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf) &
-[Towards Data Science](https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a).
-
-
-
-## Hierarchical Clustering
-**Hierarchical clustering** groups objects into clusters by repetitively joining similar observations until there is one large cluster (aka agglomerative or bottom-up) or repetitively splitting one large cluster until each observation stands alone (aka divisive or top-down). Regardless of whether agglomerative or divisive hierarchical clustering is used, the results can be visually represented in a tree-like figure called a dendrogram. The dendrogram below is based on the `USArrests` dataset available in R. The datset contains statistics on violent crimes rates (murder, assault, and rape) per capita (per 100,000 residents) for each state in the United States in 1973. For more information on the `USArrests` dataset, check out its associated [RDocumentation](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/USArrests).
-
-```{r 05-Chapter5-83, fig.align = 'center', echo=FALSE, out.width = "55%"}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image1.png")
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Hierarchical clustering of states based on violent crime.** The dendogram uses violent crime data including murder, assault, and rape rates per 100,000 residents for each state in 1973."
-
-Takeaways from this dendogram:
-
-+ The 50 states can be grouped into 4 clusters based on violent crime statistics from 1973
-+ The dendogram can only show us clusters of states but not the data trends that led to the clustering patterns that we see. Yes, it is useful to know what states have similar violent crime patterns overall, but it is important to pinpoint the variables (ie. murder, assault, and rape) that are responsible for the clustering patterns we're seeing. This idea will be explored later in the module with an environmentally-relevant dataset.
-
-Going back to hiearchical clustering, during the repetitive splitting or joining of observations, the similarity between existing clusters is calculated after each iteration. This value informs the formation of subsequent clusters. Different methods, or linkage functions, can be considered when calculating this similarity, particularly for agglomerative clustering which is often the preferred approach. Some example methods include:
-
-+ **Complete Linkage**: the maximum distance between two data points located in separate clusters.
-+ **Single Linkage**: the minimum distance between two data points located in separate clusters.
-+ **Average Linkage**: the average pairwise distance between all pairs of data points in separate clusters.
-+ **Centroid Linkage**: the distance between the centroids or centers of each cluster.
-+ **Ward Linkage**: seeks to minimize the variance between clusters.
-
-Each method has its advantages and disadvantages and more information on all distance calculations between clusters can be found at the following resource: [Hierarchical Clustering](https://www.learndatasci.com/glossary/hierarchical-clustering/#Hierarchicalclusteringtypes).
-
-
-### Deriving the Optimal Number of Clusters
-Before clustering can be performed, the function needs to be informed of the number of clusters to group the objects into. In the previous module, an example was explored to see if *k*-means clustering would group the chemicals similarly to their chemical class (either a PFAS or statin). Therefore, we told the *k*-means function to cluster into 2 groups. In situations where there is little to no prior knowledge regarding the "correct" number of clusters to specify, methods exist for deriving the optimal number of clusters. Three common methods to find the optimal *k*, or number of clusters, for both *k*-means and hierarchical clustering include: the **elbow method**, **silhouette method**, and the **gap statistic method**. These techniques help us in determining the optimal *k* using visual inspection.
-
-+ **Elbow Method**: uses a plot of the within cluster sum of squares (WCSS) on the y axis and different values of *k* on the x axis. The location where we no longer observe a significant reduction in WCSS, or where an "elbow" can be seen, is the optimal *k* value. As we can see, after a certain point, having more clusters does not lead to a significant reduction in WCSS.
-```{r 05-Chapter5-84, fig.align = 'center', out.width = "75%", echo=FALSE}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image2.png")
-```
-
-Looking at the figures above, the elbow point is much clearer in the first plot versus the second, however, elbow curves from real-world datasets typically resemble the second figure. This is why it's recommended to consider more than one method to determine the optimal number of clusters.
-
-+ **Silhouette Method**: uses a plot of the average silhouette width (score) on the y axis and different values of *k* on the x axis. The silhouette score is measure of each object's similarity to its own cluster and how dissimilar it is to other clusters. The location where the average silhouette width is *maximized* is the optimal *k* value.
-```{r 05-Chapter5-85, fig.align = 'center', out.width = "65%", echo=FALSE}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image3.png")
-```
-
-Based on the figure above, the optimal number of clusters is 2 using the silhouette method.
-
-+ **Gap Statistic Method**: uses a plot of the gap statistic on the y axis and different values of *k* on the x axis. The gap statistic evaluates the intracluster variation in comparison to expected values derived from a Monte Carlo generated, null reference data distribution for varying values of *k*. The optimal number of clusters is the smallest value where the gap statistic of *k* is greater than or equal to the gap statistic of *k*+1 minus the standard deviation of *k*+1. More details can be found [here](https://uc-r.github.io/kmeans_clustering#:~:text=The%20gap%20statistic%20compares%20the,simulations%20of%20the%20sampling%20process.).
-```{r 05-Chapter5-86, fig.align = 'center', out.width = "65%", echo=FALSE}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image4.png")
-```
-
-Based on the figure above, the optimal number of clusters is 2 using the gap statistic method.
-
-For additional information and code on all three methods, check out [Determining the Optimal Number of Clusters: 3 Must Know Methods](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/). It is also worth mentioning that while these methods are useful, further interpreting the result in the context of your problem can be beneficial, for example, checking whether clusters make biological sense when working with a genomic dataset.
-
-
-
-
-## Introduction to Example Data
-
-We will apply these techniques using an example dataset from a previously published study where 22 cytokine concentrations were derived from 44 subjects with varying smoking statuses (14 non-smokers, 17 e-cigarette users, and 13 cigarette smokers) from 4 different sampling regions in the body. These samples were derived from nasal lavage fluid (NLF), nasal epithelieum fluid (NELF), sputum, and serum as pictured below. Samples were taken from different regions in the body to compare cytokine expression in the upper respiratory tract, lower respiratory tract, and systemic circulation.
-```{r 05-Chapter5-87, fig.align = 'center', out.width = "75%", echo=FALSE}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image5.png")
-```
-
-A research question that we had was "Does cytokine expression change based on a subject's smoking habits? If so, does cigarette smoke or e-cigarette vapor induce cytokine suppression or proliferation?" Traditionally these questions would have been answered by analyzing each biomarker individually using a two-group comparison test like a t test (which we completed in this study). However, biomarkers do not work in isolation in the body, suggesting that individual biomarker statistical approaches may not capture the full biological responses occurring. Therefore we used a clustering approach to group cytokines as an attempt to more closely simulate interactions that occur *in vivo*. From there, statistical tests were run to assess the effects of smoking status on each cluster.
-
-For the purposes of this training exercise, we will focus solely on the nasal epithelieum lining fluid, or NELF, samples. In addition, we'll use *k*-means and hierarchical clustering to compare how cytokines cluster at baseline. Full methods are further described in the publication below:
-
-+ Payton AD, Perryman AN, Hoffman JR, Avula V, Wells H, Robinette C, Alexis NE, Jaspers I, Rager JE, Rebuli ME. Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples. American Journal of Physiology-Lung Cellular and Molecular Physiology 2022 322:5, L722-L736. PMID: [35318855](https://journals.physiology.org/doi/abs/10.1152/ajplung.00299.2021)
-
-Let's read in and view the dataset we'll be working with.
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 05-Chapter5-88}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 05-Chapter5-89, message=FALSE}
-if (!requireNamespace("vegan"))
- install.packages("vegan");
-if (!requireNamespace("ggrepel"))
- install.packages("ggrepel");
-if (!requireNamespace("dendextend"))
- install.packages("dendextend");
-if (!requireNamespace("ggsci"))
- install.packages("ggsci");
-if (!requireNamespace("FactoMineR"))
-install.packages("FactoMineR");
-```
-
-#### Loading required R packages
-```{r 05-Chapter5-90, message=FALSE}
-library(readxl)
-library(factoextra)
-library(FactoMineR)
-library(tidyverse)
-library(vegan)
-library(ggrepel)
-library(reshape2)
-library(pheatmap)
-library(ggsci)
-suppressPackageStartupMessages(library(dendextend))
-```
-
-#### Set your working directory
-```{r 05-Chapter5-91, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Importing example dataset
-
-Then let's read in our example dataset. As mentioned in the introduction, this example dataset contains cytokine concentrations derived from 44 subjects. Let's import and view these data:
-```{r 05-Chapter5-92}
-# Reading in file
-cytokines_df <- data.frame(read_excel("Chapter_5/Module5_5_Input/Module5_5_InputData.xlsx", sheet = 2))
-
-# Viewing data
-head(cytokines_df)
-```
-
-These data contain the following information:
-
-+ `Original_Identifier`: initial identifier given to each subject by our wet bench colleagues
-+ `Group`: denotes the smoking status of the subject ("NS" = "non-smoker", "Ecig" = "E-cigarette user", "CS" = "cigarette smoker")
-+ `SubjectNo`: ordinal subject number assigned to each subject after the dataset was wrangled (1-44)
-+ `SubjectID`: unique subject identifier that combines the group and subject number
-+ `Compartment`: region of the body from which the sample was taken ("NLF" = "nasal lavage fluid sample", "NELF" = "nasal epithelieum lining fluid sample", "Sputum" = "induced sputum sample", "Serum" = "blood serum sample")
-+ `Protein`: cytokine name
-+ `Conc`: concentration (pg/mL)
-+ `Conc_pslog2`: psuedo-log~2~ concentration
-
-Now that the data has been read in, we can start by asking some initial questions about the data.
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-1. What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using *k*-means clustering?
-2. After selecting a cluster number, which cytokines were assigned to each *k*-means cluster?
-3. What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering?
-4. How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
-5. Which cytokines have the greatest contributions to the first two eigenvectors?
-
-To answer the first environmental health question, let's start by filtering to include only NELF derived samples and non-smokers.
-```{r 05-Chapter5-93}
-baseline_df <- cytokines_df %>%
- filter(Group == "NS", Compartment == "NELF")
-
-head(baseline_df)
-```
-
-The functions we use will require us to cast the data wider. We will accomplish this using the `dcast()` function from the *reshape2* package.
-```{r 05-Chapter5-94}
-wider_baseline_df <- reshape2::dcast(baseline_df, Protein ~ SubjectID, value.var = "Conc_pslog2") %>%
- column_to_rownames("Protein")
-
-head(wider_baseline_df)
-```
-
-Now we can derive clusters using the `fviz_nbclust()` function to determine the optimal *k* based on suggestions from the elbow, silhouette, and gap statistic methods. We can use this code for both *k*-means and hierarchical clustering by changing the `FUNcluster` parameter. Lets start with *k*-means:.
-```{r 05-Chapter5-95, fig.align = 'center'}
-# Elbow method
-fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "wss") +
- labs(subtitle = "Elbow method")
-
-# Silhouette method
-fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "silhouette") +
- labs(subtitle = "Silhouette method")
-
-# Gap statistic method
-fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "gap_stat") +
- labs(subtitle = "Gap Statisitc method")
-```
-
-The elbow method is suggesting 2 or 3 clusters, the silhouette method is suggesting 2, and the gap statistic method is suggesting 1. Since each of these methods is recommending different *k* values, we can go ahead and run *k*-means to visualize the clusters and test those different *k*'s. *K*-means clusters will be visualized using the `fviz_cluster()` function.
-```{r 05-Chapter5-96, fig.align = 'center'}
-# Choosing to iterate through 2 or 3 clusters using i as our iterator
-for (i in 2:3){
- # nstart = number of random starting partitions, it's recommended for nstart > 1
- cluster_k <- kmeans(wider_baseline_df, centers = i, nstart = 25)
- cluster_plot <- fviz_cluster(cluster_k, data = wider_baseline_df) + ggtitle(paste0("k = ", i))
- print(cluster_plot)
-}
-```
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using *k*-means clustering?
-:::
-
-:::answer
-**Answer**: 2 or 3 clusters can be justified here, based on using the elbow or silhouette method or if *k*-means happens to group cytokines together that were implicated in similar biological pathways. In the final paper, we moved forward with 3 clusters, because it was justifiable from the methods and provided more granularity in the clusters.
-:::
-
-The final cluster assignments can easily be obtained using the `kmeans()` function from the *stats* package.
-```{r 05-Chapter5-97}
-cluster_kmeans_3 <- kmeans(wider_baseline_df, centers = 3, nstart = 25)
-cluster_kmeans_df <- data.frame(cluster_kmeans_3$cluster) %>%
- rownames_to_column("Cytokine") %>%
- rename(`K-Means Cluster` = cluster_kmeans_3.cluster) %>%
- # Ordering the dataframe for easier comparison
- arrange(`K-Means Cluster`)
-
-cluster_kmeans_df
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: After selecting a cluster number, which cytokines were assigned to each *k*-means cluster?
-:::
-
-:::answer
-**Answer**: After choosing the number of clusters to be 3, the cluster assignments are as follows:
-```{r 05-Chapter5-98, fig.align = 'center', echo=FALSE}
-knitr::include_graphics("Chapter_5/Module5_5_Input/Module5_5_Image7.png")
-```
-:::
-
-
-
-## Hierarchical Clustering
-
-Next, we'll turn our attention to answering environmental health questions 3 and 4: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering? How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
-
-Just as we used the elbow method, silhouette profile, and gap statistic to determine the optimal number of clusters for *k*-means, we can leverage the same approaches for hierarchical by changing the `FUNcluster` parameter.
-```{r 05-Chapter5-99, fig.align = 'center'}
-# Elbow method
-fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "wss") +
- labs(subtitle = "Elbow method")
-
-# Silhouette method
-fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "silhouette") +
- labs(subtitle = "Silhouette method")
-
-# Gap statistic method
-fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "gap_stat") +
- labs(subtitle = "Gap Statisitc method")
-```
-We can see the results are quite similar with 2-3 clusters appearing optimal.
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering?
-:::
-
-:::answer
-**Answer**: Again, 2 or 3 clusters can be justified here, but for the same reasons mentioned for the first environmental health question, we landed on 3 clusters.
-:::
-
-Now we can perform the clustering and visualize and extract the results. We'll start by using the `dist()` function to calculate the euclidean distance between the clusters followed by the `hclust()` function to obtain the hierarchical clustering assignments.
-```{r 05-Chapter5-100}
-# Viewing the wider dataframe we'll be working with
-head(wider_baseline_df)
-```
-
-
-```{r 05-Chapter5-101}
-# First scaling data with each subject (down columns)
-scaled_df <- data.frame(apply(wider_baseline_df, 2, scale))
-rownames(scaled_df) = rownames(wider_baseline_df)
-
-head(scaled_df)
-```
-
-The `dist()` function is initially used to calculate the Euclidean distance between each cytokine. Next, the `hclust()` function is used to run the hierarchical clustering analysis using the complete method by default. The method can be changed in the function using the method parameter.
-```{r 05-Chapter5-102}
-# Calculating euclidean dist
-dist_matrix <- dist(scaled_df, method = 'euclidean')
-
-# Hierarchical clustering
-cytokines_hc <- hclust(dist_matrix)
-```
-
-Now we can generate a dendrogram to help us evaluate the results using the `fviz_dend()` function from the *factoextra* package. We use k=3 to be consistent with the *k*-means analysis.
-```{r 05-Chapter5-103, fig.align = 'center', out.width = "75%", warning=FALSE}
- fviz_dend(cytokines_hc, k = 3, # Specifying k
- cex = 0.85, # Label size
- palette = "futurama", # Color palette see ?ggpubr::ggpar
- rect = TRUE, rect_fill = TRUE, # Add rectangle around groups
- horiz = TRUE, # Changes the orientation of the dendogram
- rect_border = "futurama", # Rectangle color
- labels_track_height = 0.8 # Changes the room for labels
- )
-```
-
-We can also extract those cluster assignments using the `cutree()` function from the *stats* package.
-```{r 05-Chapter5-104}
-hc_assignments_df <- data.frame(cutree(cytokines_hc, k = 3)) %>%
- rownames_to_column("Cytokine") %>%
- rename(`Hierarchical Cluster` = cutree.cytokines_hc..k...3.) %>%
- # Ordering the dataframe for easier comparison
- arrange(`Hierarchical Cluster`)
-
-# Combining the dataframes to compare the cluster assignments from each approach
- comp <- full_join(cluster_kmeans_df, hc_assignments_df, by = "Cytokine")
-
- comp
-```
-
-For additional resources on running hierarchical clustering in R, see [Visualizing Clustering Dendrogram in R](https://agroninfotech.blogspot.com/2020/06/visualizing-clusters-in-r-hierarchical.html) and [Hiearchical Clustering on Principal Components](http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/).
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer **Environmental Health Question #4***: How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
-:::
-
-:::answer
-**Answer**: Though this may not always be the case, in this instance, we see that *k*-means and hierarchical clustering with k=3 clusters yield the same groupings despite the clusters being presented in a different order.
-:::
-
-
-
-## Clustering Plot
-
-One additional way to visualize clustering is to plot the first two principal components on the axes and color the data points based on their corresponding cluster. This visualization can be used for both *k*-means and hierarchical clustering using the `fviz_cluster()` function. This figure is essentially a PCA plot with shapes drawn around each cluster to make them distinct from each other.
-```{r 05-Chapter5-105, fig.align = 'center', fig.height=5.5, fig.width=6}
-fviz_cluster(cluster_kmeans_3, data = wider_baseline_df)
-```
-
-Rather than using the `fviz_cluster()` function as shown in the figure above, we'll extract the data to recreate the sample figure using `ggplot()`. For the manuscript this was necessary, since it was important to facet the plots for each compartment (i.e., NLF, NELF, sputum, and serum). For a single plot, this data extraction isn't required, and the figure above can be further customized within the `fviz_cluster()` function. However, we'll go through the steps of obtaining the indices to recreate the same polygons in `ggplot()` directly.
-
-*K*-means actually uses principal component analysis (PCA) to reduce a dataset's dimensionality prior to obtaining the cluster assignments and plotting those clusters. Therefore, to obtain the coordinates of each cytokine within their respective clusters, PCA will need to be run first.
-```{r 05-Chapter5-106}
-# First running PCA
-pca_cytokine <- prcomp(wider_baseline_df, scale = TRUE, center = TRUE)
-# Only need PC1 and PC2 for plotting, so selecting the first two columns
-baseline_scores_df <- data.frame(scores(pca_cytokine)[,1:2])
-baseline_scores_df$Cluster <- cluster_kmeans_3$cluster
-baseline_scores_df$Protein <- rownames(baseline_scores_df)
-
-# Changing cluster to a character for plotting
-baseline_scores_df$Cluster = as.character(baseline_scores_df$Cluster)
-
-head(baseline_scores_df)
-```
-
-Within each cluster, the `chull()` function is used to compute the indices of the points on the convex hull. These are needed for `ggplot()` to create the polygon shapes of each cluster.
-```{r 05-Chapter5-107}
-# hull values for cluster 1
-cluster_1 <- baseline_scores_df[baseline_scores_df$Cluster == 1, ][chull(baseline_scores_df %>%
- filter(Cluster == 1)),]
-# hull values for cluster 2
-cluster_2 <- baseline_scores_df[baseline_scores_df$Cluster == 2, ][chull(baseline_scores_df %>%
- filter(Cluster == 2)),]
-# hull values for cluster 3
-cluster_3 <- baseline_scores_df[baseline_scores_df$Cluster == 3, ][chull(baseline_scores_df %>%
- filter(Cluster == 3)),]
-all_hulls_baseline <- rbind(cluster_1, cluster_2, cluster_3)
-# Changing cluster to a character for plotting
-all_hulls_baseline$Cluster = as.character(all_hulls_baseline$Cluster)
-
-head(all_hulls_baseline)
-```
-
-Now plotting the clusters using `ggplot()`.
-```{r 05-Chapter5-108, fig.align = 'center', fig.height=5.5, fig.width=6}
-ggplot() +
- geom_point(data = baseline_scores_df, aes(x = PC1, y = PC2, color = Cluster, shape = Cluster), size = 4) +
- # Adding cytokine names
- geom_text_repel(data = baseline_scores_df, aes(x = PC1, y = PC2, color = Cluster, label = Protein),
- show.legend = FALSE, size = 4.5) +
- # Creating polygon shapes of the clusters
- geom_polygon(data = all_hulls_baseline, aes(x = PC1, y = PC2, group = as.factor(Cluster), fill = Cluster,
- color = Cluster), alpha = 0.25, show.legend = FALSE) +
-
- theme_light() +
- theme(axis.text.x = element_text(vjust = 0.5), #rotating x labels/ moving x labels slightly to the left
- axis.line = element_line(colour="black"), #making x and y axes black
- axis.text = element_text(size = 13), #changing size of x axis labels
- axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
- legend.title = element_text(face = 'bold', size = 17), #changes legend title
- legend.text = element_text(size = 14), #changes legend text
- legend.position = 'bottom', # moving the legend to the bottom
- legend.background = element_rect(colour = 'black', fill = 'white', linetype = 'solid'), #changes the legend background
- strip.text.x = element_text(size = 18, face = "bold"), #changes size of facet x axis
- strip.text.y = element_text(size = 18, face = "bold")) + #changes size of facet y axis
- xlab('Dimension 1 (85.1%)') + ylab('Dimension 2 (7.7%)') + #changing axis labels
-
- # Using colors from the startrek palette from ggsci
- scale_color_startrek(name = 'Cluster') +
- scale_fill_startrek(name = 'Cluster')
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. *K*-means clusters of cytokines at baseline.** Cytokines samples are derived from nasal epithelium (NELF) samples in 14 non-smoking subjects. Cytokine concentration values were transformed using a data reduction technique known as Principal Component Analysis (PCA). The first two eigenvectors plotted on the axes were able to capture a majority of the variance across all samples from the original dataset."
-
-Takeaways from this clustering plot:
-
-+ PCA was able to capture almost all (~93%) of the variance from the original dataset
-+ The 22 cytokines were able to be clustered into 3 distinct clusters using *k*-means
-
-
-
-## Hierarchical Clustering Visualization
-
-We can also build a heatmap using the `pheatmap()` function that has the capability to display hierarchical clustering dendrograms. To do so, we'll need to go back and use the `wider_baseline_df` dataframe.
-```{r 05-Chapter5-109, fig.align = 'center', fig.height=7, fig.width=8}
-pheatmap(wider_baseline_df,
- cluster_cols = FALSE, # hierarchical clustering of cytokines
- scale = "column", # scaling the data to make differences across cytokines more apparent
- cutree_row = 3, # adds a space between the 3 largest clusters
- display_numbers = TRUE, number_color = "black", fontsize = 12, # adding average concentration values
- angle_col = 45, fontsize_col = 12, fontsize_row = 12, # adjusting size/ orientation of axes labels
- cellheight = 17, cellwidth = 30 # setting height and width for cells
-)
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Hierarchical clustering of cytokines at baseline.** Cytokines samples are derived from nasal epithelium (NELF) samples in 14 non-smoking subjects. The heatmap visualizes psuedo log~2~ cytokine concentrations that were scaled within each subject."
-
-It may be helpful to add axes titles like "Subject ID" for the x axis, "Cytokine" for the y axis, and "Scaled pslog~2~ Concentration" for the legend after exporting from R. The `pheatmap()` function does not have the functionality to add those titles.
-
-Nevertheless, let's identify some key takeaways from this heatmap:
-
-+ The 22 cytokines were able to be clustered into 3 distinct clusters using hierarchical clustering
-+ These clusters are based on cytokine concentration levels with the first cluster having the highest expression, the second cluster having the lowest expression, and the last cluster having average expression
-
-
-
-## Variable Contributions
-To answer our final environmental health question: Which cytokines have the greatest contributions to the first two eigenvectors, we'll use the `fviz_contrib()` function that plots the percentage of each variable's contribution to the principal component(s). It also displays a red dashed line, and variables that fall above are considered to have significant contributions to those principal components. For a refresher on PCA and variable contributions, see the previous module, **TAME 2.0 Module 5.4 Unsupervised Machine Learning**.
-```{r 05-Chapter5-110, fig.align = 'center'}
-# kmeans contributions
-fviz_contrib(pca_cytokine,
- choice = "ind", addlabels = TRUE,
- axes = 1:2) # specifies to show contribution percentages for first 2 PCs
-
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. Cytokine contributions to principal components.** The bar chart displays each cytokine's contribution to the first two eigenvectors in descending order from left to right. The red dashed line represents the expected contribution of each cytokine if all inputs were uniform, therefore the seven cytokines that fall above this reference line are considered to have significant contributions to the first two principal components."
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can answer **Environmental Health Question #5***: Which cytokines have the greatest contributions to the first two eigenvectors?
-:::
-
-:::answer
-**Answer**: The cytokines that have significant contributions to the first two principal components include IL-8, Fractalkine, IP-10, IL-4, MIG, I309, and IL-12p70.
-:::
-
-
-
-## Concluding Remarks
-In this module, we explored scenarios where clustering would be appropriate but lack contextual details informing the number of clusters that should be considered, thus resulting in the need to derive such a number. In addition, methodology for *k*-means and hierarchical clustering was presented, along with corresponding visualizations. Lastly, variable contributions to the eigenvectors were introduced as a means to determine the most influential variables on the principal components' composition.
-
-### Additional Resources
-+ [*K*-Means Cluster Analysis](https://uc-r.github.io/kmeans_clustering#silo)
-+ [*K*-Means Clustering in R](https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)
-+ [Hierarchical Clustering in R](https://uc-r.github.io/hc_clustering)
-
-
-
-
-
-:::tyk
-Using the same dataset, answer the questions below.
-
-1. Determine the optimal number of *k*-means clusters of cytokines derived from the nasal epithelieum lining fluid of **e-cigarette users**.
-2. How do those clusters compare to the ones that were derived at baseline (in non-smokers)?
-3. Which cytokines have the greatest contributions to the first two eigenvectors?
-:::
diff --git a/Chapter_5/5_1_AI/5_1_AI.Rmd b/Chapter_5/5_1_AI/5_1_AI.Rmd
new file mode 100644
index 0000000..59a6ff3
--- /dev/null
+++ b/Chapter_5/5_1_AI/5_1_AI.Rmd
@@ -0,0 +1,173 @@
+# (PART\*) Chapter 5 Machine Learning & Artificial Intelligence {-}
+
+# 5.1 Introduction to Artificial Intelligence, Machine Learning, and Predictive Modeling for Environmental Health
+
+This training module was developed by David M. Reif, with contributions from Elise Hickman, Alexis Payton, and Julia E. Rager
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Artificial intelligence (AI), machine learning (ML), and predictive modeling are becoming increasingly popular buzzwords both in the public domain and within research fields, including environmental health. Within environmental health, these computational techniques are implemented to integrate large, high dimensional datasets (e.g., chemical, biological, clinical/medical, model estimates, etc) to better understand links between environmental exposures and biological responses.
+
+In this training module, we will:
+
++ Provide general historical context and taxonomy of modern AI/ML
++ Provide an overview of the intersection between environmental health science ML through discussing...
+ + Why there is a need for ML in environmental health science
+ + The differences between ML and traditional statistical methods
+ + Predictive modeling in the context of environmental health science
+ + Additional applications of ML in environmental health science
+
+
+
+### Training Module's Environmental Health Question
+
+This training module was specifically developed to answer the following environmental health question:
+
++ How and why are artificial intelligence, machine learning, and predictive modeling used in environmental health research?
+
+## General Historical Context and Taxonomy of Modern AI/ML
+
+Before diving in to the applications of AI and ML in environmental health, let's first establish what these term mean and how they are related. Note that the definitions surrounding AI and ML can be subjective, however the purpose of this module is not to get caught up in semantics, but to broadly understand how AI and ML can be applied to environmental health research.
+
+**Artificial Intelligence (AI)** encompasses computer systems that perform tasks typically associated with human cognition and intelligence. AI is found in our everyday lives, for instance, within face recognition, internet search queries, email spam detection, smart home devices, auto-navigation, and digital assistants.
+
+**Machine Learning (ML)** can be thought of as a subset of AI and describes a computer system that iteratively learns and improves from that experience autonomously.
+
+Below is a high level taxonomy of AI. It's not meant to be an exhaustive depiction of all AI techniques but a simple visualization of how some of these methodologies are nested within each other. **Note**: AI can be categorized in different ways and may deviate from what is illustrated below.
+```{r 5-1-AI-1, out.width = "800px", echo = FALSE, out.width = "75%", fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image1.png")
+```
+
+Advantages of AI and ML include the automation of repetitive tasks, complex problem solving, and reducing human error. However, disadvantages include learning from biased datasets or patterns that are reflected in the decisions of AI/ML and the potential limited interpretability of algorithms created by AI/ML. Check out the following resources for...
+
++ Further explanation on differences in [Artificial Intelligence vs. Machine Learning](https://cloud.google.com/learn/artificial-intelligence-vs-machine-learning)
++ Other subsets of AI that fall outside of the scope of these modules in [Types of Artificial Intelligence](https://builtin.com/artificial-intelligence)
++ Additional discussion on the utility of ML approaches for high-dimensional data common in environmental health research in [Payton et. al](https://www.frontiersin.org/articles/10.3389/ftox.2023.1171175/full)
+
+It is important to understand the methodological "roots" of current methods. Otherwise, it seems like every approach is novel! AI and ML methods have been around since the mid- to late- 1900s and continue to evolve in the present day. The earliest conceptual roots for these approaches can be traced from antiquity; however, it is generally thought that the field was named "artificial intelligence" at the ["Dartmouth Workshop"](https://home.dartmouth.edu/about/artificial-intelligence-ai-coined-dartmouth) in 1956, led by John McCarthy and others. The following schematic demonstrates the general taxonomy (categories, sub-fields, and specific methods) of modern AI and ML:
+
+```{r 5-1-AI-2, out.width = "800px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image2.png")
+```
+
+### A Brief Detour to Discuss ChatGPT
+
+**ChatGPT (Chat Generative Pre-trained Transformer)** is a publicly available chatbot developed by OpenAI. It was released in November of 2022 and quickly gained popularity due to its accessibility and ability to have human-like conversations with the user across almost any imaginable topic.
+
+Language Models (LLMs), including large language models like GPT-3 (a predecessor to ChatGPT), generally fall under the "Connectionist AI" category, which use deep learning techniques and are considered a subset of artificial neural networks. They fall under the deep learning subset due to their use of deep neural networks with many layers, allowing them to learn from large amounts of data and find intricate patterns.
+
+LLMs are trained to predict the probability of a word given its context in a dataset (a form of next-word prediction), which is a machine learning methodology. It's notable that they use architectures like [Transformer Networks](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)), which are known for their efficiency in handling sequential data, making them a go-to choice for natural language processing (NLP) tasks. The use of attention mechanisms in these architectures allows the model to focus on different parts of the input sequence when producing an output sequence, offering a substantial improvement in performance for many natural language processing tasks.
+
+The role of ChatGPT and similar tools in the environmental health research space is still being explored. Although ChatGPT has the potential to streamline certain parts of the research process, such as text and language polishing, synthesizing existing information, and suggesting custom coding solutions, it is not an intellectual replacement for the expertise and diverse viewpoints of scientists and must be used transparently and with caution.
+
+
+
+## Application of Machine Learning in Environmental Health Science
+
+For the rest of this module and chapter, we will focus on machine learning (ML). Generally speaking, ML is considered to encompass the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence (AI), discussed broadly above.
+
+### Why do we need machine learning in environmental health science?
+
+There are many avenues to incorporate ML into environmental health research, all aimed at better identifying patterns amongst large datasets spanning medical health records, clinical data, exposure monitoring data, chemistry profiles, and the rapidly expanding realm of biological response data including multiple -omics endpoints.
+
+One well-known problem that can be better addressed by incorporating ML is the 'too many chemicals, too little data' problem. To detail, there are thousands of chemicals in commerce today. Testing these chemicals one by one for toxicity using comprehensive animal screening experiments would take decades and is not feasible financially. Current efforts to address this problem include using cell-based high throughput screening to efficiently determine biological responses to a variety of chemical exposures and treatment conditions.
+
+```{r 5-1-AI-3, out.width = "700px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image3.png")
+```
+
+These screening efforts result in increasing amounts of data, which can be gathered to start building big databases.
+```{r 5-1-AI-4, out.width = "700px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image4.png")
+```
+
+When many of these datasets and databases are combined, including diversity across different types of screening platforms, technologies, cell types, species, and other experimental variables, the associated dimensionality of the data gets "big."
+```{r 5-1-AI-5, out.width = "500px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image5.png")
+```
+
+This presents a problem because these data are diverse and high dimensional (the number of features or endpoints exceeds the number of observations/chemicals). To appropriately analyze and model these data, new approaches beyond traditional statistical methods are needed.
+
+### Machine Learning vs. Traditional Statistical Methods
+
+There is *plenty* of debate as to where the line(s) between ML and traditional statistics should be drawn. In our opinion, a perfect delineation is not necessary for our purposes. Rather, we will focus on the usual goals/intent of each to help us understand the distinction for environmental health research.
+
+Traditional statistics may be able to handle 1:1 or 1:many comparisons of singular quantities (e.g., activity concentrations for two chemicals). However, once the modeling becomes more complex or exploratory, assumptions of most traditional methods will be violated. Furthermore, statistics draws population inferences from a sample, while AI/ML finds generalizable predictive patterns ([Bzdok et al 2018](https://www.nature.com/articles/nmeth.4642)). This is particularly helpful in **predictive toxicology**, in which we leverage high dimensional data to obtain generalizable forecasts for the effects of chemicals on biological systems.
+
+This image shows graphical abstractions of how a "problem" is solved using:
+
++ Traditional statistics ((A) logistic regression and (B) linear regression), OR
++ Machine learning ((C) support vector machines, (D) artificial neural networks, and (E) decision trees)
+
+```{r 5-1-AI-6, out.width = "700px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image6.png")
+```
+
+### Predictive Modeling in the Context of Environmental Health Science
+
+In the previous section, we briefly mentioned **predictive toxicology.** We often think of predictions as having a forward-time component (*i.e. What will happen next?*) ... what about "prediction" in a different sense as applied to toxicology?
+
+Our *working definition* is that **predictive toxicology** describes a multidisciplinary approach to chemical toxicity evaluation that more efficiently uses animal test results, when needed, and leverages expanding non-animal test methods to forecast the effects of a chemical on biological systems. Examples of the questions we can answer using predictive toxicology include:
+
++ Can we more efficiently design animal studies and analyze data from shorter assays using fewer animals to predict long-term health outcomes?
++ Can this suite of *in vitro* assays **predict** what would happen in an organism?
++ Can we use diverse, high dimensional data to cluster chemicals into **predicted** activity classes?
+
+```{r 5-1-AI-7, out.width = "600px", echo = FALSE, fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_1_AI//Module5_1_Image7.png")
+```
+
+Similar logic applies to the field of exposure science. What about "prediction" applied to exposure science?
+
+Our *working definition* is that **predictive exposure science** describes a multidisciplinary approach to chemical exposure evaluations that more efficiently uses biomonitoring, chemical inventory, and other exposure science-relevant databases to forecast exposure rates in target populations. For example:
+
++ Can we use existing biomonitoring data from NHANES to predict exposure rates for chemicals that have yet to be measured in target populations? (see ExpoCast program, e.g., [Wambagh et al 2014](https://pubmed.ncbi.nlm.nih.gov/25343693/))
++ Can I use chemical product use inventory data to predict the likelihood of a chemical being present in a certain consumer product? (e.g., [Phillips et al 2018](https://pubmed.ncbi.nlm.nih.gov/29405058/))
+
+There are many different types of ML methods that we can employ in predictive toxicology and exposure science, depending on the data type / purpose of data analysis. A recent [review](https://pubmed.ncbi.nlm.nih.gov/34029068/) written together with [Erin Baker's lab](https://bakerlab.wordpress.ncsu.edu/) provides a high-level overview on some of the types of ML methods and challenges to address when analyzing multi-omic data (including chemical signature data).
+
+### Answer to Environmental Health Question
+:::question
+*With this, we can now answer our **Environmental Health Question***: How and why are machine learning, predictive modeling, and artificial intelligence used in environmental health research?
+:::
+
+:::answer
+**Answer:** Machine learning, a subcategory of artificial intelligence, can be used in environmental health science to better understand patterns between chemical exposure and biological response in complex, high dimensional datasets. These datasets are often generated as part of efforts to screen many chemicals efficiently. Predictive modeling, which can include machine learning approaches, leverages these data to forecast the effects of a chemical on biological systems.
+:::
+
+### Additional Applications of Machine Learning in Environmental Health Science
+
+In addition to the predictive toxicology questions above, ML can also be applied in the analysis of complex, high dimensional data in observational clinical (human subjects) studies in environmental health, such as:
+
++ Do subjects cluster by chemical exposure? Are there similarities between subjects that cluster together for chemical exposure, suggesting underlying factors relevant to chemical exposure?
++ Are biological signatures in different exposure groups different enough overall that ML can predict which group a subject belongs to based on their signature?
+
+
+
+## Concluding Remarks
+
+In conclusion, this training module provides an overview of the field of AI and ML and discusses applications of these tools in environmental health science through predictive modeling. These methods represent common tools that are used in high dimensional data analyses within the field of environmental health sciences.
+
+In the following modules, we will provide specific examples detailing how to apply both supervised and unsupervised machine learning methods to environmental health questions and how to interpret the results of these analyses.
+
+For a review article on ML, see:
+
++ Odenkirk MT, Reif DM, Baker ES. Multiomic Big Data Analysis Challenges: Increasing Confidence in the Interpretation of Artificial Intelligence Assessments. Anal Chem. 2021 Jun 8;93(22):7763-7773. PMID: [34029068](https://pubmed.ncbi.nlm.nih.gov/34029068/)
+
+For additional case studies that leverage more advanced ML techniques, see the following recent publications that also address environmental health questions from our research groups, with bracketed tags at the end of each citation denoting ML methods used in that study:
+
++ Clark J, Avula V, Ring C, Eaves LA, Howard T, Santos HP, Smeester L, Bangma JT, O'Shea TM, Fry RC, Rager JE. Comparing the Predictivity of Human Placental Gene, microRNA, and CpG Methylation Signatures in Relation to Perinatal Outcomes. Toxicol Sci. 2021 Sep 28;183(2):269-284. PMID: [34255065](https://pubmed.ncbi.nlm.nih.gov/34255065/) *[hierarchical clustering, principal component analysis, random forest]*
+
++ Green AJ, Mohlenkamp MJ, Das J, Chaudhari M, Truong L, Tanguay RL, Reif DM. Leveraging high-throughput screening data, deep neural networks, and conditional generative adversarial networks to advance predictive toxicology. PLoS Comput Biol. 2021 Jul 2;17(7):e1009135. PMID: [3421407](https://pubmed.ncbi.nlm.nih.gov/34214078/) *[conditional generative adversarial network, deep neural network, support vector machine, random forest, multilayer perceptron]*
+
++ To KT, Truong L, Edwards S, Tanguay RL, Reif DM. Multivariate modeling of engineered nanomaterial features associated with developmental toxicity. NanoImpact. 2019 Apr;16:10.1016. PMID: [32133425](https://pubmed.ncbi.nlm.nih.gov/32133425/) *[random forest]*
+
++ Ring C, Sipes NS, Hsieh JH, Carberry C, Koval LE, Klaren WD, Harris MA, Auerbach SS, Rager JE. Predictive modeling of biological responses in the rat liver using in vitro Tox21 bioactivity: Benefits from high-throughput toxicokinetics. Comput Toxicol. 2021 May;18:100166. PMID: [34013136](https://pubmed.ncbi.nlm.nih.gov/34013136/) *[random forest]*
+
++ Hickman E, Payton A, Duffney P, Wells H, Ceppe AS, Brocke S, Bailey A, Rebuli ME, Robinette C, Ring B, Rager JE, Alexis NE, Jaspers I. Biomarkers of Airway Immune Homeostasis Differ Significantly with Generation of E-Cigarettes. Am J Respir Crit Care Med. 2022 Nov 15; 206(10):1248-1258. PMID: [35731626](https://pubmed.ncbi.nlm.nih.gov/35731626/) *[hierarchical clustering, quadratic discriminant analysis, multinomial logistic regression]*
+
++ Perryman AN, Kim H-YH, Payton A, Rager JE, McNell EE, Rebuli ME, et al. (2023) Plasma sterols and vitamin D are correlates and predictors of ozone-induced inflammation in the lung: A pilot study. PLoS ONE 18(5): e0285721. PMID: [37186612](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285721) *[random forest, support vector machine, k nearest neighbor]*
+
++ Payton AD, Perryman AN, Hoffman JR, Avula V, Wells H, Robinette C, Alexis NE, Jaspers I, Rager JE, Rebuli ME. Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples. American Journal of Physiology-Lung Cellular and Molecular Physiology 2022 322:5, L722-L736. PMID: [35318855](https://journals.physiology.org/doi/abs/10.1152/ajplung.00299.2021) *[k-means clustering, principal component analysis]*
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image1.png b/Chapter_5/5_1_AI/Module5_1_Image1.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image1.png
rename to Chapter_5/5_1_AI/Module5_1_Image1.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image2.png b/Chapter_5/5_1_AI/Module5_1_Image2.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image2.png
rename to Chapter_5/5_1_AI/Module5_1_Image2.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image3.png b/Chapter_5/5_1_AI/Module5_1_Image3.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image3.png
rename to Chapter_5/5_1_AI/Module5_1_Image3.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image4.png b/Chapter_5/5_1_AI/Module5_1_Image4.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image4.png
rename to Chapter_5/5_1_AI/Module5_1_Image4.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image5.png b/Chapter_5/5_1_AI/Module5_1_Image5.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image5.png
rename to Chapter_5/5_1_AI/Module5_1_Image5.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image6.png b/Chapter_5/5_1_AI/Module5_1_Image6.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image6.png
rename to Chapter_5/5_1_AI/Module5_1_Image6.png
diff --git a/Chapter_5/Module5_1_Input/Module5_1_Image7.png b/Chapter_5/5_1_AI/Module5_1_Image7.png
similarity index 100%
rename from Chapter_5/Module5_1_Input/Module5_1_Image7.png
rename to Chapter_5/5_1_AI/Module5_1_Image7.png
diff --git a/Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML.Rmd b/Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML.Rmd
new file mode 100644
index 0000000..d599c46
--- /dev/null
+++ b/Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML.Rmd
@@ -0,0 +1,466 @@
+
+# 5.2 Supervised Machine Learning
+
+This training module was developed by Alexis Payton, Oyemwenosa N. Avenbuan, Lauren E. Koval, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Machine learning is a field that has been around for decades but has exploded in popularity and utility in recent years due to the proliferation of big and/or high dimensional data. Machine learning has the ability to sift through and learn from large volumes of data and use that knowledge to solve problems. The challenges of high dimensional data as they pertain to environmental health and the applications of machine learning to mitigate some of those challenges are discussed further in [Payton et. al](https://www.frontiersin.org/articles/10.3389/ftox.2023.1171175/full). In this module, we will introduce different types of machine learning and then focus in on supervised machine learning, including how to train and assess supervised machine learning models. We will then analyze an example dataset with supervised machine learning highlighting an example with random forest modeling.
+
+
+
+## Types of Machine Learning
+Within the field of machine learning, there are many different types of algorithms that can be leveraged to address environmental health research questions. The two broad categories of machine learning frequently applied to environmental health research are: (1) supervised machine learning and (2) unsupervised machine learning.
+
+**Supervised machine learning** involves training a model using a labeled dataset, where each independent or predictor variable is associated with a dependent variable with a known outcome. This allows the model to learn how to predict the labeled outcome on data it hasn't "seen" before based on the patterns and relationships it previously identified in the data. For example, supervised machine learning has been used for cancer prediction and prognosis based on variables like tumor size, stage, and age ([Lynch et. al](https://www.sciencedirect.com/science/article/abs/pii/S1386505617302368?via%3Dihub), [Asadi et. al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7416093/)).
+
+Supervised machine learning includes:
+
++ Classification: Using algorithms to classify a categorical outcome (ie. plant species, disease status, etc.)
++ Regression: Using algorithms to predict a continuous outcome (ie. gene expression, chemical concentration, etc.)
+```{r 5-2-Supervised-ML-1, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image1.png")
+```
+
Soni, D. (2018, March 22). Supervised vs. Unsupervised Learning. Towards Data Science; Towards Data Science. https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d
+
+**Unsupervised machine learning**, on the other hand, involves using models to find patterns or associations between variables in a dataset that lacks a known or labeled outcome. For example, unsupervised machine learning has been used to identify new patterns across genes that are co-expressed, informing potential biological pathways mediating human disease ([Botía et. al](https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-017-0420-6), [Pagnuco et. al](https://www.sciencedirect.com/science/article/pii/S0888754317300575?via%3Dihub)).
+
+```{r 5-2-Supervised-ML-2, echo=FALSE, fig.width=52, fig.height=18, fig.align='center', out.width = "75%"}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image2.png")
+```
+
Langs, G., Röhrich, S., Hofmanninger, J., Prayer, F., Pan, J., Herold, C., & Prosch, H. (2018). Machine learning: from radiomics to discovery and routine. Der Radiologe, 58(S1), 1–6. PMID: [34013136](https://doi.org/10.1007/s00117-018-0407-3). Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
+
+Overall, the distinction between supervised and unsupervised learning is an important concept in machine learning, as it can inform the choice of algorithms and techniques used to analyze and make predictions from data. It is worth noting that there are also other types of machine learning, such as [semi-supervised learning](https://www.altexsoft.com/blog/semi-supervised-learning/), [reinforcement learning](https://www.geeksforgeeks.org/what-is-reinforcement-learning/), and [deep learning](https://www.geeksforgeeks.org/introduction-deep-learning/), though we will not further discuss these topics in this module.
+
+
+
+## Types of Supervised Machine Learning Algorithms
+
+Although this module's example will focus on a random forest model in the coding example below, other commonly used algorithms for supervised machine learning include:
+
++ **K-Nearest Neighbors (KNN):** Uses distance to classify a data point in the test set based upon the most common class of neighboring data points from the training set. For more information on KNN, see [K-Nearest Neighbor](https://www.ibm.com/topics/knn).
+```{r 5-2-Supervised-ML-3, echo=FALSE, out.width = "50%",fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image6.png")
+```
+
++ **Support Vector Machine (SVM):** Creates a decision boundary line (hyperplane) in n-dimensional space to separate the data into each class so that when new data is presented, they can be easily categorized. For more information on SVM, see [Support Vector Machine](https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm).
+```{r 5-2-Supervised-ML-4, echo=FALSE, out.width = "50%", fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image7.png")
+```
+
++ **Random Forest (RF):** Uses a multitude of decision trees trained on a subset of different samples from the training set and the resulting classification of a data point in the test set is aggregated from all the decision trees. A **decision tree** is a hierarchical model that depicts decisions from predictors and their resulting outcomes. It starts with a root node, which represents an initial test from a single predictor. The root node splits into subsequent decision nodes that test another feature. These decision nodes can either feed into more decision nodes or leaf nodes that represent the predicted class label. A branch or a sub-tree refers to a subsection of an entire decision tree.
+
+Here is an example decision tree with potential variables and decisions informing a college basketball player's likelihood of being drafted to the NBA:
+```{r 5-2-Supervised-ML-5, echo=FALSE, out.width = "75%",fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image8.png")
+```
+
+While decision trees are highly interpretable, they are prone to overfitting, thus they may not always generalize well to data outside of the training set. To address this, random forests are comprised of many different decision trees. Each tree is trained on a subset of the samples in the training data, selected with replacement, and a randomly selected set of predictor variables. For a dataset with *p* predictors, it is common to test $\sqrt{p}$, $\frac{p}{2}$, and *p* predictors to see which gives the best results. This process decorrelates the trees. For a classification problem, majority vote of the decision trees determines the final class for a prediction. This process loses interpretability inherent to individual trees, but reduces the risk of overfitting.
+
+For more information on RF and decision trees, check out [Random Forest](https://www.ibm.com/in-en/topics/random-forest) and
+[Decision Trees](https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/#What_is_a_Decision_Tree?).
+
+**Note**: One algorithm is not inherently better than the others with each having their respective advantages and disadvantages. Each algorithm's predictive ability will be largely dependent on the size of the dataset, the distribution of the data points, and the scenario.
+
+
+
+## Training Supervised Machine Learning Models
+
+In supervised machine learning, algorithms need to be trained before they can be used to predict on new data. This involves selecting a smaller portion of the dataset to train the model so it will learn how to predict the outcome as accurately as possible. The process of training an algorithm is essential for enabling the model to learn and improve over time, allowing it to make more accurate predictions and better adapt to new and changing circumstances. Ultimately, the quality and relevance of the training data will have a significant impact on the effectiveness of a machine learning model.
+
+Common partitions of the full dataset used to train and test a supervised machine learning model are the following:
+
+1. **Training Set:** a subset of the data that the algorithm "sees" and uses to identify patterns.
+
+2. **Validation Set**: a subset of the training set that is used to evaluate the model's fit in an unbiased way allowing us to fine-tune its parameters and optimize performance.
+
+3. **Test Set:** a subset of data that is used to evaluate the final model's fit based on the training and validation sets. This provides an objective assessment of the model's ability to generalize new data.
+
+It is common to split the dataset into a training set that contains 60% of the data and the test set that contains 40% of the data, though other common splits include 70% training / 30% test and 80% training / 20% test.
+
+```{r 5-2-Supervised-ML-6, echo=FALSE, out.width = "65%", fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image3.png")
+```
+
+It is important to note that the test set should only be examined after the algorithm has been trained using the training/validation sets. Using the test set during the development process can lead to overfitting, where the model performs well on the test data but poorly on new data. The ideal algorithm is generalizable or flexible enough to accurately predict unseen data. This is known as the bias-variance tradeoff. For further information on the bias-variance tradeoff, see [Understanding the Bias-Variance Tradeoff](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229).
+
+### Cross Validation
+
+Finally, we will discuss **cross validation**, which is an approach used during training to expose the model to more patterns in the data and aid in model evaluation. For example, if a model is trained and tested on a 60:40 split, our model's accuracy will likely be influenced by *where* this 60:40 split occurs in the dataset. This will likely bias the data and reduce the algorithm's ability to predict accurately for data not in the training set. Overall, cross validation (CV) is implemented to fine tune a model's parameters and improve prediction accuracy and ability to generalize.
+
+Although there are [a number of cross validation approaches](https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right), we will specifically highlight ***k*-fold cross validation**. k-fold cross validation works by splitting the samples in the training dataset into *k* equally sized folds or groups. For example, if we implement 5-fold CV, we start by...
+
+1. Splitting the training data into 5 groups, or "folds".
+2. Five iterations of training/testing are then run where each of the 5 folds serves as the test data once and as part of the training set four times, as seen in the figure below.
+3. To measure predictive ability of each of the parameters tested, like the number of features to include, values like accuracy and specificity are calculated for each iteration. The parameters that optimize performance are selected for the final model which will be evaluated against the test set not used in training.
+```{r 5-2-Supervised-ML-7, echo=FALSE, fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image4.png")
+```
+
+Check out these resources for additional information on [Cross Validation in Machine Learning](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f) and [Cross Validation Pros & Cons](https://www.geeksforgeeks.org/cross-validation-machine-learning/).
+
+
+
+## Assessing Classification-Based Model Performance
+Evaluation metrics from a confusion matrix are often used to determine the best model during training and measure model performance during testing for classification-based supervised machine learning models. A confusion matrix consists of a table that displays the numbers of how often the algorithm correctly and incorrectly predicted the outcome.
+
+Let's imagine you're interested in predicting whether or not a player will be drafted to the National Basketball Association (NBA) based on a dataset that contains variables regarding a player's assists, points, height etc. Let's say that this dataset contains information on 253 players with 114 that were actually drafted and 139 that weren't drafted. The confusion matrix below shows a model's results where a player that is drafted is the "positive" class and a player that is not drafted is the "negative" class.
+
+```{r 5-2-Supervised-ML-8, echo=FALSE, out.width = "50%", fig.width=4, fig.height=5, fig.align='center'}
+knitr::include_graphics("Chapter_5/5_2_Supervised_ML/Module5_2_Image5.png")
+```
+
+Helpful confusion matrix terminology:
+
++ **True positive (TP)**: the number of correctly classified "positive" data points (i.e., the number of correctly classified players to be drafted)
++ **True negative (TN)**: the number of correctly classified "negative" data points (i.e., the number of correctly classified players to be not drafted)
++ **False positive (FP)**: the number of incorrectly classified "positive" data points (i.e., the number of players not drafted incorrectly classified as draft picks)
++ **False negative (FN)**: the number of incorrectly classified "negative" data points (i.e., the number of draft picks incorrectly classified as players not drafted)
+
+
+Some of the metrics that can be obtained from a confusion matrix are listed below:
+
++ **Overall Accuracy:** indicates how often the model makes a correct prediction relative to the total number of predictions made and is typically used to assess overall model performance ($\frac{TP+TN}{TP+TN+FP+FN}$).
+
++ **Sensitivity or Recall:** evaluates how well the model was able to predict the "positive" class. It is calculated as the ratio of correctly classified true positives to the total number of positive cases ($\frac{TP}{TP+FN}$).
+
++ **Specificity:** evaluates how well the model was able to predict the "negative" class. It is calculated as the ratio of correctly classified true negatives to total number of negatives cases ($\frac{TN}{TN+FP}$).
+
++ **Balanced Accuracy:** is the mean of sensitivity and specificity and is often used in the case of a class imbalance to gauge how well the model can correctly predict values for both classes ($\frac{sensitivity+specificity}{2}$).
+
++ **Positive Predictive Value (PPV) or Precision:** evaluates how accurate predictions of the "positive" class are. It is calculated as the ratio of correctly classified true positives to total number of predicted positives ($\frac{TP}{TP+FN}$).
+
++ **Negative Predictive Value (NPV):** evaluates how accurate predictions of the "negative" class are. It is calculated as the ratio of correctly classified true negatives to total number of predicted negatives ($\frac{TN}{TN+FP}$).
+
+For the above metrics, values fall between 0 and 1. Instances of 0 indicate that the model was not able to classify any data points correctly, and instances of 1 indicate that the model was able to classify all test data correctly. Although subjective, an overall accuracy of at least 0.7 is considered respectable ([Barkved, 2022](https://www.obviously.ai/post/machine-learning-model-performance#:~:text=Good%20accuracy%20in%20machine%20learning,also%20consistent%20with%20industry%20standards.)). Furthermore, a variety of additional metrics exist for evaluating model performance for classification problems ([24 Evaluation Metrics for Binary Classification (And When to Use Them)](https://neptune.ai/blog/evaluation-metrics-binary-classification)). Selecting a metric for evaluating model performance varies by situation and is dependent not only on the individual dataset, but also the question being answered.
+
+
+**Note**: For multi-class classification (more than two labeled outcomes to be predicted), the same metrics are often used, but are obtained in a slightly different way. Regression based supervised machine learning models use loss functions to evaluate model performance. For more information regarding confusion matrices and loss functions for regression-based models, see:
+
+ + [Additional Confusion Matrix Metrics](https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5)
+ + [Precision vs. Recall or Specificity vs. Sensitivity](https://towardsdatascience.com/should-i-look-at-precision-recall-or-specificity-sensitivity-3946158aace1)
+ + [Loss Functions for Machine Learning Regression](https://towardsdatascience.com/understanding-the-3-most-common-loss-functions-for-machine-learning-regression-23e0ef3e14d3)
+
+
+
+
+## Introduction to Activity and Example Dataset
+
+In this activity, we will analyze an example dataset to see whether we can use environmental monitoring data to predict areas of contamination using random forest (RF). This example model will leverage a dataset of well water variables that span geospatial location, sampling date, and well water attributes, with the goal of predicting whether detectable levels of inorganic arsenic (iAs) are present. This dataset was obtained through the sampling of 713 private wells across North Carolina through the University of North Carolina Superfund Research Program ([UNC-SRP](https://sph.unc.edu/superfund-pages/srp/)) using an analytical method that was capable of detecting levels of iAs greater than 5ppm. As demonstrated through the script below, the algorithm will first be trained and tested, and then resulting model performance will be assessed using the previously detailed confusion matrix and related performance metrics.
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Which well water variables, spanning various geospatial locations, sampling dates, and well water attributes, significantly differ between samples containing detectable levels of iAs vs samples that are not contaminated/ non-detectable?
+2. How can we train a random forest (RF) model to predict whether a well might be contaminated with iAs?
+3. With this RF model, can we predict if iAs will be detected based on well water information?
+4. How could this RF model be improved upon, acknowledging that there is class imbalance?
+
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 5-2-Supervised-ML-9, echo=TRUE, eval=TRUE}
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 5-2-Supervised-ML-10, echo=TRUE, eval=TRUE, warning=FALSE, results='hide', message=FALSE}
+if (!requireNamespace("readxl"))
+ install.packages("readxl");
+if (!requireNamespace("lubridate"))
+ install.packages("lubridate");
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("gtsummary"))
+ install.packages("gtsummary");
+if (!requireNamespace("flextable"))
+ install.packages("flextable");
+if (!requireNamespace("caret"))
+ install.packages("caret");
+if (!requireNamespace("randomForest"))
+ install.packages("randomForest");
+```
+
+#### Loading R packages required for this session
+```{r 5-2-Supervised-ML-11, echo=TRUE, eval=TRUE, warning=FALSE, error=FALSE, results='hide', message=FALSE}
+library(readxl);
+library(lubridate);
+library(tidyverse);
+library(gtsummary);
+library(flextable);
+library(caret);
+library(randomForest);
+library(cardx);
+```
+
+#### Set your working directory
+```{r 5-2-Supervised-ML-12, echo=TRUE, eval=FALSE, error=FALSE, results='hide', message=FALSE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+```{r 5-2-Supervised-ML-13, echo=TRUE, eval=TRUE}
+# Load the data
+arsenic_data <- data.frame(read_xlsx("Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML_Data.xlsx"))
+
+# View the top of the dataset
+head(arsenic_data)
+```
+
+The columns in this dataset are described below:
+
++ `Well_ID`: Unique id for each well (This is the sample identifier and not a predictive feature)
++ `Water_Sample_Date`: Date that the well was sampled
++ `Casing_Depth`: Depth of the casing of the well (ft)
++ `Well_Depth`: Depth of the well (ft)
++ `Static_Water_Depth`: Static water depth in the well (ft)
++ `Flow_Rate`: Well flow rate (gallons per minute)
++ `pH`: pH of water sample
++ `Detect_Concentration`: Binary identifier (either non-detect "ND" or detect "D") if iAs concentration detected in water sample
+
+### Changing Data Types
+First, `Detect_Concentration` needs to be converted from a character to a factor so that Random Forest knows that the non-detect class is the baseline or "negative" class, while the detect class will be the "positive" class. `Water_Sample_Date` will be converted from a character to a date type using the `mdy()` function from the *lubridate* package. This is done so that the model understands this column contains dates.
+```{r 5-2-Supervised-ML-14, echo=TRUE, eval=TRUE}
+arsenic_data <- arsenic_data %>%
+ # Converting `Detect_Concentration` from a character to a factor
+ mutate(Detect_Concentration = relevel(factor(Detect_Concentration), ref = "ND"),
+ # Converting water sample date from a character to a date type
+ Water_Sample_Date = mdy(Water_Sample_Date)) %>%
+ # Removing tax id and only keeping the predictor and outcome variables in the dataset
+ # This allows us to put the entire dataframe as is into RF
+ select(-Well_ID)
+
+# Look at the top of the revised dataset
+head(arsenic_data)
+```
+
+
+
+## Testing for Differences in Predictor Variables across the Outcome Classes
+
+It is useful to run summary statistics on the variables that will be used as predictors in the algorithm to see if there are differences in distributions between the outcomes classes (either non-detect or detect in this case). Typically, greater significance often leads to better predictivity for a certain variable, since the model is better able to separate the classes. We'll use the `tbl_summary()` function from the *gtsummary* package. Note, this may only be practical with smaller datasets or for a subset of predictors if there are many.
+
+For more information on the `tbl_summary()` function, check out this helpful [Tutorial](https://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html).
+```{r 5-2-Supervised-ML-15, echo=TRUE, eval=TRUE, warning=F, message = F}
+arsenic_data %>%
+ # Displaying the mean and standard deviation in parentheses for all continuous variables
+ tbl_summary(
+ by = Detect_Concentration,
+ statistic = list(all_continuous() ~ "{mean} ({sd})")
+ ) %>%
+ # Adding a column that displays the total number of samples for each variable
+ add_n() %>%
+ # Adding a column that displays the p-value from a one-way ANOVA test
+ add_p(
+ test = list(all_continuous() ~ "oneway.test"),
+ test.args = list(all_continuous() ~ list(var.equal = TRUE))
+ ) %>%
+ as_flex_table() %>%
+ bold(bold = TRUE, part = "header")
+
+```
+
+
+Note that N refers to the total sample number; ND refers to the samples that contained non-detectable levels of iAs; and D refers to the samples that contained detectable levels of iAs.
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: Which well water variables, spanning various geospatial locations, sampling dates, and well water attributes, significantly differ between samples containing detectable levels of iAs vs samples that are not contaminated/ non-detect?
+:::
+
+:::answer
+**Answer**: All of the evaluated descriptor variables are significantly different, with p<0.05 between detect and non-detect iAs samples, with the exception of the sample date and the static water depth.
+:::
+
+With these findings, we feel comfortable moving forward with these well water descriptive variables as predictors in our model.
+
+
+
+### Setting up Cross Validation
+At this point, we can move forward with training and testing a RF model aimed at predicting whether or not detectable levels of iAs are present in well water samples. We'll take a glance at the distribution of `Detect_Concentration` between the two classes.
+```{r 5-2-Supervised-ML-16, echo=TRUE, eval=TRUE}
+
+# Set seed for reproducibility
+set.seed(17)
+
+# Establish a list of indices that will used to identify our training and testing data with a 60-40 split
+tt_indices <- createDataPartition(y = arsenic_data$Detect_Concentration, p = 0.6, list = FALSE)
+
+# Use indices to make our training and testing datasets and view the number of Ds and NDs
+iAs_train <- arsenic_data[tt_indices,]
+table(iAs_train$Detect_Concentration)
+
+iAs_test <- arsenic_data[-tt_indices,]
+table(iAs_test$Detect_Concentration)
+```
+
+We can see that there are notably more non-detects (`ND`) than detects (`D`) in both our training and testing sets. This is something important to consider when evaluating our model's performance.
+
+Now we can set up our cross validation and train our model. We will be using the `trainControl()` function from the *caret* package for this task. It is one of the most commonly used libraries for supervised machine learning in R and can be leveraged for a variety algorithms including RF, SVM, KNN, and others. This model will be trained with 5-fold cross validation. Additionally, we will test 2, 3, and 6 predictors through the `mtry` parameter.
+
+See the *caret* documentation [here](https://cran.r-project.org/web/packages/caret/vignettes/caret.html).
+```{r 5-2-Supervised-ML-17, echo=TRUE, eval=TRUE}
+
+# Establish the parameters for our cross validation with 5 folds
+control <- trainControl(method = 'cv',
+ number = 5,
+ search = 'grid',
+ classProbs = TRUE)
+
+# Establish grid of predictors to test in our model as part of hyperparameter tuning
+p <- ncol(arsenic_data) - 1 # p is the total number of predictors in the dataset
+tunegrid_rf <- expand.grid(mtry = c(floor(sqrt(p)), p/2, p)) # We will test sqrt(p), p/2, and p predictors (2,3,& 6 predictors, respectively) to see which performs best
+```
+
+
+
+## Predicting iAs Detection with a Random Forest (RF) Model
+```{r 5-2-Supervised-ML-18 }
+# Look at the column names in training dataset
+colnames(iAs_train)
+
+# Train model
+rf_train <- train(x = iAs_train[,1:6], # Our predictor variables are in columns 1-6 of the dataframe
+ y = iAs_train[,7], # Our outcome variable is in column 7 of the dataframe
+ trControl = control, # Specify the cross-validation parameters we defined above
+ method = 'rf', # Specify we want to train a Random Forest
+ importance = TRUE, # This parameter calculates the variable importance for RF models specifically which can help with downstream analyses
+ tuneGrid = tunegrid_rf, # Specify the number of predictors we want to test as defined above
+ metric = "Accuracy",
+ ) # Specify what evaluation metric we want to use to decide which model is the best
+
+# Look at the results of training
+rf_train
+
+# Save the best model from our training. The best performing model is determined by the number of predictor variables we tested that resulted in the highest accuracy during the cross validation step.
+rf_final <- rf_train$finalModel
+
+# View confusion matrix for best model
+rf_final
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: How can we train a random forest (RF) model to predict whether a well might be contaminated with iAs?
+:::
+
+:::answer
+**Answer**: As is standard practice with supervised ML, we split our full dataset into a training dataset and a test dataset using a 60-40 split. Using the *caret* package, we implemented 5-fold cross validation to train a RF while also testing different numbers of predictors to see which optimized performance. The model that resulted in the greatest accuracy was selected as the final model.
+:::
+
+Now we can see how well our model does on data it hasn't seen before by applying it to our testing data.
+```{r 5-2-Supervised-ML-19, echo=TRUE, eval=TRUE}
+# Use our best model to predict the classes for our test data. We need to make sure we remove the column of Ds/NDs from our test data.
+rf_res <- predict(rf_final, iAs_test %>%
+ select(!Detect_Concentration))
+
+# View a confusion matrix of the results and gauge model performance
+# Be sure to include the 'positive' parameter to specify the correct positive class
+confusionMatrix(rf_res, iAs_test$Detect_Concentration, positive = "D")
+```
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: With this RF model, can we predict if iAs will be detected based on well water information?
+:::
+
+:::answer
+**Answer**: We can use this model to predict if iAs can be detected in well water given that an overall accuracy of ~0.72 is decent, however we should consider other metrics that may influence how good we feel about this model depending on what is important to the question we are trying to answer. For example, the model did a good job at predicting non-detect data based on a sensitivity of ~0.85 and a NPV ~0.78, but struggled at predicting detect data based on a specificity of ~0.39 and a PPV of ~0.50. Additionally, the balanced accuracy of ~0.62 further emphasizes the difference in predictive ability of the model for non-detects and detects. If it is highly important to us that detects are classified correctly, we may want to improve this model before implementing it.
+:::
+
+
+
+## Class Imbalance
+
+It is worth noting this discrepancy in predictive capabilities for detects vs. non-detects makes sense due to the observed class imbalance in our training data. There were notably more non-detects than detects in the training set, so the model was exposed to more of these data points and struggles to distinguish unique characteristics of detects when compared to non-detects. Additionally, we told the training algorithm to prioritize selecting a final model based on its overall accuracy. In the instances of a heavy class imbalance, it is common for a high accuracy to be achieved as the more prevalent class is predicted more often, though this doesn't give the full picture of the model's predictive capabilities. For example, if you consider a dog/cat case with a set of 90 dogs and 10 cats, a model could achieve 90% accuracy by predicting dog every time, which isn’t at all helpful in predicting cats.
+
+This is particularly important, because for toxicology related datasets, the "positive" class often represents the class with greater public health risk/ interest but can have less data. For example, when you classify subjects based upon whether or not they have asthma based on gene expression data. Asthmatics would likely be the "positive" class, but given that asthmatics are less prevalent than non-asthmatics in the general population, they would likely represent the minority class too.
+
+To address this issue, a few methods can be considered. Full implementation of these approaches is beyond the scope of this module, but relevant resources for further exploration are given.
+
++ **Synthetic Minority Oversampling Technique (SMOTE)**- increases the number of minority classes in the training data, thereby reducing the class imbalance by synthetically generating additional samples derived from the existing minority class samples.
+ + [SMOTE Oversampling & Tutorial On How To Implement In Python And R](https://spotintelligence.com/2023/02/17/smote-oversampling-python-r/#:~:text=Conclusion-,The%20SMOTE%20(Synthetic%20Minority%20Over%2Dsampling%20Technique)%20algorithm%20is,datasets%20that%20aren't%20balanced.)
+ + [How to Use SMOTE for Imbalanced Data in R (With Example)](https://www.statology.org/smote-in-r/)
+
++ **Adjusting the loss function**- Loss functions in machine learning quantify the penalty for a bad prediction. They can be adjusted to where the minority class is penalized more forcing the model to learn to make fewer mistakes when predicting the minority class.
+
++ **Alternative Performance Metrics**- When training the model, alternative metrics to overall accuracy may yield a more robust model capable of better predicting the minority class. Example alternatives may include balanced accuracy or an [F1-score](https://thedatascientist.com/f-1-measure-useful-imbalanced-class-problems/). The *caret* package further allows for [custom, user-defined metrics](https://topepo.github.io/caret/model-training-and-tuning.html#alternate-performance-metrics) to be evaluated during training by specifying the *summaryFunction* parameter in the `trainControl()` function, as seen below, in addition to the [`defaultSummary()` and `twoClassSummary()` functions](https://cran.r-project.org/web/packages/caret/vignettes/caret.html).
+
+In the example code below, we're creating a function (`f1`) that will calculate the F1 score and find the optimal model with the highest F1 score as opposed to the highest accuracy as we did above.
+```{r 5-2-Supervised-ML-20, echo=TRUE, eval=FALSE}
+install.packages("MLmetrics")
+library(MLmetrics)
+
+f1 <- function(data, lev = NULL, model = NULL) {
+ # Creating a function to calculate the F1 score
+ f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
+ c(F1 = f1_val)
+}
+
+# 5 fold CV
+ctrl <- trainControl(
+ method = "cv",
+ number = 5,
+ classProbs = TRUE,
+ summaryFunction = f1
+)
+
+# Training the RF model
+mod <- train(x = X,
+ y = Y,
+ trControl = ctrl,
+ method = "rf",
+ tuneGrid = tunegrid_rf,
+ importance = TRUE,
+ # Basing the best model performance off of the F1 score within 5 CV
+ metric = "F1")
+```
+
+For more in-depth information and additional ways to address class imbalance check out [How to Deal with Imbalanced Data in Classification](https://medium.com/game-of-bits/how-to-deal-with-imbalanced-data-in-classification-bd03cfc66066).
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer **Environmental Health Question #4***: How could this RF model be improved upon, acknowledging that there is class imbalance?
+:::
+
+:::answer
+**Answer**: We can implement SMOTE to increase the number of training data points for the minority class thereby reducing the class imbalance. In conjunction with using SMOTE, another approach includes selecting an alternative performance metric during training that does a better job taking the existing class imbalance into consideration, such as balanced accuracy or an F1-score, improves our predictive ability for the minority class.
+:::
+
+
+
+## Concluding Remarks
+
+In conclusion, this training module has provided an introduction to supervised machine learning using classification techniques in R. Machine learning is a powerful tool that can help researchers gain new insights and improve models to analyze complex datasets faster and in a more comprehensive way. The example we've explored demonstrates the utility of supervised machine learning models on an environmentally relevant dataset.
+
+
+
+### Additional Resources
+To learn more check out the following resources:
+
++ [IBM - What is Machine Learning](https://www.ibm.com/topics/machine-learning)
++ [Curate List of AI and Machine Learning Resources](https://medium.com/machine-learning-in-practice/my-curated-list-of-ai-and-machine-learning-resources-from-around-the-web-9a97823b8524)
++ [Introduction to Machine Learning in R](https://machinelearningmastery.com/machine-learning-in-r-step-by-step/)
++ Machine Learning by Mueller, J. P. (2021). Machine learning for dummies. John Wiley & Sons.
+
+
+
+
+
+:::tyk
+Using the "Module5_2TYKInput.xlsx", use RF to determine if well water data can be accurate predictors of Manganese detection. The data is structured similarly to the "5_2_Supervised_ML/Data.xlsx" used in this module, however it now includes 4 additional features:
+
++ `Longitude`: Longitude of address (decimal degrees)
++ `Latitude`: Latitude of address (decimal degrees)
++ `Stream_Distance`: Euclidean distance to the nearest stream (feet)
++ `Elevation`: Surface elevation of the sample location (feet)
+:::
diff --git a/Chapter_5/Module5_2_Input/Module5_2_InputData.xlsx b/Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML_Data.xlsx
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_InputData.xlsx
rename to Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML_Data.xlsx
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image1.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image1.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image1.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image1.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image2.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image2.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image2.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image2.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image3.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image3.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image3.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image3.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image4.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image4.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image4.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image4.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image5.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image5.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image5.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image5.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image6.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image6.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image6.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image6.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image7.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image7.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image7.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image7.png
diff --git a/Chapter_5/Module5_2_Input/Module5_2_Image8.png b/Chapter_5/5_2_Supervised_ML/Module5_2_Image8.png
similarity index 100%
rename from Chapter_5/Module5_2_Input/Module5_2_Image8.png
rename to Chapter_5/5_2_Supervised_ML/Module5_2_Image8.png
diff --git a/Chapter_5/5_3_Supervised_ML_Interpretation/5_3_Supervised_ML_Interpretation.Rmd b/Chapter_5/5_3_Supervised_ML_Interpretation/5_3_Supervised_ML_Interpretation.Rmd
new file mode 100644
index 0000000..691771d
--- /dev/null
+++ b/Chapter_5/5_3_Supervised_ML_Interpretation/5_3_Supervised_ML_Interpretation.Rmd
@@ -0,0 +1,540 @@
+
+# 5.3 Supervised Machine Learning Model Interpretation
+
+This training module was developed by Alexis Payton, Lauren E. Koval, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Supervised machine learning (ML) represents a subset of ML methods wherein the outcome variable is known or assigned prior to training a model to be able to predict said outcome. As we discussed in previous modules, ML methods are advantageous in that they easily incorporate a multitude of potential predictor variables, which allows these models to more closely consider real-world, complex environmental health scenarios and offer new insights through a more holistic consideration of available data inputs. However, one disadvantage of ML is that it is often not as easily interpretable as traditional statistics (e.g., regression based methods with defined beta coefficients for each input predictor variable). With this limitation in mind, there are methods and concepts that can be applied to supervised ML algorithms to aid in the understanding of their predictions including variable (feature) importance and decision boundaries, which we will cover in this module. We will also include example visualization techniques of these methods, representing important aspects contributing to model interpretability, since visualizing helps convey concepts faster and across a broader target audience. In addition, this module addresses methods to communicate these findings in a paper so that a wider span of readers can understand overall take-home points. As with other data analyses, we advise to focus just as much on the **why** components of a study's research question(s) as opposed to only focusing on the **what** or **how**. To elaborate, we explain through this module that it is not as important to explain all the intricacies of how a model works and how its parameters were tuned; rather, it is more important to focus on why a particular model was selected and how it will be leveraged to answer your research questions. This can all be a bit subjective and requires expertise within your research field. As a first step, let's first learn about some model interpretation methodologies highlighting **Variable Importance** and **Decision Boundaries** as important examples relevant to environmental health research. Then, this training module will further describe approaches to summarize these methods and communicate supervised ML findings to a broader audience.
+
+
+
+## Variable Importance
+
+When a supervised ML algorithm makes predictions, it relies more heavily on some variables than others. How much a variable contributes to classifying data is known as **variable (feature) importance**. Often times, this is thought of as the impact on overall model performance if a variable were to be removed from the model. There are many methods that are used to measure feature importance, including...
+
++ **SHapley Additive exPlanations (SHAP)**: based on game theory where each variable is considered a "player" where we're seeking to determine each player's contribution to the outcome of a "game" or overall model performance. It divides the model performance metric amongst all the variables, so that the sum of the shapley values for all the predictors is equal to the overall model performance. For more information on SHAP, see [A Novel Approach to Feature Importance](https://towardsdatascience.com/a-novel-approach-to-feature-importance-shapley-additive-explanations-d18af30fc21b).
+
++ **Mean decrease gini (gini impurity)**: quantifies the improvement of predictivity with the addition of each predictor in a decision tree, which is then averaged over all the decision trees tested. The higher the value the greater the importance on the algorithm. This metric can easily be extracted from classification-based models, including random forest (RF) classifications, which is what we will focus on in this module.
+
+Note for RF regression-based models, node purity can be extracted as a measure of feature importance. For more information, please see the following resources regarding [Feature Importance](https://www.baeldung.com/cs/ml-feature-importance) and [Mean Decrease Gini](https://cran.r-project.org/web/packages/rfVarImpOOB/vignettes/rfVarImpOOB-vignette.html).
+
+
+
+## Decision Boundary
+Another concept that is pertinent to a model's interpretability is understanding a decision boundary and how visualizing it can further aid in understanding how the model classifies new data points. A **decision boundary** is a line (or a hyperplane) that seeks to separate the training data by class. This line can be linear or non-linear and is formed in n-dimensional space. To clarify, although support vector machine (SVM) specifically uses decision boundaries to classify training data and make predictions on test data, decision boundaries can still be drawn for other algorithms.
+
+A decision boundary can be visualized to convey how well an algorithm is able to classify an outcome based on the data given. It is important to note that most ML models make use of datasets that contain three or more predictors, and it is difficult to visualize a plot in more than three dimensions. Therefore, the number of features and which features to plot need to be narrowed down to two variables. For this reason, the resulting visualization is not a true representation of the decision boundary from the initial model using all predictors, since the visualization only relies on prediction results from two variables. Nevertheless, decision boundary plots can be powerful visualizations to determine thresholds between the outcome classes.
+
+When choosing variables for decision boundary plots, features that have the most influence on the model are often selected, but that is not always the case. Sometimes predictors are selected based upon the environmental health implications relevant to the research question. For example in [Perryman et. al](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285721), lung response following ozone exposure was investigated by sampling derivatives of cholesterol biosynthesis in human subjects. In this paper, these sterol metabolites were used to predict whether a subject would be classified as having a lung response that was considered non-responsive or responsive. A decision boundary plot was made using two predictors:
+
++ Cholesterol, given that it had the highest variable importance and
++ Vitamin D, given its synthesis can be affected by ozone despite it having a lower variable importance in the paper's models.
+```{r 5-3-Supervised-ML-Interpretation-1, echo=FALSE, fig.align='center', out.width = "80%"}
+knitr::include_graphics("Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image1.png")
+```
+
**Figure 5. Decision boundary plot for SVM model predicting lung response class.** Cholesterol and 25-hydroxyvitamin D were used as predictors visualizing responder status [non-responders(green) and responders (yellow)] and disease status [non-asthmatics (triangles) and asthmatics (circles)]. The shaded regions are the model’s prediction of a subject’s lung response class at a given cholesterol and 25-hydroxyvitamin D concentration.
+
+Takeaways from this decision boundary plot:
+
++ Subjects with more lung inflammation ("responders") after ozone exposure tended to have higher Vitamin D levels (> 35pmol/mL) and lower Cholesterol levels (< 675nmol/mL).
++ These "responder" subjects were more likely to be non-asthmatics.
+
+
+
+## Introduction to Example Dataset and Activity
+
+In the previous module, we investigated whether a classification-based RF model using well water variables would be accurate predictors of inorganic arsenic (iAs) contamination. While it is helpful to know if certain variables are able to be used to construct a model that accurately predict detectability, from a public health standpoint, it is also helpful to know which of those features contribute the most to a model's accuracy. Therefore, if we can identify the features that are associated with having lower arsenic detection, we can use that information to inform policies when new wells are constructed. In addition to identifying variables with the greatest importance to the algorithm, it is also pertinent to understand the ranges of when a well is more or less likely to have arsenic detected. For example, are wells with a lower flow rate more likely to have arsenic detected? In this module, this will be addressed by extracting variable importance from the same algorithm and plotting it. The two features with the highest variable importance will be identified and used to construct a decision boundary plot to determine how features are associated with iAs detection.
+
+The data to be used in this module was described and referenced previously in **TAME 2.0 Module 5.2 Supervised Machine Learning**.
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. After plotting variable importance from highest to lowest, which two predictors have the highest variable importance on the predictive accuracy of iAs detection from a RF algorithm?
+2. Using the two features with the highest variable importance, under what conditions are we more likely to predict detectable iAs in wells based on a decision boundary plot?
+3. How do the decision boundaries shift after incorporating SMOTE to address class imbalance?
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 5-3-Supervised-ML-Interpretation-2 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 5-3-Supervised-ML-Interpretation-3, message=FALSE}
+if (!requireNamespace("readxl"))
+ install.packages("readxl");
+if (!requireNamespace("lubridate"))
+ install.packages("lubridate");
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("caret"))
+ install.packages("caret");
+if (!requireNamespace("randomForest"))
+ install.packages("randomForest");
+if (!requireNamespace("themis"))
+ install.packages("themis");
+```
+
+#### Loading R packages required for this session
+```{r 5-3-Supervised-ML-Interpretation-4, message=FALSE}
+library(readxl)
+library(lubridate)
+library(tidyverse)
+library(caret)
+library(randomForest)
+library(e1071)
+library(ggsci)
+library(themis)
+```
+
+#### Set your working directory
+```{r 5-3-Supervised-ML-Interpretation-5, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+```{r 5-3-Supervised-ML-Interpretation-6 }
+# Load the data
+arsenic_data <- data.frame(read_excel("Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_InputData.xlsx"))
+
+# View the top of the dataset
+head(arsenic_data)
+```
+
+### Changing Data Types
+First, `Detect_Concentration` needs to be converted from a character to a factor so that Random Forest knows that the non-detect class is the baseline or "negative" class, while the detect class will be the "positive" class. `Water_Sample_Date` will be converted from a character to a date type using the `mdy()` function from the *lubridate* package. This is done so that the model understands this column contains dates.
+```{r 5-3-Supervised-ML-Interpretation-7 }
+arsenic_data <- arsenic_data %>%
+ # Converting `Detect_Concentration` from a character to a factor
+ mutate(Detect_Concentration = relevel(factor(Detect_Concentration), ref = "ND"),
+ # Converting water sample date from a character to a date type
+ Water_Sample_Date = mdy(Water_Sample_Date)) %>%
+ # Removing well id and only keeping the predictor and outcome variables in the dataset
+ # This allows us to put the entire dataframe as is into RF
+ select(-Well_ID)
+
+# View the top of the current dataset
+head(arsenic_data)
+```
+
+
+### Setting up Cross Validation
+Note that the code below is different than the code presented in the previous module, **TAME 2.0 Module 5.2 Supervised Machine Learning**. Both coding methods are valid and produce comparable results, however we wanted to present another way to run *k*-fold cross validation and random forest. In 5-fold cross validation (CV), there are 5 equally-sized folds (ideally!). This means that 80% of the original dataset is split into the 4 folds that comprise the training set and the remaining 20% in the last fold is reserved for the test set.
+
+Previously, the `trainControl()` function was used for CV. This time we'll use the `createFolds()` function also from the *caret* package.
+```{r 5-3-Supervised-ML-Interpretation-8 }
+# Setting seed for reproducibility
+set.seed(12)
+
+# 5-fold cross validation
+arsenic_index = createFolds(arsenic_data$Detect_Concentration, k = 5)
+
+# Seeing if about 20% of the records are in the testing set
+kfold1 = arsenic_index[[1]]
+length(kfold1)/nrow(arsenic_data)
+
+# Creating vectors for parameters to be tuned
+ntree_values = c(50, 250, 500) # number of decision trees
+p = dim(arsenic_data)[2] - 1 # number of predictor variables in the dataset
+mtry_values = c(sqrt(p), p/2, p) # number of predictors to be used in the model
+```
+
+
+## Predicting iAs Detection with a Random Forest (RF) Model
+Notice that in the code below we are choosing the final RF model to be the one with the lowest out of bag (OOB) error. In the previous module, the final model was chosen based on the highest accuracy, however this is a similar approach here given that OOB error = 1 - Accuracy.
+```{r 5-3-Supervised-ML-Interpretation-9 }
+# Setting the seed again so the predictions are consistent
+set.seed(12)
+
+# Creating an empty dataframe to save the confusion matrix metrics and variable importance
+metrics = data.frame()
+variable_importance_df = data.frame()
+
+# Iterating through the cross validation folds
+for (i in 1:length(arsenic_index)){
+ # Training data
+ data_train = arsenic_data[-arsenic_index[[i]],]
+
+ # Test data
+ data_test = arsenic_data[arsenic_index[[i]],]
+
+ # Creating empty lists and dataframes to store errors
+ reg_rf_pred_tune = list()
+ rf_OOB_errors = list()
+ rf_error_df = data.frame()
+
+ # Tuning parameters: using ntree and mtry values to determine which combination yields the smallest OOB error
+ # from the validation datasets
+ for (j in 1:length(ntree_values)){
+ for (k in 1:length(mtry_values)){
+
+ # Running RF to tune parameters
+ reg_rf_pred_tune[[k]] = randomForest(Detect_Concentration ~ ., data = data_train,
+ ntree = ntree_values[j], mtry = mtry_values[k])
+ # Obtaining the OOB error
+ rf_OOB_errors[[k]] = data.frame("Tree Number" = ntree_values[j], "Variable Number" = mtry_values[k],
+ "OOB_errors" = reg_rf_pred_tune[[k]]$err.rate[ntree_values[j],1])
+
+ # Storing the values in a dataframe
+ rf_error_df = rbind(rf_error_df, rf_OOB_errors[[k]])
+ }
+ }
+
+ # Finding the lowest OOB error from the 5 folds using best number of predictors at split
+ best_oob_errors <- which(rf_error_df$OOB_errors == min(rf_error_df$OOB_errors))
+
+ # Now running RF on the entire training set with the tuned parameters
+ # This will be done 5 times for each fold
+ reg_rf <- randomForest(Detect_Concentration ~ ., data = data_train,
+ ntree = rf_error_df$Tree.Number[min(best_oob_errors)],
+ mtry = rf_error_df$Variable.Number[min(best_oob_errors)])
+
+ # Predicting on test set and adding the predicted values as an additional column to the test data
+ data_test$Pred_Detect_Concentration = predict(reg_rf, newdata = data_test, type = "response")
+ matrix = confusionMatrix(data = data_test$Pred_Detect_Concentration,
+ reference = data_test$Detect_Concentration, positive = "D")
+
+ # Extracting accuracy, sens, spec, PPV, NPV and adding to the dataframe to take mean later
+ matrix_values = data.frame(t(c(matrix$byClass[11])), t(c(matrix$byClass[1:4])))
+ metrics = rbind(metrics, matrix_values)
+
+ # Extracting variable importance
+ variable_importance_values = data.frame(importance(reg_rf)) %>%
+ rownames_to_column(var = "Predictor")
+ variable_importance_df = rbind(variable_importance_df, variable_importance_values)
+}
+
+# Taking average across the 5 folds
+metrics = metrics %>%
+ summarise(`Balanced Accuracy` = mean(Balanced.Accuracy), Sensitivity = mean(Sensitivity),
+ Specificity = mean(Specificity), PPV = mean(Pos.Pred.Value), NPV = mean(Neg.Pred.Value))
+
+variable_importance_df = variable_importance_df %>%
+ group_by(Predictor) %>%
+ summarise(MeanDecreaseGini = mean(MeanDecreaseGini)) %>%
+ # Sorting from highest to lowest
+ arrange(-MeanDecreaseGini)
+```
+
+The confusion matrix results from the previous module are shown below.
+```{r 5-3-Supervised-ML-Interpretation-10, echo=FALSE, fig.align='center', out.width = "80%"}
+knitr::include_graphics("Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image2.png")
+```
+
+Now let's double check that when using this new method, our results are still comparable.
+```{r 5-3-Supervised-ML-Interpretation-11 }
+# First comparing results to the previous module
+round(metrics, 2)
+```
+
+They are! Now we'll take a look at the model's variable importance.
+```{r 5-3-Supervised-ML-Interpretation-12 }
+variable_importance_df
+```
+
+Although we have the results we need, let's take it a step further and plot the data.
+
+### Reformatting the dataframe for plotting
+First, the dataframe will be transformed so that the figure is more legible. Specifically, spaces will be added between the variables, and the `Predictor` column will be put into a factor to rearrange the order of the variables from lowest to highest mean decrease gini. For additional information on tricks like this to make visualizations easier to read, see **TAME 2.0 Module 3.2 Improving Data Visualizations**.
+```{r 5-3-Supervised-ML-Interpretation-13 }
+# Adding spaces between the variables that need the space
+modified_variable_importance_df = variable_importance_df %>%
+ mutate(Predictor = gsub("_", " ", Predictor))
+
+# Saving the order of the variables from lowest to highest mean decrease gini by putting into a factor
+predictor_order = rev(modified_variable_importance_df$Predictor)
+modified_variable_importance_df$Predictor = factor(modified_variable_importance_df$Predictor,
+ levels = predictor_order)
+
+head(modified_variable_importance_df)
+```
+
+## Variable Importance Plot
+```{r 5-3-Supervised-ML-Interpretation-14, fig.align='center', out.width = "65%"}
+ggplot(data = modified_variable_importance_df ,
+ aes(x = MeanDecreaseGini, y = Predictor, size = 2)) +
+ geom_point() +
+
+ theme_light() +
+ theme(axis.line = element_line(color = "black"), #making x and y axes black
+ axis.text = element_text(size = 12), #changing size of x axis labels
+ axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
+ legend.title = element_text(face = 'bold', size = 14), #changes legend title
+ legend.text = element_text(size = 12), #changes legend text
+ strip.text.x = element_text(size = 15, face = "bold"), #changes size of facet x axis
+ strip.text.y = element_text(size = 15, face = "bold")) + #changes size of facet y axis
+ labs(x = 'Variable Importance', y = 'Predictor') + #changing axis labels
+
+ guides(size = "none")#removing size legend
+```
+An appropriate title for this figure could be:
+
+“**Figure X. Variable importance from random forest models predicting iAs detection.** Variable importance is derived from mean decrease gini values extracted from random forest models. Features are listed on the y axis from greatest (top) to least (bottom) mean decrease gini."
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: After plotting variable importance from highest to lowest, which two predictors have the highest variable importance on the predictive accuracy of iAs detection from a RF algorithm?
+:::
+
+:::answer
+**Answer**: From the variable importance dataframe and plot, we can see that casing depth and pH had the greatest impact on RF followed by water sample date, flow rate, static water depth, and well depth in descending order.
+:::
+
+Since casing depth and pH have been identified as the predictors with the highest variable importance, they will be prioritized as the two predictors included in the decision boundary plot example below.
+
+
+
+### Decision Boundary Calculation
+
+First, models will be trained using only casing depth and pH as variables. Since, the decision boundary plot will be used for visualization purposes, and a 2-D figure can only plot two variables, we will not worry about tuning the parameters as was previously done. In this module, we're creating a decision boundary based on a random forest model, however we'll also explore what decision boundaries look like for other algorithms including support vector machine (SVM), and k nearest neighbor (KNN), logistic regression. Each supervised ML method has its advantages and performance is dependent upon the situation and the dataset. Therefore, it is common to see multiple models used to predict an outcome of interest in a publication. Let's create additional boundary plots still using casing depth and pH, but this time we will use logistic regression, SVM, and KNN as comparisons to RF.
+```{r 5-3-Supervised-ML-Interpretation-15 }
+# Creating a dataframe with variables based on the highest predictors
+highest_pred_data = data.frame(arsenic_data[,c("Casing_Depth", "pH", "Detect_Concentration")])
+
+# Training RF
+rf_detect_arsenic = randomForest(Detect_Concentration~., data = highest_pred_data)
+
+# Logistic regression
+lr_detect_arsenic = glm(Detect_Concentration~., data = highest_pred_data, family = binomial(link = 'logit'))
+
+# SVM with a radial kernel (hyperplane)
+svm_detect_arsenic = svm(Detect_Concentration~., data = highest_pred_data, kernel = "radial")
+
+# KNN
+knn_detect_arsenic = knn3(Detect_Concentration~., data = highest_pred_data) # specifying 2 classes
+```
+
+From these predictions, decision boundaries will be calculated. This will be done by predicting `Detect_Concentration` between a grid of values - specifically the minimum and maximum of the two predictors (casing depth and pH). A non-linear line will be drawn on the plot to separate the two classes.
+```{r 5-3-Supervised-ML-Interpretation-16 }
+get_grid_df <- function(classification_model, data, resolution = 100, predict_type) {
+ # This function predicts the outcome (Detect_Concentration) at evenly spaced data points using the two variables (pH and casing depth)
+ # to create a decision boundary between the outcome classes (detect and non-detect samples).
+
+ # :parameters: a classification-based supervised machine learning model, dataset containing the predictors and outcome variable,
+ # specifies the number of data points to make between the minimum and maximum predictor values, prediction type
+ # :output: a grid of values for both predictors and their corresponding predicted outcome class
+
+ # Grabbing only the predictor data
+ predictor_data <- data[,1:2]
+
+ # Creating a dataframe that contains the min and max for both features
+ min_max_df <- sapply(predictor_data, range, na.rm = TRUE)
+
+ # Creating a vector of evenly spaced points between the min and max for the first variable (casing depth)
+ variable1_vector <- seq(min_max_df[1,1], min_max_df[2,1], length.out = resolution)
+ # Creating a vector of evenly spaced points between the min and max for the second variable (pH)
+ variable2_vector <- seq(min_max_df[1,2], min_max_df[2,2], length.out = resolution)
+
+ # Creating a dataframe of grid values by combining the two vectors
+ grid_df <- data.frame(cbind(rep(variable1_vector, each = resolution), rep(variable2_vector,
+ time = resolution)))
+ colnames(grid_df) <- colnames(min_max_df)
+
+ # Predicting class label based on all the predictor pairs of data
+ grid_df$Pred_Class = predict(classification_model, grid_df, type = predict_type)
+
+ return(grid_df)
+}
+
+# calling function
+# RF
+grid_df_rf = get_grid_df(rf_detect_arsenic, highest_pred_data, predict_type = "class") %>%
+ # Adding in a column that indicates the model so all the dataframes can be combined
+ mutate(Model = "A. Random Forest")
+
+# SVM with a radial kernel (hyperplane)
+grid_df_svm = get_grid_df(svm_detect_arsenic, highest_pred_data, predict_type = "class") %>%
+ mutate(Model = "B. Support Vector Machine")
+
+# KNN
+grid_df_knn = get_grid_df(knn_detect_arsenic, highest_pred_data, predict_type = "class") %>%
+ mutate(Model = "C. K Nearest Neighbor")
+
+# Logistic regression
+grid_df_lr = get_grid_df(lr_detect_arsenic, highest_pred_data, predict_type = "response") %>%
+ # First specifying the cutoff point for logistic regression predictions
+ # If the response is >= 0.5 it will be classified as a detect prediction
+ mutate(Pred_Class = relevel(factor(ifelse(Pred_Class >= 0.5, "D", "ND")), ref = "ND"),
+ Model = "D. Logistic Regression")
+
+# Creating 1 dataframe
+grid_df = rbind(grid_df_rf, grid_df_lr, grid_df_svm, grid_df_knn)
+
+# Viewing the dataframe to be plotted
+head(grid_df)
+```
+## Decision Boundary Plot
+
+Now let's plot the grid of predictions with the sampled data.
+```{r 5-3-Supervised-ML-Interpretation-17, warning = FALSE, fig.width=15, fig.height=10, fig.align='center'}
+# choosing palette from package
+ggsci_colors = pal_npg()(5)
+
+ggplot() +
+ geom_point(data = arsenic_data, aes(x = pH, y = Casing_Depth, color = Detect_Concentration),
+ position = position_jitter(w = 0.1, h = 0.1), size = 4, alpha = 0.8) +
+ geom_contour(data = grid_df, aes(x = pH, y = Casing_Depth, z = as.numeric(Pred_Class == "D")),
+ color = "black", breaks = 0.5) + # adds contour line
+ geom_point(data = grid_df, aes(x = pH, y = Casing_Depth, color = Pred_Class),
+ size = 0.1) + # shades plot
+ xlim(5.9, NA) + # changes the limits of the x axis
+
+ facet_wrap(~Model, scales = 'free') +
+
+ theme_light() +
+ theme(axis.line = element_line(color = "black"), #making x and y axes black
+ axis.text = element_text(size = 10), #changing size of x axis labels
+ axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
+ legend.title = element_text(face = 'bold', size = 12), #changes legend title
+ legend.text = element_text(size = 12), #changes legend text
+ legend.position = "bottom", # move legend to top left corner
+ legend.background = element_rect(color = 'black', fill = 'white', linetype = 'solid'), # changes legend background
+ strip.text = element_text(size = 15, face = "bold")) + #changes size of facet x axis
+ labs(y = 'Casing Depth (ft)') + #changing axis labels
+
+ scale_color_manual(name = "Arsenic Detection", # renaming the legend
+ values = ggsci_colors[c(4,5)],
+ labels = c('Non-Detect','Detect')) # renaming the classes
+
+
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Using the two features with the highest variable importance, under what conditions are we more likely to predict detectable iAs in wells based on a decision boundary plot?
+:::
+
+:::answer
+**Answer**: There is some overlap between detect and non-detect iAs samples; however, it is evident that wells with detectable levels of iAs were more likely to have lower (<80 ft) casing depths and a more basic pH (> 7) based on RF and KNN models. It seems like SVM and logistic regression could have potentially captured a greater "detect" region indicating that the models likely struggled to predict "detect" values. In the next section, SMOTE will be used to see if these decision boundaries can be improved.
+:::
+
+
+
+## Decision Boundary Plot Incorporating SMOTE
+
+Here, we will create a decision boundary plot still using casing depth and pH, but this time we will make our dataset more balance to see how improve model performance visually. The **Synthetic Minority Oversampling Technique (SMOTE)** was introduced in **TAME 2.0 Module 5.2 Supervised Machine Learning** and will be used to make the dataset more balanced by oversampling the minority class (detect values) and undersampling the majority class (non-detect values).
+
+Starting by training each model:
+```{r 5-3-Supervised-ML-Interpretation-18 }
+# Using SMOTE first to balance classes
+balanced_highest_pred_data = smotenc(highest_pred_data, "Detect_Concentration")
+
+# Training RF
+rf_detect_arsenic = randomForest(Detect_Concentration~., data = balanced_highest_pred_data)
+
+# Logistic regression
+lr_detect_arsenic = glm(Detect_Concentration~., data = balanced_highest_pred_data, family = binomial(link = 'logit'))
+
+# SVM with a radial kernel (hyperplane)
+svm_detect_arsenic = svm(Detect_Concentration~., data = balanced_highest_pred_data, kernel = "radial")
+
+# KNN
+knn_detect_arsenic = knn3(Detect_Concentration~., data = balanced_highest_pred_data) # specifying 2 classes
+```
+
+Now calling the `get_grid_df()` function we created above to create a grid of predictions.
+```{r 5-3-Supervised-ML-Interpretation-19 }
+# Calling function
+# RF
+balanced_grid_df_rf = get_grid_df(rf_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
+ # Adding in a column that indicates the model so all the dataframes can be combined
+ mutate(Model = "A. Random Forest")
+
+# SVM with a radial kernel (hyperplane)
+balanced_grid_df_svm = get_grid_df(svm_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
+ mutate(Model = "B. Support Vector Machine")
+
+# KNN
+balanced_grid_df_knn = get_grid_df(knn_detect_arsenic, balanced_highest_pred_data, predict_type = "class") %>%
+ mutate(Model = "C. K Nearest Neighbor")
+
+# Logistic regression
+balanced_grid_df_lr = get_grid_df(lr_detect_arsenic, balanced_highest_pred_data, predict_type = "response") %>%
+ # First specifying the cutoff point for logistic regression predictions
+ # If the response is >= 0.5 it will be classified as a detect prediction
+ mutate(Pred_Class = relevel(factor(ifelse(Pred_Class >= 0.5, "D", "ND")), ref = "ND"),
+ Model = "D. Logistic Regression")
+
+
+# Creating 1 dataframe
+balanced_grid_df = rbind(balanced_grid_df_rf, balanced_grid_df_lr, balanced_grid_df_svm, balanced_grid_df_knn)
+
+# Viewing the dataframe to be plotted
+head(balanced_grid_df)
+```
+
+```{r 5-3-Supervised-ML-Interpretation-20, warning = FALSE, fig.width=15, fig.height=10, fig.align='center'}
+# choosing palette from package
+ggsci_colors = pal_npg()(5)
+
+ggplot() +
+ geom_point(data = arsenic_data, aes(x = pH, y = Casing_Depth, color = Detect_Concentration),
+ position = position_jitter(w = 0.1, h = 0.1), size = 4, alpha = 0.8) +
+ geom_contour(data = balanced_grid_df, aes(x = pH, y = Casing_Depth, z = as.numeric(Pred_Class == "D")),
+ color = "black", breaks = 0.5) + # adds contour line
+ geom_point(data = balanced_grid_df, aes(x = pH, y = Casing_Depth, color = Pred_Class),
+ size = 0.1) + # shades plot
+ xlim(5.9, NA) + # changes the limits of the x axis
+
+ facet_wrap(~Model, scales = 'free') +
+
+ theme_light() +
+ theme(axis.line = element_line(color = "black"), #making x and y axes black
+ axis.text = element_text(size = 10), #changing size of x axis labels
+ axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
+ legend.title = element_text(face = 'bold', size = 12), #changes legend title
+ legend.text = element_text(size = 12), #changes legend text
+ legend.position = "bottom", # move legend to top left corner
+ legend.background = element_rect(color = 'black', fill = 'white', linetype = 'solid'), # changes legend background
+ strip.text = element_text(size = 15, face = "bold")) + #changes size of facet x axis
+ labs(y = 'Casing Depth (ft)') + #changing axis labels
+
+ scale_color_manual(name = "Arsenic Detection", # renaming the legend
+ values = ggsci_colors[c(4,5)],
+ labels = c('Non-Detect','Detect')) # renaming the classes
+```
+An appropriate title for this figure could be:
+
+“**Figure X. Decision boundary plots from supervised machine learning models predicting iAs detection.** The top two predictors on model performance, casing depth and pH, were used to visualize arsenic detection [non-detect (red) and detect (blue)]. The shaded regions represent prediction of a well's detection class based on varying casing depth and pH values using (A) Random Forest, (B) Support Vector Machine, (C) K Nearest Neighbor, and (D) Logistic Regression.
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: How do the decision boundaries shift after incorporating SMOTE to address class imbalance?
+:::
+
+:::answer
+**Answer**: It is still evident that wells with detectable levels of iAs were more likely to have lower (<80 ft) casing depths and a more basic pH (> 7). However, we see the greatest shifts in the decision boundaries of SVM and logistic regression with both models now predicting greater regions to detectable iAs levels.
+:::
+
+
+
+## Concluding Remarks
+In conclusion, this training module provided methodologies to aid in the interpretation of supervised ML with variable importance and decision boundary plots. Variable importance helps quantify the impact of each feature's importance on an algorithm's predictivity. The most important or environmentally-relevant predictors can be selected in a decision boundary plot to further understand and visualize the features impact on the model's classification.
+
+
+
+### Additional Resources
+
++ Christoph Molnar. (2019, August 27). Interpretable Machine Learning. Github.io. https://christophm.github.io/interpretable-ml-book/
++ [Variable Importance](https://compgenomr.github.io/book/trees-and-forests-random-forests-in-action.html#variable-importance-1)
++ [Decision Boundary](https://rpubs.com/ZheWangDataAnalytics/DecisionBoundary)
+
+
+
+
+
+:::tyk
+1. Using the "Module5_2_TYKInput.xlsx", use RF to determine if well water data can be accurate predictors of manganese detection as was done in the previous module. However, this time, incorporate SMOTE in the model. Feel free to use either the `trainControl()` or `createFolds()` function for CV. Extract the variable importance for each predictor on a RF model. What two features have the highest variable importance? **Hint**: Regardless of the cross validation function you choose, run SMOTE on the training dataset only to create a more balanced training set while the test set will remain unchanged.
+
+2. Using casing depth and the feature with the highest variable importance, construct a decision boundary plot. Under what conditions are a well more likely to predict detectable manganese levels based on a decision boundary plot?
+:::
diff --git a/Chapter_5/Module5_3_Input/Module5_3_Image1.png b/Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image1.png
similarity index 100%
rename from Chapter_5/Module5_3_Input/Module5_3_Image1.png
rename to Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image1.png
diff --git a/Chapter_5/Module5_3_Input/Module5_3_Image2.png b/Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image2.png
similarity index 100%
rename from Chapter_5/Module5_3_Input/Module5_3_Image2.png
rename to Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_Image2.png
diff --git a/Chapter_5/Module5_3_Input/Module5_3_InputData.xlsx b/Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_InputData.xlsx
similarity index 100%
rename from Chapter_5/Module5_3_Input/Module5_3_InputData.xlsx
rename to Chapter_5/5_3_Supervised_ML_Interpretation/Module5_3_InputData.xlsx
diff --git a/Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML.Rmd b/Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML.Rmd
new file mode 100644
index 0000000..3c28315
--- /dev/null
+++ b/Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML.Rmd
@@ -0,0 +1,486 @@
+
+# 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA
+
+This training module was developed by David M. Reif with contributions from Alexis Payton, Lauren E. Koval, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+To reiterate what has been discussed in the previous module, machine learning is a field that has great utility in environmental health sciences, often to investigate high-dimensional datasets. The two main classifications of machine learning discussed throughout the TAME Toolkit are supervised and unsupervised machine learning, though additional classifications exist. Previously, we discussed artificial intelligence and supervised machine learning in **TAME 2.0 Module 5.1 Introduction to Machine Learning & Artificial Intelligence**, **TAME 2.0 Module 5.2 Supervised Machine Learning**, and **TAME 2.0 Module 5.3 Supervised Machine Learning Model Interpretation**. In this module, we'll cover background information on unsupervised machine learning and then work through a scripted example of an unsupervised machine learning analysis.
+
+## Introduction to Unsupervised Machine Learning
+
+**Unsupervised machine learning**, as opposed to supervised machine learning, involves training a model on a dataset lacking ground truths or response variables. In this regard, unsupervised approaches are often used to identify underlying patterns amongst data in a more unbiased manner. This can provide the analyst with insights into the data that may not otherwise be apparent. Unsupervised machine learning has been used for understanding differences in gene expression patterns of breast cancer patients ([Jezequel et. al, 2015](https://link.springer.com/article/10.1186/s13058-015-0550-y)) and evaluating metabolomic signatures of patients with and without cystic fibrosis ([Laguna et. al, 2015](https://onlinelibrary.wiley.com/doi/full/10.1002/ppul.23225?casa_token=Vqlz3JgGm10AAAAA%3A4UFubAP2r97CKl9PK8oYDfgrcjrs_ZySDzDCx1t3qc6XvQRxOqIwjTn_eQxm_lzX8UQLE0zURJu94fI)).
+
+:::moduletextbox
+**Note**: Unsupervised machine learning is used for exploratory purposes, and just because it can find relationships between data points, that doesn't necessarily mean that those relationships have merit, are indicative of causal relationships, or have direct biological implications. Rather, these methods can be used to find new patterns that can also inform future studies testing direct relationships.
+:::
+
+```{r 5-4-Unsupervised-ML-1, echo=FALSE, out.width = "75%", fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_4_Unsupervised_ML/Module5_4_Image1.png")
+```
+
Langs, G., Röhrich, S., Hofmanninger, J., Prayer, F., Pan, J., Herold, C., & Prosch, H. (2018). Machine learning: from radiomics to discovery and routine. Der Radiologe, 58(S1), 1–6. PMID: [34013136](https://doi.org/10.1007/s00117-018-0407-3). Figure regenerated here in alignment with its published [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
+
+Unsupervised machine learning includes:
+
++ **Clustering**: Involves grouping elements in a dataset such that the elements in the same group are more similar to each other than to the elements in the other groups.
+ + Exclusive (*K*-means)
+ + Overlapping
+ + Hierarchical
+ + Probabilistic
++ **Dimensionality reduction**: Focuses on taking high-dimensional data and transforming it into a lower-dimensional space that has fewer features while preserving important information inherent to the original dataset. This is useful because reducing the number of features makes the data easier to visualize while trying to maintain the initial integrity of the dataset.
+ + Principal Component Analysis (PCA)
+ + Singular Value Decomposition (SVD)
+ + t-Distributed Stochastic Neighbor Embedding (t-SNE)
+ + Uniform Manifold Approximation and Projection (UMAP)
+ + Partial Least Squares-Discriminant Analysis (PLS-DA)
+
+
+In this module, we'll focus on methods for ***K*-means clustering** and **Principal Component Analysis** described in more detail in the following sections. In the next module, **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**, we'll focus on hierarchical clustering. For further information on types of unsupervised machine learning, check out [Unsupervised Learning](https://cloud.google.com/discover/what-is-unsupervised-learning#section-3).
+
+
+
+
+### *K*-Means Clustering
+
+*K*-means is a common clustering algorithm used to partition quantitative data. This algorithm works by first randomly selecting a pre-specified number of clusters, *k*, across the data space with each cluster having a data centroid. When using a standard Euclidean distance metric, the distance is calculated from an observation to each centroid, then the observation is assigned to the cluster of the closest centroid. After all observations have been assigned to one of the *k* clusters, the average of all observations in a cluster is calculated, and the centroid for the cluster is moved to the location of the mean. The process then repeats, with the distance computed between the observations and the updated centroids. Observations may be reassigned to the same cluster or moved to a different cluster if it is closer to another centroid. These iterations continue until there are no longer changes between cluster assignments for observations, resulting in the final cluster assignments that are then carried forward for analysis/interpretation.
+
+Helpful resources on *k*-means clustering include the following: [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf) &
+[Towards Data Science](https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a).
+
+
+
+### Principal Component Analysis (PCA)
+
+Principal Component Analysis, or PCA, is a dimensionality-reduction technique used to transform high-dimensional data into a lower dimensional space while trying to preserve as much of the variability in the original data as possible. PCA has strong foundations in linear algebra, so background knowledge of eigenvalues and eigenvectors is extremely useful. Though the mathematics of PCA is beyond the scope of this module, a variety of more in-depth resources on PCA exist including this [Towards Data Science Blog]("https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643"), and this [Sartorius Blog](https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186#:~:text=Principal%20component%20analysis%2C%20or%20PCA,more%20easily%20visualized%20and%20analyzed.). At a higher level, important concepts in PCA include:
+
+1. PCA partitions variance in a dataset into linearly uncorrelated principal components (PCs), which are weighted combinations of the original features.
+
+2. Each PC (starting from the first one) summarizes a decreasing percentage of variance.
+
+3. Every instance (e.g. chemical) in the original dataset has a "weight" or score" on each PC.
+
+4. Any combination of PCs can be compared to summarize relationships amongst the instances (e.g. chemicals), but typically it's the first two eigenvectors that capture a majority of the variance.
+```{r 5-4-Unsupervised-ML-2, echo=FALSE, out.width= "80%", fig.align = 'center'}
+knitr::include_graphics("Chapter_5/5_4_Unsupervised_ML/Module5_4_Image2.png")
+```
+
+
+
+## Introduction to Example Data
+
+In this activity, we are going to analyze an example dataset of physicochemical property information for chemicals spanning **per- and polyfluoroalkyl substances (PFAS) and statins**. PFAS represent a ubiquitous and pervasive class of man-made industrial chemicals that are commonly used in food packaging, commercial household products such as Teflon, cleaning products, and flame retardants. PFAS are recognized as highly stable compounds that, upon entering the environment, can persist for many years and act as harmful sources of exposure. Statins represent a class of lipid-lowering compounds that are commonly used as pharmaceutical treatments for patients at risk of cardiovascular disease. Because of their common use amongst patients, statins can also end up in water and wastewater effluent, making them environmentally relevant as well.
+
+This example analysis was designed to evaluate the chemical space of these diverse compounds and to illustrate the utility of unsupervised machine learning methods to differentiate chemical class and make associations between chemical groupings that can inform a variety of environmental and toxicological applications. The two types of machine learning methods that will be employed are *k*-means and PCA (as described in the introduction).
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying unsupervised machine learning techniques?
+2. If substances are able to be clustered, what are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
+3. How do the data compare when physicochemical properties are reduced using PCA?
+4. Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component?
+5. If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
+6. What kinds of applications/endpoints can be better understood and/or predicted because of these derived chemical groupings?
+
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 5-4-Unsupervised-ML-3, echo=TRUE, eval=TRUE}
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 5-4-Unsupervised-ML-4, message=FALSE}
+if (!requireNamespace("factoextra"))
+ install.packages("factoextra");
+if (!requireNamespace("pheatmap"))
+ install.packages("pheatmap");
+if (!requireNamespace("cowplot"))
+ install.packages("cowplot");
+```
+
+#### Loading required R packages
+```{r 5-4-Unsupervised-ML-5, results=FALSE, message=FALSE}
+library(tidyverse)
+library(factoextra)
+library(pheatmap) #used to make heatmaps
+library(cowplot)
+```
+
+Getting help with packages and functions
+```{r 5-4-Unsupervised-ML-6, eval = FALSE}
+?tidyverse # Package documentation for tidyverse
+?kmeans # Package documentation for kmeans (a part of the standard stats R package, automatically uploaded)
+?prcomp # Package documentation for deriving principal components within a PCA (a part of the standard stats R package, automatically uploaded)
+?pheatmap # Package documentation for pheatmap
+```
+
+#### Set your working directory
+```{r 5-4-Unsupervised-ML-7, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Loading the Example Dataset
+Let's start by loading the datasets needed for this training module. We are going to use a dataset of substances that have a diverse chemical space of PFAS and statin compounds. This list of chemicals will be uploaded alongside physicochemical property data. The chemical lists for 'PFAS' and 'Statins' were obtained from the EPA's Computational Toxicology Dashboard [Chemical Lists](https://comptox.epa.gov/dashboard/chemical-lists). The physicochemical properties were obtained by uploading these lists into the National Toxicology Program’s [Integrated Chemical Environment (ICE)](https://ice.ntp.niehs.nih.gov/).
+```{r 5-4-Unsupervised-ML-8 }
+dat <- read.csv("Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML_Data.csv", fileEncoding = "UTF-8-BOM")
+```
+
+#### Data Viewing
+
+Starting with the overall dimensions:
+```{r 5-4-Unsupervised-ML-9 }
+dim(dat)
+```
+
+Then looking at the first four rows and five columns of data:
+```{r 5-4-Unsupervised-ML-10 }
+dat[1:4,1:5]
+```
+
+Note that the first column, `List`, designates the following two larger chemical classes:
+```{r 5-4-Unsupervised-ML-11 }
+unique(dat$List)
+```
+
+Let's lastly view all of the column headers:
+```{r 5-4-Unsupervised-ML-12 }
+colnames(dat)
+```
+
+In the data file, the first four columns represent chemical identifier information. All remaining columns represent different physicochemical properties derived from OPERA via [Integrated Chemical Environment (ICE)](https://ice.ntp.niehs.nih.gov/). Because the original titles of these physicochemical properties contained commas and spaces, R automatically converted these into periods. Hence, titles like `OPERA..Boiling.Point`.
+
+For ease of downstream data analyses, let's create a more focused dataframe option containing only one chemical identifier (CASRN) as row names and then just the physicochemical property columns.
+```{r 5-4-Unsupervised-ML-13 }
+# Creating a new dataframe that contains the physiocochemical properties
+chemical_prop_df <- dat[,5:ncol(dat)]
+rownames(chemical_prop_df) <- dat$CASRN
+```
+
+Now explore this data subset:
+```{r 5-4-Unsupervised-ML-14 }
+dim(chemical_prop_df) # overall dimensions
+chemical_prop_df[1:4,1:5] # viewing the first four rows and five columns
+colnames(chemical_prop_df)
+```
+
+
+### Evaluating the Original Physicochemical Properties across Substances
+
+Let's first plot two physicochemical properties to determine if and how substances group together without any fancy data reduction or other machine learning techniques. This will answer **Environmental Health Question #1**: Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying unsupervised machine learning techniques?
+
+Let's put molecular weight (`Molecular.Weight`) as one axis and boiling point (`OPERA..Boiling.Point`) on the other. We'll also color by the chemical classes using the `List` column from the original dataframe.
+```{r 5-4-Unsupervised-ML-15, fig.align='center'}
+ggplot(chemical_prop_df[,1:2], aes(x = Molecular.Weight, y = OPERA..Boiling.Point, color = dat$List)) +
+ geom_point(size = 2) + theme_bw() +
+ ggtitle('Version A: Bivariate Plot of Two Original Physchem Variables') +
+ xlab("Molecular Weight") + ylab("Boiling Point")
+```
+
+Let's plot two other physicochemical property variables, Henry's Law constant (`OPERA..Henry.s.Law.Constant`) and melting point (`OPERA..Melting.Point`), to see if the same separation of chemical classes is apparent.
+```{r 5-4-Unsupervised-ML-16, fig.align='center'}
+ggplot(chemical_prop_df[,3:4], aes(x = OPERA..Henry.s.Law.Constant, y = OPERA..Melting.Point,
+ color = dat$List)) +
+ geom_point(size = 2) + theme_bw() +
+ ggtitle('Version B: Bivariate Plot of Two Other Original Physchem Variables') +
+ xlab("OPERA..Henry.s.Law.Constant") + ylab("OPERA..Melting.Point")
+```
+
+### Answer to Environmental Health Question 1
+:::question
+*With these, we can answer **Environmental Health Question #1***: Can we differentiate between PFAS and statin chemical classes when considering just the raw physicochemical property variables without applying machine learning techniques?
+:::
+
+:::answer
+**Answer**: Only in part. From the first plot, we can see that PFAS tend to have lower molecular weight ranges in comparison to the statins, though other property variables clearly overlap in ranges of values making the groupings not entirely clear.
+:::
+
+
+
+## Identifying Clusters of Chemicals through *K*-Means
+
+Let's turn our attention to **Environmental Health Question #2**: If substances are able to be clustered, what are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means? This will be done deriving clusters of chemicals based on ALL underlying physicochemical property data using *k*-means clustering.
+
+For this example, let's coerce the *k*-means algorithms to calculate 2 distinct clusters (based on their corresponding mean centered values). Here, we choose to derive two distinct clusters, because we are ultimately going to see if we can use this information to predict each chemical's classification into two distinct chemical classes (i.e., PFAS vs statins). Note that we can derive more clusters using similar code depending on the question being addressed.
+
+We can give a name to this variable to easily provide the number of clusters in the next lines of code, `num.centers`:
+```{r 5-4-Unsupervised-ML-17 }
+num.centers <- 2
+```
+
+Here we derive chemical clusters using *k*-means:
+```{r 5-4-Unsupervised-ML-18 }
+clusters <- kmeans(chemical_prop_df, # input dataframe
+ centers = num.centers, # number of cluster centers to calculate
+ iter.max = 1000, # the maximum number of iterations allowed
+ nstart = 50) # the number of rows used as the random set for the initial centers (during the first iteration)
+```
+
+The resulting property values that were derived as the final cluster centers can be pulled using:
+```{r 5-4-Unsupervised-ML-19 }
+clusters$centers
+```
+
+Let's add the cluster assignments to the physicochemical data and create a new dataframe, which can then be used in a heatmap visualization to see how these physicochemical data distributions clustered according to *k*-means.
+
+These cluster assignments can be pulled from the `cluster` list output, where chemicals are designated to each cluster with either a 1 or 2. You can view these using:
+```{r 5-4-Unsupervised-ML-20 }
+clusters$cluster
+```
+
+Because these results are listed in the exact same order as the inputted dataframe, we can simply add these assignments to the `chemical_prop_df` dataframe.
+```{r 5-4-Unsupervised-ML-21 }
+dat_wclusters <- cbind(chemical_prop_df,clusters$cluster)
+colnames(dat_wclusters)[11] <- "Cluster" # renaming this new column "Custer"
+dat_wclusters <- dat_wclusters[order(dat_wclusters$Cluster),] # sorting data by cluster assignments
+```
+
+To generate a heatmap, we need to first create a separate dataframe for the cluster assignments, ordered in the same way as the physicochemical data:
+```{r 5-4-Unsupervised-ML-22 }
+hm_cluster <- data.frame(dat_wclusters$Cluster, row.names = row.names(dat_wclusters)) # creating the dataframe
+colnames(hm_cluster) <- "Cluster" # reassigning the column name
+hm_cluster$Cluster <- as.factor(hm_cluster$Cluster) # coercing the cluster numbers into factor variables, to make the heatmap prettier
+
+head(hm_cluster) # viewing this new cluster assignment dataframe
+```
+
+We're going to go ahead and clean up the physiocochemical property names to make the heatmap a bit tidier.
+```{r 5-4-Unsupervised-ML-23 }
+clean_names1 = gsub("OPERA..", "", colnames(dat_wclusters))
+# "\\." denotes a period
+clean_names2 = gsub("\\.", " ", clean_names1)
+
+# Reassigning the cleaner names back to the df
+colnames(dat_wclusters) = clean_names2
+
+# Going back to add in the apostrophe in "Henry's Law Constant"
+colnames(dat_wclusters)[3] = "Henry's Law Constant"
+```
+Then we can call this dataframe (`data_wclusters`) into the following heatmap visualization code leveraging the `pheatmap()` function. This function was designed specifically to enable clustered heatmap visualizations. Check out [pheatmap Documenation](https://www.rdocumentation.org/packages/pheatmap/versions/1.0.12/topics/pheatmap) for additional information.
+
+
+
+### Heatmap Visualization of the Resulting *K*-Means Clusters
+```{r 5-4-Unsupervised-ML-24, fig.height=8, fig.width=10}
+pheatmap(dat_wclusters[,1:10],
+ cluster_rows = FALSE, cluster_cols = FALSE, # no further clustering, for simplicity
+ scale = "column", # scaling the data to make differences across chemicals more apparent
+ annotation_row = hm_cluster, # calling the cluster assignment dataframe as a separate color bar
+ annotation_names_row = FALSE, # adding removing the annotation name ("Cluster") from the x axis
+ angle_col = 45, fontsize_col = 7, fontsize_row = 3, # adjusting size/ orientation of axes labels
+ cellheight = 3, cellwidth = 25, # setting height and width for cells
+ border_color = FALSE # specify no border surrounding the cells
+)
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Heatmap of physicochemical properties with *k*-means cluster assignments.** Shown are the relative values for each physicochemical property labeled on the x axis. Individual chemical names are listed on the y axis. The chemicals are grouped based on their *k*-means cluster assignment as denoted by the color bar on the left."
+
+Notice that the `pheatmap()` function does not add axes or legend titles. Adding those can provide clarity, however those can be added to the figure after exporting from R in MS Powerpoint or Adobe.
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: What are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
+:::
+
+:::answer
+**Answer**: Properties with values that show obvious differences between resulting clusters including molecular weight, boiling point, negative log of acid dissociation constant, octanol air partition coefficient, and octanol water distribution coefficient.
+:::
+
+
+
+## Principal Component Analysis (PCA)
+Next, we will run through some example analyses applying the common data reduction technique of PCA. We'll start by determining how much of the variance is able to be captured within the first two principal components to answer **Environmental Health Question #3**: How do the data compare when physicochemical properties are reduced using PCA?
+
+
+We can calculate the principal components across ALL physicochemical data across all chemicals using the `prcomp()` function. Always make sure your data is centered and scaled prior to running to PCA, since it's sensitive to variables having different scales.
+```{r 5-4-Unsupervised-ML-25 }
+my.pca <- prcomp(chemical_prop_df, # input dataframe of physchem data
+ scale = TRUE, center = TRUE)
+```
+
+We can see how much of the variance was able to be captured in each of the eigenvectors or dimensions using a scree plot.
+```{r 5-4-Unsupervised-ML-26, fig.align='center'}
+fviz_eig(my.pca, addlabels = TRUE)
+```
+
+We can also calculate these values and pull them into a dataframe for future use. For example, to pull the percentage of variance explained by each principal component, we can run the following calculations, where first eigenvalues (eigs) are calculated and then used to calculate percent of variance per principal component:
+```{r 5-4-Unsupervised-ML-27 }
+eigs <- my.pca$sdev^2
+Comp.stats <- data.frame(eigs, eigs/sum(eigs), row.names = names(eigs))
+colnames(Comp.stats) <- c("Eigen_Values", "Percent_of_Variance")
+
+head(Comp.stats)
+```
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: How do the data compare when physicochemical properties are reduced using PCA?
+:::
+
+:::answer
+**Answer**: Principal Component 1 captures ~41% of the variance and Principal Component 2 captures ~24% across all physicochemical property values across all chemicals. These two components together describe ~65% of data.
+:::
+
+
+
+Next, we'll use PCA to answer **Environmental Health Question #4**: Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component (Comp.1)?
+
+Here are the resulting scores for each chemical's contribution towards each principal component (shown here as components `PC1`-`PC10`).
+```{r 5-4-Unsupervised-ML-28 }
+head(my.pca$x)
+```
+
+And the resulting loading factors of each property's contribution towards each principal component.
+```{r 5-4-Unsupervised-ML-29 }
+my.pca$rotation
+```
+
+### Answer to Environmental Health Question 4
+:::question
+*With these results, we can answer **Environmental Health Question #4***: Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component (Comp.1)?
+:::
+
+:::answer
+**Answer**: Boiling point contributes the most towards principal component 1, as it has the largest magnitude (0.464).
+:::
+
+
+
+
+### Visualizing PCA Results
+
+Let's turn our attention to **Environmental Health Question #5**: If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
+
+We can start by answering this question by visualizing the first two principal components and coloring each chemical according to class (i.e. PFAS vs statins).
+```{r 5-4-Unsupervised-ML-30, fig.align='center'}
+ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = dat$List)) +
+ geom_point(size = 2) + theme_bw() +
+ ggtitle('Version C: PCA Plot of the First 2 PCs, colored by Chemical Class') +
+ # it's good practice to put the percentage of the variance captured in the axes titles
+ xlab("Principal Component 1 (40.9%)") + ylab("Principal Component 2 (23.8%)")
+```
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can answer **Environmental Health Question #5***: If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to inform whether a chemical is more similar to a PFAS or a statin?
+:::
+
+:::answer
+ **Answer**: Data become more compressed and variables reduce across principal components capturing the majority of the variance from the original dataset (~65%). This results in improved data visualizations, where all dimensions of the physiochemical dataset are compressed and captured across the displayed components. In addition, the figure above shows a clear separation between PFAS and statin chemical when visualizing the reduced dataset.
+:::
+
+
+
+## Incorporating *K*-Means into PCA for Predictive Modeling
+
+We can also identify cluster-based trends within data that are reduced after running PCA. This example analysis does so, expanding upon the previously generated PCA results.
+
+### Estimate *K*-Means Clusters from PCA Results
+
+Let's first run code similar to the previous *k*-means analysis and associated parameters, though instead here we will use data reduced values from the PCA analysis. Specifically, clusters across PCA "scores" values will be derived, where scores represent the relative amount each chemical contributed to each principal component.
+```{r 5-4-Unsupervised-ML-31 }
+clusters_PCA <- kmeans(my.pca$x, centers = num.centers, iter.max = 1000, nstart = 50)
+```
+
+The resulting PCA score values that were derived as the final cluster centers can be pulled using:
+```{r 5-4-Unsupervised-ML-32 }
+clusters_PCA$centers
+```
+
+Viewing the final cluster assignment per chemical:
+```{r 5-4-Unsupervised-ML-33 }
+head(cbind(rownames(chemical_prop_df),clusters_PCA$cluster))
+```
+
+
+
+#### Visualizing *K*-Means Clusters from PCA Results
+
+Let's now view, again, the results of the main PCA focusing on the first two principal components; though this time let's color each chemical according to *k*-means cluster.
+```{r 5-4-Unsupervised-ML-34, fig.align='center'}
+ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = as.factor(clusters_PCA$cluster))) +
+ geom_point(size = 2) + theme_bw() +
+ ggtitle('Version D: PCA Plot of the First 2 PCs, colored by k-means Clustering') +
+ # it's good practice to put the percentage of the variance capture in the axes titles
+ xlab("Principal Component 1 (40.9%)") + ylab("Principal Component 2 (23.8%)")
+```
+
+Let's put these two PCA plots side by side to compare them more easily. We'll also tidy up the figures a bit so they're closer to publication-ready.
+```{r 5-4-Unsupervised-ML-35, fig.align='center', fig.width = 20, fig.height = 6, fig.retina= 3}
+# PCA plot colored by chemical class
+pcaplot1 = ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = dat$List)) +
+ geom_point(size = 2) +
+
+ theme_light() +
+ theme(axis.text = element_text(size = 9), # changing size of axis labels
+ axis.title = element_text(face = "bold", size = rel(1.3)), # changes axis titles
+ legend.title = element_text(face = 'bold', size = 10), # changes legend title
+ legend.text = element_text(size = 9)) + # changes legend text
+
+ labs(x = 'Principal Component 1 (40.9%)', y = 'Principal Component 2 (23.8%)',
+ color = "Chemical Class") # changing axis labels
+
+# PCA Plot by k means clusters
+pcaplot2 = ggplot(data.frame(my.pca$x), aes(x = PC1, y = PC2, color = as.factor(clusters_PCA$cluster))) +
+ geom_point(size = 2) +
+
+ theme_light() +
+ theme(axis.text = element_text(size = 9), # changing size of axis labels
+ axis.title = element_text(face = "bold", size = rel(1.3)), # changes axis titles
+ legend.text = element_text(size = 9)) + # changes legend text
+
+ labs(x = 'Principal Component 1 (40.9%)', y = 'Principal Component 2 (23.8%)',
+ color = expression(bold(bolditalic(K)-Means~Cluster))) # changing axis labels
+
+# Creating 1 figure
+plot_grid(pcaplot1, pcaplot2,
+ # Adding labels, changing size their size and position
+ labels = "AUTO", label_size = 15, label_x = 0.03)
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Principal Component Analysis (PCA) plots highlight similarities between chemical class and *k*-means clusters.** These PCA plots are based on physiochemical properties and compare (A) chemical class categories and the (B) *K*-means derived cluster assignments."
+
+### Answer to Environmental Health Question 6
+:::question
+*With this we can answer **Environmental Health Question #6***: What kinds of applications/endpoints can be better understood and/or predicted because of these derived chemical groupings?
+:::
+
+:::answer
+**Answer**: With these well-informed chemical groupings, we can now better understand the variables that attribute to the chemical classifications. We can also use this information to better understand data trends and predict environmental fate and transport for these chemicals. The reduced variables derived through PCA, and/or *k*-means clustering patterns can also be used as input variables to predict toxicological outcomes.
+:::
+
+
+
+## Concluding Remarks
+In conclusion, this training module provides an example exercise on organizing physicochemical data and analyzing trends within these data to determine chemical groupings. Results are compared from those produced using just the original data vs. clustered data from *k*-means vs. reduced data from PCA. These methods represent common tools that are used in high dimensional data analyses within the field of environmental health sciences.
+
+### Additional Resources
++ [Detailed study of Principal Component Analysis](https://f0nzie.github.io/machine_learning_compilation/detailed-study-of-principal-component-analysis.html)
++ [Practical Guide to Cluster Analysis in R](https://xsliulab.github.io/Workshop/2021/week10/r-cluster-book.pdf)
+
+
+
+
+
+:::tyk
+In this training module, we presented an unsupervised machine learning example that was based on defining *k*-means clusters based on chemical class where *k* = 2. Often times, analyses are conducted to explore potential clustering relationships without a preexisting idea of what *k* or the number of clusters should be. In this test your knowledge section, we'll go through an example like that.
+
+Using the accompanying flame retardant and pesticide physicochemical property variables found in the file ("Module5_4_TYKInput.csv"), answer the following questions:
+
+1. What are some of the physicochemical properties that seem to be driving chemical clustering patterns derived through *k*-means?
+2. Upon reducing the data through PCA, which physicochemical property contributes the most towards informing data variance captured in the primary principal component?
+3. If we did not have information telling us which chemical belonged to which class, could we use PCA and *k*-means to accurately predict whether a chemical is a PFAS or a statin?
+:::
diff --git a/Chapter_5/Module5_4_Input/Module5_4_InputData.csv b/Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML_Data.csv
similarity index 100%
rename from Chapter_5/Module5_4_Input/Module5_4_InputData.csv
rename to Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML_Data.csv
diff --git a/Chapter_5/Module5_4_Input/Module5_4_Image1.png b/Chapter_5/5_4_Unsupervised_ML/Module5_4_Image1.png
similarity index 100%
rename from Chapter_5/Module5_4_Input/Module5_4_Image1.png
rename to Chapter_5/5_4_Unsupervised_ML/Module5_4_Image1.png
diff --git a/Chapter_5/Module5_4_Input/Module5_4_Image2.png b/Chapter_5/5_4_Unsupervised_ML/Module5_4_Image2.png
similarity index 100%
rename from Chapter_5/Module5_4_Input/Module5_4_Image2.png
rename to Chapter_5/5_4_Unsupervised_ML/Module5_4_Image2.png
diff --git a/Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2.Rmd b/Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2.Rmd
new file mode 100644
index 0000000..5de9c5a
--- /dev/null
+++ b/Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2.Rmd
@@ -0,0 +1,497 @@
+
+# 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications
+
+This training module was developed by Alexis Payton, Lauren E. Koval, David M. Reif, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+The previous module **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**, served as introduction to unsupervised machine learning (ML). **Unsupervised ML** involves training a model on a dataset lacking ground truths or response variables. However, in the previous module, the number of clusters was selected based on prior information (i.e., chemical class), but what if you're in a situation where you don't know how many clusters to investigate a priori? This commonly occurs, particularly in the field of environmental health research in instances when investigators want to take a more unbiased view of their data and/or do not have information that can be used to inform the optimal number of clusters to select. In these instances,unsupervised ML techniques can be very helpful, and in this module, we'll explore the following concepts to further understand unsupervised ML:
+
++ *K*-Means and hierarchical clustering
++ Deriving the optimal number of clusters
++ Visualizing clusters through a PCA-based plot, dendrograms, and heatmaps
++ Determining each variable's contribution to the clusters
+
+
+
+
+## *K*-Means Clustering
+
+As mentioned in the previous module, *K*-means is a common clustering algorithm used to partition quantitative data. This algorithm works by first, randomly selecting a pre-specified number of clusters, *k*, across the data space, with each cluster having a data centroid. When using a standard Euclidean distance metric, the distance is calculated from an observation to each centroid, then the observation is assigned to the cluster of the closest centroid. After all observations have been assigned to one of the *k* clusters, the average of all observations in a cluster is calculated, and the centroid for the cluster is moved to the location of the mean. The process then repeats, with the distance computed between the observations and the updated centroids. Observations may be reassigned to the same cluster or moved to a different cluster if it is closer to another centroid. These iterations continue until there are no longer changes between cluster assignments for observations, resulting in the final cluster assignments that are then carried forward for analysis/interpretation.
+
+Helpful resources on *k*-means clustering include the following: [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf) &
+[Towards Data Science](https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a).
+
+
+
+## Hierarchical Clustering
+**Hierarchical clustering** groups objects into clusters by repetitively joining similar observations until there is one large cluster (aka agglomerative or bottom-up) or repetitively splitting one large cluster until each observation stands alone (aka divisive or top-down). Regardless of whether agglomerative or divisive hierarchical clustering is used, the results can be visually represented in a tree-like figure called a dendrogram. The dendrogram below is based on the `USArrests` dataset available in R. The datset contains statistics on violent crimes rates (murder, assault, and rape) per capita (per 100,000 residents) for each state in the United States in 1973. For more information on the `USArrests` dataset, check out its associated [RDocumentation](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/USArrests).
+
+```{r 5-5-Unsupervised-ML-2-1, fig.align = 'center', echo=FALSE, out.width = "55%"}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image1.png")
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Hierarchical clustering of states based on violent crime.** The dendogram uses violent crime data including murder, assault, and rape rates per 100,000 residents for each state in 1973."
+
+Takeaways from this dendogram:
+
++ The 50 states can be grouped into 4 clusters based on violent crime statistics from 1973
++ The dendogram can only show us clusters of states but not the data trends that led to the clustering patterns that we see. Yes, it is useful to know what states have similar violent crime patterns overall, but it is important to pinpoint the variables (ie. murder, assault, and rape) that are responsible for the clustering patterns we're seeing. This idea will be explored later in the module with an environmentally-relevant dataset.
+
+Going back to hiearchical clustering, during the repetitive splitting or joining of observations, the similarity between existing clusters is calculated after each iteration. This value informs the formation of subsequent clusters. Different methods, or linkage functions, can be considered when calculating this similarity, particularly for agglomerative clustering which is often the preferred approach. Some example methods include:
+
++ **Complete Linkage**: the maximum distance between two data points located in separate clusters.
++ **Single Linkage**: the minimum distance between two data points located in separate clusters.
++ **Average Linkage**: the average pairwise distance between all pairs of data points in separate clusters.
++ **Centroid Linkage**: the distance between the centroids or centers of each cluster.
++ **Ward Linkage**: seeks to minimize the variance between clusters.
+
+Each method has its advantages and disadvantages and more information on all distance calculations between clusters can be found at the following resource: [Hierarchical Clustering](https://www.learndatasci.com/glossary/hierarchical-clustering/#Hierarchicalclusteringtypes).
+
+
+### Deriving the Optimal Number of Clusters
+Before clustering can be performed, the function needs to be informed of the number of clusters to group the objects into. In the previous module, an example was explored to see if *k*-means clustering would group the chemicals similarly to their chemical class (either a PFAS or statin). Therefore, we told the *k*-means function to cluster into 2 groups. In situations where there is little to no prior knowledge regarding the "correct" number of clusters to specify, methods exist for deriving the optimal number of clusters. Three common methods to find the optimal *k*, or number of clusters, for both *k*-means and hierarchical clustering include: the **elbow method**, **silhouette method**, and the **gap statistic method**. These techniques help us in determining the optimal *k* using visual inspection.
+
++ **Elbow Method**: uses a plot of the within cluster sum of squares (WCSS) on the y axis and different values of *k* on the x axis. The location where we no longer observe a significant reduction in WCSS, or where an "elbow" can be seen, is the optimal *k* value. As we can see, after a certain point, having more clusters does not lead to a significant reduction in WCSS.
+```{r 5-5-Unsupervised-ML-2-2, fig.align = 'center', out.width = "75%", echo=FALSE}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image2.png")
+```
+
+Looking at the figures above, the elbow point is much clearer in the first plot versus the second, however, elbow curves from real-world datasets typically resemble the second figure. This is why it's recommended to consider more than one method to determine the optimal number of clusters.
+
++ **Silhouette Method**: uses a plot of the average silhouette width (score) on the y axis and different values of *k* on the x axis. The silhouette score is measure of each object's similarity to its own cluster and how dissimilar it is to other clusters. The location where the average silhouette width is *maximized* is the optimal *k* value.
+```{r 5-5-Unsupervised-ML-2-3, fig.align = 'center', out.width = "65%", echo=FALSE}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image3.png")
+```
+
+Based on the figure above, the optimal number of clusters is 2 using the silhouette method.
+
++ **Gap Statistic Method**: uses a plot of the gap statistic on the y axis and different values of *k* on the x axis. The gap statistic evaluates the intracluster variation in comparison to expected values derived from a Monte Carlo generated, null reference data distribution for varying values of *k*. The optimal number of clusters is the smallest value where the gap statistic of *k* is greater than or equal to the gap statistic of *k*+1 minus the standard deviation of *k*+1. More details can be found [here](https://uc-r.github.io/kmeans_clustering#:~:text=The%20gap%20statistic%20compares%20the,simulations%20of%20the%20sampling%20process.).
+```{r 5-5-Unsupervised-ML-2-4, fig.align = 'center', out.width = "65%", echo=FALSE}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image4.png")
+```
+
+Based on the figure above, the optimal number of clusters is 2 using the gap statistic method.
+
+For additional information and code on all three methods, check out [Determining the Optimal Number of Clusters: 3 Must Know Methods](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/). It is also worth mentioning that while these methods are useful, further interpreting the result in the context of your problem can be beneficial, for example, checking whether clusters make biological sense when working with a genomic dataset.
+
+
+
+
+## Introduction to Example Data
+
+We will apply these techniques using an example dataset from a previously published study where 22 cytokine concentrations were derived from 44 subjects with varying smoking statuses (14 non-smokers, 17 e-cigarette users, and 13 cigarette smokers) from 4 different sampling regions in the body. These samples were derived from nasal lavage fluid (NLF), nasal epithelieum fluid (NELF), sputum, and serum as pictured below. Samples were taken from different regions in the body to compare cytokine expression in the upper respiratory tract, lower respiratory tract, and systemic circulation.
+```{r 5-5-Unsupervised-ML-2-5, fig.align = 'center', out.width = "75%", echo=FALSE}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image5.png")
+```
+
+A research question that we had was "Does cytokine expression change based on a subject's smoking habits? If so, does cigarette smoke or e-cigarette vapor induce cytokine suppression or proliferation?" Traditionally these questions would have been answered by analyzing each biomarker individually using a two-group comparison test like a t test (which we completed in this study). However, biomarkers do not work in isolation in the body, suggesting that individual biomarker statistical approaches may not capture the full biological responses occurring. Therefore we used a clustering approach to group cytokines as an attempt to more closely simulate interactions that occur *in vivo*. From there, statistical tests were run to assess the effects of smoking status on each cluster.
+
+For the purposes of this training exercise, we will focus solely on the nasal epithelieum lining fluid, or NELF, samples. In addition, we'll use *k*-means and hierarchical clustering to compare how cytokines cluster at baseline. Full methods are further described in the publication below:
+
++ Payton AD, Perryman AN, Hoffman JR, Avula V, Wells H, Robinette C, Alexis NE, Jaspers I, Rager JE, Rebuli ME. Cytokine signature clusters as a tool to compare changes associated with tobacco product use in upper and lower airway samples. American Journal of Physiology-Lung Cellular and Molecular Physiology 2022 322:5, L722-L736. PMID: [35318855](https://journals.physiology.org/doi/abs/10.1152/ajplung.00299.2021)
+
+Let's read in and view the dataset we'll be working with.
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 5-5-Unsupervised-ML-2-6 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 5-5-Unsupervised-ML-2-7, message=FALSE}
+if (!requireNamespace("vegan"))
+ install.packages("vegan");
+if (!requireNamespace("ggrepel"))
+ install.packages("ggrepel");
+if (!requireNamespace("dendextend"))
+ install.packages("dendextend");
+if (!requireNamespace("ggsci"))
+ install.packages("ggsci");
+if (!requireNamespace("FactoMineR"))
+install.packages("FactoMineR");
+```
+
+#### Loading required R packages
+```{r 5-5-Unsupervised-ML-2-8, message=FALSE}
+library(readxl)
+library(factoextra)
+library(FactoMineR)
+library(tidyverse)
+library(vegan)
+library(ggrepel)
+library(reshape2)
+library(pheatmap)
+library(ggsci)
+suppressPackageStartupMessages(library(dendextend))
+```
+
+#### Set your working directory
+```{r 5-5-Unsupervised-ML-2-9, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Importing example dataset
+
+Then let's read in our example dataset. As mentioned in the introduction, this example dataset contains cytokine concentrations derived from 44 subjects. Let's import and view these data:
+```{r 5-5-Unsupervised-ML-2-10 }
+# Reading in file
+cytokines_df <- data.frame(read_excel("Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2_Data.xlsx", sheet = 2))
+
+# Viewing data
+head(cytokines_df)
+```
+
+These data contain the following information:
+
++ `Original_Identifier`: initial identifier given to each subject by our wet bench colleagues
++ `Group`: denotes the smoking status of the subject ("NS" = "non-smoker", "Ecig" = "E-cigarette user", "CS" = "cigarette smoker")
++ `SubjectNo`: ordinal subject number assigned to each subject after the dataset was wrangled (1-44)
++ `SubjectID`: unique subject identifier that combines the group and subject number
++ `Compartment`: region of the body from which the sample was taken ("NLF" = "nasal lavage fluid sample", "NELF" = "nasal epithelieum lining fluid sample", "Sputum" = "induced sputum sample", "Serum" = "blood serum sample")
++ `Protein`: cytokine name
++ `Conc`: concentration (pg/mL)
++ `Conc_pslog2`: psuedo-log~2~ concentration
+
+Now that the data has been read in, we can start by asking some initial questions about the data.
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+1. What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using *k*-means clustering?
+2. After selecting a cluster number, which cytokines were assigned to each *k*-means cluster?
+3. What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering?
+4. How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
+5. Which cytokines have the greatest contributions to the first two eigenvectors?
+
+To answer the first environmental health question, let's start by filtering to include only NELF derived samples and non-smokers.
+```{r 5-5-Unsupervised-ML-2-11 }
+baseline_df <- cytokines_df %>%
+ filter(Group == "NS", Compartment == "NELF")
+
+head(baseline_df)
+```
+
+The functions we use will require us to cast the data wider. We will accomplish this using the `dcast()` function from the *reshape2* package.
+```{r 5-5-Unsupervised-ML-2-12 }
+wider_baseline_df <- reshape2::dcast(baseline_df, Protein ~ SubjectID, value.var = "Conc_pslog2") %>%
+ column_to_rownames("Protein")
+
+head(wider_baseline_df)
+```
+
+Now we can derive clusters using the `fviz_nbclust()` function to determine the optimal *k* based on suggestions from the elbow, silhouette, and gap statistic methods. We can use this code for both *k*-means and hierarchical clustering by changing the `FUNcluster` parameter. Lets start with *k*-means:.
+```{r 5-5-Unsupervised-ML-2-13, fig.align = 'center'}
+# Elbow method
+fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "wss") +
+ labs(subtitle = "Elbow method")
+
+# Silhouette method
+fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "silhouette") +
+ labs(subtitle = "Silhouette method")
+
+# Gap statistic method
+fviz_nbclust(wider_baseline_df, FUNcluster = kmeans, method = "gap_stat") +
+ labs(subtitle = "Gap Statisitc method")
+```
+
+The elbow method is suggesting 2 or 3 clusters, the silhouette method is suggesting 2, and the gap statistic method is suggesting 1. Since each of these methods is recommending different *k* values, we can go ahead and run *k*-means to visualize the clusters and test those different *k*'s. *K*-means clusters will be visualized using the `fviz_cluster()` function.
+```{r 5-5-Unsupervised-ML-2-14, fig.align = 'center'}
+# Choosing to iterate through 2 or 3 clusters using i as our iterator
+for (i in 2:3){
+ # nstart = number of random starting partitions, it's recommended for nstart > 1
+ cluster_k <- kmeans(wider_baseline_df, centers = i, nstart = 25)
+ cluster_plot <- fviz_cluster(cluster_k, data = wider_baseline_df) + ggtitle(paste0("k = ", i))
+ print(cluster_plot)
+}
+```
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using *k*-means clustering?
+:::
+
+:::answer
+**Answer**: 2 or 3 clusters can be justified here, based on using the elbow or silhouette method or if *k*-means happens to group cytokines together that were implicated in similar biological pathways. In the final paper, we moved forward with 3 clusters, because it was justifiable from the methods and provided more granularity in the clusters.
+:::
+
+The final cluster assignments can easily be obtained using the `kmeans()` function from the *stats* package.
+```{r 5-5-Unsupervised-ML-2-15 }
+cluster_kmeans_3 <- kmeans(wider_baseline_df, centers = 3, nstart = 25)
+cluster_kmeans_df <- data.frame(cluster_kmeans_3$cluster) %>%
+ rownames_to_column("Cytokine") %>%
+ rename(`K-Means Cluster` = cluster_kmeans_3.cluster) %>%
+ # Ordering the dataframe for easier comparison
+ arrange(`K-Means Cluster`)
+
+cluster_kmeans_df
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: After selecting a cluster number, which cytokines were assigned to each *k*-means cluster?
+:::
+
+:::answer
+**Answer**: After choosing the number of clusters to be 3, the cluster assignments are as follows:
+```{r 5-5-Unsupervised-ML-2-16, fig.align = 'center', echo=FALSE}
+knitr::include_graphics("Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image7.png")
+```
+:::
+
+
+
+## Hierarchical Clustering
+
+Next, we'll turn our attention to answering environmental health questions 3 and 4: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering? How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
+
+Just as we used the elbow method, silhouette profile, and gap statistic to determine the optimal number of clusters for *k*-means, we can leverage the same approaches for hierarchical by changing the `FUNcluster` parameter.
+```{r 5-5-Unsupervised-ML-2-17, fig.align = 'center'}
+# Elbow method
+fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "wss") +
+ labs(subtitle = "Elbow method")
+
+# Silhouette method
+fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "silhouette") +
+ labs(subtitle = "Silhouette method")
+
+# Gap statistic method
+fviz_nbclust(wider_baseline_df, FUNcluster = hcut, method = "gap_stat") +
+ labs(subtitle = "Gap Statisitc method")
+```
+We can see the results are quite similar with 2-3 clusters appearing optimal.
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: What is the optimal number of clusters the cytokines can be grouped into that were derived from nasal epithelium fluid (NELF) in non-smokers using hierarchical clustering?
+:::
+
+:::answer
+**Answer**: Again, 2 or 3 clusters can be justified here, but for the same reasons mentioned for the first environmental health question, we landed on 3 clusters.
+:::
+
+Now we can perform the clustering and visualize and extract the results. We'll start by using the `dist()` function to calculate the euclidean distance between the clusters followed by the `hclust()` function to obtain the hierarchical clustering assignments.
+```{r 5-5-Unsupervised-ML-2-18 }
+# Viewing the wider dataframe we'll be working with
+head(wider_baseline_df)
+```
+
+
+```{r 5-5-Unsupervised-ML-2-19 }
+# First scaling data with each subject (down columns)
+scaled_df <- data.frame(apply(wider_baseline_df, 2, scale))
+rownames(scaled_df) = rownames(wider_baseline_df)
+
+head(scaled_df)
+```
+
+The `dist()` function is initially used to calculate the Euclidean distance between each cytokine. Next, the `hclust()` function is used to run the hierarchical clustering analysis using the complete method by default. The method can be changed in the function using the method parameter.
+```{r 5-5-Unsupervised-ML-2-20 }
+# Calculating euclidean dist
+dist_matrix <- dist(scaled_df, method = 'euclidean')
+
+# Hierarchical clustering
+cytokines_hc <- hclust(dist_matrix)
+```
+
+Now we can generate a dendrogram to help us evaluate the results using the `fviz_dend()` function from the *factoextra* package. We use k=3 to be consistent with the *k*-means analysis.
+```{r 5-5-Unsupervised-ML-2-21, fig.align = 'center', out.width = "75%", warning=FALSE}
+ fviz_dend(cytokines_hc, k = 3, # Specifying k
+ cex = 0.85, # Label size
+ palette = "futurama", # Color palette see ?ggpubr::ggpar
+ rect = TRUE, rect_fill = TRUE, # Add rectangle around groups
+ horiz = TRUE, # Changes the orientation of the dendogram
+ rect_border = "futurama", # Rectangle color
+ labels_track_height = 0.8 # Changes the room for labels
+ )
+```
+
+We can also extract those cluster assignments using the `cutree()` function from the *stats* package.
+```{r 5-5-Unsupervised-ML-2-22 }
+hc_assignments_df <- data.frame(cutree(cytokines_hc, k = 3)) %>%
+ rownames_to_column("Cytokine") %>%
+ rename(`Hierarchical Cluster` = cutree.cytokines_hc..k...3.) %>%
+ # Ordering the dataframe for easier comparison
+ arrange(`Hierarchical Cluster`)
+
+# Combining the dataframes to compare the cluster assignments from each approach
+ comp <- full_join(cluster_kmeans_df, hc_assignments_df, by = "Cytokine")
+
+ comp
+```
+
+For additional resources on running hierarchical clustering in R, see [Visualizing Clustering Dendrogram in R](https://agroninfotech.blogspot.com/2020/06/visualizing-clusters-in-r-hierarchical.html) and [Hiearchical Clustering on Principal Components](http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/).
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer **Environmental Health Question #4***: How do the hierarchical cluster assignments compare to the *k*-means cluster assignments?
+:::
+
+:::answer
+**Answer**: Though this may not always be the case, in this instance, we see that *k*-means and hierarchical clustering with k=3 clusters yield the same groupings despite the clusters being presented in a different order.
+:::
+
+
+
+## Clustering Plot
+
+One additional way to visualize clustering is to plot the first two principal components on the axes and color the data points based on their corresponding cluster. This visualization can be used for both *k*-means and hierarchical clustering using the `fviz_cluster()` function. This figure is essentially a PCA plot with shapes drawn around each cluster to make them distinct from each other.
+```{r 5-5-Unsupervised-ML-2-23, fig.align = 'center', fig.height=5.5, fig.width=6}
+fviz_cluster(cluster_kmeans_3, data = wider_baseline_df)
+```
+
+Rather than using the `fviz_cluster()` function as shown in the figure above, we'll extract the data to recreate the sample figure using `ggplot()`. For the manuscript this was necessary, since it was important to facet the plots for each compartment (i.e., NLF, NELF, sputum, and serum). For a single plot, this data extraction isn't required, and the figure above can be further customized within the `fviz_cluster()` function. However, we'll go through the steps of obtaining the indices to recreate the same polygons in `ggplot()` directly.
+
+*K*-means actually uses principal component analysis (PCA) to reduce a dataset's dimensionality prior to obtaining the cluster assignments and plotting those clusters. Therefore, to obtain the coordinates of each cytokine within their respective clusters, PCA will need to be run first.
+```{r 5-5-Unsupervised-ML-2-24 }
+# First running PCA
+pca_cytokine <- prcomp(wider_baseline_df, scale = TRUE, center = TRUE)
+# Only need PC1 and PC2 for plotting, so selecting the first two columns
+baseline_scores_df <- data.frame(scores(pca_cytokine)[,1:2])
+baseline_scores_df$Cluster <- cluster_kmeans_3$cluster
+baseline_scores_df$Protein <- rownames(baseline_scores_df)
+
+# Changing cluster to a character for plotting
+baseline_scores_df$Cluster = as.character(baseline_scores_df$Cluster)
+
+head(baseline_scores_df)
+```
+
+Within each cluster, the `chull()` function is used to compute the indices of the points on the convex hull. These are needed for `ggplot()` to create the polygon shapes of each cluster.
+```{r 5-5-Unsupervised-ML-2-25 }
+# hull values for cluster 1
+cluster_1 <- baseline_scores_df[baseline_scores_df$Cluster == 1, ][chull(baseline_scores_df %>%
+ filter(Cluster == 1)),]
+# hull values for cluster 2
+cluster_2 <- baseline_scores_df[baseline_scores_df$Cluster == 2, ][chull(baseline_scores_df %>%
+ filter(Cluster == 2)),]
+# hull values for cluster 3
+cluster_3 <- baseline_scores_df[baseline_scores_df$Cluster == 3, ][chull(baseline_scores_df %>%
+ filter(Cluster == 3)),]
+all_hulls_baseline <- rbind(cluster_1, cluster_2, cluster_3)
+# Changing cluster to a character for plotting
+all_hulls_baseline$Cluster = as.character(all_hulls_baseline$Cluster)
+
+head(all_hulls_baseline)
+```
+
+Now plotting the clusters using `ggplot()`.
+```{r 5-5-Unsupervised-ML-2-26, fig.align = 'center', fig.height=5.5, fig.width=6}
+ggplot() +
+ geom_point(data = baseline_scores_df, aes(x = PC1, y = PC2, color = Cluster, shape = Cluster), size = 4) +
+ # Adding cytokine names
+ geom_text_repel(data = baseline_scores_df, aes(x = PC1, y = PC2, color = Cluster, label = Protein),
+ show.legend = FALSE, size = 4.5) +
+ # Creating polygon shapes of the clusters
+ geom_polygon(data = all_hulls_baseline, aes(x = PC1, y = PC2, group = as.factor(Cluster), fill = Cluster,
+ color = Cluster), alpha = 0.25, show.legend = FALSE) +
+
+ theme_light() +
+ theme(axis.text.x = element_text(vjust = 0.5), #rotating x labels/ moving x labels slightly to the left
+ axis.line = element_line(colour="black"), #making x and y axes black
+ axis.text = element_text(size = 13), #changing size of x axis labels
+ axis.title = element_text(face = "bold", size = rel(1.7)), #changes axis titles
+ legend.title = element_text(face = 'bold', size = 17), #changes legend title
+ legend.text = element_text(size = 14), #changes legend text
+ legend.position = 'bottom', # moving the legend to the bottom
+ legend.background = element_rect(colour = 'black', fill = 'white', linetype = 'solid'), #changes the legend background
+ strip.text.x = element_text(size = 18, face = "bold"), #changes size of facet x axis
+ strip.text.y = element_text(size = 18, face = "bold")) + #changes size of facet y axis
+ xlab('Dimension 1 (85.1%)') + ylab('Dimension 2 (7.7%)') + #changing axis labels
+
+ # Using colors from the startrek palette from ggsci
+ scale_color_startrek(name = 'Cluster') +
+ scale_fill_startrek(name = 'Cluster')
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. *K*-means clusters of cytokines at baseline.** Cytokines samples are derived from nasal epithelium (NELF) samples in 14 non-smoking subjects. Cytokine concentration values were transformed using a data reduction technique known as Principal Component Analysis (PCA). The first two eigenvectors plotted on the axes were able to capture a majority of the variance across all samples from the original dataset."
+
+Takeaways from this clustering plot:
+
++ PCA was able to capture almost all (~93%) of the variance from the original dataset
++ The 22 cytokines were able to be clustered into 3 distinct clusters using *k*-means
+
+
+
+## Hierarchical Clustering Visualization
+
+We can also build a heatmap using the `pheatmap()` function that has the capability to display hierarchical clustering dendrograms. To do so, we'll need to go back and use the `wider_baseline_df` dataframe.
+```{r 5-5-Unsupervised-ML-2-27, fig.align = 'center', fig.height=7, fig.width=8}
+pheatmap(wider_baseline_df,
+ cluster_cols = FALSE, # hierarchical clustering of cytokines
+ scale = "column", # scaling the data to make differences across cytokines more apparent
+ cutree_row = 3, # adds a space between the 3 largest clusters
+ display_numbers = TRUE, number_color = "black", fontsize = 12, # adding average concentration values
+ angle_col = 45, fontsize_col = 12, fontsize_row = 12, # adjusting size/ orientation of axes labels
+ cellheight = 17, cellwidth = 30 # setting height and width for cells
+)
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Hierarchical clustering of cytokines at baseline.** Cytokines samples are derived from nasal epithelium (NELF) samples in 14 non-smoking subjects. The heatmap visualizes psuedo log~2~ cytokine concentrations that were scaled within each subject."
+
+It may be helpful to add axes titles like "Subject ID" for the x axis, "Cytokine" for the y axis, and "Scaled pslog~2~ Concentration" for the legend after exporting from R. The `pheatmap()` function does not have the functionality to add those titles.
+
+Nevertheless, let's identify some key takeaways from this heatmap:
+
++ The 22 cytokines were able to be clustered into 3 distinct clusters using hierarchical clustering
++ These clusters are based on cytokine concentration levels with the first cluster having the highest expression, the second cluster having the lowest expression, and the last cluster having average expression
+
+
+
+## Variable Contributions
+To answer our final environmental health question: Which cytokines have the greatest contributions to the first two eigenvectors, we'll use the `fviz_contrib()` function that plots the percentage of each variable's contribution to the principal component(s). It also displays a red dashed line, and variables that fall above are considered to have significant contributions to those principal components. For a refresher on PCA and variable contributions, see the previous module, **TAME 2.0 Module 5.4 Unsupervised Machine Learning**.
+```{r 5-5-Unsupervised-ML-2-28, fig.align = 'center'}
+# kmeans contributions
+fviz_contrib(pca_cytokine,
+ choice = "ind", addlabels = TRUE,
+ axes = 1:2) # specifies to show contribution percentages for first 2 PCs
+
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. Cytokine contributions to principal components.** The bar chart displays each cytokine's contribution to the first two eigenvectors in descending order from left to right. The red dashed line represents the expected contribution of each cytokine if all inputs were uniform, therefore the seven cytokines that fall above this reference line are considered to have significant contributions to the first two principal components."
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can answer **Environmental Health Question #5***: Which cytokines have the greatest contributions to the first two eigenvectors?
+:::
+
+:::answer
+**Answer**: The cytokines that have significant contributions to the first two principal components include IL-8, Fractalkine, IP-10, IL-4, MIG, I309, and IL-12p70.
+:::
+
+
+
+## Concluding Remarks
+In this module, we explored scenarios where clustering would be appropriate but lack contextual details informing the number of clusters that should be considered, thus resulting in the need to derive such a number. In addition, methodology for *k*-means and hierarchical clustering was presented, along with corresponding visualizations. Lastly, variable contributions to the eigenvectors were introduced as a means to determine the most influential variables on the principal components' composition.
+
+### Additional Resources
++ [*K*-Means Cluster Analysis](https://uc-r.github.io/kmeans_clustering#silo)
++ [*K*-Means Clustering in R](https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)
++ [Hierarchical Clustering in R](https://uc-r.github.io/hc_clustering)
+
+
+
+
+
+:::tyk
+Using the same dataset, answer the questions below.
+
+1. Determine the optimal number of *k*-means clusters of cytokines derived from the nasal epithelieum lining fluid of **e-cigarette users**.
+2. How do those clusters compare to the ones that were derived at baseline (in non-smokers)?
+3. Which cytokines have the greatest contributions to the first two eigenvectors?
+:::
diff --git a/Chapter_5/Module5_5_Input/Module5_5_InputData.xlsx b/Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2_Data.xlsx
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_InputData.xlsx
rename to Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2_Data.xlsx
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image1.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image1.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image1.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image1.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image2.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image2.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image2.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image2.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image3.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image3.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image3.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image3.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image4.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image4.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image4.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image4.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image5.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image5.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image5.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image5.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image7.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image7.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image7.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image7.png
diff --git a/Chapter_5/Module5_5_Input/Module5_5_Image8.png b/Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image8.png
similarity index 100%
rename from Chapter_5/Module5_5_Input/Module5_5_Image8.png
rename to Chapter_5/5_5_Unsupervised_ML_2/Module5_5_Image8.png
diff --git a/Chapter_6/06-Chapter6.Rmd b/Chapter_6/06-Chapter6.Rmd
deleted file mode 100644
index 00d7b26..0000000
--- a/Chapter_6/06-Chapter6.Rmd
+++ /dev/null
@@ -1,4385 +0,0 @@
-# (PART\*) Chapter 6 Applications in Toxicology & Exposure Science {-}
-
-# 6.1 Descriptive Cohort Analyses
-
-This training module was developed by Elise Hickman, Kyle Roell, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Human cohort datasets are very commonly analyzed and integrated in environmental health research. Commone research study designs that incorporate human data include clinical, epidemiological, biomonitoring, and/or biomarker study designs. These datasets represent metrics of health and exposure collected from human participants at one or many points in time. Although these datasets can lend themselves to highly complex analyses, it is important to first explore the basic dataset properties to understand data missingness, filter data appropriately, generate demographic tables and summary statistics, and identify outliers. In this module, we will work through these common steps with an example dataset and discuss additional considerations when working with human cohort datasets.
-
-Our example data are derived from a study in which chemical exposure profiles were collected using silicone wristbands. Silicone wristbands are an affordable and minimally invasive method for sampling personal chemical exposure profiles. This exposure monitoring technique has been described through previous publications (see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**). The example workflow can also apply to other study designs, including biomonitoring and biomarker studies, which require careful consideration of chemical or biological marker detection filters, transparent reporting of descriptive statistics, and demographics tables.
-
-### Training Module's Environmental Health Questions
-
-1. What proportion of participants wore their wristbands for all seven days?
-2. How many chemicals were detected in at least 20% of participants?
-3. What are the demographics of the study participants?
-
-### Workspace Preparation and Data Import
-
-```{r 06-Chapter6-1, message = FALSE}
-# Load packages
-library(tidyverse) # for data organization and manipulation
-library(janitor) # for data cleaning
-library(openxlsx) # for reading in and writing out files
-library(DT) # for displaying tables
-library(table1) # for making tables
-library(patchwork) # for graphing
-library(purrr) # for summary stats
-library(factoextra) # for PCA outlier detection
-library(table1) # for making demographics table
-
-# Make sure select is calling the correct function
-select <- dplyr::select
-
-# Set graphing theme
-theme_set(theme_bw())
-```
-
-First, we will import our raw chemical data and preview it.
-```{r 06-Chapter6-2, warning = FALSE}
-wrist_data <- read.xlsx("Chapter_6/Module6_1_Input/Module6_1_InputData1.xlsx") %>%
- mutate(across(everything(), \(x) as.numeric(x)))
-
-datatable(wrist_data[ , 1:6])
-```
-
-In this study, 97 participants wore silicone wristbands for one week, and chemical concentrations on the wristbands were measured with gas chromatography mass spectrometry. This dataframe consists of a column with a unique identifier for each participant (`S_ID`), a column describing the number of days that participant wore the wristband (`Ndays`), and subsequent columns containing the amount of each chemical detected (nanograms of chemical per gram of wristband). The chemical columns are labeled with the chemical class first (e.g., alkyl OPE, or alkyl organophosphate ester), followed by and underscore and the chemical name (e.g., 2IPPDPP). This dataset contains 110 different chemicals categorized into 8 chemical classes (listed below with their abbreviations):
-
-+ Brominated diphenyl ether (BDE)
-+ Brominated flame retardant (BFR)
-+ Organophosphate ester (OPE)
-+ Polycyclic aromatic hydrocarbon (PAH)
-+ Polychlorinated biphenyl (PCB)
-+ Pesticide (Pest)
-+ Phthalate (Phthal)
-+ Alkyl organophosphate ester (alkylOPE)
-
-Through the data exploration and cleaning process, we will aim to:
-
-+ Understand participant behaviors
-+ Filter out chemicals with low detection
-+ Generate a supplemental table containing chemical detection information and summary statistics such as minimum, mean, median, and maximum
-+ Identify participant outliers
-+ Generate a demographics table
-
-Although these steps are somewhat specific to our example dataset, similar steps can be taken with other datasets. We recommend thinking through the structure of your data and outlining data exploration and cleaning steps prior to starting your analysis. This process can be somewhat time-consuming and tedious but is important to ensure that your data are well-suited for downstream analyses. In addition, these steps should be included in any resulting manuscript as part of the narrative relating to the study cohort and data cleaning.
-
-## Participant Exploration
-
-We can use *tidyverse* functions to quickly tabulate how many days participants wore the wristbands.
-
-```{r 06-Chapter6-3}
-wrist_data %>%
-
- # Count number of participants for each number of days
- dplyr::count(Ndays) %>%
-
- # Calculate proportion of partipants for each number of days
- mutate(prop = prop.table(n)) %>%
-
- # Arrange the table from highest to lowest number of days
- arrange(-Ndays) %>%
-
- # Round the proportion column to two decimal places
- mutate(across(prop, \(x) round(x, 2)))
-```
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can now answer **Environmental Health Question #1***: What proportion of participants wore their wristbands for all seven days?
-:::
-
-:::answer
-**Answer:** 86% of participants wore their wristbands for all seven days.
-:::
-
-Because a few participants did not wear their wristbands for all seven days, it will be important to further explore whether there are outlier participants and to normalize the chemical concentrations by number of days the wristband was worn. We can first assess whether any participants have a particularly low or high number of chemicals detected relative to the other participants.
-
-We'll prepare the data for graphing by creating a dataframe containing information about how many chemicals were detected per participant.
-```{r 06-Chapter6-4}
-wrist_det_by_participant <- wrist_data %>%
-
- # Remove Ndays column because we don't need it for this step
- select(-Ndays) %>%
-
- # Move S_ID to rownames so it doesn't interfere with count
- column_to_rownames("S_ID") %>%
-
- # Create a new column for number of chemicals detected
- mutate(n_det = rowSums(!is.na(.))) %>%
-
- # Clean dataframe
- rownames_to_column("S_ID") %>%
- select(c(S_ID, n_det))
-
-datatable(wrist_det_by_participant)
-```
-
-Then, we can make our histogram:
-```{r 06-Chapter6-5, warning = FALSE, fig.align = "center"}
-det_per_participant_graph <- ggplot(wrist_det_by_participant, aes(x = n_det)) +
- geom_histogram(color = "black",
- fill = "gray60",
- alpha = 0.7,
- binwidth = 2) +
- ggtitle("Distribution of Number of Chemicals Detected Per Participant") +
- ylab("Number of Participants") +
- xlab("Number of Chemicals Detected") +
- scale_x_continuous(breaks = seq(0, 70, by = 10), limits = c(0, 70), expand = c(0.025, 0.025)) +
- scale_y_continuous(breaks = seq(0, 15, by = 5), limits = c(0, 15), expand = c(0, 0)) +
- theme(plot.title = element_text(hjust = 0.5, size = 16),
- axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
- axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
- axis.text = element_text(size = 12))
-
-det_per_participant_graph
-```
-
-From this histogram, we can see that the number of chemicals detected per participant ranges from about 30-65 chemicals, with no participants standing out as being well above or below the distribution.
-
-## Chemical Detection Filtering
-
-Next, we want to apply a chemical detection filter to remove chemicals from the dataset with very low detection. To start, let's make a dataframe summarizing the percentage of participants in which each chemical was detected and graph this distribution using a histogram.
-
-```{r 06-Chapter6-6}
-# Create dataframe where n_detected is the sum of the rows where there are not NA values
-chemical_counts <- data.frame(n_detected = colSums(!is.na(wrist_data %>% select(-c(S_ID, Ndays))))) %>%
-
- # Move rownames to a column
- rownames_to_column("class_chemical") %>%
-
- # Add n_undetected and percentage detected and undetected columns
- mutate(n_undetected = nrow(wrist_data) - n_detected,
- perc_detected = n_detected/nrow(wrist_data)*100,
- perc_undetected = n_undetected/nrow(wrist_data)*100) %>%
-
- # Round percentages to two decimal places
- mutate(across(c(perc_detected, perc_undetected), \(x) round(x, 2)))
-
-# View dataframe
-datatable(chemical_counts)
-```
-
-```{r 06-Chapter6-7, fig.align = "center"}
-det_per_chemical_graph <- ggplot(chemical_counts, aes(x = perc_detected)) +
- geom_histogram(color = "black",
- fill = "gray60",
- alpha = 0.7,
- binwidth = 1) +
- scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
- scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
- ggtitle("Distribution of Percentage Chemical Detection") +
- ylab("Number of Chemicals") +
- xlab("Percentage of Detection Across All Participants") +
- theme(plot.title = element_text(hjust = 0.5),
- axis.title.x = element_text(margin = ggplot2::margin(t = 10)),
- axis.title.y = element_text(margin = ggplot2::margin(r = 10)))
-
-det_per_chemical_graph
-```
-
-From this histogram, we can see that many of the chemicals fall in the < 15% or > 90% detection range, with the others distributed evenly between 20 and 90% detection. How we choose to filter our data in part depends on the goals of our analysis. For example, if we only want to keep chemicals detected for almost all of the participants, we could set our threshold at 90% detection:
-```{r 06-Chapter6-8, fig.align = "center"}
-# Add annotation column
-chemical_counts <- chemical_counts %>%
- mutate(det_filter_90 = ifelse(perc_detected > 90, "Yes", "No"))
-
-# How many chemicals pass this filter?
-nrow(chemical_counts %>% filter(det_filter_90 == "Yes"))
-
-# Make graph
-det_per_chemical_graph_90 <- ggplot(chemical_counts, aes(x = perc_detected, fill = det_filter_90)) +
- geom_histogram(color = "black",
- alpha = 0.7,
- binwidth = 1) +
- scale_fill_manual(values = c("gray87", "gray32"), guide = "none") +
- geom_segment(aes(x = 90, y = 0, xend = 90, yend = 25), color = "firebrick", linetype = 2) +
- scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
- scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
- ggtitle("Distribution of Percentage Chemical Detection") +
- ylab("Number of Chemicals") +
- xlab("Percentage of Detection Across All Participants") +
- theme(plot.title = element_text(hjust = 0.5, size = 16),
- axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
- axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
- axis.text = element_text(size = 12))
-
-det_per_chemical_graph_90
-```
-
-However, this only keeps 34 chemicals in our dataset, which is a significant proportion of all of the chemicals measured. We could also consider setting the filter at 20% detection to maximize inclusion of as many chemicals as possible.
-
-```{r 06-Chapter6-9, fig.align = "center"}
-# Add annotation column
-chemical_counts <- chemical_counts %>%
- mutate(det_filter_20 = ifelse(perc_detected > 20, "Yes", "No"))
-
-# How many chemicals pass this filter?
-nrow(chemical_counts %>% filter(det_filter_20 == "Yes"))
-
-# Make graph
-det_per_chemical_graph_20 <- ggplot(chemical_counts, aes(x = perc_detected, fill = det_filter_20)) +
- geom_histogram(color = "black",
- alpha = 0.7,
- binwidth = 1) +
- scale_fill_manual(values = c("gray87", "gray32"), guide = "none") +
- geom_segment(aes(x = 20, y = 0, xend = 20, yend = 25), color = "firebrick", linetype = 2) +
- scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
- scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
- ggtitle("Distribution of Percentage Chemical Detection") +
- ylab("Number of Chemicals") +
- xlab("Percentage of Detection Across All Participants") +
- theme(plot.title = element_text(hjust = 0.5, size = 16),
- axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
- axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
- axis.text = element_text(size = 12))
-
-det_per_chemical_graph_20
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can now answer **Environmental Health Question #2***: How many chemicals were detected in at least 20% of participants?
-:::
-
-:::answer
-**Answer:** 62 chemicals were detected in at least 20% of participants.
-:::
-
-We'll use the 20% detection filter for downstream analyses to maximize inclusion of data for our study. Note that selection of data filters is highly project- and goal- dependent, so be sure to take into consideration typical workflows for your type of data, study, or lab group.
-
-```{r 06-Chapter6-10}
-# Create vector of chemicals to keep
-chemicals_20perc <- chemical_counts %>%
- filter(perc_detected > 20) %>%
- pull(class_chemical)
-
-# Filter dataframe
-wrist_data_filtered <- wrist_data %>%
- column_to_rownames("S_ID") %>%
- dplyr::select(all_of(chemicals_20perc))
-```
-
-We can also summarize chemical detection vs. non-detection by chemical class to understand the number of chemicals in each class that were 1) detected in any participant or 2) detected in more than 20% of participants.
-
-```{r 06-Chapter6-11}
-chemical_count_byclass <- chemical_counts %>%
- separate(class_chemical, into = c("class", NA), remove = FALSE, sep = "_") %>%
- group_by(class) %>%
- summarise(n_chemicals = n(),
- n_chemicals_det = sum(n_detected > 0),
- n_chemicals_det_20perc = sum(perc_detected >= 20)) %>%
- bind_rows(summarise(., across(where(is.numeric), sum),
- across(where(is.character), ~'Total')))
-
-datatable(chemical_count_byclass)
-```
-
-From these data, we can see that, of the 62 chemicals retained by our detection filter, some classes were retained more than others. For example, of the 8 of the 10 phthalates (80%) were retained by the 20% detection filter, while only 2 of the 11 PCBs (18%) were retained.
-
-## Outlier Identification
-
-Next, we will check to see if any participants are outliers based on the entire chemical signature for each participant using principal component analysis (PCA). Prior to checking for outliers, a few final data cleaning steps are required, which are beyond the scope of this specific module, though we encourage participants to research these methods as they are important in general data pre-processing. These data cleaning steps were:
-
-1. Imputing missing values.
-2. Calculating time-weighted average values by dividing each value by the number of days the participant wore the wristband.
-3. Assessing normality of data with and without log2 transformation.
-
-Here, we'll read in the fully cleaned and processed data, which contains data for all 97 participants and the 62 chemicals that passed the detection filter (imputed, time-weighted). We will also apply log2 transformation to move the data closer to a normal distribution. For more on these steps, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations** and **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**.
-
-```{r 06-Chapter6-12}
-wrist_data_cleaned <- read.xlsx("Chapter_6/Module6_1_Input/Module6_1_InputData2.xlsx") %>%
- column_to_rownames("S_ID") %>%
- mutate(across(everything(), \(x) log2(x+1)))
-
-datatable(wrist_data_cleaned[ 1:6])
-```
-
-First, let's run PCA and plot our data.
-```{r 06-Chapter6-13, fig.align = "center"}
-# Prepare dataframe
-wrist_data_cleaned_scaled <- wrist_data_cleaned %>%
- scale() %>% data.frame()
-
-# Run PCA
-pca <- prcomp(wrist_data_cleaned_scaled)
-
-# Visualize PCA
-pca_chemplot <- fviz_pca_ind(pca,
- label = "none",
- pointsize = 3) +
-theme(axis.title = element_text(face = "bold", size = rel(1.1)),
- panel.border = element_rect(fill = NA, color = "black", linewidth = 0.3),
- panel.grid.minor = element_blank(),
- panel.grid.major = element_blank(),
- plot.title = element_text(hjust = 0.5),
- legend.position = "none")
-
-pca_chemplot
-```
-
-By visual inspection, it looks like there may be some outliers, so we can use a formula to detect outliers. One standard way to detect outliers is the criterion of being “more than 6 standard deviations away from the mean" ([Source](https://privefl.github.io/blog/detecting-outlier-samples-in-pca/)).
-
-We can apply this approach to our data by first creating a function to detect PCA outliers based on whether or not that participant passed a certain standard deviation cutoff.
-
-```{r 06-Chapter6-14}
-# Create a function to detect PCA sample outliers. The input is the PCA results data frame and the number of standard deviations for the cutoff. The output is outlier names.
-outlier_detection = function(pca_df, sd){
-
- # getting scores
- scores = pca_df$x
-
- # identifying samples that are > 6 standard deviations away from the mean
- outlier_indices = apply(scores, 2, function(x) which( abs(x - mean(x)) > (sd * sd(x)) )) %>%
- Reduce(union, .)
-
- # getting sample names
- outliers = rownames(scores)[outlier_indices]
-
- return(outliers)
-}
-
-# Call function with different standard deviation cutoffs
-outliers_6 <- outlier_detection(pca, 6)
-outliers_5 <- outlier_detection(pca, 5)
-outliers_4 <- outlier_detection(pca, 4)
-outliers_3 <- outlier_detection(pca, 3)
-
-# Summary data frame
-outlier_summary <- data.frame(sd_cutoff = c(6, 5, 4, 3), n_outliers = c(length(outliers_6), length(outliers_5), length(outliers_4), length(outliers_3)))
-
-outlier_summary
-```
-
-From these results, we see that there are no outliers that are > 6 standard deviations from the mean, so we will proceed with the dataset without filtering any participants out.
-
-## Summary Statistics Tables
-
-Now that we have explored our dataset and finished processing the data, we can make a summary table that includes descriptive statistics (minimum, mean, median, maximum) for each of our chemicals. This table would go into supplementary material when the project is submitted for publication. It is a good idea to make this table using both the raw data and the cleaned data (imputed and normalized by time-weighted average) because different readers may have different interests in the data. For example, they may want to see the raw data so that they can understand chemical detection versus non-detection and absolute minimums or maximums of detection. Or, they may want to use the cleaned data for their own analyses. This table can also include information about whether or not the chemical passed our 20% detection filter.
-
-There are many ways to generate summary statistics tables in R. Here, we will demonstrate a method using the `map_dfr()` function, which takes a list of functions and applies them across columns of the data. The summary statistics are then placed in rows, with each column representing a variable.
-
-```{r 06-Chapter6-15, warning = FALSE}
-# Define summary functions
-summary_functs <- lst(min, median, mean, max)
-
-# Apply summary functions to raw data
-summarystats_raw <- map_dfr(summary_functs, ~ summarise(wrist_data, across(3:ncol(wrist_data), .x, na.rm = TRUE)), .id = "statistic")
-
-# View data
-datatable(summarystats_raw[, 1:6])
-```
-
-Through a few cleaning steps, we can transpose and format these data so that they are publication-quality.
-```{r 06-Chapter6-16}
-summarystats_raw <- summarystats_raw %>%
-
- # Transpose dataframe and return to dataframe class
- t() %>% as.data.frame() %>%
-
- # Make the first row the column names
- row_to_names(1) %>%
-
- # Remove rows with NAs (those where data are completely missing)
- na.omit() %>%
-
- # Move chemical identifier to a column
- rownames_to_column("class_chemical") %>%
-
- # Round data
- mutate(across(min:max, as.numeric)) %>%
- mutate(across(where(is.numeric), round, 2)) %>%
-
- # Add a suffix to column titles so we know that these came from the raw data
- rename_with(~paste0(., "_raw"), min:max)
-
-datatable(summarystats_raw)
-```
-
-We can apply the same steps to the cleaned data.
-
-```{r 06-Chapter6-17}
-summarystats_cleaned <- map_dfr(summary_functs, ~ summarise(wrist_data_cleaned, across(1:ncol(wrist_data_cleaned), .x, na.rm = TRUE)),
- .id = "statistic") %>%
- t() %>% as.data.frame() %>%
- row_to_names(1) %>%
- na.omit() %>%
- rownames_to_column("class_chemical") %>%
- mutate(across(min:max, as.numeric)) %>%
- mutate(across(where(is.numeric), round, 2)) %>%
- rename_with(~paste0(., "_cleaned"), min:max)
-
-datatable(summarystats_cleaned)
-```
-
-Finally, we will merge the data from our `chemical_counts` dataframe (which contains detection information for all of our chemicals) with our summary statistics dataframes.
-
-```{r 06-Chapter6-18}
-summarystats_final <- chemical_counts %>%
-
- # Remove 90% detection filter column
- select(-det_filter_90) %>%
-
- # Add raw summary stats
- left_join(summarystats_raw, by = "class_chemical") %>%
-
- # Add cleaned summary stats
- left_join(summarystats_cleaned, by = "class_chemical")
-
-datatable(summarystats_final, width = 600)
-```
-
-## Demographics Table
-
-Another important element of any analysis of human data is the demographics table. The demographics table provides key information about the study participants and can help inform downstream analyses, such as exploration of the impact of covariates on the endpoint of interest. There are many different ways to make demographics tables in R. Here, we will demonstrate making a demographics table with the *table1* package. For more on this package, including making tables with multiple groups and testing for statistical differences in demographics between groups, see the *table1* vignette [here](https://benjaminrich.github.io/table1/vignettes/table1-examples.html).
-
-First, we'll read in and view our demographic data:
-```{r 06-Chapter6-19}
-demo_data <- read.xlsx("Chapter_6/Module6_1_Input/Module6_1_InputData3.xlsx")
-
-datatable(demo_data)
-```
-
-Then, we can create new labels for our variables so that they are more nicely formatted and more intuitive for display in the table.
-```{r 06-Chapter6-20}
-# Create new labels for the demographics table
-label(demo_data$mat_age_birth) <- "Age at Childbirth"
-label(demo_data$pc_sex) <- "Sex"
-label(demo_data$pc_gender) <- "Gender"
-label(demo_data$pc_latino_hispanic) <- "Latino or Hispanic"
-label(demo_data$pc_race_cleaned) <- "Race"
-label(demo_data$pc_ed) <- "Educational Attainment"
-```
-
-Our demographics data also had "F" for female in the sex column. We can change this to "Female" so that the demographics table is more readable.
-```{r 06-Chapter6-21}
-demo_data <- demo_data %>%
- mutate(pc_sex = dplyr::recode(pc_sex, "F" = "Female"))
-
-label(demo_data$pc_sex) <- "Sex"
-```
-
-Now, let's make the table. The first argument in the formula is all of the columns you want to include in the table, followed by the input dataframe.
-```{r 06-Chapter6-22}
-table1(~ mat_age_birth + pc_sex + pc_gender + pc_latino_hispanic + pc_race_cleaned + pc_ed, data = demo_data)
-```
-
-
-
-There are a couple of steps we could take to clean up the table:
-
-1. Change the rendering for our continuous variable (age) to just mean (SD).
-2. Order educational attainment so that it progresses from least to most education.
-
-We can change the rendering for our continuous variable by defining our own rendering function (as demonstrated in the package's vignette).
-```{r 06-Chapter6-23}
-# Create function for custom table so that Mean (SD) is shown for continuous variables
-my.render.cont <- function(x) {
- with(stats.apply.rounding(stats.default(x), digits=2),
- c("", "Mean (SD)"=sprintf("%s (± %s)", MEAN, SD)))
-}
-```
-
-We can order the education attainment by changing it to a factor and defining the levels.
-```{r 06-Chapter6-24}
-demo_data <- demo_data %>%
- mutate(pc_ed = factor(pc_ed, levels = c("High School or GED", "Associate Degree", "Four-Year Degree",
- "Master's Degree", "Professional Degree or PhD")))
-
-label(demo_data$pc_ed) <- "Educational Attainment"
-```
-
-Then, we can make our final table.
-```{r 06-Chapter6-25}
-table1(~ mat_age_birth + pc_sex + pc_gender + pc_latino_hispanic + pc_race_cleaned + pc_ed,
- data = demo_data,
- render.continuous = my.render.cont)
-```
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can now answer **Environmental Health Question #3***: What are the demographics of the study participants?
-:::
-
-:::answer
-**Answer:** The study participants were all females who identified as women and were, on average, 31 years old when they gave birth. Participants were mostly non-latino/non-hispanic and White. Participants were spread across educational attainment levels, with the smallest education attainment group being those with an associate degree and the largest being those with a four-year degree.
-:::
-
-## Concluding Remarks
-
-In conclusion, this training module serves as an introduction to human cohort data exploration and preliminary analysis, including data filtering, summary statistics, and multivariate outlier detection. These methods are an important step at the beginning of human cohort analyses, and the concepts introduced in this module can be applied to a wide variety of datasets.
-
-
-
-
-
-:::tyk
-Using a more expanded demographics file ("Module6_1_TYKInput.xlsx"), create a demographics table with:
-
-+ The two new variables (home location and home type) included
-+ The table split by which site the participant visited
-+ Variable names and values presented in a publication-quality format (first letters capitalized, spaces between words, no underscores)
-:::
-
-# 6.2 -Omics and System Biology: Transcriptomic Applications
-
-This training module was developed by Lauren E. Koval, Dr. Kyle Roell, and Dr. Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-
-## Introduction to Training Module
-
-This training module incorporates the highly relevant example of RNA sequencing to evaluate the impacts of environmental exposures on cellular responses and general human health. **RNA sequencing** is the most common method that is currently implemented to measure the transcriptome. Results from an RNA sequencing platform are often summarized as count data, representing the number of relative times a gene (or other annotated portion of the genome) was 'read' in a given sample. For more details surrounding the methodological underpinnings of RNA sequencing, see the following recent review:
-
-+ Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019 Nov;20(11):631-656. doi: 10.1038/s41576-019-0150-2. Epub 2019 Jul 24. PMID: [31341269](https://pubmed.ncbi.nlm.nih.gov/31341269/).
-
-
-In this training module, we guide participants through an example RNA sequencing analysis. Here, we analyze RNA sequencing data collected in a toxicology study evaluating the effects of biomass smoke exposure, representing wildfire-relevant exposure conditions. This study has been previously been described in the following publications:
-
-+ Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. doi: 10.1016/j.scitotenv.2021.145759. Epub 2021 Feb 10. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
-
-+ Kim YH, Warren SH, Krantz QT, King C, Jaskot R, Preston WT, George BJ, Hays MD, Landis MS, Higuchi M, DeMarini DM, Gilmour MI. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ Health Perspect. 2018 Jan 24;126(1):017011. doi: 10.1289/EHP2200. PMID: [29373863](https://pubmed.ncbi.nlm.nih.gov/29373863/).
-
-Here, we specifically analyze mRNA sequencing profiles collected in mouse lung tissues. These mice were exposed to two different biomass burn scenarios: smoldering pine needles and flaming pine needles, representing certain wildfire smoke exposure scenarios that can occur. The goal of these analyses is to identify which genes demonstrate altered expression in response to these wildfire-relevant exposures, and identify which biological pathways these genes influence to evaluate findings at the systems biology level.
-
-This training module begins by guiding users through the loading, viewing, and formatting of the example transcriptomics datasets and associated metadata. Methods to carry out quality assurance (QA) / quality control (QC) of the transcriptomics data are then described, which are advantageous to ensure high quality data are included in the final statistical analysis. Because these transcriptomic data were derived from bulk lung tissue samples, consisting of mixed cell populations that could have shifted in response to exposures, data are then adjusted for potential sources of heterogeneity using the R package [RUVseq](https://bioconductor.org/packages/release/bioc/html/RUVSeq.html).
-
-Statistical models are then implemented to identify genes that were significantly differentially expressed between exposed vs unexposed samples. Models are implemented using algorithms within the commonly implemented R package [DESeq2](https://doi.org/10.1186/s13059-014-0550-8). This package is very convenient, well written, and widely used. The main advantage of this package is that is allows you to perform differential expression analyses and easily obtain various statistics and results with minimal script development on the user-end.
-
-After obtaining results from differential gene expression analyses, we visualize these results using both MA and volcano plots. Finally, we carry out a systems level analysis through pathway enrichment using the R package [PIANO](https://doi.org/10.1093/nar/gkt111) to identify which biological pathways were altered in response to these wildfire-relevant exposure scenarios.
-
-## Introduction to the Field of "-Omics"
-
-The field of "-omics" has rapidly evolved since its inception in the mid-1990’s, initiated from information obtained through sequencing of the human genome (see the [Human Genome Project](https://www.genome.gov/human-genome-project)) as well as the advent of high-content technologies. High-content technologies have allowed the rapid and economical assessment of genome-wide, or ‘omics’-based, endpoints.
-
-Traditional molecular biology techniques typically evaluate the function(s) of individual genes and gene products. Omics-based methods, on the other hand, utilize non-targeted methods to identify many to all genes or gene products in a given environmental/biological sample. These non-targeted approaches allow for the unbiased investigation of potentially unknown or understudied molecular mediators involved in regulating cell health and disease. These molecular profiles have the potential of being altered in response to toxicant exposures and/or during disease initiation/progression.
-
-To further understand the molecular consequences of -omics-based alterations, molecules can be overlaid onto molecular networks to uncover biological pathways and molecular functions that are perturbed at the systems biology level. An overview of these generally methods, starting with high-content technologies and ending of systems biology, is provided in the below figure (created with BioRender.com).
-
-```{r 06-Chapter6-26, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_6/Module6_2_Input/Module6_2_Image1.png")
-```
-
-
-A helpful introduction to the field of -omics in relation to environmental health, as well as methods used to relate -omic-level alterations to systems biology, is provided in the following book chapter:
-
-+ Rager JE, Fry RC. Systems Biology and Environmental Exposures. Chpt 4 of 'Network Biology' edited by WenJun Zhang. 2013. ISBN: 978-1-62618-941-3. Nova Science Publishers, Inc. Available at: https://www.novapublishers.com/wp-content/uploads/2019/07/978-1-62618-942-3_ch4.pdf.
-
-
-An additional helpful resource describing computational methods that can be used in systems level analyses is the following book chapter:
-
-+ Meisner M, Reif DM. Computational Methods Used in Systems Biology. Chpt 5 of 'Systems Biology in Toxicology and Environmental Health' edited by Fry RC. 2015: 85-115. ISBN 9780128015643. Academic Press. Available at: https://www.sciencedirect.com/science/article/pii/B9780128015643000055.
-
-
-Parallel to human genomics/epigenomics-based research is the newer "-omics" topic of the **exposome**. The exposome was originally conceptualized as 'all life-course environmental exposures (including lifestyle factors), from the prenatal period onwards ([Wild et al. 2005](https://cebp.aacrjournals.org/content/14/8/1847.long)). Since then, this concept has received much attention and additional associated definitions. We like to think of the exposome as including anything in ones environment that may impact the overall health of an individual, excluding the individual's genome/epigenome. Common elements evaluated as part of the exposome include environmental exposures, such as chemicals and other substances that may impart toxicity. Additional potential stressors include lifestyle factors, socioeconomic factors, infectious agents, therapeutics, and other stressors that may be altered internally (e.g., microbiome). A helpful review of this research field is provided as the following publication:
-
-+ Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012 Feb;41(1):24-32. doi: 10.1093/ije/dyr236. Epub 2012 Jan 31. PMID: [22296988](https://pubmed.ncbi.nlm.nih.gov/22296988/).
-
-
-
-## Introduction to Transcriptomics
-One of the most widely evaluated -omics endpoints is messenger RNA (mRNA) expression (also termed gene expression). As a reminder, mRNA molecules are a major type of RNA produced as the "middle step" in the [Central Dogma Theory](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology#:~:text=The%20central%20dogma%20of%20molecular,The%20Central%20Dogma), which describes how genetic DNA is first transcribed into RNA and then translated into protein. Protein molecules are ultimately the major regulators of cellular processes and overall health. Therefore, any perturbations to this process (including changes to mRNA expression levels) can have tremendous consequences on overall cell function and health. A visualization of these steps in the Central Dogma theory are included below.
-
-```{r 06-Chapter6-27, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_6/Module6_2_Input/Module6_2_Image2.png")
-```
-
-
-mRNA expression can be evaluated in a high-throughout/high-content manner, across the genome, and is referred to as the **transcriptome** when doing so. Transcriptomics can be measured using a variety of technologies, including high-density nucleic acid arrays (e.g., DNA microarrays or GeneChip arrays), high-throughput PCR technologies, or RNA sequencing technologies. These methods are used to obtain relative measures of genes that are being expressed or transcribed from DNA by measuring the abundance of mRNA molecules. Results of these methods are often termed as providing gene expression signatures or 'transcriptomes' of a sample under evaluation.
-
-
-### Training Module's **Environmental Health Questions**
-
-This training module was specifically developed to answer the following environmental health questions:
-
-(1) What two types of data are commonly needed in the analysis of transcriptomics data?
-
-(2) When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?
-
-(3) When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?
-
-(4) What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?
-
-(5) How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-
-(6) How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-
-(7) How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-
-(8) What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?
-
-
-### Workspace Preparation and Data Import
-
-
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r packages, message=FALSE, warning=FALSE, error=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("BiocManager"))
- BiocManager::install("BiocManager");
-if (!requireNamespace("DESeq2"))
- BiocManager::install("DESeq2");
-if (!requireNamespace("edgeR"))
- BiocManager::install("edgeR");
-if (!requireNamespace("RUVSeq"))
- BiocManager::install("RUVSeq");
-if (!requireNamespace("janitor"))
- install.packages("janitor");
-if (!requireNamespace("pheatmap"))
- install.packages("pheatmap");
-if (!requireNamespace("factoextra"))
- install.packages("factoextra");
-if (!requireNamespace("RColorBrewer"))
- install.packages("RColorBrewer");
-if (!requireNamespace("data.table"))
- install.packages("data.table");
-if (!requireNamespace("EnhancedVolcano"))
- BiocManager::install("EnhancedVolcano");
-if (!requireNamespace("piano"))
- BiocManager::install("piano");
-```
-
-
-#### Loading R packages required for this session
-```{r 06-Chapter6-28, message=FALSE, warning=FALSE, error=FALSE}
-library(tidyverse)
-library(DESeq2)
-library(edgeR)
-library(RUVSeq)
-library(janitor)
-library(factoextra)
-library(pheatmap)
-library(data.table)
-library(RColorBrewer)
-library(EnhancedVolcano)
-library(piano)
-```
-
-
-#### Set your working directory
-```{r 06-Chapter6-29, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-
-### Loading the Example Transcriptomic Dataset and Metadata
-
-First, let's read in the transcriptional signature data, previously summarized as number of sequence reads per gene (also simply referred to as 'count data') and its associated metadata file:
-```{r loaddata, message=F, warning=F, error=F}
-# Read in the count data
-countdata <- read.csv(file = 'Chapter_6/Module6_2_Input/Module6_2_InputData1_GeneCounts.csv', check.names = FALSE)
-
-# Read in the metadata (describing information on each sample)
-sampleinfo <- read.csv(file = "Chapter_6/Module6_2_Input/Module6_2_InputData2_SampleInfo.csv", check.names = FALSE)
-```
-
-
-### Data Viewing
-
-Let's see how many rows and columns of data are present in the countdata dataframe
-```{r 06-Chapter6-30}
-dim(countdata)
-```
-
-Let's also view the column headers
-```{r 06-Chapter6-31}
-colnames(countdata)
-```
-
-And finally let's view the top few rows of data
-```{r 06-Chapter6-32}
-head(countdata)
-```
-Together, this dataframe contains information across 30146 mRNA identifiers, that are labeled according to "Gene name" followed by an underscore and probe number assigned by the platform used in this analysis, BioSpyder TempoSeq Technologies.
-
-A total of 23 columns are included in this dataframe, the first of which represents the gene identifier, followed by gene count data across 22 samples.
-
-
-Let's also see what the metadata dataframe looks like
-```{r 06-Chapter6-33}
-dim(sampleinfo)
-```
-
-Let's also view the column headers
-```{r 06-Chapter6-34}
-colnames(sampleinfo)
-```
-
-And finally let's view the top few rows of data
-```{r 06-Chapter6-35}
-head(sampleinfo)
-```
-Together, this dataframe contains information across the 22 total samples, that are labeled according to "SampleID_BioSpyderCountFile" header. These identifiers match those used as column headers in the countdata dataframe.
-
-A total of 9 columns are included in this dataframe, including the following:
-
-+ `SampleID_BioSpyderCountFile`: The unique sample identifers (total n=22)
-+ `PlateBatch`: The plate number that was used in the generation of these data.
-+ 'MouseID': The unique identifier, that starts with "M" followed by a number, for each mouse used in this study
-+ `NumericID`: The unique numeric identifier for each mouse.
-+ `Treatment`: The type of exposure condition that each mouse was administered. These include smoldering pine needles, flaming pine needles, vehicle control (saline), and positive inflammation control (LPS, or lipopolysaccharide)
-+ `ID`: Another form of identifier that combines the mouse identifier with the exposure condition
-+ `Timepoint`: The timepoint at which samples were collected (here, all 4h post-exposure)
-+ `Tissue`: The type of tissue that was collected and analyzed (here, all lung tissue)
-+ `Group`: The higher level identifier that groups samples together based on exposure condition, timepoint, and tissue
-
-### Checking for Duplicate mRNA IDs
-
-One common QC/preparation step that is helpful when organizing transcriptomics data is to check for potential duplicate mRNA IDs in the countdata.
-```{r 06-Chapter6-36}
-# Visualize this data quickly by viewing top left corner, to check where ID column is located:
-countdata[1:3,1:5]
-
-# Then check for duplicates within column 1 (where the ID column is located):
-Dups <- duplicated(countdata[,1])
-summary(Dups)
-```
-
-In this case, because all potential duplicate checks turn up "FALSE", these data do not contain duplicate mRNA identifiers in its current organized format.
-
-### Answer to Environmental Health Question 1
-
-:::question
-*With this, we can now answer **Environmental Health Question #1***: What two types of data are commonly needed in the analysis of transcriptomics data?
-:::
-
-:::answer
-**Answer:** A file containing the raw -omics signatures are needed (in this case, the count data summarized per gene acquired from RNA sequencing technologies), and a file containing the associated metadata describing the actual samples, where they were derived from, what they represent, etc, is needed.
-:::
-
-
-## Formatting Data for Downstream Statistics
-
-Most of the statistical analyses included in this training module will be carried out using the DESeq2 pipeline. This package requires that the count data and sample information data be formatted in a certain manner, which will expedite the downstream coding needed to carry out the statistics. Here, we will walk users through these initial formatting steps.
-
-DESeq2 first requires a `coldata` dataframe, which includes the sample information (i.e., metadata). Let's create this new dataframe based on the original `sampleinfo` dataframe:
-```{r 06-Chapter6-37, message=F, warning=F, error=F}
-coldata <- sampleinfo
-```
-
-
-DESeq2 also requires a `countdata` dataframe, which we've previously created; however, this dataframe requires some minor formatting before it can be used as input for downstream script.
-
-First, the gene identifiers need to be converted into row names:
-```{r 06-Chapter6-38, message=F, warning=F, error=F}
-countdata <- countdata %>% column_to_rownames("Gene")
-```
-
-Then, the column names need to be edited. Let's remind ourselves what the column names are currently:
-```{r 06-Chapter6-39, message=F, warning=F, error=F}
-colnames(countdata)
-```
-
-These column identifiers need to be converted into more intuitive sample IDs, that also indicate treatment. This information can be found in the coldata dataframe. Specifically, information in the column labeled `SampleID_BioSpyderCountFile` will be helpful for these purposes.
-
-To replace these original column identifiers with these more helpful sample identifiers, let's first make sure the order of the countdata columns are in the same order as the coldata column of `SampleID_BioSpyderCountFile`:
-```{r 06-Chapter6-40, message=F, warning=F, error=F}
-countdata <- setcolorder(countdata, as.character(coldata$SampleID_BioSpyderCountFile))
-```
-
-Now, we can rename the column names within the countdata dataframe with these more helpful identifiers, since both dataframes are now arranged in the same order:
-```{r 06-Chapter6-41, message=F, warning=F, error=F}
-colnames(countdata) <- coldata$ID # Rename the countdata column names with the treatment IDs.
-colnames(countdata) # Viewing these new column names
-```
-These new column identifiers look much better, and can better inform downstream statistical analysis script. Remember that these identifiers indicate that these are mouse samples ("M"), with unique numbers, followed by an underscore and the exposure condition.
-
-
-When relabeling dataframes, it's always important to triple check any of these major edits. For example, here, let's double check that the same samples appear in the same order between the two working dataframes required for dowstream DESeq2 code:
-```{r 06-Chapter6-42, message=F, warning=F, error=F}
-setequal(as.character(coldata$ID), colnames(countdata))
-identical(as.character(coldata$ID), colnames(countdata))
-```
-
-
-
-## Transcriptomics Data QA/QC
-After preparing your transcriptomic data and sample information dataframes for statistical analyses, it is very important to carry out QA/QC on your organized datasets, prior to including all samples and all genes in the actual statistical model. It is critical to only include high quality data that inform underlying biology of exposure responses/disease etiology, rather than data that may contribute noise to the overall data distributions. Some common QA/QC steps and associated data pre-filters carried out in transcriptomics analyses are detailed below.
-
-
-### Background Filter
-It is very common to perform a background filter step when preparing transcriptomic data for statistical analyses. The goal of this step is to remove genes that are very lowly expressed across the majority of samples, and thus are referred to as universally lowly expressed. Signals from these genes can mute the overall signals that may be identified in -omics analyses. The specific threshold that you may want to apply as the background filter to your dataset will depend on the distribution of your dataset and analysis goal(s).
-
-For this example, we apply a background threshold, to remove genes that are lowly expressed across the majority of samples, specifically defined as genes that have expression levels across at least 20% of the samples that are less than (or equal to) the median expression of all genes across all samples. This will result in including only genes that are expressed above background, that have expression levels in at least 20% of samples that are greater than the overall median expression. Script to apply this filter is detailed below:
-
-```{r backfilt, message=F, warning=F, error=F}
-# First count the total number of samples, and save it as a value in the global environment
-nsamp <- ncol(countdata)
-
-# Then, calculate the median expression level across all genes and all samples, and save it as a value
-total_median <- median(as.matrix(countdata))
-
-
-# We need to temporarily add back in the Gene column to the countdata so we can filter for genes that pass the background filter
-countdata <- countdata %>% rownames_to_column("Gene")
-
-# Then we can apply a set of filters and organization steps (using the tidyverse) to result in a list of genes that have an expression greater than the total median in at least 20% of the samples
-genes_above_background <- countdata %>% # Start from the 'countdata' dataframe
- # Melt the data so that we have three columns: gene, exposure condition, and expression counts
- pivot_longer(cols=!Gene, names_to = "sampleID", values_to="expression") %>%
- # Add a column that indicates whether the expression of a gene for the corresponding exposure condition is above (1) or not above (0) the median of all count data
- mutate(above_median=ifelse(expression>total_median,1,0)) %>%
- group_by(Gene) %>% # Group the dataframe by the gene
- # For each gene, count the number of exposure conditions where the expression was greater than the median of all count data
- summarize(total_above_median=sum(above_median)) %>%
- # Filter for genes that have expression above the median in at least 20% of the samples
- filter(total_above_median>=.2*nsamp) %>%
- # Select just the genes that pass the filter
- select(Gene)
-
-# Then filter the original 'countdata' dataframe for only the genes above background.
-countdata <- left_join(genes_above_background, countdata, by="Gene")
-```
-
-Here, the `countdata` dataframe went from having 30,146 rows of data (representing genes) to 16,664 rows of data (representing genes with expression levels that passed this background filter)
-
-
-
-### Sample Filtering
-Another common QA/QC check is to evaluate whether there are any samples that did not produce adequate RNA material to be measured using the technology employed. Thus, a sample filter can be applied to remove samples that have inadequate data. Here, we demonstrate this filter by checking to see whether there were any samples that resulted in mRNA expression values of zero across all genes. If any sample demonstrates this issue, it should be removed prior to any statistical analysis. Note, there are other filter cut-offs you can use depending on your specific study.
-
-Below is example script that checks for the presence of samples that meet the above criteria:
-```{r sampfilt, message=FALSE, warning=FALSE, error=FALSE}
-# Transpose filtered 'countdata', while keeping data in dataframe format, to allow for script that easily sums the total expression levels per sample
-countdata_T <- countdata %>%
- pivot_longer(cols=!Gene, names_to="sampleID",values_to="expression") %>%
- pivot_wider(names_from=Gene, values_from=expression)
-
-# Then add in a column to the transposed countdata dataframe that sums expression across all genes for each exposure condition
-countdata_T$rowsum <- rowSums(countdata_T[2:ncol(countdata_T)])
-
-# Remove samples that have no expression. All samples have some expression in this example, so all samples are retained.
-countdata_T <- countdata_T %>% filter(rowsum!=0)
-
-# Take the count data filtered for correct samples, remove the 'rowsums' column
-countdata_T <- countdata_T %>% select(!rowsum)
-
-# Then, transpose it back to the correct format for analysis
-countdata <- countdata_T %>%
- pivot_longer(cols=!sampleID, names_to = "Gene",values_to="expression") %>%
- pivot_wider(names_from = sampleID, values_from = "expression")
-```
-
-
-### Identifying & Removing Sample Outliers
-Prior to final statistical analysis, raw transcriptomic data are commonly evaluated for the presence of potential sample outliers. Outliers can result from experimental error, technical error/measurement error, and/or huge sources of variation in biology. For many analyses, it is beneficial to remove such outliers to enhance computational abilities to identify biologically meaningful signals across data. Here, we present two methods to check for the presence of sample outliers:
-
-**1. Principal component analysis (PCA)** can be used to identify potential outliers in a dataset through visualization of summary-level values illustrating reduced representations of the entire dataset. Note that a more detailed description of PCA is provided in **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means & PCA**. Here, PCA is run on the raw count data and further analyzed using scree plots, assessing principal components (PCs), and visualized using biplots displaying the first two principal components as a scatter plot.
-
-
-**2. Hierarchical clustering** is another approach that can be used to identify potential outliers. Hierarchical clustering aims to cluster data based on a similarity measure, defined by the function and/or specified by the user. There are several R packages and functions that will run hierarchical clustering, but it is often helpful visually to do this in conjuction with a heatmap. Here, we use the package *pheatmap* (introduced in **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**) with hierarchical clustering across samples to identify potential outliers.
-
-
-Let's start by using PCA to identify potential outliers, while providing a visualization of potential sources of variation across the dataset.
-
-First we need to move the Gene column back to the rownames so our dataframe is numeric and we can run the PCA script
-```{r 06-Chapter6-43, message=FALSE, warning=FALSE, error=FALSE}
-countdata <- countdata %>% column_to_rownames("Gene")
-
-# Let's remind ourselves what these data look like
-countdata[1:10,1:5] #viewing first 10 rows and 5 columns
-```
-
-
-Then we can calculate principal components using transposed count data
-```{r 06-Chapter6-44}
-pca <- prcomp(t(countdata))
-```
-
-
-And visualize the percent variation captured by each principal component (PC) with a scree plot
-```{r 06-Chapter6-45, fig.align='center'}
-# We can generate a scree plot that shows the eigenvalues of each component, indicating how much of the total variation is captured by each component
-fviz_eig(pca, addlabels = TRUE)
-```
-
-This scree plot indicates that nearly all variation is explained in PC1 and PC2, so we are comfortable with viewing these first two PCs when evaluating whether or not potential outliers exist in this dataset.
-
-#### Visualization of Transcriptomic Data using PCA
-
-Further visualization of how these transcriptomic data appear through PCA can be produced through a scatter plot showing the data reduced values per sample:
-```{r 06-Chapter6-46, fig.align='center', warning = FALSE}
-# Calculate the percent variation captured by each PC
-pca_percent <- round(100*pca$sdev^2/sum(pca$sdev^2),1)
-
-# Make dataframe for PCA plot generation using first two components and the sample name
-pca_df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], Sample=colnames(countdata))
-
-# Organize dataframe so we can color our points by the exposure condition
-pca_df <- pca_df %>% separate(Sample, into = c("mouse_num", "expo_cond"), sep="_")
-
-# Plot PC1 and PC2 for each sample and color the point by the exposure condition
-ggplot(pca_df, aes(PC1,PC2, color = expo_cond))+
- geom_hline(yintercept = 0, size=0.3)+
- geom_vline(xintercept = 0, size=0.3)+
- geom_point(size=3) +
- geom_text(aes(label=mouse_num), vjust =-1, size=4)+
- labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
- ggtitle("PCA for 4h Lung Pine Needles & Control Exposure Conditions")
-```
-
-With this plot, we can see that samples do not demonstrate obvious groupings, where certain samples group far apart from others. Therefore, our PCA analysis indicates that there are unlikely any sample outliers in this dataset.
-
-
-#### Now lets implement hierarchical clustering to identify potential outliers
-
-First we need to create a dataframe of our transposed `countdata` such that samples are rows and genes are columns to input into the clustering algorithm.
-```{r 06-Chapter6-47}
-countdata_for_clustering <- t(countdata)
-countdata_for_clustering[1:5,1:10] # Viewing what this transposed dataframe looks like
-```
-
-
-Next we can run hierarchical clustering in conjunction with the generation of a heatmap. Note that we scale these data for improved visualization.
-```{r 06-Chapter6-48, fig.align='center'}
-pheatmap(scale(countdata_for_clustering), main="Hierarchical Clustering",
- cluster_rows=TRUE, cluster_cols = FALSE,
- fontsize_col = 7, treeheight_row = 60, show_colnames = FALSE)
-```
-
-Like the PCA findings, heirarchical clustering demonstrated an overall lack of potential sample outliers because there were no obvious sample(s) that grouped separately from the rest along the clustering dendograms.
-Therefore, *neither approach points to outliers that should be removed in this analysis.*
-
-
-
-
-### Answer to Environmental Health Question 2
-
-:::question
-*With this, we can now answer **Environmental Health Question #2***: When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?
-:::
-
-:::answer
-**Answer:** (1) Background filter to remove genes that are universally lowly expressed; (2) Sample filter to remove samples that may be not have any detectable mRNA; (3) Sample outlier filter to remove samples with underlying data distributions outside of the overall, collective dataset.*
-:::
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can now also answer **Environmental Health Question #3***: When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?
-:::
-
-:::answer
-**Answer:** Principal component analysis (PCA) and hierarchical clustering.
-:::
-
-
-
-## Controlling for Sources of Sample Heterogeneity
-Because these transcriptomic data were generated from mouse lung tissues, there is potential for these samples to show heterogeneity based on underlying shifts in cell populations (e.g., neutrophil influx) or other aspects of sample heterogeneity (e.g., batch effects from plating, among other sources of heterogeneity that we may want to control for). For these kinds of complex samples, there are data processing methods that can be leveraged to minimize the influence of these sources of heterogeneity. Example methods include Remove Unwanted Variable (RUV), which is discussed here, as well as others (e.g., [Surrogate Variable Analysis (SVA)](https://academic.oup.com/nar/article/42/21/e161/2903156)).
-
-Here, we leverage the package called *RUVseq* to employ RUV on this sequencing dataset. Script was developed based off [Bioconductor website](https://bioconductor.org/packages/release/bioc/html/RUVSeq.html), [vignette](http://bioconductor.org/packages/release/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf), and original [publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/).
-
-
-#### Steps in carrying out RUV using RUVseq on this example dataset:
-```{r 06-Chapter6-49, message=F, warning=F, error=F}
-# First we store the treatment IDs and exposure conditions as a separate vector
-ID <- coldata$ID
-
-# And differentiate our treatments and control conditions, first by grabbing the groups associated with each sample
-groups <- as.factor(coldata$Group)
-
-# Let's view all the groups
-groups
-
-# then setting a control label
-ctrl <- "Saline_4h_Lung"
-
-# and extracting a vector of just our treatment groups
-trt_groups <- setdiff(groups,ctrl)
-
-# let's view this vector
-trt_groups
-```
-
-*RUVseq* contains its own set of plotting and normalization functions, though requires input of what's called an object of S4 class SeqExpressionSet. Let's go ahead and make this object, using the *RUVseq* function `newSeqExpressionSet()`:
-```{r 06-Chapter6-50}
-exprSet <- newSeqExpressionSet(as.matrix(countdata),phenoData = data.frame(groups,row.names=colnames(countdata)))
-```
-
-
-And then use this object to generate some exploratory plots using built-in tools within *RUVseq*.
-First starting with some bar charts summarizing overall data distributions per sample:
-```{r 06-Chapter6-51, fig.align='center'}
-colors <- brewer.pal(4, "Set2")
-plotRLE(exprSet, outline=FALSE, ylim=c(-4, 4), col=colors[groups])
-```
-
-We can see from this plot that some of the samples show distributions that may vary from the overall - for instance, one of the flaming pine needles-exposed samples (in orange) is far lower than the rest.
-
-
-Then viewing a PCA plot of these samples:
-```{r 06-Chapter6-52, fig.align='center'}
-colors <- brewer.pal(4, "Set2")
-plotPCA(exprSet, col=colors[groups], cex=1.2)
-```
-
-This PCA plot shows pretty good data distributions, with samples mainly showing groupings based upon exposure condition (e.g., LPS), which is to be expected. With this, we can conclude that there may be some sources of unwanted variation, but not a huge amount. Let's see what the data look like after running RUV.
-
-
-Now to actually run the RUVseq algorithm, to control for potential sources of sample heterogeneity, we need to first construct a matrix specifying the replicates (samples of the same exposure condition):
-```{r 06-Chapter6-53}
-# Construct a matrix specifying the replicates (samples of the same exposure condition) for running RUV
-differences <- makeGroups(groups)
-
-# Viewing this new matrix
-head(differences)
-```
-
-This matrix groups the samples by exposure condition. Here, each of the four rows represents one of the four exposure conditions, and each of the six columns represents a possible sample. Since the LPS exposure condition only had four samples, instead of six like the rest of the exposure conditions, a value of -1 is automatically used as a place holder to fill out the matrix. The samples in the matrix are identified by the index of the sample in the previously defined 'groups' factor that was used to generate the matrix. For example, the PineNeedlesSmolder_4h_Lung samples are the the first six samples contained in the 'groups' factor, so in the matrix, samples of this exposure condition are identified as '1','2','3','4','5', and '6'.
-
-
-Let's now implement the RUVseq algorithm and, for this example, capture one factor (k=1) of unwanted variation. Note that the k parameter can be modified to capture additional factors, if necessary.
-```{r 06-Chapter6-54}
-# Now capture 1 factor (k=1) of unwanted variation
-ruv_set <- RUVs(exprSet, rownames(countdata), k=1, differences)
-```
-
-
-This results in a list of objects within `ruv_set`, which include the following important pieces of information:
-
-(1) Estimated factors of unwanted variation are provided in the phenoData object, as viewed using the following:
-```{r 06-Chapter6-55}
-# viewing the estimated factors of unwanted variation in the column W_1
-pData(ruv_set)
-```
-
-
-(2) Normalized counts obtained by regressing the original counts on the unwanted factors (normalizedCounts object within `ruv_set`). Note that the normalized counts should only used for exploratory purposes and not subsequent differential expression analyses. For additional information on this topic, please refer official *RUVSeq* documentation. The normalized counts can be viewed using the following:
-```{r 06-Chapter6-56}
-# Viewing the head of the normalized count data, accounting for unwanted variation
-head(normCounts(ruv_set))
-```
-
-
-Let's again generate an exploratory plot using this updated dataset, focusing on the bar chart view since that was the most informative pre-RUV. Here are the updated bar charts summarizing overall data distributions per sample:
-```{r 06-Chapter6-57, fig.align='center'}
-colors <- brewer.pal(4, "Set2")
-plotRLE(ruv_set, outline=FALSE, ylim=c(-4, 4), col=colors[groups])
-```
-
-This plot shows overall tighter data that are more similarly distributed across samples. Therefore, it is looking like this RUV addition improved the overall distribution of this dataset. It is important not to over-correct/over-smooth your datasets, so implement these types of pre-processing steps with caution. One strategy that we commonly employ to gage whether data smoothing is needed/applied correctly is to run the statistical models with and without correction of potential sources of heterogeneity, and critically evaluate similarities vs differences produced in the results.
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can now answer **Environmental Health Question #4***: What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?
-:::
-
-:::answer
-**Answer:** Remove unwanted variation (RUV), among other approaches, including surrogate variable analysis (SVA).
-:::
-
-
-
-
-## Identifying Genes that are Significantly Differentially Expressed by Environmental Exposure Conditions (e.g., Biomass Smoke Exposure)
-At this point, we have completed several data pre-processing, QA/QC, and additional steps to prepare our example transcriptomics data for statistical analysis. And finally, we are ready to run the overall statistical model to identify genes that are altered in expression in association with different biomass burn conditions.
-
-Here we leverage the *DESeq2* package to carry out these statistical comparisons. This package is now the most commonly implemented analysis pipeline used for transcriptomic data, including sequencing data as well as transcriptomic data produced via other technologies (e.g., Nanostring, Fluidigm, and other gene expression technologies). This package is extremely well-documented and we encourage trainees to leverage these resources in parallel with the current training module when carrying out their own transcriptomics analyses in R:
-
-
-+ [Bioconductor website](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
-+ [Vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
-+ [Manual](https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf)
-+ Primary citation: Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: [25516281](https://pubmed.ncbi.nlm.nih.gov/25516281/)
-
-
-In brief, the basic calculations employed within the DESeq2 underlying algorithms include the following:
-
-**1. Estimate size factors.**
-In the first step, size factors are estimated to help account for potential differences in the sequencing depth across samples. It is similar to a normalization parameter in the model.
-
-**2. Normalize count data.**
-DESeq2 employs different normalization algorithms depending on the parameters selected / stage of analysis. The most commonly employed method is called the **median of ratios**, which takes into account sequencing depth and RNA composition, as described [here](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html). Specifically, these normalized values are calculated as counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene. DESeq2 then transforms these data using variance stabilization within the final statistical model. Because of these two steps, we prefer to export both the median of ratios normalized data as well as the variance stabilization transformed data, to save in our records and use when generating plots of expression levels for specific genes we are interested in. These steps are detailed below.
-
-**3. Estimate dispersion.**
-The dispersion estimate takes into account the relationship between the variance of an observed count and its mean value. It is similar to a variance parameter. In DESeq2, dispersion is estimated using a maximum likelihood and empirical bayes approach.
-
-**4. Fit negative binomial generalized linear model (GLM).**
-Finally, a negative binomial model is fit for each gene using the design formula that will be described within the proceeding code. The Wald test is performed to test if log fold changes in expression (typically calculated as log(average exposed / average unexposed)) significantly differ from zero. Statistical p-values are reported from this test and also adjusted for multiple testing using the Benjamini and Hochberg procedure.
-
-Note that these calculations, among others, are embedded within the DESeq2 functions, so we do not need to code for them ourselves. Instead, we just need to make sure that we set-up the DESeq2 functions correctly, such that these calculations are carried out appropriately in our final transcriptomics analyses.
-
-
-#### Setting up the DESeq2 experiment
-Here we provide example script that is used to identify which genes are significantly differentially expressed in association with the example biomass smoke exposures, smoldering pine needles and flaming pine needles, as well as a positive inflammation control, LPS.
-
-First, we need to set-up the DESeq2 experiment:
-```{r 06-Chapter6-58, message=FALSE, warning=FALSE, error=FALSE}
-# Set up our experiment using our RUV adjusted count and phenotype data.
-# Our design indicates that our count data is dependent on the exposure condition (groups variable) and our factor of unwanted variation, and we have specified that there not be an intercept term through the use of '~0'
-dds <- DESeqDataSetFromMatrix(countData = counts(ruv_set), # Grabbing count data from the 'ruv_set' object
- colData = pData(ruv_set), # Grabbing the phenotype data and corresponding factor of unwanted variation from the 'ruv_set' object
- design = ~0+groups+W_1) # Setting up the statistical formula (see below)
-```
-
-For the formula design, we use a '~0' at the front to not include an intercept term, and then also account for the exposure condition (groups) and the previously calculated factors of unwanted variation (W_1) of the samples. Formula design is an important step and should be carefully considered for each individual analysis. Other resources, including official *DESeq2* documentation, are available for consultation regarding formula design, as the specifics of formula design are beyond the scope of this training module.
-
-It is worth noting that, by default, *DESeq2* will use the last variable in the design formula (`W_1`) in this case, as the default variable to be output from the "results" function. Additionally, if the variable is categorical, it will display results comparing the reference level to the last level of that variable. To get results for other variables or to see other comparisons within a categorical variable, we can use the `contrast` parameter, which will be demonstrated below.
-
-
-#### Estimating size factors
-``` {r, message=FALSE, warning=FALSE, error=FALSE}
-# Estimate size factors from the dds object that was just created as the experiment above
-dds <- estimateSizeFactors(dds)
-sizeFactors(dds) # viewing the size factors
-```
-
-#### Calculating and exporting normalized counts
-
-Here, we extract normalized counts and variance stabilized counts.
-``` {r, message=FALSE, warning=FALSE, error=FALSE}
-# Extract normalized count data
-normcounts <- as.data.frame(counts(dds, normalized=TRUE))
-
-# Transforming normalized counts through variance stabilization
-vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
-vsd_matrix <- as.matrix(assay(vsd))
-```
-
-We could also export them using code such as:
-```{r 06-Chapter6-59, eval = FALSE}
-# Export data
-write.csv(normcounts, "Chapter_6/Module6_2_Input/Module6_2_Output_NormalizedCounts.csv")
-write.csv(vsd_matrix, "Chapter_6/Module6_2_Input/Module6_2_Output_VSDCounts.csv", row.names=TRUE)
-```
-
-
-#### Running the final DESeq2 experiment
-Here, we are finally ready to run the actual statistical comparisons (exposed vs control samples) to calculate fold changes and p-values that describe the degree to which each gene may or may not be altered at the expression level in association with treatment.
-
-For this example, we would like to run three different comparisons:
-(1) Smoldering Pine Needles vs. Control
-(2) Flaming Pine Needles vs. Control
-(3) LPS vs. Control
-which we can easily code for using a loop function, as detailed below.
-
-Note that we have commented out the line of code for writing out the CSV because we do not need it for the rest of the module, but this could be used if you need to write out and view results in an external application such as Excel for supplementary materials.
-
-```{r 06-Chapter6-60, message=FALSE, warning=FALSE, error=FALSE}
-# Run experiment
-dds_run <- DESeq(dds, betaPrior=FALSE)
-
-# Loop through and extract and export results for all contrasts (treatments vs. control)
-for (trt in trt_groups){ # Iterate for each of the treatments listed in 'trt_groups'
- cat(trt) # Print which treatment group we are on in the loop
- res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups",trt,ctrl)) # Extract the results of the DESeq2 analysis specifically for the comparison of the treatment group for the current iteration of the loop with the control group
- summary(res) # Print out a high-level summary of the results
- ordered <- as.data.frame(res[order(res$padj),]) # Make a dataframe of the results and order them by adjusted p-value from lowest to highest
- top10 <- head(ordered, n=10) # Make dataframe of the first ten rows of the ordered results
- cat("\nThe 10 most significantly differentially expressed genes by adjusted p-value:\n\n")
- print(top10) # View the first ten rows of the ordered results
- pfilt.05 <- nrow(ordered %>% filter(padj<0.05)) # Get the number of genes that are significantly differentially expressed where padj < 0.05
- cat("\nThe number of genes showing significant differential expression where padj < 0.05 is ", pfilt.05)
- pfilt.10 <- nrow(ordered %>% filter(padj<0.1)) # Get the number of genes that are significantly differentially expressed where padj < 0.10
- cat("\nThe number of genes showing significant differential expression where padj < 0.10 is ", pfilt.10,"\n\n")
- # write.csv(ordered, paste0("Module6_2_Output_StatisticalResults_",trt ,".csv")) ## Export the full dataframe of ordered results as a csv
-}
-```
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can now answer **Environmental Health Question #5***: How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-:::
-
-:::answer
-**Answer:** 515 genes
-:::
-
-### Answer to Environmental Health Question 6
-:::question
-*With this, we can also now answer **Environmental Health Question #6***: How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-:::
-
-:::answer
-**Answer:** 679 genes
-:::
-
-### Answer to Environmental Health Question 7
-:::question
-*And, we can answer **Environmental Health Question #7***: How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
-:::
-
-:::answer
-**Answer:** 4,813 genes
-:::
-
-
-*Together, we find that exposure to both flaming and smoldering of pine needles caused substantial disruptions in gene expression profiles. LPS serves as a positive control for inflammation and produced the greatest transcriptomic response.*
-
-
-
-
-## Visualizing Statistical Results using MA Plots
-[MA plots](https://en.wikipedia.org/wiki/MA_plot) represent a common method of visualization that illustrates differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.
-
-Here, we leverage MA plots to show how log fold changes relate to expression levels. In these plots, the log fold change is plotted on the y-axis and expression values are plotted along the x-axis, and dots are colored according to statistical significance (using padj<0.05 as the statistical filter). Here we will generate an MA plot for Flaming Pine Needles.
-
-```{r 06-Chapter6-61, message=F, warning=F, error=F, fig.align='center'}
-
-res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups","PineNeedlesFlame_4h_Lung",ctrl)) # Re-extract the DESeq2 results for the flaming pine needles
-MA <- data.frame(res) # Make a preliminary dataframe of the flaming pine needle results
-MA_ns <- MA[ which(MA$padj>=0.05),] # Non-significant genes to plot
-MA_up <- MA[ which(MA$padj<0.05 & MA$log2FoldChange > 0),] # Significant up-regulated genes to plot
-MA_down <- MA[ which(MA$padj<0.05 & MA$log2FoldChange < 0),] #Significant down-regulated genes to plot
-
-ggplot(MA_ns, aes(x = baseMean, y = log2FoldChange)) + # Plot data with counts on x-axis and log2 fold change on y-axis
- geom_point(color="gray75", size = .5) + # Set point size and color
-
- geom_point(data = MA_up, color="firebrick", size=1, show.legend = TRUE) + # Plot the up-regulated significant genes
- geom_point(data = MA_down, color="dodgerblue2", size=1, show.legend = TRUE) + # Plot down-regulated significant genes
-
- theme_bw() + # Change theme of plot from gray to black and white
-
- # We want to log10 transform x-axis for better visualizations
- scale_x_continuous(trans = "log10", breaks=c(1,10,100, 1000, 10000, 100000, 1000000), labels=c("1","10","100", "1000", "10000", "100000", "1000000")) +
- # We will bound y axis as well to better fit data while not leaving out too many points
- scale_y_continuous(limits=c(-2, 2)) +
-
- xlab("Expression (Normalized Count)") + ylab(expression(Log[2]*" Fold Change")) + # Add labels for axes
- geom_hline(yintercept=0) # Add horizontal line at 0
-```
-
-An appropriate title for this figure could be:
-
-“**Figure X. MA plot of fold change in expression as function of gene expression resulting from 4 hours of exposure to flaming pine needles in mice lung tissues.** Significantly upregulated genes (log~2~FC > 0 and p adjust < 0.05) are shown in red and significantly downregulated genes (log~2~FC < 0 and p adjust < 0.05) are shown in blue. Genes significantly associated are displayed in gray."
-
-
-## Visualizing Statistical Results using Volcano Plots
-
-Similar to MA plots, volcano plots provide visualizations of fold changes in expression from transcriptomic data. However, instead of plotting these values against expression, log fold change is plotted against (adjusted) p-values in volcano plots. Here, we use functions within the *[EnhancedVolcano package](https://www.rdocumentation.org/packages/EnhancedVolcano/versions/1.11.3/topics/EnhancedVolcano)* to generate a volcano plot for Flaming Pine Needles.
-
-Running the `EnhancedVolcano()` function to generate an example volcano plot:
-```{r 06-Chapter6-62, message=FALSE, warning=FALSE, error=FALSE, fig.align='center', out.width = 700, out.height = 580}
-Vol <- data.frame(res) # Dataset to use for plotting
-
-EnhancedVolcano(Vol,
- lab = rownames(res), # Label significant genes from dataset (can be a column name)
- x = 'log2FoldChange', # Column name in dataset with l2fc information
- y = 'padj', # Column name in dataset with adjusted p-value information
- ylab = "-Log(FDR-adjusted p value)", # Y-axis label
- pCutoff= 0.05, # Set p-value cutoff
- ylim=c(0,5), # Limit y-axis for better plot visuals
- xlim=c(-2,2), # Limit x-axis (similar to in MA plot y-axis)
- title= NULL, # Removing title
- subtitle = NULL, # Removing subtitle
- legendPosition = 'bottom') # Put legend on bottom
-```
-
-
-An appropriate title for this figure could be:
-
-“**Figure X. Volcano plot of lung genes resulting from 4 hours of exposure to flaming pine needles.** Genes are colored according to level of significant differential loading in exposed vs unexposed (vehicle control) samples, using the following statistical cut-offs: P adjust (multiple test corrected p-value) <0.05 and fold change(FC) ±1.3 (log2FC ≥±0.3785)."
-
-
-
-## Interpretting Findings at the Systems Level through Pathway Enrichment Analysis
-
-Pathway enrichment analysis is a very helpful tool that can be applied to interpret transcriptomic changes of interest in terms of systems biology. In these types of analyses, gene lists of interest are used to identify biological pathways that include genes present in your dataset more often than expected by chance alone. There are many tools that can be used to carry out pathway enrichment analyses. Here, we are using the R package, *PIANO*, to carry out the statistical enrichment analysis based on the lists of genes we previously identified with differential expression associated with flaming pine needles exposure.
-
-To detail, the following input data are required to run *PIANO*:
-(1) Your background gene sets, which represent all genes queried from your experiment (aka your 'gene universe')
-
-(2) The list of genes you are interested in evaluating pathway enrichment of; here, this represents the genes identified with significant differential expression associated with flaming pine needles
-
-(3) A underlying pathway dataset; here, we're using the KEGG PATHWAY Database ([KEGG](https://www.genome.jp/kegg/pathway.html)), summarized through the Molecular Signature Database ([MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/)) into pre-formatted input files (.gmt) ready for PIANO.
-
-*Let's organize these three required data inputs.*
-
-
-(1) Background gene set:
-```{r 06-Chapter6-63}
-# First grab the rownames of the 'res' object, which was redefined as the DESeq2 results for flaming pine needles prior to MA plot generation, and remove the BioSpyder numeric identifier using a sub function, while maintaining the gene symbol and place these IDs into a new list within the 'res' object (saved as 'id')
-res$id <- gsub("_.*", "", rownames(res));
-
-# Because these IDs now contain duplicate gene symbols, we need to remove duplicates
-# One way to do this is to preferentially retain rows of data with the largest fold change (it doesn't really matter here, because we're just identifying unique genes within the background set)
-res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] # sort by id and reverse of abs(log2foldchange)
-res.ordered <- res.ordered[ !duplicated(res.ordered$id), ] # removing gene duplicates
-
-# Setting this as the background list
-Background <- toupper(as.character(res.ordered$id))
-Background[1:200] # viewing the first 200 genes in this background list
-```
-
-(2) The list of genes identified with significant differential expression associated with flaming pine needles:
-```{r 06-Chapter6-64}
-# Similar to the above script, but starting with the res$id object
-# and filtering for genes with padj < 0.05
-
-res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] #sort by id and reverse of abs(log2FC)
-SigGenes <- toupper(as.character(res.ordered[which(res.ordered$padj<.05),"id"])) # pulling the genes with padj < 0.05
-SigGenes <- SigGenes[ !duplicated(SigGenes)] # removing gene duplicates
-
-length(SigGenes) # viewing the length of this significant gene list
-```
-
-Therefore, this gene set includes 488 *unique* genes significantly associated with the Flaming Pine Needles condition, based on padj<0.05.
-
-
-(3) The underlying KEGG pathway dataset.
-Note that this file was simply downloaded from [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/), ready for upload as a .gmt file. Here, we use the `loadGSC()` function enabled through the *PIANO* package to upload and organize these pathways.
-```{r 06-Chapter6-65}
-KEGG_Pathways <- loadGSC(file="Chapter_6/Module6_2_Input/Module6_2_InputData3_KEGGv7.gmt", type="gmt")
-
-length(KEGG_Pathways$gsc) # viewing the number of biological pathways contained in the database
-```
-This KEGG pathway database therefore includes 186 biological pathways available to query
-
-
-With these data inputs ready, we can now run the pathway enrichment analysis. The enrichment statistic that is commonly employed through the *PIANO* package is based of a hypergeometric test, run through the `runGSAhyper()` function. This returns a p-value for each gene set from which you can determine enrichment status.
-```{r 06-Chapter6-66, message=F, warning=F, error=F}
-# Running the piano function based on the hypergeometric statistic
-Results_GSA <- piano::runGSAhyper(genes=SigGenes, universe=Background,gsc=KEGG_Pathways, gsSizeLim=c(1,Inf), adjMethod = "fdr")
-
-# Pulling the pathway enrichment results into a separate dataframe
-PathwayResults <- as.data.frame(Results_GSA$resTab)
-
-# Viewing the top of these pathway enrichment results (which are not ordered at the moment)
-head(PathwayResults)
-
-```
-This dataframe therefore summarizes the enrichment p-value for each pathway, FDR adjusted p-value, number of significant genes in the gene set that intersect with genes in the pathway, etc.
-
-
-With these results, let's identify which pathways meet a statistical enrichment p-value filter of 0.05:
-```{r 06-Chapter6-67}
-SigPathways <- PathwayResults[which(PathwayResults$`p-value` < 0.05),]
-rownames(SigPathways)
-```
-
-
-### Answer to Environmental Health Question 8
-:::question
-*With this, we can now answer **Environmental Health Question #8***: What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?
-:::
-
-:::answer
-**Answer:** Biological pathways involved in cardiopulmonary function (e.g., arrhythmogenic right ventricular cardiomyopathy, hypertrophic cardiomyopathy, vascular smooth muscle contraction), carcinogenesis signaling (e.g., Wnt signaling pathway, hedgehog signaling pathway), and hormone signaling (e.g., Gnrh signaling pathway), among others.
-:::
-
-
-
-## Concluding Remarks
-
-In this module, users are guided through the uploading, organization, QA/QC, statistical analysis, and systems level analysis of an example -omics dataset based on transcriptomic responses to biomass burn scenarios, representing environmental exposure scenarios of growing concern worldwide. It is worth noting that the methods described herein represent a fraction of the approaches and tools that can be leveraged in the analysis of -omics datasets, and methods should be tailored to the purposes of each individual analysis' goal. For additional example research projects that have leveraged -omics and systems biology to address environmental health questions, see the following select relevant publications:
-
-
-**Genomic publications evaluating gene-environment interactions and relations to disease etiology:**
-
-+ Balik-Meisner M, Truong L, Scholl EH, La Du JK, Tanguay RL, Reif DM. Elucidating Gene-by-Environment Interactions Associated with Differential Susceptibility to Chemical Exposure. Environ Health Perspect. 2018 Jun 28;126(6):067010. PMID: [29968567](https://pubmed.ncbi.nlm.nih.gov/29968567/).
-
-+ Ward-Caviness CK, Neas LM, Blach C, Haynes CS, LaRocque-Abramson K, Grass E, Dowdy ZE, Devlin RB, Diaz-Sanchez D, Cascio WE, Miranda ML, Gregory SG, Shah SH, Kraus WE, Hauser ER. A genome-wide trans-ethnic interaction study links the PIGR-FCAMR locus to coronary atherosclerosis via interactions between genetic variants and residential exposure to traffic. PLoS One. 2017 Mar 29;12(3):e0173880. PMID: [28355232](https://pubmed.ncbi.nlm.nih.gov/28355232/).
-
-
-**Transcriptomic publications evaluating gene expression responses to environmental exposures and relations to disease etiology:**
-
-+ Chang Y, Rager JE, Tilton SC. Linking Coregulated Gene Modules with Polycyclic Aromatic Hydrocarbon-Related Cancer Risk in the 3D Human Bronchial Epithelium. Chem Res Toxicol. 2021 Jun 21;34(6):1445-1455. PMID: [34048650](https://pubmed.ncbi.nlm.nih.gov/34048650/).
-
-+ Chappell GA, Rager JE, Wolf J, Babic M, LeBlanc KJ, Ring CL, Harris MA, Thompson CM. Comparison of Gene Expression Responses in the Small Intestine of Mice Following Exposure to 3 Carcinogens Using the S1500+ Gene Set Informs a Potential Common Adverse Outcome Pathway. Toxicol Pathol. 2019 Oct;47(7):851-864. PMID: [31558096](https://pubmed.ncbi.nlm.nih.gov/31558096/).
-
-+ Manuck TA, Eaves LA, Rager JE, Fry RC. Mid-pregnancy maternal blood nitric oxide-related gene and miRNA expression are associated with preterm birth. Epigenomics. 2021 May;13(9):667-682. PMID: [33890487](https://pubmed.ncbi.nlm.nih.gov/33890487/).
-
-
-**Epigenomic publications** evaluating microRNA, CpG methylation, and/or histone methylation responses to environmental exposures and relations to disease etiology:
-
-+ Chappell GA, Rager JE. Epigenetics in chemical-induced genotoxic carcinogenesis. Curr Opinion Toxicol. [2017 Oct; 6:10-17](https://www.sciencedirect.com/science/article/abs/pii/S2468202017300396).
-
-+ Rager JE, Bailey KA, Smeester L, Miller SK, Parker JS, Laine JE, Drobná Z, Currier J, Douillet C, Olshan AF, Rubio-Andrade M, Stýblo M, García-Vargas G, Fry RC. Prenatal arsenic exposure and the epigenome: altered microRNAs associated with innate and adaptive immune signaling in newborn cord blood. Environ Mol Mutagen. 2014 Apr;55(3):196-208. PMID: [24327377](https://pubmed.ncbi.nlm.nih.gov/24327377/).
-
-+ Rager JE, Bauer RN, Müller LL, Smeester L, Carson JL, Brighton LE, Fry RC, Jaspers I. DNA methylation in nasal epithelial cells from smokers: identification of ULBP3-related effects. Am J Physiol Lung Cell Mol Physiol. 2013 Sep 15;305(6):L432-8. PMID: [23831618](https://pubmed.ncbi.nlm.nih.gov/23831618/).
-
-+ Smeester L, Rager JE, Bailey KA, Guan X, Smith N, García-Vargas G, Del Razo LM, Drobná Z, Kelkar H, Stýblo M, Fry RC. Epigenetic changes in individuals with arsenicosis. Chem Res Toxicol. 2011 Feb 18;24(2):165-7. PMID: [21291286](https://pubmed.ncbi.nlm.nih.gov/21291286/).
-
-
-**Metabolomic publications** evaluating changes in the metabolome in response to environmental exposures and involved in disease etiology:
-
-+ Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: 24413286; PMCID: [PMC3948040](https://pubmed.ncbi.nlm.nih.gov/24413286/).
-
-+ Manuck TA, Lai Y, Ru H, Glover AV, Rager JE, Fry RC, Lu K. Metabolites from midtrimester plasma of pregnant patients at high risk for preterm birth. Am J Obstet Gynecol MFM. 2021 Jul;3(4):100393. PMID: [33991707](https://pubmed.ncbi.nlm.nih.gov/33991707/).
-
-
-**Microbiome publications** evaluating changes in microbiome profiles in relation to the environment and human disease:
-
-+ Chi L, Bian X, Gao B, Ru H, Tu P, Lu K. Sex-Specific Effects of Arsenic Exposure on the Trajectory and Function of the Gut Microbiome. Chem Res Toxicol. 2016 Jun 20;29(6):949-51.PMID: [27268458](https://pubmed.ncbi.nlm.nih.gov/27268458/).
-
-+ Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012 Mar 13;13(4):260-70. PMID: [22411464](https://pubmed.ncbi.nlm.nih.gov/22411464/).
-
-+ Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: [24413286](https://pubmed.ncbi.nlm.nih.gov/24413286/).
-
-
-
-**Exposome publications** evaluating changes in chemical signatures in relation to the environment and human disease:
-
-+ Rager JE, Strynar MJ, Liang S, McMahen RL, Richard AM, Grulke CM, Wambaugh JF, Isaacs KK, Judson R, Williams AJ, Sobus JR. Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring. Environ Int. 2016 Mar;88:269-280. PMID: [26812473](https://pubmed.ncbi.nlm.nih.gov/26812473/).
-
-+ Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A. The blood exposome and its role in discovering causes of disease. Environ Health Perspect. 2014 Aug;122(8):769-74. PMID: [24659601](https://pubmed.ncbi.nlm.nih.gov/24659601/).
-
-+ Viet SM, Falman JC, Merrill LS, Faustman EM, Savitz DA, Mervish N, Barr DB, Peterson LA, Wright R, Balshaw D, O'Brien B. Human Health Exposure Analysis Resource (HHEAR): A model for incorporating the exposome into health studies. Int J Hyg Environ Health. 2021 Jun;235:113768. PMID: [34034040](https://pubmed.ncbi.nlm.nih.gov/34034040/).
-
-
-
-
-
-
-:::tyk
-Using "Module6_2_TYKInput1.csv" (gene counts) and "Module6_2_TYKInput2.csv" (sample info) datasets, which have already been run through the QC process described in this module and are ready for analysis:
-
-1. Conduct a differential expression analysis associated with "Season" using DESeq2. (Don't worry about including any covariates or using RUV).
-2. Find the number of significant differentially expressed genes associated with "Season", at the .05 level.
-:::
-
-# 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation
-
-This training module was developed by Dr. Lauren Eaves, Dr. Kyle Roell, and Dr. Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-
-## Introduction to Training Module
-
-Historically, toxicology and epidemiology studies have largely focused on analyzing relationships between one chemical and one outcome at a time. This is still important in identifying the degree to which a single chemical exposure is associated with a disease outcome (e.g., [UNC Superfund Research Program's](https://sph.unc.edu/superfund-pages/srp/) focus on inorganic arsenic exposure and its influence on metabolic disease). However, we are exposed, everyday, to many different stressors in our environment. It is therefore critical to deconvolute what co-occurring stressors (i.e., mixtures) in our environment impact human health! The field of mixtures research continues to grow to address this need, with the goals of developing methods to study environmental exposures using approaches to that better capture the mixture of exposures humans experience in real life. In this module, we will provide an overview of mixtures analysis methods and demonstrate how to use one of these methods, quantile g-computation, to analyzing chemical mixtures in a large geospatial epidemiologic study.
-
-
-## Overview of Mixtures Analysis
-
-### Mixtures Methods Relevance and Challenges
-
-**Mixtures approaches are recently becoming more routine in environmental health because methodological advancements are just now making mixtures research more feasible.** These advancements parallel the following:
-
-+ Advances in the ability to measure many different chemicals (e.g., through suspect screening and non-targeted chemical analysis approaches) and stressors (e.g., through improved collection and storage of survey data and clinical data) in our environment
-+ Improvements in data science to organize, store, and analyze big data
-+ Developments in statistical methodologies to parse relationships within these data
-
-Though statistical methodologies are still evolving, we will be discussing our current knowledge in this module.
-
-**Some challenges that data analysts may experience when analyzing data from mixtures studies include the following:**
-
-1. Size of mixture:
-+ As the number of components evaluated increases, your available analysis methods and statistical power may decrease
-
-2. Correlated data structure:
-+ Statistical challenge of collinearity: If data include large amounts of collinearity, this may dampen the observed effects from components that are highly correlated (e.g., may commonly co-occur) to other components
-+ Methodological challenge of co-occurring contaminant confounding: Co-occurring contaminant confounding may make it difficult to discern what is the true driver of the observed effect.
-
-3. Data analysis method selection:
-+ There are many different methods to choose from!
-+ A critical rule to address this challenge is to, first and foremost, *lay out your study's question*. This question will then help guide your method selection, as discussed below.
-
-
-### Overview of Mixtures Methods
-
-There are many methods that can be implemented to also elucidate relationships between individual chemicals/chemical groups in complex mixtures and their resulting toxicity/health effects. Some of the more common methods used in mixtures analyses, as identified by our team, are summarized in the below figure according to potential questions that could be asked in a study. Two of the methods, specifically quantile based g-computation (qgcomp) and bayesian kernel machine regression (BKMR) are highlighted as example mixtures scripted activities (qgcomp in this script and BKMR in Mixtures Methods 2). Throughout TAME 2.0 training materials, other methods are included such as Principal Component Analysis (PCA), K-means clustering, hierarchical clustering, and predictive modeling / machine learning (e.g., Random Forest modeling and variable selection). The following figure provides an overview of the types of questions that can be asked regarding mixtures and models that are commonly used to answer these questions:
-
-```{r 06-Chapter6-68, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_6/Module6_3_Input/Module6_3_Mixtures_Methods_Overview.png")
-```
-
-In this module, we will be using quantile based g-computation to analyze our data. This method is used for analysis of a total mixture effect as opposed to individual effects of mixture components. It is similar to previous, popular methods such as weighted quantile sum (WQS) regression, but does not assume directional homogeneity. It also provides access to models for non-additive and non-linear effects of the individual mixture components and overall mixture. Additionally, it runs very quickly and does not require as much computationally as other methods, making it an accessible option for those without access to many computational resources.
-
-
-## Introduction to Example Data
-
-This script outlines single-contaminant (logistic regression) and multi-contaminant modeling approaches (Quantile G-Computation (qgcomp)). The workflow follows the steps used to generate results published in [Eaves et al. 2023](https://pubmed.ncbi.nlm.nih.gov/37845729/). This study examined the relationship between metals in private well water and the risk of preterm birth. The study population was all singleton, non-anomalous births in NC between 2003-2015. Pregnancies were assigned tract-level metal exposure based on maternal residence at delivery. The relationship with single metal exposure was examined with logistic regression and metal mixtures with qgcomp.
-
-For more info on qgcomp, see [Keil et al. 2020](:https://ehp.niehs.nih.gov/doi/full/10.1289/EHP5838) and the associated [vignette](https://cran.r-project.org/web/packages/qgcomp/vignettes/qgcomp-vignette.html).
-
-Note that for educational purposes, in this example we are using a randomly sampled dataset of 100,000 births, rather than the full dataset of >1.3million (ie. using less than 10% of the full study population). Therefore the actual results of the analysis outlined below do not match the results published in the paper.
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following questions:
-
-1. What is the risk of preterm birth associated with exposure to each of arsenic, lead, cadmium, chromium, manganese, copper and zinc via private well water individually?
-
-2. What is the risk of preterm birth associated with combined exposure to arsenic, lead, cadmium, chromium, manganese, copper and zinc (ie. a mixture) via private well water?
-
-3. Which of these chemicals has the strongest effect on preterm birth risk?
-
-4. Which of these chemicals increases the risk of preterm birth and which decreases the risk of preterm birth?
-
-### Workspace Preparation
-
-Install packages as needed, then load the following packages:
-```{r cars_, message = FALSE}
-#load packages
-library(tidyverse)
-library(ggplot2)
-library(knitr)
-library(yaml)
-library(rmarkdown)
-library(broom)
-library(ggpubr)
-library(qgcomp)
-```
-
-
-Optionally, you can also create a current date variable to name output files, and create an output folder.
-```{r 06-Chapter6-69, eval = FALSE}
-# Create a current date variable to name outputfiles
-cur_date <- str_replace_all(Sys.Date(),"-","")
-
-#Create an output folder
-Output_Folder <- ("Module6_3_Output/")
-```
-
-### Data Import
-```{r 06-Chapter6-70}
-cohort <- read.csv(file="Chapter_6/Module6_3_Input/Module6_3_InputData.csv")
-colnames(cohort)
-head(cohort)
-```
-
-Note: there are many steps prior to the modeling steps outlined below. These are being skipped for educational purposes. Additional steps include assessment of normality and transformations as needed, generation of a demographics table and assessing for missing data, imputation of missing data if needed, visualizing trends and distributions in the data, functional form assessments, decisions regarding what confounders to include etc.
-
-The following are the metals of interest: arsenic, lead, cadmium, chromium, manganese, copper, zinc.
-
-For each metal there are three exposure variables:
-
-1. `[metal]_perc`: 0: less than or equal to the 50th percentile, 1: above the 50th percentile and less than or equal to the 90th percentile, 3: above the 90th percentile
-2. `[metal]_limit`: 0: <25% f well water tests for a given metal exceeded EPA regulatory standard, 1: 25% or over of well water tests for a given metal exceeded EPA regulatory standard
-3. `[metal].Mean_avg`: the mean concentration of the metal in the tract (ppb).
-Please see the Eaves et al. 2023 paper linked above for further information on these variables.
-
-Other variables of interest (outcome and covariates) in this dataset:
-
- * `preterm`: 0= 37 weeks gestational age or greater, 1= less than 37 weeks gestational age
- * `mage`: maternal age in years, continuous
- * `sex`: sex of baby at birth: 1=M, 2=F
- * `racegp`: maternal race ethnicity: 1=white non-Hispanic, 2=Black non-Hispanic, 3=Hispanic, 4=Asian/Pacific Islander, 5=American Indian, 6=other/unknown
- * `smoke`: maternal smoking in pregnany: 0=non-smoker, 1=smoker
- * `season_conep`: season of conception: 1=winter (Dec, Jan, Feb), 2=spring (Mar, Apr, May), 3=summer (June, Jul, Aug), 4=fall (Sept, Oct, Nov)
- * `mothed`: mother's education: 1=%
- mutate(preterm = as.factor(preterm))
-cohort$preterm <- relevel(cohort$preterm, ref = "0")
-
-#exposure variables
-cohort <- cohort %>%
- mutate(Arsenic_perc=as.factor(Arsenic_perc)) %>%
- mutate(Cadmium_perc=as.factor(Cadmium_perc)) %>%
- mutate(Chromium_perc=as.factor(Chromium_perc)) %>%
- mutate(Copper_perc=as.factor(Copper_perc)) %>%
- mutate(Lead_perc=as.factor(Lead_perc)) %>%
- mutate(Manganese_perc=as.factor(Manganese_perc)) %>%
- mutate(Zinc_perc=as.factor(Zinc_perc)) %>%
- mutate(Arsenic_limit=as.factor(Arsenic_limit)) %>%
- mutate(Cadmium_limit=as.factor(Cadmium_limit)) %>%
- mutate(Chromium_limit=as.factor(Chromium_limit)) %>%
- mutate(Copper_limit=as.factor(Copper_limit)) %>%
- mutate(Lead_limit=as.factor(Lead_limit)) %>%
- mutate(Manganese_limit=as.factor(Manganese_limit)) %>%
- mutate(Zinc_limit=as.factor(Zinc_limit))
-
-
-#ensure covariates are in correct variable type form
-cohort <- cohort %>%
- mutate(racegp = as.factor(racegp)) %>%
- mutate(mage = as.numeric(mage)) %>%
- mutate(mage_sq = as.numeric(mage_sq)) %>%
- mutate(smoke = as.numeric(smoke)) %>%
- mutate(season_concep = as.factor(season_concep)) %>%
- mutate(mothed = as.numeric(mothed)) %>%
- mutate(Nitr_perc = as.numeric(Nitr_perc)) %>%
- mutate(sex = as.factor(sex))%>%
- mutate(pov_perc = as.factor(pov_perc))
-
-```
-
-#### Fit adjusted logistic regression models for each metal, for each categorical variable
-
-First, we will fit an adjusted logistic regression model for each metal, for each categorical variable, to demonstrate a variable by variable approach before diving into mixtures methods. Note that there are different regression techniques (linear and logistic are covered in another TAME module) and that here we will start with using percentage variables.
-```{r 06-Chapter6-72, message=F, warning=F, error=F}
-
-metals <- c("Arsenic","Cadmium","Chromium", "Copper","Lead","Manganese","Zinc")
-
-for (i in 1:length(metals)) {
- metal <- metals[[i]]
- metal <- as.name(metal)
- print(metal)
-
- print(is.factor(eval(parse(text = paste0("cohort$",metal,"_perc"))))) #check that metal var is a factor
-
- mod <- glm(preterm ~ eval(parse(text = paste0(metal,"_perc"))) + mage + mage_sq+ racegp + smoke + season_concep + mothed + Nitr_perc + pov_perc, family=binomial, data=cohort)
-
- mod_tid <- tidy(mod, conf.int=TRUE, conf.level=0.95) %>%
- mutate(model_name=paste0(metal,"_adj_perc")) %>%
- mutate(OR = exp(estimate)) %>%
- mutate(OR.conf.high = exp(conf.high)) %>%
- mutate(OR.conf.low = exp(conf.low))
-
- mod_tid[2,1] <- paste0(metal,"_perc_50to90")
- mod_tid[3,1] <- paste0(metal,"_perc_over90")
-
- plot <- mod_tid %>%
- filter(grepl('perc_', term))%>%
- ggplot(aes(OR, term, xmin = OR.conf.low, xmax = OR.conf.high, height = 0)) +
- geom_point() +
- scale_x_continuous(trans="log10")+
- geom_errorbarh()
-
- assign(paste0(metal,"_adj_perc"),mod_tid)
- assign(paste0(metal,"_adj_perc_plot"),plot)
-
-}
-
-```
-
-Plot the results:
-```{r 06-Chapter6-73, message=F, warning=F, error=F, fig.align='center'}
-
-
-perc_plots <- ggarrange(Arsenic_adj_perc_plot,
- Cadmium_adj_perc_plot,
- Chromium_adj_perc_plot,
- Copper_adj_perc_plot)
-plot(perc_plots)
-
-perc_plots1 <- ggarrange(Lead_adj_perc_plot,
- Manganese_adj_perc_plot,
- Zinc_adj_perc_plot)
-plot(perc_plots1)
-```
-
-Save the plots:
-```{r 06-Chapter6-74, eval = FALSE}
-tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_percplots_1.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
-plot(perc_plots)
-dev.off()
-
-tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_percplots_2.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
-plot(perc_plots1)
-dev.off()
-
-```
-
-We can also run the analysis using limit variables:
-```{r 06-Chapter6-75, message=F, warning=F, error=F, fig.align='center'}
-
- for (i in 1:length(metals)) {
- metal <- metals[[i]]
- metal <- as.name(metal)
- print(metal)
-
- print(is.factor(eval(parse(text = paste0("cohort$",metal,"_limit"))))) #check that metal var is a factor
-
- mod <- glm(preterm ~ eval(parse(text = paste0(metal,"_limit")))+ mage + mage_sq+ racegp + smoke + season_concep + mothed + Nitr_perc + pov_perc, family=binomial, data=cohort)
-
- mod_tid <- tidy(mod, conf.int=TRUE, conf.level=0.95) %>%
- mutate(model_name=paste0(metal,"_adj_limit")) %>%
- mutate(OR = exp(estimate)) %>%
- mutate(OR.conf.high = exp(conf.high)) %>%
- mutate(OR.conf.low = exp(conf.low))
-
- mod_tid[2,1] <- paste0(metal,"_limit_over25perc")
-
- plot <- mod_tid %>%
- filter(grepl('limit', term))%>%
- ggplot(aes(OR, term, xmin = OR.conf.low, xmax = OR.conf.high, height = 0)) +
- geom_point() +
- scale_x_continuous(trans="log10")+
- geom_errorbarh()
-
- assign(paste0(metal,"_adj_limit"),mod_tid)
- assign(paste0(metal,"_adj_limit_plot"),plot)
-
-}
-```
-Note: you will get this warning for some of the models:
-"Warning: glm.fit: fitted probabilities numerically 0 or 1".
-
-This is because for the variability in the exposure data, ideally the sample size would be larger (as noted above the analysis this draws from was completed on >1.3million observations).
-
-Plot the results:
-```{r 06-Chapter6-76, message=F, warning=F, error=F, fig.align='center'}
-limit_plots <- ggarrange(Arsenic_adj_limit_plot,
- Cadmium_adj_limit_plot,
- Chromium_adj_limit_plot,
- Copper_adj_limit_plot)
-
-plot(limit_plots)
-
-limit_plots1 <- ggarrange(Lead_adj_limit_plot,
- Manganese_adj_limit_plot,
- Zinc_adj_limit_plot)
-
-plot(limit_plots1)
-```
-
-Save the plots:
-```{r 06-Chapter6-77, eval = FALSE}
-tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_limitplots1.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
-plot(limit_plots)
-dev.off()
-
-tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_limitplots2.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
-plot(limit_plots1)
-dev.off()
-```
-
-Merge all of the logistic regression model results. This is the data frame that you could export for supplementary material or to view the results in Excel.
-```{r 06-Chapter6-78, message=F, warning=F, error=F}
-#merge all model output
-results_df <- rbind(Arsenic_adj_perc, Arsenic_adj_limit,
- Cadmium_adj_perc, Cadmium_adj_limit,
- Chromium_adj_perc, Chromium_adj_limit,
- Copper_adj_perc, Copper_adj_limit,
- Lead_adj_perc, Lead_adj_limit,
- Manganese_adj_perc, Manganese_adj_limit,
- Zinc_adj_perc, Zinc_adj_limit)
-```
-
-To select only the coefficients related to the primary exposures:
-```{r 06-Chapter6-79}
-results_df <- results_df %>% filter(str_detect(term, 'limit|50to90|over90'))
-```
-
-This file outputs the coefficients and the odds ratios (OR) of the logistic regression models all together.
-+ The ORs in associated with [metal]_perc_50to90 can be interpreted as the OR comparing the odds of preterm birth among individuals in the 50th to 90th percentile of [metal] exposure compared to those below the 50th.
-+ The ORs in associated with [metal]_perc_over90 can be interpreted as the OR comparing the odds of preterm birth among individuals above the 90th percentile of [metal] exposure compared to those below the 50th.
-+ The ORs in associated with [metal]_limit_over25perc can be interpreted as the OR comparing the odds of preterm birth among individuals living in census tracts in with tests exceeding the an EPA standard for [metal] in 25% or more tests versus tracts with less that 25% of tests exceeding the standard
-
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer also **Environmental Health Question #1***: What is the risk of preterm birth associated with exposure to each of arsenic, lead, cadmium, chromium, manganese, copper and zinc via private well water individually?
-:::
-
-:::answer
-**Answer**: Using the interpretation guides described in the prior paragraph and the "_NCbirths_pretermbirth_singlemetal_adjusted_models.csv" file, you can answer this question. For example, for cadmium, compared to individuals residing in census tracts with cadmium below the 50th percentile, those residing in tracts with lead between the 50th and 90th percentile had a 7% increase in the adjusted odds of PTB (aOR 1.07 (95% CI: 1.00,1.14)) and those in tracts with cadmium above the 90th percentile had a 8% increased adjusted odds of PTB (aOR 1.08 (95% CI: 0.97,1.20). Compared to individuals in tracts with less than 25% of tests exceeding the standard for lead (note this is the EPA treatment technique action level=15 ppb), individuals residing in census tracts where 25% or more of tests exceeded the MCL had 1.23 (95% CI: 0.81,1.81) times the adjusted odds of preterm birth. IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
-:::
-
-While the single contaminant models provide useful information, they cannot inform us of the effect of multiple simultaneous exposures or account for co-occurring contaminant confounding. Therefore, we want to utilize quantile g-compuation to assess mixtures.
-
-## Mixtures Model with Standard qqcomp
-```{r 06-Chapter6-80, message=F, warning=F, error=F}
-#list of exposure variables
-Xnm <- c('Arsenic.Mean_avg', 'Cadmium.Mean_avg', 'Lead.Mean_avg', 'Manganese.Mean_avg', 'Chromium.Mean_avg', 'Copper.Mean_avg', 'Zinc.Mean_avg')
-#list of covariates
-covars = c('mage','mage_sq','racegp','smoke','season_concep','mothed','Nitr_perc','pov_perc')
-
-#fit adjusted model
-PTB_adj_ppb <- qgcomp.noboot(preterm~.,
- expnms=Xnm, dat=cohort[,c(Xnm,covars,'preterm')], family=binomial(), q=4)
-
-```
-
-In English, `preterm~.` is saying fit a model that has preterm (1/0) as the dependent variable and then the independent variables (exposures and covariates) are all other variables in the dataset (`.`). `expnms=Xnm` is saying that the mixture of exposures is given by the vector `Xnm,` defined above. `dat=cohort[,c(Xnm,covars,'preterm')]` is saying that the dataset to be used to fit this model includes all columns in the cohort dataset that are listed in the `Xnm` and `covars` vectors and also the `preterm` variable. `family=binomial()` is saying that the outcome is a binary outcome and therefore the model will fit a logistic regression model. `q=4` is saying break the exposures into quartiles, other options would be q=3 for teriltes, q=5 for quintiles and so forth.
-
-This is a summary of the qgcomp model output
-```{r 06-Chapter6-81, message=F, warning=F, error=F}
-PTB_adj_ppb
-```
-This output can be interpreted as:
-
- * Cadmium, chromium, manganese and zinc had positive effects, as in they increased the risk of preterm birth. Arsenic, coppper and lead had negative effects, as in they reduced the risk of preterm birth.
- * The total effect of all positive acting mixture components is given by the sum of positive coefficients = 0.0969, total effect of all negative acting mixture components is given by the sum of negative coefficients = -0.0532.
- * The numbers underneath each of the individual mixture component are the weights assigned to each component. These sum to 1 in each direction. They represent the relative contribution of each component to the effect in that direction. If only one components was acting in the positive or negative direction, it would have a weight of 1. A component's weight multiplied by the sum of the coefficient's in the relevant direction is that individual component's coefficient and represents the independent effect of that component (e.g. cadmium log(OR) = 0.0969*0.4556=0.0441).
- * The overall mixture effect (i.e. the log(OR) when all exposures are increased by one quartile) is given by psi1. Here it equals 0.0437. Note that this value is equal to combining the sum of coefficients in the positive direction adn the sum in the negative direction (ie. 0.0969-0.0532= 0.0437)
-
-IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
-
-This is the plot that gives you the weights of the components
-```{r 06-Chapter6-82, message=F, warning=F, error=F, fig.align='center'}
-plot(PTB_adj_ppb)
-```
-
-To save the plot:
-```{r 06-Chapter6-83, eval = FALSE}
-tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_qgcomp_weights.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
-plot(PTB_adj_ppb)
-dev.off()
-```
-
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: What is the risk of preterm birth associated with combined exposure to arsenic, lead, cadmium, chromium, manganese, copper and zinc (ie. a mixture) via private well water?
-:::
-
-:::answer
-**Answer**: When all exposures (arsenic, lead, cadmium, chromium, manganese, copper and zinc) are increased in concentration by one quartile the odds ratio is 1.044 (exp(0.043705)). IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
-:::
-
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer also **Environmental Health Question #3***: Which of these chemicals has the strongest effect on preterm birth risk?
-:::
-
-:::answer
-**Answer**: The mixture component with the strongest effect is the one that has the largest independent effect given my the component's coefficient (which can be calculated by (sum of coefficients in relevant direction)*(component weight), and, as shown below can also be generated to ouput into results files). In this case, the components with the largest independent effect is cadmium (0.0969*0.4556=0.0441). IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
-:::
-
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer also **Environmental Health Question #4***: Which of these chemicals increases the risk of preterm birth and which decreases the risk of preterm birth?
-:::
-
-:::answer
-**Answer**: This is indicated by the direction of effect for each component. Thus, the mixture components that increase the risk of preterm birth are cadmium, chromium, manganese and zinc, while the mixture components that decrease the risk of preterm birth are arsenic, copper and lead. IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
-:::
-
-
-We can export the mixtures modeling results using the following code, which stores the data in three different files:
-+ Results_SlopeParams outputs the overall mixture effect results
-+ Results_MetalCoeffs outputs the individual mixture components (metals) coefficients. Note that this will also output coefficient for covariates included in the model.
-+ Results_MetalWeights outputs the individual mixture components (metals) weights
-```{r 06-Chapter6-84, message=F, warning=F, error=F, eval = FALSE}
-allmodels <- c("PTB_adj_ppb") #if you run more than one qgcomp model, list them here and the following code can output the results in clean format all together
-
-clean_print <- function(x){
- output = data.frame(
- x$coef,
- sqrt(x$var.coef),
- x$ci.coef,
- x$pval
- )
- names(output) = c("Estimate", "Std. Error", "Lower CI", "Upper CI", "p value")
- return(output)
-}
-
-Results_SlopeParams <- data.frame() #empty vector to append dfs to
-for (i in allmodels){
- print(i)
- df <- eval(parse(text = paste0("clean_print(",i,")"))) %>%
- rownames_to_column("Parameter") %>%
- mutate("Model" = i)
- Results_SlopeParams <- rbind(Results_SlopeParams,df)
-}
-Results_SlopeParams <- Results_SlopeParams %>%
- mutate(OR=exp(Estimate)) %>%
- mutate(UpperCI_OR=exp(`Upper CI`)) %>%
- mutate(LowerCI_OR=exp(`Lower CI`))
-
-Results_MetalCoeffs <- data.frame()
-for (i in allmodels){
- print(i)
- df <- eval(parse(text = paste0("as.data.frame(summary(",i,"$fit)$coefficients[,])"))) %>%
- mutate("Model" = i)
- df <- df %>% rownames_to_column(var="variable")
- Results_MetalCoeffs<- rbind(Results_MetalCoeffs,df)
-}
-
-Results_MetalWeights <- data.frame()
-for (i in allmodels){
- Results_PWeights <- eval(parse(text = paste0("as.data.frame(",i,"$pos.weights)"))) %>%
- rownames_to_column("Metal") %>%
- dplyr::rename("Weight" = 2) %>%
- mutate("Weight Direction" = "Positive")
- Results_NWeights <- eval(parse(text = paste0("as.data.frame(",i,"$neg.weights)"))) %>%
- rownames_to_column("Metal") %>%
- dplyr::rename("Weight" = 2) %>%
- mutate("Weight Direction" = "Negative")
- Results_Weights <- rbind(Results_PWeights, Results_NWeights) %>%
- mutate("Model" = i) %>% as.data.frame()
- Results_MetalWeights <- rbind(Results_MetalWeights, Results_Weights)
-}
-
-write.csv(Results_SlopeParams, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_SlopeParams.csv"), row.names=TRUE)
-write.csv(Results_MetalCoeffs, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_MetalCoeffs.csv"), row.names=TRUE)
-write.csv(Results_MetalWeights, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_MetalWeights.csv"), row.names=TRUE)
-```
-
-
-## Concluding Remarks
-In conclusion, this module reviews a suite of methodologies researches can use to answer different questions relevant to environmental mixtures and their relationships to health outcomes. In this specific scripted example we utilized a large epidemiological dataset (for educational purposes, subsetted to a reduced sample size), to demonstrate using logistic regression to assess single contaminant associations with a health outcome (preterm birth) and quantile g computation to assess mixture effects with a health outcome.
-
-## Additional Resources
-The field of mixtures is vast, with many different approaches and example studies to learn from as analysts lead in their own analyses. Some resources that can be helpful include the following reviews:
-
-+ Our recent review on mixtures methodologies, particularly in the field of sufficient similarity, titled [Wrangling whole mixtures risk assessment: Recent advances in determining sufficient similarity](https://www.sciencedirect.com/science/article/abs/pii/S2468202023000323?via%3Dihub)
-+ Two more general, epidemiology-focused reviews on mixtures questions and methodologies, titled [Complex Mixtures, Complex Analyses: an Emphasis on Interpretable Results](https://link.springer.com/article/10.1007/s40572-019-00229-5) and [Environmental exposure mixtures: questions and methods to address them](https://pubmed.ncbi.nlm.nih.gov/30643709/)
-+ [A helpful online toolkit](https://bookdown.org/andreabellavia/mixtures/preface.html) for mixtures analyses generated by Andrea Bellavia, PhD
-
-Some helpful mixtures case studies include the following:
-
-+ Our recent study that implemented quantile g-computation statistics to identify chemicals present in wildfire smoke emissions that impact toxicity, published as the following: Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
-+ Another study from our group that implemented quantile g-computation identify placental gene networks that had altered expression in response to cord tissue mixtures of metals, published as the following: Eaves LA, Bulka CM, Rager JE, Galusha AL, Parsons PJ, O’Shea TM and Fry RC. Metals mixtures modeling identifies birth weight-associated gene networks in the placentas of children born extremely preterm. Chemosphere. 2022;137469.PMID:[36493891](https://pubmed.ncbi.nlm.nih.gov/36493891/)
-
-Many other groups also leverage quantile g-computation, with the following as exemplar case studies:
-
-+ [Prenatal exposure to consumer product chemical mixtures and size for gestational age at delivery](https://link.springer.com/article/10.1186/s12940-021-00724-z)
-+ [Use of personal care product mixtures and incident hormone-sensitive cancers in the Sister Study: A U.S.-wide prospective cohort](https://www.sciencedirect.com/science/article/pii/S0160412023005718)
-
-
-
-
-:::tyk
-
-Using the metals dataset within the *qgcomp* package (see the [package vignette](https://cran.r-project.org/web/packages/qgcomp/vignettes/qgcomp-vignette.html) for how to access), answer the following three mixtures-related environmental health questions using quantile g-computation, focusing on a mixture of arsenic, copper, zinc and lead:
-
-1. What is the risk of disease associated with combined exposure to each of the chemicals?
-2. Which of these chemicals has the strongest effect on disease?
-3. Which of these chemicals increases the risk of disease and which decreases the risk of disease?
-
-Note that disease is given by the variable `disease_state` (1 = case, 0 = non-case).
-
-:::
-
-# 6.4 Mixtures Analysis Methods Part 2: Bayesian Kernel Machine Regression
-
-This training module was developed by Dr. Lauren Eaves, Dr. Kyle Roell, and Dr. Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-
-## Introduction to Training Module
-
-In this training module, we will continue to explore mixtures analysis method, this time with a scripted example of Bayesian Kernel Machine Regression (BKMR). Please refer to **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation** for an overview of mixtures methodologies and a scripted example using Quantile g-Computation.
-
-## Introduction to Example Data
-
-In this scripted example, we will use a dataset from the [Extremely Low Gestational Age Newborn (ELGAN) cohort](https://elgan.fpg.unc.edu/). Specifically, we will analyze metal mixtures assessed in cord tissue collected at delivery with neonatal inflammation measured over the first two weeks of life.
-
-For more information on the cord tissue metals data, please see the following two publications:
-
- + Eaves LA, Bulka CM, Rager JE, Galusha AL, Parsons PJ, O’Shea TM and Fry RC. Metals mixtures modeling identifies birth weight-associated gene networks in the placentas of children born extremely preterm. Chemosphere. 2022;137469. PMID: [36493891](https://pubmed.ncbi.nlm.nih.gov/36493891/)
-
-+ Bulka CM, Eaves LA, Gardner AJ, Parsons PJ, Kyle RR, Smeester L, O"Shea TM, Fry RC. Prenatal exposure to multiple metallic and metalloid trace elements and the risk of bacterial sepsis in extremely low gestational age newborns: A prospective cohort study. Front Epidemiol. 2022;2. PMID:[36405975] (https://pubmed.ncbi.nlm.nih.gov/36405975/)
-
-For more information on the neonatal inflammation data, please see the following publication:
-
- + Eaves LA, Enggasser AE, Camerota M, Gogcu S, Gower WA, Hartwell H, Jackson WM, Jensen E, Joseph RM, Marsit CJ, Roell K, Santos HP Jr, Shenberger JS, Smeester L, Yanni D, Kuban KCK, O'Shea TM, Fry RC. CpG methylation patterns in placenta and neonatal blood are differentially associated with neonatal inflammation. Pediatr Res. June 2022. PMID: [35764815](https://pubmed.ncbi.nlm.nih.gov/35764815/)
-
-Here, we have a dataset of n=254 participants for which we have complete data on neonatal inflammation, cord tissue metals and key demographic variables that will be included as confounders in the analysis.
-
-Extensive research in the ELGAN study has demonstrated that neonatal inflammation is predictive of cerebral palsy, ASD, ADHD, obesity, cognitive impairment, attention problems,cerebral white matter damage, and decreased total brain volume, among other adverse outcomes. Therefore identifying exposures that lead to neonatal inflammation and could be intervened upon to reduce the risk of neonatal inflammation is critical to improve neonatal health. Environmental exposures during pregnancy such as metals may contribute to neonatal inflammation. As is often the case in environmental health, these chemical exposures are likely co-occurring and therefore mixtures methods are needed.
-
-## Introduction to BKMR
-
-BKMR offers a flexible, non-parametric method to estimate:
-
-1) The single exposure effect: odds ratio of inflammation when a single exposure is at its 75th percentile compared to its 25th percentile, with other exposures at their 50th percentile and covariates held constant
-2) The overall mixture effect: odds ratio of inflammation when all exposures are fixed at their 75th percentile compared to when all of the factors are fixed to their 25th percentile;
-3) The interactive effect: the difference in the single-exposure effect when all of the other exposures are fixed at their 75th percentile, as compared to when all of the other factors are fixed at their 25th percentile;
-
-
-There are numerous excellent summaries of BKMR, including the publications in which it was first introduced:
-
- + Bobb et al. [Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures](https://academic.oup.com/biostatistics/article/16/3/493/269719)
- + Bobb et al. [Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression](https://ehjournal.biomedcentral.com/articles/10.1186/s12940-018-0413-y)
-
-And other vignettes and toolkits including:
-
- + Jennifer Bobb's [Introduction to Bayesian kernel machine regression and the bkmr R package](https://jenfb.github.io/bkmr/overview.html)
- + Andrea Bellavia's [Bayesian kernel machine regression](https://bookdown.org/andreabellavia/mixtures/bayesian-kernel-machine-regression.html)
-
-
-While BKMR can do many things other methods cannot, it can require a lot of computational resources and take a long time to run. Before working with your final dataset and analysis, if very large or complex, it is often recommended to start with a smaller sample to make sure everything is working correctly before starting an analysis that make takes days to complete.
-
-
-### Training Module's **Environmental Health Questions**
-This training module was specifically developed to answer the following questions, which mirror the questions in **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1**, but are just in a different order:
-
-1. Which of these chemicals has the strongest effect on neonatal inflammation risk?
-2. Which of these chemicals increases the risk of neonatal inflammation and which decreases the risk of neonatal inflammation?
-3. What is the risk of neonatal inflammation associated with exposure to each of manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead individually?
-4. What is the risk of neonatal inflammation associated with combined exposure to manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead (ie. a mixture)?
-and in addition to the questions addressed in Mixtures Methods 1, we additionally can answer:
-5. Are there interactions among manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead in relation to neonatal inflammation?
-
-## Run BKMR
-
-### Workspace Preparation
-
-Install packages as needed, then load the following packages:
-```{r cars, message = FALSE}
-#load packages
-library(tidyverse)
-library(ggplot2)
-library(knitr)
-library(yaml)
-library(rmarkdown)
-library(broom)
-library(ggpubr)
-library(bkmr)
-```
-
-Optionally, you can also create a current date variable to name output files, and create an output folder.
-```{r 06-Chapter6-85, eval = FALSE}
-#Create a current date variable to name outputfiles
-cur_date <- str_replace_all(Sys.Date(),"-","")
-
-#Create an output folder
-Output_Folder <- ("Module6_4_Output/")
-```
-
-### Data Import
-```{r 06-Chapter6-86}
-cohort <- read.csv(file="Chapter_6/Module6_4_Input/Module6_4_InputData.csv")
-colnames(cohort)
-head(cohort)
-```
-
-The variables in this dataset include sample and demographic information and cort tissue metal exposure in $mu$g/g or ng/g.
-
-*Sample and Demographic Variables*
-
-+ `id`: unique study ID
- outcome:
-+ `inflam_intense`: 1= high inflammation, 0=low inflammation
- covariates:
-+ `race1`: maternal race, 1=White, 2=Black, 0=Other
-+ `sex`: neonatal sex, 0=female, 1=male
-+ `gadays`: gestational age at delivery in days
-+ `magecat`: maternal age, 1= <21, 2=21-35, 3= > 35
-+ `medu`:maternal education: 1= <12, 2=12, 3=13-15, 4=16, 5= >16
-+ `smoke`: maternal smoking while pregnant, 0=no, 1=yes
-
-*Exposure Variables*
-
-+ `Mn_ugg`
-+ `Cu_ugg`
-+ `Zn_ugg`
-+ `As_ngg`
-+ `Se_ugg`
-+ `Cd_ngg`
-+ `Hg_ngg`
-+ `Pb_ngg`
-
-
-There are many steps prior to the modeling steps outlined below. These are being skipped for educational purposes. Additional steps include assessment of normality and transformations as needed, generation of a demographics table and assessing for missing data, imputation of missing data if needed, visualizing trends and distributions in the data, assessing correlations between exposures, functional form assessments, and decisions regarding what confounders to include.
-
-In addition, it is highly recommended to conduct single-contaminant modeling initially to understand individual chemical relationships with the outcomes of focus before conducting mixtures assessment. For an example of this, see **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation**. BKMR, as a flexible non-parametric modeling approach, does not allow for classical null-hypothesis testing, and 95% CI are interpreted as credible intervals, not confidence intervals. One approach therefore could be to explore non-linearities and interactions within BKMR to then validate generated hypotheses using quantile g-computation.
-
-### Fit the BKMR Model
-First, define a matrix/vector of the exposure mixture, outcome, and confounders/covariates. BKMR performs better when the exposures are on a similar scale and when there are not outliers. Thus, we center and scale the exposure variables first. As noted above, in a complete analysis, thorough examination of exposure variable distributions, including outliers and normality, would be conducted before any exposure-outcome modeling. For more information on normality testing, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations.**
-
-First, we'll assign the matrix variables to their own data frame and scale the data.
-```{r 06-Chapter6-87, message=F, warning=F, error=F}
-#exposure mixture variables
-mixture <- as.matrix(cohort[,10:17])
-mixture <- log(mixture)
-mixture <-scale(mixture, center=TRUE)
-summary(mixture)
-```
-
-Then, we'll define the outcome variable and ensure it is the proper class and leveling.
-```{r 06-Chapter6-88}
-#outcome variable
-cohort$inflam_intense <-as.factor(cohort$inflam_intense)
-cohort$inflam_intense <- relevel(cohort$inflam_intense, ref = "0")
-y<-as.numeric(as.character(cohort$inflam_intense))
-```
-
-Next, we'll assign the covariates to a matrix.
-```{r 06-Chapter6-89}
-#covariates
-covariates<-as.matrix(cohort[,7:9])
-```
-
-Then, we can fit the BKMR model. Note that this script will take a few minutes to run.
-```{r 06-Chapter6-90}
-set.seed(111)
-fitkm <- kmbayes(y = y, Z = mixture, X = covariates, iter = 5000, verbose = FALSE, varsel = TRUE, family="binomial", est.h = TRUE)
-```
-
-For full information regarding options for the kmbayes function, refer to the BKMR reference manual: https://cran.r-project.org/web/packages/bkmr/bkmr.pdf
-
-### Assess Variable Importance
-BKMR conducts a variable selection procedure and generates posterior inclusion probabilities (PIP). The larger the PIP, the more a variable is contributing to the overall exposure-outcome effect. These are relative to each other,so there is no threshold as to when a variable becomes an "important" contributor (similar to the weights in quantile g-computation).
-```{r 06-Chapter6-91, message=F, warning=F, error=F}
-ExtractPIPs(fitkm)
-```
-
-Relative to each other, the contributions to the effect of the mixture on neonatal inflammation are shown above for each component of the mixture. Note that if a variable PIP=0, BKMR will drop it from the model and the overall mixture effect will not include this exposure.
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: Which of these chemicals has the strongest effect on neonatal inflammation risk?
-:::
-
-:::answer
-**Answer**: Based on the PIPs: Cadmium.
-:::
-
-
-### Assess Model Convergence
-
-We can use trace plots to evaluate how the parameters in the model converge over the many iterations. We hope to see that the line moves randomly but centers around a straight line
-
-```{r 06-Chapter6-92, message=F, warning=F, error=F, fig.align = "center"}
-sel<-seq(0,5000,by=1)
-TracePlot(fit = fitkm, par = "beta", sel=sel)
-```
-
-Based on this plot, it looks like the burn in period is roughly 1000 iterations. We will remove these from the results.
-```{r 06-Chapter6-93, message=F, warning=F, error=F, fig.align = "center"}
-sel<-seq(1000,5000,by=1)
-TracePlot(fit = fitkm, par = "beta", sel=sel)
-```
-
-### Presenting Model Results
-
-#### Single exposure effects
-As described above, one way to examine single effects is to calculate the odds ratio of inflammation when a single exposure is at its 75th percentile compared to its 25th percentile, with other exposures are at their 50th percentile and covariates are held constant.
-
-Here, we use the `PredictorResponseUnivar()` function to generate a dataset that details, at varying levels of each exposure (`z`), the relationship between that exposure and the outcome, holding other exposures at their 50th percentile and covariates constant. This relationship is given by a beta value, which because we have a binomial outcome and fit a probit model represents the log(odds) (`est`). The standard error for the beta value is also calculated (`se`).
-
-```{r 06-Chapter6-94, message=F, warning=F, error=F, fig.align = "center"}
-pred.resp.univar <- PredictorResponseUnivar(fit=fitkm, sel=sel,
- method="approx", q.fixed = 0.5)
-
-head(pred.resp.univar)
-```
-
-We can then plot these data for each exposure to visualize the exposure-response function for each exposure.
-
-```{r 06-Chapter6-95}
-ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se,
- ymax = est + 1.96*se)) +
- geom_smooth(stat = "identity") + ylab("h(z)") + facet_wrap(~ variable)
-```
-
-Then, we can generate a dataset that contains for each exposure (`variable`), the log(OR) (`est`) (and its standard deviation (`sd`)) corresponding to the odds of neonatal inflammation when an exposure is at its 75th compared to the odds when at the 25th percentile. The log(OR) is estimated at three levels of the other exposures (25th, 50th and 75th percentiles). We can use this dataset to identify odds ratios for neonatal inflammation (comparing the 75th to 25th percentile odds) for each exposure at differing levels of the other exposures. These odds ratios approximate risk, whereby an odds ratio >1 means there is increased risk of neonatal inflammation when that exposure is at its 75th percentile compared to its 25th percentile. We can then plot these data to see the logOR for each metal in relation to neonatal inflammation at varying levels of the rest of the exposures.
-
-```{r 06-Chapter6-96, message=F, warning=F, error=F, fig.align = "center"}
-risks.singvar <- SingVarRiskSummaries(fit=fitkm, qs.diff = c(0.25, 0.75),
- q.fixed = c(0.25, 0.50, 0.75),
- method = "approx")
-
-ggplot(risks.singvar, aes(variable, est, ymin = est - 1.96*sd,
- ymax = est + 1.96*sd, col = q.fixed)) +
- geom_hline(aes(yintercept=0), linetype="dashed", color="gray") +
- geom_pointrange(position = position_dodge(width = 0.75)) +
- coord_flip() + theme(legend.position="none")+scale_x_discrete(name="") +
- scale_y_continuous(name="estimate")
-
-```
-
-
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: Which of these chemicals increases the risk of neonatal inflammation and which decreases the risk of neonatal inflammation?
-:::
-
-:::answer
-**Answer**: At all levels of the other exposures, lead, cadmium, selenium, arsenic and zinc reduce the odds of neonatal inflammation, while manganese and mercury appear to increase the odds of neonatal inflammation. Copper appears has a null effect. Notice that the credibility intervals however for all metals span the null meaning we are not confident in the independent effect of any of the metals.
-:::
-
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can also answer also **Environmental Health Question #3***: What is the risk of neonatal inflammation associated with exposure to each of manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead individually?
-:::
-
-:::answer
-**Answer**: As an example, take manganese: when all other exposures are at their 50th percentile, the log(OR) for Mn comparing being at the 75th to the 25th percentile is 0.024, which equals an odds ratio of 1.02. From this, you should be able to calculate the odds ratios for the other metals yourself.
-:::
-
-
-#### Calculating the overall mixture effect
-
-Next, we can generate a dataset that details the effect (ie. log(OR) (`est`) and corresponding standard deviation (`sd`)) on neonatal inflammation of all exposures when at a particular quantile (`quantile`) compared to all exposures being at the 50th percentile. We can use this dataset to identify odds ratios for neonatal inflammation upon simultaneous exposure to the entire mixture for different quantile threshold comparisons. These odds ratios approximate risk, whereby an odds ratio >1 means there is increased risk of neonatal inflammation when the entire mixture is set at the index quantile, compared to the 50th percentile. We can also plot these results to visualize the overall mixture effect dose-response relationship.
-
-```{r 06-Chapter6-97, message=F, warning=F, error=F, fig.align = "center"}
-risks.overall <- OverallRiskSummaries(fit=fitkm, qs=seq(0.25, 0.75, by=0.05),
- q.fixed = 0.5, method = "approx",
- sel=sel)
-
-ggplot(risks.overall, aes(quantile, est, ymin = est - 1.96*sd,
- ymax = est + 1.96*sd)) +
- geom_hline(yintercept=00, linetype="dashed", color="gray") +
- geom_pointrange() + scale_y_continuous(name="estimate")
-```
-
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer **Environmental Health Question #4***: What is the risk of neonatal inflammation associated with combined exposure to manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead (ie. a mixture)?
-:::
-
-:::answer
-**Answer**: When every exposure is at its 25th percentile concentration compared to their 50th percentile concentration, the odds ratio for neonatal inflammation is 1.11 (exp(0.10073680)). When every exposure is at its 75th percentile concentration compared to their 50th percentile concentration, the odds ratio for neonatal inflammation is 0.87 (exp(-0.12000889)).
-:::
-
-
-### Evaluating interactive effects
-
-To understand bivariate interactions, we can generate a dataset that for each pairing of exposures details at varying levels of both exposures, the log(odds) (`est`, and associated standard deviation (`sd`)) of neonatal inflammation when all the other exposures are held constant. These plots can be tricky to interpret, so another way of looking at these results is to take "cross sections" at specific quantiles of the second exposure (see next step).
-```{r 06-Chapter6-98, message=F, warning=F, error=F, fig.align = "center"}
-pred.resp.bivar <- PredictorResponseBivar(fit=fitkm, min.plot.dist = 1,
- sel=sel, method="approx")
-
-ggplot(pred.resp.bivar, aes(z1, z2, fill = est)) +
- geom_raster() +
- facet_grid(variable2 ~ variable1) +
- scale_fill_gradientn(colours=c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
- xlab("expos1") +
- ylab("expos2") +
- ggtitle("h(expos1, expos2)")
-```
-
-Next, we generate a dataset that includes for each pairing of exposures, the log(odds) (`est` and associated standard deviation `sd`) of neonatal inflammation at varying concentrations (`z1`) of the first exposure (`variable 1`) when the second exposure (`variable 2` is at its 25th, 50th and 75th percentile (`quantile`).
-
-```{r 06-Chapter6-99, message=F, warning=F, error=F, fig.align = "center"}
-pred.resp.bivar.levels <- PredictorResponseBivarLevels(pred.resp.df=
- pred.resp.bivar, Z = mixture, both_pairs=TRUE,
- qs = c(0.25, 0.5, 0.75))
-
-ggplot(pred.resp.bivar.levels, aes(z1, est)) +
- geom_smooth(aes(col = quantile), stat = "identity") +
- facet_grid(variable2 ~ variable1) +
- ggtitle("h(expos1 | quantiles of expos2)") +
- xlab("expos1")
-```
-
-There is evidence of an interactive effect between two exposures when the exposure-response function for exposure 1 varies in form between the different quantiles of exposure 2. You can also zoom in on one plot, for example:
-
-```{r 06-Chapter6-100, message=F, warning=F, error=F, fig.align = "center"}
-HgCd <- pred.resp.bivar.levels %>%
- filter(variable1=="Hg_ngg") %>%
- filter(variable2=="Cd_ngg")
-
-ggplot(HgCd, aes(z1, est)) +
- geom_smooth(aes(col = quantile), stat = "identity") +
- ggtitle("h(expos1 | quantiles of expos2)") +
- xlab("expos1")
-
-
-CdHg <- pred.resp.bivar.levels %>%
- filter(variable1=="Cd_ngg") %>%
- filter(variable2=="Hg_ngg")
-
-ggplot(CdHg, aes(z1, est)) +
- geom_smooth(aes(col = quantile), stat = "identity") +
- ggtitle("h(expos1 | quantiles of expos2)") +
- xlab("expos1")
-```
-
-To visualize interactions between one exposure and the rest of the exposure components, we generate a dataset that details the difference in each exposure's (`variable`) log(OR) comparing 75th to 25th percentile (`est`, and associated standard deviation `sd`) when the other exposure components are at their 75th versus 25th percentile. Perhaps more intuitively, these estimates represent the blue - red points plotted in the second figure under the single exposure effects section.
-```{r 06-Chapter6-101, message=F, warning=F, error=F, fig.align = "center"}
-
-risks.int <- SingVarIntSummaries(fit=fitkm, qs.diff = c(0.25, 0.75),
- qs.fixed = c(0.25, 0.75))
-
-
-ggplot(risks.int, aes(variable, est, ymin = est - 1.96*sd,
- ymax = est + 1.96*sd)) +
- geom_pointrange(position = position_dodge(width = 0.75)) +
- geom_hline(yintercept = 0, lty = 2, col = "brown") + coord_flip()
-```
-
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can answer **Environmental Health Question #5***: Are there interactions among manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead in relation to neonatal inflammation?
-:::
-
-:::answer
-**Answer**: There do not appear to be any single exposure and rest of mixture interactions (previous plot); however, there is suggestive evidence of a bivariate interaction between cadmium and mercury.
-:::
-
-
-## Concluding Remarks
-In conclusion, this module extends upon **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation**. In this scripted example, we used a dataset from a human population study (n=246) of cord tissue metals and examined the outcome of neonatal inflammation. We found that increasing the entire mixture of metals reduced the risk of neonatal inflammation; however, certain metals increased the risk and others decreased the risk. There was also a suggestive interactive effect found between cadmium and mercury.
-
-## Additional Resources
-The field of mixtures is vast, with many different approaches and example studies to learn from as analysts lead in their own analyses. Some resources that can be helpful include the following reviews:
-
-+ Our recent review on mixtures methodologies, particularly in the field of sufficient similarity, titled [Wrangling whole mixtures risk assessment: Recent advances in determining sufficient similarity](https://www.sciencedirect.com/science/article/abs/pii/S2468202023000323?via%3Dihub)
-+ Two more general, epidemiology-focused reviews on mixtures questions and methodologies, titled [Complex Mixtures, Complex Analyses: an Emphasis on Interpretable Results](https://link.springer.com/article/10.1007/s40572-019-00229-5) and [Environmental exposure mixtures: questions and methods to address them](https://pubmed.ncbi.nlm.nih.gov/30643709/)
-+ [A helpful online toolkit](https://bookdown.org/andreabellavia/mixtures/preface.html) for mixtures analyses generated by Andrea Bellavia, PhD
-
-Some helpful mixtures case studies using BKMR include the following:
-
-+ [Prenatal metal concentrations and childhood cardio-metabolic risk using Bayesian Kernel Machine Regression to assess mixture and interaction effects](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6402346/)
-+ [Associations between Phthalate Metabolite Concentrations in Follicular Fluid and Reproductive Outcomes among Women Undergoing in Vitro Fertilization/Intracytoplasmic Sperm Injection Treatment](https://ehp.niehs.nih.gov/doi/full/10.1289/EHP11998)
-+ [Associations of Prenatal Per- and Polyfluoroalkyl Substance (PFAS) Exposures with Offspring Adiposity and Body Composition at 16–20 Years of Age: Project Viva](https://ehp.niehs.nih.gov/doi/full/10.1289/EHP12597)
-
-
-
-
-:::tyk
-
-Using the simulated dataset within the bkmr package (see below code for how to call and store this dataset), answer the key environmental health questions using BKMR.
-
-1. Which of these chemicals has the strongest effect on the outcome?
-2. Which of these chemicals increases the outome and which decreases the outcome?
-3. What is the effect on the outcome with exposure to each of the chemicals individually?
-4. What is the effect on the outcome associated with combined exposure to all chemicals?
-5. Are there interactions among the chemicals relation to the outcome?
-
-Note that the outcome (y) variable is a continuous variable here, rather than binary as in the scripted example.
-:::
-
-```{r 06-Chapter6-102}
-# Set seed for reproducibility
-set.seed(111)
-
-# Create a dataset with 100 participants and 4 mixtures components
-data <- SimData(n = 100, M = 4)
-
-# Save outcome variable (y)
-y <- data$y
-
-# Save mixtures variables (Z and X)
-Z <- data$Z
-X <- data$X
-```
-# 6.5 Mixtures Analysis Methods Part 3: Sufficient Similarity
-
-This training module was developed by Cynthia Rider, with contributions from Lauren E. Koval and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-Humans are rarely, if ever, exposed to single chemicals at a time. Instead, humans are often exposed to multiple stressors in their everyday environments in the form of mixtures. These stressors can include environmental chemicals and pharmaceuticals, and they can also include other types of stressors such as socioeconomic factors and other attributes that can place individuals at increased risk of acquiring disease. Because it is not possible to test every possible combination of exposure that an individual might experience in their lifetime, approaches that take into account variable and complex exposure conditions through mixtures modeling are needed.
-
-There are different computational approaches that can be implemented to address this research topic. In this training module, we will demonstrate how to use **sufficient similarity** to determine which groups of exposure conditions are chemically/biologically similar enough to be regulated for safety together, based on the same set of regulatory criteria. Here, our example mixtures analysis will focus on characterizing the nutritional supplement *Ginkgo biloba*.
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. Based on the chemical analysis, which *Ginkgo biloba* extract looks the most different?
-2. When viewing the variability between chemical profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
-3. Based on the chemical analysis, which chemicals do you think are important in differentiating between the different *Ginkgo biloba* samples?
-4. After removing two samples that have the most different chemical profiles (and are thus, potential outliers), do we obtain similar chemical groupings?
-5. When viewing the variability between toxicity profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
-6. Based on the toxicity analysis, which genes do you think are important in differentiating between the different *Ginkgo biloba* samples?
-7. Were similar chemical groups identified when looking at just the chemistry vs. just the toxicity? How could this impact regulatory decisions, if we only had one of these datasets?
-
-
-## Introduction to Toxicant and Dataset
-
-*Ginkgo biloba* represents a popular type of botanical supplement currently on the market. People take *Ginkgo biloba* to improve brain function, but there is conflicting data on its efficacy. Like other botanicals, *Ginkgo biloba* is a complex mixture with 100s-1000s of constituents. Here, the variability in chemical and toxicological profiles across samples of *Ginkgo biloba* purchased from different commercial sources is evaluated. We can use data from a well-characterized sample (reference sample) to evaluate the safety of other samples that are ‘sufficiently similar’ to the reference sample. Samples that are different (i.e., do not meet the standards of sufficient similarity) from the reference sample would require additional safety data.
-
-A total of 29 *Ginkgo biloba* extract samples were analyzed. These samples are abbreviated as “GbE_” followed by a unique sample identifier (GbE = *Ginkgo biloba* Extract). These data have been previously published:
-
-+ Catlin NR, Collins BJ, Auerbach SS, Ferguson SS, Harnly JM, Gennings C, Waidyanatha S, Rice GE, Smith-Roe SL, Witt KL, Rider CV. How similar is similar enough? A sufficient similarity case study with Ginkgo biloba extract. Food Chem Toxicol. 2018 Aug;118:328-339. PMID: [29752982](https://pubmed.ncbi.nlm.nih.gov/29752982/).
-
-+ Collins BJ, Kerns SP, Aillon K, Mueller G, Rider CV, DeRose EF, London RE, Harnly JM, Waidyanatha S. Comparison of phytochemical composition of Ginkgo biloba extracts using a combination of non-targeted and targeted analytical approaches. Anal Bioanal Chem. 2020 Oct;412(25):6789-6809. PMID: [32865633](https://pubmed.ncbi.nlm.nih.gov/32865633/).
-
-
-### *Ginkgo biloba* Chemistry Dataset Overview
-
-The chemical profiles of these sample extracts were first analyzed using targeted mass spectrometry-based approaches. The concentrations of 12 *Ginkgo biloba* marker compounds were measured in units of mean weight as a ratio [g chemical / g sample]. Note that in this dataset, non-detects have been replaced with values of zero for simplicity; though there are more advanced methods to impute values for non-detects. Script is provided to evaluate how *Ginkgo biloba* extracts group together, based on chemical profiles.
-
-### *Ginkgo biloba* Toxicity Dataset Overview
-
-The toxicological profiles of these samples were also analyzed using *in vitro* test methods. These data represent area under the curve (AUC) values indicating changes in gene expression across various concentrations of the *Ginkgo biloba* extract samples. Positive AUC values indicate a gene that was collectively increased in expression as concentration increased, and a negative AUC value indicates a gene that was collectively decreased in expression as exposure concentration increased. Script is provided to evaluate how *Ginkgo biloba* extracts group together, based on toxicity profiles.
-
-
-## Workspace Preparation and Data Import
-
-#### Install required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 06-Chapter6-103, results=FALSE, message=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("readxl"))
- install.packages("readxl");
-if (!requireNamespace("factoextra"))
- install.packages("factoextra");
-if (!requireNamespace("pheatmap"))
- install.packages("pheatmap");
-if (!requireNamespace("gridExtra"))
- install.packages("gridExtra");
-if (!requireNamespace("ggplotify"))
- install.packages("ggplotify")
-```
-
-#### Loading required packages
-```{r 06-Chapter6-104, results=FALSE, message=FALSE}
-library(readxl) #used to read in and work with excel files
-library(factoextra) #used to run and visualize multivariate analyses, here PCA
-library(pheatmap) #used to make heatmaps. This can be done in ggplot2 but pheatmap is easier and nicer
-library(gridExtra) #used to arrange and visualize multiple figures at once
-library(ggplotify) #used to make non ggplot figures (like a pheatmap) gg compatible
-library(tidyverse) #all tidyverse packages, including dplyr and ggplot2
-```
-
-#### Set your working directory
-```{r 06-Chapter6-105, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-#### Import example *Ginkgo biloba* dataset
-
-We need to first read in the chemistry and toxicity data from the provided excel file. Here, data were originally organized such that the actual observations start on row 2 (dataset descriptions were in the first row). So let's implement skip=1, which skips reading in the first row.
-
-```{r 06-Chapter6-106}
-chem <- read_xlsx("Chapter_6/Module6_5_Input/Module6_5_InputData.xlsx" , sheet = "chemistry data", skip=1) # loads the chemistry data tab
-tox <- read_xlsx("Chapter_6/Module6_5_Input/Module6_5_InputData.xlsx" , sheet = "in vitro data", skip=1) # loads the toxicity data tab
-```
-
-### View example dataset
-
-Let's first see how many rows and columns of data are present in both datasets:
-```{r 06-Chapter6-107}
-dim(chem)
-```
-
-The chemistry dataset contains information on 29 samples (rows); and 1 sample identifier + 12 chemicals (total of 13 columns).
-
-```{r 06-Chapter6-108}
-dim(tox)
-```
-
-The tox dataset contains information on 29 samples (rows); and 1 sample identifier + 5 genes (total of 6 columns).
-
-
-Let's also see what kind of data are organized within the datasets:
-```{r 06-Chapter6-109}
-colnames(chem)
-```
-
-```{r 06-Chapter6-110}
-head(chem)
-```
-
-```{r 06-Chapter6-111}
-colnames(tox)
-```
-
-```{r 06-Chapter6-112}
-head(tox)
-```
-
-
-## Chemistry-Based Sufficient Similarity Analysis
-
-The first method employed in this Sufficient Similarity analysis is Principal Component Analysis (PCA). PCA is a very common dimensionality reduction technique, as detailed in **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**.
-
-In summary, PCA finds dimensions (eigenvectors) in the higher dimensional original data that capture as much of the variation as possible, which you can then plot. This allows you to project higher dimensional data, in this case 12 dimensions (representing 12 measured chemicals), in fewer dimensions (we'll use 2). These dimensions, or components, capture the "essence" of the original dataset.
-
-Before we can run PCA on this chemistry dataset, we first need to scale the data across samples. We do this here for the chemistry dataset, because we specifically want to evaluate and potentially highlight/emphasize chemicals that may be at relatively low abundance. These low-abundance chemicals may actually be contaminants that drive toxicological effects.
-
-Let's first re-save the original chemistry dataset to compare off of:
-```{r 06-Chapter6-113}
-chem_original <- chem
-```
-
-Then, we'll make a scaled version to carry forward in this analysis. To do this, we move the sample column the row names and then scale and center the data.
-```{r 06-Chapter6-114}
-chem <- chem %>% column_to_rownames("Sample")
-chem <- as.data.frame(scale(as.matrix(chem)))
-```
-
-Let's now compare one of the rows of data (here, sample GbE_E) to see what scaling did:
-```{r 06-Chapter6-115}
-chem_original[5,]
-chem[5,]
-```
-
-You can see that scaling made the concentrations distributed across each chemical center around 0.
-
-Now, we can run PCA on the scaled data:
-```{r 06-Chapter6-116}
-chem_pca <- princomp(chem)
-```
-
-Looking at the scree plot, we see the first two principal components capture most of the variance in the data (~64%):
-```{r 06-Chapter6-117, fig.align = "center"}
-fviz_eig(chem_pca, addlabels = TRUE)
-```
-
-
-Here are the resulting PCA scores for each sample, for each principal component (shown here as components 1-12):
-```{r 06-Chapter6-118}
-head(chem_pca$scores)
-```
-
-And the resulting loading factors of each chemical's contribution towards each principal component. Results are arranged by a chemical's contribution to PC1, the component accounting for most of the variation in the data.
-```{r 06-Chapter6-119}
-head(chem_pca$loadings)
-```
-
-We can save the chemical-specific loadings into a separate matrix and view them from highest to lowest values for PC1.
-```{r 06-Chapter6-120}
-loadings <- as.data.frame.matrix(chem_pca$loadings)
-loadings %>% arrange(desc(Comp.1))
-```
-
-These resulting loading factors allow us to identify which constituents (of the 12 total) contribute to the principal components explaining data variabilities. For instance, we can see here that **Quercetin** is listed at the top, with the largest loading value for principal component 1. Thus, Quercetin represents the constituents that contributes to the overall variability in the dataset to the greatest extent. The next three chemicals are all **Ginkgolide** constituents, followed by **Bilobalide** and **Kaempferol**, and so forth.
-
-If we look at principal component 2 (PC2), we can now see a different set of chemicals contributing to the variability captured in this component:
-```{r 06-Chapter6-121}
-loadings %>% arrange(desc(Comp.2))
-```
-
-Here, **Ginkgolic Acids** are listed first.
-
-We can also visualize sample groupings based on these principal components 1 & 2:
-
-```{r 06-Chapter6-122, warning=FALSE, message=FALSE, fig.height=6, fig.width=8, fig.align = "center"}
-# First pull the percent variation captured by each component
-pca_percent <- round(100*chem_pca$sdev^2/sum(chem_pca$sdev^2),1)
-
-# Then make a dataframe for the PCA plot generation script using first three components
-pca_df <- data.frame(PC1 = chem_pca$scores[,1], PC2 = chem_pca$scores[,2])
-
-# Plot this dataframe
-chem_pca_plt <- ggplot(pca_df, aes(PC1,PC2))+
- geom_hline(yintercept = 0, size=0.3)+
- geom_vline(xintercept = 0, size=0.3)+
- geom_point(size=3, color="deepskyblue3") +
- geom_text(aes(label=rownames(pca_df)), fontface="bold", position=position_jitter(width=0.4,height=0.4))+
- labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
- ggtitle("GbE Sample PCA by Chemistry Profiles")
-
-
-# Changing the colors of the titles and axis text
-chem_pca_plt <- chem_pca_plt + theme(plot.title=element_text(color="deepskyblue3", face="bold"),
- axis.title.x=element_text(color="deepskyblue3", face="bold"),
- axis.title.y=element_text(color="deepskyblue3", face="bold"))
-
-# Viewing this resulting plot
-chem_pca_plt
-```
-
-This plot tells us a lot about sample groupings based on chemical profiles!
-
-### Answer to Environmental Health Question 1
-:::question
-With this, we can answer **Environmental Health Question 1**: Based on the chemical analysis, which *Ginkgo biloba* extract looks the most different?
-:::
-
-:::answer
-**Answer:** GbE_G
-:::
-
-### Answer to Environmental Health Question 2
-:::question
- We can also answer **Environmental Health Question 2**: When viewing the variability between chemical profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
-:::
-
-:::answer
-**Answer:** Approximately 4 (though could argue +1/-1): bottom left group; bottom right group; and two completely separate samples of GbE_G and GbE_N
-:::
-
-
-As an alternative way of viewing the chemical profile data, we can make a heatmap of the scaled chemistry data. We concurrently run hierarchical clustering that shows us how closely samples are related to each other, based on different algorithms than data reduction-based PCA. Samples that fall on nearby branches are more similar. Samples that don't share branches with many/any others are often considered outliers.
-
-By default, `pheatmap()` uses a Euclidean distance to cluster the observations, which is a very common clustering algorithm.
-For more details, see the following description of [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) and for more information on hierarchical clustering, see **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**.
-```{r 06-Chapter6-123, warning=FALSE, message=FALSE, fig.align = "center"}
-chem_hm <- pheatmap(chem, main="GbE Sample Heatmap by Chemistry Profiles",
- cluster_rows=TRUE, cluster_cols = FALSE,
- angle_col = 45, fontsize_col = 7, treeheight_row = 60)
-```
-
-This plot tells us a lot about the individual chemicals that differentiate the sample groupings.
-
-### Answer to Environmental Health Question 3
-:::question
-With this, we can answer **Environmental Health Question 3**: Based on the chemical analysis, which chemicals do you think are important in differentiating between the different *Ginkgo biloba* samples?
-:::
-
-:::answer
-**Answer:** All of the chemicals technically contribute to these sample patterns, but here are some that stand out: (i) Ginkgolic_Acid_C15 and Ginkgolic_Acid_C17 appear to drive the clustering of one particular GbE sample, GbE_G, as well as potentially GbE_N; (ii) Isorhamnetin influences the clustering of GbE_T; (iii) Bilobalide, Ginkgolides A & B, and Quercetin are also important because they show a general cluster of abundance at decreased levels at the bottom and increased levels at the top.
-:::
-
-Let's now revisit the PCA plot:
-```{r 06-Chapter6-124, warning=FALSE, message=FALSE, fig.height=3, fig.width=5, fig.align = "center"}
-chem_pca_plt
-```
-
-GbE_G and GbE_N look so different from the rest of the samples that they could be outliers and potentially influencing overall data trends. Let's make sure that, if we remove these two samples, our sample groupings still look the same.
-
-First, we remove those two samples from the dataframe:
-```{r 06-Chapter6-125, warning=FALSE, message=FALSE}
-chem_filt <- chem %>%
- rownames_to_column("Sample") %>%
- filter(!Sample %in% c("GbE_G","GbE_N")) %>%
- column_to_rownames("Sample")
-```
-
-Then, we can re-run PCA and generate a heatmap of the chemical data with these outlier samples removed:
-```{r 06-Chapter6-126, warning=FALSE, message=FALSE, fig.align = "center"}
-chem_filt_pca <- princomp(chem_filt)
-
-# Get the percent variation captured by each component
-pca_percent_filt <- round(100*chem_filt_pca$sdev^2/sum(chem_filt_pca$sdev^2),1)
-
-# Make dataframe for PCA plot generation using first three components
-pca_df_filt <- data.frame(PC1 = chem_filt_pca$scores[,1], PC2 = chem_filt_pca$scores[,2])
-
-# Plot this dataframe
-chem_filt_pca_plt <- ggplot(pca_df_filt, aes(PC1,PC2))+
- geom_hline(yintercept = 0, size=0.3)+
- geom_vline(xintercept = 0, size=0.3)+
- geom_point(size=3, color="aquamarine2") +
- geom_text(aes(label=rownames(pca_df_filt)), fontface="bold", position=position_jitter(width=0.5,height=0.5))+
- labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
- ggtitle("GbE Sample PCA by Chemistry Profiles excluding Potential Outliers")
-
-# Changing the colors of the titles and axis text
-chem_filt_pca_plt <- chem_filt_pca_plt + theme(plot.title=element_text(color="aquamarine2", face="bold"),
- axis.title.x=element_text(color="aquamarine2", face="bold"),
- axis.title.y=element_text(color="aquamarine2", face="bold"))
-
-# Viewing this resulting plot
-chem_filt_pca_plt
-```
-
-
-To view the PCA plots of all samples vs filtered samples:
-```{r 06-Chapter6-127, warning=FALSE, message=FALSE, fig.height=9, fig.width=8, fig.align = "center"}
-grid.arrange(chem_pca_plt, chem_filt_pca_plt)
-```
-
-
-### Answer to Environmental Health Question 4
-:::question
-With this, we can answer **Environmental Health Question 4**: After removing two samples that have the most different chemical profiles (and are thus, potential outliers), do we obtain similar chemical groupings?
-:::
-
-:::answer
-**Answer:** Yes! Removal of the potential outliers basically spreads the rest of the remaining data points out, since there is less variance in the overall dataset, and thus, more room to show variance amongst the remaining samples. The general locations of the samples on the PCA plot, however, remain consistent. We now feel confident that our similarity analysis is producing consistent grouping results.
-:::
-
-
-
-## Toxicity-Based Sufficient Similarity Analysis
-
-Now, we will perform sufficient similarity analysis using the toxicity data. Unlike the chemistry dataset, we can use the toxicity dataset as is without scaling because we want to focus on genes that are showing a large response. Similarly, we want to de-emphasize genes that are showing a strong response to the exposure condition. If we scale these data, we will reduce this needed variability.
-
-Here, we first move the sample column to row names:
-```{r 06-Chapter6-128, warning=FALSE, message=FALSE}
-tox <- tox %>% column_to_rownames("Sample")
-```
-
-Then, we can run PCA on this tox dataframe:
-```{r 06-Chapter6-129, warning=FALSE, message=FALSE}
-tox_pca <- princomp(tox)
-```
-
-Looking at the scree plot, we see the first two principal components capture most of the variation (~93%):
-```{r 06-Chapter6-130, warning=FALSE, message=FALSE, fig.align = "center"}
-fviz_eig(tox_pca, addlabels = TRUE)
-```
-
-We can then create a plot of the samples by principal components:
-```{r 06-Chapter6-131, warning=FALSE, message=FALSE, fig.height=7, fig.width=6, fig.align = "center"}
-# Get the percent variation captured by each component
-pca_percent <- round(100*tox_pca$sdev^2/sum(tox_pca$sdev^2),1)
-
-# Make dataframe for PCA plot generation using first three components
-tox_pca_df <- data.frame(PC1 = tox_pca$scores[,1], PC2 = tox_pca$scores[,2])
-
-# Plot the first two components
-tox_pca_plt <- ggplot(tox_pca_df, aes(PC1,PC2))+
- geom_hline(yintercept = 0, size=0.3)+
- geom_vline(xintercept = 0, size=0.3)+
- geom_point(size=3, color="deeppink3") +
- geom_text(aes(label=rownames(pca_df)), fontface="bold", position=position_jitter(width=0.25,height=0.25))+
- labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
- ggtitle("GbE Sample PCA by Toxicity Profiles")
-
-# Changing the colors of the titles and axis text
-tox_pca_plt <- tox_pca_plt + theme(plot.title=element_text(color="deeppink3", face="bold"),
- axis.title.x=element_text(color="deeppink3", face="bold"),
- axis.title.y=element_text(color="deeppink3", face="bold"))
-
-tox_pca_plt
-```
-
-This plot tells us a lot about sample groupings based on toxicity profiles!
-
-### Answer to Environmental Health Question 5
-:::question
-With this, we can answer **Environmental Health Question 5**: When viewing the variability between toxicity profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
-:::
-
-:::answer
-**Answer:** Approximately 3 (though could argue +1/-1): top left group; top right group; GbE_M and GbE_W.
-:::
-
-
-Similar to the chemistry data, as an alternative way of viewing the toxicity profile data, we can make a heatmap of the toxicity data:
-```{r 06-Chapter6-132, warning=FALSE, message=FALSE, fig.align = "center"}
-tox_hm <- pheatmap(tox, main="GbE Sample Heatmap by Toxicity Profiles",
- cluster_rows=TRUE, cluster_cols = FALSE,
- angle_col = 45, fontsize_col = 7, treeheight_row = 60)
-```
-
-This plot tells us a lot about the individual genes that differentiate the sample groupings!
-
-### Answer to Environmental Health Question 6
-:::question
-With this, we can answer **Environmental Health Question 6**: Based on the toxicity analysis, which genes do you think are important in differentiating between the different *Ginkgo biloba* samples?
-:::
-
-:::answer
-**Answer:** It looks like the CYP enzyme genes, particularly CYP2B6, are highly up-regulated in response to several of these sample exposures, and thus dictate a lot of these groupings.
-:::
-
-
-
-## Comparing Chemistry vs. Toxicity Sufficient Similarity Analyses
-
-Let's view the PCA plots for both datasets together, side-by-side:
-```{r 06-Chapter6-133, fig.height=8, fig.width=11, fig.align = "center"}
-pca_compare <- grid.arrange(chem_pca_plt,tox_pca_plt, nrow=1)
-```
-
-Let's also view the PCA plots for both datasets together, top-to-bottom, to visualize the trends along both axes better between these two views:
-```{r 06-Chapter6-134, fig.height=10, fig.width=10, fig.align = "center"}
-pca_compare <- grid.arrange(chem_pca_plt,tox_pca_plt)
-```
-
-Here is an edited version of the above figures, highlighting with colored circles some chemical groups of interest identified through chemistry vs toxicity-based sufficient similarity analyses:
-
-```{r 06-Chapter6-135, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_6/Module6_5_Input/Module6_5_Image1.png")
-```
-
-
-### Answer to Environmental Health Question 7
-:::question
-With this, we can answer **Environmental Health Question 7**: Were similar chemical groups identified when looking at just the chemistry vs. just the toxicity? How could this impact regulatory action, if we only had one of these datasets?
-:::
-
-:::answer
-**Answer:** There are some similarities between groupings, though there are also notable differences. For example, samples GbE_A, GbE_B, GbE_C, GbE_F, and GbE_H group together from the chemistry and toxicity similarity analyses. Though samples GbE_G, GbE_W, GbE_N, and others clearly demonstrate differences in grouping assignments. These differences could impact the accuracy of how regulatory decisions are made, where if regulation was dictated solely on the chemistry (without toxicity data) and/or vice versa, we may miss important information that could aid in accurate health risk evaluations.
-:::
-
-### Additional Methods
-
-Although we focused on sufficient similarity for this module, a number of other approaches exist to evaluate mixutres. For example, **relative potency factors** is another component-based approach that can be used to evalaute mixtures. Component-based approaches use data from individual chemicals (components of the mixture) and additivity models to estimate the effects of the mixture. For other methods, also see **TAME 2.0 Module 6.3 Mixtures I: Overview and Quantile G-Computation Application** and **TAME 2.0 Module 6.4 Mixtures II: BKMR Application**.
-
-
-
-## Concluding Remarks
-
-In this module, we evaluated the similarity between variable lots of *Ginkgo biloba* and identified sample groupings that could be used for chemical risk assessment purposes. Together, this example highlights the utility of sufficient similarity analyses to address environmental health research questions.
-
-### Additional Resources
-
-Some helpful resources that provide further background on the topic of mixtures toxicology and mixtures modeling include the following:
-
-+ Carlin DJ, Rider CV, Woychik R, Birnbaum LS. Unraveling the health effects of environmental mixtures: an NIEHS priority. Environ Health Perspect. 2013 Jan;121(1):A6-8. PMID: [23409283](https://pubmed.ncbi.nlm.nih.gov/23409283/).
-
-+ Drakvik E, Altenburger R, Aoki Y, Backhaus T, Bahadori T, Barouki R, Brack W, Cronin MTD, Demeneix B, Hougaard Bennekou S, van Klaveren J, Kneuer C, Kolossa-Gehring M, Lebret E, Posthuma L, Reiber L, Rider C, Rüegg J, Testa G, van der Burg B, van der Voet H, Warhurst AM, van de Water B, Yamazaki K, Öberg M, Bergman Å. Statement on advancing the assessment of chemical mixtures and their risks for human health and the environment. Environ Int. 2020 Jan;134:105267. PMID: [31704565](https://pubmed.ncbi.nlm.nih.gov/31704565/).
-
-+ Rider CV, McHale CM, Webster TF, Lowe L, Goodson WH 3rd, La Merrill MA, Rice G, Zeise L, Zhang L, Smith MT. Using the Key Characteristics of Carcinogens to Develop Research on Chemical Mixtures and Cancer. Environ Health Perspect. 2021 Mar;129(3):35003. PMID: [33784186](https://pubmed.ncbi.nlm.nih.gov/33784186/).
-
-
-+ Taylor KW, Joubert BR, Braun JM, Dilworth C, Gennings C, Hauser R, Heindel JJ, Rider CV, Webster TF, Carlin DJ. Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology: Lessons from an Innovative Workshop. Environ Health Perspect. 2016 Dec 1;124(12):A227-A229. PMID: [27905274](https://pubmed.ncbi.nlm.nih.gov/27905274/).
-
-
-For more information and additional examples in environmental health research, see the following relevant publications implementing sufficient similarity methods to address complex mixtures:
-
-+ Catlin NR, Collins BJ, Auerbach SS, Ferguson SS, Harnly JM, Gennings C, Waidyanatha S, Rice GE, Smith-Roe SL, Witt KL, Rider CV. How similar is similar enough? A sufficient similarity case study with Ginkgo biloba extract. Food Chem Toxicol. 2018 Aug;118:328-339. PMID: [29752982](https://pubmed.ncbi.nlm.nih.gov/29752982/).
-
-+ Collins BJ, Kerns SP, Aillon K, Mueller G, Rider CV, DeRose EF, London RE, Harnly JM, Waidyanatha S. Comparison of phytochemical composition of Ginkgo biloba extracts using a combination of non-targeted and targeted analytical approaches. Anal Bioanal Chem. 2020 Oct;412(25):6789-6809. PMID: [32865633](https://pubmed.ncbi.nlm.nih.gov/32865633/).
-
-+ Ryan KR, Huang MC, Ferguson SS, Waidyanatha S, Ramaiahgari S, Rice JR, Dunlap PE, Auerbach SS, Mutlu E, Cristy T, Peirfelice J, DeVito MJ, Smith-Roe SL, Rider CV. Evaluating Sufficient Similarity of Botanical Dietary Supplements: Combining Chemical and In Vitro Biological Data. Toxicol Sci. 2019 Dec 1;172(2):316-329. PMID: [31504990](https://pubmed.ncbi.nlm.nih.gov/31504990/).
-
-+ Rice GE, Teuschler LK, Bull RJ, Simmons JE, Feder PI. Evaluating the similarity of complex drinking-water disinfection by-product mixtures: overview of the issues. J Toxicol Environ Health A. 2009;72(7):429-36. PMID: [19267305](https://pubmed.ncbi.nlm.nih.gov/19267305/).
-
-
-
-
-
-:::tyk
-We recently published a study evaluating similarities across wildfire chemistry profiles using a more advanced analysis approach than described in this module (PMID: [36399130](https://pubmed.ncbi.nlm.nih.gov/36399130/)). For this test your knowledge box, let’s implement the more simple, PCA-based sufficient similarity analysis to identify groups of biomass smoke exposure signatures using chemical profiles. The relevant dataset is included in the file *Module6_5_TYKInput.csv*. Specifically:
-
-1. Perform a PCA on the chemistry data and visualize the proximity of each chemical signature to other signatures according to the first two principal components.
-
-2. Identify major groupings of biomass smoke exposure signatures.
-:::
-
-# 6.6 Toxicokinetic Modeling
-
-This training module was developed by Caroline Ring, Lauren E. Koval, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-*Disclaimer: The views expressed in this document are those of the author and do not necessarily reflect the views or policies of the U.S. EPA.*
-
-
-## Introduction to Training Module
-
-This module serves as an example to guide trainees through the basics of toxicokinetic (TK) modeling and how this type of modeling can be used in the high-throughput setting for environmental health research applications.
-
-In this activity, the capabilities of a high-throughput toxicokinetic modeling package titled 'httk' are demonstrated on a suite of environmentally relevant chemicals. The httk R package implements high-throughput toxicokinetic modeling (hence, 'httk'), including a generic physiologically based toxicokinetic (PBTK) model as well as tables of chemical-specific parameters needed to solve the model for hundreds of chemicals. In this activity, the capabilities of 'httk' are demonstrated and explored. Example modeling estimates are produced for the high interest environmental chemical, bisphenol-A. Then, an example script is provided to derive the plasma concentration at steady state for an example environmental chemical, bisphenol-A.
-
-The concept of reverse toxicokinetics is explained and demonstrated, again using bisphenol-A as an example chemical.
-
-This module then demonstrates the derivation of the bioactivity-exposure ratio (BER) across many chemicals leveraging the capabilities of httk, while incorporating exposure measures. BERs are particularly useful in the evaluation of chemical risk, as they take into account both toxicity (i.e., *in vitro* potency) and exposure rates, the two essential components used in risk calculations for chemical safety and prioritization evaluations. Therefore, the estimates of both potency and exposure and needed to calculate BERs, which are described in this training module.
-
-For potency estimates, the ToxCast high-throughput screening library is introduced as an example high-throughput dataset to carry out in vitro to in vivo extrapolation (IVIVE) modeling through httk. ToxCast activity concentrations that elicit 50% maximal bioactivity (AC50) are uploaded and organized as inputs, and then the tenth percentile ToxCast AC50 is calculated for each chemical (in other words, across all ToxCast screening assays, the tenth percentile of AC50 values were carried forward). These concentration estimates then serve as concentration estimates for potency. For exposure estimates, previously generated exposure estimates that have been inferred from CDC NHANES urinary biomonitoring data are used.
-
-The bioactivity-exposure ratio (BER) is then calculated across chemicals with both potency and exposure estimate information. This ratio is simply calculated as the ratio of the lower-end equivalent dose (for the most-sensitive 5\% of the population) divided by the upper-end estimated exposure (here, the upper bound on the inferred population median exposure). Chemicals are then ranked based on resulting BERs and visualized through plots. The importance of these chemical prioritization are then discussed in relation to environmental health research and corresponding regulatory decisions.
-
-## Introduction to Toxicokinetic Modeling
-
-To understand what toxicokinetic modeling is, consider the following scenario:
-
-```{r 06-Chapter6-136, echo=FALSE, fig.align = "center" }
-knitr::include_graphics("Chapter_6/Module6_6_Input/Module6_6_Image1.png")
-```
-
-Simply put, toxicokinetics answers these questions by describing "what the body does to the chemical" after an exposure scenario.
-
-More technically, **toxicokinetic modeling** refers to the evaluation of the uptake and disposition of a chemical in the body.
-
-### Notes on terminology
-Pharmacokinetics (PK) is a synonym for toxicokinetics (TK). They are often used interchangeably. PK connotes pharmaceuticals; TK connotes environmental chemicals – but those connotations are weak.
-
-A common abbreviation that you will also see in this research field is **ADME**, which stands for:
-**Absorption:** How does the chemical get absorbed into the body tissues?
-**Distribution:** Where does the chemical go inside the body?
-**Metabolism:** How do enzymes in the body break apart the chemical molecules?
-**Excretion:** How does the chemical leave the body?
-To place this term into the context of TK, TK models describe ADME mathematically by representing the body as compartments and flows.
-
-
-### Types of TK models
-TK models describe the body mathematically as one or more "compartments" connected by "flows." The compartments represent organs or tissues. Using mass balance equations, the amount or concentration of chemical in each compartment is described as a function of time.
-
-Types of models discussed throughout this training module are described here.
-
-#### 1 Compartment Model
-The simplest TK model is a 1-compartment model, where the body is assumed to be one big well-mixed compartment.
-
-#### 3 Compartment Model
-A 3-compartment model mathematically incorporates three distinct body compartments, that can exhibit different parameters contributing to their individual mass balance. Commonly used compartments in 3-compartment modeling can include tissues like blood plasma, liver, gut, kidney, and/or 'rest of body' terms; though the specific compartments included depend on the chemical under evaluation, exposure scenario, and modeling assumptions.
-
-#### PBTK Model
-A physiologically-based TK (PBTK) model incorporates compartments and flows that represent real physiological quantities (as opposed to the aforementioned empirical 1- and 3-compartment models). PBTK models have more parameters overall, including parameters representing physiological quantities that are known *a priori* based on studies of anatomy. The only PBTK model parameters that need to be estimated for each new chemical are parameters representing chemical-body interactions, which can include the following:
-
-- Rate of hepatic metabolism of chemical: How fast does liver break down chemical?
-- Plasma protein binding: How tightly does the chemical bind to proteins in blood plasma? Liver may not be able to break down chemical that is bound to plasma protein.
-- Blood:tissue partition coefficients: Assuming chemical diffuses between blood and other tissues very fast compared to the rate of blood flow, the ratio of concentration in blood to concentration in each tissue is approximately constant = partition coefficient.
-- Rate of active transport into/out of a tissue: If chemical moves between blood and tissues not just by passive diffusion, but by cells actively transporting it in or out of the tissue
-- Binding to other tissues: Some chemical may be bound inside a tissue and not available for diffusion or transport in/out
-
-
-Types of TK modeling can also fall into the following major categories:
-1. **Forward TK Modeling:** Where external exposure doses are converted into internal doses (or concentrations of chemicals/drugs in one or more body tissues of interest)
-2. **Reverse TK Modeling:** The reverse of the above, where internal doses are converted into external exposure doses.
-
-
-### Other TK modeling resources
-
-For further information on TK modeling background, math, and example models, there are additional resources online including a helpful course website on [Basic Pharmacokinetics](https://www.boomer.org/c/p4/) by Dr. Bourne.
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 06-Chapter6-137}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 06-Chapter6-138, results=FALSE, message=FALSE}
-if(!nzchar(system.file(package = "ggplot2"))){
- install.packages("ggplot2")}
-if(!nzchar(system.file(package = "reshape2"))){
- install.packages("reshape2")}
-if(!nzchar(system.file(package = "stringr"))){
- install.packages("stringr")}
-if(!nzchar(system.file(package = "httk"))){
- install.packages("httk")}
-if(!nzchar(system.file(package = "eulerr"))){
- install.packages("eulerr")}
-```
-
-
-#### Loading R packages required for this session
-```{r 06-Chapter6-139, results=FALSE, message=FALSE}
-library(ggplot2) # ggplot2 will be used to generate associated graphics
-library(reshape2) # reshape2 will be used to organize and transform datasets
-library(stringr) # stringr will be used to aid in various data manipulation steps through this module
-library(httk) # httk package will be used to carry out all toxicokinetic modeling steps
-library(eulerr) #eulerr package will be used to generate Venn/Euler diagram graphics
-```
-
-
-For more information on the *ggplot2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/ggplot2/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5).
-
-For more information on the *reshape2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/reshape2/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4).
-
-For more information on the *stringr* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/stringr/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/stringr/versions/1.4.0).
-
-For more information on the *httk* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/httk/index.html) and parent publication by [Pearce et al. (2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134854/).
-
-#### More information on the httk package
-You can see an overview of the *httk* package by typing `?httk` at the R command line.
-
-You can see a browsable index of all functions in the *httk* package by typing `help(package="httk")` at the R command line.
-
-You can see a browsable list of vignettes by typing `browseVignettes("httk")` at the R command line. (Please note that some of these vignettes were written using older versions of the package and may no longer work as written -- specifically the Ring (2017) vignette, which I wrote back in 2016. The *httk* team is actively working on updating these.)
-
-You can get information about any function in *httk*, or indeed any function in any R package, by typing `help()` and placing the function name in quotation marks inside the parentheses. For example, to get information about the *httk* function `solve_model()`, type this:
-
-```{r 06-Chapter6-140, eval=FALSE}
-help("solve_model")
-```
-
-Note that this module was run with `httk` version 2.4.0.
-
-#### Set your working directory
-```{r 06-Chapter6-141, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-(1) After solving the TK model that evaluates bisphenol-A, what is the maximum concentration of bisphenol-A estimated to occur in human plasma, after 1 exposure dose of 1 mg/kg/day?
-
-(2) After solving the TK model that evaluates bisphenol-A, what is the steady-state concentration of bisphenol-A estimated to occur in human plasma, for a long-term oral infusion dose of 1 mg/kg/day?
-
-(3) What is the predicted range of bisphenol-A concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile?
-
-(4) Considering the chemicals evaluated in the above TK modeling example, do the $C_{ss}$-dose slope distributions become wider as the median $C_{ss}$-dose slope increases?
-
-(5) How many chemicals have available AC50 values to evaluate in the current ToxCast/Tox21 high-throughput screening database?
-
-(6) What are the chemicals with the three lowest predicted equivalent doses (for tenth-percentile ToxCast AC50s), for the most-sensitive 5\% of the population?
-
-(7) Based on httk modeling estimates, are chemicals with higher bioactivity-exposure ratios always less potent than chemicals with lower bioactivity-exposure ratios?
-
-(8) Based on httk modeling estimates, do chemicals with higher bioactivity-exposure ratios always have lower estimated exposures than chemicals with lower bioactivity-exposure ratios?
-
-(9) How are chemical prioritization results different when using only hazard information vs. only exposure information vs. bioactivity-exposure ratios?
-
-(10) Of the three datasets used in this training module -- bioactivity from ToxCast, TK data from *httk*, and exposure inferred from NHANES urinary biomonitoring -- which one most limits the number of chemicals that can be prioritized using BERs?
-
-
-
-## Data and Models used in Toxicokinetic Modeling (TK)
-
-### Common Models used in TK Modeling, that are Provided as Built-in Models in httk
-
-There are five TK models currently built into *httk*. They are:
-
-* **pbtk**: A physiologically-based TK model with oral absorption. Contains the following compartments: gutlumen, gut, liver, kidneys, veins, arteries, lungs, and the rest of the body. Chemical is metabolized by the liver and excreted by the kidneys via glomerular filtration.
-* **gas_pbtk**: A PBTK model with absorption via inhalation. Contains the same compartments as `pbtk`.
-* **1compartment**: A simple one-compartment TK model with oral absorption.
-* **3compartment**: A three-compartment TK model with oral absorption. Compartments are gut, liver, and rest of body.
-* **3compartmentss**: The steady-state solution to the 3-compartment model under an assumption of constant infusion dosing, without considering tissue partitioning. This was the first *httk* model (see Wambaugh et al. 2015, Wetmore et al. 2012, Rotroff et al. 2010).
-
-### Chemical-Specific TK Data Built Into 'httk'
-
-Each of these TK models has chemical-specific parameters. The chemical-specific TK information needed to parameterize these models is built into `httk`, in the form of a built-in lookup table in a data.frame called `chem.physical_and_invitro.data`. This lookup table means that in order to run a TK model for a particular chemical, you only need to specify the chemical.
-
-Look at the first few rows of this data.frame to see everything that's in there (it is a lot of information).
-
-```{r 06-Chapter6-142}
-head(chem.physical_and_invitro.data)
-```
-
-The table contains chemical identifiers: name, CASRN (Chemical Abstract Service Registry Number), and DTXSID (DSSTox ID, a chemical identifier from the EPA Distributed Structure-Searchable Toxicity Database, DSSTox for short -- more information can be found at https://www.epa.gov/chemical-research/distributed-structure-searchable-toxicity-dsstox-database). The table also contains physical-chemical properties for each chemical. These are used in predicting tissue partitioning.
-
-The table contains *in vitro* measured chemical-specific TK parameters, if available. These chemical-specific parameters include intrinsic hepatic clearance (`Clint`) and fraction unbound to plasma protein (`Funbound.plasma`) for each chemical. It also contains measured values for oral absorption fraction `Fgutabs`, and for the partition coefficient between blood and plasma `Rblood2plasma`, if these values have been measured for a given chemical. If available, there may be chemical-specific TK values for multiple species.
-
-#### Listing chemicals for which a TK model can be parameterized
-
-You can easily get a list of all the chemicals for which a specific TK model can be parameterized (for a given species, if needed) using the function `get_cheminfo()`.
-
-For example, here is how you get a list of all the chemicals for which the PBTK model can be parameterized for humans.
-
-```{r 06-Chapter6-143, warning = FALSE}
-chems_pbtk <- get_cheminfo(info = c("Compound", "CAS", "DTXSID"),
- model = "pbtk",
- species = "Human")
-
-head(chems_pbtk) #first few rows
-```
-
-
-How many such chemicals have parameter data to run a PBTK model in this package?
-```{r 06-Chapter6-144}
-nrow(chems_pbtk)
-```
-
-Here is how you get all the chemicals for which the 3-compartment steady-state model can be parameterized for humans.
-```{r 06-Chapter6-145}
-chems_3compss <- get_cheminfo(info = c("Compound", "CAS", "DTXSID"),
- model = "3compartmentss",
- species = "Human")
-```
-
-How many such chemicals have parameter data to run a 3-compartment steady-state model in this package?
-```{r 06-Chapter6-146}
-nrow(chems_3compss)
-```
-
-The 3-compartment steady-state model can be parameterized for a few more chemicals than the PBTK model, because it is a simpler model and requires less data to parameterize. Specifically, the 3-compartment steady-state model does not require estimating tissue partition coefficients, unlike the PBTK model.
-
-### Solving Toxicokinetic Models to Obtain Internal Chemical Concentration vs. Time Predictions
-
-You can solve any of the models for a specified chemical and specified dosing protocol, and get concentration vs. time predictions, using the function `solve_model()`. For example:
-
-```{r 06-Chapter6-147, warning=FALSE}
-sol_pbtk <- solve_model(chem.name = "Bisphenol-A", #chemical to simulate
- model = "pbtk", #TK model to use
- dosing = list(initial.dose = NULL, #for repeated dosing, if first dose is different from the rest, specify first dose here
- doses.per.day = 1, #number of doses per day
- daily.dose = 1, #total daily dose in mg/kg units
- dosing.matrix = NULL), #used to specify more complicated dosing protocols
- days = 1) #number of days to simulate
-```
-
-There are some cryptic-sounding warnings that can safely be ignored. (They are providing information about certain assumptions that were made while solving the model). Then there is a final message providing the units of the output.
-
-The output, assigned to `sol_pbtk`, is a matrix with concentration vs. time data for each of the compartments in the pbtk model. Time is in units of days. Additionally, the output traces the amount excreted via passive renal filtration (`Atubules`), the amount metabolized in the liver (`Ametabolized`), and the cumulative area under the curve for plasma concentration vs. time (`AUC`). Here are the first few rows of `sol_pbtk` so you can see the format.
-
-```{r 06-Chapter6-148}
-head(sol_pbtk)
-```
-
-You can plot the results, for example plasma concentration vs. time.
-
-```{r 06-Chapter6-149, fig.align = "center"}
-sol_pbtk <- as.data.frame(sol_pbtk) #because ggplot2 requires data.frame input, not matrix
-
-ggplot(sol_pbtk) +
- geom_line(aes(x = time,
- y = Cplasma)) +
- theme_bw() +
- xlab("Time, days") +
- ylab("Cplasma, uM") +
- ggtitle("Plasma concentration vs. time for single dose 1 mg/kg Bisphenol-A")
-```
-
-### Calculating summary metrics of internal dose produced from TK models
-
-We can calculate summary metrics of internal dose -- peak concentration, average concentration, and AUC -- using the function `calc_tkstats()`. We have to specify the dosing protocol and length of simulation. Here, we use the same dosing protocol and simulation length as in the plot above.
-
-```{r 06-Chapter6-150, warning = FALSE}
-tkstats <- calc_tkstats(chem.name = "Bisphenol-A", #chemical to simulate
- stats = c("AUC", "peak", "mean"), #which metrics to return (these are the only three choices)
- model = "pbtk", #model to use
- tissue = "plasma", #tissue for which to return internal dose metrics
- days = 1, #length of simulation
- daily.dose = 1, #total daily dose in mg/kg/day
- doses.per.day = 1) #number of doses per day
-
-print(tkstats)
-```
-
-
-### Answer to Environmental Health Question 1
-:::question
-*With this, we can answer **Environmental Health Question #1***: After solving the TK model that evaluates bisphenol-A, what is the maximum concentration of bisphenol-A estimated to occur in human plasma, after 1 exposure dose of 1 mg/kg/day?
-:::
-
-:::answer
-**Answer**: The peak plasma concentration estimate for bisphenol-A, under the conditions tested, is 0.3779 uM.
-:::
-
-
-### Calculating steady-state concentration
-
-Another summary metric is the steady-state concentration: If the same dose is given repeatedly over many days, the body concentration will (usually) reach a steady state after some time. The value of this steady-state concentration, and the time needed to achieve steady state, are different for different chemicals. Steady-state concentrations are useful when considering long-term, low-level exposures, which is frequently the situation in environmental health.
-
-For example, here is a plot of plasma concentration vs. time for 1 mg/kg/day Bisphenol-A, administered for 12 days. You can see how the average plasma concentration reaches a steady state around 1.5 uM. Each peak represents one day's dose.
-
-```{r 06-Chapter6-151, warning = FALSE, fig.align = "center"}
-foo <- as.data.frame(solve_pbtk(
- chem.name='Bisphenol-A',
- daily.dose=1,
- days=12,
- doses.per.day=1,
- tsteps=2))
-
-ggplot(foo) +
- geom_line(aes(x = time,
- y= Cplasma)) +
- scale_x_continuous(breaks = seq(0,12)) +
- xlab("Time, days") +
- ylab("Cplasma, uM")
-```
-
-*httk* includes a function `calc_analytic_css()` to calculate the steady-state plasma concentration ($C_{ss}$ for short) analytically for each model, for a specified chemical and daily oral dose. This function assumes that the daily oral dose is administered as an oral infusion, rather than a single oral bolus dose -- in effect, that the daily dose is divided into many small doses over the day. Therefore, the result of `calc_analytic_css()` may be slightly different than our previous estimate based on the concentration vs. time plot from a single oral bolus dose every day.
-
-Here is the result of `calc_analytic_css()` for a 1 mg/kg/day dose of bisphenol-A.
-
-```{r 06-Chapter6-152, warning = FALSE}
-calc_analytic_css(chem.name = "Bisphenol-A",
- daily.dose = 1,
- output.units = "uM",
- model = "pbtk",
- concentration = "plasma")
-```
-
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: After solving the TK model that evaluates bisphenol-A, what is the steady-state concentration of bisphenol-A estimated to occur in human plasma, for a long-term oral infusion dose of 1 mg/kg/day?
-:::
-
-:::answer
-**Answer**: The steady-state plasma concentration estimate for bisphenol-A, under the conditions tested, is 0.9417 uM.
-:::
-
-
-
-### Steady-state concentration is linear with dose for httk models
-
-For the TK models included in the *httk* package, steady-state concentration is linear with dose for a given chemical. The slope of the line is simply the steady-state concentration for a dose of 1 mg/kg/day. This can be shown by solving `calc_analytic_css()` for several doses, and plotting the dose-$C_{ss}$ points along a line whose slope is equal to $C_{ss}$ for 1 mg/kg/day.
-
-```{r 06-Chapter6-153, fig.align = "center"}
-#choose five doses at which to find the Css
-doses <- c(0.1, #all mg/kg/day
- 0.5,
- 1.0,
- 1.5,
- 2.0)
-suppressWarnings(bpa_css <- sapply(doses,
- function(dose) calc_analytic_css(chem.name = "Bisphenol-A",
- daily.dose = dose,
- output.units = "uM",
- model = "pbtk",
- concentration = "plasma",
- suppress.messages = TRUE)))
-
-DF <- data.frame(dose = doses,
- Css = bpa_css)
-
-#Plot the results
-Cssdosefig <- ggplot(DF) +
- geom_point(aes(x = dose,
- y = Css),
- size = 3) +
- geom_abline( #plot a straight line
- intercept = 0, #intercept 0
- slope = DF[DF$dose==1, #slope = Css for 1 mg/kg/day
- "Css"],
- linetype = 2
- ) +
- xlab("Daily dose, mg/kg/day") +
- ylab("Css, uM")
-
-print(Cssdosefig)
-
-```
-
-## Reverse Toxicokinetics
-
-In the previous TK examples, we started with a specified dosing protocol, then solved the TK models to find the resulting concentration in the body (e.g., in plasma). This allows us to convert from external exposure metrics to internal exposure metrics. However, many environmental health questions require the reverse: converting from internal exposure metrics to external exposure metrics.
-
-For example, when health effects of environmental chemicals are studied in epidemiological cohorts, adverse health effects are often related to *internal* exposure metrics, such as blood or plasma concentration of a chemical. Similarly, *in vitro* studies of chemical bioactivity (for example, the ToxCast program) relate bioactivity to *in vitro* concentration, which can be consdered analogous to internal exposure or body concentration. So we may know the *internal* exposure level associated with some adverse health effect of a chemical.
-
-However, risk assessors and risk managers typically control *external* exposure to reduce the risk of adverse health effects. They need some way to start from an internal exposure associated with adverse health effects, and convert to the corresponding external exposure.
-
-The solution is *reverse toxicokinetics* (reverse TK). Starting with a specified internal exposure metric (body concentration), solve the TK model *in reverse* to find the corresponding external exposure that produced that concentration.
-
-When exposures are long-term and low-level (as environmental exposures often are), then the relevant internal exposure metric is the steady-state concentration. In this case, it is useful to remember the linear relationship between $C_{ss}$ and dose for the *httk* TK models. It gives you a quick and easy way to perform reverse TK for the steady-state case.
-
-The procedure is illustrated graphically below.
-
-1. Begin with a "target" concentration on the y-axis (labeled $C_{\textrm{target}}$). For example, $C_{\textrm{target}}$ may be the *in vitro* concentration associated with bioactivity in a ToxCast assay, or the plasma concentration associated with an adverse health effect in an epidemiological study.
-2. Draw a horizontal line over to the $C_{ss}$-dose line.
-3. Drop down vertically to the x-axis and read off the corresponding dose. This is the *administered equivalent dose* (AED): the the external dose or exposure rate, in mg/kg/day, that would produce an internal steady-state plasma concentration equal to the target concentration.
-
-```{r 06-Chapter6-154, echo = FALSE, warning = FALSE, fig.align = "center"}
-reverseTKfig <- Cssdosefig +
- geom_segment(aes(x = -Inf, y = 0.8671, xend = 0.75, yend = 0.8671),
- size = 2,
- arrow = arrow(angle = 30, length = unit(5, "mm"), type = "closed"),
- color = "#fc8d62") +
- geom_segment(aes(x = 0.75, y = 0.8671, xend = 0.75, yend = -Inf),
- size = 2,
- arrow = arrow(angle = 30, length = unit(5, "mm"), type = "closed"),
- color = "#fc8d62") +
- ggplot2::annotate("text",
- x = 0,
- y = 1,
- label = "1",
- size = 8,
- color = "#fc8d62",
- vjust = "bottom") +
- ggplot2::annotate("text",
- x = 0.75,
- y = 1,
- label = "2",
- size = 8,
- color = "#fc8d62",
- vjust = "bottom") +
- ggplot2::annotate("text",
- x = 0.8,
- y = 0,
- label = "3",
- size = 8,
- color = "#fc8d62",
- hjust = "left") +
- scale_y_continuous(breaks = c(seq(0, 2, by = 0.5), 0.8671),
- labels = c(seq(0, 2, by = 0.5), "Ctarget"),
- minor_breaks = seq(0.25, 1.75, by = 0.5)) +
- scale_x_continuous(breaks = c(seq(0, 2, by = 0.5), 0.75),
- labels = c(seq(0, 2, by = 0.5), "AED"),
- minor_breaks = seq(0.25, 1.75, by = 0.5)) +
- theme(
- axis.text.y = element_text(
- color = c(rep("black", length(seq(0, 2, by = 0.5))), "#fc8d62"),
- size = 12
- ),
- axis.text.x = element_text(
- color = c(rep("black", length(seq(0, 2, by = 0.5))), "#fc8d62"),
- size = 12
- )
- )
-
-print(reverseTKfig)
-```
-
-Mathematically, the relation is very simple:
-
-$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{-dose slope}} $$
-
-Since the $C_{ss}$-dose slope is simply $C_{ss}$ for a daily dose of 1 mg/kg/day, this equation can be rewritten as
-
-$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{ for 1 mg/kg/day}} $$
-
-
-## Capturing Population Variability in Toxicokinetics, and Uncertainty in Chemical-Specific Parameters
-
-For a given dose, $C_{ss}$ is determined by the values of the parameters of the TK model. These parameters describe absorption, distribution, metabolism, and excretion (ADME) of each chemical. They include both chemical-specific parameters, describing hepatic clearance and protein binding, and chemical-independent parameters, describing physiology. A table of these parameters is presented below.
-
-```{r 06-Chapter6-155, results = "asis", echo = FALSE}
-paramtable <- data.frame("Parameter" = c("Intrinsic hepatic clearance rate",
- "Fraction unbound to plasma protein",
- "Tissue:plasma partition coefficients",
- "Tissue masses",
- "Tissue blood flows",
- "Glomerular filtration rate",
- "Hepatocellularity"),
- "Details" = c("Rate at which liver removes chemical from blood",
- "Free fraction of chemical in plasma",
- "Ratio of concentration in body tissues to concentration in plasma",
- "Mass of each body tissue (including total body weight)",
- "Blood flow rate to each body tissue",
- "Rate at which kidneys remove chemical from blood",
- "Number of cells per mg liver"),
- "Estimated" = c("Measured *in vitro*",
- "Measured *in vitro*",
- "Estimated from chemical and tissue properties",
- rep("From anatomical literature", 4)
- ),
- "Type" = c(rep("Chemical-specific", 3),
- rep("Chemical-independent", 4))
-)
-
-knitr::kable(paramtable)
-```
-
-Because these parameters represent physiology and chemical-body interactions, their exact values will vary across individuals in a population, reflecting population physiological variability. Additionally, parameters are subject to measurement uncertainty.
-
-Since the $C_{ss}$-dose relation is determined by these parameters, variability and uncertainty in the TK parameters translates directly into variability and uncertainty in $C_{ss}$ for a given dose. In other words, there is a distribution of $C_{ss}$ values for each daily dose level of a chemical.
-
-The $C_{ss}$-dose relationship is still linear when variability and uncertainty are taken into account. However, rather than a single $C_{ss}$-dose slope, there is a distribution of $C_{ss}$-dose slopes. Because the $C_{ss}$-dose slope is simply the $C_{ss}$ value for an exposure rate of 1 mg/kg/day, the distribution of the $C_{ss}$-dose slope is the same as the $C_{ss}$ distribution for an exposure rate of 1 mg/kg/day.
-
-A distribution of $C_{ss}$-dose slopes is illustrated in the figure below, along with boxplots illustrating the distributions for $C_{ss}$ itself at five different dose levels: 0.05, 0.25, 0.5, 0.75, and 0.95 mg/kg/day.
-
-
-```{r 06-Chapter6-156, echo = FALSE, warning = FALSE, fig.align = "center"}
-
-suppressWarnings(css_examp <- calc_mc_css(chem.name = "Bisphenol-A",
- which.quantile = c(0.05, #specify which quantiles to return
- 0.25,
- 0.5,
- 0.75,
- 0.95),
- output.units = "uM",
- suppress.messages = TRUE,
- model = "3compartmentss" #which model to use to calculate Css
- ))
-
-#Css for various doses
-css_dist_wide <- as.data.frame(
- t(
- sapply(doses,
- function(x) x * css_examp
- )
- )
-)
-
-#add column defining daily doses
-css_dist_wide$dose <- doses
-
-#data.frame of slope percentiles
-slope_dist <- data.frame(slope = css_examp,
- quantile= factor(names(css_examp),
- levels = names(css_examp)))
-
-#colors for plotting -- specify order to be consistent with color use later
-#This is a slight re-ordering of ColorBrewer2's "Set2" palette
-plotcols <- c("5%" = "#66c2a5",
- "50%" = "#fc8d62",
- "95%" = "#8da0cb",
- "25%" = "#e78ac3",
- "75%" = "#a6d854")
-
-
-ggplot(css_dist_wide) +
- geom_boxplot(aes(x = dose,
- group = dose,
- lower = `25%`,
- upper = `75%`,
- middle = `50%`,
- ymin = `5%`,
- ymax = `95%`),
- stat = "identity") +
- geom_abline(data = slope_dist,
- aes(intercept =0,
- slope = slope,
- color = quantile),
- size = 1) +
- scale_color_manual(values = plotcols,
- limits = levels(slope_dist$quantile),
- name = "Percentile") +
- xlab("Daily dose, mg/kg/day") +
- ylab("Css, uM") +
- theme(legend.position = c(0.1,0.7))
-
-```
-
-An appropriate title for this figure could be:
-
-“**Boxplots: Distributions of Css for five daily dose levels of Bisphenol-A.** Boxes extend from 25th to 75th percentile. Lower whisker = 5th percentile; upper whisker = 95th percentile. Lines: Css-dose relations for each quantile."
-
-### Variability and Uncertainty in Reverse Toxicokinetics
-
-Earlier, we found that with a linear $C_{ss}$-dose relation, reverse toxicokinetics became a matter of a simple linear equation. For a given target concentration -- for example, a plasma concentration associated with adverse health effects *in vivo*, or a concentration associated with bioactivity *in vitro* -- we could predict an AED (administered equivalent dose), the external exposure rate in mg/kg/day that would produce the target concentration at steady state.
-
-$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{-dose slope}} $$
-
-Since AED depends on the $C_{ss}$-dose slope, variability and uncertainty in that slope will induce variability and uncertainty in the AED. A distribution of slopes will lead to a distribution of AEDs for the same target concentration.
-
-For example, a graphical representation of finding the AED distribution for a target concentration of 1 uM looks like this, for the same arbitrary example chemical used to illustrate the distribution of $C_{ss}$-dose slopes above. (The lines shown in this plot are the same as the previous plot, but the plot has been "zoomed in" on the y-axis.)
-
-The steps are the same as before:
-
-1. Begin with a "target" concentration on the y-axis, here 1 uM.
-2. Draw a horizontal line over to intersect each $C_{ss}$-dose line.
-3. Where the horizontal line intersects each $C_{ss}$-dose line, drop down vertically to the x-axis and read off each corresponding AED (marked with colored circles matching the color of each $C_{ss}$-dose line).
-
-```{r 06-Chapter6-157, echo = FALSE, warning = FALSE, fig.align = "center"}
-
-ggplot(css_dist_wide,
- aes(x=dose, y = `95%`)) +
- geom_blank() +
- geom_abline(data = slope_dist,
- aes(intercept =0,
- slope = slope,
- color = quantile),
- size = 1) +
- geom_hline(aes(yintercept = 1)) +
- geom_segment(aes(x = 1/css_examp,
- xend = 1/css_examp,
- y = 1,
- yend = -Inf)) +
- geom_point(aes(x = 1/css_examp,
- y = -Inf,
- color = factor(c("5%", "25%", "50%", "75%", "95%"),
- levels = levels(slope_dist$quantile))
- ),
- size = 5) +
- scale_color_manual(values = plotcols,
- limits = levels(slope_dist$quantile),
- name = "Percentile") +
- xlab("Daily dose, mg/kg/day") +
- ylab("Css, uM") +
- theme(legend.position = "right") +
- coord_cartesian(ylim = c(0,5),
- clip = "off")
-```
-
-Notice that the line with the steepest, 95th-percentile slope (the purple line) yields the lowest AED (the purple dot, approximately 0.07 mg/kg/day for this example chemical), and the line with the shallowest, 5th-percentile slope (the turquoise blue line) yields the highest AED (the turquoise dot, approximately 2 mg/kg/day for this example chemical).
-
-In general, the 95th-percentile $C_{ss}$-dose slope represents the most-sensitive 5\% of the population -- individuals who will reach the target concentration in their body with the smallest daily doses. Therefore, using the AED for the 95th-percentile $C_{ss}$-dose slope is a conservative choice, health-protective for 95\% of the estimated population.
-
-
-### Monte Carlo approach to simulating variability and uncertainty
-The *httk* package implements a Monte Carlo approach for simulating variability and uncertainty in TK.
-
-*httk* first defines distributions for the TK model parameters, representing population variabilty. These distributions are defined based on real data about U.S. population demographics and physiology collected as part of the Centers for Disease Control's National Health and Nutrition Examination Survey (NHANES) [(Ring et al., 2017)](https://pubmed.ncbi.nlm.nih.gov/28628784/). TK parameters with known measurement uncertainty (intrinsic hepatic clearance rate and fraction of chemical unbound in plasma) additionally have distributions defined to represent their uncertainty [(Wambaugh et al., 2019)](https://pubmed.ncbi.nlm.nih.gov/31532498/).
-
-Then, *httk* samples sets of TK parameter values from these distributions (including appropriate correlations: for example, liver mass is correlated with body weight). Each sampled set of TK parameter values represents one "simulated individual."
-
-Next, *httk* calculates the $C_{ss}$-dose slope for each "simulated individual." The resulting sample of $C_{ss}$-dose slopes can be used to characterize the distribution of $C_{ss}$-dose slopes -- for example, by calculating percentiles.
-
-*httk* makes this whole Monte Carlo process simple and transparent for the user, You just need to call one function, `calc_mc_css()`, specifying the chemical whose $C_{ss}$-dose slope distribution you want to calculate. Behind the scenes, *httk* will perform all the Monte Carlo calculations. It will return percentiles of the $C_{ss}$-dose slope (by default), or it can return all individual samples of $C_{ss}$-dose slope (if you want to do some calculations of your own).
-
-### Chemical-Specific Example Capturing Population Variability for Bisphenol-A Plasma Concentration Estimates
-
-The following code estimates the 5th percentile, 50th percentile, and 95th percentile of the $C_{ss}$-dose slope for the chemical bisphenol-A. For the sake of simplicity, we will use the 3-compartment steady-state model (rather than the PBTK model used in the previous examples).
-
-```{r 06-Chapter6-158, warning=FALSE}
-css_examp <- calc_mc_css(chem.name = "Bisphenol-A",
- which.quantile = c(0.05, #specify which quantiles to return
- 0.5,
- 0.95),
- model = "3compartmentss", #which model to use to calculate Css
- output.units = "uM") #could also choose mg/Lo
-
-print(css_examp)
-```
-
-Recall that the $C_{ss}$-dose slope is the same as $C_{ss}$ for a daily dose of 1 mg/kg/day. The function `calc_mc_css()` therefore assumes a dose of 1 mg/kg/day and calculates the resulting $C_{ss}$ distribution. If you need to calculate the $C_{ss}$ distribution for a different dose, e.g. 2 mg/kg/day, you can simply multiply the $C_{ss}$ percentiles from `calc_mc_css()` by your desired dose.
-
-The steady-state plasma concentration for 1 mg/kg/day dose is returned in units of uM. The three requested quantiles are returned as a named numeric vector (whose names in this case are `5%`, `50%`, and `95%`).
-
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: What is the predicted range of bisphenol-A concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile?
-:::
-
-:::answer
-**Answer**: For a human population exposed to 1 mg/kg/day bisphenol-A, plasma concentrations are estimated to be `r unname(css_examp[1])` uM at the 5th percentile, `r unname(css_examp[2])` uM at the 50th percentile, and `r unname(css_examp[3])` uM at the 95th percentile.
-:::
-
-
-### High-Throughput Example Capturing Population Variability for ~1000 Chemicals
-
-We can easily and (fairly) quickly do this for all 998 chemicals for which the 3-compartment steady-state model can be parameterized, using `sapply()` to loop over the chemicals. This will take a few minutes to run (for example, it takes about 10-15 minutes on a Dell Latitude with an Intel i7 processor).
-
-In order to make the Monte Carlo sampling reproducible, set a seed for the random number generator. It doesn't matter what seed you choose -- it can be any integer. Here, the seed is set to 42, because it's the answer to the ultimate question of life, the universe, and everything [(Adams, 1979)](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy_(novel)).
-
-```{r 06-Chapter6-159}
-set.seed(42)
-
-system.time(
- suppressWarnings(
- css_3compss <- sapply(chems_3compss$CAS,
- calc_mc_css,
- #additional arguments to calc_mc_css()
- model = "3compartmentss",
- which.quantile = c(0.05, 0.5, 0.95),
- output.units = "uM",
- suppress.messages = TRUE)
- )
-)
-```
-
-Organizing the results:
-```{r 06-Chapter6-160}
-#css_3compss comes out as a 3 x 998 array,
-#where rows are quantiles and columns are chemicals
-#transpose it so that rows are chemicals and columns are quantiles
-css_3compss <- t(css_3compss)
-#convert to data.frame
-css_3compss <- as.data.frame(css_3compss)
-#make a column for CAS, rather than just leaving it as the row names
-css_3compss$CAS <- row.names(css_3compss)
-
-head(css_3compss) #View first few rows
-```
-
-
-### Plotting the $C_{ss}$-dose slope distribution quantiles across these ~1000 chemicals
-
-Here, we will plot the resulting concentration distribution quantiles for each chemical, while sorting the chemicals from lowest to highest median value.
-
-By default, *ggplot2* will plot the chemical CASRNs in alphabetically-sorted order. To force it to plot them in another order, we have to explicitly specify the desired order. The easiest way to do this is to add a column in the data.frame that contains the chemical names as a factor (categorical) variable, whose levels (categories) are explicitly set to be the CASRNs in our desired plotting order. Then we can tell *ggplot2* to plot that factor variable on the x-axis, rather than the original CASRN variable.
-
-Set the ordering of the chemical CASRNs from lowest to highest median value
-```{r 06-Chapter6-161}
-chemical_order <- order(css_3compss$`50%`)
-```
-
-Create a factor (categorical) CAS column where the factor levels are given by the CASRNs with this ordering.
-```{r 06-Chapter6-162}
-css_3compss$CAS_factor <- factor(css_3compss$CAS, levels = css_3compss$CAS[chemical_order])
-```
-
-For plotting ease, reshape the data.frame into "long" format -- rather than having one column for each quantile of the $C_{ss}$ distribution, have a row for each chemical/quantile combination. We use the `melt()` function from the *reshape2* package.
-```{r 06-Chapter6-163, warning = FALSE}
-css_3compss_melt <- reshape2::melt(css_3compss,
- id.vars = "CAS_factor",
- measure.vars = c("5%", "50%", "95%"),
- variable.name = "Percentile",
- value.name = "Css_slope")
-head(css_3compss_melt)
-```
-
-Plot the slope percentiles. Use a log scale for the y-axis because the slopes span orders of magnitude. Suppress the x-axis labels (the CASRNs) because they are not readable anyway.
-```{r 06-Chapter6-164, fig.align = "center"}
-ggplot(css_3compss_melt) +
- geom_point(aes(x=CAS_factor,
- y = Css_slope,
- color = Percentile)) +
- scale_color_brewer(palette = "Set2") + #use better color scheme than default
- scale_y_log10() + #use log scale for y axis
- xlab("Chemical") +
- ylab("Css-dose slope (uM per mg/kg/day)") +
- annotation_logticks(sides = "l") + #add log ticks to y axis
- theme_bw() + #plot with white plot background instead of gray
- theme(axis.text.x = element_blank(), #suppress x-axis labels
- panel.grid.major.x = element_blank(), #suppress vertical grid lines
- legend.position = c(0.1,0.8) #place legend in lower right corner
- )
-```
-
-Chemicals along the x-axis are in order from lowest to highest median (50th percentile) predicted $C_{ss}$-dose slope. The orange points represent that 50th percentile $C_{ss}$-dose slope for each chemical. The green points represent the 5th percentile $C_{ss}$-dose slopes, and the purple points represent the 95th percentile $C_{ss}$-dose slope for each chemical. Each chemical has one orange point (50th percentile), one green point (5th percentile), and one purple point (95th percentile), characterizing the distribution of $C_{ss}$-dose slopes across the U.S. population for that chemical. The width of the distribution for each chemical is roughly represented by the vertical distance between the green and purple points for that chemical.
-
-
-### Answer to Environmental Health Question 4
-:::question
-*With this, we can answer **Environmental Health Question #4***: Considering the chemicals evaluated in the above TK modeling example, do the $C_{ss}$-dose slope distributions become wider as the median $C_{ss}$-dose slope increases?
-:::
-
-:::answer
-**Answer**: No -- the $C_{ss}$-dose slope distributions generally become narrower as the median $C_{ss}$-dose slope increases. This can be seen by looking at the right end of the plot, where the highest-median chemicals are located -- the distance between the green points and purple points, representing the 5th and 95th percentiles, are much smaller for these higher-median chemicals.
-:::
-
-
-
-## Reverse TK: Calculating Administered Equivalent Doses for ToxCast Bioactive Concentrations
-
-As described in an earlier section of this document, the slope defining the linear relation between $C_{ss}$ and dose is useful for reverse toxicokinetics: converting an internal dose metric to an external dose metric. The internal dose metric may, for example, be a concentration associated with an *in vivo* health effect, or *in vitro* bioactivity. Here, we will consider *in vitro* bioactivity -- specifically, from the ToxCast program. ToxCast tests chemicals in multiple concentration-response format across a battery of *in vitro* assays that measure activity in a wide variety of biological endpoints. If a chemical showed any activity in an assay at any of its tested concentrations, then one metric of concentration associated with bioactivity is AC50 -- the concentration at which the assay response is halfway between its minimum and its maximum.
-
-The module won't address the details of how ToxCast determines assay activity and AC50s from raw concentration-response data. There is an entire R package for the ToxCast data processing workflow, called *tcpl*. If you want to learn more about those details, [start here](https://www.epa.gov/chemical-research/toxcast-data-generation-toxcast-pipeline-tcpl). Lots of information is available if you install the *tcpl* R package and look at the package vignette; it essentially walks you through the full ToxCast data processing workflow.
-
-In this module, we will begin with pre-computed ToxCast AC50 values for various chemicals and assays. We will use `httk` to convert ToxCast AC50 values into administered equivalent doses (AEDs).
-
-### Loading ToxCast AC50s
-
-The latest public release of ToxCast high-throughput screening assay data can be downloaded [here](https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data). Previous public releases of ToxCast data included a matrix of AC50s by chemical and assay. The data format of the latest public release does not contain this kind of matrix. So this dataset was pre-processed to prepare a simple data.frame of AC50s for each chemical/assay combination for the purposes of this training module.
-
-Read in the pre-processed dataset and view the first few rows.
-
-```{r 06-Chapter6-165}
-toxcast <- read.csv("Chapter_6/Module6_6_Input/Module6_6_InputData1.csv")
-head(toxcast)
-```
-
-The columns of this data frame are:
-
-* `Compound`: The compound name.
-* `CAS`: The compound's CASRN.
-* `DTXSID`: The compound's DSSTox Substance ID.
-* `aenm`: Assay identifier. "aenm" stands for "Assay Endpoint Name." More information about the ToxCast assays is available on the [ToxCast data download page](https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data).
-* `log10_ac50`: The AC50 for the chemical/assay combination on each row, in log10 uM units.
-
-How many ToxCast chemicals are in this dataset?
-
-```{r 06-Chapter6-166}
-length(unique(toxcast$DTXSID))
-```
-
-
-### Answer to Environmental Health Question 5
-:::question
-*With this, we can answer **Environmental Health Question #5***: How many chemicals have available AC50 values to evaluate in the current ToxCast/Tox21 high-throughput screening database?
-:::
-
-:::answer
-**Answer**: 7863 chemicals.
-:::
-
-
-### Subsetting the ToxCast Chemicals to include those that are also in httk
-
-Not all of the ToxCast chemicals have TK data built into *httk* such that we can perform reverse TK using the HTTK models. Let's subset the ToxCast data to include only the chemicals for which we can run the 3-compartment steady-state models.
-
-Previously, we used `get_cheminfo()` to get a list of chemicals for which we could run the 3-compartment steady state model, including the names, CASRNs, and DSSTox IDs of those chemicals. That list is stored in variable `chems_3compss`, a data.frame with compound name, CASRN, and DTXSID. Now, we can use that chemical list to subset the ToxCast data.
-
-```{r 06-Chapter6-167}
-toxcast_httk <- subset(toxcast,
- subset = toxcast$DTXSID %in%
- chems_3compss$DTXSID)
-```
-
-How many chemicals are in this subset?
-
-```{r 06-Chapter6-168}
-length(unique(toxcast_httk$DTXSID))
-```
-
-There were 869 *httk* chemicals for which we could run the 3-compartment steady-state model; only 911 of them had ToxCast data. Conversely, most of the 7863 ToxCast chemicals do not have TK data in *httk* such that we can run the 3-compartment steady state model.
-
-### Identifying the Lower-Bound *In Vitro* AC50 Value per Chemical
-ToxCast/Tox21 screens chemicals across multiple assays, such that each chemical has multiple resulting AC50 values, spanning a range of values. For example, here are boxplots of the AC50s for the first 20 chemicals listed in `chems_3compss`. Note that the chemical identifiers, DTXSID, are used here in these visualizations to represent unique chemicals.
-
-```{r 06-Chapter6-169, fig.align = "center"}
-ggplot(toxcast_httk[toxcast_httk$DTXSID %in%
- chems_3compss[1:20,
- "DTXSID"],
- ]
- ) +
- geom_boxplot(aes(x=DTXSID, y = log10_ac50)) +
- ylab("log10 AC50") +
- theme_bw() +
- theme(axis.text.x = element_text(angle = 45,
- hjust = 1))
-```
-
-
-Sometimes we have an interest in getting the equivalent dose for an AC50 for one specific assay. For example, if we happen to be interested in estrogen-receptor activity, we might look specifically at one of the assays that measures estrogen receptor activity.
-
-However, sometimes we just want a general idea of what concentrations showed bioactivity in *any* of the ToxCast assays, regardless of the specific biological endpoint of each assay. In this case, typically, we are interested in a "reasonable lower bound" of bioactive concentrations across assays for each chemical. Intuitively, we suspect that the very lowest AC50s for each chemical might represent false activity. Therefore, we often select the tenth percentile of ToxCast AC50s for each chemical as that "reasonable lower bound" on bioactive concentrations.
-
-Let's calculate the tenth percentile ToxCast AC50 for each chemical. Here, we use the base R function `aggregate()`, which groups a vector (specified in the `x` argument) by a list of factors (specified in the `by` argument), and applies a function to each group (specified in the `FUN` argument). You can add any extra arguments to the `FUN` function as named arguments to `aggregate()`.
-
-```{r 06-Chapter6-170}
-toxcast_httk_P10 <- aggregate(x = toxcast_httk$log10_ac50, #aggregate the AC50s
- by = list(DTXSID = toxcast_httk$DTXSID), #group AC50s by DTXSID
- FUN = quantile, #the function to apply to each group
- prob = 0.1) #an argument to the quantile() function
-#by default the names of the output data.frame will be 'DTXSID' and 'x'
-#let's change 'x' to be a more informative name
-names(toxcast_httk_P10) <- c("DTXSID", "log10_ac50_P10")
-```
-
-Let's transform the tenth-percentile AC50 values back to the natural scale (they are currently on the log10 scale) and put them in a new column `AC50`. These AC50s will be in uM.
-
-```{r 06-Chapter6-171}
-toxcast_httk_P10$AC50 <- 10^(toxcast_httk_P10$log10_ac50_P10)
-```
-
-View the first few rows:
-
-```{r 06-Chapter6-172}
-head(toxcast_httk_P10)
-```
-
-
-### Calculating Equivalent Doses for 10th Percentile ToxCast AC50s
-
-We can calculate equivalent doses in one line of R code -- again including all of the Monte Carlo for TK uncertainty and variability -- just by using the *httk* function `calc_mc_oral_equiv()`.
-
-Note that in `calc_mc_oral_equiv()`, the `which.quantile` argument refers to the quantile of the $C_{ss}$-dose slope, not the quantile of the equivalent dose itself. So specifying `which.quantile = 0.95` will yield a *lower* equivalent dose than `which.quantile = 0.05`.
-
-Under the hood, `calc_mc_oral_equiv()` first calls `calc_mc_css()` to get percentiles of the $C_{ss}$-dose slope for a chemical. It then divides a user-specified target concentration (specified in argument `conc`) by each quantile of $C_{ss}$-dose slope to get the equivalent dose corresponding to that target concentration for each slope quantile.
-
-Here, we're using the `mapply()` function in base R to call `calc_mc_oral_equiv()` in a loop over chemicals. This is because `calc_mc_oral_equiv()` requires two chemical-specific arguments -- the chemical identifier and the concentration for which to compute the equivalent dose. `mapply()` lets us provide vectors of values for each argument (in the named arguments `dtxsid` and `conc`), and will automatically loop over those vectors. We also use the argument `MoreArgs`, a named list of additional arguments to the function in `FUN` that will be the same for every iteration of the loop. Note that this line of code takes a few minutes to run.
-
-```{r 06-Chapter6-173, results="hide"}
-set.seed(42)
-
-system.time(
- suppressWarnings(
- toxcast_equiv_dose <- mapply(FUN = calc_mc_oral_equiv,
- conc = toxcast_httk_P10$AC50,
- dtxsid = toxcast_httk_P10$DTXSID,
- MoreArgs = list(model = "3compartmentss", #model to use
- which.quantile = c(0.05, 0.5, 0.95), #quantiles of Css-dose slope
- suppress.messages = TRUE)
- )
-)
-)
-
-#by default, the results are a 3 x 869 matrix, where rows are quantiles and columns are chemicals
-
-toxcast_equiv_dose <- t(toxcast_equiv_dose) #transpose so that rows are chemicals
-toxcast_equiv_dose <- as.data.frame(toxcast_equiv_dose) #convert to data.frame
-head(toxcast_equiv_dose) #look at first few rows
-```
-
-Let's add the DTXSIDs back into this data.frame.
-
-```{r 06-Chapter6-174}
-toxcast_equiv_dose$DTXSID <- toxcast_httk_P10$DTXSID
-```
-
-We can get the names of these chemicals by using the list of chemicals for which the 3-compartment steady-state model can be parameterized, which was stored in the variable `chems_3compss`. In that dataframe, we have the compound name and CASRN corresponding to each DTXSID.
-
-```{r 06-Chapter6-175}
-head(chems_3compss)
-```
-
-Merge `chems_3compss` with `toxcast_equiv_dose`.
-
-```{r 06-Chapter6-176}
-toxcast_equiv_dose <- merge(chems_3compss,
- toxcast_equiv_dose,
- by = "DTXSID",
- all.x = FALSE,
- all.y = TRUE)
-
-head(toxcast_equiv_dose)
-```
-
-To find the chemicals with the lowest equivalent doses at the 95th percentile level (corresponding to the most-sensitive 5\% of the population), sort this data.frame in ascending order on the `95%` column.
-
-```{r 06-Chapter6-177}
-toxcast_equiv_dose <- toxcast_equiv_dose[order(toxcast_equiv_dose$`95%`), ]
-head(toxcast_equiv_dose, 10) #first ten rows of sorted table
-```
-
-
-### Answer to Environmental Health Question 6
-:::question
-*With this, we can answer **Environmental Health Question #6***: What are the chemicals with the three lowest predicted equivalent doses (for tenth-percentile ToxCast AC50s), for the most-sensitive 5\% of the population?
-:::
-
-:::answer
-**Answer**: 2,4-d; secbumeton, and 1,4-dioxane
-:::
-
-
-
-## Comparing Equivalent Doses Estimated to Elicit Toxicity (Hazard) to External Exposure Estimates (Exposure), for Chemical Prioritization by Bioactivity-Exposure Ratios (BERs)
-
-To estimate potential risk, hazard -- in the form of the equivalent dose for the 10th percentile Toxcast AC50 -- now needs to be compared to exposure. A quantitative metric for this comparison is the ratio of the lowest 5\% of equivalent doses to the highest 5\% of potential exposures. This metric is termed the Bioactivity-Exposure Ratio, or BER. Lower BER corresponds to higher potential risk. With BERs calculated for each chemical, we can ultimately rank all of the chemicals from lowest to highest BER, to achieve a chemical prioritization based on potential risk.
-
-### Human Exposure Estimates
-
-Here, we will use exposure estimates that have been inferred from CDC NHANES urinary biomonitoring data (Ring et al., 2019). These estimates consist of an estimated median, and estimated upper and lower 95\% credible interval bounds representing uncertainty in that estimated median. These estimates are provided here in the following csv file:
-
-```{r 06-Chapter6-178}
-exposure <- read.csv("Chapter_6/Module6_6_Input/Module6_6_InputData2.csv")
-head(exposure) #view first few rows
-```
-
-### Merging Exposure Estimates with Equivalent Dose Estimates of Toxicity (Hazard)
-
-To calculate a BER for a chemical, it needs to have both an equivalent dose and an exposure estimate. Not all of the chemicals for which equivalent doses could be computed (*i.e.*, chemicals with both ToxCast AC50s and `httk` data) also have exposure estimates inferred from NHANES. Find out how many do.
-
-```{r 06-Chapter6-179}
-length(intersect(toxcast_equiv_dose$DTXSID, exposure$DTXSID))
-```
-
-This means that, using the ToxCast AC50 data for bioactive concentrations, the NHANES urinary inference data for exposures, and the *httk* package to convert bioactive concentrations to equivalent doses, we can compute BERs for `r length(intersect(toxcast_equiv_dose$DTXSID, exposure$DTXSID))` chemicals.
-
-Merge together the ToxCast equivalent doses and the exposure data into a single data frame. Keep only the chemicals that have data in both ToxCast equivalent doses and exposure data frames.
-
-```{r 06-Chapter6-180}
-hazard_exposure <- merge(toxcast_equiv_dose,
- exposure,
- by = "DTXSID",
- all = FALSE)
-head(hazard_exposure) #view first few rows of result
-```
-
-### Plotting Hazard and Exposure Together
-
-We can visually compare the equivalent doses and the inferred exposure estimates by plotting them together.
-
-```{r 06-Chapter6-181, fig.align = "center"}
-ggplot(hazard_exposure) +
- geom_crossbar(aes(x = Compound.x, #Boxes for equivalent doses
- y = `50%`,
- ymax = `5%`,
- ymin = `95%`,
- color = "Equiv. dose")) +
- geom_crossbar(aes( x= Compound.x, #Boxes for exposures
- y = Median,
- ymax = up95,
- ymin = low95,
- color = "Exposure")) +
- scale_color_manual(values = c("Equiv. dose" = "black",
- "Exposure" = "Orange"),
- name = NULL) +
- scale_x_discrete(label = function(x) str_trunc(x, 20)
- ) + #truncate chemical names to 20 chars
- scale_y_log10() +
- annotation_logticks(sides = "l") +
- ylab("Equiv. dose or Exposure, mg/kg/day") +
- theme_bw() +
- theme(axis.text.x = element_text(angle = 45,
- hjust = 1,
- size = 6),
- axis.title.x = element_blank(),
- legend.position = "top")
-```
-
-### Calculating Bioactivity-Exposure Ratios (BERs)
-
-The bioactivity-exposure ratio (BER) is simply the ratio of the lower-end equivalent dose (for the most-sensitive 5\% of the population) divided by the upper-end estimated exposure (here, the upper bound on the inferred population median exposure). In the data frame `hazard_exposure` containing the hazard and exposure data, the lower-end equivalent dose is in column `95%` (corresponding to the 95th-percentile $C_{ss}$-dose slope) and the upper-end exposure is in column `up95`. Calculate the BER, and assign the result to a new column in the `hazard_exposure` data frame called `BER`.
-
-```{r 06-Chapter6-182}
-hazard_exposure[["BER"]] <- hazard_exposure[["95%"]]/hazard_exposure[["up95"]]
-```
-
-### Prioritizing Chemicals by BER
-
-To prioritize chemicals according to potential risk, they can be sorted from lowest to highest BER. The lower the BER, the higher the priority.
-
-Sort the rows of the data.frame from lowest to highest BER.
-
-```{r 06-Chapter6-183}
-hazard_exposure <- hazard_exposure[order(hazard_exposure$BER), ]
-head(hazard_exposure)
-```
-
-The hazard-exposure plot above showed chemicals in alphabetical order. It can be revised to show chemicals in order of priority, from lowest to highest BER.
-
-First, create a categorical (factor) variable for the compound names, whose levels are in order of increasing BER. (Since we already sorted the data.frame in order of increasing BER, we can just take the compound names in the order that they appear.)
-
-```{r 06-Chapter6-184}
-hazard_exposure$Compound_factor <- factor(hazard_exposure$Compound.x,
- levels = hazard_exposure$Compound.x)
-
-```
-
-Now, make the same plot as before, but use `Compound_factor` as the x-axis variable instead of `Compound`.
-
-```{r 06-Chapter6-185, fig.align = "center"}
-ggplot(hazard_exposure) +
- geom_crossbar(aes(x = Compound_factor, #Boxes for equivalent dose
- y = `50%`,
- ymax = `5%`,
- ymin = `95%`,
- color = "Equiv. dose")) +
- geom_crossbar(aes( x= Compound_factor, #Boxes for exposure
- y = Median,
- ymax = up95,
- ymin = low95,
- color = "Exposure")) +
- scale_color_manual(values = c("Equiv. dose" = "black",
- "Exposure" = "Orange"),
- name = NULL) +
- scale_x_discrete(label = function(x) str_trunc(x, 20)
- ) + #truncate chemical names
- scale_y_log10() +
- ylab("Equiv. dose or Exposure, mg/kg/day") +
- annotation_logticks(sides = "l") +
- theme_bw() +
- theme(axis.text.x = element_text(angle = 45,
- hjust = 1,
- size = 6),
- axis.title.x = element_blank(),
- legend.position = "top")
-```
-
-
-Now, the chemicals are displayed in order of increasing BER. From left to right, you can visually see the distance increase between the lower bound of equivalent doses (the bottom of the black boxes) and the upper bound of exposure estimates (the top of the orange boxes). Since the y-axis is put on a log~10~ scale, the distance between the boxes corresponds to the BER. We can gather a lot of information from this plot!
-
-
-### Answer to Environmental Health Question 7
-:::question
-*With this, we can answer **Environmental Health Question #7***: Based on httk modeling estimates, are chemicals with higher bioactivity-exposure ratios always less potent than chemicals with lower bioactivity-exposure ratios?
-:::
-
-:::answer
-**Answer**: Answer: No -- some chemicals with high potency (low equivalent doses) demonstrate high BERs because they have relatively low human exposure estimates; and vice versa.
-:::
-
-
-
-### Answer to Environmental Health Question 8
-:::question
-*With this, we can also answer **Environmental Health Question #8***: Based on httk modeling estimates, do chemicals with higher bioactivity-exposure ratios always have lower estimated exposures than chemicals with lower bioactivity-exposure ratios?
-:::
-
-:::answer
-**Answer**: No -- some chemicals with high estimated exposures have equivalent doses that are higher still, resulting in a high BER despite the higher estimated exposure. Likewise, some chemicals with low estimated exposures also have lower equivalent doses, resulting in a low BER despite the low estimated exposure.
-:::
-
-
-### Answer to Environmental Health Question 9
-:::question
-*With this, we can also answer **Environmental Health Question #9***: How are chemical prioritization results different when using only hazard information vs. only exposure information vs. bioactivity-exposure ratios?
-:::
-
-:::answer
-**Answer**: When chemicals are prioritized solely on the basis of hazard, more-potent chemicals will be highly prioritized. However, if humans are never exposed to these chemicals, or exposure is extremely low compared to potency, then despite the high potency, the potential risk may be low. Conversely, if chemicals are prioritized solely on the basis of exposure, then ubiquitous chemicals will be highly prioritized. However, if these chemicals are inert and do not produce adverse effects, then despite the high exposure, the potential risk may be low. For these reasons, risk-based chemical prioritization efforts consider both hazard (toxicity) and exposure, for instance through bioactivity-exposure ratios.
-:::
-
-
-### Filling Hazard and Exposure Data Gaps to Prioritize More Chemicals
-
-To calculate a BER for a chemical, both bioactivity and exposure data are required, as well as sufficient TK data to perform reverse TK. In this training module, bioactivity data came from ToxCast AC50s; exposure data consisted of exposure inferences made from NHANES urinary biomonitoring data; and TK data consisted of parameter values measured *in vitro* and built into the *httk* R package. The intersections are illustrated in an Euler diagram below. BERs can only be calculated for chemicals in the triple intersection.
-
-```{r 06-Chapter6-186, fig.align = "center"}
-fit <- eulerr::euler(list('ToxCast AC50s' = unique(toxcast$DTXSID),
- 'HTTK' = unique(chems_3compss$DTXSID),
- 'NHANES inferred exposure' = unique(exposure$DTXSID)
- ),
- shape = "ellipse")
-plot(fit,
- legend = TRUE,
- quantities = TRUE
- )
-```
-
-Clearly, it would be useful to gather more data to allow calculation of BERs for more chemicals.
-
-
-
-### Answer to Environmental Health Question 10
-:::question
-*With this, we can also answer **Environmental Health Question #10***: Of the three datasets used in this training module -- bioactivity from ToxCast, TK data from *httk*, and exposure inferred from NHANES urinary biomonitoring -- which one most limits the number of chemicals that can be prioritized using BERs?
-:::
-
-:::answer
-**Answer**: The exposure dataset includes the fewest chemicals and is therefore the most limiting.
-:::
-
-
-The exposure dataset used in this training module is limited to chemicals for which NHANES did urinary biomonitoring for markers of exposure, which is a fairly small set of chemicals that were of interest to NHANES due to existing concerns about health effects of exposure, and/or other reasons. This dataset was chosen because it is a convenient set of exposure estimates to use for demonstration purposes, but it could be expanded by including other sources of exposure data and exposure model predictions. Further discussion is beyond the scope of this training module, but as an example of this kind of high-throughput exposure modeling, see [Ring et al., 2019](https://pubmed.ncbi.nlm.nih.gov/30516957/).
-
-It would additionally be useful to gather TK data for additional chemicals. *In vitro* measurement efforts are ongoing. Additonally, *in silico* modeling can produce useful predictions of TK properties to facilitate chemical prioritization. Efforts are ongoing to develop computational models to predict TK parameters from chemical structure and properties.
-
-## Concluding Remarks
-
-This training module provides an overview of toxicokinetic modeling using the *httk* R package, and its application to *in vitro*-*in vivo* extrapolation in the form of placing *in vitro* data in the context of exposure by calculating equivalent doses for *in vitro* bioactive concentrations.
-
-We would like to acknowledge the developers of the *httk* package, as detailed below via the CRAN website:
-
-```{r 06-Chapter6-187, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_6/Module6_6_Input/Module6_6_Image2.png")
-```
-
-This module also summarizes the use of the Bioactivity-Exposure Ratio (BER) for chemical prioritization, and provides examples of calculating the BER and ranking chemicals accordingly.
-
-Together, these approaches can be used to more efficiently identify chemicals present in the environment that pose a potential risk to human health.
-
-
-For additional case studies that leverage TK and/or httk modeling techniques, see the following publications that also address environmental health questions:
-
-+ Breen M, Ring CL, Kreutz A, Goldsmith MR, Wambaugh JF. High-throughput PBTK models for in vitro to in vivo extrapolation. Expert Opin Drug Metab Toxicol. 2021 Aug;17(8):903-921. PMID: [34056988](https://pubmed.ncbi.nlm.nih.gov/34056988/).
-
-+ Klaren WD, Ring C, Harris MA, Thompson CM, Borghoff S, Sipes NS, Hsieh JH, Auerbach SS, Rager JE. Identifying Attributes That Influence In Vitro-to-In Vivo Concordance by Comparing In Vitro Tox21 Bioactivity Versus In Vivo DrugMatrix Transcriptomic Responses Across 130 Chemicals. Toxicol Sci. 2019 Jan 1;167(1):157-171. PMID: [30202884](https://pubmed.ncbi.nlm.nih.gov/30202884/).
-
-+ Pearce RG, Setzer RW, Strope CL, Wambaugh JF, Sipes NS. httk: R Package for High-Throughput Toxicokinetics. J Stat Softw. 2017;79(4):1-26. PMID [30220889](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134854/).
-
-+ Ring CL, Pearce RG, Setzer RW, Wetmore BA, Wambaugh JF. Identifying populations sensitive to environmental chemicals by simulating toxicokinetic variability. Environ Int. 2017 Sep;106:105-118. PMID: [28628784](https://pubmed.ncbi.nlm.nih.gov/28628784/).
-
-+ Ring C, Sipes NS, Hsieh JH, Carberry C, Koval LE, Klaren WD, Harris MA, Auerbach SS, Rager JE. Predictive modeling of biological responses in the rat liver using in vitro Tox21 bioactivity: Benefits from high-throughput toxicokinetics. Comput Toxicol. 2021 May;18:100166. PMID: [34013136](https://pubmed.ncbi.nlm.nih.gov/34013136/).
-
-+ Rotroff DM, Wetmore BA, Dix DJ, Ferguson SS, Clewell HJ, Houck KA, Lecluyse EL, Andersen ME, Judson RS, Smith CM, Sochaski MA, Kavlock RJ, Boellmann F, Martin MT, Reif DM, Wambaugh JF, Thomas RS. Incorporating human dosimetry and exposure into high-throughput in vitro toxicity screening. Toxicol Sci. 2010 Oct;117(2):348-58. PMID: [20639261](https://pubmed.ncbi.nlm.nih.gov/20639261/).
-
-+ Wetmore BA, Wambaugh JF, Ferguson SS, Sochaski MA, Rotroff DM, Freeman K, Clewell HJ 3rd, Dix DJ, Andersen ME, Houck KA, Allen B, Judson RS, Singh R, Kavlock RJ, Richard AM, Thomas RS. Integration of dosimetry, exposure, and high-throughput screening data in chemical toxicity assessment. Toxicol Sci. 2012 Jan;125(1):157-74. PMID: [21948869](https://pubmed.ncbi.nlm.nih.gov/21948869/).
-
-+ Wambaugh JF, Wetmore BA, Pearce R, Strope C, Goldsmith R, Sluka JP, Sedykh A, Tropsha A, Bosgra S, Shah I, Judson R, Thomas RS, Setzer RW. Toxicokinetic Triage for Environmental Chemicals. Toxicol Sci. 2015 Sep;147(1):55-67. PMID: [26085347](https://pubmed.ncbi.nlm.nih.gov/26085347/).
-
-+ Wambaugh JF, Wetmore BA, Ring CL, Nicolas CI, Pearce RG, Honda GS, Dinallo R, Angus D, Gilbert J, Sierra T, Badrinarayanan A, Snodgrass B, Brockman A, Strock C, Setzer RW, Thomas RS. Assessing Toxicokinetic Uncertainty and Variability in Risk Prioritization. Toxicol Sci. 2019 Dec 1;172(2):235-251. doi: 10.1093/toxsci/kfz205. PMID: [31532498](https://pubmed.ncbi.nlm.nih.gov/31532498/).
-
-
-
-
-
-
-:::tyk
-1. After exposure to a single daily dose of 1 mg/kg/day methylparaben, what is the maximum concentration of methylparaben estimated to occur in human liver, estimated by the 3-comprtment model implemented in *httk*?
-2. What is the predicted range of methylparaben concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and 3-compartment steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile.
-:::
-
-# 6.7 Chemical Read-Across for Toxicity Predictions
-
-This training module was developed by Grace Patlewicz, Lauren E. Koval, Alexis Payton, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-*Disclaimer: The views expressed in this document are those of the author and do not necessarily reflect the views or policies of the U.S. EPA.*
-
-```{r 06-Chapter6-188, include=FALSE}
-#set default values for R Markdown "knitting" to HTML, Word, or PDF
-knitr::opts_chunk$set(echo = TRUE) #print code chunks
-```
-
-## Introduction to Training Module
-
-The method of **read-across** represents one type of computational approach that is commonly used to predict a chemical's toxicological effects using its properties. Other types of approaches that you will hear commonly used in this field include **SAR** and **QSAR** analyses. A high-level overview of each of these definitions and simple illustrative examples of these three computational modeling approaches is provided in the following schematic:
-```{r 06-Chapter6-189, echo=FALSE }
-knitr::include_graphics("Chapter_6/Module6_7_Input/Module6_7_Image1.png")
-```
-
-Focusing more on read-across, this computational approach represents the method of filling a data gap whereby a chemical with existing data values is used to make a prediction for a 'similar' chemical, typically one which is structurally similar. Thus, information from chemicals with data is read across to chemical(s) without data.
-
-In a typical read-across workflow, the first step is to determine the problem definition - what question are we trying to address. The second step starts the process of identifying chemical analogues that have information that can be used to inform this question, imparting information towards a chemical of interest that is lacking data. A specific type of read-across that is commonly employed is termed 'Generalized Read-Across' or GenRA, which is based upon similarity-weighted activity predictions. This type of read-across approach will be used here when conducting the example chemical read-across training module. This approach has been previously described and published:
-
-+ Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G. Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information. Regul Toxicol Pharmacol. 2016 79:12-24. PMID: [27174420](https://pubmed.ncbi.nlm.nih.gov/27174420/)
-
-
-
-## Introduction to Activity
-
-In this activity we are going to consider a chemical of interest (which we call the target chemical) that is lacking acute oral toxicity information. Specifically, we would like to obtain estimates of the dose that causes lethality after acute (meaning, short-term) exposure conditions. These dose values are typically presented as LD50 values, and are usually collected through animal testing. There is huge interest surrounding the reduced reliance upon animal testing, and we would like to avoid further animal testing as much as possible. With this goal in mind, this activity aims to estimate an LD50 value for the target chemical using completely computational approaches, leveraging existing data as best we can. To achieve this aim, we explore ways in which we can search for structurally similar chemicals that have acute toxicity data already available. Data on these structurally similar chemicals, termed 'source analogues', are then used to predict acute toxicity for the target chemical of interest using the GenRA approach.
-
-The dataset used for this training module were previously compiled and published in the following manuscript:
-Helman G, Shah I, Patlewicz G. Transitioning the Generalised Read-Across approach (GenRA) to quantitative predictions: A case study using acute oral toxicity data. Comput Toxicol. 2019 Nov 1;12(November 2019):10.1016/j.comtox.2019.100097. doi: 10.1016/j.comtox.2019.100097. PMID: [33623834](https://pubmed.ncbi.nlm.nih.gov/33623834/)
-
-+ With associated data available at: https://github.com/USEPA/CompTox-GenRA-acutetox-comptoxicol/tree/master/input
-
-This exercise will specifically predict LD50 values for the chemical, 1-chloro-4-nitrobenzene (DTXSID5020281). This chemical is an organic compound with the formula ClC˜6˜H˜4˜NO˜2˜, and is a common intermediate in the production of a number of industrial compounds, including common antioxidants found in rubber.
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-1. How many chemicals with acute toxicity data are structurally similar to 1-chloro-4-nitrobenzene?
-2. What is the predicted LD50 for 1-chloro-4-nitrobenzene using the GenRA approach?
-3. How different is the predicted vs. experimentally observed LD50 for 1-chloro-4-nitrobenzene?
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 06-Chapter6-190}
-rm(list=ls())
-```
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you:
-```{r 06-Chapter6-191, results=FALSE, message=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse");
-if (!requireNamespace("fingerprint"))
- install.packages("fingerprint");
-if (!requireNamespace("rcdk"))
- install.packages("rcdk");
-```
-
-#### Loading R packages required for this session
-```{r 06-Chapter6-192, results=FALSE, message=FALSE}
-library(tidyverse) #all tidyverse packages, including dplyr and ggplot2
-library(fingerprint) # a package that supports operations on molecular fingerprint data
-library(rcdk) # a package that interfaces with the 'CDK', a Java framework for chemoinformatics libraries packaged for R
-```
-
-#### Set your working directory
-```{r 06-Chapter6-193, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-
-
-## Read-Across Example Analysis
-
-#### Loading Example Datasets
-Let's start by loading the datasets needed for this training module. We are going to use a dataset of substances that have chemical identification information ready in the form of SMILES, as well as acute toxicity data, in the form of LD50 values.
-
-The first file to upload is named `Module6_6_InputData1.csv` and contains the list of substances and their structural information, in the form of SMILES nomenclature. SMILES stands for Simplified molecular-input line-entry system, a form of line notation to describe the structure of a chemical.
-
-The second file to upload is named `Module6_6_InputData2.csv` and contains the substances and their acute toxicity information.
-```{r 06-Chapter6-194}
-substances <- read.csv("Chapter_6/Module6_7_Input/Module6_7_InputData1.csv")
-acute_data <- read.csv("Chapter_6/Module6_7_Input/Module6_7_InputData2.csv")
-```
-
-Let's first view the substances dataset:
-```{r 06-Chapter6-195}
-dim(substances)
-```
-
-```{r 06-Chapter6-196}
-colnames(substances)
-```
-
-```{r 06-Chapter6-197}
-head(substances)
-```
-
-We can see that this dataset contains information on 6955 chemicals (rows). The columns are further described below:
-
-+ `DTXSIDs`: a substance identifier provided through the [U.S. EPA's Computational Toxicology Dashboard](https://comptox.epa.gov/dashboard)
-+ `SMILES` and `QSAR_READY_SMILES`: Chemical identifiers. The QSAR_READY_SMILES values are what we will specifically need in a later step, to construct chemical fingerprints from.
-+ `QSAR_READY_SMILES`: `SMILES` that have been standardized related to salts, tautomers, inorganics, aromaticity, and stereochemistry (among other factors) prior to any QSAR modeling or prediction.
-
-Let's make sure that these values are recognized as character format and placed in its own vector, to ensure proper execution of functions throughout this script:
-```{r 06-Chapter6-198}
-all_smiles <- as.character(substances$QSAR_READY_SMILES)
-```
-
-Now let's view the acute toxicity dataset:
-```{r 06-Chapter6-199}
-dim(acute_data)
-```
-
-```{r 06-Chapter6-200}
-colnames(acute_data)
-```
-
-```{r 06-Chapter6-201}
-head(acute_data)
-```
-
-We can see that this dataset contains information on 6955 chemicals (rows). Some notable columns are explained below:
-+ `DTXSIDs`: a substance identifier provided through the [U.S. EPA's Computational Toxicology Dashboard](https://comptox.epa.gov/dashboard)
-+ `casrn`: CASRN number
-+ `mol_weight`: molecular weight
-+ `LD50_LM`: the -log~10~ of the millimolar LD50. LD stands for 'Lethal Dose'. The LD50 value is the dose of substance given all at once which causes the death of 50% of a group of test animals. The lower the LD50 in mg/kg, the more toxic that substance is.
-
-#### Important Notes on Units
-In modeling studies, the convention is to convert toxicity values expressed as mg per unit into their molar or millimolar values and then to convert these to the base 10 logarithm. To increase clarity when plotting, such that higher toxicities would be expressed by higher values, the negative logarithm is then taken. For example, substance DTXSID00142939 has a molecular weight of 99.089 (grams per mole) and a LD50 of 32 mg/kg. This would be converted to a toxicity value of ($\frac{32}{99.089} = 0.322942~mmol/kg$). The logarithm of that would be -0.4908755. By convention, the negative logarithm of the millimolar concentration would then be used i.e. -log[mmol/kg]. This conversion has been used to create the `LD50_LM` values in the acute toxicity dataset.
-
-Let's check to see whether the same chemicals are present in both datasets:
-```{r 06-Chapter6-202}
-# First need to make sure that both dataframes are sorted by the identifier, DTXSID
-substances <- substances[order(substances$DTXSID),]
-acute_data <- acute_data[order(acute_data$DTXSID),]
-# Then test to see whether data in these columns are equal
-unique(substances$DTXSID == acute_data$DTXSID)
-```
-All accounts are true, meaning they are all equal (the same chemical).
-
-
-### Data Visualizations of Acute Toxicity Values
-
-Let's create a plot to show the distribution of the LD50 values in the dataset.
-```{r 06-Chapter6-203, fig.align = "center"}
-ggplot(data = acute_data, aes(LD50_mgkg)) +
- stat_ecdf(geom = "point")
-
-ggplot(data = acute_data, aes(LD50_LM)) +
- stat_ecdf(geom = "point")
-```
-
-**Can you see a difference between these two plots?**
-Yes, if the LD50 mg/kg values are converted into -log[mmol/kg] scale (LD50_LM), then the distribution resembles a normal cumulative distribution curve.
-
-
-### Selecting the 'Target' Chemical of Interest for Read-Across Analysis
-For this exercise, we will select a 'target' substance of interest from our dataset, and assume that we have no acute toxicity data for it, and we will perform read-across for this target chemical. Note that this module's example dataset actually has full data coverage (meaning all chemicals have acute toxicity data), but this exercise is beneficial, because we can make toxicity predictions, and then check to see how close we are by viewing the experimentally observed values.
-
-Our target substance for this exercise is going to be DTXSID5020281, which is 1-chloro-4-nitrobenzene. This chemical is an organic compound with the formula ClC~6~H~4~NO~2~, and is a common intermediate in the production of a number of industrially useful compounds, including common antioxidants found in rubber. Here is an image of the chemical structure (https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID5020281):
-
-```{r 06-Chapter6-204, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_6/Module6_7_Input/Module6_7_Image2.png")
-```
-
-Filtering the dataframes for only data on this target substance:
-```{r 06-Chapter6-205}
-target_substance <-filter(substances, DTXSID == 'DTXSID5020281')
-target_acute_data <- filter(acute_data, DTXSID == 'DTXSID5020281')
-```
-
-
-
-### Calculating Structural Similarities between Substances
-
-To eventually identify chemical analogues with information that can be 'read-across' to our target chemical (1-chloro-4-nitrobenzene), we first need to evaluate how similar each chemical is to one another. In this example, we will base our search for similar substances upon similarities between chemical structure fingerprint representations. Once these chemical structure fingerprints are derived, they will be used to calculate the degree to which each possible pair of chemicals is similar, leveraging the Tanimoto metric. These findings will yield a similarity matrix of all possible pairwise similarity scores.
-
-
-#### Converting Chemical Identifiers into Molecular Objects (MOL)
-
-To derive structure fingerprints across all evaluated substances, we need to first convert the chemical identifiers originally provided as `QSAR_READY_SMILES` into molecular objects. The standard exchange format for molecular information is a MOL file. This is a chemical file format that contains plain text information and stores information about atoms, bonds and their connections.
-
-We can carry out these identifier conversions using the 'parse.smiles' function within the rcdk package. Here we do this for the target chemical of interest, as well as all substances in the dataset.
-```{r 06-Chapter6-206}
-target_mol <- parse.smiles(as.character(target_substance$QSAR_READY_SMILES))
-all_mols <-parse.smiles(all_smiles)
-```
-
-#### Computing chemical fingerprints
-
-With these mol data, we can now compute the fingerprints for our target substance, as well as all the substances in the dataset. We can compute fingerprints leveraging the `get.fingerprint()` function. Let's first run it on the target chemical:
-```{r 06-Chapter6-207}
-target.fp <- get.fingerprint(target_mol[[1]], type = 'standard')
-target.fp # View fingerprint
-```
-
-We can run the same function over the entire `all_mols` dataset, leveraging the `lapply()` function:
-```{r 06-Chapter6-208}
-all.fp <- lapply(all_mols, get.fingerprint, type = 'standard')
-```
-
-
-
-## Calculating Chemical Similarities
-
-Using these molecular fingerprint data, we can now calculate the degree to which each chemical is similar to another chemical, based on structural similarity. The method employed in this example is the Tanimoto method. The Tanimoto similarity metric is a unitless number between zero and one that measures how similar two sets (in this case 2 chemicals) are from one another. A Tanimoto index of 1 means the 2 chemicals are identical whereas a index of 0 means that the chemicals share nothing in common. In the context of the fingerprints, a Tanimoto index of 0.5 means that half of the fingerprint matches between two chemicals whilst the other half does not match.
-
-Once these Tanimoto similarity indices are calculated between every possible chemical pair, the similarity results can be viewed in the form of a similarity matrix. In this matrix, all substances are listed across the rows and columns, and the degree to which every possible chemical pair is similar is summarized through values contained within the matrix. Further information about chemical similarity can be found here: https://en.wikipedia.org/wiki/Chemical_similarity
-
-Steps to generate this similarity matrix are detailed here:
-```{r 06-Chapter6-209}
-all.fp.sim <- fingerprint::fp.sim.matrix(all.fp, method = 'tanimoto')
-all.fp.sim <- as.data.frame(all.fp.sim) # Convert the outputted matrix to a dataframe
-colnames(all.fp.sim) = substances$DTXSID # Placing chemical identifiers back as column headers
-row.names(all.fp.sim) = substances$DTXSID # Placing chemical identifiers back as row names
-```
-
-Since we are querying a large number of chemicals, it is difficult to view the entire resulting similarity matrix. Let's, instead view portions of these results:
-```{r 06-Chapter6-210}
-all.fp.sim[1:5,1:5] # Viewing the first five rows and columns of data
-```
-
-
-```{r 06-Chapter6-211}
-all.fp.sim[6:10,6:10] # Viewing the next five rows and columns of data
-```
-You can see that there is an identity line within this similarity matrix, where instances when a chemical's structure is being compared to itself, the similarity values are 1.00000.
-
-All other possible chemical pairings show variable similarity scores, ranging from:
-```{r 06-Chapter6-212}
-min(all.fp.sim)
-```
-
-a minimum of zero, indicating no similarities between chemical structures.
-```{r 06-Chapter6-213}
-max(all.fp.sim)
-```
-
-a maximum of 1, indicating the identical chemical structure (which occurs when comparing a chemical to itself).
-
-### Identifying Chemical Analogues
-This step will find substances that are structurally similar to the target chemical, 1-chloro-4-nitrobenzene (with DTXSID5020281). Structurally similar chemicals are referred to as 'source analogues', with information that will be carried forward in this read-across analysis.
-
-The first step to identifying chemical analogues is to subset the full similarity matrix to focus just on our target chemical.
-```{r 06-Chapter6-214}
-target.sim <- all.fp.sim %>%
- filter(row.names(all.fp.sim) == 'DTXSID5020281')
-```
-
-Then we'll extract the substances that exceed a similarity threshold of 0.75 by selecting to keep columns which are > 0.75.
-```{r 06-Chapter6-215}
-target.sim <- target.sim %>%
- select_if(function(x) any(x > 0.75))
-
-dim(target.sim) # Show dimensions of subsetted matrix
-```
-
-This gives us our analogues list! Specifically, we selected 12 columns of data, representing our target chemical plus 11 structurally similar chemicals. Let's create a dataframe of these substance identifiers to carry forward in the read-across analysis:
-```{r 06-Chapter6-216}
-source_analogues <- t(target.sim) # Transposing the filtered similarity matrix results
-DTXSID <-rownames(source_analogues) # Temporarily grabbing the dtxsid identifiers from this matrix
-source_analogues <- cbind(DTXSID, source_analogues) # Adding these identifiers as a column
-rownames(source_analogues) <- NULL # Removing the rownames from this dataframe, to land on a cleaned dataframe
-colnames(source_analogues) <- c('DTXSID', 'Target_TanimotoSim') # Renaming column headers
-source_analogues[1:12,1:2] # Viewing the cleaned dataframe of analogues
-```
-
-### Answer to Environmental Health Question 1
-:::question
-*With these, we can answer **Environmental Health Question #1***: How many chemicals with acute toxicity data are structurally similar to 1-chloro-4-nitrobenzene?
-:::
-
-:::answer
-**Answer**: In this dataset, 11 chemicals are structurally similar to the target chemical, based on a Tanimoto similiary score of > 0.75.
-:::
-
-
-
-## Chemical Read-Across to Predict Acute Toxicity
-Acute toxicity data from these chemical analogues can now be extracted and read across to the target chemical (1-chloro-4-nitrobenzene) to make predictions about its toxicity.
-
-Let's first merge the acute data for these analogues into our working dataframe:
-```{r 06-Chapter6-217}
-source_analogues <- merge(source_analogues, acute_data, by.x = 'DTXSID', by.y = 'DTXSID')
-```
-
-Then, let's remove the target chemical of interest and create a new dataframe of just the source analogues:
-```{r 06-Chapter6-218}
-source_analogues_only <- source_analogues %>%
- filter(Target_TanimotoSim!=1) # Removing the row of data with the target chemical, identified as the chemical with a similarity of 1 to itself
-
-source_analogues_only[1:11,1:10] # Viewing the combined dataset of source analogues
-```
-
-### Read-across Calculations using GenRA
-The final generalized read-across (GenRA) prediction is based on a similarity-weighted activity score. This score is specifically calculated as the following weighted average:
-
-(pairwise similarity between the target and source analogue) * (the toxicity of the source analogue), summed across each individual analogue; and then this value is divided by the sum of all pairwise similarities. For further details surrounding this algorithm and its spelled out formulation, see [Shah et al.](https://pubmed.ncbi.nlm.nih.gov/27174420/).
-
-Here are the underlying calculations needed to derive the similarity weighted activity score for this current exercise:
-```{r 06-Chapter6-219}
-source_analogues_only$wt_tox_calc <-
- as.numeric(source_analogues_only$Target_TanimotoSim) * source_analogues_only$LD50_LM
-# Calculating (pairwise similarity between the target and source analogue) * (the toxicity of the source analogue)
-# for each analogy, and saving it as a new column titled 'wt_tox_calc'
-
-source_analogues_only[1:3,1:11] # Viewing a portion of the updated dataframe with the 'wt_tox_cal' column
-
-sum_tox <- sum(source_analogues_only$wt_tox_calc) #Summing this wt_tox_calc value across all analogues
-
-sum_sims <- sum(as.numeric(source_analogues_only$Target_TanimotoSim)) # Summing all of the pairwise Tanimoto similarity scores
-
-ReadAcross_Pred <- sum_tox/sum_sims # Final calculation for the weighted activity score (i.e., read-across prediction)
-```
-
-### Converting LD50 Units
-Right now, these results are in units of -log~10~ millimolar. So we still need to convert them into mg/kg equivalent, by converting out of -log~10~ and multiplying by the molecular weight of 1-chloro-4-nitrobenzene (g/mol):
-```{r 06-Chapter6-220}
-ReadAcross_Pred <- (10^(-ReadAcross_Pred))*157.55
-ReadAcross_Pred
-```
-
-### Answer to Environmental Health Question 2
-:::question
-*With this, we can answer **Environmental Health Question #2***: What is the predicted LD50 for 1-chloro-4-nitrobenzene, using the GenRA approach?
-:::
-
-:::answer
-**Answer**: 1-chloro-4-nitrobenzene has a predicted LD50 (mg/kg) of 471 mg/kg.
-:::
-
-### Visual Representation of this Read-Across Approach
-
-Here is a schematic summarizing the steps we employed in this analysis:
-```{r 06-Chapter6-221, echo=FALSE }
-knitr::include_graphics("Chapter_6/Module6_7_Input/Module6_7_Image3.png")
-```
-
-
-### Comparing Read-Across Predictions to Experimental Observations
-
-Let's now compare how close this computationally-based prediction is to the experimentally observed LD50 value
-```{r 06-Chapter6-222}
-target_acute_data$LD50_mgkg
-```
-We can see that the experimentally observed LD50 values for this chemical is 460 mg/kg.
-
-### Answer to Environmental Health Question 3
-:::question
-*With this, we can answer **Environmental Health Question #3***: How different is the predicted vs. experimentally observed LD50 for 1-chloro-4-nitrobenzene?
-:::
-
-:::answer
-**Answer**: The predicted LD50 is 471 mg/kg, and the experimentally observed LD50 is 460 mg/kg, which is reasonably close!
-:::
-
-
-
-## Concluding Remarks
-
-In conclusion, this training module leverages a dataset of substances with structural representations and toxicity data to create chemical fingerprint representations. We have selected a chemical of interest (target) and used the most similar analogues based on a similarity threshold to predict the acute toxicity of that target using the generalized read-across formula of weighted activity by similarity. We have seen that the prediction is in close agreement with that already reported for the target chemical in the dataset. Similar methods can be used to predict other toxicity endpoints, based on other datasets of chemicals. Additionally, further efforts are aimed at expanding read-across approaches to integrate *in vitro* data.
-
-More information on the GenRA approach as implemented in the EPA CompTox Chemicals Dashboard, as well as the extension of read-across to include bioactivity information, are described in the following manuscripts:
-
-+ Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G. Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information. Regul Toxicol Pharmacol. 2016 79:12-24. PMID: [27174420](https://pubmed.ncbi.nlm.nih.gov/27174420/)
-
-+ Helman G, Shah I, Williams AJ, Edwards J, Dunne J, Patlewicz G. Generalized Read-Across (GenRA): A workflow implemented into the EPA CompTox Chemicals Dashboard. ALTEX. 2019;36(3):462-465. PMID: [30741315](https://pubmed.ncbi.nlm.nih.gov/30741315/).
-
-+ GenRA has also been implemented as a standalone [python package](https://pypi.org/project/genra/#description).
-
-
-
-
-
-:::tyk
-Use the same input data we used in this module to answer the following questions.
-
-1. How many source analogues are structurally similar to methylparaben (DTXSID4022529) when considering a similarity threshold of 0.75?
-2. What is the predicted LD50 for methylparaben in mg/kg, and how does this compare to the measured LD50 for methylparaben?
-:::
-
-
-
-
-
-
-:::
diff --git a/Chapter_6/6_1_Descriptive_Cohort_Analyses/6_1_Descriptive_Cohort_Analyses.Rmd b/Chapter_6/6_1_Descriptive_Cohort_Analyses/6_1_Descriptive_Cohort_Analyses.Rmd
new file mode 100644
index 0000000..2210ba5
--- /dev/null
+++ b/Chapter_6/6_1_Descriptive_Cohort_Analyses/6_1_Descriptive_Cohort_Analyses.Rmd
@@ -0,0 +1,528 @@
+# (PART\*) Chapter 6 Applications in Toxicology & Exposure Science {-}
+
+# 6.1 Descriptive Cohort Analyses
+
+This training module was developed by Elise Hickman, Kyle Roell, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Human cohort datasets are very commonly analyzed and integrated in environmental health research. Commone research study designs that incorporate human data include clinical, epidemiological, biomonitoring, and/or biomarker study designs. These datasets represent metrics of health and exposure collected from human participants at one or many points in time. Although these datasets can lend themselves to highly complex analyses, it is important to first explore the basic dataset properties to understand data missingness, filter data appropriately, generate demographic tables and summary statistics, and identify outliers. In this module, we will work through these common steps with an example dataset and discuss additional considerations when working with human cohort datasets.
+
+Our example data are derived from a study in which chemical exposure profiles were collected using silicone wristbands. Silicone wristbands are an affordable and minimally invasive method for sampling personal chemical exposure profiles. This exposure monitoring technique has been described through previous publications (see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations**). The example workflow can also apply to other study designs, including biomonitoring and biomarker studies, which require careful consideration of chemical or biological marker detection filters, transparent reporting of descriptive statistics, and demographics tables.
+
+### Training Module's Environmental Health Questions
+
+1. What proportion of participants wore their wristbands for all seven days?
+2. How many chemicals were detected in at least 20% of participants?
+3. What are the demographics of the study participants?
+
+### Workspace Preparation and Data Import
+
+```{r 6-1-Descriptive-Cohort-Analyses-1, message = FALSE}
+# Load packages
+library(tidyverse) # for data organization and manipulation
+library(janitor) # for data cleaning
+library(openxlsx) # for reading in and writing out files
+library(DT) # for displaying tables
+library(table1) # for making tables
+library(patchwork) # for graphing
+library(purrr) # for summary stats
+library(factoextra) # for PCA outlier detection
+library(table1) # for making demographics table
+
+# Make sure select is calling the correct function
+select <- dplyr::select
+
+# Set graphing theme
+theme_set(theme_bw())
+```
+
+First, we will import our raw chemical data and preview it.
+```{r 6-1-Descriptive-Cohort-Analyses-2, warning = FALSE}
+wrist_data <- read.xlsx("Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData1.xlsx") %>%
+ mutate(across(everything(), \(x) as.numeric(x)))
+
+datatable(wrist_data[ , 1:6])
+```
+
+In this study, 97 participants wore silicone wristbands for one week, and chemical concentrations on the wristbands were measured with gas chromatography mass spectrometry. This dataframe consists of a column with a unique identifier for each participant (`S_ID`), a column describing the number of days that participant wore the wristband (`Ndays`), and subsequent columns containing the amount of each chemical detected (nanograms of chemical per gram of wristband). The chemical columns are labeled with the chemical class first (e.g., alkyl OPE, or alkyl organophosphate ester), followed by and underscore and the chemical name (e.g., 2IPPDPP). This dataset contains 110 different chemicals categorized into 8 chemical classes (listed below with their abbreviations):
+
++ Brominated diphenyl ether (BDE)
++ Brominated flame retardant (BFR)
++ Organophosphate ester (OPE)
++ Polycyclic aromatic hydrocarbon (PAH)
++ Polychlorinated biphenyl (PCB)
++ Pesticide (Pest)
++ Phthalate (Phthal)
++ Alkyl organophosphate ester (alkylOPE)
+
+Through the data exploration and cleaning process, we will aim to:
+
++ Understand participant behaviors
++ Filter out chemicals with low detection
++ Generate a supplemental table containing chemical detection information and summary statistics such as minimum, mean, median, and maximum
++ Identify participant outliers
++ Generate a demographics table
+
+Although these steps are somewhat specific to our example dataset, similar steps can be taken with other datasets. We recommend thinking through the structure of your data and outlining data exploration and cleaning steps prior to starting your analysis. This process can be somewhat time-consuming and tedious but is important to ensure that your data are well-suited for downstream analyses. In addition, these steps should be included in any resulting manuscript as part of the narrative relating to the study cohort and data cleaning.
+
+## Participant Exploration
+
+We can use *tidyverse* functions to quickly tabulate how many days participants wore the wristbands.
+
+```{r 6-1-Descriptive-Cohort-Analyses-3 }
+wrist_data %>%
+
+ # Count number of participants for each number of days
+ dplyr::count(Ndays) %>%
+
+ # Calculate proportion of partipants for each number of days
+ mutate(prop = prop.table(n)) %>%
+
+ # Arrange the table from highest to lowest number of days
+ arrange(-Ndays) %>%
+
+ # Round the proportion column to two decimal places
+ mutate(across(prop, \(x) round(x, 2)))
+```
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can now answer **Environmental Health Question #1***: What proportion of participants wore their wristbands for all seven days?
+:::
+
+:::answer
+**Answer:** 86% of participants wore their wristbands for all seven days.
+:::
+
+Because a few participants did not wear their wristbands for all seven days, it will be important to further explore whether there are outlier participants and to normalize the chemical concentrations by number of days the wristband was worn. We can first assess whether any participants have a particularly low or high number of chemicals detected relative to the other participants.
+
+We'll prepare the data for graphing by creating a dataframe containing information about how many chemicals were detected per participant.
+```{r 6-1-Descriptive-Cohort-Analyses-4 }
+wrist_det_by_participant <- wrist_data %>%
+
+ # Remove Ndays column because we don't need it for this step
+ select(-Ndays) %>%
+
+ # Move S_ID to rownames so it doesn't interfere with count
+ column_to_rownames("S_ID") %>%
+
+ # Create a new column for number of chemicals detected
+ mutate(n_det = rowSums(!is.na(.))) %>%
+
+ # Clean dataframe
+ rownames_to_column("S_ID") %>%
+ select(c(S_ID, n_det))
+
+datatable(wrist_det_by_participant)
+```
+
+Then, we can make our histogram:
+```{r 6-1-Descriptive-Cohort-Analyses-5, warning = FALSE, fig.align = "center"}
+det_per_participant_graph <- ggplot(wrist_det_by_participant, aes(x = n_det)) +
+ geom_histogram(color = "black",
+ fill = "gray60",
+ alpha = 0.7,
+ binwidth = 2) +
+ ggtitle("Distribution of Number of Chemicals Detected Per Participant") +
+ ylab("Number of Participants") +
+ xlab("Number of Chemicals Detected") +
+ scale_x_continuous(breaks = seq(0, 70, by = 10), limits = c(0, 70), expand = c(0.025, 0.025)) +
+ scale_y_continuous(breaks = seq(0, 15, by = 5), limits = c(0, 15), expand = c(0, 0)) +
+ theme(plot.title = element_text(hjust = 0.5, size = 16),
+ axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
+ axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
+ axis.text = element_text(size = 12))
+
+det_per_participant_graph
+```
+
+From this histogram, we can see that the number of chemicals detected per participant ranges from about 30-65 chemicals, with no participants standing out as being well above or below the distribution.
+
+## Chemical Detection Filtering
+
+Next, we want to apply a chemical detection filter to remove chemicals from the dataset with very low detection. To start, let's make a dataframe summarizing the percentage of participants in which each chemical was detected and graph this distribution using a histogram.
+
+```{r 6-1-Descriptive-Cohort-Analyses-6 }
+# Create dataframe where n_detected is the sum of the rows where there are not NA values
+chemical_counts <- data.frame(n_detected = colSums(!is.na(wrist_data %>% select(-c(S_ID, Ndays))))) %>%
+
+ # Move rownames to a column
+ rownames_to_column("class_chemical") %>%
+
+ # Add n_undetected and percentage detected and undetected columns
+ mutate(n_undetected = nrow(wrist_data) - n_detected,
+ perc_detected = n_detected/nrow(wrist_data)*100,
+ perc_undetected = n_undetected/nrow(wrist_data)*100) %>%
+
+ # Round percentages to two decimal places
+ mutate(across(c(perc_detected, perc_undetected), \(x) round(x, 2)))
+
+# View dataframe
+datatable(chemical_counts)
+```
+
+```{r 6-1-Descriptive-Cohort-Analyses-7, fig.align = "center"}
+det_per_chemical_graph <- ggplot(chemical_counts, aes(x = perc_detected)) +
+ geom_histogram(color = "black",
+ fill = "gray60",
+ alpha = 0.7,
+ binwidth = 1) +
+ scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
+ scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
+ ggtitle("Distribution of Percentage Chemical Detection") +
+ ylab("Number of Chemicals") +
+ xlab("Percentage of Detection Across All Participants") +
+ theme(plot.title = element_text(hjust = 0.5),
+ axis.title.x = element_text(margin = ggplot2::margin(t = 10)),
+ axis.title.y = element_text(margin = ggplot2::margin(r = 10)))
+
+det_per_chemical_graph
+```
+
+From this histogram, we can see that many of the chemicals fall in the < 15% or > 90% detection range, with the others distributed evenly between 20 and 90% detection. How we choose to filter our data in part depends on the goals of our analysis. For example, if we only want to keep chemicals detected for almost all of the participants, we could set our threshold at 90% detection:
+```{r 6-1-Descriptive-Cohort-Analyses-8, fig.align = "center"}
+# Add annotation column
+chemical_counts <- chemical_counts %>%
+ mutate(det_filter_90 = ifelse(perc_detected > 90, "Yes", "No"))
+
+# How many chemicals pass this filter?
+nrow(chemical_counts %>% filter(det_filter_90 == "Yes"))
+
+# Make graph
+det_per_chemical_graph_90 <- ggplot(chemical_counts, aes(x = perc_detected, fill = det_filter_90)) +
+ geom_histogram(color = "black",
+ alpha = 0.7,
+ binwidth = 1) +
+ scale_fill_manual(values = c("gray87", "gray32"), guide = "none") +
+ geom_segment(aes(x = 90, y = 0, xend = 90, yend = 25), color = "firebrick", linetype = 2) +
+ scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
+ scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
+ ggtitle("Distribution of Percentage Chemical Detection") +
+ ylab("Number of Chemicals") +
+ xlab("Percentage of Detection Across All Participants") +
+ theme(plot.title = element_text(hjust = 0.5, size = 16),
+ axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
+ axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
+ axis.text = element_text(size = 12))
+
+det_per_chemical_graph_90
+```
+
+However, this only keeps 34 chemicals in our dataset, which is a significant proportion of all of the chemicals measured. We could also consider setting the filter at 20% detection to maximize inclusion of as many chemicals as possible.
+
+```{r 6-1-Descriptive-Cohort-Analyses-9, fig.align = "center"}
+# Add annotation column
+chemical_counts <- chemical_counts %>%
+ mutate(det_filter_20 = ifelse(perc_detected > 20, "Yes", "No"))
+
+# How many chemicals pass this filter?
+nrow(chemical_counts %>% filter(det_filter_20 == "Yes"))
+
+# Make graph
+det_per_chemical_graph_20 <- ggplot(chemical_counts, aes(x = perc_detected, fill = det_filter_20)) +
+ geom_histogram(color = "black",
+ alpha = 0.7,
+ binwidth = 1) +
+ scale_fill_manual(values = c("gray87", "gray32"), guide = "none") +
+ geom_segment(aes(x = 20, y = 0, xend = 20, yend = 25), color = "firebrick", linetype = 2) +
+ scale_x_continuous(breaks = seq(0, 100, by = 10), expand = c(0.025, 0.025)) +
+ scale_y_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 25), expand = c(0, 0)) +
+ ggtitle("Distribution of Percentage Chemical Detection") +
+ ylab("Number of Chemicals") +
+ xlab("Percentage of Detection Across All Participants") +
+ theme(plot.title = element_text(hjust = 0.5, size = 16),
+ axis.title.x = element_text(margin = ggplot2::margin(t = 10), size = 13),
+ axis.title.y = element_text(margin = ggplot2::margin(r = 10), size = 13),
+ axis.text = element_text(size = 12))
+
+det_per_chemical_graph_20
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can now answer **Environmental Health Question #2***: How many chemicals were detected in at least 20% of participants?
+:::
+
+:::answer
+**Answer:** 62 chemicals were detected in at least 20% of participants.
+:::
+
+We'll use the 20% detection filter for downstream analyses to maximize inclusion of data for our study. Note that selection of data filters is highly project- and goal- dependent, so be sure to take into consideration typical workflows for your type of data, study, or lab group.
+
+```{r 6-1-Descriptive-Cohort-Analyses-10 }
+# Create vector of chemicals to keep
+chemicals_20perc <- chemical_counts %>%
+ filter(perc_detected > 20) %>%
+ pull(class_chemical)
+
+# Filter dataframe
+wrist_data_filtered <- wrist_data %>%
+ column_to_rownames("S_ID") %>%
+ dplyr::select(all_of(chemicals_20perc))
+```
+
+We can also summarize chemical detection vs. non-detection by chemical class to understand the number of chemicals in each class that were 1) detected in any participant or 2) detected in more than 20% of participants.
+
+```{r 6-1-Descriptive-Cohort-Analyses-11 }
+chemical_count_byclass <- chemical_counts %>%
+ separate(class_chemical, into = c("class", NA), remove = FALSE, sep = "_") %>%
+ group_by(class) %>%
+ summarise(n_chemicals = n(),
+ n_chemicals_det = sum(n_detected > 0),
+ n_chemicals_det_20perc = sum(perc_detected >= 20)) %>%
+ bind_rows(summarise(., across(where(is.numeric), sum),
+ across(where(is.character), ~'Total')))
+
+datatable(chemical_count_byclass)
+```
+
+From these data, we can see that, of the 62 chemicals retained by our detection filter, some classes were retained more than others. For example, of the 8 of the 10 phthalates (80%) were retained by the 20% detection filter, while only 2 of the 11 PCBs (18%) were retained.
+
+## Outlier Identification
+
+Next, we will check to see if any participants are outliers based on the entire chemical signature for each participant using principal component analysis (PCA). Prior to checking for outliers, a few final data cleaning steps are required, which are beyond the scope of this specific module, though we encourage participants to research these methods as they are important in general data pre-processing. These data cleaning steps were:
+
+1. Imputing missing values.
+2. Calculating time-weighted average values by dividing each value by the number of days the participant wore the wristband.
+3. Assessing normality of data with and without log2 transformation.
+
+Here, we'll read in the fully cleaned and processed data, which contains data for all 97 participants and the 62 chemicals that passed the detection filter (imputed, time-weighted). We will also apply log2 transformation to move the data closer to a normal distribution. For more on these steps, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations** and **TAME 2.0 Module 4.2 Data Import, Processing, and Summary Statistics**.
+
+```{r 6-1-Descriptive-Cohort-Analyses-12 }
+wrist_data_cleaned <- read.xlsx("Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData2.xlsx") %>%
+ column_to_rownames("S_ID") %>%
+ mutate(across(everything(), \(x) log2(x+1)))
+
+datatable(wrist_data_cleaned[ 1:6])
+```
+
+First, let's run PCA and plot our data.
+```{r 6-1-Descriptive-Cohort-Analyses-13, fig.align = "center"}
+# Prepare dataframe
+wrist_data_cleaned_scaled <- wrist_data_cleaned %>%
+ scale() %>% data.frame()
+
+# Run PCA
+pca <- prcomp(wrist_data_cleaned_scaled)
+
+# Visualize PCA
+pca_chemplot <- fviz_pca_ind(pca,
+ label = "none",
+ pointsize = 3) +
+theme(axis.title = element_text(face = "bold", size = rel(1.1)),
+ panel.border = element_rect(fill = NA, color = "black", linewidth = 0.3),
+ panel.grid.minor = element_blank(),
+ panel.grid.major = element_blank(),
+ plot.title = element_text(hjust = 0.5),
+ legend.position = "none")
+
+pca_chemplot
+```
+
+By visual inspection, it looks like there may be some outliers, so we can use a formula to detect outliers. One standard way to detect outliers is the criterion of being “more than 6 standard deviations away from the mean" ([Source](https://privefl.github.io/blog/detecting-outlier-samples-in-pca/)).
+
+We can apply this approach to our data by first creating a function to detect PCA outliers based on whether or not that participant passed a certain standard deviation cutoff.
+
+```{r 6-1-Descriptive-Cohort-Analyses-14 }
+# Create a function to detect PCA sample outliers. The input is the PCA results data frame and the number of standard deviations for the cutoff. The output is outlier names.
+outlier_detection = function(pca_df, sd){
+
+ # getting scores
+ scores = pca_df$x
+
+ # identifying samples that are > 6 standard deviations away from the mean
+ outlier_indices = apply(scores, 2, function(x) which( abs(x - mean(x)) > (sd * sd(x)) )) %>%
+ Reduce(union, .)
+
+ # getting sample names
+ outliers = rownames(scores)[outlier_indices]
+
+ return(outliers)
+}
+
+# Call function with different standard deviation cutoffs
+outliers_6 <- outlier_detection(pca, 6)
+outliers_5 <- outlier_detection(pca, 5)
+outliers_4 <- outlier_detection(pca, 4)
+outliers_3 <- outlier_detection(pca, 3)
+
+# Summary data frame
+outlier_summary <- data.frame(sd_cutoff = c(6, 5, 4, 3), n_outliers = c(length(outliers_6), length(outliers_5), length(outliers_4), length(outliers_3)))
+
+outlier_summary
+```
+
+From these results, we see that there are no outliers that are > 6 standard deviations from the mean, so we will proceed with the dataset without filtering any participants out.
+
+## Summary Statistics Tables
+
+Now that we have explored our dataset and finished processing the data, we can make a summary table that includes descriptive statistics (minimum, mean, median, maximum) for each of our chemicals. This table would go into supplementary material when the project is submitted for publication. It is a good idea to make this table using both the raw data and the cleaned data (imputed and normalized by time-weighted average) because different readers may have different interests in the data. For example, they may want to see the raw data so that they can understand chemical detection versus non-detection and absolute minimums or maximums of detection. Or, they may want to use the cleaned data for their own analyses. This table can also include information about whether or not the chemical passed our 20% detection filter.
+
+There are many ways to generate summary statistics tables in R. Here, we will demonstrate a method using the `map_dfr()` function, which takes a list of functions and applies them across columns of the data. The summary statistics are then placed in rows, with each column representing a variable.
+
+```{r 6-1-Descriptive-Cohort-Analyses-15, warning = FALSE}
+# Define summary functions
+summary_functs <- lst(min, median, mean, max)
+
+# Apply summary functions to raw data
+summarystats_raw <- map_dfr(summary_functs, ~ summarise(wrist_data, across(3:ncol(wrist_data), .x, na.rm = TRUE)), .id = "statistic")
+
+# View data
+datatable(summarystats_raw[, 1:6])
+```
+
+Through a few cleaning steps, we can transpose and format these data so that they are publication-quality.
+```{r 6-1-Descriptive-Cohort-Analyses-16 }
+summarystats_raw <- summarystats_raw %>%
+
+ # Transpose dataframe and return to dataframe class
+ t() %>% as.data.frame() %>%
+
+ # Make the first row the column names
+ row_to_names(1) %>%
+
+ # Remove rows with NAs (those where data are completely missing)
+ na.omit() %>%
+
+ # Move chemical identifier to a column
+ rownames_to_column("class_chemical") %>%
+
+ # Round data
+ mutate(across(min:max, as.numeric)) %>%
+ mutate(across(where(is.numeric), round, 2)) %>%
+
+ # Add a suffix to column titles so we know that these came from the raw data
+ rename_with(~paste0(., "_raw"), min:max)
+
+datatable(summarystats_raw)
+```
+
+We can apply the same steps to the cleaned data.
+
+```{r 6-1-Descriptive-Cohort-Analyses-17 }
+summarystats_cleaned <- map_dfr(summary_functs, ~ summarise(wrist_data_cleaned, across(1:ncol(wrist_data_cleaned), .x, na.rm = TRUE)),
+ .id = "statistic") %>%
+ t() %>% as.data.frame() %>%
+ row_to_names(1) %>%
+ na.omit() %>%
+ rownames_to_column("class_chemical") %>%
+ mutate(across(min:max, as.numeric)) %>%
+ mutate(across(where(is.numeric), round, 2)) %>%
+ rename_with(~paste0(., "_cleaned"), min:max)
+
+datatable(summarystats_cleaned)
+```
+
+Finally, we will merge the data from our `chemical_counts` dataframe (which contains detection information for all of our chemicals) with our summary statistics dataframes.
+
+```{r 6-1-Descriptive-Cohort-Analyses-18 }
+summarystats_final <- chemical_counts %>%
+
+ # Remove 90% detection filter column
+ select(-det_filter_90) %>%
+
+ # Add raw summary stats
+ left_join(summarystats_raw, by = "class_chemical") %>%
+
+ # Add cleaned summary stats
+ left_join(summarystats_cleaned, by = "class_chemical")
+
+datatable(summarystats_final, width = 600)
+```
+
+## Demographics Table
+
+Another important element of any analysis of human data is the demographics table. The demographics table provides key information about the study participants and can help inform downstream analyses, such as exploration of the impact of covariates on the endpoint of interest. There are many different ways to make demographics tables in R. Here, we will demonstrate making a demographics table with the *table1* package. For more on this package, including making tables with multiple groups and testing for statistical differences in demographics between groups, see the *table1* vignette [here](https://benjaminrich.github.io/table1/vignettes/table1-examples.html).
+
+First, we'll read in and view our demographic data:
+```{r 6-1-Descriptive-Cohort-Analyses-19 }
+demo_data <- read.xlsx("Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData3.xlsx")
+
+datatable(demo_data)
+```
+
+Then, we can create new labels for our variables so that they are more nicely formatted and more intuitive for display in the table.
+```{r 6-1-Descriptive-Cohort-Analyses-20 }
+# Create new labels for the demographics table
+label(demo_data$mat_age_birth) <- "Age at Childbirth"
+label(demo_data$pc_sex) <- "Sex"
+label(demo_data$pc_gender) <- "Gender"
+label(demo_data$pc_latino_hispanic) <- "Latino or Hispanic"
+label(demo_data$pc_race_cleaned) <- "Race"
+label(demo_data$pc_ed) <- "Educational Attainment"
+```
+
+Our demographics data also had "F" for female in the sex column. We can change this to "Female" so that the demographics table is more readable.
+```{r 6-1-Descriptive-Cohort-Analyses-21 }
+demo_data <- demo_data %>%
+ mutate(pc_sex = dplyr::recode(pc_sex, "F" = "Female"))
+
+label(demo_data$pc_sex) <- "Sex"
+```
+
+Now, let's make the table. The first argument in the formula is all of the columns you want to include in the table, followed by the input dataframe.
+```{r 6-1-Descriptive-Cohort-Analyses-22 }
+table1(~ mat_age_birth + pc_sex + pc_gender + pc_latino_hispanic + pc_race_cleaned + pc_ed, data = demo_data)
+```
+
+
+
+There are a couple of steps we could take to clean up the table:
+
+1. Change the rendering for our continuous variable (age) to just mean (SD).
+2. Order educational attainment so that it progresses from least to most education.
+
+We can change the rendering for our continuous variable by defining our own rendering function (as demonstrated in the package's vignette).
+```{r 6-1-Descriptive-Cohort-Analyses-23 }
+# Create function for custom table so that Mean (SD) is shown for continuous variables
+my.render.cont <- function(x) {
+ with(stats.apply.rounding(stats.default(x), digits=2),
+ c("", "Mean (SD)"=sprintf("%s (± %s)", MEAN, SD)))
+}
+```
+
+We can order the education attainment by changing it to a factor and defining the levels.
+```{r 6-1-Descriptive-Cohort-Analyses-24 }
+demo_data <- demo_data %>%
+ mutate(pc_ed = factor(pc_ed, levels = c("High School or GED", "Associate Degree", "Four-Year Degree",
+ "Master's Degree", "Professional Degree or PhD")))
+
+label(demo_data$pc_ed) <- "Educational Attainment"
+```
+
+Then, we can make our final table.
+```{r 6-1-Descriptive-Cohort-Analyses-25 }
+table1(~ mat_age_birth + pc_sex + pc_gender + pc_latino_hispanic + pc_race_cleaned + pc_ed,
+ data = demo_data,
+ render.continuous = my.render.cont)
+```
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can now answer **Environmental Health Question #3***: What are the demographics of the study participants?
+:::
+
+:::answer
+**Answer:** The study participants were all females who identified as women and were, on average, 31 years old when they gave birth. Participants were mostly non-latino/non-hispanic and White. Participants were spread across educational attainment levels, with the smallest education attainment group being those with an associate degree and the largest being those with a four-year degree.
+:::
+
+## Concluding Remarks
+
+In conclusion, this training module serves as an introduction to human cohort data exploration and preliminary analysis, including data filtering, summary statistics, and multivariate outlier detection. These methods are an important step at the beginning of human cohort analyses, and the concepts introduced in this module can be applied to a wide variety of datasets.
+
+
+
+
+
+:::tyk
+Using a more expanded demographics file ("Module6_1_TYKInput.xlsx"), create a demographics table with:
+
++ The two new variables (home location and home type) included
++ The table split by which site the participant visited
++ Variable names and values presented in a publication-quality format (first letters capitalized, spaces between words, no underscores)
+:::
diff --git a/Chapter_6/Module6_1_Input/Module6_1_InputData1.xlsx b/Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData1.xlsx
similarity index 100%
rename from Chapter_6/Module6_1_Input/Module6_1_InputData1.xlsx
rename to Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData1.xlsx
diff --git a/Chapter_6/Module6_1_Input/Module6_1_InputData2.xlsx b/Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData2.xlsx
similarity index 100%
rename from Chapter_6/Module6_1_Input/Module6_1_InputData2.xlsx
rename to Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData2.xlsx
diff --git a/Chapter_6/Module6_1_Input/Module6_1_InputData3.xlsx b/Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData3.xlsx
similarity index 100%
rename from Chapter_6/Module6_1_Input/Module6_1_InputData3.xlsx
rename to Chapter_6/6_1_Descriptive_Cohort_Analyses/Module6_1_InputData3.xlsx
diff --git a/Chapter_6/6_2_Omics_System_Biology/6_2_Omics_System_Biology.Rmd b/Chapter_6/6_2_Omics_System_Biology/6_2_Omics_System_Biology.Rmd
new file mode 100644
index 0000000..f4910f0
--- /dev/null
+++ b/Chapter_6/6_2_Omics_System_Biology/6_2_Omics_System_Biology.Rmd
@@ -0,0 +1,913 @@
+
+# 6.2 -Omics and System Biology: Transcriptomic Applications
+
+This training module was developed by Lauren E. Koval, Dr. Kyle Roell, and Dr. Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+
+## Introduction to Training Module
+
+This training module incorporates the highly relevant example of RNA sequencing to evaluate the impacts of environmental exposures on cellular responses and general human health. **RNA sequencing** is the most common method that is currently implemented to measure the transcriptome. Results from an RNA sequencing platform are often summarized as count data, representing the number of relative times a gene (or other annotated portion of the genome) was 'read' in a given sample. For more details surrounding the methodological underpinnings of RNA sequencing, see the following recent review:
+
++ Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019 Nov;20(11):631-656. doi: 10.1038/s41576-019-0150-2. Epub 2019 Jul 24. PMID: [31341269](https://pubmed.ncbi.nlm.nih.gov/31341269/).
+
+
+In this training module, we guide participants through an example RNA sequencing analysis. Here, we analyze RNA sequencing data collected in a toxicology study evaluating the effects of biomass smoke exposure, representing wildfire-relevant exposure conditions. This study has been previously been described in the following publications:
+
++ Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. doi: 10.1016/j.scitotenv.2021.145759. Epub 2021 Feb 10. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
+
++ Kim YH, Warren SH, Krantz QT, King C, Jaskot R, Preston WT, George BJ, Hays MD, Landis MS, Higuchi M, DeMarini DM, Gilmour MI. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ Health Perspect. 2018 Jan 24;126(1):017011. doi: 10.1289/EHP2200. PMID: [29373863](https://pubmed.ncbi.nlm.nih.gov/29373863/).
+
+Here, we specifically analyze mRNA sequencing profiles collected in mouse lung tissues. These mice were exposed to two different biomass burn scenarios: smoldering pine needles and flaming pine needles, representing certain wildfire smoke exposure scenarios that can occur. The goal of these analyses is to identify which genes demonstrate altered expression in response to these wildfire-relevant exposures, and identify which biological pathways these genes influence to evaluate findings at the systems biology level.
+
+This training module begins by guiding users through the loading, viewing, and formatting of the example transcriptomics datasets and associated metadata. Methods to carry out quality assurance (QA) / quality control (QC) of the transcriptomics data are then described, which are advantageous to ensure high quality data are included in the final statistical analysis. Because these transcriptomic data were derived from bulk lung tissue samples, consisting of mixed cell populations that could have shifted in response to exposures, data are then adjusted for potential sources of heterogeneity using the R package [RUVseq](https://bioconductor.org/packages/release/bioc/html/RUVSeq.html).
+
+Statistical models are then implemented to identify genes that were significantly differentially expressed between exposed vs unexposed samples. Models are implemented using algorithms within the commonly implemented R package [DESeq2](https://doi.org/10.1186/s13059-014-0550-8). This package is very convenient, well written, and widely used. The main advantage of this package is that is allows you to perform differential expression analyses and easily obtain various statistics and results with minimal script development on the user-end.
+
+After obtaining results from differential gene expression analyses, we visualize these results using both MA and volcano plots. Finally, we carry out a systems level analysis through pathway enrichment using the R package [PIANO](https://doi.org/10.1093/nar/gkt111) to identify which biological pathways were altered in response to these wildfire-relevant exposure scenarios.
+
+## Introduction to the Field of "-Omics"
+
+The field of "-omics" has rapidly evolved since its inception in the mid-1990’s, initiated from information obtained through sequencing of the human genome (see the [Human Genome Project](https://www.genome.gov/human-genome-project)) as well as the advent of high-content technologies. High-content technologies have allowed the rapid and economical assessment of genome-wide, or ‘omics’-based, endpoints.
+
+Traditional molecular biology techniques typically evaluate the function(s) of individual genes and gene products. Omics-based methods, on the other hand, utilize non-targeted methods to identify many to all genes or gene products in a given environmental/biological sample. These non-targeted approaches allow for the unbiased investigation of potentially unknown or understudied molecular mediators involved in regulating cell health and disease. These molecular profiles have the potential of being altered in response to toxicant exposures and/or during disease initiation/progression.
+
+To further understand the molecular consequences of -omics-based alterations, molecules can be overlaid onto molecular networks to uncover biological pathways and molecular functions that are perturbed at the systems biology level. An overview of these generally methods, starting with high-content technologies and ending of systems biology, is provided in the below figure (created with BioRender.com).
+
+```{r 6-2-Omics-System-Biology-1, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_6/6_2_Omics_System_Biology/Module6_2_Image1.png")
+```
+
+
+A helpful introduction to the field of -omics in relation to environmental health, as well as methods used to relate -omic-level alterations to systems biology, is provided in the following book chapter:
+
++ Rager JE, Fry RC. Systems Biology and Environmental Exposures. Chpt 4 of 'Network Biology' edited by WenJun Zhang. 2013. ISBN: 978-1-62618-941-3. Nova Science Publishers, Inc. Available at: https://www.novapublishers.com/wp-content/uploads/2019/07/978-1-62618-942-3_ch4.pdf.
+
+
+An additional helpful resource describing computational methods that can be used in systems level analyses is the following book chapter:
+
++ Meisner M, Reif DM. Computational Methods Used in Systems Biology. Chpt 5 of 'Systems Biology in Toxicology and Environmental Health' edited by Fry RC. 2015: 85-115. ISBN 9780128015643. Academic Press. Available at: https://www.sciencedirect.com/science/article/pii/B9780128015643000055.
+
+
+Parallel to human genomics/epigenomics-based research is the newer "-omics" topic of the **exposome**. The exposome was originally conceptualized as 'all life-course environmental exposures (including lifestyle factors), from the prenatal period onwards ([Wild et al. 2005](https://cebp.aacrjournals.org/content/14/8/1847.long)). Since then, this concept has received much attention and additional associated definitions. We like to think of the exposome as including anything in ones environment that may impact the overall health of an individual, excluding the individual's genome/epigenome. Common elements evaluated as part of the exposome include environmental exposures, such as chemicals and other substances that may impart toxicity. Additional potential stressors include lifestyle factors, socioeconomic factors, infectious agents, therapeutics, and other stressors that may be altered internally (e.g., microbiome). A helpful review of this research field is provided as the following publication:
+
++ Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012 Feb;41(1):24-32. doi: 10.1093/ije/dyr236. Epub 2012 Jan 31. PMID: [22296988](https://pubmed.ncbi.nlm.nih.gov/22296988/).
+
+
+
+## Introduction to Transcriptomics
+One of the most widely evaluated -omics endpoints is messenger RNA (mRNA) expression (also termed gene expression). As a reminder, mRNA molecules are a major type of RNA produced as the "middle step" in the [Central Dogma Theory](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology#:~:text=The%20central%20dogma%20of%20molecular,The%20Central%20Dogma), which describes how genetic DNA is first transcribed into RNA and then translated into protein. Protein molecules are ultimately the major regulators of cellular processes and overall health. Therefore, any perturbations to this process (including changes to mRNA expression levels) can have tremendous consequences on overall cell function and health. A visualization of these steps in the Central Dogma theory are included below.
+
+```{r 6-2-Omics-System-Biology-2, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_6/6_2_Omics_System_Biology/Module6_2_Image2.png")
+```
+
+
+mRNA expression can be evaluated in a high-throughout/high-content manner, across the genome, and is referred to as the **transcriptome** when doing so. Transcriptomics can be measured using a variety of technologies, including high-density nucleic acid arrays (e.g., DNA microarrays or GeneChip arrays), high-throughput PCR technologies, or RNA sequencing technologies. These methods are used to obtain relative measures of genes that are being expressed or transcribed from DNA by measuring the abundance of mRNA molecules. Results of these methods are often termed as providing gene expression signatures or 'transcriptomes' of a sample under evaluation.
+
+
+### Training Module's **Environmental Health Questions**
+
+This training module was specifically developed to answer the following environmental health questions:
+
+(1) What two types of data are commonly needed in the analysis of transcriptomics data?
+
+(2) When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?
+
+(3) When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?
+
+(4) What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?
+
+(5) How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+
+(6) How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+
+(7) How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+
+(8) What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?
+
+
+### Workspace Preparation and Data Import
+
+
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 6-2-Omics-System-Biology-3, message=FALSE, warning=FALSE, error=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("BiocManager"))
+ BiocManager::install("BiocManager");
+if (!requireNamespace("DESeq2"))
+ BiocManager::install("DESeq2");
+if (!requireNamespace("edgeR"))
+ BiocManager::install("edgeR");
+if (!requireNamespace("RUVSeq"))
+ BiocManager::install("RUVSeq");
+if (!requireNamespace("janitor"))
+ install.packages("janitor");
+if (!requireNamespace("pheatmap"))
+ install.packages("pheatmap");
+if (!requireNamespace("factoextra"))
+ install.packages("factoextra");
+if (!requireNamespace("RColorBrewer"))
+ install.packages("RColorBrewer");
+if (!requireNamespace("data.table"))
+ install.packages("data.table");
+if (!requireNamespace("EnhancedVolcano"))
+ BiocManager::install("EnhancedVolcano");
+if (!requireNamespace("piano"))
+ BiocManager::install("piano");
+```
+
+
+#### Loading R packages required for this session
+```{r 6-2-Omics-System-Biology-4, message=FALSE, warning=FALSE, error=FALSE}
+library(tidyverse)
+library(DESeq2)
+library(edgeR)
+library(RUVSeq)
+library(janitor)
+library(factoextra)
+library(pheatmap)
+library(data.table)
+library(RColorBrewer)
+library(EnhancedVolcano)
+library(piano)
+```
+
+
+#### Set your working directory
+```{r 6-2-Omics-System-Biology-5, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+
+### Loading the Example Transcriptomic Dataset and Metadata
+
+First, let's read in the transcriptional signature data, previously summarized as number of sequence reads per gene (also simply referred to as 'count data') and its associated metadata file:
+```{r 6-2-Omics-System-Biology-6, message=F, warning=F, error=F}
+# Read in the count data
+countdata <- read.csv(file = 'Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData1_GeneCounts.csv', check.names = FALSE)
+
+# Read in the metadata (describing information on each sample)
+sampleinfo <- read.csv(file = "Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData2_SampleInfo.csv", check.names = FALSE)
+```
+
+
+### Data Viewing
+
+Let's see how many rows and columns of data are present in the countdata dataframe
+```{r 6-2-Omics-System-Biology-7 }
+dim(countdata)
+```
+
+Let's also view the column headers
+```{r 6-2-Omics-System-Biology-8 }
+colnames(countdata)
+```
+
+And finally let's view the top few rows of data
+```{r 6-2-Omics-System-Biology-9 }
+head(countdata)
+```
+Together, this dataframe contains information across 30146 mRNA identifiers, that are labeled according to "Gene name" followed by an underscore and probe number assigned by the platform used in this analysis, BioSpyder TempoSeq Technologies.
+
+A total of 23 columns are included in this dataframe, the first of which represents the gene identifier, followed by gene count data across 22 samples.
+
+
+Let's also see what the metadata dataframe looks like
+```{r 6-2-Omics-System-Biology-10 }
+dim(sampleinfo)
+```
+
+Let's also view the column headers
+```{r 6-2-Omics-System-Biology-11 }
+colnames(sampleinfo)
+```
+
+And finally let's view the top few rows of data
+```{r 6-2-Omics-System-Biology-12 }
+head(sampleinfo)
+```
+Together, this dataframe contains information across the 22 total samples, that are labeled according to "SampleID_BioSpyderCountFile" header. These identifiers match those used as column headers in the countdata dataframe.
+
+A total of 9 columns are included in this dataframe, including the following:
+
++ `SampleID_BioSpyderCountFile`: The unique sample identifers (total n=22)
++ `PlateBatch`: The plate number that was used in the generation of these data.
++ 'MouseID': The unique identifier, that starts with "M" followed by a number, for each mouse used in this study
++ `NumericID`: The unique numeric identifier for each mouse.
++ `Treatment`: The type of exposure condition that each mouse was administered. These include smoldering pine needles, flaming pine needles, vehicle control (saline), and positive inflammation control (LPS, or lipopolysaccharide)
++ `ID`: Another form of identifier that combines the mouse identifier with the exposure condition
++ `Timepoint`: The timepoint at which samples were collected (here, all 4h post-exposure)
++ `Tissue`: The type of tissue that was collected and analyzed (here, all lung tissue)
++ `Group`: The higher level identifier that groups samples together based on exposure condition, timepoint, and tissue
+
+### Checking for Duplicate mRNA IDs
+
+One common QC/preparation step that is helpful when organizing transcriptomics data is to check for potential duplicate mRNA IDs in the countdata.
+```{r 6-2-Omics-System-Biology-13 }
+# Visualize this data quickly by viewing top left corner, to check where ID column is located:
+countdata[1:3,1:5]
+
+# Then check for duplicates within column 1 (where the ID column is located):
+Dups <- duplicated(countdata[,1])
+summary(Dups)
+```
+
+In this case, because all potential duplicate checks turn up "FALSE", these data do not contain duplicate mRNA identifiers in its current organized format.
+
+### Answer to Environmental Health Question 1
+
+:::question
+*With this, we can now answer **Environmental Health Question #1***: What two types of data are commonly needed in the analysis of transcriptomics data?
+:::
+
+:::answer
+**Answer:** A file containing the raw -omics signatures are needed (in this case, the count data summarized per gene acquired from RNA sequencing technologies), and a file containing the associated metadata describing the actual samples, where they were derived from, what they represent, etc, is needed.
+:::
+
+
+## Formatting Data for Downstream Statistics
+
+Most of the statistical analyses included in this training module will be carried out using the DESeq2 pipeline. This package requires that the count data and sample information data be formatted in a certain manner, which will expedite the downstream coding needed to carry out the statistics. Here, we will walk users through these initial formatting steps.
+
+DESeq2 first requires a `coldata` dataframe, which includes the sample information (i.e., metadata). Let's create this new dataframe based on the original `sampleinfo` dataframe:
+```{r 6-2-Omics-System-Biology-14, message=F, warning=F, error=F}
+coldata <- sampleinfo
+```
+
+
+DESeq2 also requires a `countdata` dataframe, which we've previously created; however, this dataframe requires some minor formatting before it can be used as input for downstream script.
+
+First, the gene identifiers need to be converted into row names:
+```{r 6-2-Omics-System-Biology-15, message=F, warning=F, error=F}
+countdata <- countdata %>% column_to_rownames("Gene")
+```
+
+Then, the column names need to be edited. Let's remind ourselves what the column names are currently:
+```{r 6-2-Omics-System-Biology-16, message=F, warning=F, error=F}
+colnames(countdata)
+```
+
+These column identifiers need to be converted into more intuitive sample IDs, that also indicate treatment. This information can be found in the coldata dataframe. Specifically, information in the column labeled `SampleID_BioSpyderCountFile` will be helpful for these purposes.
+
+To replace these original column identifiers with these more helpful sample identifiers, let's first make sure the order of the countdata columns are in the same order as the coldata column of `SampleID_BioSpyderCountFile`:
+```{r 6-2-Omics-System-Biology-17, message=F, warning=F, error=F}
+countdata <- setcolorder(countdata, as.character(coldata$SampleID_BioSpyderCountFile))
+```
+
+Now, we can rename the column names within the countdata dataframe with these more helpful identifiers, since both dataframes are now arranged in the same order:
+```{r 6-2-Omics-System-Biology-18, message=F, warning=F, error=F}
+colnames(countdata) <- coldata$ID # Rename the countdata column names with the treatment IDs.
+colnames(countdata) # Viewing these new column names
+```
+These new column identifiers look much better, and can better inform downstream statistical analysis script. Remember that these identifiers indicate that these are mouse samples ("M"), with unique numbers, followed by an underscore and the exposure condition.
+
+
+When relabeling dataframes, it's always important to triple check any of these major edits. For example, here, let's double check that the same samples appear in the same order between the two working dataframes required for dowstream DESeq2 code:
+```{r 6-2-Omics-System-Biology-19, message=F, warning=F, error=F}
+setequal(as.character(coldata$ID), colnames(countdata))
+identical(as.character(coldata$ID), colnames(countdata))
+```
+
+
+
+## Transcriptomics Data QA/QC
+After preparing your transcriptomic data and sample information dataframes for statistical analyses, it is very important to carry out QA/QC on your organized datasets, prior to including all samples and all genes in the actual statistical model. It is critical to only include high quality data that inform underlying biology of exposure responses/disease etiology, rather than data that may contribute noise to the overall data distributions. Some common QA/QC steps and associated data pre-filters carried out in transcriptomics analyses are detailed below.
+
+
+### Background Filter
+It is very common to perform a background filter step when preparing transcriptomic data for statistical analyses. The goal of this step is to remove genes that are very lowly expressed across the majority of samples, and thus are referred to as universally lowly expressed. Signals from these genes can mute the overall signals that may be identified in -omics analyses. The specific threshold that you may want to apply as the background filter to your dataset will depend on the distribution of your dataset and analysis goal(s).
+
+For this example, we apply a background threshold, to remove genes that are lowly expressed across the majority of samples, specifically defined as genes that have expression levels across at least 20% of the samples that are less than (or equal to) the median expression of all genes across all samples. This will result in including only genes that are expressed above background, that have expression levels in at least 20% of samples that are greater than the overall median expression. Script to apply this filter is detailed below:
+
+```{r 6-2-Omics-System-Biology-20, message=F, warning=F, error=F}
+# First count the total number of samples, and save it as a value in the global environment
+nsamp <- ncol(countdata)
+
+# Then, calculate the median expression level across all genes and all samples, and save it as a value
+total_median <- median(as.matrix(countdata))
+
+
+# We need to temporarily add back in the Gene column to the countdata so we can filter for genes that pass the background filter
+countdata <- countdata %>% rownames_to_column("Gene")
+
+# Then we can apply a set of filters and organization steps (using the tidyverse) to result in a list of genes that have an expression greater than the total median in at least 20% of the samples
+genes_above_background <- countdata %>% # Start from the 'countdata' dataframe
+ # Melt the data so that we have three columns: gene, exposure condition, and expression counts
+ pivot_longer(cols=!Gene, names_to = "sampleID", values_to="expression") %>%
+ # Add a column that indicates whether the expression of a gene for the corresponding exposure condition is above (1) or not above (0) the median of all count data
+ mutate(above_median=ifelse(expression>total_median,1,0)) %>%
+ group_by(Gene) %>% # Group the dataframe by the gene
+ # For each gene, count the number of exposure conditions where the expression was greater than the median of all count data
+ summarize(total_above_median=sum(above_median)) %>%
+ # Filter for genes that have expression above the median in at least 20% of the samples
+ filter(total_above_median>=.2*nsamp) %>%
+ # Select just the genes that pass the filter
+ select(Gene)
+
+# Then filter the original 'countdata' dataframe for only the genes above background.
+countdata <- left_join(genes_above_background, countdata, by="Gene")
+```
+
+Here, the `countdata` dataframe went from having 30,146 rows of data (representing genes) to 16,664 rows of data (representing genes with expression levels that passed this background filter)
+
+
+
+### Sample Filtering
+Another common QA/QC check is to evaluate whether there are any samples that did not produce adequate RNA material to be measured using the technology employed. Thus, a sample filter can be applied to remove samples that have inadequate data. Here, we demonstrate this filter by checking to see whether there were any samples that resulted in mRNA expression values of zero across all genes. If any sample demonstrates this issue, it should be removed prior to any statistical analysis. Note, there are other filter cut-offs you can use depending on your specific study.
+
+Below is example script that checks for the presence of samples that meet the above criteria:
+```{r 6-2-Omics-System-Biology-21, message=FALSE, warning=FALSE, error=FALSE}
+# Transpose filtered 'countdata', while keeping data in dataframe format, to allow for script that easily sums the total expression levels per sample
+countdata_T <- countdata %>%
+ pivot_longer(cols=!Gene, names_to="sampleID",values_to="expression") %>%
+ pivot_wider(names_from=Gene, values_from=expression)
+
+# Then add in a column to the transposed countdata dataframe that sums expression across all genes for each exposure condition
+countdata_T$rowsum <- rowSums(countdata_T[2:ncol(countdata_T)])
+
+# Remove samples that have no expression. All samples have some expression in this example, so all samples are retained.
+countdata_T <- countdata_T %>% filter(rowsum!=0)
+
+# Take the count data filtered for correct samples, remove the 'rowsums' column
+countdata_T <- countdata_T %>% select(!rowsum)
+
+# Then, transpose it back to the correct format for analysis
+countdata <- countdata_T %>%
+ pivot_longer(cols=!sampleID, names_to = "Gene",values_to="expression") %>%
+ pivot_wider(names_from = sampleID, values_from = "expression")
+```
+
+
+### Identifying & Removing Sample Outliers
+Prior to final statistical analysis, raw transcriptomic data are commonly evaluated for the presence of potential sample outliers. Outliers can result from experimental error, technical error/measurement error, and/or huge sources of variation in biology. For many analyses, it is beneficial to remove such outliers to enhance computational abilities to identify biologically meaningful signals across data. Here, we present two methods to check for the presence of sample outliers:
+
+**1. Principal component analysis (PCA)** can be used to identify potential outliers in a dataset through visualization of summary-level values illustrating reduced representations of the entire dataset. Note that a more detailed description of PCA is provided in **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means & PCA**. Here, PCA is run on the raw count data and further analyzed using scree plots, assessing principal components (PCs), and visualized using biplots displaying the first two principal components as a scatter plot.
+
+
+**2. Hierarchical clustering** is another approach that can be used to identify potential outliers. Hierarchical clustering aims to cluster data based on a similarity measure, defined by the function and/or specified by the user. There are several R packages and functions that will run hierarchical clustering, but it is often helpful visually to do this in conjuction with a heatmap. Here, we use the package *pheatmap* (introduced in **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**) with hierarchical clustering across samples to identify potential outliers.
+
+
+Let's start by using PCA to identify potential outliers, while providing a visualization of potential sources of variation across the dataset.
+
+First we need to move the Gene column back to the rownames so our dataframe is numeric and we can run the PCA script
+```{r 6-2-Omics-System-Biology-22, message=FALSE, warning=FALSE, error=FALSE}
+countdata <- countdata %>% column_to_rownames("Gene")
+
+# Let's remind ourselves what these data look like
+countdata[1:10,1:5] #viewing first 10 rows and 5 columns
+```
+
+
+Then we can calculate principal components using transposed count data
+```{r 6-2-Omics-System-Biology-23 }
+pca <- prcomp(t(countdata))
+```
+
+
+And visualize the percent variation captured by each principal component (PC) with a scree plot
+```{r 6-2-Omics-System-Biology-24, fig.align='center'}
+# We can generate a scree plot that shows the eigenvalues of each component, indicating how much of the total variation is captured by each component
+fviz_eig(pca, addlabels = TRUE)
+```
+
+This scree plot indicates that nearly all variation is explained in PC1 and PC2, so we are comfortable with viewing these first two PCs when evaluating whether or not potential outliers exist in this dataset.
+
+#### Visualization of Transcriptomic Data using PCA
+
+Further visualization of how these transcriptomic data appear through PCA can be produced through a scatter plot showing the data reduced values per sample:
+```{r 6-2-Omics-System-Biology-25, fig.align='center', warning = FALSE}
+# Calculate the percent variation captured by each PC
+pca_percent <- round(100*pca$sdev^2/sum(pca$sdev^2),1)
+
+# Make dataframe for PCA plot generation using first two components and the sample name
+pca_df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], Sample=colnames(countdata))
+
+# Organize dataframe so we can color our points by the exposure condition
+pca_df <- pca_df %>% separate(Sample, into = c("mouse_num", "expo_cond"), sep="_")
+
+# Plot PC1 and PC2 for each sample and color the point by the exposure condition
+ggplot(pca_df, aes(PC1,PC2, color = expo_cond))+
+ geom_hline(yintercept = 0, size=0.3)+
+ geom_vline(xintercept = 0, size=0.3)+
+ geom_point(size=3) +
+ geom_text(aes(label=mouse_num), vjust =-1, size=4)+
+ labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
+ ggtitle("PCA for 4h Lung Pine Needles & Control Exposure Conditions")
+```
+
+With this plot, we can see that samples do not demonstrate obvious groupings, where certain samples group far apart from others. Therefore, our PCA analysis indicates that there are unlikely any sample outliers in this dataset.
+
+
+#### Now lets implement hierarchical clustering to identify potential outliers
+
+First we need to create a dataframe of our transposed `countdata` such that samples are rows and genes are columns to input into the clustering algorithm.
+```{r 6-2-Omics-System-Biology-26 }
+countdata_for_clustering <- t(countdata)
+countdata_for_clustering[1:5,1:10] # Viewing what this transposed dataframe looks like
+```
+
+
+Next we can run hierarchical clustering in conjunction with the generation of a heatmap. Note that we scale these data for improved visualization.
+```{r 6-2-Omics-System-Biology-27, fig.align='center'}
+pheatmap(scale(countdata_for_clustering), main="Hierarchical Clustering",
+ cluster_rows=TRUE, cluster_cols = FALSE,
+ fontsize_col = 7, treeheight_row = 60, show_colnames = FALSE)
+```
+
+Like the PCA findings, heirarchical clustering demonstrated an overall lack of potential sample outliers because there were no obvious sample(s) that grouped separately from the rest along the clustering dendograms.
+Therefore, *neither approach points to outliers that should be removed in this analysis.*
+
+
+
+
+### Answer to Environmental Health Question 2
+
+:::question
+*With this, we can now answer **Environmental Health Question #2***: When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?
+:::
+
+:::answer
+**Answer:** (1) Background filter to remove genes that are universally lowly expressed; (2) Sample filter to remove samples that may be not have any detectable mRNA; (3) Sample outlier filter to remove samples with underlying data distributions outside of the overall, collective dataset.*
+:::
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can now also answer **Environmental Health Question #3***: When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?
+:::
+
+:::answer
+**Answer:** Principal component analysis (PCA) and hierarchical clustering.
+:::
+
+
+
+## Controlling for Sources of Sample Heterogeneity
+Because these transcriptomic data were generated from mouse lung tissues, there is potential for these samples to show heterogeneity based on underlying shifts in cell populations (e.g., neutrophil influx) or other aspects of sample heterogeneity (e.g., batch effects from plating, among other sources of heterogeneity that we may want to control for). For these kinds of complex samples, there are data processing methods that can be leveraged to minimize the influence of these sources of heterogeneity. Example methods include Remove Unwanted Variable (RUV), which is discussed here, as well as others (e.g., [Surrogate Variable Analysis (SVA)](https://academic.oup.com/nar/article/42/21/e161/2903156)).
+
+Here, we leverage the package called *RUVseq* to employ RUV on this sequencing dataset. Script was developed based off [Bioconductor website](https://bioconductor.org/packages/release/bioc/html/RUVSeq.html), [vignette](http://bioconductor.org/packages/release/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf), and original [publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/).
+
+
+#### Steps in carrying out RUV using RUVseq on this example dataset:
+```{r 6-2-Omics-System-Biology-28, message=F, warning=F, error=F}
+# First we store the treatment IDs and exposure conditions as a separate vector
+ID <- coldata$ID
+
+# And differentiate our treatments and control conditions, first by grabbing the groups associated with each sample
+groups <- as.factor(coldata$Group)
+
+# Let's view all the groups
+groups
+
+# then setting a control label
+ctrl <- "Saline_4h_Lung"
+
+# and extracting a vector of just our treatment groups
+trt_groups <- setdiff(groups,ctrl)
+
+# let's view this vector
+trt_groups
+```
+
+*RUVseq* contains its own set of plotting and normalization functions, though requires input of what's called an object of S4 class SeqExpressionSet. Let's go ahead and make this object, using the *RUVseq* function `newSeqExpressionSet()`:
+```{r 6-2-Omics-System-Biology-29 }
+exprSet <- newSeqExpressionSet(as.matrix(countdata),phenoData = data.frame(groups,row.names=colnames(countdata)))
+```
+
+
+And then use this object to generate some exploratory plots using built-in tools within *RUVseq*.
+First starting with some bar charts summarizing overall data distributions per sample:
+```{r 6-2-Omics-System-Biology-30, fig.align='center'}
+colors <- brewer.pal(4, "Set2")
+plotRLE(exprSet, outline=FALSE, ylim=c(-4, 4), col=colors[groups])
+```
+
+We can see from this plot that some of the samples show distributions that may vary from the overall - for instance, one of the flaming pine needles-exposed samples (in orange) is far lower than the rest.
+
+
+Then viewing a PCA plot of these samples:
+```{r 6-2-Omics-System-Biology-31, fig.align='center'}
+colors <- brewer.pal(4, "Set2")
+plotPCA(exprSet, col=colors[groups], cex=1.2)
+```
+
+This PCA plot shows pretty good data distributions, with samples mainly showing groupings based upon exposure condition (e.g., LPS), which is to be expected. With this, we can conclude that there may be some sources of unwanted variation, but not a huge amount. Let's see what the data look like after running RUV.
+
+
+Now to actually run the RUVseq algorithm, to control for potential sources of sample heterogeneity, we need to first construct a matrix specifying the replicates (samples of the same exposure condition):
+```{r 6-2-Omics-System-Biology-32 }
+# Construct a matrix specifying the replicates (samples of the same exposure condition) for running RUV
+differences <- makeGroups(groups)
+
+# Viewing this new matrix
+head(differences)
+```
+
+This matrix groups the samples by exposure condition. Here, each of the four rows represents one of the four exposure conditions, and each of the six columns represents a possible sample. Since the LPS exposure condition only had four samples, instead of six like the rest of the exposure conditions, a value of -1 is automatically used as a place holder to fill out the matrix. The samples in the matrix are identified by the index of the sample in the previously defined 'groups' factor that was used to generate the matrix. For example, the PineNeedlesSmolder_4h_Lung samples are the the first six samples contained in the 'groups' factor, so in the matrix, samples of this exposure condition are identified as '1','2','3','4','5', and '6'.
+
+
+Let's now implement the RUVseq algorithm and, for this example, capture one factor (k=1) of unwanted variation. Note that the k parameter can be modified to capture additional factors, if necessary.
+```{r 6-2-Omics-System-Biology-33 }
+# Now capture 1 factor (k=1) of unwanted variation
+ruv_set <- RUVs(exprSet, rownames(countdata), k=1, differences)
+```
+
+
+This results in a list of objects within `ruv_set`, which include the following important pieces of information:
+
+(1) Estimated factors of unwanted variation are provided in the phenoData object, as viewed using the following:
+```{r 6-2-Omics-System-Biology-34 }
+# viewing the estimated factors of unwanted variation in the column W_1
+pData(ruv_set)
+```
+
+
+(2) Normalized counts obtained by regressing the original counts on the unwanted factors (normalizedCounts object within `ruv_set`). Note that the normalized counts should only used for exploratory purposes and not subsequent differential expression analyses. For additional information on this topic, please refer official *RUVSeq* documentation. The normalized counts can be viewed using the following:
+```{r 6-2-Omics-System-Biology-35 }
+# Viewing the head of the normalized count data, accounting for unwanted variation
+head(normCounts(ruv_set))
+```
+
+
+Let's again generate an exploratory plot using this updated dataset, focusing on the bar chart view since that was the most informative pre-RUV. Here are the updated bar charts summarizing overall data distributions per sample:
+```{r 6-2-Omics-System-Biology-36, fig.align='center'}
+colors <- brewer.pal(4, "Set2")
+plotRLE(ruv_set, outline=FALSE, ylim=c(-4, 4), col=colors[groups])
+```
+
+This plot shows overall tighter data that are more similarly distributed across samples. Therefore, it is looking like this RUV addition improved the overall distribution of this dataset. It is important not to over-correct/over-smooth your datasets, so implement these types of pre-processing steps with caution. One strategy that we commonly employ to gage whether data smoothing is needed/applied correctly is to run the statistical models with and without correction of potential sources of heterogeneity, and critically evaluate similarities vs differences produced in the results.
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can now answer **Environmental Health Question #4***: What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?
+:::
+
+:::answer
+**Answer:** Remove unwanted variation (RUV), among other approaches, including surrogate variable analysis (SVA).
+:::
+
+
+
+
+## Identifying Genes that are Significantly Differentially Expressed by Environmental Exposure Conditions (e.g., Biomass Smoke Exposure)
+At this point, we have completed several data pre-processing, QA/QC, and additional steps to prepare our example transcriptomics data for statistical analysis. And finally, we are ready to run the overall statistical model to identify genes that are altered in expression in association with different biomass burn conditions.
+
+Here we leverage the *DESeq2* package to carry out these statistical comparisons. This package is now the most commonly implemented analysis pipeline used for transcriptomic data, including sequencing data as well as transcriptomic data produced via other technologies (e.g., Nanostring, Fluidigm, and other gene expression technologies). This package is extremely well-documented and we encourage trainees to leverage these resources in parallel with the current training module when carrying out their own transcriptomics analyses in R:
+
+
++ [Bioconductor website](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
++ [Vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html)
++ [Manual](https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf)
++ Primary citation: Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: [25516281](https://pubmed.ncbi.nlm.nih.gov/25516281/)
+
+
+In brief, the basic calculations employed within the DESeq2 underlying algorithms include the following:
+
+**1. Estimate size factors.**
+In the first step, size factors are estimated to help account for potential differences in the sequencing depth across samples. It is similar to a normalization parameter in the model.
+
+**2. Normalize count data.**
+DESeq2 employs different normalization algorithms depending on the parameters selected / stage of analysis. The most commonly employed method is called the **median of ratios**, which takes into account sequencing depth and RNA composition, as described [here](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html). Specifically, these normalized values are calculated as counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene. DESeq2 then transforms these data using variance stabilization within the final statistical model. Because of these two steps, we prefer to export both the median of ratios normalized data as well as the variance stabilization transformed data, to save in our records and use when generating plots of expression levels for specific genes we are interested in. These steps are detailed below.
+
+**3. Estimate dispersion.**
+The dispersion estimate takes into account the relationship between the variance of an observed count and its mean value. It is similar to a variance parameter. In DESeq2, dispersion is estimated using a maximum likelihood and empirical bayes approach.
+
+**4. Fit negative binomial generalized linear model (GLM).**
+Finally, a negative binomial model is fit for each gene using the design formula that will be described within the proceeding code. The Wald test is performed to test if log fold changes in expression (typically calculated as log(average exposed / average unexposed)) significantly differ from zero. Statistical p-values are reported from this test and also adjusted for multiple testing using the Benjamini and Hochberg procedure.
+
+Note that these calculations, among others, are embedded within the DESeq2 functions, so we do not need to code for them ourselves. Instead, we just need to make sure that we set-up the DESeq2 functions correctly, such that these calculations are carried out appropriately in our final transcriptomics analyses.
+
+
+#### Setting up the DESeq2 experiment
+Here we provide example script that is used to identify which genes are significantly differentially expressed in association with the example biomass smoke exposures, smoldering pine needles and flaming pine needles, as well as a positive inflammation control, LPS.
+
+First, we need to set-up the DESeq2 experiment:
+```{r 6-2-Omics-System-Biology-37, message=FALSE, warning=FALSE, error=FALSE}
+# Set up our experiment using our RUV adjusted count and phenotype data.
+# Our design indicates that our count data is dependent on the exposure condition (groups variable) and our factor of unwanted variation, and we have specified that there not be an intercept term through the use of '~0'
+dds <- DESeqDataSetFromMatrix(countData = counts(ruv_set), # Grabbing count data from the 'ruv_set' object
+ colData = pData(ruv_set), # Grabbing the phenotype data and corresponding factor of unwanted variation from the 'ruv_set' object
+ design = ~0+groups+W_1) # Setting up the statistical formula (see below)
+```
+
+For the formula design, we use a '~0' at the front to not include an intercept term, and then also account for the exposure condition (groups) and the previously calculated factors of unwanted variation (W_1) of the samples. Formula design is an important step and should be carefully considered for each individual analysis. Other resources, including official *DESeq2* documentation, are available for consultation regarding formula design, as the specifics of formula design are beyond the scope of this training module.
+
+It is worth noting that, by default, *DESeq2* will use the last variable in the design formula (`W_1`) in this case, as the default variable to be output from the "results" function. Additionally, if the variable is categorical, it will display results comparing the reference level to the last level of that variable. To get results for other variables or to see other comparisons within a categorical variable, we can use the `contrast` parameter, which will be demonstrated below.
+
+
+#### Estimating size factors
+``` {r, message=FALSE, warning=FALSE, error=FALSE}
+# Estimate size factors from the dds object that was just created as the experiment above
+dds <- estimateSizeFactors(dds)
+sizeFactors(dds) # viewing the size factors
+```
+
+#### Calculating and exporting normalized counts
+
+Here, we extract normalized counts and variance stabilized counts.
+``` {r, message=FALSE, warning=FALSE, error=FALSE}
+# Extract normalized count data
+normcounts <- as.data.frame(counts(dds, normalized=TRUE))
+
+# Transforming normalized counts through variance stabilization
+vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
+vsd_matrix <- as.matrix(assay(vsd))
+```
+
+We could also export them using code such as:
+```{r 6-2-Omics-System-Biology-38, eval = FALSE}
+# Export data
+write.csv(normcounts, "Chapter_6/6_2_Omics_System_Biology/Module6_2_Output_NormalizedCounts.csv")
+write.csv(vsd_matrix, "Chapter_6/6_2_Omics_System_Biology/Module6_2_Output_VSDCounts.csv", row.names=TRUE)
+```
+
+
+#### Running the final DESeq2 experiment
+Here, we are finally ready to run the actual statistical comparisons (exposed vs control samples) to calculate fold changes and p-values that describe the degree to which each gene may or may not be altered at the expression level in association with treatment.
+
+For this example, we would like to run three different comparisons:
+(1) Smoldering Pine Needles vs. Control
+(2) Flaming Pine Needles vs. Control
+(3) LPS vs. Control
+which we can easily code for using a loop function, as detailed below.
+
+Note that we have commented out the line of code for writing out the CSV because we do not need it for the rest of the module, but this could be used if you need to write out and view results in an external application such as Excel for supplementary materials.
+
+```{r 6-2-Omics-System-Biology-39, message=FALSE, warning=FALSE, error=FALSE}
+# Run experiment
+dds_run <- DESeq(dds, betaPrior=FALSE)
+
+# Loop through and extract and export results for all contrasts (treatments vs. control)
+for (trt in trt_groups){ # Iterate for each of the treatments listed in 'trt_groups'
+ cat(trt) # Print which treatment group we are on in the loop
+ res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups",trt,ctrl)) # Extract the results of the DESeq2 analysis specifically for the comparison of the treatment group for the current iteration of the loop with the control group
+ summary(res) # Print out a high-level summary of the results
+ ordered <- as.data.frame(res[order(res$padj),]) # Make a dataframe of the results and order them by adjusted p-value from lowest to highest
+ top10 <- head(ordered, n=10) # Make dataframe of the first ten rows of the ordered results
+ cat("\nThe 10 most significantly differentially expressed genes by adjusted p-value:\n\n")
+ print(top10) # View the first ten rows of the ordered results
+ pfilt.05 <- nrow(ordered %>% filter(padj<0.05)) # Get the number of genes that are significantly differentially expressed where padj < 0.05
+ cat("\nThe number of genes showing significant differential expression where padj < 0.05 is ", pfilt.05)
+ pfilt.10 <- nrow(ordered %>% filter(padj<0.1)) # Get the number of genes that are significantly differentially expressed where padj < 0.10
+ cat("\nThe number of genes showing significant differential expression where padj < 0.10 is ", pfilt.10,"\n\n")
+ # write.csv(ordered, paste0("Module6_2_Output_StatisticalResults_",trt ,".csv")) ## Export the full dataframe of ordered results as a csv
+}
+```
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can now answer **Environmental Health Question #5***: How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+:::
+
+:::answer
+**Answer:** 515 genes
+:::
+
+### Answer to Environmental Health Question 6
+:::question
+*With this, we can also now answer **Environmental Health Question #6***: How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+:::
+
+:::answer
+**Answer:** 679 genes
+:::
+
+### Answer to Environmental Health Question 7
+:::question
+*And, we can answer **Environmental Health Question #7***: How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
+:::
+
+:::answer
+**Answer:** 4,813 genes
+:::
+
+
+*Together, we find that exposure to both flaming and smoldering of pine needles caused substantial disruptions in gene expression profiles. LPS serves as a positive control for inflammation and produced the greatest transcriptomic response.*
+
+
+
+
+## Visualizing Statistical Results using MA Plots
+[MA plots](https://en.wikipedia.org/wiki/MA_plot) represent a common method of visualization that illustrates differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.
+
+Here, we leverage MA plots to show how log fold changes relate to expression levels. In these plots, the log fold change is plotted on the y-axis and expression values are plotted along the x-axis, and dots are colored according to statistical significance (using padj<0.05 as the statistical filter). Here we will generate an MA plot for Flaming Pine Needles.
+
+```{r 6-2-Omics-System-Biology-40, message=F, warning=F, error=F, fig.align='center'}
+
+res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups","PineNeedlesFlame_4h_Lung",ctrl)) # Re-extract the DESeq2 results for the flaming pine needles
+MA <- data.frame(res) # Make a preliminary dataframe of the flaming pine needle results
+MA_ns <- MA[ which(MA$padj>=0.05),] # Non-significant genes to plot
+MA_up <- MA[ which(MA$padj<0.05 & MA$log2FoldChange > 0),] # Significant up-regulated genes to plot
+MA_down <- MA[ which(MA$padj<0.05 & MA$log2FoldChange < 0),] #Significant down-regulated genes to plot
+
+ggplot(MA_ns, aes(x = baseMean, y = log2FoldChange)) + # Plot data with counts on x-axis and log2 fold change on y-axis
+ geom_point(color="gray75", size = .5) + # Set point size and color
+
+ geom_point(data = MA_up, color="firebrick", size=1, show.legend = TRUE) + # Plot the up-regulated significant genes
+ geom_point(data = MA_down, color="dodgerblue2", size=1, show.legend = TRUE) + # Plot down-regulated significant genes
+
+ theme_bw() + # Change theme of plot from gray to black and white
+
+ # We want to log10 transform x-axis for better visualizations
+ scale_x_continuous(trans = "log10", breaks=c(1,10,100, 1000, 10000, 100000, 1000000), labels=c("1","10","100", "1000", "10000", "100000", "1000000")) +
+ # We will bound y axis as well to better fit data while not leaving out too many points
+ scale_y_continuous(limits=c(-2, 2)) +
+
+ xlab("Expression (Normalized Count)") + ylab(expression(Log[2]*" Fold Change")) + # Add labels for axes
+ geom_hline(yintercept=0) # Add horizontal line at 0
+```
+
+An appropriate title for this figure could be:
+
+“**Figure X. MA plot of fold change in expression as function of gene expression resulting from 4 hours of exposure to flaming pine needles in mice lung tissues.** Significantly upregulated genes (log~2~FC > 0 and p adjust < 0.05) are shown in red and significantly downregulated genes (log~2~FC < 0 and p adjust < 0.05) are shown in blue. Genes significantly associated are displayed in gray."
+
+
+## Visualizing Statistical Results using Volcano Plots
+
+Similar to MA plots, volcano plots provide visualizations of fold changes in expression from transcriptomic data. However, instead of plotting these values against expression, log fold change is plotted against (adjusted) p-values in volcano plots. Here, we use functions within the *[EnhancedVolcano package](https://www.rdocumentation.org/packages/EnhancedVolcano/versions/1.11.3/topics/EnhancedVolcano)* to generate a volcano plot for Flaming Pine Needles.
+
+Running the `EnhancedVolcano()` function to generate an example volcano plot:
+```{r 6-2-Omics-System-Biology-41, message=FALSE, warning=FALSE, error=FALSE, fig.align='center', out.width = 700, out.height = 580}
+Vol <- data.frame(res) # Dataset to use for plotting
+
+EnhancedVolcano(Vol,
+ lab = rownames(res), # Label significant genes from dataset (can be a column name)
+ x = 'log2FoldChange', # Column name in dataset with l2fc information
+ y = 'padj', # Column name in dataset with adjusted p-value information
+ ylab = "-Log(FDR-adjusted p value)", # Y-axis label
+ pCutoff= 0.05, # Set p-value cutoff
+ ylim=c(0,5), # Limit y-axis for better plot visuals
+ xlim=c(-2,2), # Limit x-axis (similar to in MA plot y-axis)
+ title= NULL, # Removing title
+ subtitle = NULL, # Removing subtitle
+ legendPosition = 'bottom') # Put legend on bottom
+```
+
+
+An appropriate title for this figure could be:
+
+“**Figure X. Volcano plot of lung genes resulting from 4 hours of exposure to flaming pine needles.** Genes are colored according to level of significant differential loading in exposed vs unexposed (vehicle control) samples, using the following statistical cut-offs: P adjust (multiple test corrected p-value) <0.05 and fold change(FC) ±1.3 (log2FC ≥±0.3785)."
+
+
+
+## Interpretting Findings at the Systems Level through Pathway Enrichment Analysis
+
+Pathway enrichment analysis is a very helpful tool that can be applied to interpret transcriptomic changes of interest in terms of systems biology. In these types of analyses, gene lists of interest are used to identify biological pathways that include genes present in your dataset more often than expected by chance alone. There are many tools that can be used to carry out pathway enrichment analyses. Here, we are using the R package, *PIANO*, to carry out the statistical enrichment analysis based on the lists of genes we previously identified with differential expression associated with flaming pine needles exposure.
+
+To detail, the following input data are required to run *PIANO*:
+(1) Your background gene sets, which represent all genes queried from your experiment (aka your 'gene universe')
+
+(2) The list of genes you are interested in evaluating pathway enrichment of; here, this represents the genes identified with significant differential expression associated with flaming pine needles
+
+(3) A underlying pathway dataset; here, we're using the KEGG PATHWAY Database ([KEGG](https://www.genome.jp/kegg/pathway.html)), summarized through the Molecular Signature Database ([MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/)) into pre-formatted input files (.gmt) ready for PIANO.
+
+*Let's organize these three required data inputs.*
+
+
+(1) Background gene set:
+```{r 6-2-Omics-System-Biology-42 }
+# First grab the rownames of the 'res' object, which was redefined as the DESeq2 results for flaming pine needles prior to MA plot generation, and remove the BioSpyder numeric identifier using a sub function, while maintaining the gene symbol and place these IDs into a new list within the 'res' object (saved as 'id')
+res$id <- gsub("_.*", "", rownames(res));
+
+# Because these IDs now contain duplicate gene symbols, we need to remove duplicates
+# One way to do this is to preferentially retain rows of data with the largest fold change (it doesn't really matter here, because we're just identifying unique genes within the background set)
+res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] # sort by id and reverse of abs(log2foldchange)
+res.ordered <- res.ordered[ !duplicated(res.ordered$id), ] # removing gene duplicates
+
+# Setting this as the background list
+Background <- toupper(as.character(res.ordered$id))
+Background[1:200] # viewing the first 200 genes in this background list
+```
+
+(2) The list of genes identified with significant differential expression associated with flaming pine needles:
+```{r 6-2-Omics-System-Biology-43 }
+# Similar to the above script, but starting with the res$id object
+# and filtering for genes with padj < 0.05
+
+res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] #sort by id and reverse of abs(log2FC)
+SigGenes <- toupper(as.character(res.ordered[which(res.ordered$padj<.05),"id"])) # pulling the genes with padj < 0.05
+SigGenes <- SigGenes[ !duplicated(SigGenes)] # removing gene duplicates
+
+length(SigGenes) # viewing the length of this significant gene list
+```
+
+Therefore, this gene set includes 488 *unique* genes significantly associated with the Flaming Pine Needles condition, based on padj<0.05.
+
+
+(3) The underlying KEGG pathway dataset.
+Note that this file was simply downloaded from [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/), ready for upload as a .gmt file. Here, we use the `loadGSC()` function enabled through the *PIANO* package to upload and organize these pathways.
+```{r 6-2-Omics-System-Biology-44 }
+KEGG_Pathways <- loadGSC(file="Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData3_KEGGv7.gmt", type="gmt")
+
+length(KEGG_Pathways$gsc) # viewing the number of biological pathways contained in the database
+```
+This KEGG pathway database therefore includes 186 biological pathways available to query
+
+
+With these data inputs ready, we can now run the pathway enrichment analysis. The enrichment statistic that is commonly employed through the *PIANO* package is based of a hypergeometric test, run through the `runGSAhyper()` function. This returns a p-value for each gene set from which you can determine enrichment status.
+```{r 6-2-Omics-System-Biology-45, message=F, warning=F, error=F}
+# Running the piano function based on the hypergeometric statistic
+Results_GSA <- piano::runGSAhyper(genes=SigGenes, universe=Background,gsc=KEGG_Pathways, gsSizeLim=c(1,Inf), adjMethod = "fdr")
+
+# Pulling the pathway enrichment results into a separate dataframe
+PathwayResults <- as.data.frame(Results_GSA$resTab)
+
+# Viewing the top of these pathway enrichment results (which are not ordered at the moment)
+head(PathwayResults)
+
+```
+This dataframe therefore summarizes the enrichment p-value for each pathway, FDR adjusted p-value, number of significant genes in the gene set that intersect with genes in the pathway, etc.
+
+
+With these results, let's identify which pathways meet a statistical enrichment p-value filter of 0.05:
+```{r 6-2-Omics-System-Biology-46 }
+SigPathways <- PathwayResults[which(PathwayResults$`p-value` < 0.05),]
+rownames(SigPathways)
+```
+
+
+### Answer to Environmental Health Question 8
+:::question
+*With this, we can now answer **Environmental Health Question #8***: What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?
+:::
+
+:::answer
+**Answer:** Biological pathways involved in cardiopulmonary function (e.g., arrhythmogenic right ventricular cardiomyopathy, hypertrophic cardiomyopathy, vascular smooth muscle contraction), carcinogenesis signaling (e.g., Wnt signaling pathway, hedgehog signaling pathway), and hormone signaling (e.g., Gnrh signaling pathway), among others.
+:::
+
+
+
+## Concluding Remarks
+
+In this module, users are guided through the uploading, organization, QA/QC, statistical analysis, and systems level analysis of an example -omics dataset based on transcriptomic responses to biomass burn scenarios, representing environmental exposure scenarios of growing concern worldwide. It is worth noting that the methods described herein represent a fraction of the approaches and tools that can be leveraged in the analysis of -omics datasets, and methods should be tailored to the purposes of each individual analysis' goal. For additional example research projects that have leveraged -omics and systems biology to address environmental health questions, see the following select relevant publications:
+
+
+**Genomic publications evaluating gene-environment interactions and relations to disease etiology:**
+
++ Balik-Meisner M, Truong L, Scholl EH, La Du JK, Tanguay RL, Reif DM. Elucidating Gene-by-Environment Interactions Associated with Differential Susceptibility to Chemical Exposure. Environ Health Perspect. 2018 Jun 28;126(6):067010. PMID: [29968567](https://pubmed.ncbi.nlm.nih.gov/29968567/).
+
++ Ward-Caviness CK, Neas LM, Blach C, Haynes CS, LaRocque-Abramson K, Grass E, Dowdy ZE, Devlin RB, Diaz-Sanchez D, Cascio WE, Miranda ML, Gregory SG, Shah SH, Kraus WE, Hauser ER. A genome-wide trans-ethnic interaction study links the PIGR-FCAMR locus to coronary atherosclerosis via interactions between genetic variants and residential exposure to traffic. PLoS One. 2017 Mar 29;12(3):e0173880. PMID: [28355232](https://pubmed.ncbi.nlm.nih.gov/28355232/).
+
+
+**Transcriptomic publications evaluating gene expression responses to environmental exposures and relations to disease etiology:**
+
++ Chang Y, Rager JE, Tilton SC. Linking Coregulated Gene Modules with Polycyclic Aromatic Hydrocarbon-Related Cancer Risk in the 3D Human Bronchial Epithelium. Chem Res Toxicol. 2021 Jun 21;34(6):1445-1455. PMID: [34048650](https://pubmed.ncbi.nlm.nih.gov/34048650/).
+
++ Chappell GA, Rager JE, Wolf J, Babic M, LeBlanc KJ, Ring CL, Harris MA, Thompson CM. Comparison of Gene Expression Responses in the Small Intestine of Mice Following Exposure to 3 Carcinogens Using the S1500+ Gene Set Informs a Potential Common Adverse Outcome Pathway. Toxicol Pathol. 2019 Oct;47(7):851-864. PMID: [31558096](https://pubmed.ncbi.nlm.nih.gov/31558096/).
+
++ Manuck TA, Eaves LA, Rager JE, Fry RC. Mid-pregnancy maternal blood nitric oxide-related gene and miRNA expression are associated with preterm birth. Epigenomics. 2021 May;13(9):667-682. PMID: [33890487](https://pubmed.ncbi.nlm.nih.gov/33890487/).
+
+
+**Epigenomic publications** evaluating microRNA, CpG methylation, and/or histone methylation responses to environmental exposures and relations to disease etiology:
+
++ Chappell GA, Rager JE. Epigenetics in chemical-induced genotoxic carcinogenesis. Curr Opinion Toxicol. [2017 Oct; 6:10-17](https://www.sciencedirect.com/science/article/abs/pii/S2468202017300396).
+
++ Rager JE, Bailey KA, Smeester L, Miller SK, Parker JS, Laine JE, Drobná Z, Currier J, Douillet C, Olshan AF, Rubio-Andrade M, Stýblo M, García-Vargas G, Fry RC. Prenatal arsenic exposure and the epigenome: altered microRNAs associated with innate and adaptive immune signaling in newborn cord blood. Environ Mol Mutagen. 2014 Apr;55(3):196-208. PMID: [24327377](https://pubmed.ncbi.nlm.nih.gov/24327377/).
+
++ Rager JE, Bauer RN, Müller LL, Smeester L, Carson JL, Brighton LE, Fry RC, Jaspers I. DNA methylation in nasal epithelial cells from smokers: identification of ULBP3-related effects. Am J Physiol Lung Cell Mol Physiol. 2013 Sep 15;305(6):L432-8. PMID: [23831618](https://pubmed.ncbi.nlm.nih.gov/23831618/).
+
++ Smeester L, Rager JE, Bailey KA, Guan X, Smith N, García-Vargas G, Del Razo LM, Drobná Z, Kelkar H, Stýblo M, Fry RC. Epigenetic changes in individuals with arsenicosis. Chem Res Toxicol. 2011 Feb 18;24(2):165-7. PMID: [21291286](https://pubmed.ncbi.nlm.nih.gov/21291286/).
+
+
+**Metabolomic publications** evaluating changes in the metabolome in response to environmental exposures and involved in disease etiology:
+
++ Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: 24413286; PMCID: [PMC3948040](https://pubmed.ncbi.nlm.nih.gov/24413286/).
+
++ Manuck TA, Lai Y, Ru H, Glover AV, Rager JE, Fry RC, Lu K. Metabolites from midtrimester plasma of pregnant patients at high risk for preterm birth. Am J Obstet Gynecol MFM. 2021 Jul;3(4):100393. PMID: [33991707](https://pubmed.ncbi.nlm.nih.gov/33991707/).
+
+
+**Microbiome publications** evaluating changes in microbiome profiles in relation to the environment and human disease:
+
++ Chi L, Bian X, Gao B, Ru H, Tu P, Lu K. Sex-Specific Effects of Arsenic Exposure on the Trajectory and Function of the Gut Microbiome. Chem Res Toxicol. 2016 Jun 20;29(6):949-51.PMID: [27268458](https://pubmed.ncbi.nlm.nih.gov/27268458/).
+
++ Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012 Mar 13;13(4):260-70. PMID: [22411464](https://pubmed.ncbi.nlm.nih.gov/22411464/).
+
++ Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: [24413286](https://pubmed.ncbi.nlm.nih.gov/24413286/).
+
+
+
+**Exposome publications** evaluating changes in chemical signatures in relation to the environment and human disease:
+
++ Rager JE, Strynar MJ, Liang S, McMahen RL, Richard AM, Grulke CM, Wambaugh JF, Isaacs KK, Judson R, Williams AJ, Sobus JR. Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring. Environ Int. 2016 Mar;88:269-280. PMID: [26812473](https://pubmed.ncbi.nlm.nih.gov/26812473/).
+
++ Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A. The blood exposome and its role in discovering causes of disease. Environ Health Perspect. 2014 Aug;122(8):769-74. PMID: [24659601](https://pubmed.ncbi.nlm.nih.gov/24659601/).
+
++ Viet SM, Falman JC, Merrill LS, Faustman EM, Savitz DA, Mervish N, Barr DB, Peterson LA, Wright R, Balshaw D, O'Brien B. Human Health Exposure Analysis Resource (HHEAR): A model for incorporating the exposome into health studies. Int J Hyg Environ Health. 2021 Jun;235:113768. PMID: [34034040](https://pubmed.ncbi.nlm.nih.gov/34034040/).
+
+
+
+
+
+
+:::tyk
+Using "Module6_2_TYKInput1.csv" (gene counts) and "Module6_2_TYKInput2.csv" (sample info) datasets, which have already been run through the QC process described in this module and are ready for analysis:
+
+1. Conduct a differential expression analysis associated with "Season" using DESeq2. (Don't worry about including any covariates or using RUV).
+2. Find the number of significant differentially expressed genes associated with "Season", at the .05 level.
+:::
diff --git a/Chapter_6/Module6_2_Input/Module6_2_Image1.png b/Chapter_6/6_2_Omics_System_Biology/Module6_2_Image1.png
similarity index 100%
rename from Chapter_6/Module6_2_Input/Module6_2_Image1.png
rename to Chapter_6/6_2_Omics_System_Biology/Module6_2_Image1.png
diff --git a/Chapter_6/Module6_2_Input/Module6_2_Image2.png b/Chapter_6/6_2_Omics_System_Biology/Module6_2_Image2.png
similarity index 100%
rename from Chapter_6/Module6_2_Input/Module6_2_Image2.png
rename to Chapter_6/6_2_Omics_System_Biology/Module6_2_Image2.png
diff --git a/Chapter_6/Module6_2_Input/Module6_2_InputData1_GeneCounts.csv b/Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData1_GeneCounts.csv
similarity index 100%
rename from Chapter_6/Module6_2_Input/Module6_2_InputData1_GeneCounts.csv
rename to Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData1_GeneCounts.csv
diff --git a/Chapter_6/Module6_2_Input/Module6_2_InputData2_SampleInfo.csv b/Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData2_SampleInfo.csv
similarity index 100%
rename from Chapter_6/Module6_2_Input/Module6_2_InputData2_SampleInfo.csv
rename to Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData2_SampleInfo.csv
diff --git a/Chapter_6/Module6_2_Input/Module6_2_InputData3_KEGGv7.gmt b/Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData3_KEGGv7.gmt
similarity index 100%
rename from Chapter_6/Module6_2_Input/Module6_2_InputData3_KEGGv7.gmt
rename to Chapter_6/6_2_Omics_System_Biology/Module6_2_InputData3_KEGGv7.gmt
diff --git a/Chapter_6/6_3_Mixtures_Analysis/6_3_Mixtures_Analysis.Rmd b/Chapter_6/6_3_Mixtures_Analysis/6_3_Mixtures_Analysis.Rmd
new file mode 100644
index 0000000..50e2aba
--- /dev/null
+++ b/Chapter_6/6_3_Mixtures_Analysis/6_3_Mixtures_Analysis.Rmd
@@ -0,0 +1,501 @@
+
+# 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation
+
+This training module was developed by Dr. Lauren Eaves, Dr. Kyle Roell, and Dr. Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+
+## Introduction to Training Module
+
+Historically, toxicology and epidemiology studies have largely focused on analyzing relationships between one chemical and one outcome at a time. This is still important in identifying the degree to which a single chemical exposure is associated with a disease outcome (e.g., [UNC Superfund Research Program's](https://sph.unc.edu/superfund-pages/srp/) focus on inorganic arsenic exposure and its influence on metabolic disease). However, we are exposed, everyday, to many different stressors in our environment. It is therefore critical to deconvolute what co-occurring stressors (i.e., mixtures) in our environment impact human health! The field of mixtures research continues to grow to address this need, with the goals of developing methods to study environmental exposures using approaches to that better capture the mixture of exposures humans experience in real life. In this module, we will provide an overview of mixtures analysis methods and demonstrate how to use one of these methods, quantile g-computation, to analyzing chemical mixtures in a large geospatial epidemiologic study.
+
+
+## Overview of Mixtures Analysis
+
+### Mixtures Methods Relevance and Challenges
+
+**Mixtures approaches are recently becoming more routine in environmental health because methodological advancements are just now making mixtures research more feasible.** These advancements parallel the following:
+
++ Advances in the ability to measure many different chemicals (e.g., through suspect screening and non-targeted chemical analysis approaches) and stressors (e.g., through improved collection and storage of survey data and clinical data) in our environment
++ Improvements in data science to organize, store, and analyze big data
++ Developments in statistical methodologies to parse relationships within these data
+
+Though statistical methodologies are still evolving, we will be discussing our current knowledge in this module.
+
+**Some challenges that data analysts may experience when analyzing data from mixtures studies include the following:**
+
+1. Size of mixture:
++ As the number of components evaluated increases, your available analysis methods and statistical power may decrease
+
+2. Correlated data structure:
++ Statistical challenge of collinearity: If data include large amounts of collinearity, this may dampen the observed effects from components that are highly correlated (e.g., may commonly co-occur) to other components
++ Methodological challenge of co-occurring contaminant confounding: Co-occurring contaminant confounding may make it difficult to discern what is the true driver of the observed effect.
+
+3. Data analysis method selection:
++ There are many different methods to choose from!
++ A critical rule to address this challenge is to, first and foremost, *lay out your study's question*. This question will then help guide your method selection, as discussed below.
+
+
+### Overview of Mixtures Methods
+
+There are many methods that can be implemented to also elucidate relationships between individual chemicals/chemical groups in complex mixtures and their resulting toxicity/health effects. Some of the more common methods used in mixtures analyses, as identified by our team, are summarized in the below figure according to potential questions that could be asked in a study. Two of the methods, specifically quantile based g-computation (qgcomp) and bayesian kernel machine regression (BKMR) are highlighted as example mixtures scripted activities (qgcomp in this script and BKMR in Mixtures Methods 2). Throughout TAME 2.0 training materials, other methods are included such as Principal Component Analysis (PCA), K-means clustering, hierarchical clustering, and predictive modeling / machine learning (e.g., Random Forest modeling and variable selection). The following figure provides an overview of the types of questions that can be asked regarding mixtures and models that are commonly used to answer these questions:
+
+```{r 6-3-Mixtures-Analysis-1, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_6/6_3_Mixtures_Analysis/Module6_3_Mixtures_Methods_Overview.png")
+```
+
+In this module, we will be using quantile based g-computation to analyze our data. This method is used for analysis of a total mixture effect as opposed to individual effects of mixture components. It is similar to previous, popular methods such as weighted quantile sum (WQS) regression, but does not assume directional homogeneity. It also provides access to models for non-additive and non-linear effects of the individual mixture components and overall mixture. Additionally, it runs very quickly and does not require as much computationally as other methods, making it an accessible option for those without access to many computational resources.
+
+
+## Introduction to Example Data
+
+This script outlines single-contaminant (logistic regression) and multi-contaminant modeling approaches (Quantile G-Computation (qgcomp)). The workflow follows the steps used to generate results published in [Eaves et al. 2023](https://pubmed.ncbi.nlm.nih.gov/37845729/). This study examined the relationship between metals in private well water and the risk of preterm birth. The study population was all singleton, non-anomalous births in NC between 2003-2015. Pregnancies were assigned tract-level metal exposure based on maternal residence at delivery. The relationship with single metal exposure was examined with logistic regression and metal mixtures with qgcomp.
+
+For more info on qgcomp, see [Keil et al. 2020](:https://ehp.niehs.nih.gov/doi/full/10.1289/EHP5838) and the associated [vignette](https://cran.r-project.org/web/packages/qgcomp/vignettes/qgcomp-vignette.html).
+
+Note that for educational purposes, in this example we are using a randomly sampled dataset of 100,000 births, rather than the full dataset of >1.3million (ie. using less than 10% of the full study population). Therefore the actual results of the analysis outlined below do not match the results published in the paper.
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following questions:
+
+1. What is the risk of preterm birth associated with exposure to each of arsenic, lead, cadmium, chromium, manganese, copper and zinc via private well water individually?
+
+2. What is the risk of preterm birth associated with combined exposure to arsenic, lead, cadmium, chromium, manganese, copper and zinc (ie. a mixture) via private well water?
+
+3. Which of these chemicals has the strongest effect on preterm birth risk?
+
+4. Which of these chemicals increases the risk of preterm birth and which decreases the risk of preterm birth?
+
+### Workspace Preparation
+
+Install packages as needed, then load the following packages:
+```{r 6-3-Mixtures-Analysis-2, message = FALSE}
+#load packages
+library(tidyverse)
+library(ggplot2)
+library(knitr)
+library(yaml)
+library(rmarkdown)
+library(broom)
+library(ggpubr)
+library(qgcomp)
+```
+
+
+Optionally, you can also create a current date variable to name output files, and create an output folder.
+```{r 6-3-Mixtures-Analysis-3, eval = FALSE}
+# Create a current date variable to name outputfiles
+cur_date <- str_replace_all(Sys.Date(),"-","")
+
+#Create an output folder
+Output_Folder <- ("Module6_3_Output/")
+```
+
+### Data Import
+```{r 6-3-Mixtures-Analysis-4 }
+cohort <- read.csv(file="Chapter_6/6_3_Mixtures_Analysis/Module6_3_InputData.csv")
+colnames(cohort)
+head(cohort)
+```
+
+Note: there are many steps prior to the modeling steps outlined below. These are being skipped for educational purposes. Additional steps include assessment of normality and transformations as needed, generation of a demographics table and assessing for missing data, imputation of missing data if needed, visualizing trends and distributions in the data, functional form assessments, decisions regarding what confounders to include etc.
+
+The following are the metals of interest: arsenic, lead, cadmium, chromium, manganese, copper, zinc.
+
+For each metal there are three exposure variables:
+
+1. `[metal]_perc`: 0: less than or equal to the 50th percentile, 1: above the 50th percentile and less than or equal to the 90th percentile, 3: above the 90th percentile
+2. `[metal]_limit`: 0: <25% f well water tests for a given metal exceeded EPA regulatory standard, 1: 25% or over of well water tests for a given metal exceeded EPA regulatory standard
+3. `[metal].Mean_avg`: the mean concentration of the metal in the tract (ppb).
+Please see the Eaves et al. 2023 paper linked above for further information on these variables.
+
+Other variables of interest (outcome and covariates) in this dataset:
+
+ * `preterm`: 0= 37 weeks gestational age or greater, 1= less than 37 weeks gestational age
+ * `mage`: maternal age in years, continuous
+ * `sex`: sex of baby at birth: 1=M, 2=F
+ * `racegp`: maternal race ethnicity: 1=white non-Hispanic, 2=Black non-Hispanic, 3=Hispanic, 4=Asian/Pacific Islander, 5=American Indian, 6=other/unknown
+ * `smoke`: maternal smoking in pregnany: 0=non-smoker, 1=smoker
+ * `season_conep`: season of conception: 1=winter (Dec, Jan, Feb), 2=spring (Mar, Apr, May), 3=summer (June, Jul, Aug), 4=fall (Sept, Oct, Nov)
+ * `mothed`: mother's education: 1=%
+ mutate(preterm = as.factor(preterm))
+cohort$preterm <- relevel(cohort$preterm, ref = "0")
+
+#exposure variables
+cohort <- cohort %>%
+ mutate(Arsenic_perc=as.factor(Arsenic_perc)) %>%
+ mutate(Cadmium_perc=as.factor(Cadmium_perc)) %>%
+ mutate(Chromium_perc=as.factor(Chromium_perc)) %>%
+ mutate(Copper_perc=as.factor(Copper_perc)) %>%
+ mutate(Lead_perc=as.factor(Lead_perc)) %>%
+ mutate(Manganese_perc=as.factor(Manganese_perc)) %>%
+ mutate(Zinc_perc=as.factor(Zinc_perc)) %>%
+ mutate(Arsenic_limit=as.factor(Arsenic_limit)) %>%
+ mutate(Cadmium_limit=as.factor(Cadmium_limit)) %>%
+ mutate(Chromium_limit=as.factor(Chromium_limit)) %>%
+ mutate(Copper_limit=as.factor(Copper_limit)) %>%
+ mutate(Lead_limit=as.factor(Lead_limit)) %>%
+ mutate(Manganese_limit=as.factor(Manganese_limit)) %>%
+ mutate(Zinc_limit=as.factor(Zinc_limit))
+
+
+#ensure covariates are in correct variable type form
+cohort <- cohort %>%
+ mutate(racegp = as.factor(racegp)) %>%
+ mutate(mage = as.numeric(mage)) %>%
+ mutate(mage_sq = as.numeric(mage_sq)) %>%
+ mutate(smoke = as.numeric(smoke)) %>%
+ mutate(season_concep = as.factor(season_concep)) %>%
+ mutate(mothed = as.numeric(mothed)) %>%
+ mutate(Nitr_perc = as.numeric(Nitr_perc)) %>%
+ mutate(sex = as.factor(sex))%>%
+ mutate(pov_perc = as.factor(pov_perc))
+
+```
+
+#### Fit adjusted logistic regression models for each metal, for each categorical variable
+
+First, we will fit an adjusted logistic regression model for each metal, for each categorical variable, to demonstrate a variable by variable approach before diving into mixtures methods. Note that there are different regression techniques (linear and logistic are covered in another TAME module) and that here we will start with using percentage variables.
+```{r 6-3-Mixtures-Analysis-6, message=F, warning=F, error=F}
+
+metals <- c("Arsenic","Cadmium","Chromium", "Copper","Lead","Manganese","Zinc")
+
+for (i in 1:length(metals)) {
+ metal <- metals[[i]]
+ metal <- as.name(metal)
+ print(metal)
+
+ print(is.factor(eval(parse(text = paste0("cohort$",metal,"_perc"))))) #check that metal var is a factor
+
+ mod <- glm(preterm ~ eval(parse(text = paste0(metal,"_perc"))) + mage + mage_sq+ racegp + smoke + season_concep + mothed + Nitr_perc + pov_perc, family=binomial, data=cohort)
+
+ mod_tid <- tidy(mod, conf.int=TRUE, conf.level=0.95) %>%
+ mutate(model_name=paste0(metal,"_adj_perc")) %>%
+ mutate(OR = exp(estimate)) %>%
+ mutate(OR.conf.high = exp(conf.high)) %>%
+ mutate(OR.conf.low = exp(conf.low))
+
+ mod_tid[2,1] <- paste0(metal,"_perc_50to90")
+ mod_tid[3,1] <- paste0(metal,"_perc_over90")
+
+ plot <- mod_tid %>%
+ filter(grepl('perc_', term))%>%
+ ggplot(aes(OR, term, xmin = OR.conf.low, xmax = OR.conf.high, height = 0)) +
+ geom_point() +
+ scale_x_continuous(trans="log10")+
+ geom_errorbarh()
+
+ assign(paste0(metal,"_adj_perc"),mod_tid)
+ assign(paste0(metal,"_adj_perc_plot"),plot)
+
+}
+
+```
+
+Plot the results:
+```{r 6-3-Mixtures-Analysis-7, message=F, warning=F, error=F, fig.align='center'}
+
+
+perc_plots <- ggarrange(Arsenic_adj_perc_plot,
+ Cadmium_adj_perc_plot,
+ Chromium_adj_perc_plot,
+ Copper_adj_perc_plot)
+plot(perc_plots)
+
+perc_plots1 <- ggarrange(Lead_adj_perc_plot,
+ Manganese_adj_perc_plot,
+ Zinc_adj_perc_plot)
+plot(perc_plots1)
+```
+
+Save the plots:
+```{r 6-3-Mixtures-Analysis-8, eval = FALSE}
+tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_percplots_1.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
+plot(perc_plots)
+dev.off()
+
+tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_percplots_2.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
+plot(perc_plots1)
+dev.off()
+
+```
+
+We can also run the analysis using limit variables:
+```{r 6-3-Mixtures-Analysis-9, message=F, warning=F, error=F, fig.align='center'}
+
+ for (i in 1:length(metals)) {
+ metal <- metals[[i]]
+ metal <- as.name(metal)
+ print(metal)
+
+ print(is.factor(eval(parse(text = paste0("cohort$",metal,"_limit"))))) #check that metal var is a factor
+
+ mod <- glm(preterm ~ eval(parse(text = paste0(metal,"_limit")))+ mage + mage_sq+ racegp + smoke + season_concep + mothed + Nitr_perc + pov_perc, family=binomial, data=cohort)
+
+ mod_tid <- tidy(mod, conf.int=TRUE, conf.level=0.95) %>%
+ mutate(model_name=paste0(metal,"_adj_limit")) %>%
+ mutate(OR = exp(estimate)) %>%
+ mutate(OR.conf.high = exp(conf.high)) %>%
+ mutate(OR.conf.low = exp(conf.low))
+
+ mod_tid[2,1] <- paste0(metal,"_limit_over25perc")
+
+ plot <- mod_tid %>%
+ filter(grepl('limit', term))%>%
+ ggplot(aes(OR, term, xmin = OR.conf.low, xmax = OR.conf.high, height = 0)) +
+ geom_point() +
+ scale_x_continuous(trans="log10")+
+ geom_errorbarh()
+
+ assign(paste0(metal,"_adj_limit"),mod_tid)
+ assign(paste0(metal,"_adj_limit_plot"),plot)
+
+}
+```
+Note: you will get this warning for some of the models:
+"Warning: glm.fit: fitted probabilities numerically 0 or 1".
+
+This is because for the variability in the exposure data, ideally the sample size would be larger (as noted above the analysis this draws from was completed on >1.3million observations).
+
+Plot the results:
+```{r 6-3-Mixtures-Analysis-10, message=F, warning=F, error=F, fig.align='center'}
+limit_plots <- ggarrange(Arsenic_adj_limit_plot,
+ Cadmium_adj_limit_plot,
+ Chromium_adj_limit_plot,
+ Copper_adj_limit_plot)
+
+plot(limit_plots)
+
+limit_plots1 <- ggarrange(Lead_adj_limit_plot,
+ Manganese_adj_limit_plot,
+ Zinc_adj_limit_plot)
+
+plot(limit_plots1)
+```
+
+Save the plots:
+```{r 6-3-Mixtures-Analysis-11, eval = FALSE}
+tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_limitplots1.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
+plot(limit_plots)
+dev.off()
+
+tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_singlemetal_adjusted_models_limitplots2.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
+plot(limit_plots1)
+dev.off()
+```
+
+Merge all of the logistic regression model results. This is the data frame that you could export for supplementary material or to view the results in Excel.
+```{r 6-3-Mixtures-Analysis-12, message=F, warning=F, error=F}
+#merge all model output
+results_df <- rbind(Arsenic_adj_perc, Arsenic_adj_limit,
+ Cadmium_adj_perc, Cadmium_adj_limit,
+ Chromium_adj_perc, Chromium_adj_limit,
+ Copper_adj_perc, Copper_adj_limit,
+ Lead_adj_perc, Lead_adj_limit,
+ Manganese_adj_perc, Manganese_adj_limit,
+ Zinc_adj_perc, Zinc_adj_limit)
+```
+
+To select only the coefficients related to the primary exposures:
+```{r 6-3-Mixtures-Analysis-13 }
+results_df <- results_df %>% filter(str_detect(term, 'limit|50to90|over90'))
+```
+
+This file outputs the coefficients and the odds ratios (OR) of the logistic regression models all together.
++ The ORs in associated with [metal]_perc_50to90 can be interpreted as the OR comparing the odds of preterm birth among individuals in the 50th to 90th percentile of [metal] exposure compared to those below the 50th.
++ The ORs in associated with [metal]_perc_over90 can be interpreted as the OR comparing the odds of preterm birth among individuals above the 90th percentile of [metal] exposure compared to those below the 50th.
++ The ORs in associated with [metal]_limit_over25perc can be interpreted as the OR comparing the odds of preterm birth among individuals living in census tracts in with tests exceeding the an EPA standard for [metal] in 25% or more tests versus tracts with less that 25% of tests exceeding the standard
+
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer also **Environmental Health Question #1***: What is the risk of preterm birth associated with exposure to each of arsenic, lead, cadmium, chromium, manganese, copper and zinc via private well water individually?
+:::
+
+:::answer
+**Answer**: Using the interpretation guides described in the prior paragraph and the "_NCbirths_pretermbirth_singlemetal_adjusted_models.csv" file, you can answer this question. For example, for cadmium, compared to individuals residing in census tracts with cadmium below the 50th percentile, those residing in tracts with lead between the 50th and 90th percentile had a 7% increase in the adjusted odds of PTB (aOR 1.07 (95% CI: 1.00,1.14)) and those in tracts with cadmium above the 90th percentile had a 8% increased adjusted odds of PTB (aOR 1.08 (95% CI: 0.97,1.20). Compared to individuals in tracts with less than 25% of tests exceeding the standard for lead (note this is the EPA treatment technique action level=15 ppb), individuals residing in census tracts where 25% or more of tests exceeded the MCL had 1.23 (95% CI: 0.81,1.81) times the adjusted odds of preterm birth. IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
+:::
+
+While the single contaminant models provide useful information, they cannot inform us of the effect of multiple simultaneous exposures or account for co-occurring contaminant confounding. Therefore, we want to utilize quantile g-compuation to assess mixtures.
+
+## Mixtures Model with Standard qqcomp
+```{r 6-3-Mixtures-Analysis-14, message=F, warning=F, error=F}
+#list of exposure variables
+Xnm <- c('Arsenic.Mean_avg', 'Cadmium.Mean_avg', 'Lead.Mean_avg', 'Manganese.Mean_avg', 'Chromium.Mean_avg', 'Copper.Mean_avg', 'Zinc.Mean_avg')
+#list of covariates
+covars = c('mage','mage_sq','racegp','smoke','season_concep','mothed','Nitr_perc','pov_perc')
+
+#fit adjusted model
+PTB_adj_ppb <- qgcomp.noboot(preterm~.,
+ expnms=Xnm, dat=cohort[,c(Xnm,covars,'preterm')], family=binomial(), q=4)
+
+```
+
+In English, `preterm~.` is saying fit a model that has preterm (1/0) as the dependent variable and then the independent variables (exposures and covariates) are all other variables in the dataset (`.`). `expnms=Xnm` is saying that the mixture of exposures is given by the vector `Xnm,` defined above. `dat=cohort[,c(Xnm,covars,'preterm')]` is saying that the dataset to be used to fit this model includes all columns in the cohort dataset that are listed in the `Xnm` and `covars` vectors and also the `preterm` variable. `family=binomial()` is saying that the outcome is a binary outcome and therefore the model will fit a logistic regression model. `q=4` is saying break the exposures into quartiles, other options would be q=3 for teriltes, q=5 for quintiles and so forth.
+
+This is a summary of the qgcomp model output
+```{r 6-3-Mixtures-Analysis-15, message=F, warning=F, error=F}
+PTB_adj_ppb
+```
+This output can be interpreted as:
+
+ * Cadmium, chromium, manganese and zinc had positive effects, as in they increased the risk of preterm birth. Arsenic, coppper and lead had negative effects, as in they reduced the risk of preterm birth.
+ * The total effect of all positive acting mixture components is given by the sum of positive coefficients = 0.0969, total effect of all negative acting mixture components is given by the sum of negative coefficients = -0.0532.
+ * The numbers underneath each of the individual mixture component are the weights assigned to each component. These sum to 1 in each direction. They represent the relative contribution of each component to the effect in that direction. If only one components was acting in the positive or negative direction, it would have a weight of 1. A component's weight multiplied by the sum of the coefficient's in the relevant direction is that individual component's coefficient and represents the independent effect of that component (e.g. cadmium log(OR) = 0.0969*0.4556=0.0441).
+ * The overall mixture effect (i.e. the log(OR) when all exposures are increased by one quartile) is given by psi1. Here it equals 0.0437. Note that this value is equal to combining the sum of coefficients in the positive direction adn the sum in the negative direction (ie. 0.0969-0.0532= 0.0437)
+
+IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
+
+This is the plot that gives you the weights of the components
+```{r 6-3-Mixtures-Analysis-16, message=F, warning=F, error=F, fig.align='center'}
+plot(PTB_adj_ppb)
+```
+
+To save the plot:
+```{r 6-3-Mixtures-Analysis-17, eval = FALSE}
+tiff(file = (paste0(Output_Folder,"/", cur_date, "_NCbirths_pretermbirth_qgcomp_weights.tiff")), width = 10, height = 8, units = "in", pointsize = 12, res = 600)
+plot(PTB_adj_ppb)
+dev.off()
+```
+
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: What is the risk of preterm birth associated with combined exposure to arsenic, lead, cadmium, chromium, manganese, copper and zinc (ie. a mixture) via private well water?
+:::
+
+:::answer
+**Answer**: When all exposures (arsenic, lead, cadmium, chromium, manganese, copper and zinc) are increased in concentration by one quartile the odds ratio is 1.044 (exp(0.043705)). IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
+:::
+
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer also **Environmental Health Question #3***: Which of these chemicals has the strongest effect on preterm birth risk?
+:::
+
+:::answer
+**Answer**: The mixture component with the strongest effect is the one that has the largest independent effect given my the component's coefficient (which can be calculated by (sum of coefficients in relevant direction)*(component weight), and, as shown below can also be generated to ouput into results files). In this case, the components with the largest independent effect is cadmium (0.0969*0.4556=0.0441). IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
+:::
+
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer also **Environmental Health Question #4***: Which of these chemicals increases the risk of preterm birth and which decreases the risk of preterm birth?
+:::
+
+:::answer
+**Answer**: This is indicated by the direction of effect for each component. Thus, the mixture components that increase the risk of preterm birth are cadmium, chromium, manganese and zinc, while the mixture components that decrease the risk of preterm birth are arsenic, copper and lead. IMPORTANT NOTE: as described above, these results differ from the publication (Eaves et al. 2023) because this scripted example is conducted on a smaller subsetted dataset.
+:::
+
+
+We can export the mixtures modeling results using the following code, which stores the data in three different files:
++ Results_SlopeParams outputs the overall mixture effect results
++ Results_MetalCoeffs outputs the individual mixture components (metals) coefficients. Note that this will also output coefficient for covariates included in the model.
++ Results_MetalWeights outputs the individual mixture components (metals) weights
+```{r 6-3-Mixtures-Analysis-18, message=F, warning=F, error=F, eval = FALSE}
+allmodels <- c("PTB_adj_ppb") #if you run more than one qgcomp model, list them here and the following code can output the results in clean format all together
+
+clean_print <- function(x){
+ output = data.frame(
+ x$coef,
+ sqrt(x$var.coef),
+ x$ci.coef,
+ x$pval
+ )
+ names(output) = c("Estimate", "Std. Error", "Lower CI", "Upper CI", "p value")
+ return(output)
+}
+
+Results_SlopeParams <- data.frame() #empty vector to append dfs to
+for (i in allmodels){
+ print(i)
+ df <- eval(parse(text = paste0("clean_print(",i,")"))) %>%
+ rownames_to_column("Parameter") %>%
+ mutate("Model" = i)
+ Results_SlopeParams <- rbind(Results_SlopeParams,df)
+}
+Results_SlopeParams <- Results_SlopeParams %>%
+ mutate(OR=exp(Estimate)) %>%
+ mutate(UpperCI_OR=exp(`Upper CI`)) %>%
+ mutate(LowerCI_OR=exp(`Lower CI`))
+
+Results_MetalCoeffs <- data.frame()
+for (i in allmodels){
+ print(i)
+ df <- eval(parse(text = paste0("as.data.frame(summary(",i,"$fit)$coefficients[,])"))) %>%
+ mutate("Model" = i)
+ df <- df %>% rownames_to_column(var="variable")
+ Results_MetalCoeffs<- rbind(Results_MetalCoeffs,df)
+}
+
+Results_MetalWeights <- data.frame()
+for (i in allmodels){
+ Results_PWeights <- eval(parse(text = paste0("as.data.frame(",i,"$pos.weights)"))) %>%
+ rownames_to_column("Metal") %>%
+ dplyr::rename("Weight" = 2) %>%
+ mutate("Weight Direction" = "Positive")
+ Results_NWeights <- eval(parse(text = paste0("as.data.frame(",i,"$neg.weights)"))) %>%
+ rownames_to_column("Metal") %>%
+ dplyr::rename("Weight" = 2) %>%
+ mutate("Weight Direction" = "Negative")
+ Results_Weights <- rbind(Results_PWeights, Results_NWeights) %>%
+ mutate("Model" = i) %>% as.data.frame()
+ Results_MetalWeights <- rbind(Results_MetalWeights, Results_Weights)
+}
+
+write.csv(Results_SlopeParams, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_SlopeParams.csv"), row.names=TRUE)
+write.csv(Results_MetalCoeffs, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_MetalCoeffs.csv"), row.names=TRUE)
+write.csv(Results_MetalWeights, paste0(Output_Folder,"/", cur_date, "_qgcomp_Results_MetalWeights.csv"), row.names=TRUE)
+```
+
+
+## Concluding Remarks
+In conclusion, this module reviews a suite of methodologies researches can use to answer different questions relevant to environmental mixtures and their relationships to health outcomes. In this specific scripted example we utilized a large epidemiological dataset (for educational purposes, subsetted to a reduced sample size), to demonstrate using logistic regression to assess single contaminant associations with a health outcome (preterm birth) and quantile g computation to assess mixture effects with a health outcome.
+
+## Additional Resources
+The field of mixtures is vast, with many different approaches and example studies to learn from as analysts lead in their own analyses. Some resources that can be helpful include the following reviews:
+
++ Our recent review on mixtures methodologies, particularly in the field of sufficient similarity, titled [Wrangling whole mixtures risk assessment: Recent advances in determining sufficient similarity](https://www.sciencedirect.com/science/article/abs/pii/S2468202023000323?via%3Dihub)
++ Two more general, epidemiology-focused reviews on mixtures questions and methodologies, titled [Complex Mixtures, Complex Analyses: an Emphasis on Interpretable Results](https://link.springer.com/article/10.1007/s40572-019-00229-5) and [Environmental exposure mixtures: questions and methods to address them](https://pubmed.ncbi.nlm.nih.gov/30643709/)
++ [A helpful online toolkit](https://bookdown.org/andreabellavia/mixtures/preface.html) for mixtures analyses generated by Andrea Bellavia, PhD
+
+Some helpful mixtures case studies include the following:
+
++ Our recent study that implemented quantile g-computation statistics to identify chemicals present in wildfire smoke emissions that impact toxicity, published as the following: Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. PMID: [33611182](https://pubmed.ncbi.nlm.nih.gov/33611182/).
++ Another study from our group that implemented quantile g-computation identify placental gene networks that had altered expression in response to cord tissue mixtures of metals, published as the following: Eaves LA, Bulka CM, Rager JE, Galusha AL, Parsons PJ, O’Shea TM and Fry RC. Metals mixtures modeling identifies birth weight-associated gene networks in the placentas of children born extremely preterm. Chemosphere. 2022;137469.PMID:[36493891](https://pubmed.ncbi.nlm.nih.gov/36493891/)
+
+Many other groups also leverage quantile g-computation, with the following as exemplar case studies:
+
++ [Prenatal exposure to consumer product chemical mixtures and size for gestational age at delivery](https://link.springer.com/article/10.1186/s12940-021-00724-z)
++ [Use of personal care product mixtures and incident hormone-sensitive cancers in the Sister Study: A U.S.-wide prospective cohort](https://www.sciencedirect.com/science/article/pii/S0160412023005718)
+
+
+
+
+:::tyk
+
+Using the metals dataset within the *qgcomp* package (see the [package vignette](https://cran.r-project.org/web/packages/qgcomp/vignettes/qgcomp-vignette.html) for how to access), answer the following three mixtures-related environmental health questions using quantile g-computation, focusing on a mixture of arsenic, copper, zinc and lead:
+
+1. What is the risk of disease associated with combined exposure to each of the chemicals?
+2. Which of these chemicals has the strongest effect on disease?
+3. Which of these chemicals increases the risk of disease and which decreases the risk of disease?
+
+Note that disease is given by the variable `disease_state` (1 = case, 0 = non-case).
+
+:::
diff --git a/Chapter_6/Module6_3_Input/Module6_3_InputData.csv b/Chapter_6/6_3_Mixtures_Analysis/Module6_3_InputData.csv
similarity index 100%
rename from Chapter_6/Module6_3_Input/Module6_3_InputData.csv
rename to Chapter_6/6_3_Mixtures_Analysis/Module6_3_InputData.csv
diff --git a/Chapter_6/Module6_3_Input/Module6_3_InputData.csv.zip b/Chapter_6/6_3_Mixtures_Analysis/Module6_3_InputData.csv.zip
similarity index 100%
rename from Chapter_6/Module6_3_Input/Module6_3_InputData.csv.zip
rename to Chapter_6/6_3_Mixtures_Analysis/Module6_3_InputData.csv.zip
diff --git a/Chapter_6/Module6_3_Input/Module6_3_Mixtures_Methods_Overview.png b/Chapter_6/6_3_Mixtures_Analysis/Module6_3_Mixtures_Methods_Overview.png
similarity index 100%
rename from Chapter_6/Module6_3_Input/Module6_3_Mixtures_Methods_Overview.png
rename to Chapter_6/6_3_Mixtures_Analysis/Module6_3_Mixtures_Methods_Overview.png
diff --git a/Chapter_6/6_4_Mixtures_Analysis_2/6_4_Mixtures_Analysis_2.Rmd b/Chapter_6/6_4_Mixtures_Analysis_2/6_4_Mixtures_Analysis_2.Rmd
new file mode 100644
index 0000000..42edb05
--- /dev/null
+++ b/Chapter_6/6_4_Mixtures_Analysis_2/6_4_Mixtures_Analysis_2.Rmd
@@ -0,0 +1,405 @@
+
+# 6.4 Mixtures Analysis Methods Part 2: Bayesian Kernel Machine Regression
+
+This training module was developed by Dr. Lauren Eaves, Dr. Kyle Roell, and Dr. Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+
+## Introduction to Training Module
+
+In this training module, we will continue to explore mixtures analysis method, this time with a scripted example of Bayesian Kernel Machine Regression (BKMR). Please refer to **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation** for an overview of mixtures methodologies and a scripted example using Quantile g-Computation.
+
+## Introduction to Example Data
+
+In this scripted example, we will use a dataset from the [Extremely Low Gestational Age Newborn (ELGAN) cohort](https://elgan.fpg.unc.edu/). Specifically, we will analyze metal mixtures assessed in cord tissue collected at delivery with neonatal inflammation measured over the first two weeks of life.
+
+For more information on the cord tissue metals data, please see the following two publications:
+
+ + Eaves LA, Bulka CM, Rager JE, Galusha AL, Parsons PJ, O’Shea TM and Fry RC. Metals mixtures modeling identifies birth weight-associated gene networks in the placentas of children born extremely preterm. Chemosphere. 2022;137469. PMID: [36493891](https://pubmed.ncbi.nlm.nih.gov/36493891/)
+
++ Bulka CM, Eaves LA, Gardner AJ, Parsons PJ, Kyle RR, Smeester L, O"Shea TM, Fry RC. Prenatal exposure to multiple metallic and metalloid trace elements and the risk of bacterial sepsis in extremely low gestational age newborns: A prospective cohort study. Front Epidemiol. 2022;2. PMID:[36405975] (https://pubmed.ncbi.nlm.nih.gov/36405975/)
+
+For more information on the neonatal inflammation data, please see the following publication:
+
+ + Eaves LA, Enggasser AE, Camerota M, Gogcu S, Gower WA, Hartwell H, Jackson WM, Jensen E, Joseph RM, Marsit CJ, Roell K, Santos HP Jr, Shenberger JS, Smeester L, Yanni D, Kuban KCK, O'Shea TM, Fry RC. CpG methylation patterns in placenta and neonatal blood are differentially associated with neonatal inflammation. Pediatr Res. June 2022. PMID: [35764815](https://pubmed.ncbi.nlm.nih.gov/35764815/)
+
+Here, we have a dataset of n=254 participants for which we have complete data on neonatal inflammation, cord tissue metals and key demographic variables that will be included as confounders in the analysis.
+
+Extensive research in the ELGAN study has demonstrated that neonatal inflammation is predictive of cerebral palsy, ASD, ADHD, obesity, cognitive impairment, attention problems,cerebral white matter damage, and decreased total brain volume, among other adverse outcomes. Therefore identifying exposures that lead to neonatal inflammation and could be intervened upon to reduce the risk of neonatal inflammation is critical to improve neonatal health. Environmental exposures during pregnancy such as metals may contribute to neonatal inflammation. As is often the case in environmental health, these chemical exposures are likely co-occurring and therefore mixtures methods are needed.
+
+## Introduction to BKMR
+
+BKMR offers a flexible, non-parametric method to estimate:
+
+1) The single exposure effect: odds ratio of inflammation when a single exposure is at its 75th percentile compared to its 25th percentile, with other exposures at their 50th percentile and covariates held constant
+2) The overall mixture effect: odds ratio of inflammation when all exposures are fixed at their 75th percentile compared to when all of the factors are fixed to their 25th percentile;
+3) The interactive effect: the difference in the single-exposure effect when all of the other exposures are fixed at their 75th percentile, as compared to when all of the other factors are fixed at their 25th percentile;
+
+
+There are numerous excellent summaries of BKMR, including the publications in which it was first introduced:
+
+ + Bobb et al. [Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures](https://academic.oup.com/biostatistics/article/16/3/493/269719)
+ + Bobb et al. [Statistical software for analyzing the health effects of multiple concurrent exposures via Bayesian kernel machine regression](https://ehjournal.biomedcentral.com/articles/10.1186/s12940-018-0413-y)
+
+And other vignettes and toolkits including:
+
+ + Jennifer Bobb's [Introduction to Bayesian kernel machine regression and the bkmr R package](https://jenfb.github.io/bkmr/overview.html)
+ + Andrea Bellavia's [Bayesian kernel machine regression](https://bookdown.org/andreabellavia/mixtures/bayesian-kernel-machine-regression.html)
+
+
+While BKMR can do many things other methods cannot, it can require a lot of computational resources and take a long time to run. Before working with your final dataset and analysis, if very large or complex, it is often recommended to start with a smaller sample to make sure everything is working correctly before starting an analysis that make takes days to complete.
+
+
+### Training Module's **Environmental Health Questions**
+This training module was specifically developed to answer the following questions, which mirror the questions in **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1**, but are just in a different order:
+
+1. Which of these chemicals has the strongest effect on neonatal inflammation risk?
+2. Which of these chemicals increases the risk of neonatal inflammation and which decreases the risk of neonatal inflammation?
+3. What is the risk of neonatal inflammation associated with exposure to each of manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead individually?
+4. What is the risk of neonatal inflammation associated with combined exposure to manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead (ie. a mixture)?
+and in addition to the questions addressed in Mixtures Methods 1, we additionally can answer:
+5. Are there interactions among manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead in relation to neonatal inflammation?
+
+## Run BKMR
+
+### Workspace Preparation
+
+Install packages as needed, then load the following packages:
+```{r 6-4-Mixtures-Analysis-2-1, message = FALSE}
+#load packages
+library(tidyverse)
+library(ggplot2)
+library(knitr)
+library(yaml)
+library(rmarkdown)
+library(broom)
+library(ggpubr)
+library(bkmr)
+```
+
+Optionally, you can also create a current date variable to name output files, and create an output folder.
+```{r 6-4-Mixtures-Analysis-2-2, eval = FALSE}
+#Create a current date variable to name outputfiles
+cur_date <- str_replace_all(Sys.Date(),"-","")
+
+#Create an output folder
+Output_Folder <- ("Module6_4_Output/")
+```
+
+### Data Import
+```{r 6-4-Mixtures-Analysis-2-3 }
+cohort <- read.csv(file="Chapter_6/6_4_Mixtures_Analysis_2/Module6_4_InputData.csv")
+colnames(cohort)
+head(cohort)
+```
+
+The variables in this dataset include sample and demographic information and cort tissue metal exposure in $mu$g/g or ng/g.
+
+*Sample and Demographic Variables*
+
++ `id`: unique study ID
+ outcome:
++ `inflam_intense`: 1= high inflammation, 0=low inflammation
+ covariates:
++ `race1`: maternal race, 1=White, 2=Black, 0=Other
++ `sex`: neonatal sex, 0=female, 1=male
++ `gadays`: gestational age at delivery in days
++ `magecat`: maternal age, 1= <21, 2=21-35, 3= > 35
++ `medu`:maternal education: 1= <12, 2=12, 3=13-15, 4=16, 5= >16
++ `smoke`: maternal smoking while pregnant, 0=no, 1=yes
+
+*Exposure Variables*
+
++ `Mn_ugg`
++ `Cu_ugg`
++ `Zn_ugg`
++ `As_ngg`
++ `Se_ugg`
++ `Cd_ngg`
++ `Hg_ngg`
++ `Pb_ngg`
+
+
+There are many steps prior to the modeling steps outlined below. These are being skipped for educational purposes. Additional steps include assessment of normality and transformations as needed, generation of a demographics table and assessing for missing data, imputation of missing data if needed, visualizing trends and distributions in the data, assessing correlations between exposures, functional form assessments, and decisions regarding what confounders to include.
+
+In addition, it is highly recommended to conduct single-contaminant modeling initially to understand individual chemical relationships with the outcomes of focus before conducting mixtures assessment. For an example of this, see **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation**. BKMR, as a flexible non-parametric modeling approach, does not allow for classical null-hypothesis testing, and 95% CI are interpreted as credible intervals, not confidence intervals. One approach therefore could be to explore non-linearities and interactions within BKMR to then validate generated hypotheses using quantile g-computation.
+
+### Fit the BKMR Model
+First, define a matrix/vector of the exposure mixture, outcome, and confounders/covariates. BKMR performs better when the exposures are on a similar scale and when there are not outliers. Thus, we center and scale the exposure variables first. As noted above, in a complete analysis, thorough examination of exposure variable distributions, including outliers and normality, would be conducted before any exposure-outcome modeling. For more information on normality testing, see **TAME 2.0 Module 3.3 Normality Tests and Data Transformations.**
+
+First, we'll assign the matrix variables to their own data frame and scale the data.
+```{r 6-4-Mixtures-Analysis-2-4, message=F, warning=F, error=F}
+#exposure mixture variables
+mixture <- as.matrix(cohort[,10:17])
+mixture <- log(mixture)
+mixture <-scale(mixture, center=TRUE)
+summary(mixture)
+```
+
+Then, we'll define the outcome variable and ensure it is the proper class and leveling.
+```{r 6-4-Mixtures-Analysis-2-5 }
+#outcome variable
+cohort$inflam_intense <-as.factor(cohort$inflam_intense)
+cohort$inflam_intense <- relevel(cohort$inflam_intense, ref = "0")
+y<-as.numeric(as.character(cohort$inflam_intense))
+```
+
+Next, we'll assign the covariates to a matrix.
+```{r 6-4-Mixtures-Analysis-2-6 }
+#covariates
+covariates<-as.matrix(cohort[,7:9])
+```
+
+Then, we can fit the BKMR model. Note that this script will take a few minutes to run.
+```{r 6-4-Mixtures-Analysis-2-7 }
+set.seed(111)
+fitkm <- kmbayes(y = y, Z = mixture, X = covariates, iter = 5000, verbose = FALSE, varsel = TRUE, family="binomial", est.h = TRUE)
+```
+
+For full information regarding options for the kmbayes function, refer to the BKMR reference manual: https://cran.r-project.org/web/packages/bkmr/bkmr.pdf
+
+### Assess Variable Importance
+BKMR conducts a variable selection procedure and generates posterior inclusion probabilities (PIP). The larger the PIP, the more a variable is contributing to the overall exposure-outcome effect. These are relative to each other,so there is no threshold as to when a variable becomes an "important" contributor (similar to the weights in quantile g-computation).
+```{r 6-4-Mixtures-Analysis-2-8, message=F, warning=F, error=F}
+ExtractPIPs(fitkm)
+```
+
+Relative to each other, the contributions to the effect of the mixture on neonatal inflammation are shown above for each component of the mixture. Note that if a variable PIP=0, BKMR will drop it from the model and the overall mixture effect will not include this exposure.
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: Which of these chemicals has the strongest effect on neonatal inflammation risk?
+:::
+
+:::answer
+**Answer**: Based on the PIPs: Cadmium.
+:::
+
+
+### Assess Model Convergence
+
+We can use trace plots to evaluate how the parameters in the model converge over the many iterations. We hope to see that the line moves randomly but centers around a straight line
+
+```{r 6-4-Mixtures-Analysis-2-9, message=F, warning=F, error=F, fig.align = "center"}
+sel<-seq(0,5000,by=1)
+TracePlot(fit = fitkm, par = "beta", sel=sel)
+```
+
+Based on this plot, it looks like the burn in period is roughly 1000 iterations. We will remove these from the results.
+```{r 6-4-Mixtures-Analysis-2-10, message=F, warning=F, error=F, fig.align = "center"}
+sel<-seq(1000,5000,by=1)
+TracePlot(fit = fitkm, par = "beta", sel=sel)
+```
+
+### Presenting Model Results
+
+#### Single exposure effects
+As described above, one way to examine single effects is to calculate the odds ratio of inflammation when a single exposure is at its 75th percentile compared to its 25th percentile, with other exposures are at their 50th percentile and covariates are held constant.
+
+Here, we use the `PredictorResponseUnivar()` function to generate a dataset that details, at varying levels of each exposure (`z`), the relationship between that exposure and the outcome, holding other exposures at their 50th percentile and covariates constant. This relationship is given by a beta value, which because we have a binomial outcome and fit a probit model represents the log(odds) (`est`). The standard error for the beta value is also calculated (`se`).
+
+```{r 6-4-Mixtures-Analysis-2-11, message=F, warning=F, error=F, fig.align = "center"}
+pred.resp.univar <- PredictorResponseUnivar(fit=fitkm, sel=sel,
+ method="approx", q.fixed = 0.5)
+
+head(pred.resp.univar)
+```
+
+We can then plot these data for each exposure to visualize the exposure-response function for each exposure.
+
+```{r 6-4-Mixtures-Analysis-2-12 }
+ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se,
+ ymax = est + 1.96*se)) +
+ geom_smooth(stat = "identity") + ylab("h(z)") + facet_wrap(~ variable)
+```
+
+Then, we can generate a dataset that contains for each exposure (`variable`), the log(OR) (`est`) (and its standard deviation (`sd`)) corresponding to the odds of neonatal inflammation when an exposure is at its 75th compared to the odds when at the 25th percentile. The log(OR) is estimated at three levels of the other exposures (25th, 50th and 75th percentiles). We can use this dataset to identify odds ratios for neonatal inflammation (comparing the 75th to 25th percentile odds) for each exposure at differing levels of the other exposures. These odds ratios approximate risk, whereby an odds ratio >1 means there is increased risk of neonatal inflammation when that exposure is at its 75th percentile compared to its 25th percentile. We can then plot these data to see the logOR for each metal in relation to neonatal inflammation at varying levels of the rest of the exposures.
+
+```{r 6-4-Mixtures-Analysis-2-13, message=F, warning=F, error=F, fig.align = "center"}
+risks.singvar <- SingVarRiskSummaries(fit=fitkm, qs.diff = c(0.25, 0.75),
+ q.fixed = c(0.25, 0.50, 0.75),
+ method = "approx")
+
+ggplot(risks.singvar, aes(variable, est, ymin = est - 1.96*sd,
+ ymax = est + 1.96*sd, col = q.fixed)) +
+ geom_hline(aes(yintercept=0), linetype="dashed", color="gray") +
+ geom_pointrange(position = position_dodge(width = 0.75)) +
+ coord_flip() + theme(legend.position="none")+scale_x_discrete(name="") +
+ scale_y_continuous(name="estimate")
+
+```
+
+
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: Which of these chemicals increases the risk of neonatal inflammation and which decreases the risk of neonatal inflammation?
+:::
+
+:::answer
+**Answer**: At all levels of the other exposures, lead, cadmium, selenium, arsenic and zinc reduce the odds of neonatal inflammation, while manganese and mercury appear to increase the odds of neonatal inflammation. Copper appears has a null effect. Notice that the credibility intervals however for all metals span the null meaning we are not confident in the independent effect of any of the metals.
+:::
+
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can also answer also **Environmental Health Question #3***: What is the risk of neonatal inflammation associated with exposure to each of manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead individually?
+:::
+
+:::answer
+**Answer**: As an example, take manganese: when all other exposures are at their 50th percentile, the log(OR) for Mn comparing being at the 75th to the 25th percentile is 0.024, which equals an odds ratio of 1.02. From this, you should be able to calculate the odds ratios for the other metals yourself.
+:::
+
+
+#### Calculating the overall mixture effect
+
+Next, we can generate a dataset that details the effect (ie. log(OR) (`est`) and corresponding standard deviation (`sd`)) on neonatal inflammation of all exposures when at a particular quantile (`quantile`) compared to all exposures being at the 50th percentile. We can use this dataset to identify odds ratios for neonatal inflammation upon simultaneous exposure to the entire mixture for different quantile threshold comparisons. These odds ratios approximate risk, whereby an odds ratio >1 means there is increased risk of neonatal inflammation when the entire mixture is set at the index quantile, compared to the 50th percentile. We can also plot these results to visualize the overall mixture effect dose-response relationship.
+
+```{r 6-4-Mixtures-Analysis-2-14, message=F, warning=F, error=F, fig.align = "center"}
+risks.overall <- OverallRiskSummaries(fit=fitkm, qs=seq(0.25, 0.75, by=0.05),
+ q.fixed = 0.5, method = "approx",
+ sel=sel)
+
+ggplot(risks.overall, aes(quantile, est, ymin = est - 1.96*sd,
+ ymax = est + 1.96*sd)) +
+ geom_hline(yintercept=00, linetype="dashed", color="gray") +
+ geom_pointrange() + scale_y_continuous(name="estimate")
+```
+
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer **Environmental Health Question #4***: What is the risk of neonatal inflammation associated with combined exposure to manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead (ie. a mixture)?
+:::
+
+:::answer
+**Answer**: When every exposure is at its 25th percentile concentration compared to their 50th percentile concentration, the odds ratio for neonatal inflammation is 1.11 (exp(0.10073680)). When every exposure is at its 75th percentile concentration compared to their 50th percentile concentration, the odds ratio for neonatal inflammation is 0.87 (exp(-0.12000889)).
+:::
+
+
+### Evaluating interactive effects
+
+To understand bivariate interactions, we can generate a dataset that for each pairing of exposures details at varying levels of both exposures, the log(odds) (`est`, and associated standard deviation (`sd`)) of neonatal inflammation when all the other exposures are held constant. These plots can be tricky to interpret, so another way of looking at these results is to take "cross sections" at specific quantiles of the second exposure (see next step).
+```{r 6-4-Mixtures-Analysis-2-15, message=F, warning=F, error=F, fig.align = "center"}
+pred.resp.bivar <- PredictorResponseBivar(fit=fitkm, min.plot.dist = 1,
+ sel=sel, method="approx")
+
+ggplot(pred.resp.bivar, aes(z1, z2, fill = est)) +
+ geom_raster() +
+ facet_grid(variable2 ~ variable1) +
+ scale_fill_gradientn(colours=c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
+ xlab("expos1") +
+ ylab("expos2") +
+ ggtitle("h(expos1, expos2)")
+```
+
+Next, we generate a dataset that includes for each pairing of exposures, the log(odds) (`est` and associated standard deviation `sd`) of neonatal inflammation at varying concentrations (`z1`) of the first exposure (`variable 1`) when the second exposure (`variable 2` is at its 25th, 50th and 75th percentile (`quantile`).
+
+```{r 6-4-Mixtures-Analysis-2-16, message=F, warning=F, error=F, fig.align = "center"}
+pred.resp.bivar.levels <- PredictorResponseBivarLevels(pred.resp.df=
+ pred.resp.bivar, Z = mixture, both_pairs=TRUE,
+ qs = c(0.25, 0.5, 0.75))
+
+ggplot(pred.resp.bivar.levels, aes(z1, est)) +
+ geom_smooth(aes(col = quantile), stat = "identity") +
+ facet_grid(variable2 ~ variable1) +
+ ggtitle("h(expos1 | quantiles of expos2)") +
+ xlab("expos1")
+```
+
+There is evidence of an interactive effect between two exposures when the exposure-response function for exposure 1 varies in form between the different quantiles of exposure 2. You can also zoom in on one plot, for example:
+
+```{r 6-4-Mixtures-Analysis-2-17, message=F, warning=F, error=F, fig.align = "center"}
+HgCd <- pred.resp.bivar.levels %>%
+ filter(variable1=="Hg_ngg") %>%
+ filter(variable2=="Cd_ngg")
+
+ggplot(HgCd, aes(z1, est)) +
+ geom_smooth(aes(col = quantile), stat = "identity") +
+ ggtitle("h(expos1 | quantiles of expos2)") +
+ xlab("expos1")
+
+
+CdHg <- pred.resp.bivar.levels %>%
+ filter(variable1=="Cd_ngg") %>%
+ filter(variable2=="Hg_ngg")
+
+ggplot(CdHg, aes(z1, est)) +
+ geom_smooth(aes(col = quantile), stat = "identity") +
+ ggtitle("h(expos1 | quantiles of expos2)") +
+ xlab("expos1")
+```
+
+To visualize interactions between one exposure and the rest of the exposure components, we generate a dataset that details the difference in each exposure's (`variable`) log(OR) comparing 75th to 25th percentile (`est`, and associated standard deviation `sd`) when the other exposure components are at their 75th versus 25th percentile. Perhaps more intuitively, these estimates represent the blue - red points plotted in the second figure under the single exposure effects section.
+```{r 6-4-Mixtures-Analysis-2-18, message=F, warning=F, error=F, fig.align = "center"}
+
+risks.int <- SingVarIntSummaries(fit=fitkm, qs.diff = c(0.25, 0.75),
+ qs.fixed = c(0.25, 0.75))
+
+
+ggplot(risks.int, aes(variable, est, ymin = est - 1.96*sd,
+ ymax = est + 1.96*sd)) +
+ geom_pointrange(position = position_dodge(width = 0.75)) +
+ geom_hline(yintercept = 0, lty = 2, col = "brown") + coord_flip()
+```
+
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can answer **Environmental Health Question #5***: Are there interactions among manganese, copper, zinc, arsenic, selenium, cadmium, mercury, lead in relation to neonatal inflammation?
+:::
+
+:::answer
+**Answer**: There do not appear to be any single exposure and rest of mixture interactions (previous plot); however, there is suggestive evidence of a bivariate interaction between cadmium and mercury.
+:::
+
+
+## Concluding Remarks
+In conclusion, this module extends upon **TAME 2.0 Module 6.3 Mixtures Analysis Methods Part 1: Overview and Example with Quantile G-Computation**. In this scripted example, we used a dataset from a human population study (n=246) of cord tissue metals and examined the outcome of neonatal inflammation. We found that increasing the entire mixture of metals reduced the risk of neonatal inflammation; however, certain metals increased the risk and others decreased the risk. There was also a suggestive interactive effect found between cadmium and mercury.
+
+## Additional Resources
+The field of mixtures is vast, with many different approaches and example studies to learn from as analysts lead in their own analyses. Some resources that can be helpful include the following reviews:
+
++ Our recent review on mixtures methodologies, particularly in the field of sufficient similarity, titled [Wrangling whole mixtures risk assessment: Recent advances in determining sufficient similarity](https://www.sciencedirect.com/science/article/abs/pii/S2468202023000323?via%3Dihub)
++ Two more general, epidemiology-focused reviews on mixtures questions and methodologies, titled [Complex Mixtures, Complex Analyses: an Emphasis on Interpretable Results](https://link.springer.com/article/10.1007/s40572-019-00229-5) and [Environmental exposure mixtures: questions and methods to address them](https://pubmed.ncbi.nlm.nih.gov/30643709/)
++ [A helpful online toolkit](https://bookdown.org/andreabellavia/mixtures/preface.html) for mixtures analyses generated by Andrea Bellavia, PhD
+
+Some helpful mixtures case studies using BKMR include the following:
+
++ [Prenatal metal concentrations and childhood cardio-metabolic risk using Bayesian Kernel Machine Regression to assess mixture and interaction effects](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6402346/)
++ [Associations between Phthalate Metabolite Concentrations in Follicular Fluid and Reproductive Outcomes among Women Undergoing in Vitro Fertilization/Intracytoplasmic Sperm Injection Treatment](https://ehp.niehs.nih.gov/doi/full/10.1289/EHP11998)
++ [Associations of Prenatal Per- and Polyfluoroalkyl Substance (PFAS) Exposures with Offspring Adiposity and Body Composition at 16–20 Years of Age: Project Viva](https://ehp.niehs.nih.gov/doi/full/10.1289/EHP12597)
+
+
+
+
+:::tyk
+
+Using the simulated dataset within the bkmr package (see below code for how to call and store this dataset), answer the key environmental health questions using BKMR.
+
+1. Which of these chemicals has the strongest effect on the outcome?
+2. Which of these chemicals increases the outome and which decreases the outcome?
+3. What is the effect on the outcome with exposure to each of the chemicals individually?
+4. What is the effect on the outcome associated with combined exposure to all chemicals?
+5. Are there interactions among the chemicals relation to the outcome?
+
+Note that the outcome (y) variable is a continuous variable here, rather than binary as in the scripted example.
+:::
+
+```{r 6-4-Mixtures-Analysis-2-19 }
+# Set seed for reproducibility
+set.seed(111)
+
+# Create a dataset with 100 participants and 4 mixtures components
+data <- SimData(n = 100, M = 4)
+
+# Save outcome variable (y)
+y <- data$y
+
+# Save mixtures variables (Z and X)
+Z <- data$Z
+X <- data$X
+```
diff --git a/Chapter_6/Module6_4_Input/Module6_3_Mixtures_Methods_Overview.png b/Chapter_6/6_4_Mixtures_Analysis_2/Module6_3_Mixtures_Methods_Overview.png
similarity index 100%
rename from Chapter_6/Module6_4_Input/Module6_3_Mixtures_Methods_Overview.png
rename to Chapter_6/6_4_Mixtures_Analysis_2/Module6_3_Mixtures_Methods_Overview.png
diff --git a/Chapter_6/Module6_4_Input/Module6_4_InputData.csv b/Chapter_6/6_4_Mixtures_Analysis_2/Module6_4_InputData.csv
similarity index 100%
rename from Chapter_6/Module6_4_Input/Module6_4_InputData.csv
rename to Chapter_6/6_4_Mixtures_Analysis_2/Module6_4_InputData.csv
diff --git a/Chapter_6/6_5_Mixtures_Analysis_3/6_5_Mixtures_Analysis_3.Rmd b/Chapter_6/6_5_Mixtures_Analysis_3/6_5_Mixtures_Analysis_3.Rmd
new file mode 100644
index 0000000..d674fb5
--- /dev/null
+++ b/Chapter_6/6_5_Mixtures_Analysis_3/6_5_Mixtures_Analysis_3.Rmd
@@ -0,0 +1,468 @@
+# 6.5 Mixtures Analysis Methods Part 3: Sufficient Similarity
+
+This training module was developed by Cynthia Rider, with contributions from Lauren E. Koval and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+Humans are rarely, if ever, exposed to single chemicals at a time. Instead, humans are often exposed to multiple stressors in their everyday environments in the form of mixtures. These stressors can include environmental chemicals and pharmaceuticals, and they can also include other types of stressors such as socioeconomic factors and other attributes that can place individuals at increased risk of acquiring disease. Because it is not possible to test every possible combination of exposure that an individual might experience in their lifetime, approaches that take into account variable and complex exposure conditions through mixtures modeling are needed.
+
+There are different computational approaches that can be implemented to address this research topic. In this training module, we will demonstrate how to use **sufficient similarity** to determine which groups of exposure conditions are chemically/biologically similar enough to be regulated for safety together, based on the same set of regulatory criteria. Here, our example mixtures analysis will focus on characterizing the nutritional supplement *Ginkgo biloba*.
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. Based on the chemical analysis, which *Ginkgo biloba* extract looks the most different?
+2. When viewing the variability between chemical profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
+3. Based on the chemical analysis, which chemicals do you think are important in differentiating between the different *Ginkgo biloba* samples?
+4. After removing two samples that have the most different chemical profiles (and are thus, potential outliers), do we obtain similar chemical groupings?
+5. When viewing the variability between toxicity profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
+6. Based on the toxicity analysis, which genes do you think are important in differentiating between the different *Ginkgo biloba* samples?
+7. Were similar chemical groups identified when looking at just the chemistry vs. just the toxicity? How could this impact regulatory decisions, if we only had one of these datasets?
+
+
+## Introduction to Toxicant and Dataset
+
+*Ginkgo biloba* represents a popular type of botanical supplement currently on the market. People take *Ginkgo biloba* to improve brain function, but there is conflicting data on its efficacy. Like other botanicals, *Ginkgo biloba* is a complex mixture with 100s-1000s of constituents. Here, the variability in chemical and toxicological profiles across samples of *Ginkgo biloba* purchased from different commercial sources is evaluated. We can use data from a well-characterized sample (reference sample) to evaluate the safety of other samples that are ‘sufficiently similar’ to the reference sample. Samples that are different (i.e., do not meet the standards of sufficient similarity) from the reference sample would require additional safety data.
+
+A total of 29 *Ginkgo biloba* extract samples were analyzed. These samples are abbreviated as “GbE_” followed by a unique sample identifier (GbE = *Ginkgo biloba* Extract). These data have been previously published:
+
++ Catlin NR, Collins BJ, Auerbach SS, Ferguson SS, Harnly JM, Gennings C, Waidyanatha S, Rice GE, Smith-Roe SL, Witt KL, Rider CV. How similar is similar enough? A sufficient similarity case study with Ginkgo biloba extract. Food Chem Toxicol. 2018 Aug;118:328-339. PMID: [29752982](https://pubmed.ncbi.nlm.nih.gov/29752982/).
+
++ Collins BJ, Kerns SP, Aillon K, Mueller G, Rider CV, DeRose EF, London RE, Harnly JM, Waidyanatha S. Comparison of phytochemical composition of Ginkgo biloba extracts using a combination of non-targeted and targeted analytical approaches. Anal Bioanal Chem. 2020 Oct;412(25):6789-6809. PMID: [32865633](https://pubmed.ncbi.nlm.nih.gov/32865633/).
+
+
+### *Ginkgo biloba* Chemistry Dataset Overview
+
+The chemical profiles of these sample extracts were first analyzed using targeted mass spectrometry-based approaches. The concentrations of 12 *Ginkgo biloba* marker compounds were measured in units of mean weight as a ratio [g chemical / g sample]. Note that in this dataset, non-detects have been replaced with values of zero for simplicity; though there are more advanced methods to impute values for non-detects. Script is provided to evaluate how *Ginkgo biloba* extracts group together, based on chemical profiles.
+
+### *Ginkgo biloba* Toxicity Dataset Overview
+
+The toxicological profiles of these samples were also analyzed using *in vitro* test methods. These data represent area under the curve (AUC) values indicating changes in gene expression across various concentrations of the *Ginkgo biloba* extract samples. Positive AUC values indicate a gene that was collectively increased in expression as concentration increased, and a negative AUC value indicates a gene that was collectively decreased in expression as exposure concentration increased. Script is provided to evaluate how *Ginkgo biloba* extracts group together, based on toxicity profiles.
+
+
+## Workspace Preparation and Data Import
+
+#### Install required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 6-5-Mixtures-Analysis-3-1, results=FALSE, message=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("readxl"))
+ install.packages("readxl");
+if (!requireNamespace("factoextra"))
+ install.packages("factoextra");
+if (!requireNamespace("pheatmap"))
+ install.packages("pheatmap");
+if (!requireNamespace("gridExtra"))
+ install.packages("gridExtra");
+if (!requireNamespace("ggplotify"))
+ install.packages("ggplotify")
+```
+
+#### Loading required packages
+```{r 6-5-Mixtures-Analysis-3-2, results=FALSE, message=FALSE}
+library(readxl) #used to read in and work with excel files
+library(factoextra) #used to run and visualize multivariate analyses, here PCA
+library(pheatmap) #used to make heatmaps. This can be done in ggplot2 but pheatmap is easier and nicer
+library(gridExtra) #used to arrange and visualize multiple figures at once
+library(ggplotify) #used to make non ggplot figures (like a pheatmap) gg compatible
+library(tidyverse) #all tidyverse packages, including dplyr and ggplot2
+```
+
+#### Set your working directory
+```{r 6-5-Mixtures-Analysis-3-3, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+#### Import example *Ginkgo biloba* dataset
+
+We need to first read in the chemistry and toxicity data from the provided excel file. Here, data were originally organized such that the actual observations start on row 2 (dataset descriptions were in the first row). So let's implement skip=1, which skips reading in the first row.
+
+```{r 6-5-Mixtures-Analysis-3-4 }
+chem <- read_xlsx("Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_InputData.xlsx" , sheet = "chemistry data", skip=1) # loads the chemistry data tab
+tox <- read_xlsx("Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_InputData.xlsx" , sheet = "in vitro data", skip=1) # loads the toxicity data tab
+```
+
+### View example dataset
+
+Let's first see how many rows and columns of data are present in both datasets:
+```{r 6-5-Mixtures-Analysis-3-5 }
+dim(chem)
+```
+
+The chemistry dataset contains information on 29 samples (rows); and 1 sample identifier + 12 chemicals (total of 13 columns).
+
+```{r 6-5-Mixtures-Analysis-3-6 }
+dim(tox)
+```
+
+The tox dataset contains information on 29 samples (rows); and 1 sample identifier + 5 genes (total of 6 columns).
+
+
+Let's also see what kind of data are organized within the datasets:
+```{r 6-5-Mixtures-Analysis-3-7 }
+colnames(chem)
+```
+
+```{r 6-5-Mixtures-Analysis-3-8 }
+head(chem)
+```
+
+```{r 6-5-Mixtures-Analysis-3-9 }
+colnames(tox)
+```
+
+```{r 6-5-Mixtures-Analysis-3-10 }
+head(tox)
+```
+
+
+## Chemistry-Based Sufficient Similarity Analysis
+
+The first method employed in this Sufficient Similarity analysis is Principal Component Analysis (PCA). PCA is a very common dimensionality reduction technique, as detailed in **TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means Clustering & PCA**.
+
+In summary, PCA finds dimensions (eigenvectors) in the higher dimensional original data that capture as much of the variation as possible, which you can then plot. This allows you to project higher dimensional data, in this case 12 dimensions (representing 12 measured chemicals), in fewer dimensions (we'll use 2). These dimensions, or components, capture the "essence" of the original dataset.
+
+Before we can run PCA on this chemistry dataset, we first need to scale the data across samples. We do this here for the chemistry dataset, because we specifically want to evaluate and potentially highlight/emphasize chemicals that may be at relatively low abundance. These low-abundance chemicals may actually be contaminants that drive toxicological effects.
+
+Let's first re-save the original chemistry dataset to compare off of:
+```{r 6-5-Mixtures-Analysis-3-11 }
+chem_original <- chem
+```
+
+Then, we'll make a scaled version to carry forward in this analysis. To do this, we move the sample column the row names and then scale and center the data.
+```{r 6-5-Mixtures-Analysis-3-12 }
+chem <- chem %>% column_to_rownames("Sample")
+chem <- as.data.frame(scale(as.matrix(chem)))
+```
+
+Let's now compare one of the rows of data (here, sample GbE_E) to see what scaling did:
+```{r 6-5-Mixtures-Analysis-3-13 }
+chem_original[5,]
+chem[5,]
+```
+
+You can see that scaling made the concentrations distributed across each chemical center around 0.
+
+Now, we can run PCA on the scaled data:
+```{r 6-5-Mixtures-Analysis-3-14 }
+chem_pca <- princomp(chem)
+```
+
+Looking at the scree plot, we see the first two principal components capture most of the variance in the data (~64%):
+```{r 6-5-Mixtures-Analysis-3-15, fig.align = "center"}
+fviz_eig(chem_pca, addlabels = TRUE)
+```
+
+
+Here are the resulting PCA scores for each sample, for each principal component (shown here as components 1-12):
+```{r 6-5-Mixtures-Analysis-3-16 }
+head(chem_pca$scores)
+```
+
+And the resulting loading factors of each chemical's contribution towards each principal component. Results are arranged by a chemical's contribution to PC1, the component accounting for most of the variation in the data.
+```{r 6-5-Mixtures-Analysis-3-17 }
+head(chem_pca$loadings)
+```
+
+We can save the chemical-specific loadings into a separate matrix and view them from highest to lowest values for PC1.
+```{r 6-5-Mixtures-Analysis-3-18 }
+loadings <- as.data.frame.matrix(chem_pca$loadings)
+loadings %>% arrange(desc(Comp.1))
+```
+
+These resulting loading factors allow us to identify which constituents (of the 12 total) contribute to the principal components explaining data variabilities. For instance, we can see here that **Quercetin** is listed at the top, with the largest loading value for principal component 1. Thus, Quercetin represents the constituents that contributes to the overall variability in the dataset to the greatest extent. The next three chemicals are all **Ginkgolide** constituents, followed by **Bilobalide** and **Kaempferol**, and so forth.
+
+If we look at principal component 2 (PC2), we can now see a different set of chemicals contributing to the variability captured in this component:
+```{r 6-5-Mixtures-Analysis-3-19 }
+loadings %>% arrange(desc(Comp.2))
+```
+
+Here, **Ginkgolic Acids** are listed first.
+
+We can also visualize sample groupings based on these principal components 1 & 2:
+
+```{r 6-5-Mixtures-Analysis-3-20, warning=FALSE, message=FALSE, fig.height=6, fig.width=8, fig.align = "center"}
+# First pull the percent variation captured by each component
+pca_percent <- round(100*chem_pca$sdev^2/sum(chem_pca$sdev^2),1)
+
+# Then make a dataframe for the PCA plot generation script using first three components
+pca_df <- data.frame(PC1 = chem_pca$scores[,1], PC2 = chem_pca$scores[,2])
+
+# Plot this dataframe
+chem_pca_plt <- ggplot(pca_df, aes(PC1,PC2))+
+ geom_hline(yintercept = 0, size=0.3)+
+ geom_vline(xintercept = 0, size=0.3)+
+ geom_point(size=3, color="deepskyblue3") +
+ geom_text(aes(label=rownames(pca_df)), fontface="bold", position=position_jitter(width=0.4,height=0.4))+
+ labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
+ ggtitle("GbE Sample PCA by Chemistry Profiles")
+
+
+# Changing the colors of the titles and axis text
+chem_pca_plt <- chem_pca_plt + theme(plot.title=element_text(color="deepskyblue3", face="bold"),
+ axis.title.x=element_text(color="deepskyblue3", face="bold"),
+ axis.title.y=element_text(color="deepskyblue3", face="bold"))
+
+# Viewing this resulting plot
+chem_pca_plt
+```
+
+This plot tells us a lot about sample groupings based on chemical profiles!
+
+### Answer to Environmental Health Question 1
+:::question
+With this, we can answer **Environmental Health Question 1**: Based on the chemical analysis, which *Ginkgo biloba* extract looks the most different?
+:::
+
+:::answer
+**Answer:** GbE_G
+:::
+
+### Answer to Environmental Health Question 2
+:::question
+ We can also answer **Environmental Health Question 2**: When viewing the variability between chemical profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
+:::
+
+:::answer
+**Answer:** Approximately 4 (though could argue +1/-1): bottom left group; bottom right group; and two completely separate samples of GbE_G and GbE_N
+:::
+
+
+As an alternative way of viewing the chemical profile data, we can make a heatmap of the scaled chemistry data. We concurrently run hierarchical clustering that shows us how closely samples are related to each other, based on different algorithms than data reduction-based PCA. Samples that fall on nearby branches are more similar. Samples that don't share branches with many/any others are often considered outliers.
+
+By default, `pheatmap()` uses a Euclidean distance to cluster the observations, which is a very common clustering algorithm.
+For more details, see the following description of [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) and for more information on hierarchical clustering, see **TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications**.
+```{r 6-5-Mixtures-Analysis-3-21, warning=FALSE, message=FALSE, fig.align = "center"}
+chem_hm <- pheatmap(chem, main="GbE Sample Heatmap by Chemistry Profiles",
+ cluster_rows=TRUE, cluster_cols = FALSE,
+ angle_col = 45, fontsize_col = 7, treeheight_row = 60)
+```
+
+This plot tells us a lot about the individual chemicals that differentiate the sample groupings.
+
+### Answer to Environmental Health Question 3
+:::question
+With this, we can answer **Environmental Health Question 3**: Based on the chemical analysis, which chemicals do you think are important in differentiating between the different *Ginkgo biloba* samples?
+:::
+
+:::answer
+**Answer:** All of the chemicals technically contribute to these sample patterns, but here are some that stand out: (i) Ginkgolic_Acid_C15 and Ginkgolic_Acid_C17 appear to drive the clustering of one particular GbE sample, GbE_G, as well as potentially GbE_N; (ii) Isorhamnetin influences the clustering of GbE_T; (iii) Bilobalide, Ginkgolides A & B, and Quercetin are also important because they show a general cluster of abundance at decreased levels at the bottom and increased levels at the top.
+:::
+
+Let's now revisit the PCA plot:
+```{r 6-5-Mixtures-Analysis-3-22, warning=FALSE, message=FALSE, fig.height=3, fig.width=5, fig.align = "center"}
+chem_pca_plt
+```
+
+GbE_G and GbE_N look so different from the rest of the samples that they could be outliers and potentially influencing overall data trends. Let's make sure that, if we remove these two samples, our sample groupings still look the same.
+
+First, we remove those two samples from the dataframe:
+```{r 6-5-Mixtures-Analysis-3-23, warning=FALSE, message=FALSE}
+chem_filt <- chem %>%
+ rownames_to_column("Sample") %>%
+ filter(!Sample %in% c("GbE_G","GbE_N")) %>%
+ column_to_rownames("Sample")
+```
+
+Then, we can re-run PCA and generate a heatmap of the chemical data with these outlier samples removed:
+```{r 6-5-Mixtures-Analysis-3-24, warning=FALSE, message=FALSE, fig.align = "center"}
+chem_filt_pca <- princomp(chem_filt)
+
+# Get the percent variation captured by each component
+pca_percent_filt <- round(100*chem_filt_pca$sdev^2/sum(chem_filt_pca$sdev^2),1)
+
+# Make dataframe for PCA plot generation using first three components
+pca_df_filt <- data.frame(PC1 = chem_filt_pca$scores[,1], PC2 = chem_filt_pca$scores[,2])
+
+# Plot this dataframe
+chem_filt_pca_plt <- ggplot(pca_df_filt, aes(PC1,PC2))+
+ geom_hline(yintercept = 0, size=0.3)+
+ geom_vline(xintercept = 0, size=0.3)+
+ geom_point(size=3, color="aquamarine2") +
+ geom_text(aes(label=rownames(pca_df_filt)), fontface="bold", position=position_jitter(width=0.5,height=0.5))+
+ labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
+ ggtitle("GbE Sample PCA by Chemistry Profiles excluding Potential Outliers")
+
+# Changing the colors of the titles and axis text
+chem_filt_pca_plt <- chem_filt_pca_plt + theme(plot.title=element_text(color="aquamarine2", face="bold"),
+ axis.title.x=element_text(color="aquamarine2", face="bold"),
+ axis.title.y=element_text(color="aquamarine2", face="bold"))
+
+# Viewing this resulting plot
+chem_filt_pca_plt
+```
+
+
+To view the PCA plots of all samples vs filtered samples:
+```{r 6-5-Mixtures-Analysis-3-25, warning=FALSE, message=FALSE, fig.height=9, fig.width=8, fig.align = "center"}
+grid.arrange(chem_pca_plt, chem_filt_pca_plt)
+```
+
+
+### Answer to Environmental Health Question 4
+:::question
+With this, we can answer **Environmental Health Question 4**: After removing two samples that have the most different chemical profiles (and are thus, potential outliers), do we obtain similar chemical groupings?
+:::
+
+:::answer
+**Answer:** Yes! Removal of the potential outliers basically spreads the rest of the remaining data points out, since there is less variance in the overall dataset, and thus, more room to show variance amongst the remaining samples. The general locations of the samples on the PCA plot, however, remain consistent. We now feel confident that our similarity analysis is producing consistent grouping results.
+:::
+
+
+
+## Toxicity-Based Sufficient Similarity Analysis
+
+Now, we will perform sufficient similarity analysis using the toxicity data. Unlike the chemistry dataset, we can use the toxicity dataset as is without scaling because we want to focus on genes that are showing a large response. Similarly, we want to de-emphasize genes that are showing a strong response to the exposure condition. If we scale these data, we will reduce this needed variability.
+
+Here, we first move the sample column to row names:
+```{r 6-5-Mixtures-Analysis-3-26, warning=FALSE, message=FALSE}
+tox <- tox %>% column_to_rownames("Sample")
+```
+
+Then, we can run PCA on this tox dataframe:
+```{r 6-5-Mixtures-Analysis-3-27, warning=FALSE, message=FALSE}
+tox_pca <- princomp(tox)
+```
+
+Looking at the scree plot, we see the first two principal components capture most of the variation (~93%):
+```{r 6-5-Mixtures-Analysis-3-28, warning=FALSE, message=FALSE, fig.align = "center"}
+fviz_eig(tox_pca, addlabels = TRUE)
+```
+
+We can then create a plot of the samples by principal components:
+```{r 6-5-Mixtures-Analysis-3-29, warning=FALSE, message=FALSE, fig.height=7, fig.width=6, fig.align = "center"}
+# Get the percent variation captured by each component
+pca_percent <- round(100*tox_pca$sdev^2/sum(tox_pca$sdev^2),1)
+
+# Make dataframe for PCA plot generation using first three components
+tox_pca_df <- data.frame(PC1 = tox_pca$scores[,1], PC2 = tox_pca$scores[,2])
+
+# Plot the first two components
+tox_pca_plt <- ggplot(tox_pca_df, aes(PC1,PC2))+
+ geom_hline(yintercept = 0, size=0.3)+
+ geom_vline(xintercept = 0, size=0.3)+
+ geom_point(size=3, color="deeppink3") +
+ geom_text(aes(label=rownames(pca_df)), fontface="bold", position=position_jitter(width=0.25,height=0.25))+
+ labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
+ ggtitle("GbE Sample PCA by Toxicity Profiles")
+
+# Changing the colors of the titles and axis text
+tox_pca_plt <- tox_pca_plt + theme(plot.title=element_text(color="deeppink3", face="bold"),
+ axis.title.x=element_text(color="deeppink3", face="bold"),
+ axis.title.y=element_text(color="deeppink3", face="bold"))
+
+tox_pca_plt
+```
+
+This plot tells us a lot about sample groupings based on toxicity profiles!
+
+### Answer to Environmental Health Question 5
+:::question
+With this, we can answer **Environmental Health Question 5**: When viewing the variability between toxicity profiles, how many groupings of potentially ‘sufficiently similar’ *Ginkgo biloba* samples do you see?
+:::
+
+:::answer
+**Answer:** Approximately 3 (though could argue +1/-1): top left group; top right group; GbE_M and GbE_W.
+:::
+
+
+Similar to the chemistry data, as an alternative way of viewing the toxicity profile data, we can make a heatmap of the toxicity data:
+```{r 6-5-Mixtures-Analysis-3-30, warning=FALSE, message=FALSE, fig.align = "center"}
+tox_hm <- pheatmap(tox, main="GbE Sample Heatmap by Toxicity Profiles",
+ cluster_rows=TRUE, cluster_cols = FALSE,
+ angle_col = 45, fontsize_col = 7, treeheight_row = 60)
+```
+
+This plot tells us a lot about the individual genes that differentiate the sample groupings!
+
+### Answer to Environmental Health Question 6
+:::question
+With this, we can answer **Environmental Health Question 6**: Based on the toxicity analysis, which genes do you think are important in differentiating between the different *Ginkgo biloba* samples?
+:::
+
+:::answer
+**Answer:** It looks like the CYP enzyme genes, particularly CYP2B6, are highly up-regulated in response to several of these sample exposures, and thus dictate a lot of these groupings.
+:::
+
+
+
+## Comparing Chemistry vs. Toxicity Sufficient Similarity Analyses
+
+Let's view the PCA plots for both datasets together, side-by-side:
+```{r 6-5-Mixtures-Analysis-3-31, fig.height=8, fig.width=11, fig.align = "center"}
+pca_compare <- grid.arrange(chem_pca_plt,tox_pca_plt, nrow=1)
+```
+
+Let's also view the PCA plots for both datasets together, top-to-bottom, to visualize the trends along both axes better between these two views:
+```{r 6-5-Mixtures-Analysis-3-32, fig.height=10, fig.width=10, fig.align = "center"}
+pca_compare <- grid.arrange(chem_pca_plt,tox_pca_plt)
+```
+
+Here is an edited version of the above figures, highlighting with colored circles some chemical groups of interest identified through chemistry vs toxicity-based sufficient similarity analyses:
+
+```{r 6-5-Mixtures-Analysis-3-33, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_Image1.png")
+```
+
+
+### Answer to Environmental Health Question 7
+:::question
+With this, we can answer **Environmental Health Question 7**: Were similar chemical groups identified when looking at just the chemistry vs. just the toxicity? How could this impact regulatory action, if we only had one of these datasets?
+:::
+
+:::answer
+**Answer:** There are some similarities between groupings, though there are also notable differences. For example, samples GbE_A, GbE_B, GbE_C, GbE_F, and GbE_H group together from the chemistry and toxicity similarity analyses. Though samples GbE_G, GbE_W, GbE_N, and others clearly demonstrate differences in grouping assignments. These differences could impact the accuracy of how regulatory decisions are made, where if regulation was dictated solely on the chemistry (without toxicity data) and/or vice versa, we may miss important information that could aid in accurate health risk evaluations.
+:::
+
+### Additional Methods
+
+Although we focused on sufficient similarity for this module, a number of other approaches exist to evaluate mixutres. For example, **relative potency factors** is another component-based approach that can be used to evalaute mixtures. Component-based approaches use data from individual chemicals (components of the mixture) and additivity models to estimate the effects of the mixture. For other methods, also see **TAME 2.0 Module 6.3 Mixtures I: Overview and Quantile G-Computation Application** and **TAME 2.0 Module 6.4 Mixtures II: BKMR Application**.
+
+
+
+## Concluding Remarks
+
+In this module, we evaluated the similarity between variable lots of *Ginkgo biloba* and identified sample groupings that could be used for chemical risk assessment purposes. Together, this example highlights the utility of sufficient similarity analyses to address environmental health research questions.
+
+### Additional Resources
+
+Some helpful resources that provide further background on the topic of mixtures toxicology and mixtures modeling include the following:
+
++ Carlin DJ, Rider CV, Woychik R, Birnbaum LS. Unraveling the health effects of environmental mixtures: an NIEHS priority. Environ Health Perspect. 2013 Jan;121(1):A6-8. PMID: [23409283](https://pubmed.ncbi.nlm.nih.gov/23409283/).
+
++ Drakvik E, Altenburger R, Aoki Y, Backhaus T, Bahadori T, Barouki R, Brack W, Cronin MTD, Demeneix B, Hougaard Bennekou S, van Klaveren J, Kneuer C, Kolossa-Gehring M, Lebret E, Posthuma L, Reiber L, Rider C, Rüegg J, Testa G, van der Burg B, van der Voet H, Warhurst AM, van de Water B, Yamazaki K, Öberg M, Bergman Å. Statement on advancing the assessment of chemical mixtures and their risks for human health and the environment. Environ Int. 2020 Jan;134:105267. PMID: [31704565](https://pubmed.ncbi.nlm.nih.gov/31704565/).
+
++ Rider CV, McHale CM, Webster TF, Lowe L, Goodson WH 3rd, La Merrill MA, Rice G, Zeise L, Zhang L, Smith MT. Using the Key Characteristics of Carcinogens to Develop Research on Chemical Mixtures and Cancer. Environ Health Perspect. 2021 Mar;129(3):35003. PMID: [33784186](https://pubmed.ncbi.nlm.nih.gov/33784186/).
+
+
++ Taylor KW, Joubert BR, Braun JM, Dilworth C, Gennings C, Hauser R, Heindel JJ, Rider CV, Webster TF, Carlin DJ. Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology: Lessons from an Innovative Workshop. Environ Health Perspect. 2016 Dec 1;124(12):A227-A229. PMID: [27905274](https://pubmed.ncbi.nlm.nih.gov/27905274/).
+
+
+For more information and additional examples in environmental health research, see the following relevant publications implementing sufficient similarity methods to address complex mixtures:
+
++ Catlin NR, Collins BJ, Auerbach SS, Ferguson SS, Harnly JM, Gennings C, Waidyanatha S, Rice GE, Smith-Roe SL, Witt KL, Rider CV. How similar is similar enough? A sufficient similarity case study with Ginkgo biloba extract. Food Chem Toxicol. 2018 Aug;118:328-339. PMID: [29752982](https://pubmed.ncbi.nlm.nih.gov/29752982/).
+
++ Collins BJ, Kerns SP, Aillon K, Mueller G, Rider CV, DeRose EF, London RE, Harnly JM, Waidyanatha S. Comparison of phytochemical composition of Ginkgo biloba extracts using a combination of non-targeted and targeted analytical approaches. Anal Bioanal Chem. 2020 Oct;412(25):6789-6809. PMID: [32865633](https://pubmed.ncbi.nlm.nih.gov/32865633/).
+
++ Ryan KR, Huang MC, Ferguson SS, Waidyanatha S, Ramaiahgari S, Rice JR, Dunlap PE, Auerbach SS, Mutlu E, Cristy T, Peirfelice J, DeVito MJ, Smith-Roe SL, Rider CV. Evaluating Sufficient Similarity of Botanical Dietary Supplements: Combining Chemical and In Vitro Biological Data. Toxicol Sci. 2019 Dec 1;172(2):316-329. PMID: [31504990](https://pubmed.ncbi.nlm.nih.gov/31504990/).
+
++ Rice GE, Teuschler LK, Bull RJ, Simmons JE, Feder PI. Evaluating the similarity of complex drinking-water disinfection by-product mixtures: overview of the issues. J Toxicol Environ Health A. 2009;72(7):429-36. PMID: [19267305](https://pubmed.ncbi.nlm.nih.gov/19267305/).
+
+
+
+
+
+:::tyk
+We recently published a study evaluating similarities across wildfire chemistry profiles using a more advanced analysis approach than described in this module (PMID: [36399130](https://pubmed.ncbi.nlm.nih.gov/36399130/)). For this test your knowledge box, let’s implement the more simple, PCA-based sufficient similarity analysis to identify groups of biomass smoke exposure signatures using chemical profiles. The relevant dataset is included in the file *Module6_5_TYKInput.csv*. Specifically:
+
+1. Perform a PCA on the chemistry data and visualize the proximity of each chemical signature to other signatures according to the first two principal components.
+
+2. Identify major groupings of biomass smoke exposure signatures.
+:::
diff --git a/Chapter_6/Module6_5_Input/Module6_5_Image1.png b/Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_Image1.png
similarity index 100%
rename from Chapter_6/Module6_5_Input/Module6_5_Image1.png
rename to Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_Image1.png
diff --git a/Chapter_6/Module6_5_Input/Module6_5_InputData.xlsx b/Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_InputData.xlsx
similarity index 100%
rename from Chapter_6/Module6_5_Input/Module6_5_InputData.xlsx
rename to Chapter_6/6_5_Mixtures_Analysis_3/Module6_5_InputData.xlsx
diff --git a/Chapter_6/6_6_Toxicokinetic_Modeling/6_6_Toxicokinetic_Modeling.Rmd b/Chapter_6/6_6_Toxicokinetic_Modeling/6_6_Toxicokinetic_Modeling.Rmd
new file mode 100644
index 0000000..f58bdbc
--- /dev/null
+++ b/Chapter_6/6_6_Toxicokinetic_Modeling/6_6_Toxicokinetic_Modeling.Rmd
@@ -0,0 +1,1170 @@
+
+# 6.6 Toxicokinetic Modeling
+
+This training module was developed by Caroline Ring, Lauren E. Koval, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+*Disclaimer: The views expressed in this document are those of the author and do not necessarily reflect the views or policies of the U.S. EPA.*
+
+
+## Introduction to Training Module
+
+This module serves as an example to guide trainees through the basics of toxicokinetic (TK) modeling and how this type of modeling can be used in the high-throughput setting for environmental health research applications.
+
+In this activity, the capabilities of a high-throughput toxicokinetic modeling package titled 'httk' are demonstrated on a suite of environmentally relevant chemicals. The httk R package implements high-throughput toxicokinetic modeling (hence, 'httk'), including a generic physiologically based toxicokinetic (PBTK) model as well as tables of chemical-specific parameters needed to solve the model for hundreds of chemicals. In this activity, the capabilities of 'httk' are demonstrated and explored. Example modeling estimates are produced for the high interest environmental chemical, bisphenol-A. Then, an example script is provided to derive the plasma concentration at steady state for an example environmental chemical, bisphenol-A.
+
+The concept of reverse toxicokinetics is explained and demonstrated, again using bisphenol-A as an example chemical.
+
+This module then demonstrates the derivation of the bioactivity-exposure ratio (BER) across many chemicals leveraging the capabilities of httk, while incorporating exposure measures. BERs are particularly useful in the evaluation of chemical risk, as they take into account both toxicity (i.e., *in vitro* potency) and exposure rates, the two essential components used in risk calculations for chemical safety and prioritization evaluations. Therefore, the estimates of both potency and exposure and needed to calculate BERs, which are described in this training module.
+
+For potency estimates, the ToxCast high-throughput screening library is introduced as an example high-throughput dataset to carry out in vitro to in vivo extrapolation (IVIVE) modeling through httk. ToxCast activity concentrations that elicit 50% maximal bioactivity (AC50) are uploaded and organized as inputs, and then the tenth percentile ToxCast AC50 is calculated for each chemical (in other words, across all ToxCast screening assays, the tenth percentile of AC50 values were carried forward). These concentration estimates then serve as concentration estimates for potency. For exposure estimates, previously generated exposure estimates that have been inferred from CDC NHANES urinary biomonitoring data are used.
+
+The bioactivity-exposure ratio (BER) is then calculated across chemicals with both potency and exposure estimate information. This ratio is simply calculated as the ratio of the lower-end equivalent dose (for the most-sensitive 5\% of the population) divided by the upper-end estimated exposure (here, the upper bound on the inferred population median exposure). Chemicals are then ranked based on resulting BERs and visualized through plots. The importance of these chemical prioritization are then discussed in relation to environmental health research and corresponding regulatory decisions.
+
+## Introduction to Toxicokinetic Modeling
+
+To understand what toxicokinetic modeling is, consider the following scenario:
+
+```{r 6-6-Toxicokinetic-Modeling-1, echo=FALSE, fig.align = "center" }
+knitr::include_graphics("Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image1.png")
+```
+
+Simply put, toxicokinetics answers these questions by describing "what the body does to the chemical" after an exposure scenario.
+
+More technically, **toxicokinetic modeling** refers to the evaluation of the uptake and disposition of a chemical in the body.
+
+### Notes on terminology
+Pharmacokinetics (PK) is a synonym for toxicokinetics (TK). They are often used interchangeably. PK connotes pharmaceuticals; TK connotes environmental chemicals – but those connotations are weak.
+
+A common abbreviation that you will also see in this research field is **ADME**, which stands for:
+**Absorption:** How does the chemical get absorbed into the body tissues?
+**Distribution:** Where does the chemical go inside the body?
+**Metabolism:** How do enzymes in the body break apart the chemical molecules?
+**Excretion:** How does the chemical leave the body?
+To place this term into the context of TK, TK models describe ADME mathematically by representing the body as compartments and flows.
+
+
+### Types of TK models
+TK models describe the body mathematically as one or more "compartments" connected by "flows." The compartments represent organs or tissues. Using mass balance equations, the amount or concentration of chemical in each compartment is described as a function of time.
+
+Types of models discussed throughout this training module are described here.
+
+#### 1 Compartment Model
+The simplest TK model is a 1-compartment model, where the body is assumed to be one big well-mixed compartment.
+
+#### 3 Compartment Model
+A 3-compartment model mathematically incorporates three distinct body compartments, that can exhibit different parameters contributing to their individual mass balance. Commonly used compartments in 3-compartment modeling can include tissues like blood plasma, liver, gut, kidney, and/or 'rest of body' terms; though the specific compartments included depend on the chemical under evaluation, exposure scenario, and modeling assumptions.
+
+#### PBTK Model
+A physiologically-based TK (PBTK) model incorporates compartments and flows that represent real physiological quantities (as opposed to the aforementioned empirical 1- and 3-compartment models). PBTK models have more parameters overall, including parameters representing physiological quantities that are known *a priori* based on studies of anatomy. The only PBTK model parameters that need to be estimated for each new chemical are parameters representing chemical-body interactions, which can include the following:
+
+- Rate of hepatic metabolism of chemical: How fast does liver break down chemical?
+- Plasma protein binding: How tightly does the chemical bind to proteins in blood plasma? Liver may not be able to break down chemical that is bound to plasma protein.
+- Blood:tissue partition coefficients: Assuming chemical diffuses between blood and other tissues very fast compared to the rate of blood flow, the ratio of concentration in blood to concentration in each tissue is approximately constant = partition coefficient.
+- Rate of active transport into/out of a tissue: If chemical moves between blood and tissues not just by passive diffusion, but by cells actively transporting it in or out of the tissue
+- Binding to other tissues: Some chemical may be bound inside a tissue and not available for diffusion or transport in/out
+
+
+Types of TK modeling can also fall into the following major categories:
+1. **Forward TK Modeling:** Where external exposure doses are converted into internal doses (or concentrations of chemicals/drugs in one or more body tissues of interest)
+2. **Reverse TK Modeling:** The reverse of the above, where internal doses are converted into external exposure doses.
+
+
+### Other TK modeling resources
+
+For further information on TK modeling background, math, and example models, there are additional resources online including a helpful course website on [Basic Pharmacokinetics](https://www.boomer.org/c/p4/) by Dr. Bourne.
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 6-6-Toxicokinetic-Modeling-2 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 6-6-Toxicokinetic-Modeling-3, results=FALSE, message=FALSE}
+if(!nzchar(system.file(package = "ggplot2"))){
+ install.packages("ggplot2")}
+if(!nzchar(system.file(package = "reshape2"))){
+ install.packages("reshape2")}
+if(!nzchar(system.file(package = "stringr"))){
+ install.packages("stringr")}
+if(!nzchar(system.file(package = "httk"))){
+ install.packages("httk")}
+if(!nzchar(system.file(package = "eulerr"))){
+ install.packages("eulerr")}
+```
+
+
+#### Loading R packages required for this session
+```{r 6-6-Toxicokinetic-Modeling-4, results=FALSE, message=FALSE}
+library(ggplot2) # ggplot2 will be used to generate associated graphics
+library(reshape2) # reshape2 will be used to organize and transform datasets
+library(stringr) # stringr will be used to aid in various data manipulation steps through this module
+library(httk) # httk package will be used to carry out all toxicokinetic modeling steps
+library(eulerr) #eulerr package will be used to generate Venn/Euler diagram graphics
+```
+
+
+For more information on the *ggplot2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/ggplot2/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5).
+
+For more information on the *reshape2* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/reshape2/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4).
+
+For more information on the *stringr* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/stringr/index.html) and [RDocumentation webpage](https://www.rdocumentation.org/packages/stringr/versions/1.4.0).
+
+For more information on the *httk* package, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/httk/index.html) and parent publication by [Pearce et al. (2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134854/).
+
+#### More information on the httk package
+You can see an overview of the *httk* package by typing `?httk` at the R command line.
+
+You can see a browsable index of all functions in the *httk* package by typing `help(package="httk")` at the R command line.
+
+You can see a browsable list of vignettes by typing `browseVignettes("httk")` at the R command line. (Please note that some of these vignettes were written using older versions of the package and may no longer work as written -- specifically the Ring (2017) vignette, which I wrote back in 2016. The *httk* team is actively working on updating these.)
+
+You can get information about any function in *httk*, or indeed any function in any R package, by typing `help()` and placing the function name in quotation marks inside the parentheses. For example, to get information about the *httk* function `solve_model()`, type this:
+
+```{r 6-6-Toxicokinetic-Modeling-5, eval=FALSE}
+help("solve_model")
+```
+
+Note that this module was run with `httk` version 2.4.0.
+
+#### Set your working directory
+```{r 6-6-Toxicokinetic-Modeling-6, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+(1) After solving the TK model that evaluates bisphenol-A, what is the maximum concentration of bisphenol-A estimated to occur in human plasma, after 1 exposure dose of 1 mg/kg/day?
+
+(2) After solving the TK model that evaluates bisphenol-A, what is the steady-state concentration of bisphenol-A estimated to occur in human plasma, for a long-term oral infusion dose of 1 mg/kg/day?
+
+(3) What is the predicted range of bisphenol-A concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile?
+
+(4) Considering the chemicals evaluated in the above TK modeling example, do the $C_{ss}$-dose slope distributions become wider as the median $C_{ss}$-dose slope increases?
+
+(5) How many chemicals have available AC50 values to evaluate in the current ToxCast/Tox21 high-throughput screening database?
+
+(6) What are the chemicals with the three lowest predicted equivalent doses (for tenth-percentile ToxCast AC50s), for the most-sensitive 5\% of the population?
+
+(7) Based on httk modeling estimates, are chemicals with higher bioactivity-exposure ratios always less potent than chemicals with lower bioactivity-exposure ratios?
+
+(8) Based on httk modeling estimates, do chemicals with higher bioactivity-exposure ratios always have lower estimated exposures than chemicals with lower bioactivity-exposure ratios?
+
+(9) How are chemical prioritization results different when using only hazard information vs. only exposure information vs. bioactivity-exposure ratios?
+
+(10) Of the three datasets used in this training module -- bioactivity from ToxCast, TK data from *httk*, and exposure inferred from NHANES urinary biomonitoring -- which one most limits the number of chemicals that can be prioritized using BERs?
+
+
+
+## Data and Models used in Toxicokinetic Modeling (TK)
+
+### Common Models used in TK Modeling, that are Provided as Built-in Models in httk
+
+There are five TK models currently built into *httk*. They are:
+
+* **pbtk**: A physiologically-based TK model with oral absorption. Contains the following compartments: gutlumen, gut, liver, kidneys, veins, arteries, lungs, and the rest of the body. Chemical is metabolized by the liver and excreted by the kidneys via glomerular filtration.
+* **gas_pbtk**: A PBTK model with absorption via inhalation. Contains the same compartments as `pbtk`.
+* **1compartment**: A simple one-compartment TK model with oral absorption.
+* **3compartment**: A three-compartment TK model with oral absorption. Compartments are gut, liver, and rest of body.
+* **3compartmentss**: The steady-state solution to the 3-compartment model under an assumption of constant infusion dosing, without considering tissue partitioning. This was the first *httk* model (see Wambaugh et al. 2015, Wetmore et al. 2012, Rotroff et al. 2010).
+
+### Chemical-Specific TK Data Built Into 'httk'
+
+Each of these TK models has chemical-specific parameters. The chemical-specific TK information needed to parameterize these models is built into `httk`, in the form of a built-in lookup table in a data.frame called `chem.physical_and_invitro.data`. This lookup table means that in order to run a TK model for a particular chemical, you only need to specify the chemical.
+
+Look at the first few rows of this data.frame to see everything that's in there (it is a lot of information).
+
+```{r 6-6-Toxicokinetic-Modeling-7 }
+head(chem.physical_and_invitro.data)
+```
+
+The table contains chemical identifiers: name, CASRN (Chemical Abstract Service Registry Number), and DTXSID (DSSTox ID, a chemical identifier from the EPA Distributed Structure-Searchable Toxicity Database, DSSTox for short -- more information can be found at https://www.epa.gov/chemical-research/distributed-structure-searchable-toxicity-dsstox-database). The table also contains physical-chemical properties for each chemical. These are used in predicting tissue partitioning.
+
+The table contains *in vitro* measured chemical-specific TK parameters, if available. These chemical-specific parameters include intrinsic hepatic clearance (`Clint`) and fraction unbound to plasma protein (`Funbound.plasma`) for each chemical. It also contains measured values for oral absorption fraction `Fgutabs`, and for the partition coefficient between blood and plasma `Rblood2plasma`, if these values have been measured for a given chemical. If available, there may be chemical-specific TK values for multiple species.
+
+#### Listing chemicals for which a TK model can be parameterized
+
+You can easily get a list of all the chemicals for which a specific TK model can be parameterized (for a given species, if needed) using the function `get_cheminfo()`.
+
+For example, here is how you get a list of all the chemicals for which the PBTK model can be parameterized for humans.
+
+```{r 6-6-Toxicokinetic-Modeling-8, warning = FALSE}
+chems_pbtk <- get_cheminfo(info = c("Compound", "CAS", "DTXSID"),
+ model = "pbtk",
+ species = "Human")
+
+head(chems_pbtk) #first few rows
+```
+
+
+How many such chemicals have parameter data to run a PBTK model in this package?
+```{r 6-6-Toxicokinetic-Modeling-9 }
+nrow(chems_pbtk)
+```
+
+Here is how you get all the chemicals for which the 3-compartment steady-state model can be parameterized for humans.
+```{r 6-6-Toxicokinetic-Modeling-10 }
+chems_3compss <- get_cheminfo(info = c("Compound", "CAS", "DTXSID"),
+ model = "3compartmentss",
+ species = "Human")
+```
+
+How many such chemicals have parameter data to run a 3-compartment steady-state model in this package?
+```{r 6-6-Toxicokinetic-Modeling-11 }
+nrow(chems_3compss)
+```
+
+The 3-compartment steady-state model can be parameterized for a few more chemicals than the PBTK model, because it is a simpler model and requires less data to parameterize. Specifically, the 3-compartment steady-state model does not require estimating tissue partition coefficients, unlike the PBTK model.
+
+### Solving Toxicokinetic Models to Obtain Internal Chemical Concentration vs. Time Predictions
+
+You can solve any of the models for a specified chemical and specified dosing protocol, and get concentration vs. time predictions, using the function `solve_model()`. For example:
+
+```{r 6-6-Toxicokinetic-Modeling-12, warning=FALSE}
+sol_pbtk <- solve_model(chem.name = "Bisphenol-A", #chemical to simulate
+ model = "pbtk", #TK model to use
+ dosing = list(initial.dose = NULL, #for repeated dosing, if first dose is different from the rest, specify first dose here
+ doses.per.day = 1, #number of doses per day
+ daily.dose = 1, #total daily dose in mg/kg units
+ dosing.matrix = NULL), #used to specify more complicated dosing protocols
+ days = 1) #number of days to simulate
+```
+
+There are some cryptic-sounding warnings that can safely be ignored. (They are providing information about certain assumptions that were made while solving the model). Then there is a final message providing the units of the output.
+
+The output, assigned to `sol_pbtk`, is a matrix with concentration vs. time data for each of the compartments in the pbtk model. Time is in units of days. Additionally, the output traces the amount excreted via passive renal filtration (`Atubules`), the amount metabolized in the liver (`Ametabolized`), and the cumulative area under the curve for plasma concentration vs. time (`AUC`). Here are the first few rows of `sol_pbtk` so you can see the format.
+
+```{r 6-6-Toxicokinetic-Modeling-13 }
+head(sol_pbtk)
+```
+
+You can plot the results, for example plasma concentration vs. time.
+
+```{r 6-6-Toxicokinetic-Modeling-14, fig.align = "center"}
+sol_pbtk <- as.data.frame(sol_pbtk) #because ggplot2 requires data.frame input, not matrix
+
+ggplot(sol_pbtk) +
+ geom_line(aes(x = time,
+ y = Cplasma)) +
+ theme_bw() +
+ xlab("Time, days") +
+ ylab("Cplasma, uM") +
+ ggtitle("Plasma concentration vs. time for single dose 1 mg/kg Bisphenol-A")
+```
+
+### Calculating summary metrics of internal dose produced from TK models
+
+We can calculate summary metrics of internal dose -- peak concentration, average concentration, and AUC -- using the function `calc_tkstats()`. We have to specify the dosing protocol and length of simulation. Here, we use the same dosing protocol and simulation length as in the plot above.
+
+```{r 6-6-Toxicokinetic-Modeling-15, warning = FALSE}
+tkstats <- calc_tkstats(chem.name = "Bisphenol-A", #chemical to simulate
+ stats = c("AUC", "peak", "mean"), #which metrics to return (these are the only three choices)
+ model = "pbtk", #model to use
+ tissue = "plasma", #tissue for which to return internal dose metrics
+ days = 1, #length of simulation
+ daily.dose = 1, #total daily dose in mg/kg/day
+ doses.per.day = 1) #number of doses per day
+
+print(tkstats)
+```
+
+
+### Answer to Environmental Health Question 1
+:::question
+*With this, we can answer **Environmental Health Question #1***: After solving the TK model that evaluates bisphenol-A, what is the maximum concentration of bisphenol-A estimated to occur in human plasma, after 1 exposure dose of 1 mg/kg/day?
+:::
+
+:::answer
+**Answer**: The peak plasma concentration estimate for bisphenol-A, under the conditions tested, is 0.3779 uM.
+:::
+
+
+### Calculating steady-state concentration
+
+Another summary metric is the steady-state concentration: If the same dose is given repeatedly over many days, the body concentration will (usually) reach a steady state after some time. The value of this steady-state concentration, and the time needed to achieve steady state, are different for different chemicals. Steady-state concentrations are useful when considering long-term, low-level exposures, which is frequently the situation in environmental health.
+
+For example, here is a plot of plasma concentration vs. time for 1 mg/kg/day Bisphenol-A, administered for 12 days. You can see how the average plasma concentration reaches a steady state around 1.5 uM. Each peak represents one day's dose.
+
+```{r 6-6-Toxicokinetic-Modeling-16, warning = FALSE, fig.align = "center"}
+foo <- as.data.frame(solve_pbtk(
+ chem.name='Bisphenol-A',
+ daily.dose=1,
+ days=12,
+ doses.per.day=1,
+ tsteps=2))
+
+ggplot(foo) +
+ geom_line(aes(x = time,
+ y= Cplasma)) +
+ scale_x_continuous(breaks = seq(0,12)) +
+ xlab("Time, days") +
+ ylab("Cplasma, uM")
+```
+
+*httk* includes a function `calc_analytic_css()` to calculate the steady-state plasma concentration ($C_{ss}$ for short) analytically for each model, for a specified chemical and daily oral dose. This function assumes that the daily oral dose is administered as an oral infusion, rather than a single oral bolus dose -- in effect, that the daily dose is divided into many small doses over the day. Therefore, the result of `calc_analytic_css()` may be slightly different than our previous estimate based on the concentration vs. time plot from a single oral bolus dose every day.
+
+Here is the result of `calc_analytic_css()` for a 1 mg/kg/day dose of bisphenol-A.
+
+```{r 6-6-Toxicokinetic-Modeling-17, warning = FALSE}
+calc_analytic_css(chem.name = "Bisphenol-A",
+ daily.dose = 1,
+ output.units = "uM",
+ model = "pbtk",
+ concentration = "plasma")
+```
+
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: After solving the TK model that evaluates bisphenol-A, what is the steady-state concentration of bisphenol-A estimated to occur in human plasma, for a long-term oral infusion dose of 1 mg/kg/day?
+:::
+
+:::answer
+**Answer**: The steady-state plasma concentration estimate for bisphenol-A, under the conditions tested, is 0.9417 uM.
+:::
+
+
+
+### Steady-state concentration is linear with dose for httk models
+
+For the TK models included in the *httk* package, steady-state concentration is linear with dose for a given chemical. The slope of the line is simply the steady-state concentration for a dose of 1 mg/kg/day. This can be shown by solving `calc_analytic_css()` for several doses, and plotting the dose-$C_{ss}$ points along a line whose slope is equal to $C_{ss}$ for 1 mg/kg/day.
+
+```{r 6-6-Toxicokinetic-Modeling-18, fig.align = "center"}
+#choose five doses at which to find the Css
+doses <- c(0.1, #all mg/kg/day
+ 0.5,
+ 1.0,
+ 1.5,
+ 2.0)
+suppressWarnings(bpa_css <- sapply(doses,
+ function(dose) calc_analytic_css(chem.name = "Bisphenol-A",
+ daily.dose = dose,
+ output.units = "uM",
+ model = "pbtk",
+ concentration = "plasma",
+ suppress.messages = TRUE)))
+
+DF <- data.frame(dose = doses,
+ Css = bpa_css)
+
+#Plot the results
+Cssdosefig <- ggplot(DF) +
+ geom_point(aes(x = dose,
+ y = Css),
+ size = 3) +
+ geom_abline( #plot a straight line
+ intercept = 0, #intercept 0
+ slope = DF[DF$dose==1, #slope = Css for 1 mg/kg/day
+ "Css"],
+ linetype = 2
+ ) +
+ xlab("Daily dose, mg/kg/day") +
+ ylab("Css, uM")
+
+print(Cssdosefig)
+
+```
+
+## Reverse Toxicokinetics
+
+In the previous TK examples, we started with a specified dosing protocol, then solved the TK models to find the resulting concentration in the body (e.g., in plasma). This allows us to convert from external exposure metrics to internal exposure metrics. However, many environmental health questions require the reverse: converting from internal exposure metrics to external exposure metrics.
+
+For example, when health effects of environmental chemicals are studied in epidemiological cohorts, adverse health effects are often related to *internal* exposure metrics, such as blood or plasma concentration of a chemical. Similarly, *in vitro* studies of chemical bioactivity (for example, the ToxCast program) relate bioactivity to *in vitro* concentration, which can be consdered analogous to internal exposure or body concentration. So we may know the *internal* exposure level associated with some adverse health effect of a chemical.
+
+However, risk assessors and risk managers typically control *external* exposure to reduce the risk of adverse health effects. They need some way to start from an internal exposure associated with adverse health effects, and convert to the corresponding external exposure.
+
+The solution is *reverse toxicokinetics* (reverse TK). Starting with a specified internal exposure metric (body concentration), solve the TK model *in reverse* to find the corresponding external exposure that produced that concentration.
+
+When exposures are long-term and low-level (as environmental exposures often are), then the relevant internal exposure metric is the steady-state concentration. In this case, it is useful to remember the linear relationship between $C_{ss}$ and dose for the *httk* TK models. It gives you a quick and easy way to perform reverse TK for the steady-state case.
+
+The procedure is illustrated graphically below.
+
+1. Begin with a "target" concentration on the y-axis (labeled $C_{\textrm{target}}$). For example, $C_{\textrm{target}}$ may be the *in vitro* concentration associated with bioactivity in a ToxCast assay, or the plasma concentration associated with an adverse health effect in an epidemiological study.
+2. Draw a horizontal line over to the $C_{ss}$-dose line.
+3. Drop down vertically to the x-axis and read off the corresponding dose. This is the *administered equivalent dose* (AED): the the external dose or exposure rate, in mg/kg/day, that would produce an internal steady-state plasma concentration equal to the target concentration.
+
+```{r 6-6-Toxicokinetic-Modeling-19, echo = FALSE, warning = FALSE, fig.align = "center"}
+reverseTKfig <- Cssdosefig +
+ geom_segment(aes(x = -Inf, y = 0.8671, xend = 0.75, yend = 0.8671),
+ size = 2,
+ arrow = arrow(angle = 30, length = unit(5, "mm"), type = "closed"),
+ color = "#fc8d62") +
+ geom_segment(aes(x = 0.75, y = 0.8671, xend = 0.75, yend = -Inf),
+ size = 2,
+ arrow = arrow(angle = 30, length = unit(5, "mm"), type = "closed"),
+ color = "#fc8d62") +
+ ggplot2::annotate("text",
+ x = 0,
+ y = 1,
+ label = "1",
+ size = 8,
+ color = "#fc8d62",
+ vjust = "bottom") +
+ ggplot2::annotate("text",
+ x = 0.75,
+ y = 1,
+ label = "2",
+ size = 8,
+ color = "#fc8d62",
+ vjust = "bottom") +
+ ggplot2::annotate("text",
+ x = 0.8,
+ y = 0,
+ label = "3",
+ size = 8,
+ color = "#fc8d62",
+ hjust = "left") +
+ scale_y_continuous(breaks = c(seq(0, 2, by = 0.5), 0.8671),
+ labels = c(seq(0, 2, by = 0.5), "Ctarget"),
+ minor_breaks = seq(0.25, 1.75, by = 0.5)) +
+ scale_x_continuous(breaks = c(seq(0, 2, by = 0.5), 0.75),
+ labels = c(seq(0, 2, by = 0.5), "AED"),
+ minor_breaks = seq(0.25, 1.75, by = 0.5)) +
+ theme(
+ axis.text.y = element_text(
+ color = c(rep("black", length(seq(0, 2, by = 0.5))), "#fc8d62"),
+ size = 12
+ ),
+ axis.text.x = element_text(
+ color = c(rep("black", length(seq(0, 2, by = 0.5))), "#fc8d62"),
+ size = 12
+ )
+ )
+
+print(reverseTKfig)
+```
+
+Mathematically, the relation is very simple:
+
+$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{-dose slope}} $$
+
+Since the $C_{ss}$-dose slope is simply $C_{ss}$ for a daily dose of 1 mg/kg/day, this equation can be rewritten as
+
+$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{ for 1 mg/kg/day}} $$
+
+
+## Capturing Population Variability in Toxicokinetics, and Uncertainty in Chemical-Specific Parameters
+
+For a given dose, $C_{ss}$ is determined by the values of the parameters of the TK model. These parameters describe absorption, distribution, metabolism, and excretion (ADME) of each chemical. They include both chemical-specific parameters, describing hepatic clearance and protein binding, and chemical-independent parameters, describing physiology. A table of these parameters is presented below.
+
+```{r 6-6-Toxicokinetic-Modeling-20, results = "asis", echo = FALSE}
+paramtable <- data.frame("Parameter" = c("Intrinsic hepatic clearance rate",
+ "Fraction unbound to plasma protein",
+ "Tissue:plasma partition coefficients",
+ "Tissue masses",
+ "Tissue blood flows",
+ "Glomerular filtration rate",
+ "Hepatocellularity"),
+ "Details" = c("Rate at which liver removes chemical from blood",
+ "Free fraction of chemical in plasma",
+ "Ratio of concentration in body tissues to concentration in plasma",
+ "Mass of each body tissue (including total body weight)",
+ "Blood flow rate to each body tissue",
+ "Rate at which kidneys remove chemical from blood",
+ "Number of cells per mg liver"),
+ "Estimated" = c("Measured *in vitro*",
+ "Measured *in vitro*",
+ "Estimated from chemical and tissue properties",
+ rep("From anatomical literature", 4)
+ ),
+ "Type" = c(rep("Chemical-specific", 3),
+ rep("Chemical-independent", 4))
+)
+
+knitr::kable(paramtable)
+```
+
+Because these parameters represent physiology and chemical-body interactions, their exact values will vary across individuals in a population, reflecting population physiological variability. Additionally, parameters are subject to measurement uncertainty.
+
+Since the $C_{ss}$-dose relation is determined by these parameters, variability and uncertainty in the TK parameters translates directly into variability and uncertainty in $C_{ss}$ for a given dose. In other words, there is a distribution of $C_{ss}$ values for each daily dose level of a chemical.
+
+The $C_{ss}$-dose relationship is still linear when variability and uncertainty are taken into account. However, rather than a single $C_{ss}$-dose slope, there is a distribution of $C_{ss}$-dose slopes. Because the $C_{ss}$-dose slope is simply the $C_{ss}$ value for an exposure rate of 1 mg/kg/day, the distribution of the $C_{ss}$-dose slope is the same as the $C_{ss}$ distribution for an exposure rate of 1 mg/kg/day.
+
+A distribution of $C_{ss}$-dose slopes is illustrated in the figure below, along with boxplots illustrating the distributions for $C_{ss}$ itself at five different dose levels: 0.05, 0.25, 0.5, 0.75, and 0.95 mg/kg/day.
+
+
+```{r 6-6-Toxicokinetic-Modeling-21, echo = FALSE, warning = FALSE, fig.align = "center"}
+
+suppressWarnings(css_examp <- calc_mc_css(chem.name = "Bisphenol-A",
+ which.quantile = c(0.05, #specify which quantiles to return
+ 0.25,
+ 0.5,
+ 0.75,
+ 0.95),
+ output.units = "uM",
+ suppress.messages = TRUE,
+ model = "3compartmentss" #which model to use to calculate Css
+ ))
+
+#Css for various doses
+css_dist_wide <- as.data.frame(
+ t(
+ sapply(doses,
+ function(x) x * css_examp
+ )
+ )
+)
+
+#add column defining daily doses
+css_dist_wide$dose <- doses
+
+#data.frame of slope percentiles
+slope_dist <- data.frame(slope = css_examp,
+ quantile= factor(names(css_examp),
+ levels = names(css_examp)))
+
+#colors for plotting -- specify order to be consistent with color use later
+#This is a slight re-ordering of ColorBrewer2's "Set2" palette
+plotcols <- c("5%" = "#66c2a5",
+ "50%" = "#fc8d62",
+ "95%" = "#8da0cb",
+ "25%" = "#e78ac3",
+ "75%" = "#a6d854")
+
+
+ggplot(css_dist_wide) +
+ geom_boxplot(aes(x = dose,
+ group = dose,
+ lower = `25%`,
+ upper = `75%`,
+ middle = `50%`,
+ ymin = `5%`,
+ ymax = `95%`),
+ stat = "identity") +
+ geom_abline(data = slope_dist,
+ aes(intercept =0,
+ slope = slope,
+ color = quantile),
+ size = 1) +
+ scale_color_manual(values = plotcols,
+ limits = levels(slope_dist$quantile),
+ name = "Percentile") +
+ xlab("Daily dose, mg/kg/day") +
+ ylab("Css, uM") +
+ theme(legend.position = c(0.1,0.7))
+
+```
+
+An appropriate title for this figure could be:
+
+“**Boxplots: Distributions of Css for five daily dose levels of Bisphenol-A.** Boxes extend from 25th to 75th percentile. Lower whisker = 5th percentile; upper whisker = 95th percentile. Lines: Css-dose relations for each quantile."
+
+### Variability and Uncertainty in Reverse Toxicokinetics
+
+Earlier, we found that with a linear $C_{ss}$-dose relation, reverse toxicokinetics became a matter of a simple linear equation. For a given target concentration -- for example, a plasma concentration associated with adverse health effects *in vivo*, or a concentration associated with bioactivity *in vitro* -- we could predict an AED (administered equivalent dose), the external exposure rate in mg/kg/day that would produce the target concentration at steady state.
+
+$$ AED = \frac{C_{\textrm{target}}}{C_{ss}\textrm{-dose slope}} $$
+
+Since AED depends on the $C_{ss}$-dose slope, variability and uncertainty in that slope will induce variability and uncertainty in the AED. A distribution of slopes will lead to a distribution of AEDs for the same target concentration.
+
+For example, a graphical representation of finding the AED distribution for a target concentration of 1 uM looks like this, for the same arbitrary example chemical used to illustrate the distribution of $C_{ss}$-dose slopes above. (The lines shown in this plot are the same as the previous plot, but the plot has been "zoomed in" on the y-axis.)
+
+The steps are the same as before:
+
+1. Begin with a "target" concentration on the y-axis, here 1 uM.
+2. Draw a horizontal line over to intersect each $C_{ss}$-dose line.
+3. Where the horizontal line intersects each $C_{ss}$-dose line, drop down vertically to the x-axis and read off each corresponding AED (marked with colored circles matching the color of each $C_{ss}$-dose line).
+
+```{r 6-6-Toxicokinetic-Modeling-22, echo = FALSE, warning = FALSE, fig.align = "center"}
+
+ggplot(css_dist_wide,
+ aes(x=dose, y = `95%`)) +
+ geom_blank() +
+ geom_abline(data = slope_dist,
+ aes(intercept =0,
+ slope = slope,
+ color = quantile),
+ size = 1) +
+ geom_hline(aes(yintercept = 1)) +
+ geom_segment(aes(x = 1/css_examp,
+ xend = 1/css_examp,
+ y = 1,
+ yend = -Inf)) +
+ geom_point(aes(x = 1/css_examp,
+ y = -Inf,
+ color = factor(c("5%", "25%", "50%", "75%", "95%"),
+ levels = levels(slope_dist$quantile))
+ ),
+ size = 5) +
+ scale_color_manual(values = plotcols,
+ limits = levels(slope_dist$quantile),
+ name = "Percentile") +
+ xlab("Daily dose, mg/kg/day") +
+ ylab("Css, uM") +
+ theme(legend.position = "right") +
+ coord_cartesian(ylim = c(0,5),
+ clip = "off")
+```
+
+Notice that the line with the steepest, 95th-percentile slope (the purple line) yields the lowest AED (the purple dot, approximately 0.07 mg/kg/day for this example chemical), and the line with the shallowest, 5th-percentile slope (the turquoise blue line) yields the highest AED (the turquoise dot, approximately 2 mg/kg/day for this example chemical).
+
+In general, the 95th-percentile $C_{ss}$-dose slope represents the most-sensitive 5\% of the population -- individuals who will reach the target concentration in their body with the smallest daily doses. Therefore, using the AED for the 95th-percentile $C_{ss}$-dose slope is a conservative choice, health-protective for 95\% of the estimated population.
+
+
+### Monte Carlo approach to simulating variability and uncertainty
+The *httk* package implements a Monte Carlo approach for simulating variability and uncertainty in TK.
+
+*httk* first defines distributions for the TK model parameters, representing population variabilty. These distributions are defined based on real data about U.S. population demographics and physiology collected as part of the Centers for Disease Control's National Health and Nutrition Examination Survey (NHANES) [(Ring et al., 2017)](https://pubmed.ncbi.nlm.nih.gov/28628784/). TK parameters with known measurement uncertainty (intrinsic hepatic clearance rate and fraction of chemical unbound in plasma) additionally have distributions defined to represent their uncertainty [(Wambaugh et al., 2019)](https://pubmed.ncbi.nlm.nih.gov/31532498/).
+
+Then, *httk* samples sets of TK parameter values from these distributions (including appropriate correlations: for example, liver mass is correlated with body weight). Each sampled set of TK parameter values represents one "simulated individual."
+
+Next, *httk* calculates the $C_{ss}$-dose slope for each "simulated individual." The resulting sample of $C_{ss}$-dose slopes can be used to characterize the distribution of $C_{ss}$-dose slopes -- for example, by calculating percentiles.
+
+*httk* makes this whole Monte Carlo process simple and transparent for the user, You just need to call one function, `calc_mc_css()`, specifying the chemical whose $C_{ss}$-dose slope distribution you want to calculate. Behind the scenes, *httk* will perform all the Monte Carlo calculations. It will return percentiles of the $C_{ss}$-dose slope (by default), or it can return all individual samples of $C_{ss}$-dose slope (if you want to do some calculations of your own).
+
+### Chemical-Specific Example Capturing Population Variability for Bisphenol-A Plasma Concentration Estimates
+
+The following code estimates the 5th percentile, 50th percentile, and 95th percentile of the $C_{ss}$-dose slope for the chemical bisphenol-A. For the sake of simplicity, we will use the 3-compartment steady-state model (rather than the PBTK model used in the previous examples).
+
+```{r 6-6-Toxicokinetic-Modeling-23, warning=FALSE}
+css_examp <- calc_mc_css(chem.name = "Bisphenol-A",
+ which.quantile = c(0.05, #specify which quantiles to return
+ 0.5,
+ 0.95),
+ model = "3compartmentss", #which model to use to calculate Css
+ output.units = "uM") #could also choose mg/Lo
+
+print(css_examp)
+```
+
+Recall that the $C_{ss}$-dose slope is the same as $C_{ss}$ for a daily dose of 1 mg/kg/day. The function `calc_mc_css()` therefore assumes a dose of 1 mg/kg/day and calculates the resulting $C_{ss}$ distribution. If you need to calculate the $C_{ss}$ distribution for a different dose, e.g. 2 mg/kg/day, you can simply multiply the $C_{ss}$ percentiles from `calc_mc_css()` by your desired dose.
+
+The steady-state plasma concentration for 1 mg/kg/day dose is returned in units of uM. The three requested quantiles are returned as a named numeric vector (whose names in this case are `5%`, `50%`, and `95%`).
+
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: What is the predicted range of bisphenol-A concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile?
+:::
+
+:::answer
+**Answer**: For a human population exposed to 1 mg/kg/day bisphenol-A, plasma concentrations are estimated to be `r unname(css_examp[1])` uM at the 5th percentile, `r unname(css_examp[2])` uM at the 50th percentile, and `r unname(css_examp[3])` uM at the 95th percentile.
+:::
+
+
+### High-Throughput Example Capturing Population Variability for ~1000 Chemicals
+
+We can easily and (fairly) quickly do this for all 998 chemicals for which the 3-compartment steady-state model can be parameterized, using `sapply()` to loop over the chemicals. This will take a few minutes to run (for example, it takes about 10-15 minutes on a Dell Latitude with an Intel i7 processor).
+
+In order to make the Monte Carlo sampling reproducible, set a seed for the random number generator. It doesn't matter what seed you choose -- it can be any integer. Here, the seed is set to 42, because it's the answer to the ultimate question of life, the universe, and everything [(Adams, 1979)](https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy_(novel)).
+
+```{r 6-6-Toxicokinetic-Modeling-24 }
+set.seed(42)
+
+system.time(
+ suppressWarnings(
+ css_3compss <- sapply(chems_3compss$CAS,
+ calc_mc_css,
+ #additional arguments to calc_mc_css()
+ model = "3compartmentss",
+ which.quantile = c(0.05, 0.5, 0.95),
+ output.units = "uM",
+ suppress.messages = TRUE)
+ )
+)
+```
+
+Organizing the results:
+```{r 6-6-Toxicokinetic-Modeling-25 }
+#css_3compss comes out as a 3 x 998 array,
+#where rows are quantiles and columns are chemicals
+#transpose it so that rows are chemicals and columns are quantiles
+css_3compss <- t(css_3compss)
+#convert to data.frame
+css_3compss <- as.data.frame(css_3compss)
+#make a column for CAS, rather than just leaving it as the row names
+css_3compss$CAS <- row.names(css_3compss)
+
+head(css_3compss) #View first few rows
+```
+
+
+### Plotting the $C_{ss}$-dose slope distribution quantiles across these ~1000 chemicals
+
+Here, we will plot the resulting concentration distribution quantiles for each chemical, while sorting the chemicals from lowest to highest median value.
+
+By default, *ggplot2* will plot the chemical CASRNs in alphabetically-sorted order. To force it to plot them in another order, we have to explicitly specify the desired order. The easiest way to do this is to add a column in the data.frame that contains the chemical names as a factor (categorical) variable, whose levels (categories) are explicitly set to be the CASRNs in our desired plotting order. Then we can tell *ggplot2* to plot that factor variable on the x-axis, rather than the original CASRN variable.
+
+Set the ordering of the chemical CASRNs from lowest to highest median value
+```{r 6-6-Toxicokinetic-Modeling-26 }
+chemical_order <- order(css_3compss$`50%`)
+```
+
+Create a factor (categorical) CAS column where the factor levels are given by the CASRNs with this ordering.
+```{r 6-6-Toxicokinetic-Modeling-27 }
+css_3compss$CAS_factor <- factor(css_3compss$CAS, levels = css_3compss$CAS[chemical_order])
+```
+
+For plotting ease, reshape the data.frame into "long" format -- rather than having one column for each quantile of the $C_{ss}$ distribution, have a row for each chemical/quantile combination. We use the `melt()` function from the *reshape2* package.
+```{r 6-6-Toxicokinetic-Modeling-28, warning = FALSE}
+css_3compss_melt <- reshape2::melt(css_3compss,
+ id.vars = "CAS_factor",
+ measure.vars = c("5%", "50%", "95%"),
+ variable.name = "Percentile",
+ value.name = "Css_slope")
+head(css_3compss_melt)
+```
+
+Plot the slope percentiles. Use a log scale for the y-axis because the slopes span orders of magnitude. Suppress the x-axis labels (the CASRNs) because they are not readable anyway.
+```{r 6-6-Toxicokinetic-Modeling-29, fig.align = "center"}
+ggplot(css_3compss_melt) +
+ geom_point(aes(x=CAS_factor,
+ y = Css_slope,
+ color = Percentile)) +
+ scale_color_brewer(palette = "Set2") + #use better color scheme than default
+ scale_y_log10() + #use log scale for y axis
+ xlab("Chemical") +
+ ylab("Css-dose slope (uM per mg/kg/day)") +
+ annotation_logticks(sides = "l") + #add log ticks to y axis
+ theme_bw() + #plot with white plot background instead of gray
+ theme(axis.text.x = element_blank(), #suppress x-axis labels
+ panel.grid.major.x = element_blank(), #suppress vertical grid lines
+ legend.position = c(0.1,0.8) #place legend in lower right corner
+ )
+```
+
+Chemicals along the x-axis are in order from lowest to highest median (50th percentile) predicted $C_{ss}$-dose slope. The orange points represent that 50th percentile $C_{ss}$-dose slope for each chemical. The green points represent the 5th percentile $C_{ss}$-dose slopes, and the purple points represent the 95th percentile $C_{ss}$-dose slope for each chemical. Each chemical has one orange point (50th percentile), one green point (5th percentile), and one purple point (95th percentile), characterizing the distribution of $C_{ss}$-dose slopes across the U.S. population for that chemical. The width of the distribution for each chemical is roughly represented by the vertical distance between the green and purple points for that chemical.
+
+
+### Answer to Environmental Health Question 4
+:::question
+*With this, we can answer **Environmental Health Question #4***: Considering the chemicals evaluated in the above TK modeling example, do the $C_{ss}$-dose slope distributions become wider as the median $C_{ss}$-dose slope increases?
+:::
+
+:::answer
+**Answer**: No -- the $C_{ss}$-dose slope distributions generally become narrower as the median $C_{ss}$-dose slope increases. This can be seen by looking at the right end of the plot, where the highest-median chemicals are located -- the distance between the green points and purple points, representing the 5th and 95th percentiles, are much smaller for these higher-median chemicals.
+:::
+
+
+
+## Reverse TK: Calculating Administered Equivalent Doses for ToxCast Bioactive Concentrations
+
+As described in an earlier section of this document, the slope defining the linear relation between $C_{ss}$ and dose is useful for reverse toxicokinetics: converting an internal dose metric to an external dose metric. The internal dose metric may, for example, be a concentration associated with an *in vivo* health effect, or *in vitro* bioactivity. Here, we will consider *in vitro* bioactivity -- specifically, from the ToxCast program. ToxCast tests chemicals in multiple concentration-response format across a battery of *in vitro* assays that measure activity in a wide variety of biological endpoints. If a chemical showed any activity in an assay at any of its tested concentrations, then one metric of concentration associated with bioactivity is AC50 -- the concentration at which the assay response is halfway between its minimum and its maximum.
+
+The module won't address the details of how ToxCast determines assay activity and AC50s from raw concentration-response data. There is an entire R package for the ToxCast data processing workflow, called *tcpl*. If you want to learn more about those details, [start here](https://www.epa.gov/chemical-research/toxcast-data-generation-toxcast-pipeline-tcpl). Lots of information is available if you install the *tcpl* R package and look at the package vignette; it essentially walks you through the full ToxCast data processing workflow.
+
+In this module, we will begin with pre-computed ToxCast AC50 values for various chemicals and assays. We will use `httk` to convert ToxCast AC50 values into administered equivalent doses (AEDs).
+
+### Loading ToxCast AC50s
+
+The latest public release of ToxCast high-throughput screening assay data can be downloaded [here](https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data). Previous public releases of ToxCast data included a matrix of AC50s by chemical and assay. The data format of the latest public release does not contain this kind of matrix. So this dataset was pre-processed to prepare a simple data.frame of AC50s for each chemical/assay combination for the purposes of this training module.
+
+Read in the pre-processed dataset and view the first few rows.
+
+```{r 6-6-Toxicokinetic-Modeling-30 }
+toxcast <- read.csv("Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData1.csv")
+head(toxcast)
+```
+
+The columns of this data frame are:
+
+* `Compound`: The compound name.
+* `CAS`: The compound's CASRN.
+* `DTXSID`: The compound's DSSTox Substance ID.
+* `aenm`: Assay identifier. "aenm" stands for "Assay Endpoint Name." More information about the ToxCast assays is available on the [ToxCast data download page](https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data).
+* `log10_ac50`: The AC50 for the chemical/assay combination on each row, in log10 uM units.
+
+How many ToxCast chemicals are in this dataset?
+
+```{r 6-6-Toxicokinetic-Modeling-31 }
+length(unique(toxcast$DTXSID))
+```
+
+
+### Answer to Environmental Health Question 5
+:::question
+*With this, we can answer **Environmental Health Question #5***: How many chemicals have available AC50 values to evaluate in the current ToxCast/Tox21 high-throughput screening database?
+:::
+
+:::answer
+**Answer**: 7863 chemicals.
+:::
+
+
+### Subsetting the ToxCast Chemicals to include those that are also in httk
+
+Not all of the ToxCast chemicals have TK data built into *httk* such that we can perform reverse TK using the HTTK models. Let's subset the ToxCast data to include only the chemicals for which we can run the 3-compartment steady-state models.
+
+Previously, we used `get_cheminfo()` to get a list of chemicals for which we could run the 3-compartment steady state model, including the names, CASRNs, and DSSTox IDs of those chemicals. That list is stored in variable `chems_3compss`, a data.frame with compound name, CASRN, and DTXSID. Now, we can use that chemical list to subset the ToxCast data.
+
+```{r 6-6-Toxicokinetic-Modeling-32 }
+toxcast_httk <- subset(toxcast,
+ subset = toxcast$DTXSID %in%
+ chems_3compss$DTXSID)
+```
+
+How many chemicals are in this subset?
+
+```{r 6-6-Toxicokinetic-Modeling-33 }
+length(unique(toxcast_httk$DTXSID))
+```
+
+There were 869 *httk* chemicals for which we could run the 3-compartment steady-state model; only 911 of them had ToxCast data. Conversely, most of the 7863 ToxCast chemicals do not have TK data in *httk* such that we can run the 3-compartment steady state model.
+
+### Identifying the Lower-Bound *In Vitro* AC50 Value per Chemical
+ToxCast/Tox21 screens chemicals across multiple assays, such that each chemical has multiple resulting AC50 values, spanning a range of values. For example, here are boxplots of the AC50s for the first 20 chemicals listed in `chems_3compss`. Note that the chemical identifiers, DTXSID, are used here in these visualizations to represent unique chemicals.
+
+```{r 6-6-Toxicokinetic-Modeling-34, fig.align = "center"}
+ggplot(toxcast_httk[toxcast_httk$DTXSID %in%
+ chems_3compss[1:20,
+ "DTXSID"],
+ ]
+ ) +
+ geom_boxplot(aes(x=DTXSID, y = log10_ac50)) +
+ ylab("log10 AC50") +
+ theme_bw() +
+ theme(axis.text.x = element_text(angle = 45,
+ hjust = 1))
+```
+
+
+Sometimes we have an interest in getting the equivalent dose for an AC50 for one specific assay. For example, if we happen to be interested in estrogen-receptor activity, we might look specifically at one of the assays that measures estrogen receptor activity.
+
+However, sometimes we just want a general idea of what concentrations showed bioactivity in *any* of the ToxCast assays, regardless of the specific biological endpoint of each assay. In this case, typically, we are interested in a "reasonable lower bound" of bioactive concentrations across assays for each chemical. Intuitively, we suspect that the very lowest AC50s for each chemical might represent false activity. Therefore, we often select the tenth percentile of ToxCast AC50s for each chemical as that "reasonable lower bound" on bioactive concentrations.
+
+Let's calculate the tenth percentile ToxCast AC50 for each chemical. Here, we use the base R function `aggregate()`, which groups a vector (specified in the `x` argument) by a list of factors (specified in the `by` argument), and applies a function to each group (specified in the `FUN` argument). You can add any extra arguments to the `FUN` function as named arguments to `aggregate()`.
+
+```{r 6-6-Toxicokinetic-Modeling-35 }
+toxcast_httk_P10 <- aggregate(x = toxcast_httk$log10_ac50, #aggregate the AC50s
+ by = list(DTXSID = toxcast_httk$DTXSID), #group AC50s by DTXSID
+ FUN = quantile, #the function to apply to each group
+ prob = 0.1) #an argument to the quantile() function
+#by default the names of the output data.frame will be 'DTXSID' and 'x'
+#let's change 'x' to be a more informative name
+names(toxcast_httk_P10) <- c("DTXSID", "log10_ac50_P10")
+```
+
+Let's transform the tenth-percentile AC50 values back to the natural scale (they are currently on the log10 scale) and put them in a new column `AC50`. These AC50s will be in uM.
+
+```{r 6-6-Toxicokinetic-Modeling-36 }
+toxcast_httk_P10$AC50 <- 10^(toxcast_httk_P10$log10_ac50_P10)
+```
+
+View the first few rows:
+
+```{r 6-6-Toxicokinetic-Modeling-37 }
+head(toxcast_httk_P10)
+```
+
+
+### Calculating Equivalent Doses for 10th Percentile ToxCast AC50s
+
+We can calculate equivalent doses in one line of R code -- again including all of the Monte Carlo for TK uncertainty and variability -- just by using the *httk* function `calc_mc_oral_equiv()`.
+
+Note that in `calc_mc_oral_equiv()`, the `which.quantile` argument refers to the quantile of the $C_{ss}$-dose slope, not the quantile of the equivalent dose itself. So specifying `which.quantile = 0.95` will yield a *lower* equivalent dose than `which.quantile = 0.05`.
+
+Under the hood, `calc_mc_oral_equiv()` first calls `calc_mc_css()` to get percentiles of the $C_{ss}$-dose slope for a chemical. It then divides a user-specified target concentration (specified in argument `conc`) by each quantile of $C_{ss}$-dose slope to get the equivalent dose corresponding to that target concentration for each slope quantile.
+
+Here, we're using the `mapply()` function in base R to call `calc_mc_oral_equiv()` in a loop over chemicals. This is because `calc_mc_oral_equiv()` requires two chemical-specific arguments -- the chemical identifier and the concentration for which to compute the equivalent dose. `mapply()` lets us provide vectors of values for each argument (in the named arguments `dtxsid` and `conc`), and will automatically loop over those vectors. We also use the argument `MoreArgs`, a named list of additional arguments to the function in `FUN` that will be the same for every iteration of the loop. Note that this line of code takes a few minutes to run.
+
+```{r 6-6-Toxicokinetic-Modeling-38, results="hide"}
+set.seed(42)
+
+system.time(
+ suppressWarnings(
+ toxcast_equiv_dose <- mapply(FUN = calc_mc_oral_equiv,
+ conc = toxcast_httk_P10$AC50,
+ dtxsid = toxcast_httk_P10$DTXSID,
+ MoreArgs = list(model = "3compartmentss", #model to use
+ which.quantile = c(0.05, 0.5, 0.95), #quantiles of Css-dose slope
+ suppress.messages = TRUE)
+ )
+)
+)
+
+#by default, the results are a 3 x 869 matrix, where rows are quantiles and columns are chemicals
+
+toxcast_equiv_dose <- t(toxcast_equiv_dose) #transpose so that rows are chemicals
+toxcast_equiv_dose <- as.data.frame(toxcast_equiv_dose) #convert to data.frame
+head(toxcast_equiv_dose) #look at first few rows
+```
+
+Let's add the DTXSIDs back into this data.frame.
+
+```{r 6-6-Toxicokinetic-Modeling-39 }
+toxcast_equiv_dose$DTXSID <- toxcast_httk_P10$DTXSID
+```
+
+We can get the names of these chemicals by using the list of chemicals for which the 3-compartment steady-state model can be parameterized, which was stored in the variable `chems_3compss`. In that dataframe, we have the compound name and CASRN corresponding to each DTXSID.
+
+```{r 6-6-Toxicokinetic-Modeling-40 }
+head(chems_3compss)
+```
+
+Merge `chems_3compss` with `toxcast_equiv_dose`.
+
+```{r 6-6-Toxicokinetic-Modeling-41 }
+toxcast_equiv_dose <- merge(chems_3compss,
+ toxcast_equiv_dose,
+ by = "DTXSID",
+ all.x = FALSE,
+ all.y = TRUE)
+
+head(toxcast_equiv_dose)
+```
+
+To find the chemicals with the lowest equivalent doses at the 95th percentile level (corresponding to the most-sensitive 5\% of the population), sort this data.frame in ascending order on the `95%` column.
+
+```{r 6-6-Toxicokinetic-Modeling-42 }
+toxcast_equiv_dose <- toxcast_equiv_dose[order(toxcast_equiv_dose$`95%`), ]
+head(toxcast_equiv_dose, 10) #first ten rows of sorted table
+```
+
+
+### Answer to Environmental Health Question 6
+:::question
+*With this, we can answer **Environmental Health Question #6***: What are the chemicals with the three lowest predicted equivalent doses (for tenth-percentile ToxCast AC50s), for the most-sensitive 5\% of the population?
+:::
+
+:::answer
+**Answer**: 2,4-d; secbumeton, and 1,4-dioxane
+:::
+
+
+
+## Comparing Equivalent Doses Estimated to Elicit Toxicity (Hazard) to External Exposure Estimates (Exposure), for Chemical Prioritization by Bioactivity-Exposure Ratios (BERs)
+
+To estimate potential risk, hazard -- in the form of the equivalent dose for the 10th percentile Toxcast AC50 -- now needs to be compared to exposure. A quantitative metric for this comparison is the ratio of the lowest 5\% of equivalent doses to the highest 5\% of potential exposures. This metric is termed the Bioactivity-Exposure Ratio, or BER. Lower BER corresponds to higher potential risk. With BERs calculated for each chemical, we can ultimately rank all of the chemicals from lowest to highest BER, to achieve a chemical prioritization based on potential risk.
+
+### Human Exposure Estimates
+
+Here, we will use exposure estimates that have been inferred from CDC NHANES urinary biomonitoring data (Ring et al., 2019). These estimates consist of an estimated median, and estimated upper and lower 95\% credible interval bounds representing uncertainty in that estimated median. These estimates are provided here in the following csv file:
+
+```{r 6-6-Toxicokinetic-Modeling-43 }
+exposure <- read.csv("Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData2.csv")
+head(exposure) #view first few rows
+```
+
+### Merging Exposure Estimates with Equivalent Dose Estimates of Toxicity (Hazard)
+
+To calculate a BER for a chemical, it needs to have both an equivalent dose and an exposure estimate. Not all of the chemicals for which equivalent doses could be computed (*i.e.*, chemicals with both ToxCast AC50s and `httk` data) also have exposure estimates inferred from NHANES. Find out how many do.
+
+```{r 6-6-Toxicokinetic-Modeling-44 }
+length(intersect(toxcast_equiv_dose$DTXSID, exposure$DTXSID))
+```
+
+This means that, using the ToxCast AC50 data for bioactive concentrations, the NHANES urinary inference data for exposures, and the *httk* package to convert bioactive concentrations to equivalent doses, we can compute BERs for `r length(intersect(toxcast_equiv_dose$DTXSID, exposure$DTXSID))` chemicals.
+
+Merge together the ToxCast equivalent doses and the exposure data into a single data frame. Keep only the chemicals that have data in both ToxCast equivalent doses and exposure data frames.
+
+```{r 6-6-Toxicokinetic-Modeling-45 }
+hazard_exposure <- merge(toxcast_equiv_dose,
+ exposure,
+ by = "DTXSID",
+ all = FALSE)
+head(hazard_exposure) #view first few rows of result
+```
+
+### Plotting Hazard and Exposure Together
+
+We can visually compare the equivalent doses and the inferred exposure estimates by plotting them together.
+
+```{r 6-6-Toxicokinetic-Modeling-46, fig.align = "center"}
+ggplot(hazard_exposure) +
+ geom_crossbar(aes(x = Compound.x, #Boxes for equivalent doses
+ y = `50%`,
+ ymax = `5%`,
+ ymin = `95%`,
+ color = "Equiv. dose")) +
+ geom_crossbar(aes( x= Compound.x, #Boxes for exposures
+ y = Median,
+ ymax = up95,
+ ymin = low95,
+ color = "Exposure")) +
+ scale_color_manual(values = c("Equiv. dose" = "black",
+ "Exposure" = "Orange"),
+ name = NULL) +
+ scale_x_discrete(label = function(x) str_trunc(x, 20)
+ ) + #truncate chemical names to 20 chars
+ scale_y_log10() +
+ annotation_logticks(sides = "l") +
+ ylab("Equiv. dose or Exposure, mg/kg/day") +
+ theme_bw() +
+ theme(axis.text.x = element_text(angle = 45,
+ hjust = 1,
+ size = 6),
+ axis.title.x = element_blank(),
+ legend.position = "top")
+```
+
+### Calculating Bioactivity-Exposure Ratios (BERs)
+
+The bioactivity-exposure ratio (BER) is simply the ratio of the lower-end equivalent dose (for the most-sensitive 5\% of the population) divided by the upper-end estimated exposure (here, the upper bound on the inferred population median exposure). In the data frame `hazard_exposure` containing the hazard and exposure data, the lower-end equivalent dose is in column `95%` (corresponding to the 95th-percentile $C_{ss}$-dose slope) and the upper-end exposure is in column `up95`. Calculate the BER, and assign the result to a new column in the `hazard_exposure` data frame called `BER`.
+
+```{r 6-6-Toxicokinetic-Modeling-47 }
+hazard_exposure[["BER"]] <- hazard_exposure[["95%"]]/hazard_exposure[["up95"]]
+```
+
+### Prioritizing Chemicals by BER
+
+To prioritize chemicals according to potential risk, they can be sorted from lowest to highest BER. The lower the BER, the higher the priority.
+
+Sort the rows of the data.frame from lowest to highest BER.
+
+```{r 6-6-Toxicokinetic-Modeling-48 }
+hazard_exposure <- hazard_exposure[order(hazard_exposure$BER), ]
+head(hazard_exposure)
+```
+
+The hazard-exposure plot above showed chemicals in alphabetical order. It can be revised to show chemicals in order of priority, from lowest to highest BER.
+
+First, create a categorical (factor) variable for the compound names, whose levels are in order of increasing BER. (Since we already sorted the data.frame in order of increasing BER, we can just take the compound names in the order that they appear.)
+
+```{r 6-6-Toxicokinetic-Modeling-49 }
+hazard_exposure$Compound_factor <- factor(hazard_exposure$Compound.x,
+ levels = hazard_exposure$Compound.x)
+
+```
+
+Now, make the same plot as before, but use `Compound_factor` as the x-axis variable instead of `Compound`.
+
+```{r 6-6-Toxicokinetic-Modeling-50, fig.align = "center"}
+ggplot(hazard_exposure) +
+ geom_crossbar(aes(x = Compound_factor, #Boxes for equivalent dose
+ y = `50%`,
+ ymax = `5%`,
+ ymin = `95%`,
+ color = "Equiv. dose")) +
+ geom_crossbar(aes( x= Compound_factor, #Boxes for exposure
+ y = Median,
+ ymax = up95,
+ ymin = low95,
+ color = "Exposure")) +
+ scale_color_manual(values = c("Equiv. dose" = "black",
+ "Exposure" = "Orange"),
+ name = NULL) +
+ scale_x_discrete(label = function(x) str_trunc(x, 20)
+ ) + #truncate chemical names
+ scale_y_log10() +
+ ylab("Equiv. dose or Exposure, mg/kg/day") +
+ annotation_logticks(sides = "l") +
+ theme_bw() +
+ theme(axis.text.x = element_text(angle = 45,
+ hjust = 1,
+ size = 6),
+ axis.title.x = element_blank(),
+ legend.position = "top")
+```
+
+
+Now, the chemicals are displayed in order of increasing BER. From left to right, you can visually see the distance increase between the lower bound of equivalent doses (the bottom of the black boxes) and the upper bound of exposure estimates (the top of the orange boxes). Since the y-axis is put on a log~10~ scale, the distance between the boxes corresponds to the BER. We can gather a lot of information from this plot!
+
+
+### Answer to Environmental Health Question 7
+:::question
+*With this, we can answer **Environmental Health Question #7***: Based on httk modeling estimates, are chemicals with higher bioactivity-exposure ratios always less potent than chemicals with lower bioactivity-exposure ratios?
+:::
+
+:::answer
+**Answer**: Answer: No -- some chemicals with high potency (low equivalent doses) demonstrate high BERs because they have relatively low human exposure estimates; and vice versa.
+:::
+
+
+
+### Answer to Environmental Health Question 8
+:::question
+*With this, we can also answer **Environmental Health Question #8***: Based on httk modeling estimates, do chemicals with higher bioactivity-exposure ratios always have lower estimated exposures than chemicals with lower bioactivity-exposure ratios?
+:::
+
+:::answer
+**Answer**: No -- some chemicals with high estimated exposures have equivalent doses that are higher still, resulting in a high BER despite the higher estimated exposure. Likewise, some chemicals with low estimated exposures also have lower equivalent doses, resulting in a low BER despite the low estimated exposure.
+:::
+
+
+### Answer to Environmental Health Question 9
+:::question
+*With this, we can also answer **Environmental Health Question #9***: How are chemical prioritization results different when using only hazard information vs. only exposure information vs. bioactivity-exposure ratios?
+:::
+
+:::answer
+**Answer**: When chemicals are prioritized solely on the basis of hazard, more-potent chemicals will be highly prioritized. However, if humans are never exposed to these chemicals, or exposure is extremely low compared to potency, then despite the high potency, the potential risk may be low. Conversely, if chemicals are prioritized solely on the basis of exposure, then ubiquitous chemicals will be highly prioritized. However, if these chemicals are inert and do not produce adverse effects, then despite the high exposure, the potential risk may be low. For these reasons, risk-based chemical prioritization efforts consider both hazard (toxicity) and exposure, for instance through bioactivity-exposure ratios.
+:::
+
+
+### Filling Hazard and Exposure Data Gaps to Prioritize More Chemicals
+
+To calculate a BER for a chemical, both bioactivity and exposure data are required, as well as sufficient TK data to perform reverse TK. In this training module, bioactivity data came from ToxCast AC50s; exposure data consisted of exposure inferences made from NHANES urinary biomonitoring data; and TK data consisted of parameter values measured *in vitro* and built into the *httk* R package. The intersections are illustrated in an Euler diagram below. BERs can only be calculated for chemicals in the triple intersection.
+
+```{r 6-6-Toxicokinetic-Modeling-51, fig.align = "center"}
+fit <- eulerr::euler(list('ToxCast AC50s' = unique(toxcast$DTXSID),
+ 'HTTK' = unique(chems_3compss$DTXSID),
+ 'NHANES inferred exposure' = unique(exposure$DTXSID)
+ ),
+ shape = "ellipse")
+plot(fit,
+ legend = TRUE,
+ quantities = TRUE
+ )
+```
+
+Clearly, it would be useful to gather more data to allow calculation of BERs for more chemicals.
+
+
+
+### Answer to Environmental Health Question 10
+:::question
+*With this, we can also answer **Environmental Health Question #10***: Of the three datasets used in this training module -- bioactivity from ToxCast, TK data from *httk*, and exposure inferred from NHANES urinary biomonitoring -- which one most limits the number of chemicals that can be prioritized using BERs?
+:::
+
+:::answer
+**Answer**: The exposure dataset includes the fewest chemicals and is therefore the most limiting.
+:::
+
+
+The exposure dataset used in this training module is limited to chemicals for which NHANES did urinary biomonitoring for markers of exposure, which is a fairly small set of chemicals that were of interest to NHANES due to existing concerns about health effects of exposure, and/or other reasons. This dataset was chosen because it is a convenient set of exposure estimates to use for demonstration purposes, but it could be expanded by including other sources of exposure data and exposure model predictions. Further discussion is beyond the scope of this training module, but as an example of this kind of high-throughput exposure modeling, see [Ring et al., 2019](https://pubmed.ncbi.nlm.nih.gov/30516957/).
+
+It would additionally be useful to gather TK data for additional chemicals. *In vitro* measurement efforts are ongoing. Additonally, *in silico* modeling can produce useful predictions of TK properties to facilitate chemical prioritization. Efforts are ongoing to develop computational models to predict TK parameters from chemical structure and properties.
+
+## Concluding Remarks
+
+This training module provides an overview of toxicokinetic modeling using the *httk* R package, and its application to *in vitro*-*in vivo* extrapolation in the form of placing *in vitro* data in the context of exposure by calculating equivalent doses for *in vitro* bioactive concentrations.
+
+We would like to acknowledge the developers of the *httk* package, as detailed below via the CRAN website:
+
+```{r 6-6-Toxicokinetic-Modeling-52, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image2.png")
+```
+
+This module also summarizes the use of the Bioactivity-Exposure Ratio (BER) for chemical prioritization, and provides examples of calculating the BER and ranking chemicals accordingly.
+
+Together, these approaches can be used to more efficiently identify chemicals present in the environment that pose a potential risk to human health.
+
+
+For additional case studies that leverage TK and/or httk modeling techniques, see the following publications that also address environmental health questions:
+
++ Breen M, Ring CL, Kreutz A, Goldsmith MR, Wambaugh JF. High-throughput PBTK models for in vitro to in vivo extrapolation. Expert Opin Drug Metab Toxicol. 2021 Aug;17(8):903-921. PMID: [34056988](https://pubmed.ncbi.nlm.nih.gov/34056988/).
+
++ Klaren WD, Ring C, Harris MA, Thompson CM, Borghoff S, Sipes NS, Hsieh JH, Auerbach SS, Rager JE. Identifying Attributes That Influence In Vitro-to-In Vivo Concordance by Comparing In Vitro Tox21 Bioactivity Versus In Vivo DrugMatrix Transcriptomic Responses Across 130 Chemicals. Toxicol Sci. 2019 Jan 1;167(1):157-171. PMID: [30202884](https://pubmed.ncbi.nlm.nih.gov/30202884/).
+
++ Pearce RG, Setzer RW, Strope CL, Wambaugh JF, Sipes NS. httk: R Package for High-Throughput Toxicokinetics. J Stat Softw. 2017;79(4):1-26. PMID [30220889](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134854/).
+
++ Ring CL, Pearce RG, Setzer RW, Wetmore BA, Wambaugh JF. Identifying populations sensitive to environmental chemicals by simulating toxicokinetic variability. Environ Int. 2017 Sep;106:105-118. PMID: [28628784](https://pubmed.ncbi.nlm.nih.gov/28628784/).
+
++ Ring C, Sipes NS, Hsieh JH, Carberry C, Koval LE, Klaren WD, Harris MA, Auerbach SS, Rager JE. Predictive modeling of biological responses in the rat liver using in vitro Tox21 bioactivity: Benefits from high-throughput toxicokinetics. Comput Toxicol. 2021 May;18:100166. PMID: [34013136](https://pubmed.ncbi.nlm.nih.gov/34013136/).
+
++ Rotroff DM, Wetmore BA, Dix DJ, Ferguson SS, Clewell HJ, Houck KA, Lecluyse EL, Andersen ME, Judson RS, Smith CM, Sochaski MA, Kavlock RJ, Boellmann F, Martin MT, Reif DM, Wambaugh JF, Thomas RS. Incorporating human dosimetry and exposure into high-throughput in vitro toxicity screening. Toxicol Sci. 2010 Oct;117(2):348-58. PMID: [20639261](https://pubmed.ncbi.nlm.nih.gov/20639261/).
+
++ Wetmore BA, Wambaugh JF, Ferguson SS, Sochaski MA, Rotroff DM, Freeman K, Clewell HJ 3rd, Dix DJ, Andersen ME, Houck KA, Allen B, Judson RS, Singh R, Kavlock RJ, Richard AM, Thomas RS. Integration of dosimetry, exposure, and high-throughput screening data in chemical toxicity assessment. Toxicol Sci. 2012 Jan;125(1):157-74. PMID: [21948869](https://pubmed.ncbi.nlm.nih.gov/21948869/).
+
++ Wambaugh JF, Wetmore BA, Pearce R, Strope C, Goldsmith R, Sluka JP, Sedykh A, Tropsha A, Bosgra S, Shah I, Judson R, Thomas RS, Setzer RW. Toxicokinetic Triage for Environmental Chemicals. Toxicol Sci. 2015 Sep;147(1):55-67. PMID: [26085347](https://pubmed.ncbi.nlm.nih.gov/26085347/).
+
++ Wambaugh JF, Wetmore BA, Ring CL, Nicolas CI, Pearce RG, Honda GS, Dinallo R, Angus D, Gilbert J, Sierra T, Badrinarayanan A, Snodgrass B, Brockman A, Strock C, Setzer RW, Thomas RS. Assessing Toxicokinetic Uncertainty and Variability in Risk Prioritization. Toxicol Sci. 2019 Dec 1;172(2):235-251. doi: 10.1093/toxsci/kfz205. PMID: [31532498](https://pubmed.ncbi.nlm.nih.gov/31532498/).
+
+
+
+
+
+
+:::tyk
+1. After exposure to a single daily dose of 1 mg/kg/day methylparaben, what is the maximum concentration of methylparaben estimated to occur in human liver, estimated by the 3-comprtment model implemented in *httk*?
+2. What is the predicted range of methylparaben concentrations in plasma that can occur in a human population, assuming a long-term exposure rate of 1 mg/kg/day and 3-compartment steady-state conditions? Provide estimates at the 5th, 50th, and 95th percentile.
+:::
diff --git a/Chapter_6/Module6_6_Input/Module6_6_Image1.png b/Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image1.png
similarity index 100%
rename from Chapter_6/Module6_6_Input/Module6_6_Image1.png
rename to Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image1.png
diff --git a/Chapter_6/Module6_6_Input/Module6_6_Image2.png b/Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image2.png
similarity index 100%
rename from Chapter_6/Module6_6_Input/Module6_6_Image2.png
rename to Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_Image2.png
diff --git a/Chapter_6/Module6_6_Input/Module6_6_InputData1.csv b/Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData1.csv
similarity index 100%
rename from Chapter_6/Module6_6_Input/Module6_6_InputData1.csv
rename to Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData1.csv
diff --git a/Chapter_6/Module6_6_Input/Module6_6_InputData2.csv b/Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData2.csv
similarity index 100%
rename from Chapter_6/Module6_6_Input/Module6_6_InputData2.csv
rename to Chapter_6/6_6_Toxicokinetic_Modeling/Module6_6_InputData2.csv
diff --git a/Chapter_6/6_7_Chemical_Read_Across/6_7_Chemical_Read_Across.Rmd b/Chapter_6/6_7_Chemical_Read_Across/6_7_Chemical_Read_Across.Rmd
new file mode 100644
index 0000000..ddef5b3
--- /dev/null
+++ b/Chapter_6/6_7_Chemical_Read_Across/6_7_Chemical_Read_Across.Rmd
@@ -0,0 +1,400 @@
+
+# 6.7 Chemical Read-Across for Toxicity Predictions
+
+This training module was developed by Grace Patlewicz, Lauren E. Koval, Alexis Payton, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+*Disclaimer: The views expressed in this document are those of the author and do not necessarily reflect the views or policies of the U.S. EPA.*
+
+```{r 6-7-Chemical-Read-Across-1, include=FALSE}
+#set default values for R Markdown "knitting" to HTML, Word, or PDF
+knitr::opts_chunk$set(echo = TRUE) #print code chunks
+```
+
+## Introduction to Training Module
+
+The method of **read-across** represents one type of computational approach that is commonly used to predict a chemical's toxicological effects using its properties. Other types of approaches that you will hear commonly used in this field include **SAR** and **QSAR** analyses. A high-level overview of each of these definitions and simple illustrative examples of these three computational modeling approaches is provided in the following schematic:
+```{r 6-7-Chemical-Read-Across-2, echo=FALSE }
+knitr::include_graphics("Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image1.png")
+```
+
+Focusing more on read-across, this computational approach represents the method of filling a data gap whereby a chemical with existing data values is used to make a prediction for a 'similar' chemical, typically one which is structurally similar. Thus, information from chemicals with data is read across to chemical(s) without data.
+
+In a typical read-across workflow, the first step is to determine the problem definition - what question are we trying to address. The second step starts the process of identifying chemical analogues that have information that can be used to inform this question, imparting information towards a chemical of interest that is lacking data. A specific type of read-across that is commonly employed is termed 'Generalized Read-Across' or GenRA, which is based upon similarity-weighted activity predictions. This type of read-across approach will be used here when conducting the example chemical read-across training module. This approach has been previously described and published:
+
++ Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G. Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information. Regul Toxicol Pharmacol. 2016 79:12-24. PMID: [27174420](https://pubmed.ncbi.nlm.nih.gov/27174420/)
+
+
+
+## Introduction to Activity
+
+In this activity we are going to consider a chemical of interest (which we call the target chemical) that is lacking acute oral toxicity information. Specifically, we would like to obtain estimates of the dose that causes lethality after acute (meaning, short-term) exposure conditions. These dose values are typically presented as LD50 values, and are usually collected through animal testing. There is huge interest surrounding the reduced reliance upon animal testing, and we would like to avoid further animal testing as much as possible. With this goal in mind, this activity aims to estimate an LD50 value for the target chemical using completely computational approaches, leveraging existing data as best we can. To achieve this aim, we explore ways in which we can search for structurally similar chemicals that have acute toxicity data already available. Data on these structurally similar chemicals, termed 'source analogues', are then used to predict acute toxicity for the target chemical of interest using the GenRA approach.
+
+The dataset used for this training module were previously compiled and published in the following manuscript:
+Helman G, Shah I, Patlewicz G. Transitioning the Generalised Read-Across approach (GenRA) to quantitative predictions: A case study using acute oral toxicity data. Comput Toxicol. 2019 Nov 1;12(November 2019):10.1016/j.comtox.2019.100097. doi: 10.1016/j.comtox.2019.100097. PMID: [33623834](https://pubmed.ncbi.nlm.nih.gov/33623834/)
+
++ With associated data available at: https://github.com/USEPA/CompTox-GenRA-acutetox-comptoxicol/tree/master/input
+
+This exercise will specifically predict LD50 values for the chemical, 1-chloro-4-nitrobenzene (DTXSID5020281). This chemical is an organic compound with the formula ClC˜6˜H˜4˜NO˜2˜, and is a common intermediate in the production of a number of industrial compounds, including common antioxidants found in rubber.
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+1. How many chemicals with acute toxicity data are structurally similar to 1-chloro-4-nitrobenzene?
+2. What is the predicted LD50 for 1-chloro-4-nitrobenzene using the GenRA approach?
+3. How different is the predicted vs. experimentally observed LD50 for 1-chloro-4-nitrobenzene?
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 6-7-Chemical-Read-Across-3 }
+rm(list=ls())
+```
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you:
+```{r 6-7-Chemical-Read-Across-4, results=FALSE, message=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse");
+if (!requireNamespace("fingerprint"))
+ install.packages("fingerprint");
+if (!requireNamespace("rcdk"))
+ install.packages("rcdk");
+```
+
+#### Loading R packages required for this session
+```{r 6-7-Chemical-Read-Across-5, results=FALSE, message=FALSE}
+library(tidyverse) #all tidyverse packages, including dplyr and ggplot2
+library(fingerprint) # a package that supports operations on molecular fingerprint data
+library(rcdk) # a package that interfaces with the 'CDK', a Java framework for chemoinformatics libraries packaged for R
+```
+
+#### Set your working directory
+```{r 6-7-Chemical-Read-Across-6, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+
+
+## Read-Across Example Analysis
+
+#### Loading Example Datasets
+Let's start by loading the datasets needed for this training module. We are going to use a dataset of substances that have chemical identification information ready in the form of SMILES, as well as acute toxicity data, in the form of LD50 values.
+
+The first file to upload is named `Module6_6_InputData1.csv` and contains the list of substances and their structural information, in the form of SMILES nomenclature. SMILES stands for Simplified molecular-input line-entry system, a form of line notation to describe the structure of a chemical.
+
+The second file to upload is named `Module6_6_InputData2.csv` and contains the substances and their acute toxicity information.
+```{r 6-7-Chemical-Read-Across-7 }
+substances <- read.csv("Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData1.csv")
+acute_data <- read.csv("Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData2.csv")
+```
+
+Let's first view the substances dataset:
+```{r 6-7-Chemical-Read-Across-8 }
+dim(substances)
+```
+
+```{r 6-7-Chemical-Read-Across-9 }
+colnames(substances)
+```
+
+```{r 6-7-Chemical-Read-Across-10 }
+head(substances)
+```
+
+We can see that this dataset contains information on 6955 chemicals (rows). The columns are further described below:
+
++ `DTXSIDs`: a substance identifier provided through the [U.S. EPA's Computational Toxicology Dashboard](https://comptox.epa.gov/dashboard)
++ `SMILES` and `QSAR_READY_SMILES`: Chemical identifiers. The QSAR_READY_SMILES values are what we will specifically need in a later step, to construct chemical fingerprints from.
++ `QSAR_READY_SMILES`: `SMILES` that have been standardized related to salts, tautomers, inorganics, aromaticity, and stereochemistry (among other factors) prior to any QSAR modeling or prediction.
+
+Let's make sure that these values are recognized as character format and placed in its own vector, to ensure proper execution of functions throughout this script:
+```{r 6-7-Chemical-Read-Across-11 }
+all_smiles <- as.character(substances$QSAR_READY_SMILES)
+```
+
+Now let's view the acute toxicity dataset:
+```{r 6-7-Chemical-Read-Across-12 }
+dim(acute_data)
+```
+
+```{r 6-7-Chemical-Read-Across-13 }
+colnames(acute_data)
+```
+
+```{r 6-7-Chemical-Read-Across-14 }
+head(acute_data)
+```
+
+We can see that this dataset contains information on 6955 chemicals (rows). Some notable columns are explained below:
++ `DTXSIDs`: a substance identifier provided through the [U.S. EPA's Computational Toxicology Dashboard](https://comptox.epa.gov/dashboard)
++ `casrn`: CASRN number
++ `mol_weight`: molecular weight
++ `LD50_LM`: the -log~10~ of the millimolar LD50. LD stands for 'Lethal Dose'. The LD50 value is the dose of substance given all at once which causes the death of 50% of a group of test animals. The lower the LD50 in mg/kg, the more toxic that substance is.
+
+#### Important Notes on Units
+In modeling studies, the convention is to convert toxicity values expressed as mg per unit into their molar or millimolar values and then to convert these to the base 10 logarithm. To increase clarity when plotting, such that higher toxicities would be expressed by higher values, the negative logarithm is then taken. For example, substance DTXSID00142939 has a molecular weight of 99.089 (grams per mole) and a LD50 of 32 mg/kg. This would be converted to a toxicity value of ($\frac{32}{99.089} = 0.322942~mmol/kg$). The logarithm of that would be -0.4908755. By convention, the negative logarithm of the millimolar concentration would then be used i.e. -log[mmol/kg]. This conversion has been used to create the `LD50_LM` values in the acute toxicity dataset.
+
+Let's check to see whether the same chemicals are present in both datasets:
+```{r 6-7-Chemical-Read-Across-15 }
+# First need to make sure that both dataframes are sorted by the identifier, DTXSID
+substances <- substances[order(substances$DTXSID),]
+acute_data <- acute_data[order(acute_data$DTXSID),]
+# Then test to see whether data in these columns are equal
+unique(substances$DTXSID == acute_data$DTXSID)
+```
+All accounts are true, meaning they are all equal (the same chemical).
+
+
+### Data Visualizations of Acute Toxicity Values
+
+Let's create a plot to show the distribution of the LD50 values in the dataset.
+```{r 6-7-Chemical-Read-Across-16, fig.align = "center"}
+ggplot(data = acute_data, aes(LD50_mgkg)) +
+ stat_ecdf(geom = "point")
+
+ggplot(data = acute_data, aes(LD50_LM)) +
+ stat_ecdf(geom = "point")
+```
+
+**Can you see a difference between these two plots?**
+Yes, if the LD50 mg/kg values are converted into -log[mmol/kg] scale (LD50_LM), then the distribution resembles a normal cumulative distribution curve.
+
+
+### Selecting the 'Target' Chemical of Interest for Read-Across Analysis
+For this exercise, we will select a 'target' substance of interest from our dataset, and assume that we have no acute toxicity data for it, and we will perform read-across for this target chemical. Note that this module's example dataset actually has full data coverage (meaning all chemicals have acute toxicity data), but this exercise is beneficial, because we can make toxicity predictions, and then check to see how close we are by viewing the experimentally observed values.
+
+Our target substance for this exercise is going to be DTXSID5020281, which is 1-chloro-4-nitrobenzene. This chemical is an organic compound with the formula ClC~6~H~4~NO~2~, and is a common intermediate in the production of a number of industrially useful compounds, including common antioxidants found in rubber. Here is an image of the chemical structure (https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID5020281):
+
+```{r 6-7-Chemical-Read-Across-17, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image2.png")
+```
+
+Filtering the dataframes for only data on this target substance:
+```{r 6-7-Chemical-Read-Across-18 }
+target_substance <-filter(substances, DTXSID == 'DTXSID5020281')
+target_acute_data <- filter(acute_data, DTXSID == 'DTXSID5020281')
+```
+
+
+
+### Calculating Structural Similarities between Substances
+
+To eventually identify chemical analogues with information that can be 'read-across' to our target chemical (1-chloro-4-nitrobenzene), we first need to evaluate how similar each chemical is to one another. In this example, we will base our search for similar substances upon similarities between chemical structure fingerprint representations. Once these chemical structure fingerprints are derived, they will be used to calculate the degree to which each possible pair of chemicals is similar, leveraging the Tanimoto metric. These findings will yield a similarity matrix of all possible pairwise similarity scores.
+
+
+#### Converting Chemical Identifiers into Molecular Objects (MOL)
+
+To derive structure fingerprints across all evaluated substances, we need to first convert the chemical identifiers originally provided as `QSAR_READY_SMILES` into molecular objects. The standard exchange format for molecular information is a MOL file. This is a chemical file format that contains plain text information and stores information about atoms, bonds and their connections.
+
+We can carry out these identifier conversions using the 'parse.smiles' function within the rcdk package. Here we do this for the target chemical of interest, as well as all substances in the dataset.
+```{r 6-7-Chemical-Read-Across-19 }
+target_mol <- parse.smiles(as.character(target_substance$QSAR_READY_SMILES))
+all_mols <-parse.smiles(all_smiles)
+```
+
+#### Computing chemical fingerprints
+
+With these mol data, we can now compute the fingerprints for our target substance, as well as all the substances in the dataset. We can compute fingerprints leveraging the `get.fingerprint()` function. Let's first run it on the target chemical:
+```{r 6-7-Chemical-Read-Across-20 }
+target.fp <- get.fingerprint(target_mol[[1]], type = 'standard')
+target.fp # View fingerprint
+```
+
+We can run the same function over the entire `all_mols` dataset, leveraging the `lapply()` function:
+```{r 6-7-Chemical-Read-Across-21 }
+all.fp <- lapply(all_mols, get.fingerprint, type = 'standard')
+```
+
+
+
+## Calculating Chemical Similarities
+
+Using these molecular fingerprint data, we can now calculate the degree to which each chemical is similar to another chemical, based on structural similarity. The method employed in this example is the Tanimoto method. The Tanimoto similarity metric is a unitless number between zero and one that measures how similar two sets (in this case 2 chemicals) are from one another. A Tanimoto index of 1 means the 2 chemicals are identical whereas a index of 0 means that the chemicals share nothing in common. In the context of the fingerprints, a Tanimoto index of 0.5 means that half of the fingerprint matches between two chemicals whilst the other half does not match.
+
+Once these Tanimoto similarity indices are calculated between every possible chemical pair, the similarity results can be viewed in the form of a similarity matrix. In this matrix, all substances are listed across the rows and columns, and the degree to which every possible chemical pair is similar is summarized through values contained within the matrix. Further information about chemical similarity can be found here: https://en.wikipedia.org/wiki/Chemical_similarity
+
+Steps to generate this similarity matrix are detailed here:
+```{r 6-7-Chemical-Read-Across-22 }
+all.fp.sim <- fingerprint::fp.sim.matrix(all.fp, method = 'tanimoto')
+all.fp.sim <- as.data.frame(all.fp.sim) # Convert the outputted matrix to a dataframe
+colnames(all.fp.sim) = substances$DTXSID # Placing chemical identifiers back as column headers
+row.names(all.fp.sim) = substances$DTXSID # Placing chemical identifiers back as row names
+```
+
+Since we are querying a large number of chemicals, it is difficult to view the entire resulting similarity matrix. Let's, instead view portions of these results:
+```{r 6-7-Chemical-Read-Across-23 }
+all.fp.sim[1:5,1:5] # Viewing the first five rows and columns of data
+```
+
+
+```{r 6-7-Chemical-Read-Across-24 }
+all.fp.sim[6:10,6:10] # Viewing the next five rows and columns of data
+```
+You can see that there is an identity line within this similarity matrix, where instances when a chemical's structure is being compared to itself, the similarity values are 1.00000.
+
+All other possible chemical pairings show variable similarity scores, ranging from:
+```{r 6-7-Chemical-Read-Across-25 }
+min(all.fp.sim)
+```
+
+a minimum of zero, indicating no similarities between chemical structures.
+```{r 6-7-Chemical-Read-Across-26 }
+max(all.fp.sim)
+```
+
+a maximum of 1, indicating the identical chemical structure (which occurs when comparing a chemical to itself).
+
+### Identifying Chemical Analogues
+This step will find substances that are structurally similar to the target chemical, 1-chloro-4-nitrobenzene (with DTXSID5020281). Structurally similar chemicals are referred to as 'source analogues', with information that will be carried forward in this read-across analysis.
+
+The first step to identifying chemical analogues is to subset the full similarity matrix to focus just on our target chemical.
+```{r 6-7-Chemical-Read-Across-27 }
+target.sim <- all.fp.sim %>%
+ filter(row.names(all.fp.sim) == 'DTXSID5020281')
+```
+
+Then we'll extract the substances that exceed a similarity threshold of 0.75 by selecting to keep columns which are > 0.75.
+```{r 6-7-Chemical-Read-Across-28 }
+target.sim <- target.sim %>%
+ select_if(function(x) any(x > 0.75))
+
+dim(target.sim) # Show dimensions of subsetted matrix
+```
+
+This gives us our analogues list! Specifically, we selected 12 columns of data, representing our target chemical plus 11 structurally similar chemicals. Let's create a dataframe of these substance identifiers to carry forward in the read-across analysis:
+```{r 6-7-Chemical-Read-Across-29 }
+source_analogues <- t(target.sim) # Transposing the filtered similarity matrix results
+DTXSID <-rownames(source_analogues) # Temporarily grabbing the dtxsid identifiers from this matrix
+source_analogues <- cbind(DTXSID, source_analogues) # Adding these identifiers as a column
+rownames(source_analogues) <- NULL # Removing the rownames from this dataframe, to land on a cleaned dataframe
+colnames(source_analogues) <- c('DTXSID', 'Target_TanimotoSim') # Renaming column headers
+source_analogues[1:12,1:2] # Viewing the cleaned dataframe of analogues
+```
+
+### Answer to Environmental Health Question 1
+:::question
+*With these, we can answer **Environmental Health Question #1***: How many chemicals with acute toxicity data are structurally similar to 1-chloro-4-nitrobenzene?
+:::
+
+:::answer
+**Answer**: In this dataset, 11 chemicals are structurally similar to the target chemical, based on a Tanimoto similiary score of > 0.75.
+:::
+
+
+
+## Chemical Read-Across to Predict Acute Toxicity
+Acute toxicity data from these chemical analogues can now be extracted and read across to the target chemical (1-chloro-4-nitrobenzene) to make predictions about its toxicity.
+
+Let's first merge the acute data for these analogues into our working dataframe:
+```{r 6-7-Chemical-Read-Across-30 }
+source_analogues <- merge(source_analogues, acute_data, by.x = 'DTXSID', by.y = 'DTXSID')
+```
+
+Then, let's remove the target chemical of interest and create a new dataframe of just the source analogues:
+```{r 6-7-Chemical-Read-Across-31 }
+source_analogues_only <- source_analogues %>%
+ filter(Target_TanimotoSim!=1) # Removing the row of data with the target chemical, identified as the chemical with a similarity of 1 to itself
+
+source_analogues_only[1:11,1:10] # Viewing the combined dataset of source analogues
+```
+
+### Read-across Calculations using GenRA
+The final generalized read-across (GenRA) prediction is based on a similarity-weighted activity score. This score is specifically calculated as the following weighted average:
+
+(pairwise similarity between the target and source analogue) * (the toxicity of the source analogue), summed across each individual analogue; and then this value is divided by the sum of all pairwise similarities. For further details surrounding this algorithm and its spelled out formulation, see [Shah et al.](https://pubmed.ncbi.nlm.nih.gov/27174420/).
+
+Here are the underlying calculations needed to derive the similarity weighted activity score for this current exercise:
+```{r 6-7-Chemical-Read-Across-32 }
+source_analogues_only$wt_tox_calc <-
+ as.numeric(source_analogues_only$Target_TanimotoSim) * source_analogues_only$LD50_LM
+# Calculating (pairwise similarity between the target and source analogue) * (the toxicity of the source analogue)
+# for each analogy, and saving it as a new column titled 'wt_tox_calc'
+
+source_analogues_only[1:3,1:11] # Viewing a portion of the updated dataframe with the 'wt_tox_cal' column
+
+sum_tox <- sum(source_analogues_only$wt_tox_calc) #Summing this wt_tox_calc value across all analogues
+
+sum_sims <- sum(as.numeric(source_analogues_only$Target_TanimotoSim)) # Summing all of the pairwise Tanimoto similarity scores
+
+ReadAcross_Pred <- sum_tox/sum_sims # Final calculation for the weighted activity score (i.e., read-across prediction)
+```
+
+### Converting LD50 Units
+Right now, these results are in units of -log~10~ millimolar. So we still need to convert them into mg/kg equivalent, by converting out of -log~10~ and multiplying by the molecular weight of 1-chloro-4-nitrobenzene (g/mol):
+```{r 6-7-Chemical-Read-Across-33 }
+ReadAcross_Pred <- (10^(-ReadAcross_Pred))*157.55
+ReadAcross_Pred
+```
+
+### Answer to Environmental Health Question 2
+:::question
+*With this, we can answer **Environmental Health Question #2***: What is the predicted LD50 for 1-chloro-4-nitrobenzene, using the GenRA approach?
+:::
+
+:::answer
+**Answer**: 1-chloro-4-nitrobenzene has a predicted LD50 (mg/kg) of 471 mg/kg.
+:::
+
+### Visual Representation of this Read-Across Approach
+
+Here is a schematic summarizing the steps we employed in this analysis:
+```{r 6-7-Chemical-Read-Across-34, echo=FALSE }
+knitr::include_graphics("Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image3.png")
+```
+
+
+### Comparing Read-Across Predictions to Experimental Observations
+
+Let's now compare how close this computationally-based prediction is to the experimentally observed LD50 value
+```{r 6-7-Chemical-Read-Across-35 }
+target_acute_data$LD50_mgkg
+```
+We can see that the experimentally observed LD50 values for this chemical is 460 mg/kg.
+
+### Answer to Environmental Health Question 3
+:::question
+*With this, we can answer **Environmental Health Question #3***: How different is the predicted vs. experimentally observed LD50 for 1-chloro-4-nitrobenzene?
+:::
+
+:::answer
+**Answer**: The predicted LD50 is 471 mg/kg, and the experimentally observed LD50 is 460 mg/kg, which is reasonably close!
+:::
+
+
+
+## Concluding Remarks
+
+In conclusion, this training module leverages a dataset of substances with structural representations and toxicity data to create chemical fingerprint representations. We have selected a chemical of interest (target) and used the most similar analogues based on a similarity threshold to predict the acute toxicity of that target using the generalized read-across formula of weighted activity by similarity. We have seen that the prediction is in close agreement with that already reported for the target chemical in the dataset. Similar methods can be used to predict other toxicity endpoints, based on other datasets of chemicals. Additionally, further efforts are aimed at expanding read-across approaches to integrate *in vitro* data.
+
+More information on the GenRA approach as implemented in the EPA CompTox Chemicals Dashboard, as well as the extension of read-across to include bioactivity information, are described in the following manuscripts:
+
++ Shah I, Liu J, Judson RS, Thomas RS, Patlewicz G. Systematically evaluating read-across prediction and performance using a local validity approach characterized by chemical structure and bioactivity information. Regul Toxicol Pharmacol. 2016 79:12-24. PMID: [27174420](https://pubmed.ncbi.nlm.nih.gov/27174420/)
+
++ Helman G, Shah I, Williams AJ, Edwards J, Dunne J, Patlewicz G. Generalized Read-Across (GenRA): A workflow implemented into the EPA CompTox Chemicals Dashboard. ALTEX. 2019;36(3):462-465. PMID: [30741315](https://pubmed.ncbi.nlm.nih.gov/30741315/).
+
++ GenRA has also been implemented as a standalone [python package](https://pypi.org/project/genra/#description).
+
+
+
+
+
+:::tyk
+Use the same input data we used in this module to answer the following questions.
+
+1. How many source analogues are structurally similar to methylparaben (DTXSID4022529) when considering a similarity threshold of 0.75?
+2. What is the predicted LD50 for methylparaben in mg/kg, and how does this compare to the measured LD50 for methylparaben?
+:::
+
+
+
+
+
+
+:::
diff --git a/Chapter_6/Module6_7_Input/Module6_7_Image1.png b/Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image1.png
similarity index 100%
rename from Chapter_6/Module6_7_Input/Module6_7_Image1.png
rename to Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image1.png
diff --git a/Chapter_6/Module6_7_Input/Module6_7_Image2.png b/Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image2.png
similarity index 100%
rename from Chapter_6/Module6_7_Input/Module6_7_Image2.png
rename to Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image2.png
diff --git a/Chapter_6/Module6_7_Input/Module6_7_Image3.png b/Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image3.png
similarity index 100%
rename from Chapter_6/Module6_7_Input/Module6_7_Image3.png
rename to Chapter_6/6_7_Chemical_Read_Across/Module6_7_Image3.png
diff --git a/Chapter_6/Module6_7_Input/Module6_7_InputData1.csv b/Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData1.csv
similarity index 100%
rename from Chapter_6/Module6_7_Input/Module6_7_InputData1.csv
rename to Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData1.csv
diff --git a/Chapter_6/Module6_7_Input/Module6_7_InputData2.csv b/Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData2.csv
similarity index 100%
rename from Chapter_6/Module6_7_Input/Module6_7_InputData2.csv
rename to Chapter_6/6_7_Chemical_Read_Across/Module6_7_InputData2.csv
diff --git a/Chapter_7/7_1_Comparative_Toxicogenomics_Database/7_1_Comparative_Toxicogenomics_Database.Rmd b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/7_1_Comparative_Toxicogenomics_Database.Rmd
new file mode 100644
index 0000000..49b221c
--- /dev/null
+++ b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/7_1_Comparative_Toxicogenomics_Database.Rmd
@@ -0,0 +1,368 @@
+# (PART\*) Chapter 7 Environmental Health Database Mining {-}
+
+# 7.1 Comparative Toxicogenomics Database
+
+This training module was developed by Lauren E. Koval, Kyle R. Roell, and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+
+## Introduction to Training Module
+
+The Comparative Toxicogenomics Database (CTD) is a publicly available, online database that provides manually curated information about chemical-gene/protein interactions, chemical-disease and gene-disease relationships. CTD also recently incorporated curation of exposure data and chemical-phenotype relationships.
+
+CTD is located at: http://ctdbase.org/. Here is a screenshot of the CTD homepage (as of August 5, 2021):
+```{r 7-1-Comparative-Toxicogenomics-Database-1, echo=FALSE, fig.align='center' }
+#knitr::include_graphics("_book/TAME_Toolkit_files/figure-html/Module3_1_CTD_homepage.jpg")
+knitr::include_graphics("Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image1.jpg")
+```
+
+In this module, we will be using CTD to access and download data to perform data organization and analysis as an applications-based example towards environmental health research. This activity represents a demonstration of basic data manipulation, filtering, and organization steps in R, while highlighting the utility of CTD to identify novel genomic/epigenomic relationships to environmental exposures. Example visualizations are also included in this training module's script, providing visualizations of gene list comparison results.
+
+
+
+### Training Module's Environmental Health Questions
+This training module was specifically developed to answer the following environmental health questions:
+
+(1) Which genes show altered expression in response to arsenic exposure?
+(2) Of the genes showing altered expression, which may be under epigenetic control?
+
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 7-1-Comparative-Toxicogenomics-Database-2 }
+rm(list=ls())
+```
+
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you.
+```{r 7-1-Comparative-Toxicogenomics-Database-3, results=FALSE, message=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse")
+if (!requireNamespace("VennDiagram"))
+install.packages("VennDiagram")
+if (!requireNamespace("grid"))
+install.packages("grid")
+```
+
+
+#### Loading R packages required for this session
+```{r 7-1-Comparative-Toxicogenomics-Database-4, results=FALSE, message=FALSE}
+library(tidyverse)
+library(VennDiagram)
+library(grid)
+```
+
+
+#### Set your working directory
+```{r 7-1-Comparative-Toxicogenomics-Database-5, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+
+
+## CTD Data in R
+
+### Organizing Example Dataset from CTD
+
+CTD requires manual querying of its database, outside of the R scripting environment. Because of this, let's first manually pull the data we need for this example analysis. We can answer both of the example questions by pulling all chemical-gene relationship data for arsenic, which we can do by following the below steps:
+
+Navigate to the main CTD website: http://ctdbase.org/.
+
+Select at the top, 'Search' -> 'Chemical-Gene Interactions'.
+
+```{r 7-1-Comparative-Toxicogenomics-Database-6, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image2.jpg")
+```
+
+
+
+Select to query all chemical-gene interaction data for arsenic.
+
+```{r 7-1-Comparative-Toxicogenomics-Database-7, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image3.jpg")
+```
+
+
+
+Note that there are lots of results, represented by many many rows of data! Scroll to the bottom of the webpage and select to download as 'CSV'.
+
+```{r 7-1-Comparative-Toxicogenomics-Database-8, echo=FALSE, fig.align='center' }
+knitr::include_graphics("Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image4.jpg")
+```
+
+
+
+This is the file that we can now use to import into the R environment and analyze!
+Note that the data pulled here represent data available on August 1, 2021
+
+
+
+### Loading the Example CTD Dataset into R
+
+
+
+Read in the csv file of the results from CTD query:
+```{r 7-1-Comparative-Toxicogenomics-Database-9, results=FALSE, message=FALSE}
+ctd = read_csv("Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_InputData1.csv")
+```
+
+
+
+Let's first see how many rows and columns of data this file contains:
+```{r 7-1-Comparative-Toxicogenomics-Database-10 }
+dim(ctd)
+```
+This dataset includes 6280 observations (represented by rows) linking arsenic exposure to gene-level alterations
+With information spanning across 9 columns
+
+
+
+Let's also see what kind of data are organized within the columns:
+```{r 7-1-Comparative-Toxicogenomics-Database-11 }
+colnames(ctd)
+```
+
+
+```{r 7-1-Comparative-Toxicogenomics-Database-12 }
+# Viewing the first five rows of data, across all 9 columns
+ctd[1:9,1:5]
+```
+
+
+
+
+#### Filtering data for genes with altered expression
+
+
+
+To identify genes with altered expression in association with arsenic, we can leverage the results of our CTD query and filter this dataset to include only the rows that contain the term "expression" in the "Interaction Actions" column.
+```{r 7-1-Comparative-Toxicogenomics-Database-13 }
+exp_filt = ctd %>% filter(grepl("expression", `Interaction Actions`))
+```
+
+We now have 2586 observations, representing instances of arsenic exposure causing a changes in a target gene's expression levels.
+```{r 7-1-Comparative-Toxicogenomics-Database-14 }
+dim(exp_filt)
+```
+
+
+
+Let's see how many unique genes this represents:
+```{r 7-1-Comparative-Toxicogenomics-Database-15 }
+length(unique(exp_filt$`Gene Symbol`))
+```
+This reflects 1878 unique genes that show altered expression in association with arsenic.
+
+
+
+Let's make a separate dataframe that includes only the unique genes, based on the "Gene Symbol" column.
+```{r 7-1-Comparative-Toxicogenomics-Database-16 }
+exp_genes = exp_filt %>% distinct(`Gene Symbol`, .keep_all=TRUE)
+
+# Removing columns besides gene identifier
+exp_genes = exp_genes[,4]
+
+# Viewing the first 10 genes listed
+exp_genes[1:10,]
+```
+This now provides us a list of 1878 genes showing altered expression in association with arsenic.
+
+
+##### Technical notes on running the distinct function within tidyverse:
+By default, the distinct function keeps the first instance of a duplicated value. This does have implications if the rest of the values in the rows differ. You will only retain the data associated with the first instance of the duplicated value (which is why we just retained the gene column here). It may be useful to first find the rows with the duplicate value and verify that results are as you would expect before removing observations. For example, in this dataset, expression levels can increase or decrease. If you were looking for just increases in expression, and there were genes that showed increased and decreased expression across different samples, using the distinct function just on "Gene Symbol" would not give you the results you wanted. If the first instance of the gene symbol noted decreased expression, that gene would not be returned in the results even though it might be one you would want. For this example case, we only care about expression change, regardless of direction, so this is not an issue. The distinct function can also take multiple columns to consider jointly as the value to check for duplicates if you are concerned about this.
+
+
+
+### Answer to Environmental Health Question 1
+
+:::question
+*With this, we can answer **Environmental Health Question 1***:
+Which genes show altered expression in response to arsenic exposure?
+:::
+:::answer
+**Answer**: This list of 1878 genes have published evidence supporting their altered expression levels associated with arsenic exposure.
+:::
+
+
+
+## Identifying Genes Under Epigenetic Control
+
+
+For this dataset, let's focus on gene-level methylation as a marker of epigenetic regulation. Let's return to our main dataframe, representing the results of the CTD query, and filter these results for only the rows that contain the term "methylation" in the "Interaction Actions" column.
+```{r 7-1-Comparative-Toxicogenomics-Database-17 }
+met_filt = ctd %>% filter(grepl("methylation",`Interaction Actions`))
+```
+
+We now have 3211 observations, representing instances of arsenic exposure causing a changes in a target gene's methylation levels.
+```{r 7-1-Comparative-Toxicogenomics-Database-18 }
+dim(met_filt)
+```
+
+
+Let's see how many unique genes this represents.
+```{r 7-1-Comparative-Toxicogenomics-Database-19 }
+length(unique(met_filt$`Gene Symbol`))
+```
+This reflects 3142 unique genes that show altered methylation in association with arsenic
+
+
+
+Let's make a separate dataframe that includes only the unique genes, based on the "Gene Symbol" column.
+```{r 7-1-Comparative-Toxicogenomics-Database-20 }
+met_genes = met_filt %>% distinct(`Gene Symbol`, .keep_all=TRUE)
+
+# Removing columns besides gene identifier
+met_genes = met_genes[,4]
+```
+This now provides us a list of 3142 genes showing altered methylation in association with arsenic.
+
+
+
+With this list of genes with altered methylation, we can now compare it to previous list of genes with altered expression to yeild our final list of genes of interest. To achieve this last step, we present two different methods to carry out list comparisons below.
+
+
+
+#### Method 1 for list comparisons: Merging
+
+
+
+Merge the expression results with the methylation resuts on the Gene Symbol column found in both datasets.
+```{r 7-1-Comparative-Toxicogenomics-Database-21 }
+merge_df = merge(exp_genes, met_genes, by = "Gene Symbol")
+```
+We end up with 315 rows reflecting the 315 genes that show altered expression and altered methylation
+
+Let's view these genes:
+```{r 7-1-Comparative-Toxicogenomics-Database-22 }
+merge_df[1:315,]
+```
+
+
+
+### Answer to Environmental Health Question 2
+
+:::question
+*With this, we can answer **Environmental Health Question 2***:
+Of the genes showing altered expression, which may be under epigenetic control?
+:::
+:::answer
+**Answer**: We identified 315 genes with altered expression resulting from arsenic exposure, that also demonstrate epigenetic modifications from arsenic. These genes include many high interest molecules involved in regulating cell health, including several cyclin dependent kinases (e.g., CDK2, CDK4, CDK5, CDK6), molecules involved in oxidative stress (e.g., FOSB, NOS2), and cytokines involved in inflammatory response pathways (e.g., IFNG, IL10, IL16, IL1R1, IR1RAP, TGFB1, TGFB3).
+:::
+
+
+
+#### Method 2 for list comparisons: Intersection
+For further training, shown here is another method for pulling this list of interest, through the use of the 'intersection' function.
+
+
+
+Obtain a list of the overlapping genes in the overall expression results and the methylation results.
+```{r 7-1-Comparative-Toxicogenomics-Database-23 }
+inxn = intersect(exp_filt$`Gene Symbol`,met_filt$`Gene Symbol`)
+```
+Again, we end up with a list of 315 unique genes that show altered expression and altered methylation.
+
+
+
+This list can be viewed on its own or converted to a dataframe (df).
+```{r 7-1-Comparative-Toxicogenomics-Database-24 }
+inxn_df = data.frame(genes=inxn)
+```
+
+
+
+This list can also be conveniently used to filter the original query results.
+```{r 7-1-Comparative-Toxicogenomics-Database-25 }
+inxn_df_all_data = ctd %>% filter(`Gene Symbol` %in% inxn)
+```
+
+
+
+Note that in this last case, the same 315 genes are present, but this time the results contain all records from the original query results, hence the 875 rows (875 records observations reflecting the 315 genes).
+```{r 7-1-Comparative-Toxicogenomics-Database-26 }
+summary(unique(sort(inxn_df_all_data$`Gene Symbol`))==sort(merge_df$`Gene Symbol`))
+dim(inxn_df_all_data)
+```
+
+
+Visually we can represent this as a Venn diagram. Here, we use the ["VennDiagram"](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-35) R package.
+
+```{r 7-1-Comparative-Toxicogenomics-Database-27, message=F, eval=F, fig.align = "center"}
+# Use the data we previously used for intersection in the venn diagram function
+venn.plt = venn.diagram(
+ x = list(exp_filt$`Gene Symbol`, met_filt$`Gene Symbol`),
+ category.names = c("Altered Expression" , "Altered Methylation"),
+ filename = NULL,
+
+ # Change font size, type, and position
+ cat.cex = 1.15,
+ cat.fontface = "bold",
+ cat.default.pos = "outer",
+ cat.pos = c(-27, 27),
+ cat.dist = c(0.055, 0.055),
+
+ # Change color of ovals
+ col=c("#440154ff", '#21908dff'),
+ fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3)),
+)
+
+```
+
+```{r 7-1-Comparative-Toxicogenomics-Database-28, fig.width = 7, fig.height = 7, echo=F, message=F, fig.align = "center"}
+# Use the data we previously used for intersection in the venn diagram function
+venn.plt = venn.diagram(
+ x = list(exp_filt$`Gene Symbol`, met_filt$`Gene Symbol`),
+ category.names = c("Altered Expression" , "Altered Methylation"),
+ filename = NULL,
+ output=F,
+
+ # Change font size, type, and position
+ cat.cex = 1.15,
+ cat.fontface = "bold",
+ cat.default.pos = "outer",
+ cat.pos = c(-27, 27),
+ cat.dist = c(0.055, 0.055),
+
+ # Change color of ovals
+ col=c("#440154ff", '#21908dff'),
+ fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3)),
+)
+
+grid::grid.draw(venn.plt)
+```
+
+
+## Concluding Remarks
+In conclusion, we identified 315 genes that show altered expression in response to arsenic exposure that may be under epigenetic control. These genes represent critical mediators of oxidative stress and inflammation, among other important cellular processes. Results yielded an important list of genes representing potential targets for further evaluation, to better understand mechanism of environmental exposure-induced disease. Together, this example highlights the utility of CTD to address environmental health research questions.
+
+For more information, see the recently updated primary CTD publication:
+
++ Davis AP, Grondin CJ, Johnson RJ, Sciaky D, Wiegers J, Wiegers TC, Mattingly CJ. Comparative Toxicogenomics Database (CTD): update 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D1138-D1143. PMID: [33068428](https://pubmed.ncbi.nlm.nih.gov/33068428/).
+
+Additional case studies relevant to environmental health research include the following:
+
++ An example publication leveraging CTD findings to identify mechanisms of metals-induced birth defects: Ahir BK, Sanders AP, Rager JE, Fry RC. Systems biology and birth defects prevention: blockade of the glucocorticoid receptor prevents arsenic-induced birth defects. Environ Health Perspect. 2013 Mar;121(3):332-8. PMID: [23458687](https://pubmed.ncbi.nlm.nih.gov/23458687/).
+
++ An example publication leveraging CTD to help fill data gaps on data poor chemicals, in combination with ToxCast/Tox21 data streams, to elucidate environmental influences on disease pathways: Kosnik MB, Planchart A, Marvel SW, Reif DM, Mattingly CJ. Integration of curated and high-throughput screening data to elucidate environmental influences on disease pathways. Comput Toxicol. 2019 Nov;12:100094. PMID: [31453412](https://pubmed.ncbi.nlm.nih.gov/31453412/).
+
++ An example publication leveraging CTD to extract chemical-disease relationships used to derive new chemical risk values, with the goal of prioritizing connections between environmental factors, genetic variants, and human diseases: Kosnik MB, Reif DM. Determination of chemical-disease risk values to prioritize connections between environmental factors, genetic variants, and human diseases. Toxicol Appl Pharmacol. 2019 Sep 15;379:114674. [PMID: 31323264](https://pubmed.ncbi.nlm.nih.gov/31323264/).
+
+
+
+
+
+
+:::tyk
+
+Using the same dataset from this module (available at the GitHub site and as Module7_1_TYKInput.csv):
+
+1. Filter the data using the grepl function to look at only those observations that specifically decrease the target gene's "expression" level. How many observations are there?
+2. Similarly, filter the data to identify how many observations there are where the target gene's "expression" level is simply "affected". Create a venn diagram to help visualize any overlap between these two filtered datasets.
+
+:::
diff --git a/Chapter_7/Module7_1_Input/Module7_1_Image1.jpg b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image1.jpg
similarity index 100%
rename from Chapter_7/Module7_1_Input/Module7_1_Image1.jpg
rename to Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image1.jpg
diff --git a/Chapter_7/Module7_1_Input/Module7_1_Image2.jpg b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image2.jpg
similarity index 100%
rename from Chapter_7/Module7_1_Input/Module7_1_Image2.jpg
rename to Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image2.jpg
diff --git a/Chapter_7/Module7_1_Input/Module7_1_Image3.jpg b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image3.jpg
similarity index 100%
rename from Chapter_7/Module7_1_Input/Module7_1_Image3.jpg
rename to Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image3.jpg
diff --git a/Chapter_7/Module7_1_Input/Module7_1_Image4.jpg b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image4.jpg
similarity index 100%
rename from Chapter_7/Module7_1_Input/Module7_1_Image4.jpg
rename to Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_Image4.jpg
diff --git a/Chapter_7/Module7_1_Input/Module7_1_InputData1.csv b/Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_InputData1.csv
similarity index 100%
rename from Chapter_7/Module7_1_Input/Module7_1_InputData1.csv
rename to Chapter_7/7_1_Comparative_Toxicogenomics_Database/Module7_1_InputData1.csv
diff --git a/Chapter_7/7_2_Gene_Expression_Omnibus/7_2_Gene_Expression_Omnibus.Rmd b/Chapter_7/7_2_Gene_Expression_Omnibus/7_2_Gene_Expression_Omnibus.Rmd
new file mode 100644
index 0000000..16bd6bf
--- /dev/null
+++ b/Chapter_7/7_2_Gene_Expression_Omnibus/7_2_Gene_Expression_Omnibus.Rmd
@@ -0,0 +1,638 @@
+
+# 7.2 Gene Expression Omnibus
+
+This training module was developed by Kyle R. Roell and Julia E. Rager.
+
+All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
+
+## Introduction to Training Module
+
+[GEO](https://www.ncbi.nlm.nih.gov/geo/) is a publicly available database repository of high-throughput gene expression data and hybridization arrays, chips, and microarrays that span genome-wide endpoints of genomics, transcriptomics, and epigenomics. This training module specifically guides trainees through the loading of required packages and data, including the manual upload of GEO data as well as the upload/organization of data leveraging the [GEOquery package](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html). Data are then further organized and combined with gene annotation information through the merging of platform annotation files. Example visualizations are then produced, including boxplots to evaluate the overall distribution of expression data across samples, as well as heat map visualizations that compare unscaled versus scaled gene expression values. Statistical analyses are then included to identify which genes are significantly altered in expression upon exposure to formaldehyde. Together, this training module serves as a simple example showing methods to access and download GEO data and to perform data organization, analysis, and visualization tasks through applications-based questions.
+
+
+## Introduction to GEO
+
+The GEO repository is organized and managed by the [The National Center for Biotechnology Information (NCBI)](https://www.ncbi.nlm.nih.gov/), which seeks to advance science and health by providing access to biomedical and genomic information. The three [overall goals](https://www.ncbi.nlm.nih.gov/geo/info/overview.html) of GEO are to: (1) Provide a robust, versatile database in which to efficiently store high-throughput functional genomic data, (2) Offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community, and (3) Provide user-friendly mechanisms that allow users to query, locate, review and download studies and gene expression profiles of interest.
+
+Of high relevance to environmental health, data organized within GEO can be pulled and analyzed to address new environmental health questions, leveraging previously generated data. For example, we have pulled gene expression data from acute myeloid leukemia patients and re-analyzed these data to elucidate new mechanisms of epigenetically-regulated networks involved in cancer, that in turn, may be modified by environmental insults, as previously published in [Rager et al. 2012](https://pubmed.ncbi.nlm.nih.gov/22754483/). We have also pulled and analyzed gene expression data from published studies evaluating toxicity resulting from hexavalent chromium exposure, to further substantiate the role of epigenetic mediators in hexavelent chromium-induced carcinogenesis (see [Rager et al. 2019](https://pubmed.ncbi.nlm.nih.gov/30690063/)). This training exercise leverages an additional dataset that we published and deposited through GEO to evaluate the effects of formaldehyde inhalation exposure, as detailed below.
+
+
+## Introduction to Example Data
+
+In this training module, data will be pulled from the published GEO dataset recorded through the online series [GSE42394](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394). This series represents Affymetrix rat genome-wide microarray data generated from our previous study, aimed at evaluating the transcriptomic effects of formaldehyde across three tissues: the nose, blood, and bone marrow. For the purposes of this training module, we will focus on evaluating gene expression profiles from nasal samples after 7 days of exposure, collected from rats exposed to 2 ppm formaldehyde via inhalation. These findings, in addition to other epigenomic endpoint measures, have been previously published (see [Rager et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24304932/)).
+
+
+### Training Module's Environmental Health Questions
+
+This training module was specifically developed to answer the following environmental health questions:
+
+(1) What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
+(2) How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
+(3) Why do we often scale gene expression signatures prior to heat map visualizations?
+(4) What genes are altered in expression by formaldehyde inhalation exposure?
+(5) What are the potential biological consequences of these gene-level perturbations?
+
+
+
+### Script Preparations
+
+#### Cleaning the global environment
+```{r 7-2-Gene-Expression-Omnibus-1 }
+rm(list=ls())
+```
+
+
+#### Installing required R packages
+If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
+```{r 7-2-Gene-Expression-Omnibus-2, results=FALSE, message=FALSE}
+if (!requireNamespace("tidyverse"))
+ install.packages("tidyverse")
+if (!requireNamespace("reshape2"))
+ install.packages("reshape2")
+
+# GEOquery, this will install BiocManager if you don't have it installed
+if (!requireNamespace("BiocManager"))
+ install.packages("BiocManager")
+BiocManager::install("GEOquery")
+```
+
+
+#### Loading R packages required for this session
+```{r 7-2-Gene-Expression-Omnibus-3, results=FALSE, message=FALSE, warning=FALSE}
+library(tidyverse)
+library(reshape2)
+library(GEOquery)
+```
+For more information on the **tidyverse package**, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/tidyverse/index.html), primary [webpage](https://www.tidyverse.org/packages/), and peer-reviewed [article released in 2018](https://onlinelibrary.wiley.com/doi/10.1002/sdr.1600).
+
+For more information on the **reshape2 package**, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/reshape2/index.html), [R Documentation](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4), and [helpful website](https://seananderson.ca/2013/10/19/reshape/) providing an introduction to the reshape2 package.
+
+For more information on the **GEOquery package**, see its associated [Bioconductor website](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html) and [R Documentation file](https://www.rdocumentation.org/packages/GEOquery/versions/2.38.4).
+
+
+
+#### Set your working directory
+```{r 7-2-Gene-Expression-Omnibus-4, eval=FALSE, echo=TRUE}
+setwd("/filepath to where your input files are")
+```
+
+```{r 7-2-Gene-Expression-Omnibus-5, echo=FALSE}
+#setwd("/Users/juliarager/IEHS Dropbox/Julia Rager/Research Projects/1_SRP/4_DMAC/DMAC Training Modules/Training_Modules/3_Chapter 3/3_2_Database_GEO/Clean_Files/")
+```
+
+## GEO Data in R
+
+Let's start by loading the GEO dataset needed for this training module. As explained in the introduction, this module walks through two methods of uploading GEO data: manual option vs automatic option using the GEOquery package. These two methods are detailed below.
+
+### 1. Manually Downloading and Uploading GEO Files
+
+In this first method, we will navigate to the dataset within the GEO website, manually download its associated text data file, save it in our working directory, and then upload it into our global environment in R.
+
+For the purposes of this training exercise, we manually downloaded the GEO series matrix file from the GEO series webpage, located at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394. The specific file that was downloaded was noted as "GSE42394_series_matrix.txt", pulled by clicking on the link indicated by the red arrow from the GEO series webpage:
+
+```{r 7-2-Gene-Expression-Omnibus-6, echo=FALSE, fig.width=4, fig.height=5, fig.align = "center"}
+knitr::include_graphics("Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image1.png")
+```
+
+
+For simplicity, we also have already pre-filtered this file for the samples we are interested in, focusing on the rat nasal gene expression data after 7 days of exposure to gaseous formaldehyde. This filtered file was saved as "GSE42394_series_matrix_filtered.txt", then renamed "Module7_2_InputData1.txt" for use in this module.
+
+
+At this point, we can simply read in this pre-filtered text file for the purposes of this training module
+```{r 7-2-Gene-Expression-Omnibus-7 }
+geodata_manual = read.table(file="Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData1.txt",
+ header=T)
+```
+
+
+Because this is a manual approach, we have to also manually define the treated and untreated samples (based on manually opening the surrounding metadata from the GEO webpage)
+
+Manually defining treated and untreated for these samples of interest:
+```{r 7-2-Gene-Expression-Omnibus-8 }
+exposed_manual = c("GSM1150940", "GSM1150941", "GSM1150942")
+unexposed_manual = c("GSM1150937", "GSM1150938", "GSM1150939")
+```
+
+
+
+### 2. Loading and Organizing GEO Files through the GEOquery Package
+In this second method, we will leverage the GEOquery package, which allows for easier downloading and reading in of data from GEO without having to manually download raw text files, and manually assign sample attributes (e.g., exposed vs unexposed). This package is set-up to automatically merge sample information from GEO metadata files with raw genome-wide datasets.
+
+
+Let's first use the getGEO function (from the GEOquery package) to load data from our series matrix ("GSE42394_series_matrix.txt", renamed "Module7_2_InputData2.txt" for use in this module). *Note that this line of code may take a couple of minutes to run.*
+```{r 7-2-Gene-Expression-Omnibus-9, message=FALSE}
+geo.getGEO.data = getGEO(filename='Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData2.txt')
+```
+
+
+
+One of the reasons the getGEO package is so helpful is that we can automatically link a dataset with nicely organized sample information using the `pData()` function.
+```{r 7-2-Gene-Expression-Omnibus-10 }
+sampleInfo = pData(geo.getGEO.data)
+```
+
+
+Let's view this sample information / metadata file, first by viewing what the column headers are.
+```{r 7-2-Gene-Expression-Omnibus-11 }
+colnames(sampleInfo)
+```
+
+Then viewing the first five columns.
+```{r 7-2-Gene-Expression-Omnibus-12 }
+sampleInfo[1:10,1:5]
+```
+
+This shows that each sample is provided with a unique number starting with "GSM", and these are described by information summarized in the "title" column. We can also see that these data were made public on Jan 7, 2014.
+
+
+Let's view the next five columns.
+```{r 7-2-Gene-Expression-Omnibus-13 }
+sampleInfo[1:10,6:10]
+```
+
+We can see that information is provided here surrounding the type of sample that was analyzed (i.e., RNA), more information on the collected samples within the column `source_name_ch1`, and the organism (rat) is provided in the `organism_ch1` column.
+
+
+More detailed metadata information is provided throughout this file, as seen when viewing the column headers above.
+
+
+#### Defining samples
+
+Now, we can use this information to define the samples we want to analyze. Note that this is the same step we did manually above.
+
+In this training exercise, we are focusing on responses in the nose, so we can easily filter for cell type = Nasal epithelial cells (specifically in the `cell type:ch1` variable). We are also focusing on responses collected after 7 days of exposure, which we can filter for using time = 7 day (specifically in the `time:ch1` variable). We will also define exposed and unexposed samples using the variable `treatment:ch1`.
+
+First, let's subset the sampleInfo dataframe to just keep the samples we're interested in
+```{r 7-2-Gene-Expression-Omnibus-14 }
+# Define a vector variable (here we call it 'keep') that will store rows we want to keep
+keep = rownames(sampleInfo[which(sampleInfo$`cell type:ch1`=="Nasal epithelial cells"
+ & sampleInfo$`time:ch1`=="7 day"),])
+
+# Then subset the sample info for just those samples we defined in keep variable
+sampleInfo = sampleInfo[keep,]
+```
+
+
+Next, we can pull the exposed and unexposed animal IDs. Let's first see how these are labeled within the `treatment:ch1` variable.
+```{r 7-2-Gene-Expression-Omnibus-15 }
+unique(sampleInfo$`treatment:ch1`)
+```
+
+
+And then search for the rows of data, pulling the sample animal IDs (which are in the variable `geo_accession`).
+```{r 7-2-Gene-Expression-Omnibus-16 }
+exposedIDs = sampleInfo[which(sampleInfo$`treatment:ch1`=="2 ppm formaldehyde"),
+ "geo_accession"]
+unexposedIDs = sampleInfo[which(sampleInfo$`treatment:ch1`=="unexposed"),
+ "geo_accession"]
+```
+
+
+The next step is to pull the expression data we want to use in our analyses. The GEOquery function, `exprs()`, allows us to easily pull these data. Here, we can pull the data we're interested in using the `exprs()` function, while defining the data we want to pull based off our previously generated 'keep' vector.
+```{r 7-2-Gene-Expression-Omnibus-17 }
+# As a reminder, this is what the 'keep' vector includes
+# (i.e., animal IDs that we're interested in)
+keep
+```
+
+```{r 7-2-Gene-Expression-Omnibus-18 }
+# Using the exprs() function
+geodata = exprs(geo.getGEO.data[,keep])
+```
+
+
+Let's view the full dataset as is now:
+```{r 7-2-Gene-Expression-Omnibus-19 }
+head(geodata)
+```
+This now represents a matrix of data, with animal IDs as column headers and expression levels within the matrix.
+
+
+#### Simplifying column names
+These column names are not the easiest to interpret, so let's rename these columns to indicate which animals were from the exposed vs. unexposed groups.
+
+We need to first convert our expression dataset to a dataframe so we can edit columns names, and continue with downstream data manipulations that require dataframe formats.
+```{r 7-2-Gene-Expression-Omnibus-20 }
+geodata = data.frame(geodata)
+```
+
+
+Let's remind ourselves what the column names are:
+```{r 7-2-Gene-Expression-Omnibus-21 }
+colnames(geodata)
+```
+
+Which ones of these are exposed vs unexposed animals can be determined by viewing our previously defined vectors.
+```{r 7-2-Gene-Expression-Omnibus-22 }
+exposedIDs
+unexposedIDs
+```
+
+With this we can tell that the first three listed IDs are from unexposed animals, and the last three IDs are from exposed animals.
+
+Let's simplify the names of these columns to indicate exposure status and replicate number.
+```{r 7-2-Gene-Expression-Omnibus-23 }
+colnames(geodata) = c("Control_1", "Control_2", "Control_3", "Exposed_1",
+ "Exposed_2", "Exposed_3")
+```
+
+
+And we'll now need to re-define our 'exposed' vs 'unexposed' vectors for downstream script.
+```{r 7-2-Gene-Expression-Omnibus-24 }
+exposedIDs = c("Exposed_1", "Exposed_2", "Exposed_3")
+unexposedIDs = c("Control_1", "Control_2", "Control_3")
+```
+
+
+
+Viewing the data again:
+```{r 7-2-Gene-Expression-Omnibus-25 }
+head(geodata)
+```
+
+These data are now looking easier to interpret/analyze. Still, the row identifiers include 8 digit numbers starting with "107...". We know that this dataset is a gene expression dataset, but these identifiers, in themselves, don't tell us much about what genes these are referring to. These numeric IDs specifically represent microarray probesetIDs, that were produced by the Affymetrix platform used in the original study.
+
+**But how can we tell which genes are represented by these data?!**
+
+
+#### Adding gene symbol information
+
+Each -omics dataset contained within GEO points to a specific platform that was used to obtain measurements.
+In instances where we want more information surrounding the molecular identifiers, we can merge the platform-specific annotation file with the molecular IDs given in the full dataset.
+
+For example, let's pull the platform-specific annotation file for this experiment. Let's revisit the [website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394) that contained the original dataset on GEO. Scroll down to where it lists "Platforms", and there is a hyperlinked platform number "GPL6247" (see arrow below).
+
+```{r 7-2-Gene-Expression-Omnibus-26, echo=FALSE, fig.width=4, fig.height=5, fig.align = "center"}
+knitr::include_graphics("Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image2.png")
+```
+
+
+Click on this, and you will be navigated to a different GEO website describing the Affymetrix rat array platform that was used in this analysis. Note that this website also includes information on when this array became available, links to other experiments that have used this platform within GEO, and much more.
+
+Here, we're interested in pulling the corresponding gene symbol information for the probeset IDs. To do so, scroll to the bottom, and click "Annotation SOFT table..." and download the corresponding .gz file within your working directory. Unzip this, and you will find the master annotation file: "GPL6247.annot".
+
+In this exercise, we've already done these steps and unzipped the file in our working directory. So at this point, we can simply read in this annotation dataset, renamed "Module7_2_InputData2.annot", still using the `GEOquery()` function to help automate.
+
+```{r 7-2-Gene-Expression-Omnibus-27, warning=FALSE}
+geo.annot = GEOquery::getGEO(filename="Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData3.annot")
+```
+
+Now we can use the `Table()` function from GEOquery to pull data from the annotation dataset.
+```{r 7-2-Gene-Expression-Omnibus-28 }
+id.gene.table = GEOquery::Table(geo.annot)[,c("ID", "Gene symbol")]
+id.gene.table[1:10,1:2]
+```
+
+With these two columns of data, we now have the needed IDs and gene symbols to match with our dataset.
+
+Within the full dataset, we need to add a new column for the probeset ID, taken from the rownames, in preparation for the merging step.
+```{r 7-2-Gene-Expression-Omnibus-29 }
+geodata$ID = rownames(geodata)
+```
+
+We can now merge the gene symbol information by ID with our expression data.
+```{r 7-2-Gene-Expression-Omnibus-30 }
+geodata_genes = merge(geodata, id.gene.table, by="ID")
+head(geodata_genes)
+```
+
+Note that many of the probeset IDs do not map to full gene symbols, which is shown here by viewing the top few rows - this is expected in genome-wide analyses based on microarray platforms.
+
+Let's look at the first 25 unique genes in these data:
+```{r 7-2-Gene-Expression-Omnibus-31 }
+UniqueGenes = unique(geodata_genes$`Gene symbol`)
+UniqueGenes[1:25]
+```
+
+Again, you can see that the first value listed is blank, representing probesetIDs that do not match to fully annotated gene symbols. Though the rest pertain for gene symbols annotated to the rat genome.
+
+You can also see that some gene symbols have multiple entries, separated by "///"
+
+To simplify identifiers, we can pull just the first gene symbol, and remove the rest by using gsub().
+```{r 7-2-Gene-Expression-Omnibus-32 }
+geodata_genes$`Gene symbol` = gsub("///.*", "", geodata_genes$`Gene symbol`)
+```
+
+Let's alphabetize by main expression dataframe by gene symbol.
+```{r 7-2-Gene-Expression-Omnibus-33 }
+geodata_genes = geodata_genes[order(geodata_genes$`Gene symbol`),]
+```
+
+And then re-view these data:
+```{r 7-2-Gene-Expression-Omnibus-34 }
+geodata_genes[1:5,]
+```
+
+In preparation for the visualization steps below, let's reset the probeset IDs to rownames.
+```{r 7-2-Gene-Expression-Omnibus-35 }
+rownames(geodata_genes) = geodata_genes$ID
+
+# Can then remove this column within the dataframe
+geodata_genes$ID = NULL
+```
+
+Finally let's rearrange this dataset to include gene symbols as the first column, right after rownames (probeset IDs).
+```{r 7-2-Gene-Expression-Omnibus-36 }
+geodata_genes = geodata_genes[,c(ncol(geodata_genes),1:(ncol(geodata_genes)-1))]
+geodata_genes[1:5,]
+dim(geodata_genes)
+```
+
+Note that this dataset includes expression measures across **29,214 probes, representing 14,019 unique genes**.
+For simplicity in the final exercises, let's just filter for rows representing mapped genes.
+
+```{r 7-2-Gene-Expression-Omnibus-37 }
+geodata_genes = geodata_genes[!(geodata_genes$`Gene symbol` == ""), ]
+dim(geodata_genes)
+```
+
+Note that this dataset now includes 16,024 rows with mapped gene symbol identifiers.
+
+### Answer to Environmental Health Question 1
+
+:::question
+With this, we can now answer **Environmental Health Question 1**:
+What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
+:::
+:::answer
+**Answer**: Platform-specific probeset IDs.
+:::
+
+
+### Answer to Environmental Health Question 2
+
+:::question
+We can also answer **Environmental Health Question 2**:
+How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
+:::
+:::answer
+**Answer**: We can merge platform-specific IDs with gene-level information using annotation files.
+:::
+
+
+## Visualizing Data
+
+### Visualizing Gene Expression Data using Boxplots and Heat Maps
+
+To visualize the -omics data, we can generate boxplots, heat maps, any many other types of visualizations. Here, we provide an example to plot a boxplot, which can be used to visualize the variability amongst samples. We also provide an example to plot a heat map, comparing unscaled vs scaled gene expression profiles. These visualizations can be useful to both simply visualize the data as well as identify patterns across samples or genes
+
+#### Boxplot visualizations
+For this example, let's simply use R's built in boxplot() function.
+
+We only want to use columns with our expression data (2 to 7), so let's pull those columns when running the boxplot function.
+```{r 7-2-Gene-Expression-Omnibus-38, fig.width=5, fig.height=4, fig.align = "center"}
+boxplot(geodata_genes[,2:7])
+```
+
+There seem to be a lot of variability within each sample's range of expression levels, with many outliers. This makes sense given that we are analyzing the expression levels across the rat's entire genome, where some genes won't be expressed at all while others will be highly expressed due to biological and/or potential technical variability.
+
+To show plots without outliers, we can simply use outline=F.
+```{r 7-2-Gene-Expression-Omnibus-39, fig.width=5, fig.height=4, fig.align = "center"}
+boxplot(geodata_genes[,2:7], outline=F)
+```
+
+
+#### Heat Map visualizations
+Heat maps are also useful when evaluating large datasets.
+
+There are many different packages you can use to generate heat maps. Here, we use the *superheat* package.
+
+It also takes awhile to plot all genes across the genome, so to save time for this training module, let's randomly select 100 rows to plot.
+
+```{r 7-2-Gene-Expression-Omnibus-40, fig.width=9, fig.height=7, fig.align = "center"}
+# To ensure that the same subset of genes are selected each time
+set.seed = 101
+
+# Random selection of 100 rows
+row.sample = sample(1:nrow(geodata_genes),100)
+
+# Heat map code
+superheat::superheat(geodata_genes[row.sample,2:7], # Only want to plot non-id/gene symbol columns (2 to 7)
+ pretty.order.rows = TRUE,
+ pretty.order.cols = TRUE,
+ col.dendrogram = T,
+ row.dendrogram = T)
+```
+
+This produces a heat map with sample IDs along the x-axis and probeset IDs along the y-axis. Here, the values being displayed represent normalized expression values.
+
+
+One way to improve our ability to distinguish differences between samples is to **scale expression values** across probes.
+
+**Scaling data**
+
+Z-score is a very common method of scaling that transforms data points to reflect the number of standard deviations they are from the overall mean. Z-score scaling data results in the overall transformation of a dataset to have an overall mean = 0 and standard deviation = 1.
+
+Let's see what happens when we scale this gene expression dataset by z-score across each probe. This can be easily done using the `scale()` function.
+
+This specific `scale()` function works by centering and scaling across columns, but since we want to use it across probesets (organized as rows), we need to first transpose our dataset, then run the scale function.
+```{r 7-2-Gene-Expression-Omnibus-41 }
+geodata_genes_scaled = scale(t(geodata_genes[,2:7]), center=T, scale=T)
+```
+
+Now we can transpose it back to the original format (i.e., before it was transposed).
+```{r 7-2-Gene-Expression-Omnibus-42 }
+geodata_genes_scaled = t(geodata_genes_scaled)
+```
+
+
+And then view what the normalized and now scaled expression data look like for now a random subset of 100 probesets (representing genes).
+```{r 7-2-Gene-Expression-Omnibus-43, echo=FALSE, fig.width=9, fig.height=7, fig.align = "center"}
+superheat::superheat(geodata_genes_scaled[row.sample,],
+ pretty.order.rows = TRUE,
+ pretty.order.cols = TRUE,
+ col.dendrogram = T,
+ row.dendrogram = T)
+```
+
+With these data now scaled, we can more easily visualize patterns between samples.
+
+
+### Answer to Environmental Health Question 3
+
+:::question
+*We can also answer **Environmental Health Question 3***:
+Why do we often scale gene expression signatures prior to heat map visualizations?
+:::
+:::answer
+**Answer**: To better visualize patterns in expression signatures between samples.
+:::
+
+
+Now, with these data nicely organized, we can next explore how statistics can help us find which genes show trends in expression associated with formaldehyde exposure.
+
+
+## Statistical Analyses
+
+### Statistical Analyses to Identify Genes altered by Formaldehyde
+
+A simple way to identify differences between formaldehyde-exposed and unexposed samples is to use a t-test. Because there are so many tests being performed, one for each gene, it is also important to carry out multiple test corrections through a p-value adjustment method.
+
+We need to run a t-test for each row of our dataset. This exercise demonstrates two different methods to run a t-test:
+
++ Method 1: using a 'for loop'
++ Method 2: using the apply function (more computationally efficient)
+
+#### Method 1 (m1): 'For Loop'
+
+Let's first re-save the molecular probe IDs to a column within the dataframe, since we need those values in the loop function.
+```{r 7-2-Gene-Expression-Omnibus-44 }
+geodata_genes$ID = rownames(geodata_genes)
+```
+
+
+We also need to initially create an empty dataframe to eventually store p-values.
+```{r 7-2-Gene-Expression-Omnibus-45 }
+pValue_m1 = matrix(0, nrow=nrow(geodata_genes), ncol=3)
+colnames(pValue_m1) = c("ID", "pval", "padj")
+head(pValue_m1)
+```
+
+You can see the empty dataframe that was generated through this code.
+
+Then we can loop through the entire dataset to acquire p-values from t-test statistics, comparing n=3 exposed vs n=3 unexposed samples.
+```{r 7-2-Gene-Expression-Omnibus-46 }
+for (i in 1:nrow(geodata_genes)) {
+
+ #Get the ID
+ ID.i = geodata_genes[i, "ID"];
+
+ #Run the t-test and get the p-value
+ pval.i = t.test(geodata_genes[i,exposedIDs], geodata_genes[i,unexposedIDs])$p.value;
+
+ #Store the data in the empty dataframe
+ pValue_m1[i,"ID"] = ID.i;
+ pValue_m1[i,"pval"] = pval.i
+
+}
+```
+
+View the results:
+```{r 7-2-Gene-Expression-Omnibus-47 }
+# Note that we're not pulling the last column (padj) since we haven't calculated these yet
+pValue_m1[1:5,1:2]
+```
+
+
+
+#### Method 2 (m2): Apply Function
+For the second method, we can use the *apply()* function to calculate resulting t-test p-values more efficiently labeled.
+
+```{r 7-2-Gene-Expression-Omnibus-48 }
+pValue_m2 = apply(geodata_genes[,2:7], 1, function(x) t.test(x[unexposedIDs],
+ x[exposedIDs])$p.value)
+names(pValue_m2) = geodata_genes[,"ID"]
+```
+
+We can convert the results into a dataframe to make it similar to m1 matrix we created above.
+```{r 7-2-Gene-Expression-Omnibus-49 }
+pValue_m2 = data.frame(pValue_m2)
+
+# Now create an ID column
+pValue_m2$ID = rownames(pValue_m2)
+```
+
+Then we can view at the two datasets to see they result in the same pvalues.
+```{r 7-2-Gene-Expression-Omnibus-50 }
+head(pValue_m1)
+head(pValue_m2)
+```
+We can see from these results that both methods (m1 and m2) generate the same statistical p-values.
+
+#### Interpreting Results
+
+Let's again merge these data with the gene symbols to tell which genes are significant.
+
+First, let's convert to a dataframe and then merge as before, for one of the above methods as an example (m1).
+```{r 7-2-Gene-Expression-Omnibus-51 }
+pValue_m1 = data.frame(pValue_m1)
+pValue_m1 = merge(pValue_m1, id.gene.table, by="ID")
+```
+
+We can also add a multiple test correction by applying a false discovery rate-adjusted p-value; here, using the Benjamini Hochberg (BH) method.
+```{r 7-2-Gene-Expression-Omnibus-52 }
+# Here fdr is an alias for B-H method
+pValue_m1[,"padj"] = p.adjust(pValue_m1[,"pval"], method=c("fdr"))
+```
+
+Now, we can sort these statistical results by adjusted p-values.
+```{r 7-2-Gene-Expression-Omnibus-53 }
+pValue_m1.sorted = pValue_m1[order(pValue_m1[,'padj']),]
+head(pValue_m1.sorted)
+```
+
+Pulling just the significant genes using an adjusted p-value threshold of 0.05.
+```{r 7-2-Gene-Expression-Omnibus-54 }
+adj.pval.sig = pValue_m1[which(pValue_m1[,'padj'] < .05),]
+
+# Viewing these genes
+adj.pval.sig
+```
+
+
+### Answer to Environmental Health Question 4
+
+:::question
+*With this, we can answer **Environmental Health Question 4***:
+What genes are altered in expression by formaldehyde inhalation exposure?
+:::
+:::answer
+**Answer**: Olr633 and Slc7a8.
+:::
+
+Finally, let's plot these using a mini heat map.
+Note that we can use probesetIDs, then gene symbols, in rownames to have them show in heat map labels.
+```{r 7-2-Gene-Expression-Omnibus-55, echo=FALSE, fig.width=8, fig.height=4, fig.align = "center"}
+rownames(geodata_genes) = paste(geodata_genes$ID, ": ",geodata_genes$`Gene symbol`)
+superheat::superheat(geodata_genes[which(geodata_genes$ID %in% adj.pval.sig[,"ID"]),2:7])
+```
+
+Note that this statistical filter is pretty strict when comparing only n=3 vs n=3 biological replicates. If we loosen the statistical criteria to p-value < 0.05, this is what we can find:
+```{r 7-2-Gene-Expression-Omnibus-56 }
+pval.sig = pValue_m1[which(pValue_m1[,'pval'] < .05),]
+nrow(pval.sig)
+```
+
+5327 genes with significantly altered expression!
+
+Note that other filters are commonly applied to further focus these lists (e.g., background and fold change filters) prior to statistical evaluation, which can impact the final results. See [Rager et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24304932/) for further statistical approaches and visualizations.
+
+
+
+### Answer to Environmental Health Question 5
+
+:::question
+*With this, we can answer **Environmental Health Question 5***:
+What are the potential biological consequences of these gene-level perturbations?
+:::
+:::answer
+**Answer**: Olr633 stands for 'olfactory receptor 633'. Olr633 is up-regulated in expression, meaning that formaldehyde inhalation exposure has a smell that resulted in 'activated' olfactory receptors in the nose of these exposed rats. Slc7a8 stands for 'solute carrier family 7 member 8'. Slc7a8 is down-regulated in expression, and it plays a role in many biological processes, that when altered, can lead to changes in cellular homeostasis and disease.
+:::
+
+
+
+## Concluding Remarks
+
+In conclusion, this training module provides an overview of pulling, organizing, visualizing, and analyzing -omics data from the online repository, Gene Expression Omnibus (GEO). Trainees are guided through the overall organization of an example high dimensional dataset, focusing on transcriptomic responses in the nasal epithelium of rats exposed to formaldehyde. Data are visualized and then analyzed using standard two-group comparisons. Findings are interpreted for biological relevance, yielding insight into the effects resulting from formaldehyde exposure.
+
+For additional case studies that leverage GEO, see the following publications that also address environmental health questions from our research group:
+
++ Rager JE, Fry RC. The aryl hydrocarbon receptor pathway: a key component of the microRNA-mediated AML signalisome. Int J Environ Res Public Health. 2012 May;9(5):1939-53. doi: 10.3390/ijerph9051939. Epub 2012 May 18. PMID: 22754483; PMCID: [PMC3386597](https://pubmed.ncbi.nlm.nih.gov/22754483/).
+
++ Rager JE, Suh M, Chappell GA, Thompson CM, Proctor DM. Review of transcriptomic responses to hexavalent chromium exposure in lung cells supports a role of epigenetic mediators in carcinogenesis. Toxicol Lett. 2019 May 1;305:40-50. PMID: [30690063](https://pubmed.ncbi.nlm.nih.gov/30690063/).
+
+
+
+
+
+
+:::tyk
+
+Using the same dataset that was used in this module, available from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2):
+1. Load the downloaded GEO dataset into R using the packages and functions mentioned in this tutorial.
+2. Filter the data to just those with "cell type" of "Circulating white blood cells".
+3. Report the means of the first 5 rows of the gene expression data (10700001, 10700002, 10700003, 10700004, 10700005), across all samples.
+
+:::
diff --git a/Chapter_7/Module7_2_Input/Module7_2_Image1.png b/Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image1.png
similarity index 100%
rename from Chapter_7/Module7_2_Input/Module7_2_Image1.png
rename to Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image1.png
diff --git a/Chapter_7/Module7_2_Input/Module7_2_Image2.png b/Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image2.png
similarity index 100%
rename from Chapter_7/Module7_2_Input/Module7_2_Image2.png
rename to Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_Image2.png
diff --git a/Chapter_7/Module7_2_Input/Module7_2_InputData1.txt b/Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData1.txt
similarity index 100%
rename from Chapter_7/Module7_2_Input/Module7_2_InputData1.txt
rename to Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData1.txt
diff --git a/Chapter_7/Module7_2_Input/Module7_2_InputData2.txt b/Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData2.txt
similarity index 100%
rename from Chapter_7/Module7_2_Input/Module7_2_InputData2.txt
rename to Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData2.txt
diff --git a/Chapter_7/Module7_2_Input/Module7_2_InputData3.annot b/Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData3.annot
similarity index 100%
rename from Chapter_7/Module7_2_Input/Module7_2_InputData3.annot
rename to Chapter_7/7_2_Gene_Expression_Omnibus/Module7_2_InputData3.annot
diff --git a/Chapter_7/07-Chapter7.Rmd b/Chapter_7/7_3_CompTox_Dashboard/7_3_CompTox_Dashboard.Rmd
similarity index 55%
rename from Chapter_7/07-Chapter7.Rmd
rename to Chapter_7/7_3_CompTox_Dashboard/7_3_CompTox_Dashboard.Rmd
index 6c3d55b..b39f0ec 100644
--- a/Chapter_7/07-Chapter7.Rmd
+++ b/Chapter_7/7_3_CompTox_Dashboard/7_3_CompTox_Dashboard.Rmd
@@ -1,1014 +1,8 @@
-# (PART\*) Chapter 7 Environmental Health Database Mining {-}
-
-# 7.1 Comparative Toxicogenomics Database
-
-This training module was developed by Lauren E. Koval, Kyle R. Roell, and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-
-## Introduction to Training Module
-
-The Comparative Toxicogenomics Database (CTD) is a publicly available, online database that provides manually curated information about chemical-gene/protein interactions, chemical-disease and gene-disease relationships. CTD also recently incorporated curation of exposure data and chemical-phenotype relationships.
-
-CTD is located at: http://ctdbase.org/. Here is a screenshot of the CTD homepage (as of August 5, 2021):
-```{r 07-Chapter7-1, echo=FALSE, fig.align='center' }
-#knitr::include_graphics("_book/TAME_Toolkit_files/figure-html/Module3_1_CTD_homepage.jpg")
-knitr::include_graphics("Chapter_7/Module7_1_Input/Module7_1_Image1.jpg")
-```
-
-In this module, we will be using CTD to access and download data to perform data organization and analysis as an applications-based example towards environmental health research. This activity represents a demonstration of basic data manipulation, filtering, and organization steps in R, while highlighting the utility of CTD to identify novel genomic/epigenomic relationships to environmental exposures. Example visualizations are also included in this training module's script, providing visualizations of gene list comparison results.
-
-
-
-### Training Module's Environmental Health Questions
-This training module was specifically developed to answer the following environmental health questions:
-
-(1) Which genes show altered expression in response to arsenic exposure?
-(2) Of the genes showing altered expression, which may be under epigenetic control?
-
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 07-Chapter7-2}
-rm(list=ls())
-```
-
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you.
-```{r 07-Chapter7-3, results=FALSE, message=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse")
-if (!requireNamespace("VennDiagram"))
-install.packages("VennDiagram")
-if (!requireNamespace("grid"))
-install.packages("grid")
-```
-
-
-#### Loading R packages required for this session
-```{r 07-Chapter7-4, results=FALSE, message=FALSE}
-library(tidyverse)
-library(VennDiagram)
-library(grid)
-```
-
-
-#### Set your working directory
-```{r 07-Chapter7-5, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-
-
-## CTD Data in R
-
-### Organizing Example Dataset from CTD
-
-CTD requires manual querying of its database, outside of the R scripting environment. Because of this, let's first manually pull the data we need for this example analysis. We can answer both of the example questions by pulling all chemical-gene relationship data for arsenic, which we can do by following the below steps:
-
-Navigate to the main CTD website: http://ctdbase.org/.
-
-Select at the top, 'Search' -> 'Chemical-Gene Interactions'.
-
-```{r 07-Chapter7-6, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_7/Module7_1_Input/Module7_1_Image2.jpg")
-```
-
-
-
-Select to query all chemical-gene interaction data for arsenic.
-
-```{r 07-Chapter7-7, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_7/Module7_1_Input/Module7_1_Image3.jpg")
-```
-
-
-
-Note that there are lots of results, represented by many many rows of data! Scroll to the bottom of the webpage and select to download as 'CSV'.
-
-```{r 07-Chapter7-8, echo=FALSE, fig.align='center' }
-knitr::include_graphics("Chapter_7/Module7_1_Input/Module7_1_Image4.jpg")
-```
-
-
-
-This is the file that we can now use to import into the R environment and analyze!
-Note that the data pulled here represent data available on August 1, 2021
-
-
-
-### Loading the Example CTD Dataset into R
-
-
-
-Read in the csv file of the results from CTD query:
-```{r 07-Chapter7-9, results=FALSE, message=FALSE}
-ctd = read_csv("Chapter_7/Module7_1_Input/Module7_1_InputData1.csv")
-```
-
-
-
-Let's first see how many rows and columns of data this file contains:
-```{r 07-Chapter7-10}
-dim(ctd)
-```
-This dataset includes 6280 observations (represented by rows) linking arsenic exposure to gene-level alterations
-With information spanning across 9 columns
-
-
-
-Let's also see what kind of data are organized within the columns:
-```{r 07-Chapter7-11}
-colnames(ctd)
-```
-
-
-```{r 07-Chapter7-12}
-# Viewing the first five rows of data, across all 9 columns
-ctd[1:9,1:5]
-```
-
-
-
-
-#### Filtering data for genes with altered expression
-
-
-
-To identify genes with altered expression in association with arsenic, we can leverage the results of our CTD query and filter this dataset to include only the rows that contain the term "expression" in the "Interaction Actions" column.
-```{r 07-Chapter7-13}
-exp_filt = ctd %>% filter(grepl("expression", `Interaction Actions`))
-```
-
-We now have 2586 observations, representing instances of arsenic exposure causing a changes in a target gene's expression levels.
-```{r 07-Chapter7-14}
-dim(exp_filt)
-```
-
-
-
-Let's see how many unique genes this represents:
-```{r 07-Chapter7-15}
-length(unique(exp_filt$`Gene Symbol`))
-```
-This reflects 1878 unique genes that show altered expression in association with arsenic.
-
-
-
-Let's make a separate dataframe that includes only the unique genes, based on the "Gene Symbol" column.
-```{r 07-Chapter7-16}
-exp_genes = exp_filt %>% distinct(`Gene Symbol`, .keep_all=TRUE)
-
-# Removing columns besides gene identifier
-exp_genes = exp_genes[,4]
-
-# Viewing the first 10 genes listed
-exp_genes[1:10,]
-```
-This now provides us a list of 1878 genes showing altered expression in association with arsenic.
-
-
-##### Technical notes on running the distinct function within tidyverse:
-By default, the distinct function keeps the first instance of a duplicated value. This does have implications if the rest of the values in the rows differ. You will only retain the data associated with the first instance of the duplicated value (which is why we just retained the gene column here). It may be useful to first find the rows with the duplicate value and verify that results are as you would expect before removing observations. For example, in this dataset, expression levels can increase or decrease. If you were looking for just increases in expression, and there were genes that showed increased and decreased expression across different samples, using the distinct function just on "Gene Symbol" would not give you the results you wanted. If the first instance of the gene symbol noted decreased expression, that gene would not be returned in the results even though it might be one you would want. For this example case, we only care about expression change, regardless of direction, so this is not an issue. The distinct function can also take multiple columns to consider jointly as the value to check for duplicates if you are concerned about this.
-
-
-
-### Answer to Environmental Health Question 1
-
-:::question
-*With this, we can answer **Environmental Health Question 1***:
-Which genes show altered expression in response to arsenic exposure?
-:::
-:::answer
-**Answer**: This list of 1878 genes have published evidence supporting their altered expression levels associated with arsenic exposure.
-:::
-
-
-
-## Identifying Genes Under Epigenetic Control
-
-
-For this dataset, let's focus on gene-level methylation as a marker of epigenetic regulation. Let's return to our main dataframe, representing the results of the CTD query, and filter these results for only the rows that contain the term "methylation" in the "Interaction Actions" column.
-```{r 07-Chapter7-17}
-met_filt = ctd %>% filter(grepl("methylation",`Interaction Actions`))
-```
-
-We now have 3211 observations, representing instances of arsenic exposure causing a changes in a target gene's methylation levels.
-```{r 07-Chapter7-18}
-dim(met_filt)
-```
-
-
-Let's see how many unique genes this represents.
-```{r 07-Chapter7-19}
-length(unique(met_filt$`Gene Symbol`))
-```
-This reflects 3142 unique genes that show altered methylation in association with arsenic
-
-
-
-Let's make a separate dataframe that includes only the unique genes, based on the "Gene Symbol" column.
-```{r 07-Chapter7-20}
-met_genes = met_filt %>% distinct(`Gene Symbol`, .keep_all=TRUE)
-
-# Removing columns besides gene identifier
-met_genes = met_genes[,4]
-```
-This now provides us a list of 3142 genes showing altered methylation in association with arsenic.
-
-
-
-With this list of genes with altered methylation, we can now compare it to previous list of genes with altered expression to yeild our final list of genes of interest. To achieve this last step, we present two different methods to carry out list comparisons below.
-
-
-
-#### Method 1 for list comparisons: Merging
-
-
-
-Merge the expression results with the methylation resuts on the Gene Symbol column found in both datasets.
-```{r 07-Chapter7-21}
-merge_df = merge(exp_genes, met_genes, by = "Gene Symbol")
-```
-We end up with 315 rows reflecting the 315 genes that show altered expression and altered methylation
-
-Let's view these genes:
-```{r 07-Chapter7-22}
-merge_df[1:315,]
-```
-
-
-
-### Answer to Environmental Health Question 2
-
-:::question
-*With this, we can answer **Environmental Health Question 2***:
-Of the genes showing altered expression, which may be under epigenetic control?
-:::
-:::answer
-**Answer**: We identified 315 genes with altered expression resulting from arsenic exposure, that also demonstrate epigenetic modifications from arsenic. These genes include many high interest molecules involved in regulating cell health, including several cyclin dependent kinases (e.g., CDK2, CDK4, CDK5, CDK6), molecules involved in oxidative stress (e.g., FOSB, NOS2), and cytokines involved in inflammatory response pathways (e.g., IFNG, IL10, IL16, IL1R1, IR1RAP, TGFB1, TGFB3).
-:::
-
-
-
-#### Method 2 for list comparisons: Intersection
-For further training, shown here is another method for pulling this list of interest, through the use of the 'intersection' function.
-
-
-
-Obtain a list of the overlapping genes in the overall expression results and the methylation results.
-```{r 07-Chapter7-23}
-inxn = intersect(exp_filt$`Gene Symbol`,met_filt$`Gene Symbol`)
-```
-Again, we end up with a list of 315 unique genes that show altered expression and altered methylation.
-
-
-
-This list can be viewed on its own or converted to a dataframe (df).
-```{r 07-Chapter7-24}
-inxn_df = data.frame(genes=inxn)
-```
-
-
-
-This list can also be conveniently used to filter the original query results.
-```{r 07-Chapter7-25}
-inxn_df_all_data = ctd %>% filter(`Gene Symbol` %in% inxn)
-```
-
-
-
-Note that in this last case, the same 315 genes are present, but this time the results contain all records from the original query results, hence the 875 rows (875 records observations reflecting the 315 genes).
-```{r 07-Chapter7-26}
-summary(unique(sort(inxn_df_all_data$`Gene Symbol`))==sort(merge_df$`Gene Symbol`))
-dim(inxn_df_all_data)
-```
-
-
-Visually we can represent this as a Venn diagram. Here, we use the ["VennDiagram"](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-35) R package.
-
-```{r venn, message=F, eval=F, fig.align = "center"}
-# Use the data we previously used for intersection in the venn diagram function
-venn.plt = venn.diagram(
- x = list(exp_filt$`Gene Symbol`, met_filt$`Gene Symbol`),
- category.names = c("Altered Expression" , "Altered Methylation"),
- filename = NULL,
-
- # Change font size, type, and position
- cat.cex = 1.15,
- cat.fontface = "bold",
- cat.default.pos = "outer",
- cat.pos = c(-27, 27),
- cat.dist = c(0.055, 0.055),
-
- # Change color of ovals
- col=c("#440154ff", '#21908dff'),
- fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3)),
-)
-
-```
-
-```{r print-venn, fig.width = 7, fig.height = 7, echo=F, message=F, fig.align = "center"}
-# Use the data we previously used for intersection in the venn diagram function
-venn.plt = venn.diagram(
- x = list(exp_filt$`Gene Symbol`, met_filt$`Gene Symbol`),
- category.names = c("Altered Expression" , "Altered Methylation"),
- filename = NULL,
- output=F,
-
- # Change font size, type, and position
- cat.cex = 1.15,
- cat.fontface = "bold",
- cat.default.pos = "outer",
- cat.pos = c(-27, 27),
- cat.dist = c(0.055, 0.055),
-
- # Change color of ovals
- col=c("#440154ff", '#21908dff'),
- fill = c(alpha("#440154ff",0.3), alpha('#21908dff',0.3)),
-)
-
-grid::grid.draw(venn.plt)
-```
-
-
-## Concluding Remarks
-In conclusion, we identified 315 genes that show altered expression in response to arsenic exposure that may be under epigenetic control. These genes represent critical mediators of oxidative stress and inflammation, among other important cellular processes. Results yielded an important list of genes representing potential targets for further evaluation, to better understand mechanism of environmental exposure-induced disease. Together, this example highlights the utility of CTD to address environmental health research questions.
-
-For more information, see the recently updated primary CTD publication:
-
-+ Davis AP, Grondin CJ, Johnson RJ, Sciaky D, Wiegers J, Wiegers TC, Mattingly CJ. Comparative Toxicogenomics Database (CTD): update 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D1138-D1143. PMID: [33068428](https://pubmed.ncbi.nlm.nih.gov/33068428/).
-
-Additional case studies relevant to environmental health research include the following:
-
-+ An example publication leveraging CTD findings to identify mechanisms of metals-induced birth defects: Ahir BK, Sanders AP, Rager JE, Fry RC. Systems biology and birth defects prevention: blockade of the glucocorticoid receptor prevents arsenic-induced birth defects. Environ Health Perspect. 2013 Mar;121(3):332-8. PMID: [23458687](https://pubmed.ncbi.nlm.nih.gov/23458687/).
-
-+ An example publication leveraging CTD to help fill data gaps on data poor chemicals, in combination with ToxCast/Tox21 data streams, to elucidate environmental influences on disease pathways: Kosnik MB, Planchart A, Marvel SW, Reif DM, Mattingly CJ. Integration of curated and high-throughput screening data to elucidate environmental influences on disease pathways. Comput Toxicol. 2019 Nov;12:100094. PMID: [31453412](https://pubmed.ncbi.nlm.nih.gov/31453412/).
-
-+ An example publication leveraging CTD to extract chemical-disease relationships used to derive new chemical risk values, with the goal of prioritizing connections between environmental factors, genetic variants, and human diseases: Kosnik MB, Reif DM. Determination of chemical-disease risk values to prioritize connections between environmental factors, genetic variants, and human diseases. Toxicol Appl Pharmacol. 2019 Sep 15;379:114674. [PMID: 31323264](https://pubmed.ncbi.nlm.nih.gov/31323264/).
-
-
-
-
-
-
-:::tyk
-
-Using the same dataset from this module (available at the GitHub site and as Module7_1_TYKInput.csv):
-
-1. Filter the data using the grepl function to look at only those observations that specifically decrease the target gene's "expression" level. How many observations are there?
-2. Similarly, filter the data to identify how many observations there are where the target gene's "expression" level is simply "affected". Create a venn diagram to help visualize any overlap between these two filtered datasets.
-
-:::
-
-# 7.2 Gene Expression Omnibus
-
-This training module was developed by Kyle R. Roell and Julia E. Rager.
-
-All input files (script, data, and figures) can be downloaded from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2).
-
-## Introduction to Training Module
-
-[GEO](https://www.ncbi.nlm.nih.gov/geo/) is a publicly available database repository of high-throughput gene expression data and hybridization arrays, chips, and microarrays that span genome-wide endpoints of genomics, transcriptomics, and epigenomics. This training module specifically guides trainees through the loading of required packages and data, including the manual upload of GEO data as well as the upload/organization of data leveraging the [GEOquery package](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html). Data are then further organized and combined with gene annotation information through the merging of platform annotation files. Example visualizations are then produced, including boxplots to evaluate the overall distribution of expression data across samples, as well as heat map visualizations that compare unscaled versus scaled gene expression values. Statistical analyses are then included to identify which genes are significantly altered in expression upon exposure to formaldehyde. Together, this training module serves as a simple example showing methods to access and download GEO data and to perform data organization, analysis, and visualization tasks through applications-based questions.
-
-
-## Introduction to GEO
-
-The GEO repository is organized and managed by the [The National Center for Biotechnology Information (NCBI)](https://www.ncbi.nlm.nih.gov/), which seeks to advance science and health by providing access to biomedical and genomic information. The three [overall goals](https://www.ncbi.nlm.nih.gov/geo/info/overview.html) of GEO are to: (1) Provide a robust, versatile database in which to efficiently store high-throughput functional genomic data, (2) Offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community, and (3) Provide user-friendly mechanisms that allow users to query, locate, review and download studies and gene expression profiles of interest.
-
-Of high relevance to environmental health, data organized within GEO can be pulled and analyzed to address new environmental health questions, leveraging previously generated data. For example, we have pulled gene expression data from acute myeloid leukemia patients and re-analyzed these data to elucidate new mechanisms of epigenetically-regulated networks involved in cancer, that in turn, may be modified by environmental insults, as previously published in [Rager et al. 2012](https://pubmed.ncbi.nlm.nih.gov/22754483/). We have also pulled and analyzed gene expression data from published studies evaluating toxicity resulting from hexavalent chromium exposure, to further substantiate the role of epigenetic mediators in hexavelent chromium-induced carcinogenesis (see [Rager et al. 2019](https://pubmed.ncbi.nlm.nih.gov/30690063/)). This training exercise leverages an additional dataset that we published and deposited through GEO to evaluate the effects of formaldehyde inhalation exposure, as detailed below.
-
-
-## Introduction to Example Data
-
-In this training module, data will be pulled from the published GEO dataset recorded through the online series [GSE42394](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394). This series represents Affymetrix rat genome-wide microarray data generated from our previous study, aimed at evaluating the transcriptomic effects of formaldehyde across three tissues: the nose, blood, and bone marrow. For the purposes of this training module, we will focus on evaluating gene expression profiles from nasal samples after 7 days of exposure, collected from rats exposed to 2 ppm formaldehyde via inhalation. These findings, in addition to other epigenomic endpoint measures, have been previously published (see [Rager et al. 2014](https://pubmed.ncbi.nlm.nih.gov/24304932/)).
-
-
-### Training Module's Environmental Health Questions
-
-This training module was specifically developed to answer the following environmental health questions:
-
-(1) What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
-(2) How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
-(3) Why do we often scale gene expression signatures prior to heat map visualizations?
-(4) What genes are altered in expression by formaldehyde inhalation exposure?
-(5) What are the potential biological consequences of these gene-level perturbations?
-
-
-
-### Script Preparations
-
-#### Cleaning the global environment
-```{r 07-Chapter7-27}
-rm(list=ls())
-```
-
-
-#### Installing required R packages
-If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
-```{r 07-Chapter7-28, results=FALSE, message=FALSE}
-if (!requireNamespace("tidyverse"))
- install.packages("tidyverse")
-if (!requireNamespace("reshape2"))
- install.packages("reshape2")
-
-# GEOquery, this will install BiocManager if you don't have it installed
-if (!requireNamespace("BiocManager"))
- install.packages("BiocManager")
-BiocManager::install("GEOquery")
-```
-
-
-#### Loading R packages required for this session
-```{r 07-Chapter7-29, results=FALSE, message=FALSE, warning=FALSE}
-library(tidyverse)
-library(reshape2)
-library(GEOquery)
-```
-For more information on the **tidyverse package**, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/tidyverse/index.html), primary [webpage](https://www.tidyverse.org/packages/), and peer-reviewed [article released in 2018](https://onlinelibrary.wiley.com/doi/10.1002/sdr.1600).
-
-For more information on the **reshape2 package**, see its associated [CRAN webpage](https://cran.r-project.org/web/packages/reshape2/index.html), [R Documentation](https://www.rdocumentation.org/packages/reshape2/versions/1.4.4), and [helpful website](https://seananderson.ca/2013/10/19/reshape/) providing an introduction to the reshape2 package.
-
-For more information on the **GEOquery package**, see its associated [Bioconductor website](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html) and [R Documentation file](https://www.rdocumentation.org/packages/GEOquery/versions/2.38.4).
-
-
-
-#### Set your working directory
-```{r 07-Chapter7-30, eval=FALSE, echo=TRUE}
-setwd("/filepath to where your input files are")
-```
-
-```{r 07-Chapter7-31, echo=FALSE}
-#setwd("/Users/juliarager/IEHS Dropbox/Julia Rager/Research Projects/1_SRP/4_DMAC/DMAC Training Modules/Training_Modules/3_Chapter 3/3_2_Database_GEO/Clean_Files/")
-```
-
-## GEO Data in R
-
-Let's start by loading the GEO dataset needed for this training module. As explained in the introduction, this module walks through two methods of uploading GEO data: manual option vs automatic option using the GEOquery package. These two methods are detailed below.
-
-### 1. Manually Downloading and Uploading GEO Files
-
-In this first method, we will navigate to the dataset within the GEO website, manually download its associated text data file, save it in our working directory, and then upload it into our global environment in R.
-
-For the purposes of this training exercise, we manually downloaded the GEO series matrix file from the GEO series webpage, located at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394. The specific file that was downloaded was noted as "GSE42394_series_matrix.txt", pulled by clicking on the link indicated by the red arrow from the GEO series webpage:
-
-```{r 07-Chapter7-32, echo=FALSE, fig.width=4, fig.height=5, fig.align = "center"}
-knitr::include_graphics("Chapter_7/Module7_2_Input/Module7_2_Image1.png")
-```
-
-
-For simplicity, we also have already pre-filtered this file for the samples we are interested in, focusing on the rat nasal gene expression data after 7 days of exposure to gaseous formaldehyde. This filtered file was saved as "GSE42394_series_matrix_filtered.txt", then renamed "Module7_2_InputData1.txt" for use in this module.
-
-
-At this point, we can simply read in this pre-filtered text file for the purposes of this training module
-```{r 07-Chapter7-33}
-geodata_manual = read.table(file="Chapter_7/Module7_2_Input/Module7_2_InputData1.txt",
- header=T)
-```
-
-
-Because this is a manual approach, we have to also manually define the treated and untreated samples (based on manually opening the surrounding metadata from the GEO webpage)
-
-Manually defining treated and untreated for these samples of interest:
-```{r 07-Chapter7-34}
-exposed_manual = c("GSM1150940", "GSM1150941", "GSM1150942")
-unexposed_manual = c("GSM1150937", "GSM1150938", "GSM1150939")
-```
-
-
-
-### 2. Loading and Organizing GEO Files through the GEOquery Package
-In this second method, we will leverage the GEOquery package, which allows for easier downloading and reading in of data from GEO without having to manually download raw text files, and manually assign sample attributes (e.g., exposed vs unexposed). This package is set-up to automatically merge sample information from GEO metadata files with raw genome-wide datasets.
-
-
-Let's first use the getGEO function (from the GEOquery package) to load data from our series matrix ("GSE42394_series_matrix.txt", renamed "Module7_2_InputData2.txt" for use in this module). *Note that this line of code may take a couple of minutes to run.*
-```{r 07-Chapter7-35, message=FALSE}
-geo.getGEO.data = getGEO(filename='Chapter_7/Module7_2_Input/Module7_2_InputData2.txt')
-```
-
-
-
-One of the reasons the getGEO package is so helpful is that we can automatically link a dataset with nicely organized sample information using the `pData()` function.
-```{r 07-Chapter7-36}
-sampleInfo = pData(geo.getGEO.data)
-```
-
-
-Let's view this sample information / metadata file, first by viewing what the column headers are.
-```{r 07-Chapter7-37}
-colnames(sampleInfo)
-```
-
-Then viewing the first five columns.
-```{r 07-Chapter7-38}
-sampleInfo[1:10,1:5]
-```
-
-This shows that each sample is provided with a unique number starting with "GSM", and these are described by information summarized in the "title" column. We can also see that these data were made public on Jan 7, 2014.
-
-
-Let's view the next five columns.
-```{r 07-Chapter7-39}
-sampleInfo[1:10,6:10]
-```
-
-We can see that information is provided here surrounding the type of sample that was analyzed (i.e., RNA), more information on the collected samples within the column `source_name_ch1`, and the organism (rat) is provided in the `organism_ch1` column.
-
-
-More detailed metadata information is provided throughout this file, as seen when viewing the column headers above.
-
-
-#### Defining samples
-
-Now, we can use this information to define the samples we want to analyze. Note that this is the same step we did manually above.
-
-In this training exercise, we are focusing on responses in the nose, so we can easily filter for cell type = Nasal epithelial cells (specifically in the `cell type:ch1` variable). We are also focusing on responses collected after 7 days of exposure, which we can filter for using time = 7 day (specifically in the `time:ch1` variable). We will also define exposed and unexposed samples using the variable `treatment:ch1`.
-
-First, let's subset the sampleInfo dataframe to just keep the samples we're interested in
-```{r 07-Chapter7-40}
-# Define a vector variable (here we call it 'keep') that will store rows we want to keep
-keep = rownames(sampleInfo[which(sampleInfo$`cell type:ch1`=="Nasal epithelial cells"
- & sampleInfo$`time:ch1`=="7 day"),])
-
-# Then subset the sample info for just those samples we defined in keep variable
-sampleInfo = sampleInfo[keep,]
-```
-
-
-Next, we can pull the exposed and unexposed animal IDs. Let's first see how these are labeled within the `treatment:ch1` variable.
-```{r 07-Chapter7-41}
-unique(sampleInfo$`treatment:ch1`)
-```
-
-
-And then search for the rows of data, pulling the sample animal IDs (which are in the variable `geo_accession`).
-```{r 07-Chapter7-42}
-exposedIDs = sampleInfo[which(sampleInfo$`treatment:ch1`=="2 ppm formaldehyde"),
- "geo_accession"]
-unexposedIDs = sampleInfo[which(sampleInfo$`treatment:ch1`=="unexposed"),
- "geo_accession"]
-```
-
-
-The next step is to pull the expression data we want to use in our analyses. The GEOquery function, `exprs()`, allows us to easily pull these data. Here, we can pull the data we're interested in using the `exprs()` function, while defining the data we want to pull based off our previously generated 'keep' vector.
-```{r 07-Chapter7-43}
-# As a reminder, this is what the 'keep' vector includes
-# (i.e., animal IDs that we're interested in)
-keep
-```
-
-```{r 07-Chapter7-44}
-# Using the exprs() function
-geodata = exprs(geo.getGEO.data[,keep])
-```
-
-
-Let's view the full dataset as is now:
-```{r 07-Chapter7-45}
-head(geodata)
-```
-This now represents a matrix of data, with animal IDs as column headers and expression levels within the matrix.
-
-
-#### Simplifying column names
-These column names are not the easiest to interpret, so let's rename these columns to indicate which animals were from the exposed vs. unexposed groups.
-
-We need to first convert our expression dataset to a dataframe so we can edit columns names, and continue with downstream data manipulations that require dataframe formats.
-```{r 07-Chapter7-46}
-geodata = data.frame(geodata)
-```
-
-
-Let's remind ourselves what the column names are:
-```{r 07-Chapter7-47}
-colnames(geodata)
-```
-
-Which ones of these are exposed vs unexposed animals can be determined by viewing our previously defined vectors.
-```{r 07-Chapter7-48}
-exposedIDs
-unexposedIDs
-```
-
-With this we can tell that the first three listed IDs are from unexposed animals, and the last three IDs are from exposed animals.
-
-Let's simplify the names of these columns to indicate exposure status and replicate number.
-```{r 07-Chapter7-49}
-colnames(geodata) = c("Control_1", "Control_2", "Control_3", "Exposed_1",
- "Exposed_2", "Exposed_3")
-```
-
-
-And we'll now need to re-define our 'exposed' vs 'unexposed' vectors for downstream script.
-```{r 07-Chapter7-50}
-exposedIDs = c("Exposed_1", "Exposed_2", "Exposed_3")
-unexposedIDs = c("Control_1", "Control_2", "Control_3")
-```
-
-
-
-Viewing the data again:
-```{r 07-Chapter7-51}
-head(geodata)
-```
-
-These data are now looking easier to interpret/analyze. Still, the row identifiers include 8 digit numbers starting with "107...". We know that this dataset is a gene expression dataset, but these identifiers, in themselves, don't tell us much about what genes these are referring to. These numeric IDs specifically represent microarray probesetIDs, that were produced by the Affymetrix platform used in the original study.
-
-**But how can we tell which genes are represented by these data?!**
-
-
-#### Adding gene symbol information
-
-Each -omics dataset contained within GEO points to a specific platform that was used to obtain measurements.
-In instances where we want more information surrounding the molecular identifiers, we can merge the platform-specific annotation file with the molecular IDs given in the full dataset.
-
-For example, let's pull the platform-specific annotation file for this experiment. Let's revisit the [website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394) that contained the original dataset on GEO. Scroll down to where it lists "Platforms", and there is a hyperlinked platform number "GPL6247" (see arrow below).
-
-```{r 07-Chapter7-52, echo=FALSE, fig.width=4, fig.height=5, fig.align = "center"}
-knitr::include_graphics("Chapter_7/Module7_2_Input/Module7_2_Image2.png")
-```
-
-
-Click on this, and you will be navigated to a different GEO website describing the Affymetrix rat array platform that was used in this analysis. Note that this website also includes information on when this array became available, links to other experiments that have used this platform within GEO, and much more.
-
-Here, we're interested in pulling the corresponding gene symbol information for the probeset IDs. To do so, scroll to the bottom, and click "Annotation SOFT table..." and download the corresponding .gz file within your working directory. Unzip this, and you will find the master annotation file: "GPL6247.annot".
-
-In this exercise, we've already done these steps and unzipped the file in our working directory. So at this point, we can simply read in this annotation dataset, renamed "Module7_2_InputData2.annot", still using the `GEOquery()` function to help automate.
-
-```{r 07-Chapter7-53, warning=FALSE}
-geo.annot = GEOquery::getGEO(filename="Chapter_7/Module7_2_Input/Module7_2_InputData3.annot")
-```
-
-Now we can use the `Table()` function from GEOquery to pull data from the annotation dataset.
-```{r 07-Chapter7-54}
-id.gene.table = GEOquery::Table(geo.annot)[,c("ID", "Gene symbol")]
-id.gene.table[1:10,1:2]
-```
-
-With these two columns of data, we now have the needed IDs and gene symbols to match with our dataset.
-
-Within the full dataset, we need to add a new column for the probeset ID, taken from the rownames, in preparation for the merging step.
-```{r 07-Chapter7-55}
-geodata$ID = rownames(geodata)
-```
-
-We can now merge the gene symbol information by ID with our expression data.
-```{r 07-Chapter7-56}
-geodata_genes = merge(geodata, id.gene.table, by="ID")
-head(geodata_genes)
-```
-
-Note that many of the probeset IDs do not map to full gene symbols, which is shown here by viewing the top few rows - this is expected in genome-wide analyses based on microarray platforms.
-
-Let's look at the first 25 unique genes in these data:
-```{r 07-Chapter7-57}
-UniqueGenes = unique(geodata_genes$`Gene symbol`)
-UniqueGenes[1:25]
-```
-
-Again, you can see that the first value listed is blank, representing probesetIDs that do not match to fully annotated gene symbols. Though the rest pertain for gene symbols annotated to the rat genome.
-
-You can also see that some gene symbols have multiple entries, separated by "///"
-
-To simplify identifiers, we can pull just the first gene symbol, and remove the rest by using gsub().
-```{r 07-Chapter7-58}
-geodata_genes$`Gene symbol` = gsub("///.*", "", geodata_genes$`Gene symbol`)
-```
-
-Let's alphabetize by main expression dataframe by gene symbol.
-```{r 07-Chapter7-59}
-geodata_genes = geodata_genes[order(geodata_genes$`Gene symbol`),]
-```
-
-And then re-view these data:
-```{r 07-Chapter7-60}
-geodata_genes[1:5,]
-```
-
-In preparation for the visualization steps below, let's reset the probeset IDs to rownames.
-```{r 07-Chapter7-61}
-rownames(geodata_genes) = geodata_genes$ID
-
-# Can then remove this column within the dataframe
-geodata_genes$ID = NULL
-```
-
-Finally let's rearrange this dataset to include gene symbols as the first column, right after rownames (probeset IDs).
-```{r 07-Chapter7-62}
-geodata_genes = geodata_genes[,c(ncol(geodata_genes),1:(ncol(geodata_genes)-1))]
-geodata_genes[1:5,]
-dim(geodata_genes)
-```
-
-Note that this dataset includes expression measures across **29,214 probes, representing 14,019 unique genes**.
-For simplicity in the final exercises, let's just filter for rows representing mapped genes.
-
-```{r 07-Chapter7-63}
-geodata_genes = geodata_genes[!(geodata_genes$`Gene symbol` == ""), ]
-dim(geodata_genes)
-```
-
-Note that this dataset now includes 16,024 rows with mapped gene symbol identifiers.
-
-### Answer to Environmental Health Question 1
-
-:::question
-With this, we can now answer **Environmental Health Question 1**:
-What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
-:::
-:::answer
-**Answer**: Platform-specific probeset IDs.
-:::
-
-
-### Answer to Environmental Health Question 2
-
-:::question
-We can also answer **Environmental Health Question 2**:
-How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
-:::
-:::answer
-**Answer**: We can merge platform-specific IDs with gene-level information using annotation files.
-:::
-
-
-## Visualizing Data
-
-### Visualizing Gene Expression Data using Boxplots and Heat Maps
-
-To visualize the -omics data, we can generate boxplots, heat maps, any many other types of visualizations. Here, we provide an example to plot a boxplot, which can be used to visualize the variability amongst samples. We also provide an example to plot a heat map, comparing unscaled vs scaled gene expression profiles. These visualizations can be useful to both simply visualize the data as well as identify patterns across samples or genes
-
-#### Boxplot visualizations
-For this example, let's simply use R's built in boxplot() function.
-
-We only want to use columns with our expression data (2 to 7), so let's pull those columns when running the boxplot function.
-```{r 07-Chapter7-64, fig.width=5, fig.height=4, fig.align = "center"}
-boxplot(geodata_genes[,2:7])
-```
-
-There seem to be a lot of variability within each sample's range of expression levels, with many outliers. This makes sense given that we are analyzing the expression levels across the rat's entire genome, where some genes won't be expressed at all while others will be highly expressed due to biological and/or potential technical variability.
-
-To show plots without outliers, we can simply use outline=F.
-```{r 07-Chapter7-65, fig.width=5, fig.height=4, fig.align = "center"}
-boxplot(geodata_genes[,2:7], outline=F)
-```
-
-
-#### Heat Map visualizations
-Heat maps are also useful when evaluating large datasets.
-
-There are many different packages you can use to generate heat maps. Here, we use the *superheat* package.
-
-It also takes awhile to plot all genes across the genome, so to save time for this training module, let's randomly select 100 rows to plot.
-
-```{r 07-Chapter7-66, fig.width=9, fig.height=7, fig.align = "center"}
-# To ensure that the same subset of genes are selected each time
-set.seed = 101
-
-# Random selection of 100 rows
-row.sample = sample(1:nrow(geodata_genes),100)
-
-# Heat map code
-superheat::superheat(geodata_genes[row.sample,2:7], # Only want to plot non-id/gene symbol columns (2 to 7)
- pretty.order.rows = TRUE,
- pretty.order.cols = TRUE,
- col.dendrogram = T,
- row.dendrogram = T)
-```
-
-This produces a heat map with sample IDs along the x-axis and probeset IDs along the y-axis. Here, the values being displayed represent normalized expression values.
-
-
-One way to improve our ability to distinguish differences between samples is to **scale expression values** across probes.
-
-**Scaling data**
-
-Z-score is a very common method of scaling that transforms data points to reflect the number of standard deviations they are from the overall mean. Z-score scaling data results in the overall transformation of a dataset to have an overall mean = 0 and standard deviation = 1.
-
-Let's see what happens when we scale this gene expression dataset by z-score across each probe. This can be easily done using the `scale()` function.
-
-This specific `scale()` function works by centering and scaling across columns, but since we want to use it across probesets (organized as rows), we need to first transpose our dataset, then run the scale function.
-```{r 07-Chapter7-67}
-geodata_genes_scaled = scale(t(geodata_genes[,2:7]), center=T, scale=T)
-```
-
-Now we can transpose it back to the original format (i.e., before it was transposed).
-```{r 07-Chapter7-68}
-geodata_genes_scaled = t(geodata_genes_scaled)
-```
-
-
-And then view what the normalized and now scaled expression data look like for now a random subset of 100 probesets (representing genes).
-```{r 07-Chapter7-69, echo=FALSE, fig.width=9, fig.height=7, fig.align = "center"}
-superheat::superheat(geodata_genes_scaled[row.sample,],
- pretty.order.rows = TRUE,
- pretty.order.cols = TRUE,
- col.dendrogram = T,
- row.dendrogram = T)
-```
-
-With these data now scaled, we can more easily visualize patterns between samples.
-
-
-### Answer to Environmental Health Question 3
-
-:::question
-*We can also answer **Environmental Health Question 3***:
-Why do we often scale gene expression signatures prior to heat map visualizations?
-:::
-:::answer
-**Answer**: To better visualize patterns in expression signatures between samples.
-:::
-
-
-Now, with these data nicely organized, we can next explore how statistics can help us find which genes show trends in expression associated with formaldehyde exposure.
-
-
-## Statistical Analyses
-
-### Statistical Analyses to Identify Genes altered by Formaldehyde
-
-A simple way to identify differences between formaldehyde-exposed and unexposed samples is to use a t-test. Because there are so many tests being performed, one for each gene, it is also important to carry out multiple test corrections through a p-value adjustment method.
-
-We need to run a t-test for each row of our dataset. This exercise demonstrates two different methods to run a t-test:
-
-+ Method 1: using a 'for loop'
-+ Method 2: using the apply function (more computationally efficient)
-
-#### Method 1 (m1): 'For Loop'
-
-Let's first re-save the molecular probe IDs to a column within the dataframe, since we need those values in the loop function.
-```{r 07-Chapter7-70}
-geodata_genes$ID = rownames(geodata_genes)
-```
-
-
-We also need to initially create an empty dataframe to eventually store p-values.
-```{r 07-Chapter7-71}
-pValue_m1 = matrix(0, nrow=nrow(geodata_genes), ncol=3)
-colnames(pValue_m1) = c("ID", "pval", "padj")
-head(pValue_m1)
-```
-
-You can see the empty dataframe that was generated through this code.
-
-Then we can loop through the entire dataset to acquire p-values from t-test statistics, comparing n=3 exposed vs n=3 unexposed samples.
-```{r 07-Chapter7-72}
-for (i in 1:nrow(geodata_genes)) {
-
- #Get the ID
- ID.i = geodata_genes[i, "ID"];
-
- #Run the t-test and get the p-value
- pval.i = t.test(geodata_genes[i,exposedIDs], geodata_genes[i,unexposedIDs])$p.value;
-
- #Store the data in the empty dataframe
- pValue_m1[i,"ID"] = ID.i;
- pValue_m1[i,"pval"] = pval.i
-
-}
-```
-
-View the results:
-```{r 07-Chapter7-73}
-# Note that we're not pulling the last column (padj) since we haven't calculated these yet
-pValue_m1[1:5,1:2]
-```
-
-
-
-#### Method 2 (m2): Apply Function
-For the second method, we can use the *apply()* function to calculate resulting t-test p-values more efficiently labeled.
-
-```{r 07-Chapter7-74}
-pValue_m2 = apply(geodata_genes[,2:7], 1, function(x) t.test(x[unexposedIDs],
- x[exposedIDs])$p.value)
-names(pValue_m2) = geodata_genes[,"ID"]
-```
-
-We can convert the results into a dataframe to make it similar to m1 matrix we created above.
-```{r 07-Chapter7-75}
-pValue_m2 = data.frame(pValue_m2)
-
-# Now create an ID column
-pValue_m2$ID = rownames(pValue_m2)
-```
-
-Then we can view at the two datasets to see they result in the same pvalues.
-```{r 07-Chapter7-76}
-head(pValue_m1)
-head(pValue_m2)
-```
-We can see from these results that both methods (m1 and m2) generate the same statistical p-values.
-
-#### Interpreting Results
-
-Let's again merge these data with the gene symbols to tell which genes are significant.
-
-First, let's convert to a dataframe and then merge as before, for one of the above methods as an example (m1).
-```{r 07-Chapter7-77}
-pValue_m1 = data.frame(pValue_m1)
-pValue_m1 = merge(pValue_m1, id.gene.table, by="ID")
-```
-
-We can also add a multiple test correction by applying a false discovery rate-adjusted p-value; here, using the Benjamini Hochberg (BH) method.
-```{r 07-Chapter7-78}
-# Here fdr is an alias for B-H method
-pValue_m1[,"padj"] = p.adjust(pValue_m1[,"pval"], method=c("fdr"))
-```
-
-Now, we can sort these statistical results by adjusted p-values.
-```{r 07-Chapter7-79}
-pValue_m1.sorted = pValue_m1[order(pValue_m1[,'padj']),]
-head(pValue_m1.sorted)
-```
-
-Pulling just the significant genes using an adjusted p-value threshold of 0.05.
-```{r 07-Chapter7-80}
-adj.pval.sig = pValue_m1[which(pValue_m1[,'padj'] < .05),]
-
-# Viewing these genes
-adj.pval.sig
-```
-
-
-### Answer to Environmental Health Question 4
-
-:::question
-*With this, we can answer **Environmental Health Question 4***:
-What genes are altered in expression by formaldehyde inhalation exposure?
-:::
-:::answer
-**Answer**: Olr633 and Slc7a8.
-:::
-
-Finally, let's plot these using a mini heat map.
-Note that we can use probesetIDs, then gene symbols, in rownames to have them show in heat map labels.
-```{r 07-Chapter7-81, echo=FALSE, fig.width=8, fig.height=4, fig.align = "center"}
-rownames(geodata_genes) = paste(geodata_genes$ID, ": ",geodata_genes$`Gene symbol`)
-superheat::superheat(geodata_genes[which(geodata_genes$ID %in% adj.pval.sig[,"ID"]),2:7])
-```
-
-Note that this statistical filter is pretty strict when comparing only n=3 vs n=3 biological replicates. If we loosen the statistical criteria to p-value < 0.05, this is what we can find:
-```{r 07-Chapter7-82}
-pval.sig = pValue_m1[which(pValue_m1[,'pval'] < .05),]
-nrow(pval.sig)
-```
-
-5327 genes with significantly altered expression!
-
-Note that other filters are commonly applied to further focus these lists (e.g., background and fold change filters) prior to statistical evaluation, which can impact the final results. See [Rager et al. 2013](https://pubmed.ncbi.nlm.nih.gov/24304932/) for further statistical approaches and visualizations.
-
-
-
-### Answer to Environmental Health Question 5
-
-:::question
-*With this, we can answer **Environmental Health Question 5***:
-What are the potential biological consequences of these gene-level perturbations?
-:::
-:::answer
-**Answer**: Olr633 stands for 'olfactory receptor 633'. Olr633 is up-regulated in expression, meaning that formaldehyde inhalation exposure has a smell that resulted in 'activated' olfactory receptors in the nose of these exposed rats. Slc7a8 stands for 'solute carrier family 7 member 8'. Slc7a8 is down-regulated in expression, and it plays a role in many biological processes, that when altered, can lead to changes in cellular homeostasis and disease.
-:::
-
-
-
-## Concluding Remarks
-
-In conclusion, this training module provides an overview of pulling, organizing, visualizing, and analyzing -omics data from the online repository, Gene Expression Omnibus (GEO). Trainees are guided through the overall organization of an example high dimensional dataset, focusing on transcriptomic responses in the nasal epithelium of rats exposed to formaldehyde. Data are visualized and then analyzed using standard two-group comparisons. Findings are interpreted for biological relevance, yielding insight into the effects resulting from formaldehyde exposure.
-
-For additional case studies that leverage GEO, see the following publications that also address environmental health questions from our research group:
-
-+ Rager JE, Fry RC. The aryl hydrocarbon receptor pathway: a key component of the microRNA-mediated AML signalisome. Int J Environ Res Public Health. 2012 May;9(5):1939-53. doi: 10.3390/ijerph9051939. Epub 2012 May 18. PMID: 22754483; PMCID: [PMC3386597](https://pubmed.ncbi.nlm.nih.gov/22754483/).
-
-+ Rager JE, Suh M, Chappell GA, Thompson CM, Proctor DM. Review of transcriptomic responses to hexavalent chromium exposure in lung cells supports a role of epigenetic mediators in carcinogenesis. Toxicol Lett. 2019 May 1;305:40-50. PMID: [30690063](https://pubmed.ncbi.nlm.nih.gov/30690063/).
-
-
-
-
-
-
-:::tyk
-
-Using the same dataset that was used in this module, available from the [UNC-SRP TAME2 GitHub website](https://github.com/UNCSRP/TAME2):
-1. Load the downloaded GEO dataset into R using the packages and functions mentioned in this tutorial.
-2. Filter the data to just those with "cell type" of "Circulating white blood cells".
-3. Report the means of the first 5 rows of the gene expression data (10700001, 10700002, 10700003, 10700004, 10700005), across all samples.
-
-:::
# 7.3 CompTox Dashboard Data through APIs
-```{r 07-Chapter7-83, include = FALSE}
+```{r 7-3-CompTox-Dashboard-1, include = FALSE}
tpl <- knitr::opts_template$get("TAME_options")
merged <- c(
list(collapse = TRUE, comment = "#>"),
@@ -1085,14 +79,14 @@ This training module was specifically developed to answer the following question
### Cleaning the Global Environment
-```{r 07-Chapter7-84, eval=FALSE}
+```{r 7-3-CompTox-Dashboard-2, eval=FALSE}
rm(list=ls())
```
### Installing Required R Packages
-```{r 07-Chapter7-85, eval=FALSE}
+```{r 7-3-CompTox-Dashboard-3, eval=FALSE}
if (!requireNamespace('ctxR'))
install.packages('ctxR')
@@ -1102,7 +96,7 @@ if (!requireNamespace('ggplot2'))
### Loading R Packages
-```{r 07-Chapter7-86}
+```{r 7-3-CompTox-Dashboard-4 }
# Used to interface with CompTox Chemicals Dashboard
library(ctxR)
@@ -1123,8 +117,8 @@ The CCD can be searched either one chemical at a time, or using a batch search.
In single-substance search, the user types a full or partial chemical identifier (name, CASRN, InChiKey, or DSSTox ID) into a search box on the CCD homepage. Autocomplete provides a list of possible matches; the user selects one by clicking on it, and is then taken to the CCD page for that substance. Here is an example of the CCD page for the chemical Bisphenol A:
-```{r 07-Chapter7-87, echo = FALSE, out.width= "90%", fig.align= 'center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image1.png')
+```{r 7-3-CompTox-Dashboard-5, echo = FALSE, out.width= "90%", fig.align= 'center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image1.png')
```
@@ -1135,14 +129,14 @@ The different domains of data available for this chemical are shown by the tabs
In batch search, the user enters a list of search inputs, separated by newlines, into a batch-search box on https://comptox.epa.gov/dashboard/batch-search . The user selects the type(s) of inputs by selecting one or more checkboxes – these may include chemical identifiers, monoisotopic masses, or molecular formulas. Then, the user selects “Display All Chemicals” to display the list of substances matching the batch-search inputs, or “Choose Export Options” to choose options for exporting the batch-search results as a spreadsheet. The exported spreadsheet may include data from most of the domains available on an individual substance’s CCD page.
-```{r 07-Chapter7-88, echo = FALSE, out.width = "90%", fig.align = 'center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image2.png')
+```{r 7-3-CompTox-Dashboard-6, echo = FALSE, out.width = "90%", fig.align = 'center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image2.png')
```
The user can download the selected information in various formats, such as Excel (.xlsx), comma-separated values (.csv), or different types of chemical table files (.e.g, MOL).
-```{r 07-Chapter7-89, echo=FALSE, out.width="90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image3.png')
+```{r 7-3-CompTox-Dashboard-7, echo=FALSE, out.width="90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image3.png')
```
@@ -1192,8 +186,8 @@ For more information on the data accessible through the CTX APIs and related too
The APIs are organized into four sets of "endpoints" (chemical data domains): `Chemical`, `Hazard`, `Bioactivity`, and `Exposure`. Pictured below is what the `Chemical` section looks like and can be found at [CTX API Chemical Endpoints](https://api-ccte.epa.gov/docs/chemical.html).
-```{r 07-Chapter7-90, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image4.png')
+```{r 7-3-CompTox-Dashboard-8, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image4.png')
```
The APIs can be explored through the pictured web interface at https://api-ccte.epa.gov/docs/chemical.html .
@@ -1202,8 +196,8 @@ The APIs can be explored through the pictured web interface at https://api-ccte.
`Authentication` is the first tab on the left. Authentication is required to use the APIs. To authenticate yourself in the API web interface, input your unique API key.
-```{r 07-Chapter7-91, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image5.png')
+```{r 7-3-CompTox-Dashboard-9, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image5.png')
```
@@ -1221,8 +215,8 @@ In the CTX API web interface, the colored boxes next to each endpoint indicate t
Click on the second item under `Chemical Details Resource`, the tab labeled `Get data by dtxsid`. The following page will appear.
-```{r 07-Chapter7-92, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image6.png')
+```{r 7-3-CompTox-Dashboard-10, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image6.png')
```
@@ -1232,15 +226,15 @@ This page has two subheadings: "Path Parameters" and "Query-String Parameters".
The default return format is displayed below and includes a variety of fields with data types represented.
-```{r 07-Chapter7-93, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image7.png')
+```{r 7-3-CompTox-Dashboard-11, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image7.png')
```
We show what reRturned data from searching Bisphenol A looks like using this endpoint with the `chemicaldetailstandard` value for `projection` selected.
-```{r 07-Chapter7-94, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image8.png')
+```{r 7-3-CompTox-Dashboard-12, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image8.png')
```
@@ -1254,7 +248,7 @@ Formatting an http request is not necessarily intuitive nor worth the time for s
We store the API key required to access the APIs. To do this for the current session, run the first command. If you want to store your key across multiple sessions, run the second command.
-```{r 07-Chapter7-95, eval=FALSE}
+```{r 7-3-CompTox-Dashboard-13, eval=FALSE}
# This stores the key in the current session
register_ctx_api_key(key = '')
@@ -1263,7 +257,7 @@ register_ctx_api_key(key = '')
register_ctx_api_key(key = '', write = TRUE)
```
-```{r 07-Chapter7-96, echo=FALSE, warning = FALSE}
+```{r 7-3-CompTox-Dashboard-14, echo=FALSE, warning = FALSE}
# This stores the key in the current session
register_ctx_api_key(key = '706401cd-8bda-469d-9cdb-ac27f489c93a')
@@ -1274,7 +268,7 @@ register_ctx_api_key(key = '706401cd-8bda-469d-9cdb-ac27f489c93a', write = TRUE)
To check that your key has successfully been stored for the session, run the following command.
-```{r 07-Chapter7-97, eval=FALSE}
+```{r 7-3-CompTox-Dashboard-15, eval=FALSE}
ctx_key()
```
@@ -1282,7 +276,7 @@ ctx_key()
Now, we demonstrate how to retrieve the information for BPA given by the `Chemical Detail Resource` endpoint under the `chemicaldetailstandard` value for `projection`. Note, this `projection` value is the default value for the function `get_chemical_details()`.
-```{r 07-Chapter7-98}
+```{r 7-3-CompTox-Dashboard-16 }
BPA_chemical_detail <- get_chemical_details(DTXSID = 'DTXSID7020182')
dim(BPA_chemical_detail)
class(BPA_chemical_detail)
@@ -1301,7 +295,7 @@ These lists can be found in the CCD at [CCL4](https://comptox.epa.gov/dashboard/
We explore details about these two lists of chemicals before diving into analyzing the data contained in each list.
-```{r 07-Chapter7-99}
+```{r 7-3-CompTox-Dashboard-17 }
options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
@@ -1312,7 +306,7 @@ print(natadb_information, trunc.cols = TRUE)
Now we pull the actual chemicals contained in the lists using the APIs.
-```{r 07-Chapter7-100}
+```{r 7-3-CompTox-Dashboard-18 }
ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)
@@ -1322,7 +316,7 @@ natadb <- data.table::as.data.table(natadb)
We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.
-```{r 07-Chapter7-101}
+```{r 7-3-CompTox-Dashboard-19 }
dim(ccl4)
dim(natadb)
@@ -1335,7 +329,7 @@ head(ccl4, 1)
Once we have the chemicals in each list, we access their physico-chemical properties. We will use the batch search forms of the function `get_chem_info()`, to which we supply a list of DTXSIDs.
-```{r 07-Chapter7-102}
+```{r 7-3-CompTox-Dashboard-20 }
ccl4$dtxsid
natadb$dtxsid
@@ -1347,7 +341,7 @@ Observe that this returns a single data.table for each query, and the data.table
Before any deeper analysis, let's take a look at the dimensions of the data and the column names.
-```{r 07-Chapter7-103}
+```{r 7-3-CompTox-Dashboard-21 }
dim(ccl4_phys_chem)
colnames(ccl4_phys_chem)
```
@@ -1357,14 +351,14 @@ Next, we display the unique values for the columns `propertyID` and `propType`.
-```{r 07-Chapter7-104}
+```{r 7-3-CompTox-Dashboard-22 }
ccl4_phys_chem[, unique(propName)]
ccl4_phys_chem[, unique(propType)]
```
Let's explore this further by examining the mean of the "boiling-point" and "melting-point" data.
-```{r 07-Chapter7-105}
+```{r 7-3-CompTox-Dashboard-23 }
ccl4_phys_chem[propName == 'Boiling Point', .(Mean = mean(propValue, na.rm = TRUE))]
ccl4_phys_chem[propName == 'Boiling Point', .(Mean = mean(propValue, na.rm = TRUE)),
by = .(propType)]
@@ -1388,7 +382,7 @@ These results tell us about some of the reported physico-chemical properties of
To explore **all** the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property ("Boiling Point", "Melting Point", etc...), and then additionally group those by property type ("experimental" vs "predicted"). In the grouping, we look at the columns `propValue`, `unit`, `propName` and `propType`. We also demonstrate how take the mean of the values for each grouping. We examine the chemical with `DTXSID` "DTXSID0020153" from CCL4.
-```{r 07-Chapter7-106}
+```{r 7-3-CompTox-Dashboard-24 }
head(ccl4_phys_chem[dtxsid == 'DTXSID0020153', ])
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(propType, propValue, propUnit),
by = .(propName)]
@@ -1413,7 +407,7 @@ We first examine the vapor pressures for all the chemicals in each list. We then
Group first by DTXSID.
-```{r 07-Chapter7-107}
+```{r 7-3-CompTox-Dashboard-25 }
ccl4_vapor_all <- ccl4_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
@@ -1424,7 +418,7 @@ natadb_vapor_all <- natadb_phys_chem[propName %in% 'Vapor Pressure',
Then group by DTXSID and then by property type.
-```{r 07-Chapter7-108}
+```{r 7-3-CompTox-Dashboard-26 }
ccl4_vapor_grouped <- ccl4_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
@@ -1438,7 +432,7 @@ natadb_vapor_grouped <- natadb_phys_chem[propName %in% 'Vapor Pressure',
Then examine the summary statistics of the data.
-```{r 07-Chapter7-109}
+```{r 7-3-CompTox-Dashboard-27 }
summary(ccl4_vapor_all)
summary(ccl4_vapor_grouped)
summary(natadb_vapor_all)
@@ -1447,7 +441,7 @@ summary(natadb_vapor_grouped)
With such a large range of values covering several orders of magnitude, we log transform the data. Since some of these value are non-positive, some transformations may result in non-numeric values. These will be removed when plotting. We expect these values to be positive in general so we go ahead with these transformations.
-```{r 07-Chapter7-110}
+```{r 7-3-CompTox-Dashboard-28 }
ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
@@ -1461,7 +455,7 @@ natadb_vapor_grouped[, log_transform_mean_vapor_pressure :=
Now we plot the log transformed data.
First plot the CCL4 data.
-```{r 07-Chapter7-111, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-29, fig.align='center'}
ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot() +
coord_flip()
@@ -1471,7 +465,7 @@ ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
Then plot the NATA data.
-```{r 07-Chapter7-112, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-30, fig.align='center'}
ggplot(natadb_vapor_all, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot() + coord_flip()
ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
@@ -1480,7 +474,7 @@ ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
Finally, we compare both sets simultaneously. We add in a column to each data.table denoting to which data set the rows correspond and then combine the rows from both data sets together using the function `rbind()`.
-```{r 07-Chapter7-113}
+```{r 7-3-CompTox-Dashboard-31 }
ccl4_vapor_grouped[, set := 'CCL4']
natadb_vapor_grouped[, set := 'NATADB']
@@ -1489,7 +483,7 @@ all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)
Now we plot the combined data. First we color the boxplots based on the property type, with mean log transformed vapor pressure plotted for each data set and property type.
-```{r 07-Chapter7-114, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-32, fig.align='center'}
vapor_box <- ggplot(all_vapor_grouped,
aes(set, log_transform_mean_vapor_pressure)) +
geom_boxplot(aes(color = propType))
@@ -1498,7 +492,7 @@ vapor_box
Next we color the boxplots based on the data set.
-```{r 07-Chapter7-115,, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-33, fig.align='center'}
vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot((aes(color = set))) +
coord_flip()
@@ -1511,7 +505,7 @@ We also explore Henry's Law constant and boiling point in a similar fashion.
Group by DTXSID.
-```{r 07-Chapter7-116}
+```{r 7-3-CompTox-Dashboard-34 }
ccl4_hlc_all <- ccl4_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
@@ -1522,7 +516,7 @@ natadb_hlc_all <- natadb_phys_chem[propName %in% "Henry's Law Constant",
Group by DTXSID and property type.
-```{r 07-Chapter7-117}
+```{r 7-3-CompTox-Dashboard-35 }
ccl4_hlc_grouped <- ccl4_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
@@ -1535,7 +529,7 @@ natadb_hlc_grouped <- natadb_phys_chem[propName %in% "Henry's Law Constant",
Examine summary statistics.
-```{r 07-Chapter7-118}
+```{r 7-3-CompTox-Dashboard-36 }
summary(ccl4_hlc_all)
summary(ccl4_hlc_grouped)
summary(natadb_hlc_all)
@@ -1544,7 +538,7 @@ summary(natadb_hlc_grouped)
Again, we log transform the data as it covers several orders of magnitude. We expect these values to be positive in general so we go ahead with these transformations.
-```{r 07-Chapter7-119}
+```{r 7-3-CompTox-Dashboard-37 }
ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
@@ -1557,7 +551,7 @@ We compare both sets simultaneously. We add in a column to each data.table denot
Label and combine data.
-```{r 07-Chapter7-120}
+```{r 7-3-CompTox-Dashboard-38 }
ccl4_hlc_grouped[, set := 'CCL4']
natadb_hlc_grouped[, set := 'NATADB']
@@ -1566,7 +560,7 @@ all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)
Plot data. Some rows are removed due to transformations above that result in non-valid values.
-```{r 07-Chapter7-121,, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-39, fig.align='center'}
hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) +
geom_boxplot(aes(color = propType))
hlc_box
@@ -1583,7 +577,7 @@ Finally, we consider boiling point.
Group by DTXSID.
-```{r 07-Chapter7-122}
+```{r 7-3-CompTox-Dashboard-40 }
ccl4_boiling_all <- ccl4_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
@@ -1595,7 +589,7 @@ natadb_boiling_all <- natadb_phys_chem[propName %in% 'Boiling Point',
Group by DTXSID and property type.
-```{r 07-Chapter7-123}
+```{r 7-3-CompTox-Dashboard-41 }
ccl4_boiling_grouped <- ccl4_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point =
sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
@@ -1610,7 +604,7 @@ natadb_boiling_grouped <- natadb_phys_chem[propName %in% 'Boiling Point',
Calculate summary statistics.
-```{r 07-Chapter7-124}
+```{r 7-3-CompTox-Dashboard-42 }
summary(ccl4_boiling_all)
summary(ccl4_boiling_grouped)
summary(natadb_boiling_all)
@@ -1619,7 +613,7 @@ summary(natadb_boiling_grouped)
Since some of the boiling point values have negative values, we cannot log transform these values. If we try, as you will see below, there will be warnings of NaNs produced.
-```{r 07-Chapter7-125, eval}
+```{r 7-3-CompTox-Dashboard-43 }
ccl4_boiling_all[, log_transform := log(mean_boiling_point)]
ccl4_boiling_grouped[, log_transform := log(mean_boiling_point)]
@@ -1631,7 +625,7 @@ We compare both sets simultaneously. We add in a column to each data.table denot
Label and combine data.
-```{r 07-Chapter7-126}
+```{r 7-3-CompTox-Dashboard-44 }
ccl4_boiling_grouped[, set := 'CCL4']
natadb_boiling_grouped[, set := 'NATADB']
@@ -1640,7 +634,7 @@ all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)
Plot the data.
-```{r 07-Chapter7-127,, fig.align='center'}
+```{r 7-3-CompTox-Dashboard-45, fig.align='center'}
boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) +
geom_boxplot(aes(color = propType))
boiling_box
@@ -1672,22 +666,22 @@ Now, having examined some of the distributions of the physico-chemical propertie
Using the standard CompTox Chemicals Dashboard approach to access genotoxicity, one would again navigate to the individual chemical page
-```{r 07-Chapter7-128, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image9.png')
+```{r 7-3-CompTox-Dashboard-46, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image9.png')
```
Once one navigates to the genotoxicity tab highlighted in the previous page, the following is displayed as seen here:
-```{r 07-Chapter7-129, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image10.png')
+```{r 7-3-CompTox-Dashboard-47, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image10.png')
```
This page includes two sets of information, the first of which provides a summary of available genotoxicity data while the second provides the individual reports and samples of such data.
We again use the CTX APIs to streamline the process of retrieving this information in a programmatic fashion. To this end, we will use the genotoxicity endpoints found within the `Hazard` endpoints of the CTX APIs. Pictured below is the particular set of genotoxicity resources available in the `Hazard` endpoints of the CTX APIs.
-```{r 07-Chapter7-130, echo = FALSE, out.width = "90%", fig.align='center'}
-knitr::include_graphics('Chapter_7/Module7_3_Input/Module7_3_Image11.png')
+```{r 7-3-CompTox-Dashboard-48, echo = FALSE, out.width = "90%", fig.align='center'}
+knitr::include_graphics('Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image11.png')
```
There are both summary and detail resources, reflecting the information one can find on the CompTox Chemicals Dashboard Genotoxicity page for a given chemical.
@@ -1695,21 +689,21 @@ There are both summary and detail resources, reflecting the information one can
To access the genetox endpoint, we will use the function `get_genetox_summary()`. Since we have a list of chemicals, rather than searching individually for each chemical, we use the batch search version of the function, named `get_genetox_summary_batch()`. We will examine this and then access the details.
Grab the data using the APIs.
-```{r 07-Chapter7-131}
+```{r 7-3-CompTox-Dashboard-49 }
ccl4_genotox <- get_genetox_summary_batch(DTXSID = ccl4$dtxsid)
natadb_genetox <- get_genetox_summary_batch(DTXSID = natadb$dtxsid)
```
Examine the dimensions.
-```{r 07-Chapter7-132}
+```{r 7-3-CompTox-Dashboard-50 }
dim(ccl4_genotox)
dim(natadb_genetox)
```
Examine the column names and data from the first six chemicals with genetox data from CCL4.
-```{r 07-Chapter7-133}
+```{r 7-3-CompTox-Dashboard-51 }
colnames(ccl4_genotox)
head(ccl4_genotox)
```
@@ -1718,7 +712,7 @@ The information returned is of the first variety highlighted in the image above,
Observe that we have information on 71 chemicals from the CCL4 data and 153 from the NATA data. We note the chemicals not included in the results and then dig into the returned results.
-```{r 07-Chapter7-134}
+```{r 7-3-CompTox-Dashboard-52 }
ccl4[!(dtxsid %in% ccl4_genotox$dtxsid),
.(dtxsid, casrn, preferredName, molFormula)]
natadb[!(dtxsid %in% natadb_genetox$dtxsid),
@@ -1729,21 +723,21 @@ Now, we access the genotoxicity details of the chemicals in each data set using
Grab the data from the CTX APIs.
-```{r 07-Chapter7-135}
+```{r 7-3-CompTox-Dashboard-53 }
ccl4_genetox_details <- get_genetox_details_batch(DTXSID = ccl4$dtxsid)
natadb_genetox_details <- get_genetox_details_batch(DTXSID = natadb$dtxsid)
```
Examine the dimensions.
-```{r 07-Chapter7-136}
+```{r 7-3-CompTox-Dashboard-54 }
dim(ccl4_genetox_details)
dim(natadb_genetox_details)
```
Look at the column names and the first six rows of the data from the CCL4 chemicals.
-```{r 07-Chapter7-137}
+```{r 7-3-CompTox-Dashboard-55 }
colnames(ccl4_genetox_details)
head(ccl4_genetox_details)
```
@@ -1752,14 +746,14 @@ We examine the information returned for the first chemical in each set of result
Look at the dimensions first.
-```{r 07-Chapter7-138}
+```{r 7-3-CompTox-Dashboard-56 }
dim(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
dim(natadb_genetox_details[dtxsid %in% 'DTXSID0020153', ])
```
Now examine the first few rows.
-```{r 07-Chapter7-139}
+```{r 7-3-CompTox-Dashboard-57 }
head(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
```
@@ -1770,13 +764,13 @@ We now explore the assays present for chemicals in each data set. We first deter
Determine the unique assay categories.
-```{r 07-Chapter7-140}
+```{r 7-3-CompTox-Dashboard-58 }
ccl4_genetox_details[, unique(assayCategory)]
natadb_genetox_details[, unique(assayCategory)]
```
Determine the unique assays for each data set and list them.
-```{r 07-Chapter7-141}
+```{r 7-3-CompTox-Dashboard-59 }
ccl4_genetox_details[, unique(assayType)]
natadb_genetox_details[, unique(assayType)]
@@ -1788,7 +782,7 @@ natadb_genetox_details[, unique(assayType)]
Determine the number of assays per unique `assayCategory` value.
-```{r 07-Chapter7-142}
+```{r 7-3-CompTox-Dashboard-60 }
ccl4_genetox_details[, .(Assays = length(unique(assayType))),
by = .(assayCategory)]
@@ -1799,14 +793,14 @@ natadb_genetox_details[, .(Assays = length(unique(assayType))),
We can analyze these results more closely, counting the number of assay results and grouping by `assayCategory`, and `assayType`. We also examine the different numbers of `assayCategory` and `assayTypes` values used.
-```{r 07-Chapter7-143}
+```{r 7-3-CompTox-Dashboard-61 }
ccl4_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
ccl4_genetox_details[, .N, by = .(assayCategory)]
```
We look at the `assayType` values and numbers of each for the three different `assayCategory` values.
-```{r 07-Chapter7-144}
+```{r 7-3-CompTox-Dashboard-62 }
ccl4_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
ccl4_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
ccl4_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
@@ -1814,14 +808,14 @@ ccl4_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
Now we repeat this for NATADB.
-```{r 07-Chapter7-145}
+```{r 7-3-CompTox-Dashboard-63 }
natadb_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
natadb_genetox_details[, .N, by = .(assayCategory)]
```
Examine the number of rows for each `assayType` value by each `assaycategory` value.
-```{r 07-Chapter7-146, R.options=list(width=150) }
+```{r 7-3-CompTox-Dashboard-64, R.options=list(width=150) }
natadb_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
natadb_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
natadb_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
@@ -1839,7 +833,7 @@ natadb_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
Next, we dig into the results of the assays. One may be interested in looking at the number of chemicals for which an assay resulted in a positive or negative result for instance. We group by `assayResult` and determine the number of unique `dtxsid` values associated with each `assayResult` value.
-```{r 07-Chapter7-147}
+```{r 7-3-CompTox-Dashboard-65 }
ccl4_genetox_details[, .(DTXSIDs = length(unique(dtxsid))), by = .(assayResult)]
natadb_genetox_details[, .(DTXSIDs = length(unique(dtxsid))),
by = .(assayResult)]
@@ -1857,7 +851,7 @@ natadb_genetox_details[, .(DTXSIDs = length(unique(dtxsid))),
We now determine the chemicals from each data set that are known to have genotoxic effects. For this, we look to see which chemicals produce at least one positive response in the `assayResult` column.
-```{r 07-Chapter7-148}
+```{r 7-3-CompTox-Dashboard-66 }
ccl4_genetox_details[, .(is_positive = any(assayResult == 'positive')),
by = .(dtxsid)][is_positive == TRUE, dtxsid]
natadb_genetox_details[, .(is_positive = any(assayResult == 'positive')),
@@ -1866,7 +860,7 @@ natadb_genetox_details[, .(is_positive = any(assayResult == 'positive')),
With so much genotoxicity data, let us explore this data for one chemical more deeply to get a sense of the assays and results present for it. We will explore the chemical with DTXSID0020153. We will look at the assays, the number of each type of result, and which correspond to "positive" results. To determine this, we group by `assayResult` and calculate `.N` for each group. We also isolate which were positive and output a data.table with the number of each type.
-```{r 07-Chapter7-149}
+```{r 7-3-CompTox-Dashboard-67 }
ccl4_genetox_details[dtxsid == 'DTXSID0020153', .(Number = .N),
by = .(assayResult)]
ccl4_genetox_details[dtxsid == 'DTXSID0020153' & assayResult == 'positive',
@@ -1889,20 +883,20 @@ ccl4_genetox_details[dtxsid == 'DTXSID0020153' & assayResult == 'positive',
Finally, we examine the hazard data associated with the chemicals in each data set. For each chemical, there will be potentially hundreds of rows of hazard data, so the returned results will be much larger than in most other API endpoints.
-```{r 07-Chapter7-150}
+```{r 7-3-CompTox-Dashboard-68 }
ccl4_hazard <- get_hazard_by_dtxsid_batch(DTXSID = ccl4$dtxsid)
natadb_hazard <- get_hazard_by_dtxsid_batch(DTXSID = natadb$dtxsid)
```
We do some preliminary exploration of the data. First we determine the dimensions of the data sets.
-```{r 07-Chapter7-151}
+```{r 7-3-CompTox-Dashboard-69 }
dim(ccl4_hazard)
dim(natadb_hazard)
```
Next we record the column names and display the first six results in the CCL4 hazard data.
-```{r 07-Chapter7-152}
+```{r 7-3-CompTox-Dashboard-70 }
colnames(ccl4_hazard)
head(ccl4_hazard)
```
@@ -1911,26 +905,26 @@ We determine the number of unique values in the `criticalEffect`, `toxvalTypeSup
The number of unique values for `criticalEffect`.
-```{r 07-Chapter7-153}
+```{r 7-3-CompTox-Dashboard-71 }
length(ccl4_hazard[, unique(criticalEffect)])
length(natadb_hazard[, unique(criticalEffect)])
```
The number of unique values of `toxvalTypeSuperCategory`.
-```{r 07-Chapter7-154}
+```{r 7-3-CompTox-Dashboard-72 }
length(ccl4_hazard[, unique(toxvalTypeSuperCategory)])
length(natadb_hazard[, unique(toxvalTypeSuperCategory)])
```
The number of unique values for `toxvalType`.
-```{r 07-Chapter7-155}
+```{r 7-3-CompTox-Dashboard-73 }
length(ccl4_hazard[, unique(toxvalType)])
length(natadb_hazard[, unique(toxvalType)])
```
Now we look at the number of entries per `toxvalTypeSuperCategory`.
-```{r 07-Chapter7-156}
+```{r 7-3-CompTox-Dashboard-74 }
ccl4_hazard[, .N, by = .(toxvalTypeSuperCategory)]
natadb_hazard[, .N, by = .(toxvalTypeSuperCategory)]
@@ -1938,7 +932,7 @@ natadb_hazard[, .N, by = .(toxvalTypeSuperCategory)]
With over 7,000 results for the `toxvalTypeSuperCategory` value "Dose Response Summary Value" for each data set, we dig into this further.
We determine the number of rows grouped by `toxvalType` that have the "Dose Response Summary Value" `toxvalTypeSuperCategory` value, and display this descending.
-```{r 07-Chapter7-157}
+```{r 7-3-CompTox-Dashboard-75 }
ccl4_hazard[toxvalTypeSuperCategory %in% 'Dose Response Summary Value', .N,
by = .(toxvalType)][order(-N),]
natadb_hazard[toxvalTypeSuperCategory %in% 'Dose Response Summary Value', .N,
@@ -1949,7 +943,7 @@ We explore "NOAEL", "LOAEL", and "NOEL" further. Let us look at the the case whe
First, we look at "food". We order by `toxvalType` and by the minimum `toxvalNumeric` value in each group, descending.
-```{r 07-Chapter7-158}
+```{r 7-3-CompTox-Dashboard-76 }
ccl4_hazard[media %in% 'food' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
@@ -1962,7 +956,7 @@ natadb_hazard[media %in% 'food' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
Next we look at "culture", repeating the same grouping and ordering as in the previous case.
-```{r 07-Chapter7-159}
+```{r 7-3-CompTox-Dashboard-77 }
ccl4_hazard[media %in% 'culture' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
@@ -1977,7 +971,7 @@ Now, let us restrict our attention to human hazard and focus on the exposure rou
First, let us determine the exposure routes in general.
-```{r 07-Chapter7-160}
+```{r 7-3-CompTox-Dashboard-78 }
ccl4_hazard[humanEco %in% 'human health', unique(exposureRoute)]
natadb_hazard[humanEco %in% 'human health', unique(exposureRoute)]
```
@@ -1988,7 +982,7 @@ Then, let's focus on the inhalation and oral exposure routes for human hazard.
To answer this, filter the data into the corresponding exposure routes, then group by `exposureRoute` and `riskAssessmentClass`, and finally count the number of instances for each grouping. To determine the most represented class, one can order the results descending.
-```{r 07-Chapter7-161}
+```{r 7-3-CompTox-Dashboard-79 }
ccl4_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation', 'oral'), .(Hits = .N),
by = .(exposureRoute, riskAssessmentClass)][order(exposureRoute,
@@ -2018,7 +1012,7 @@ To answer this, we filter the rows to the "human health" `humanEco` value and "i
First we look at CCL4.
-```{r 07-Chapter7-162}
+```{r 7-3-CompTox-Dashboard-80 }
ccl4_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation'), unique(toxvalType)]
ccl4_hazard[humanEco %in% 'human health' &
@@ -2028,7 +1022,7 @@ intersect(ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'inhalat
Then we look at NATADB.
-```{r 07-Chapter7-163}
+```{r 7-3-CompTox-Dashboard-81 }
natadb_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation'), unique(toxvalType)]
natadb_hazard[humanEco %in% 'human health' &
@@ -2049,7 +1043,7 @@ intersect(natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'inhal
For the next data exploration, we will examine the "NOAEL" and "LOAEL" values for chemicals with oral exposure and human hazard. We also examine the units to determine whether any unit conversions are necessary to compare numeric values.
-```{r 07-Chapter7-164}
+```{r 7-3-CompTox-Dashboard-82 }
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL'), ]
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
@@ -2062,7 +1056,7 @@ natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
Observe that for both CCL4 and NATADB, the units are given by "mg/kg-day", "ppm", "mg/L" and additionally "-" for NATADB. In this case, we treat "mg/kg-day" and "ppm" the same and exclude "-" and "mg/L". We group by DTXSID to find the lowest or highest value.
-```{r 07-Chapter7-165}
+```{r 7-3-CompTox-Dashboard-83 }
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL') & !(toxvalUnits %in% c('-', 'mg/L')),
.(numeric_value = min(toxvalNumeric),
@@ -2077,7 +1071,7 @@ natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
Now, we also explore the values of "RfD", "RfC", and "cancer slope factor" of the `toxvalType` rows. We first determine the set of units for each, make appropriate conversions if necessary, and then make comparisons.
-```{r 07-Chapter7-166}
+```{r 7-3-CompTox-Dashboard-84 }
ccl4_hazard[humanEco %in% 'human health' & toxvalType %in%
c('cancer slope factor', 'RfD', 'RfC'), .N,
by = .(toxvalType, toxvalUnits)][order(toxvalType, -N)]
@@ -2089,7 +1083,7 @@ For CCL4 and NATADB, there is a single unit type for each `toxvalType` value, so
First, we filter and separate out the relevant data subsets.
-```{r 07-Chapter7-167}
+```{r 7-3-CompTox-Dashboard-85 }
# Separate out into relevant data subsets
ccl4_csf <- ccl4_hazard[humanEco %in% 'human health' &
toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
@@ -2101,7 +1095,7 @@ ccl4_rfd <- ccl4_hazard[humanEco %in% 'human health' &
While there are no unit conversions needed, we demonstrate how we would convert units if they were required.
-```{r 07-Chapter7-168}
+```{r 7-3-CompTox-Dashboard-86 }
# Set mass by volume units to mg/m3, so scale g/m3 by 1E3 and ug/m3 by 1E-3
ccl4_rfc[toxvalUnits == 'mg/m3', conversion := 1]
ccl4_rfc[toxvalUnits == 'g/m3', conversion := 1E3]
@@ -2114,7 +1108,12 @@ ccl4_rfd[toxvalUnits %in% c('mg/kg-day', 'mg/kg'), units := 'mg/kg']
Then aggregate the data.
-```{r 07-Chapter7-169}
+```{r 7-3-CompTox-Dashboard-87 }
+# Run data aggregations grouping by dtxsid and taking either the max or the min
+# depending on the toxvalType we are considering.
+
+
+```{r 7-3-CompTox-Dashboard-88 }
# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
ccl4_csf[,.(numeric_value = max(toxvalNumeric),
@@ -2130,7 +1129,7 @@ ccl4_rfd[,.(numeric_value = min(toxvalNumeric*conversion),
Repeat the process for NATADB, first separating out the relevant subsets of the data.
-```{r 07-Chapter7-170}
+```{r 7-3-CompTox-Dashboard-89 }
# Separate out into relevant data subsets
natadb_csf <- natadb_hazard[humanEco %in% 'human health' &
toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
@@ -2142,7 +1141,7 @@ natadb_rfd <- natadb_hazard[humanEco %in% 'human health' &
Now handle the unit conversions.
-```{r 07-Chapter7-171}
+```{r 7-3-CompTox-Dashboard-90 }
# Set mass by mass units to mg/kg. Note that ppm is already in mg/kg
natadb_rfc <- natadb_rfc[toxvalUnits != 'ppm',]
natadb_rfd[, units := 'mg/kg-day']
@@ -2150,7 +1149,7 @@ natadb_rfd[, units := 'mg/kg-day']
Finally, aggregate the data.
-```{r 07-Chapter7-172}
+```{r 7-3-CompTox-Dashboard-91 }
# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
natadb_csf[, .(numeric_value = max(toxvalNumeric),
@@ -2191,7 +1190,7 @@ Try running the same analysis of physical-chemical properties, genotoxicity data
-```{r breakdown, echo = FALSE, results = 'hide'}
+```{r 7-3-CompTox-Dashboard-92, echo = FALSE, results = 'hide'}
# This chunk will be hidden in the final product. It serves to undo defining the
# custom print function to prevent unexpected behavior after this module during
# the final knitting process
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image1.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image1.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image1.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image1.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image10.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image10.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image10.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image10.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image11.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image11.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image11.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image11.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image2.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image2.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image2.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image2.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image3.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image3.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image3.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image3.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image4.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image4.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image4.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image4.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image5.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image5.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image5.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image5.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image6.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image6.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image6.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image6.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image7.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image7.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image7.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image7.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image8.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image8.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image8.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image8.png
diff --git a/Chapter_7/Module7_3_Input/Module7_3_Image9.png b/Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image9.png
similarity index 100%
rename from Chapter_7/Module7_3_Input/Module7_3_Image9.png
rename to Chapter_7/7_3_CompTox_Dashboard/Module7_3_Image9.png
diff --git a/Chapter_7/Module7_4_Input/Module7_4_InputData.RData b/Chapter_7/Module7_4_Input/Module7_4_InputData.RData
deleted file mode 100644
index 449f93c..0000000
Binary files a/Chapter_7/Module7_4_Input/Module7_4_InputData.RData and /dev/null differ
diff --git a/_bookdown.yml b/_bookdown.yml
index 6cfcdcf..f9de649 100644
--- a/_bookdown.yml
+++ b/_bookdown.yml
@@ -2,13 +2,46 @@ delete_merged_file: true
rmd_files:
- "index.Rmd"
- - "Chapter_1/01-Chapter1.Rmd"
- - "Chapter_2/02-Chapter2.Rmd"
- - "Chapter_3/03-Chapter3.Rmd"
- - "Chapter_4/04-Chapter4.Rmd"
- - "Chapter_5/05-Chapter5.Rmd"
- - "Chapter_6/06-Chapter6.Rmd"
- - "Chapter_7/07-Chapter7.Rmd"
+
+ - "Chapter_1/1_1_FAIR/1_1_FAIR.Rmd"
+ - "Chapter_1/1_2_Data_Sharing/1_2_Data_Sharing.Rmd"
+ - "Chapter_1/1_3_Github/1_3_Github.Rmd"
+ - "Chapter_1/1_4_Excel/1_4_Excel.Rmd"
+
+ - "Chapter_2/2_1_R_Programming/2_1_R_Programming.Rmd"
+ - "Chapter_2/2_2_Best_Practices/2_2_Best_Practices.Rmd"
+ - "Chapter_2/2_3_Data_Manipulation/2_3_Data_Manipulation.Rmd"
+ - "Chapter_2/2_4_Code_Efficiency/2_4_Code_Efficiency.Rmd"
+ - "Chapter_3/3_1_Data_Visualization/3_1_Data_Visualization.Rmd"
+
+ - "Chapter_3/3_2_Improving_Visualization/3_2_Improving_Visualization.Rmd"
+ - "Chapter_3/3_3_Normality_Tests/3_3_Normality_Tests.Rmd"
+ - "Chapter_3/3_4_Statistical_Tests/3_4_Statistical_Tests.Rmd"
+ - "Chapter_4/4_1_Experimental_Design/4_1_Experimental_Design.Rmd"
+
+ - "Chapter_4/4_2_Data_Import/4_2_Data_Import.Rmd"
+ - "Chapter_4/4_3_PDF_Import/4_3_PDF_Import.Rmd"
+ - "Chapter_4/4_4_Two_Groups/4_4_Two_Groups.Rmd"
+ - "Chapter_4/4_5_Multiple_Groups/4_5_Multiple_Groups.Rmd"
+ - "Chapter_4/4_6_Advanced_Multiple_Groups/4_6_Advanced_Multiple_Groups.Rmd"
+
+ - "Chapter_5/5_1_AI/5_1_AI.Rmd"
+ - "Chapter_5/5_2_Supervised_ML/5_2_Supervised_ML.Rmd"
+ - "Chapter_5/5_3_Supervised_ML_Interpretation/5_3_Supervised_ML_Interpretation.Rmd"
+ - "Chapter_5/5_4_Unsupervised_ML/5_4_Unsupervised_ML.Rmd"
+ - "Chapter_5/5_5_Unsupervised_ML_2/5_5_Unsupervised_ML_2.Rmd"
+
+ - "Chapter_6/6_1_Descriptive_Cohort_Analyses/6_1_Descriptive_Cohort_Analyses.Rmd"
+ - "Chapter_6/6_2_Omics_System_Biology/6_2_Omics_System_Biology.Rmd"
+ - "Chapter_6/6_3_Mixtures_Analysis/6_3_Mixtures_Analysis.Rmd"
+ - "Chapter_6/6_4_Mixtures_Analysis_2/6_4_Mixtures_Analysis_2.Rmd"
+ - "Chapter_6/6_5_Mixtures_Analysis_3/6_5_Mixtures_Analysis_3.Rmd"
+ - "Chapter_6/6_6_Toxicokinetic_Modeling/6_6_Toxicokinetic_Modeling.Rmd"
+ - "Chapter_6/6_7_Chemical_Read_Across/6_7_Chemical_Read_Across.Rmd"
+
+ - "Chapter_7/7_1_Comparative_Toxicogenomics_Database/7_1_Comparative_Toxicogenomics_Database.Rmd"
+ - "Chapter_7/7_2_Gene_Expression_Omnibus/7_2_Gene_Expression_Omnibus.Rmd"
+ - "Chapter_7/7_3_CompTox_Dashboard/7_3_CompTox_Dashboard.Rmd"
#prefix name for each chapter
language:
diff --git a/index.Rmd b/index.Rmd
index ecc96f9..880fc90 100644
--- a/index.Rmd
+++ b/index.Rmd
@@ -55,7 +55,7 @@ Training modules were developed to provide applications-driven examples of data
The overall organization of this TAME toolkit is summarized below. Modules are organized into seven chapters that are listed on the left side of this website.
-```{r ModuleOverview, out.width="70%", echo=FALSE, fig.align='center'}
+```{r index-2, out.width="70%", echo=FALSE, fig.align='center'}
knitr::include_graphics("images/index_images/Module0_Image1.png")
```
@@ -83,7 +83,7 @@ This study was supported by the National Institutes of Health (NIH) from the Nat
**P42ES031007**: The [University of North Carolina (UNC)-Superfund Research Program](https://sph.unc.edu/superfund-pages/srp/) (SRP) seeks to develop new solutions for reducing exposure to inorganic arsenic and prevent arsenic-induced diabetes through mechanistic and translational research. The [UNC-SRP Data Analysis and Management Core (UNC-SRP-DMAC)](https://sph.unc.edu/superfund-pages/dmac/) provides the UNC-SRP with critical expertise in bioinformatics, statistics, data management, and data integration.
-```{r index-2, echo=FALSE, out.width="40%", fig.align='center'}
+```{r index-3, echo=FALSE, out.width="40%", fig.align='center'}
knitr::include_graphics("images/index_images/Module0_Image2.png")
```
@@ -91,7 +91,7 @@ knitr::include_graphics("images/index_images/Module0_Image2.png")
**T32ES007126**: The [UNC Curriculum in Toxicology and Environmental Medicine (CiTEM)](https://www.med.unc.edu/toxicology/) seeks to provide a cutting edge research and mentoring environment to train students and postdoctoral fellows in environmental health and toxicology. Towards this goal, the CiTEM has a T32 Training Program for Pre- and Postdoctoral Training in Toxicology to support the development of future investigators in environmental health and toxicology.
-```{r index-3, echo=FALSE, out.width="15%",fig.align='center'}
+```{r index-4, echo=FALSE, out.width="15%",fig.align='center'}
knitr::include_graphics("images/index_images/Module0_Image3.png")
```
@@ -99,7 +99,7 @@ knitr::include_graphics("images/index_images/Module0_Image3.png")
Support was additionally provided through the [Institute for Environmental Health Solutions (IEHS)](https://sph.unc.edu/iehs/institute-for-environmental-health-solutions/) at the University of North Carolina (UNC) Gillings School of Global Public Health. The IEHS is aimed at protecting those who are particularly vulnerable to diseases caused by environmental factors, putting solutions directly into the hands of individuals and communities of North Carolina and beyond.
-```{r index-4, echo=FALSE, out.width="60%", fig.align='center'}
+```{r index-5, echo=FALSE, out.width="60%", fig.align='center'}
knitr::include_graphics("images/index_images/Module0_Image4.png")
```