RepRes_VHP_Shortintro/02_sharingdata.Rmd at main · DataScienceINT/RepRes_VHP_Shortintro · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# Hands-on sharing data

```{r setup2, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  error = FALSE,
  message = FALSE)
```

<style>
.blauw {
  background-color: #f3fbfe;
}
</style>


**Note: this half of the workshop we will split the participants based on their previous experience and tools they are using. If you can answer "yes" to the following questions, you are on the correct page:**

 - I don't use code (R, Python, Matlab, Julia,  etc) to analyse my data
 - I would like to use a tidy data format to make my data machine readable

## **Exercise B1: ** reflection {.blauw}

5-10 minutes

Go through your e-mail and look for the last time you sent some colleague or colaborator data or if all your data is shared in the cloud: the last time you uploaded new data. Suppose you move to a different university, your mailbox is closed and some stranger needs to use your data for a project. Would they have enough information to do so? What would you do differently next time?

Discuss in pairs of 2.

## Introduction
Whether you are working in an experimental, computational or design oriented scientific field, data is often either the final or at least an important product of your scientific research. When working in science, we seldom work in isolation, so sharing data and being able to track mutations and versions of the data and it's connected metadata should be an import aspect of your work. Because usually we grow into a project without having much education on very practical things such as where to store data, how to handle file structure, how to define versions of the data and how to handle mutations and version change, we decided to focus on these aspects in a very practical manner.

In this part of the workshop, we will provide you with hands-on practical handles on how to collaborate on data.


## Data flows in VHP4Safety

VHP4Safety is a data intensive project. Meaning, there is a lot of data collection going on in the project. While there are several ongoing developments to automate and standardize the data formats in VHP4Safety, many of us use a spreadsheet program such as MS Excel to gather and document data. In a large AI based project such as VHP4Safety, we need to keep in mind that the data does not only need to be understandable for us. It also needs to be machine readable (or at the very least: really easy to convert to machine readable data). Therefore, we will discuss how to best store data in tabular (in this case spreadsheet) formats.

## **Exercise B2: ** What can go wrong with data in Excel {.blauw}

Exercise by [FAIR in (biological) practice course on carpentries-incubator](https://carpentries-incubator.github.io/fair-bio-practice/05-intro-to-metadata/index.html#types-of-metadata). (see below for full reference)

15 minutes

Tables are one of the best forms of storing and representing information. That is why we find them almost everywhere, from a menu in a restaurant, a bank statement, to professional databases. No surprise then that we biologists love tables and we tend to predominantly use Excel.

Excel is easy to use, flexible and powerful, however, it often gives us too much freedom which leads to bad practices and difficult to re-use data and metadata.

Have a look at the following example data-file in Excel (you can download it [here](data/04-bad-excel.xlsx)):

```{r, echo=FALSE,  out.width = "900px"}
knitr::include_graphics(
  here::here(
    "images",
    "bad-excel.png"))
```

It contains data similar to the presented data before from experiments on plants in different light conditions. Imagine you need to work with this file, and you got it emailed from a collaborator.

 1. What do you find confusing?
   What would you try to clarify with the author before doing anything with the file?
 1. What will be the issues with calculation of: average biomas, biomas per genotype?
 1. Typically, more advanced data analysis is done programmatically, which requires either conversion to text format as csv, tsv format or using a library that reads Excel files and “kind of makes this conversion on the fly”. Save this file in a text format, close Excel and reopen the saved files. What has changed?
 1. Have you seen similar tables? Do you believe this example is realistic and replicates real-life issues?

## {.unlisted .unnumbered}

<details><summary>Click here for a solution</summary>

 1. This file hopefully unrealistically exacerbates typical bad practices in Excel.Some things that may be confusing:
   - Why are there two tables, are the period measurement related to the metabolics i.e. same samples?
   - Do colors in the period table have the same meaning? Seems no.
   - Why can row 22 be read, whilst row 13 says error?
   - What is the meaning of values in the media column?
   - Are the genotypes the same in different blocks or not?
   - What is the meaning behind bold text in the table?
   - What is the definition of the terms/why are units missing/inconsistent?
 1. - Before averaging the biomas weight, they need to get converted the same unit and the text needs to get replaced by the unit. - Averaging per genotype needs manual selection of suitable entries
 1. - Information about light conditions is completely lost. - Header columns are scrambled. - The update date may change its meaning depending on the location (switch year with day).

</details>
</br>

## Some common spreadsheat errors (that have been spotted in VHP4Safety or similar projects):
(shortened  and reworded accordingly slightly from source at carpentries-incubator.github.io)

 - **Not filling in zeros**
It might be that when you’re measuring something, it’s usually a zero, say the number of times an adverse effect is observed in the survey. Why bother writing in the number zero in that column, when it’s mostly zeros? However, "I counted zero adverse effects" is diffent from "I did not measure this, this data is missing". Make sure to record zeros as zeros and truly missing data as empty / missing (so not -999, 999 and definately not 0. computers will conclude that there is data).
 - **Using formatting to convey information organizing data**
Never highlight cells, rows or columns that should be excluded from an analysis, or to mark particular properties/conditions. Also never merge cells. Formatting information is not available to analysis software and almost certainly will be lost during processing.
 - **Placing comments or units or more than one variable in a cell**
 Cells in a spreadsheet should contain one piece of information per cell.

Source: [FAIR in (biological) practice](https://carpentries-incubator.github.io/fair-bio-practice/05-intro-to-metadata/index.html#types-of-metadata). Licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) 2022. We took [Exercise 1: What can go wrong with data in Excel](https://carpentries-incubator.github.io/fair-bio-practice/07-data-in-excel/index.html#exercise-1-what-can-go-wrong-with-data-in-excel-4-min) out of context.

## Tidy machine readable data

As discussed, tidy data is a way of organising your data in a neat and structured way. If you make your data tidy, it is ensured that it is machine readable (thus can be "shared with AI's"). This is important in VHP4Safety, as we want to use the data to build an *in silico* model.

So for future reference, what is tidy data again:

1. Each variable must have its own column.
1. Each observation must have its own row.
1. Each value must have its own cell.

*Table1: Example of tidy data*

| country 	| year 	| population 	| birth rate 	|
|:---------	|:-----	|:----------	|:-----------	|
| mali    	| 2001 	| 10.000.000 	| 6.88       	|
| mali    	| 2010 	| 15.000.000 	| 6.06       	|
| sweden  	| 2001 	| 9.000.000  	| 1.57       	|
| sweden  	| 2010 	| 10.000.000 	| 1.85       	|


Each variable must have its own column.

```{r , echo=FALSE, message=FALSE, out.width = "60%"}
knitr::include_graphics("images/03_4_tidy1.jpg")
```

Each observation must have its own row.

```{r , echo=FALSE, message=FALSE, out.width = "60%"}
knitr::include_graphics("images/03_5_tidy2.jpg")
```

Each value must have its own cell.

```{r , echo=FALSE, message=FALSE, out.width = "60%"}
knitr::include_graphics("images/03_6_tidy3.jpg")
```

So the following table would by untidy, as there are multiple observations per row:

*Table2: Example of untidy data*

| participant  	| sample1 	| sample2 	| sample3 	| sample4 	|
|:----------	|:------	|:--------	|:-------------	|:-----	|
| John   	| 7.5  	| 6      	| 8.2         	| 8   	|
| Mary 	| 8    	| 7.9    	| 5           	| 9   	|


As you can see, this is not only not very machine-readable, it is also not immediately clear from the headers what we are looking at here. You would need metadata to tell you what this is about.

This would be the tidy version:

*Table3: Example of untidy data made tidy*

| student  	| sample      	| stairs_walked_monday 	|
|:----------	|:-------------	|:-------	|
| John   	| 1        	| 7.5   	|
| John   	| 2      	| 6     	|
| John   	| 3 	| 8.2   	|
| John   	| 4         	| 8     	|
| Mary 	| 1        	| 8     	|
| Mary 	| 2      	| 7.9   	|
| Mary 	| 3 	| 5     	|
| Mary 	| 4         	| 9     	|


## Key Points

 - keep your excel files machine readable and umambiguous
 - don't use formatting to convey information in Excel
 - use tidy data if possible
 - version control takes some effort, but saves a lot of confusion -thus time- in the long run
 - always let name_of_your_document.docx be the current version.

## **EXTRA Exercise B3:** tidy data {.blauw}

(15 minutes)

Take a look at an example (fake) dataset on blood levels for several compounds in hepatitis patients, blood donors and suspected hepatitis patients in Utrecht, the Netherlands [here](data/hepatitis_example_short.xlsx). Data was collected between January 2022 and May 2022 by K. the Frog, PhD. It doesn't look that bad, especially compared to the example from the previous exercise. But it is not machine readable.

Make this dataset machine readable and store it in a reproducible way. Keep the following in mind:

 1. Tidy data: Each variable its own column. No duplicate columns. Each observation its own row. Each value must have its own cell.
 1. One datatype per column, be consistent
 1. notes in a separate column
 1. don't use formatting to convey information
 1. never throw away or change a raw datafile. Raw data = however YOU received the data, be it from a machine, handmade in Excel yourself, or received through email.
 1. data always needs meta data


## EXTRA Version control on documents and data
You may know from experience how an unnoticed change in a document, the use of the wrong version or the loss of edits can lead to loss of time invested, frustration and above all confusion! This is why keeping a record on all mutations on a document or the data is important. There usually will be [quite a lot](https://phdcomics.com/comics/archive.php?comicid=1531). Especially, tracking mutations on the data is important for another reason: reproducibility and validity of conclusions (or any other interpretation or cummunication) connected to the data.

Version control is the process of tracking and managing different versions (or drafts) of a file / document / dataset. By doing so, you always know what is the most recent version of a file, when something was changed and what was changed.

Think of it as keeping track of any changes, on top of keeping your file.

>So instead of:
>
>file --> file-version_2 --> file_version_final -->   file_version_final_2 = file_current
>
>You keep track of:
>
>file + changes_06_06_2022 + changes_07_06_2022  + changes_22_06_2022 = file_current

This has the added advantage of making it easier to merge changes from two separate collaborators. "Added advantage" with respect to reproducibility, but in case of the file being a manuscript written with >2 co-autheurs, we all know this could be a huge help.

If you use code to analyse your data and/or write your manuscripts (e.g. Latex, Rmarkdown), stop here and change to the "advanced" exercises. You should really take the easy route (why not!) and use Github to do the version control for you.

If you use Word and Excel / imageJ / SPSS / etc, keep reading.

## EXTRA Project Version Control: manually

One option that does not make use of storing changes separately is this (proposed) method doing version control manually. It is, however, very easy to implement starting today:

 1. Whenever you change something significant in your project, just copy the whole folder, and save the old version in a folder with the date of today as the folder name. This may seem wasteful, but remember that storage nowadays is a whole lot cheaper than your time or confusion.
 1. Document all changes in a changelog.txt file in your main project folder. Write what you changed, who changed it, when and why.
 1. Save everything on onedrive, or google docs, or somewhere else that keeps backups.
 1. Keep the /current folder on your laptop if you want to work locally, but remember to backup  and sync your project daily with the remote folder.

Your folder structure will look something like this:

```
laptop/projects
├── project A
│   ├── changelog.txt
│   ├── current
│   │   ├── analyses
│   │   ├── data_raw
│   │   ├── documentation
│   │   ├── manuscript
│   │   ├── methods
│   ├── version_01_02_2022
│   ├── version_14_12_2021
```

And your changelog may look like this:

```
├── changelog.txt
01_02_2022
Alyanne: Removed chapter 2 from /manuscript/manuscript.docx and changed statistics to Bayesian (updated folder: /analyses) as per reviewer 1's request.

03_01_2022
optional: keep record of smaller changes as well

14_12_2021
Marc: Added dataset 3 (/data_raw/dataset3_th.csv) from Theme Hospital and wrote script to tidiy data (/analyses/clean_th_data.R).
Marc: Added feedback from Francois (/docs/mail_Francois.docx) to /manuscript/manuscript.docx.
```

## EXTRA Document Version control using onedrive / google docs

If you are only working on a document, e.g. writing a proposal together with 3 colleagues with no other files involved, you may instead choose to version control the document only. You can do this by using  onedrive or google docs to store the document.

Choose one of the 2 links below to read based on whether your organisation uses onedrive or google docs:

[Office](https://support.microsoft.com/en-us/office/restore-a-previous-version-of-a-file-stored-in-onedrive-159cad6d-d76e-4981-88ef-de6e96c93893). and also [here](https://support.microsoft.com/en-us/office/view-the-version-history-of-an-item-or-file-in-a-list-or-library-53262060-5092-424d-a50b-c798b0ec32b1)

[Google docs](https://www.techrepublic.com/article/version-history-essentials-for-google-docs-sheets-and-slides/)

## EXTRA Document Version control manually
You could also so this manually, especially if you are not collaborating a lot on this particular file.

Do not fall into the trap of copying your document, then renaming it document_version2.docx, and repeating this step every time. Think the other way around:

 - document.docx should always be the current version
 - whenever you make bigger changes, keep a copy of the version before those changes with added numbers or dates in the file name as a version controlled log.

Keep a changelog in table format at the beginning of your document or in a file called `changelog.txt`. e.g.:

| Version          | Date       | Author  | Change                        |
|------------------|------------|---------|-------------------------------|
| 0.1              | 01-02-2022 | Alyanne | First draft                   |
| 0.2              | 10-02-2022 | Marc    | Added exercise 3.2            |
| 0.3              | 15-02-2022 | Marie   | reviewed and corrected French |
| 1.0              | 01-03-2022 | Alyanne | Added chapter 2               |


## **EXTRA Exercise B4:** version control {.blauw}

(5 minutes)

Take a look at your research project folders.

 - Are there multiple versions of the same document? Is it immediately clear which one is the current / used / published one?
 - Can you quickly locate the original version?
 - Which labeling technique did you use? (e.g. date included in file name, version number included in file name, copying the whole project folder, adding "final" several times.)
 - How would you prefer to version control your next project?
 - What would you need to start doing so?

## **EXTRA Exercise B5:** workflow {.blauw}

(15 minutes)

Create a diagram of your research process. Do this however you prefer, with software,
with pen and paper, etc. Locate the materials that you are creating, collecting, and
editing as part of your research process and answer the following questions:

 1. Will your research materials be edited or transformed between the time that you
collect them and the time you publish them in your thesis or dissertation? If so, is
it important to save the intermediate versions? What conventions might you use to
track the versions?
 1. Do you expect to combine multiple files into a single file (e.g., editing multiple clips
into a single video)? If so, how do you plan to manage the relationships between the
original objects and the combined object (e.g., naming conventions, folder context,
readme files, etc.)?

Discuss the issues you see within the breakout room, and plan 1 morning in your agenda in the next month to build a plan to tackle them. Then, stick to the plan!

(Exercise borrowed from the [Educopia / Institute of museum and library service, Electronic Theses and Dissertations (ETD)plus Guidance Briefs](https://educopia.org/etd-plus-guidance-briefs/) shared under creative commons license [CC-BY](https://creativecommons.org/licenses/by/4.0/) )