-
Notifications
You must be signed in to change notification settings - Fork 0
File Management
Here we describe the file naming and organisation standards used by Kim Plummer's research group. The bottom line of this document is to address the following concerns:
- If you don't know what a file is, you can't use it
- If no one else knows what your data are, they can't use them
- If your data are not organised well, they will probably get lost
The solution is to treat your data exactly as if they were a biological sample. These guidelines demonstrate how you and other members of the Plummer research group can:
- Label data well
- Know where data is kept
- Keep data safe
The benefits of these standards rely on everyone adhering to them, so please do.
This document is based on the recommendations provided by Andrew Robinson in April 2015.
The primary file storage service is the La Trobe research data storage (rdfs) system. For instructions on connecting to the rdfs drive see the Connecting to the Plummer drive protocol.
For the purpose of this document we take the term 'project' to mean something that produces one or a few specific datasets that can be easily separated from the procedures of other projects. You can think of it as a bit like a biological experiment. An example of a project in this context might be annotating a genome, sequencing a genome, a differential transcriptomics analysis, or cloning a gene. The result of one project can become the foundation of another project (e.g. a more detailed analysis of a differentially transcribed gene). Essentially, data are stored as a series of results from numerous projects.
All project specific data lives in the Projects folder.
Each project goes in its own folder with the project start-date (in reverse) followed by a dash and then a short title.
For example 2015-04-27-AvrRvi5_cloning would be a good project folder name.
Note that the start-date is arbitrary, it is simply meant to keep the folders ordered chronologically and to avoid naming conflicts (there are only so many short titles).
If you don't remember the start-date, simply use todays date, it isn't too important.
Every project should contain a README.txt file to describe the project in general.
We use plain text description files so that everyone can read the files regardless of their operating system, available software or method of access (you can't really read a word document via a terminal connection to a server). If you are looking for a simple way to introduce some basic formatting to the document to make things clearer I suggest looking at the markdown syntax.
Things to include in the README might be:
- Any long title that you want to use for the project
- An overview of the project
- What are you trying to do?
- What kind of data will you end up with?
- A list of contributors and what they did
- The organism(s) used during the project
- Where are they from?
- Isolate numbers
- The originating data used
- Where is it from
- Permissions or restrictions associated with it
- Materials or communications associated with this project
- Papers
- Presentations
- Theses
- Lab books
- Submissions to public databases
- Git repositories or similar
- Permissions associated with the results. That is, are we free to share this information with collaborators or the public?
For each data file there should be a plain text file describing:
- What the data is.
- What procedure you used to create this file or where it came from (e.g. this database on this date, database version 1). If describing the procedure, this should include both a plain language description of what happened and the specific commands (code or clicks) used.
- Software used to produce the file (including version numbers or commit numbers).
- Models of any instruments used to produce the data (e.g. which sequencing platform, which library kit etc).
- Any other files that were required to produce this file.
- The date that the file was created (as yyyy-mm-dd).
Descriptor files should be named the same as the file that it describes but appended with .txt.
For example:
my_sequence.fasta # file
my_sequence.fasta.txt # descriptor
I can imagine a situation where describing each file individually would become painful. In this case you might have a descriptor file that describes a group of files with a name that makes it clear that it refers to all of the those files. The main thing is that it should be clear to everyone what files a descriptor refers to.
The data folder is where the results of previous projects that the current project depends on are kept.
All data files in this directory should be considered to be (if not explicitly made) read-only.
All data files should have a descriptor file describing at least where the data came from (which project, which database, which collaborator, which version), sharing permission and what the data are.
The structure of the projects directory depends on the complexity of the project.
Use step based subfolders to organise your project and emphasise the procedure that you used.
Step folders should be named with sequential numbers followed by a dash and a short title.
Add leading zeroes to the numbers to keep the folders ordered properly.
For example 01 will still be ordered before 11.
For a simple project, you might only need to have 1 subfolder to contain your results and you might decide to omit the step number.
2015-04-27-example_project
├── README.txt
└── growth_rate_invitro
├── growth_rate_in_vitro.csv
└── growth_rate_in_vitro.csv.txt
Whereas a more complex project might have multiple subfolders.
2015-04-27-example_project
├── data
│ ├── my_genome.fasta
│ └── my_genome.fasta.txt
├── README.txt
├── 01-primer_design
│ ├── atg123_primers.fasta
│ └── atg123_primers.fasta.txt
├── 02-pcr_electrophoresis
│ ├── atg123_gel_image.jpeg
│ └── atg123_gel_image.jpeg.txt
└── 03-sequencing
├── atg123_sequenced.abi
├── atg123_sequenced.abi.txt
├── atg123_sequenced.fasta
└── atg123_sequenced.fasta.txt
This sample project involves a PCR amplification followed by sanger sequencing.
In step 01 we have the results of step 1.
We have designed some primers using my_genome.fasta to amplify atg123.
In step 02 we have amplified atg123 and visualised it in a gel.
Gel images count as data, so include it and add a descriptor file.
Finally, in step 03 we have sent off the PCR product for sanger sequencing and have included the returned results.
If your sequencing provider just gave you the chromatogram, you might have needed to add another step to get the actual sequence.
Here we see that a project can be organised into a logical structure where we catalogue intermediate files and build on data from previous steps or projects to produce new data.
Does it warrant a new step? If you require data from the same step to generate something else, it needs to be in a separate step.
See the notes from Andrew Robinson for more on file naming guidelines. The main points are:
- Put important words first
- Group related files by shared keywords
- Use date stamps (e.g.
2015-05-01) instead of version or revision numbers - Avoid generic terms. Words like document or stuff aren't useful. Be descriptive so that you can find things easily later.
Be friendly to your UNIX friends! try to stick to the following characters for your filenames and directories:
- a-z
- A-Z
- 0-9
- underscore (_)
- dash (-)
- period (.)
Other characters can mean that in the command line we have to escape them or they might even run commands and cause problems big for us. Notice that the space character is not included in this list for the same reasons.
Consider using underscores_to_indicate_spaces or CamelCaseToSeparateWords instead of whitespace.
Data should be stored as plain text files wherever possible.
Rather than saving your tables as an excel spreadsheet, save it as a comma-separated values (CSV) (or similar) file. This allows users of R, excel, and SPSS to use the table. It also means that in 10 years when the software formats change, we can still access the data.
Export any software specific or binary files to a plain text version. In CLC workbench, for example, export sequence data as fasta and GFF formatted files. Again, this means that everyone can use these data and we can still read them years down the line.
There are possible exceptions to these rules. For example, where a binary format has become the standard for working with data and is supported by many software, you might as well just leave it as the binary format. BAM would be a good example of a file format that you wouldn't necessarily need to convert to the plain text equivalent format (SAM; though it might be worth considering for long-term storage).
Microsoft word (.docx or .doc) is not a good data storage format.
Use UTF-8 text encoding for all of your plain text documents when you have a choice. Unicode is generally the preferred text encoding and UTF-8 is the current web and unix unicode default. Windows OS tends to use ANSI text encoding by default; however, this older format is not supported by some programs (e.g. some unix based bioinformatics programs).
When saving a text file -- especially in windows -- look for a section that says 'encoding' and look for an UTF-8 option to save in the preferred format. Likewise, if you try to view a CSV file in excel and it looks like gibberish, make sure that you set the encoding to UTF-8 when reading the file.
Suppose you have a large table and you want to remove or add some rows, how would you keep track of this? Try to make a specific note of what you have changed. For programmers, you might add or remove data using a script so that you and others can re-apply the changes reproducibly. For those who like to click you might create a mini-table showing the rows that you have added or deleted, or you might describe the specific row numbers that you changed.
This might seem trivial, but consider a project where you are filtering BLAST results in a table or manually curating gene models in a GFF file. Details like this will help you and others understand what has happened and why. Some US government bodies already require information like this for evidence based policy submissions or drug approval applications. We can probably expect the same here in the future. See the general topic of reproducible research for more on this. The bottom line is that both your lab work and your data analysis should be reproducible.
If anything in this document is unclear to you or you feel is incorrect, please raise an issue on the repository landing page or edit the page yourself.
We want these guidelines to be clear to everyone and allow all users to work together relatively efficiently, regardless of operating system or computing ability.