Skip to content

4. Tutorial

Max Brown edited this page Jul 1, 2022 · 7 revisions

Tutorial for goat-cli

Welcome to a whistle-stop tour of goat-cli. We are going to attempt to uncover as much functionality as we can in a small amount of space and time. I am assuming the reader knows a bit about the shell/terminal, but if not I have annotated the code as we go.

Note that this tutorial focuses on the taxon index of GoaT, i.e. goat-cli taxon <subcommand>. Most of the syntax is the same for the assembly index of GoaT, i.e. goat-cli assembly <subcommand>.

Installation

First things first. Download goat-cli:

# this code block is a small Bash script

# make a new directory and go into it
# just for cleanliness
mkdir goat-cli-tutorial
cd goat-cli-tutorial

# for MAC
curl -L "https://github.com/genomehubs/goat-cli/releases/download/0.2.0/goat_mac_0.2.0" > goat && chmod +x goat
# and LINUX (ubuntu)
curl -L "https://github.com/genomehubs/goat-cli/releases/download/0.2.0/goat_ubuntu_0.2.0" > goat && chmod +x goat
# the executable is `./goat`
./goat

# or if you have conda
conda install -c bioconda goat
# executable is `goat-cli`
goat-cli

This command puts goat-cli in the directory you are in. Run with:

./goat
#^^^^^
# this will show a bunch of help stuff by default

Actually most of this tutorial will focus on the taxon search functionality in GoaT. We can see the help with the search:

# add the search sub-command
# add the `-h` flag
./goat taxon search -h
#      ^^^^^^^^^
# notice there are some options which take values, and
# others which do not.

Exploring taxa

We will concentrate our attention on flowering plants (Magnoliopsida), because I like plants. But you can follow with any taxon you want!

Let's do the simplest search we can.

# `-t` for taxon!
./goat taxon search -t Magnoliopsida
#                   ^^
# equivalent to
./goat taxon search -t 3398
#                   ^^
# or
./goat taxon search -t 'flowering plants'
#                   ^^

This returns a TSV with a single row with ancestral estimates of some variables including stuff about assembly level and span, chromosome numbers, etc. GoaT currently works by averaging values up and down the tree of life for nodes which cannot be directly measured, so we can get an estimate of any variable, at any node (given sufficient data at the tips of the tree of life; the species). But, perhaps we are not so interested in the ancestor of all flowering plants, we are interested in living species.

Getting descendents

One of the most powerful flags in goat-cli is -d, or --descendents flag. If we set this, we get information about all the nodes that are contained within flowering plants. Let's do it.

./goat taxon search -dt Magnoliopsida
#                    ^^

So we returned ~10,000 results. But there are over 300,000 species of flowering plants that have been described. For each of the variables in the search, if there is at least one direct estimate, a result is returned. So it's impossible to tell (on the CLI) which variables have a direct estimate or not. We can only say that there is at least some direct measurement data for ~10,000 taxa of flowering plant for the default variables. We can view only direct estimates for each taxon using the -r flag.

./goat taxon search -dt Magnoliopsida -r
#                                     ^^
# this is also fine to do
goat taxon search -rdt Magnoliopsida
#                  ^^
# or this
goat taxon search -t Magnoliopsida -dr
#                                    ^

The -d flag returns taxa, as every node on the tree in flowering plants is reported! Let's get just species:

# I removed the `-r` flag here
goat taxon search -dt Magnoliopsida --tax-rank species
#                                   ^^^^^^^^^^^^^^^^^^

What's that warning message at the top?

[-]     For search query Magnoliopsida, size specified (50) was less than the number of results returned, (9614).

By default goat-cli returns a maximum of 50 results. Luckily we have a --size parameter so we can crank up the number of values we return:

goat taxon search -dt Magnoliopsida --tax-rank species --size 10000
#                                                      ^^^^^^^^^^^^

And now the warning should disappear, and you'll have to wait longer for those results to appear (poor GoaT is doing more work).

We can generally get GoaT to do more work if we wanted indirect values. This is achieved with the -i flag. You are going to return every species in the NCBI database for the clade you specify if you add this flag though.

We can also add a progress bar to see our download in progress.

# add a progress bar!
goat taxon search -dit Magnoliopsida --tax-rank species --size 50000 --progress-bar
#                                                                    ^^^^^^^^^^^^^^
# poor goat
# hit Ctrl + C to kill that request!

Notice I've also ratcheted up the size flag, as this search returns over 170,000 results. Currently it's not possible to return everything, or very large searches. This would cause heavy load on our servers, and also it's rarely useful in practice. So you're limited to 50,000 results being returned. Usually we'll want to filter results, whether this is in the column variables we are returning, or filtering further by some operation on those variables.

Looking up multiple species

So far, we have focused on the -d flag with a taxon rank, as that is a fairly natural way for a search to be made. However, we can look at individual species too! Use a comma separated list to look up species.

# we can look up a single species
# Corn/Maize
./goat taxon search -t 'zea mays'
# Corn & barley!
./goat taxon search -t 'zea mays, hordeum vulgare'
#                               ^^
# corn & barley & rye!
./goat taxon search -t 'zea mays, hordeum vulgare, secale cereale'
#                               ^^               ^^
# we could go on with cereal crops...

The other way you can look up multiple species is using a file which consists of one taxon per line. Something like this:

zea mays
hordeum vulgare
secale cereale

We can make this quickly on the command line as a proof of concept.

# or you could put all of your species/taxa into a file
# and give that file to `goat-cli`

printf "zea mays\nhordeum vulgare\nsecale cereale" > my_species.txt
#               ^^               ^^
# view the file to check those newlines
cat my_species.txt
# give it to goat!
./goat taxon search -f my_species.txt
#                   ^^
# -f, not -t as we have been using

Combining variable flags

As you saw in the help for ./goat taxon search, there are lots of flags you can combine to get a tailored dataset back.

I'm pretty interested in plant organellar genomes at the moment, so I'm going to search for these. But try any combination of the flags that interest you.

# -m == mitochondrial genome length/gc content
# -p == plastid genome length/gc content
# so this should return four variable columns
./goat taxon search -dmpt Magnoliopsida
#                     ^^

This returns four columns for each of the variables. Again, if we want direct only values for these four variables, add that -r flag.

./goat taxon search -rdmpt Magnoliopsida
#                    ^

Variables and expressions

Continuing on the theme, for now I am only interested only in plant mitochondrial genome metadata. What's the range of plant mitochondrial genome sizes out there? And GC%? Here I'm using the -v flag because I don't want to also include mitochondrial GC%, which we would return if we ran with the flag -m (see above!).

# add the `-v` flag
# I also put the size up so we get all those results
./goat taxon search -dt Magnoliopsida -v "mitochondrion_assembly_span" --size 210
#                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Hang on, how did I know what to type there in that variable string? We can print a table of all the variables to remind ourselves. Or to see what they all are for the first time.

# this prints quite a large (and growing) table
./goat taxon search --print-expression
#                   ^^^^^^^^^^^^^^^^^^
# now this command hopefully makes a bit more sense:
./goat taxon search -dt Magnoliopsida -v "mitochondrion_assembly_span" --size 210

I can already see a pretty wide variation. Let's filter this to look at genomes greater than 1Mb! We can do this by passing an expression as an argument.

# 1Mb == 1 megabase == 1 million bases
./goat taxon search -dt Magnoliopsida -v "mitochondrion_assembly_span" -e "mitochondrion_assembly_span > 1000000"
#                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Only 7 species so far huh. We can add higher taxon ranks for a bit of context to quickly see if related clades have anything to do with this.

# add the --ranks flag
./goat taxon search -dt Magnoliopsida -v "mitochondrion_assembly_span" -e "mitochondrion_assembly_span > 1000000" --ranks family
#                                                                                                                 ^^^^^^^^^^^^^^

We can become very specific in our searches. Changing gears away from plants we can look at other things (a small list):

  • What are all the family representatives for flowering plants in the Darwin Tree of Life?
# there are around 150 families of flowering plant in the UK
./goat taxon search -dt Magnoliopsida -v "family_representative" -e "family_representative == dtol" --size 150
  • Which species are in progress for having their genomes sequenced?
# add commas to separate variables
# this will return two variable columns
./goat taxon search -dt eukaryota -v "sequencing_status, long_list" -e "sequencing_status == in_progress" --size 4000
#                                                        ^^^^^^^^^

But actually for this I only want those species which are on the Aquatic Symbiosis Genomics long list!

# we can use logical AND's to filter on multiple variables.
./goat taxon search -dt eukaryota -v "sequencing_status, long_list" -e "sequencing_status == in_progress AND long_list == asg" --size 150
#                                                                                                        ^^^^^^^^^^^^^^^^^^^^

Other goat-cli functionality

We've spent most of our time on ./goat taxon search, as this is where goat-cli is most useful. There are however a few other things goat-cli can do. Let's check them out.

./goat taxon lookup gives you back authorities, scientific names, common names, and synonymns for an input taxon. Euphrasia officinalis is wild.

./goat taxon lookup -t "Euprhasia officinalis"
#            ^^^^^^        ^^
#                          the spelling mistake should be found by GoaT

./goat taxon newick returns a Newick representation of a cladogram (a kind of tree). Say we wanted to see how all the genera of legumes are related.

# `-r` for rank
./goat taxon newick -t "Fabaceae" -r genus
#            ^^^^^^               ^^

Wrap up

Thank you for joining this tutorial. Be sure to contact me or the GoaT team if things can be improved/make more sense/if things are broken!

Stay tuned for another tutorial focusing on the assembly index!

Clone this wiki locally