OMA standalone is a piece of software which makes it possible to run the OMA algorithm for inferring homology information on your custom data. This includes generating pairwise orthologs, Hierarchical Orthologous Groups, as well as OMA Groups. It takes as input the coding sequences of genomes or transcriptomes, in FASTA format. The recommended input type is amino acid sequences, but OMA also supports nucleotide sequences. With amino acid sequences, users can combine their own data with publicly available genomes from the OMA database, including pre-computed all-against-all sequence comparisons (the first and computationally most intensive step), using the export function on the OMA website.
In this exercise, we will run OMA standalone to obtain gene families and other orthology information for a few bacterial species. We will download four genomes from the OMA browser, before adding our own custom genome as an example.
For more information on OMA standalone, please see this blog post and the extensive documentation available here.
2. Open the archive and examine the contents. The proteomes are stored in the DB/ folder (DB stands for database). In what format are the proteome files?
tar -zxvf AllAll-...
Now we want to add our own, newly sequenced genome. (For demonstration purposes, this genome is reduced to cut down on computation time.)
Note: it is important to know that when using your own genomes, or adding a genome to the exported all-against-all data, the name of the proteome file will be used as the name of the genome throughout the rest of the analysis.
3. Add the following dummy bacterial genome to your dataset: my_bacterial_genome.fa
4. Which species has the largest proteome? How many predicted proteins are in its file?
grep
doc to select the header lines (starting with ">") and count them.Bio.SeqIO
or similar.
Before running OMA, we next have to modify the parameters.drw
file. This file is located in the main OMA directory and should be edited by the user. There are many options that can be tweaked, but there are two options to specifically pay attention to: SpeciesTree
and OutgroupSpecies
.
Note: here, we shall not edit the SpeciesTree
parameter. Instead, we shall let OMA estimate it. For future reference, this estimation should be used with extreme caution and the resulting EstimatedSpeciesTree.nwk
file should be examined.
5. Edit the parameters.drw
file and specify the outgroup species to be Magnetococcus marinus
The OMA algorithm runs in three main steps: 1) Quality and consistency checks of the genomes that will be used to run OMA Standalone; 2) All-against-all alignments of every protein sequence to all other protein sequences; and 3) Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020). The all-against-all step is the most computationally intensive and takes the longest amount of time. This is why it is beneficial to export the precomputed all-against-all for genomes in the OMA browser.
1. In which folder are the pre-computed all-against-all alignments stored in?
Cache/AllAll
2. Now run OMA standalone on the five bacterial genomes we prepared previously. Which command should be used to start OMA standalone and run it to completion, without stopping after each of the aforementioned three steps?
`bin/oma`
Now that OMA standalone is complete, the Output
folder should be created - have a look at the contents. Note: Familiarity with command line scripting is preferable to complete this section.
1. OMA has estimated a rough species tree from orthologous groups, using a distance-based method. (If you know the phylogeny of the species, you can instead give a predefined species tree in the parameters file.) Examine the tree, to which species is your new genome closest to?
2. Examine the pairwise orthologs. Which pair of genomes has the most orthologous pairs of genes? How many?
wc -l *
3. How many 1:1 orthologous genes does BACAA have with STRZN?
cat STRZN-BACAA.txt | cut -f 5 | sort | uniq -c
4. Examine the root HOGs. How many root HOGs were inferred? What is the average size of the root HOGs in terms of number of genes per HOG?
HOGFasta
folder and loop through each file to count the number of genes.
ls -1 | grep ".fa" | sed "s/.\*/grep -c \">\" &/" | bash | awk '{ total += $1 } END { print total/NR }'
5. Examine the OMA Groups. How many species are represented in the group containing the entry STAA002275? What is the putative function?
OrthologousGroups.txt
. In the file, each line is an OMA group and each tab-separated columm the whole gene description (From the FASTA header).