OMA

Obtaining Data and Getting Setup

OMA standalone is a piece of software which makes it possible to run the OMA algorithm for inferring homology information on your custom data. This includes generating pairwise orthologs, Hierarchical Orthologous Groups, as well as OMA Groups. It takes as input the coding sequences of genomes or transcriptomes, in FASTA format. The recommended input type is amino acid sequences, but OMA also supports nucleotide sequences. With amino acid sequences, users can combine their own data with publicly available genomes from the OMA database, including pre-computed all-against-all sequence comparisons (the first and computationally most intensive step), using the export function on the OMA website.

In this exercise, we will run OMA standalone to obtain gene families and other orthology information for a few bacterial species. We will download four genomes from the OMA browser, before adding our own custom genome as an example.

For more information on OMA standalone, please see this blog post and the extensive documentation available here.

1. Export data for the following genomes (Export All-All) from omabrowser.org:
- Bacillus anthracis (strain A0248)
- Staphylococcus aureus (strain TW20 / 0582)
- Streptococcus pneumoniae serotype 14 (strain INV200)
- Magnetococcus marinus (strain ATCC BAA-1437 / JCM 17883 / MC-1)
From OMA home page, click Download → Export All/All

Once on the export page, click on the (?) for more information on how to export genomes.

Create a working directory and copy the downloaded .tgz file into it.
2. Open the archive and examine the contents. The proteomes are stored in the DB/ folder (DB stands for database). In what format are the proteome files?

To decompress the archive, use tar -zxvf AllAll-...

The proteome files are stored in FASTA format, a common format for genomes sequences.
Now we want to add our own, newly sequenced genome. (For demonstration purposes, this genome is reduced to cut down on computation time.)

Note: it is important to know that when using your own genomes, or adding a genome to the exported all-against-all data, the name of the proteome file will be used as the name of the genome throughout the rest of the analysis.

3. Add the following dummy bacterial genome to your dataset: my_bacterial_genome.fa

Right click to save the file. (Also, re-read question 2!)
4. Which species has the largest proteome? How many predicted proteins are in its file?
FASTA headers always start with ">"

You can search for this character in text editor to count its occurences.

In the command line, you may use grep doc to select the header lines (starting with ">") and count them.

If you are more familiar with Python, you could use Bio.SeqIO or similar.
Bacillus anthracis (strain A0248), 5291 proteins
Before running OMA, we next have to modify the parameters.drw file. This file is located in the main OMA directory and should be edited by the user. There are many options that can be tweaked, but there are two options to specifically pay attention to: SpeciesTree and OutgroupSpecies.

Note: here, we shall not edit the SpeciesTree parameter. Instead, we shall let OMA estimate it. For future reference, this estimation should be used with extreme caution and the resulting EstimatedSpeciesTree.nwk file should be examined.

5. Edit the parameters.drw file and specify the outgroup species to be Magnetococcus marinus

Hint: when specifying the outgroup in the parameters file, it must be the same name as in the name of the FASTA file.

Running OMA Standalone

The OMA algorithm runs in three main steps: 1) Quality and consistency checks of the genomes that will be used to run OMA Standalone; 2) All-against-all alignments of every protein sequence to all other protein sequences; and 3) Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020). The all-against-all step is the most computationally intensive and takes the longest amount of time. This is why it is beneficial to export the precomputed all-against-all for genomes in the OMA browser.

1. In which folder are the pre-computed all-against-all alignments stored in?

Hint: check the README in the exported all-against-all files.

Cache/AllAll
2. Now run OMA standalone on the five bacterial genomes we prepared previously. Which command should be used to start OMA standalone and run it to completion, without stopping after each of the aforementioned three steps?

Hint: -Although it may be more convenient to install OMA standalone on your system, you don’t have to: simply navigate inside the OMA.2.2.0 folder, then run it from there -See OMA standalone documentation omabrowser.org/standalone/

`bin/oma`
It should take between 5-10 minutes for OMA Standalone to finish. This is because we have relatively small genomes, of which the all-against-all is mostly pre-computed. However, for future reference, if computing on larger and more numerous genomes, it is recommended to parallelise the all-against-all and run on a High Performance Cluster (HPC). See this cheatsheet for more information.

Interpreting Results

Now that OMA standalone is complete, the Output folder should be created - have a look at the contents. Note: Familiarity with command line scripting is preferable to complete this section.

1. OMA has estimated a rough species tree from orthologous groups, using a distance-based method. (If you know the phylogeny of the species, you can instead give a predefined species tree in the parameters file.) Examine the tree, to which species is your new genome closest to?

View tree using the online tree viewer PhyloIO)

It is a sister clade to STRZN, STAA0, and BACAA, thus equally close to all of those species.
2. Examine the pairwise orthologs. Which pair of genomes has the most orthologous pairs of genes? How many?

Hint: wc -l *

Bacillus anthracis (strain A0248) and Staphylococcus aureus (strain TW20 / 0582); 2009 pairwise orthologs
3. How many 1:1 orthologous genes does BACAA have with STRZN?

Hint: look in the “Orthology type” column of the pairwise ortholog file for these two genomes.

788
cat STRZN-BACAA.txt | cut -f 5 | sort | uniq -c
4. Examine the root HOGs. How many root HOGs were inferred? What is the average size of the root HOGs in terms of number of genes per HOG?

Hint: look at the HOGFasta folder and loop through each file to count the number of genes.

1874 HOGs in total, with a mean of 3.21238 genes per HOG.
ls -1 | grep ".fa" | sed "s/.\*/grep -c \">\" &/" | bash | awk '{ total += $1 } END { print total/NR }'
5. Examine the OMA Groups. How many species are represented in the group containing the entry STAA002275? What is the putative function?

Hint: OMA Groups are also known as orthologous groups. Look at OrthologousGroups.txt. In the file, each line is an OMA group and each tab-separated columm the whole gene description (From the FASTA header).

There are two species represented in this OMA Group - STAA0 and STRZN. The predicted function is a Lactose phosphotransferase system repressor.

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Category	Prefixes
Genes	id, go, ec, description, domain, sequence
HOGs	hog, sequence
OMA Groups	omagroup, fingerprint, sequence
Taxon	species, taxid, taxon

Module 2: OMA Standalone

Obtaining Data and Getting Setup

Running OMA Standalone

Interpreting Results