Biodiversity Genomics Academia : OMA and OMArk for homology exploration and gene annotation quality control

Welcome to the OMA and OMArk session. To go through this session, please use our GitPod for command line and steps that need computing. For the first module, you will need to reach the OMA Browser

Back to home / Reset

1. The OMA Browser and HOGs

The OMA database is a publicly available orthology database with a yearly release cycle. From the latest release, it contains orthology information for 2,851 species - 1,965 bacteria, 173 archaea, and 713 eukaryotes. The OMA Browser database serves as the basis for both OMAmer and OMArk. In this session, we will go through the easiest way to get information from the Browser and get the most out of OMArk.

For this module, go to the OMA Browser.

Browsing the species content in the OMA Browser:

The species content of the current release of the OMA Browser is listed under the 'Explore' > 'Species/release information' section of the website.

1. How many arthropods and mammalian genomes represented in OMA?

Under the release page, search for clades in the search bar

75 arthropods and 77 mammals

If you are interested in specific clades, you can also search for them using the main search bar on top of the Browser.

2. Look up the page about the Obtectomera clade. How many species of this clade are in OMA?

Type Obtectomera then select “Taxon” as search field

There are 7 species, including 5 butterflies
3. Look up information about the species Papilio machaon. What is the identifier of its assembly? What is its release date?

Click on the species name, then at the DB release category.

The assembly identifier is GCF_912999745. It comes from the 2022 release of Refseq.

Browsing HOGs

HOGs, or Hierarchical Orthologous Groups, are representations of gene families. A HOG is a set of genes that descended from a common ancestral gene in a specific ancestral species (i.e., at a specific taxonomic level). HOGs are hierarchical because groups defined at more recent clades are encompassed within larger groups that are defined at older clades, making them nested subfamilies HOGs are the main data representation in OMA and allow for exploring the evolution of a gene family. They also are the main data used for OMArk's quality assessment.

The following exercises are focused on analyzing the evolutionary history of a gene family. For an introduction on how to use the iham graphical viewer (needed to answer the following questions), see our documentation and YouTube video.

Getting HOGs from genes

From the OMA Browser, click on the link to the rat P53 gene or search for the protein P53_RAT in the search bar. Then click on the Groups button and select HOG. We display the deepest HOG in which the protein is present - a gene sub-family. You can go to the largest HOG in which this gene is present (known as a “Root HOG” in OMA) by clicking on the first taxon on the taxa list below the header.

4. From which taxon did this gene family originate?

Euteolostomi
5. In which species is this gene most commonly duplicated?

The number of box on each row indicate the number of genes in this family per species.

Loxodonta africana

Duplications cause new HOGs to be created under the Root HOG. HOG. When they do, their identifier becomes the one of the top-level HOG appended with ‘.[number][letter]’.

6. Look for information about the HOG D0637001.1a. It originated from a duplication in Neopterygii. Click on the Neopterygii node on the tree, how does the duplication appear in the visualization?

The boxes are separated by a line, which show this genes was already present in two copy in the last common ancestor of this clade. Genes from HOG:D0637001.1a are shown in green, those from another subHOG ar shown in grey.

It is possible to look at member proteins of a HOG and download corresponding sequences through the "Members" tab.

7. How many sequences are in the HOG D0637001.1a? You can download their sequences in the FASTA format.

There are 45 protein member of this HOG.

The classification of genes into gene families and subfamilies through HOGs provides information about their evolutionary relationships. By taking all the HOGs at a given taxonomic level , we can also estimate the ancestral gene content for any given clade.

2. Placements with OMAmer

This session needs GitPod and should be run from /workspace/oma-omark/working_dir/
The input FASTA files for OMAmer are in the subfolder Proteomes and the OMAmer database in the folder DB

OMArk is a pipeline that assess the quality of proteomes. i.e. the coding-gene repertoire of species. It is based on comparison with OMA HOGs. As a practical example, we will assess the quality of the proteome of the narwhal, Monodon monoceros.
The first step of OMArk is placement of genes into HOGs with the OMAmer software. It provides a fast way to find representatives of known gene families in a query proteome. OMAmer compares the k-mers (words of k characters) between sequence and OMA HOGs to place the sequence into an existing HOGs.
To accomplish this, a special database is required, which mirrors the content of the OMA database's HOGs and their sequences' k-mers. OMA also offers lightweight, clade specific databases. However, OMArk works at its best when using a comprehensive database, called LUCA.h5 and available at https://oma-stage.vital-it.ch.org/All/LUCA.h5, available under the DB folder.

An OMAmer database is big (>10GB) and for now, running the software requires a high amount of RAM. We recommend performing the analysis on a computing cluster rather than on a laptop. As running it on GitPod could take time, we provide pre-computed OMAmer result files for this session. If you accidently overwrote the ones in the omamer folder of your working space, you can copy those in the folder workspace/oma-omark/expected_outputs/omamer/ .

1. The command that performs gene placement with OMAmer is omamer search .What command would you write to run OMAmer with query proteome Monmon.fa ?

Use omamer search --help to get documentation about the parameters.

omamer search --db DB/LUCA.h5 --query Proteomes/Monmon.fa --out omamer/Monmon.omamer.txt
2. Take a look at the output file (omamer/Monmon.omamer.txt). The results indicate, for each protein, its best placement into HOGs. What is the placement for the protein A0A4U1FQC7_MONMO ? What is this HOG's taxonomic rank in the OMA Browser?

HOG:D0628802.7a.1a . Its rank is Cetacea

The family and subfamily scores represent the proportion of k-mers of the sequence that overlap with k-mers the nested HOG, excluding those already shared with the encompassing HOG (for subfamilies). \

3. Based on these scores, would you consider the placement of A0A4U1FQC7_MONMO to be of high confidence?

Family score and subfamily score: close to 1. Very high confidence.

Some proteins are not assigned to any HOGs as the k-mer overlap was not significant. These proteins are marked with "na" indicating their unassigned status.

4. How many proteins are not assigned to any HOG in this file?

Use grep ‘\sna\s’ <file> | wc

32 sequences with no hits over 20,731

OMAmer placements can be used to easily obtains information and the sequences of homologs of all the sequences of a proteome. They can be combined with the OMA Browser to start phylogenomic analysis while skipping time-consuming orthology prediction. They can also serve as input to run OMArk quality assesments.

3. Proteome quality assesment with OMArk

This session needs GitPod and should be run from /workspace/oma-omark/working_dir/
The input FASTA files for OMArk are in the subfolder Proteomes, omamer and the OMAmer database in the folder DB

OMArk uses the output from OMAmer and OMA's data on ancestral gene content to assess the quality of a proteome in terms of completeness and consistency. In the final part of the tutorial, we will explore how to execute it and interpret the results.

You can execute OMArk with the omark command using the command line, as the software should be installed in your workspace. If it is not already installed, you can do so by running pip install omark in your environment.

1. What command would you write to run OMArk on the Monmon OMAmer file?

Type omark --help to see possible parameters

omark -d DB/LUCA.h5 -f omamer/Monmon.omamer.txt -o omark/Monmon_results -v

OMArk will take around 10 minutes to complete. Some information is already available in the command line output. When no taxonomy information id given by the user, OMArk determines the species composition based on the OMAmer placement.

2. Which ancestral lineage did OMArk select?

Ancestral lineage: Artiodactlyla. The ancestral lineage is the subset of the chosen taxon which has at least 5 species in OMA and that we use for completeness and consistenct assesmebt

Completeness assessment

OMArk searches for HOGs that contain genes for at least 80% of the ancestral lineage’s descendant species that are present in OMA. These conserved genes are expected to be present in the proteome and serve as a proxy for completeness

3. In the output of OMArk, open either the file with the extension "_detailed_summary.txt" (human-readable) or ".sum" (machine-readable) to access the completeness assessment. How many HOGs were used for the completeness assessment?"

13,050
4. How many genes are reported as missing in the proteomes?

957 / 7.33%

OMArk reports two categories of duplicates: Expected duplicates occur when conserved ancestral HOGs have known sub-HOGs to which sequences are mapped, while Unexpected duplicates occur when multiple genes are placed into the same HOGs at the ancestral level.

5. In which category do the majority of duplicate genes in the proteomes belong?

Most of them - 1302 of the 1351 are reported as unexpected duplicated

You can explore the HOGs that are reported in different categories in this file by opening the file with extension “.omq”. The categories in this file, marked by a starting ‘>’ are:

Single: Single copies
Lost: Missing
Duplicated: Duplicated unexpected
Underspecific: Single copies - when the placement is into a parent HOG
Overspecific_S: Single copies - when the placement is into a single child HOG
Overspecific_D: Duplicated expected

Consistency assessment

The consistency assessment in OMArk uses OMA information to address several questions: Do we find homologs for the predicted proteins? Are those homologs found in clades consistent with the query taxonomy? Do they exhibit similar protein structures?

6. How taxonomically consistent are the proteins in this proteome?

Most of the proteins (94.84%) are taxonomically consistent. We note, however, a few taxonomically inconsistent ones (possible undetected contaminants, genes hitherto unknown in the clade or errors).
7. Would you say the gene structures in this proteome are of high quality?

Structurally inconsistent genes are marked as Partial hits (less than 80% of the total sequence length was used for the placement) or fragments (they have a length less than half of the median sequence length for the clade)

Gene structures look like they are of medium quality with around 11.70% partial placements and 13.52% fragments
8. How many Unknown proteins are there?

32. It is same number as the genes with na in OMAmer.

The sequences for each category can be found in the output file with the '.ump' extension. You can search for the identifier of these fragmented genes or partial hit genes in the OMAmer output and refer to the corresponding HOGs to investigate the results.

9. For instance, let's look up the fragmented protein A0A4U1EN85_MONMO in the OMAmer output. What can you conclude about it?

You can use grep A0A4U1EN85_MONMO omamer/Monmon.omamer.txt
Pay particular attention to the last two columns that indicate the query sequence length and the median sequence length in the subfamily.

The protein is placed into a HOG with a median protein length of 803, but it is only 344 amino acids long.
By searching for the HOG identifier in the OMAmer file, we can observe that another gene is also assigned to this same HOG (HOG:D0610699), with a sequence length of 499. This HOG is also reported as duplicated in OMArk's results. This “duplication” is likely caused by a fragmented gene sequence.

10. We have another proteome available for this species, which is obtained from the NCBI. The proteome from the NCBI contains isoforms of the same gene (available in Proteomes/Monmon_NCBI.splice). Therefore, we will provide a file to OMAmer that specifies which proteins come from the same gene. What is the command to run OMArk on the MonMon_NCBI.omamer.txt file?

omark -f omamer/Monmon_NCBI.1.txt -i Proteomes/Monmon_NCBI.splice -d DB/LUCA.h5 -o omark/Monmon_NCBI_results/ -v
11. What can you say about this proteome in comparison to the previous one?

It is both more complete and more consistent in terms of gene content.

OMArk website

OMArk can also be used on its website: https://omark.omabrowser.org/. The website contains data about public proteomes from multiple sources, making it easy to compare results with those public data. This is particularly relevant because OMArk results can vary between clades and are most useful for relative comparisons

12. Look for the clade Cetacea on the website. What can you say about the proteomes with outlier proteome size (number of proteins)?

Click on select Taxon and type Cetacea

Proteomes with low gene count are reported as incomplete by OMArk, the one protein with higher gene count has higher fragments (Monodon monoceros from UniProt)
13. Is the NCBI Monocedon monoceros proteome of good quality for its clade?

Yes, its proportion of inconsistent genes and missing genes are close to what is observed in other proteomes. (Note that getting 0% may be unrealistic)
14. In this view, open the proteome of Physeter catodon. What is it contaminated with?

A parasitic apicomplexa from the Emeriorina clade.

Finally, upload a proteome of your choice (perhaps downloaded from UniProt) on the OMArk website. You can easily obtain the same results file as shown before by clicking the Download button. You can also easily compare your proteome to closely related ones by clicking on the “Change to comparison view” button.