The OMA database can be obtained and queried in a variety of ways. Pairwise orthologs, HOGs, OMA Groups, paralogs and homoeologs can be accessed through the browser, API, or SPARQL. They can be inferred on user-input genomes using the OMA Standalone software.
From the landing page, users can search for their gene, group, or genome of interest. This can be done by clicking on the drop down menu to the left of the search bar.
Searching for a gene, group, or genome takes you to the gene-, group-, or genome-centric pages.
Gene-centric pages in OMA give all the information specific to a single gene in OMA. The gene is found at the top, with its OMA ID and UniProt ID. More information about the orthologs of this gene can be found in several different sub-pages are available on the side menu, including the orthologs, paralogs, gene information, isoforms, GO annotations, sequences, and local synteny. Each will be discussed in the following sections.
After searching a particular gene on OMA, a table of the orthologs is shown by default, also reachable from the side menu.
Paralogs are those genes which started diverging due to a duplication. They are defined in OMA by the taxonomic level at which the duplication occurred. The table is in the same format as the orthologs table.
Here one can get general information about the gene of interest. This includes the gene description, organism, locus coordinates, number of exons, and various ids and cross-references.
This page contains a table for any isoforms, or alternative splice variants, for the given gene, if it is stored in OMA. (Note, if no isoforms are available it doesn’t mean that isoforms don’t exist, but rather we don’t have the information in OMA.)
This page contains the GO annotations, including evidence codes and reference numbers, for a given gene. One can jump to Biological Process, Cellular Component, or Molecular Function in the side menu.
This page contains the protein and CDS sequences for the gene, available to download as a fasta file.
In the absence of genome rearrangement, orthology relationships can be expected to be consistent across neighbouring genes—a concept commonly referred to as ‘shared synteny’. Patterns of syntenic conservation or divergence can shed light on the evolutionary history of genomic loci of interest; they can also reveal sequencing artefacts, misannotations or orthology inference errors.
The OMA synteny viewer uses a typical layout: genes are represented by boxes, with neighbouring genes displayed in adjacent columns and orthologous regions displayed in different rows. The reference syntenic block, centred on a query gene, is displayed in the first row. The other rows are centred on genes that are orthologous to the query gene, ordered by increasing taxonomic distance to the query gene species. Orthology relationships to each gene contained in the reference syntenic block are coded using different colours. To convey many-to-one and many-to-many relationships, we use stripes of the relevant colours. To aid clarity, hovering over a gene highlights all orthologs of the same colour including those with stripes. The data can be conveniently explored by clicking on any gene, which recenters the display on that gene as a new query.Please see (Altenhoff et al. 2015) for more information.
The Extant and Ancestral genomes search page is an integrated page to search for information about the proteome of an extant species or ancestral one. If interested in a current species, you only need to write its scientific name or NCBI taxonomy ID in the corresponding text search boxes. If interested in an ancestral species, you can write the clade which the species is the common ancestor of in the scientific name box.
Both the Extant and Ancestral options allow not only to inspect and explore the genes content of a genome (or HOGs in ancestral genomes), but also to visualize the gene order in an interactive way. Moreover, if looking for an extant species, you can visualize the synteny with other species or clade of choice in a dotplot representation.
You can find it by expanding the “Explore” tab on the top right menu. Once here, click on “Extant and ancestral genomes”. This will take you to the Ancestral and extant genomes search page.
When selecting an extant genome, the Gene Order viewer is a representation of the protein-coding genes present in each chromosome/ scaffold/ contig. The grey rectangles or bars represent the genes and the links connect genes that are found next to each other (independently of the length of the sequence between them).
The synteny tab on the left hand menu opens a tool to visualise the gene synteny between the selected browsed extant species with another species of choice.
When selecting a second species in the text box and pressing enter, a matrix representing an overview of the synteny between both genomes shows up. Here, you can select two chromosomes (one for each species) for closer comparison with the synteny dot plot by clicking on their intersecting square. The colour intensity of a square represents the number of 1:1 orthologs shared between chromosomes, or the number of dots in that specific dot plot. The synteny dot plot can also be opened here by selecting the chromosome of each species that you want to compare on the drop down boxes of the bottom left menu.
The Ancestral Gene Order (AGO) viewer is an interactive tool to view and explore the order of genes in a specific ancestral species.
Firstly, you need to open the Extant and Ancestral genomes search page. In order to view the gene order of an ancestral species, you need to select it. You can do so by writing the name of the clade whose common ancestor species you are interested in (e.g. Mammalia, Chordata, Hominidae).
The ancestral genome page will first show, where you can access and download information like the common and scientific names of the species in the clade selected, their gene number, their OMA taxon ID or their OMA code.
The AGO viewer can be accessed by simply clicking on the Ancestral Gene Order tab on the left hand side.
The ancestral genes tab consists of a table of all the genes present in the ancestral genome with additional information about each of them. This is presented as the HOGs present at this level, which can be interactively accessed by clicking on them and downloaded (top right button).
Each line or group of lines represents an ancestral contig or CAR (Contiguous Ancestral Region) and each vertical bar on them a gene (HOG present at the ancestral genome level). These contigs are ordered by decreasing length.
By default, the height of the bars show the number of genes in the HOG (number of extant genes) while the colouring represents the completeness of the HOG (proportion of descendant species represented with at least one gene in this HOG). You can see this information and customise it on the settings option (top right button).
The Ancestral Gene Order information shown here has been reconstructed from extant species gene order using a parsimony approach. Two HOGs (bars) are connected by a link if they have been inferred to be of closest proximity, with the weight (colour of the line) indicating the number of times these two HOGs are also of closest proximity in descendant extant genomes.
There are two main types of Group-centric pages: the HOG and OMA Group pages. These can be accessed by searching for a specific group by one of their group identifiers (HOG id, OMA Group id, OMA fingerprint) or by a member gene of the group.
The HOG page gives information on a HOG of interest. It can be accessed by searching for a specific HOG, or through one of its member genes.
The landing HOG page gives information about the Root HOG (i.e. gene family), including the Root HOG id, the taxonomic level at which this HOG is inferred, and the number of member genes in this HOG.
The Graphical viewer is displayed on the HOG landing page. This is an interactive display in which all the species in the HOG are organized by their species tree. All the genes in the HOG are represented as boxes. Ancestral genes and evolutionary events such as duplications and losses can be visualized by interacting with the graph. For more information on the iham visualization and the options, see (Train et al. 2019) and YouTube video.
By clicking on the Members tab in the side menu for a given HOG, one has access to all the genes in that particular HOG at that taxonomic level. Displayed in the table are the Protein IDs , Cross References , Domain Architectures , and Taxon . The table can be filtered to show the sub-HOGs at various taxonomic levels.
The last tab in the side menu is for similar HOGs. Genes that belong to distinct but similar HOGs can be paralogs separated by a very deep duplication, orthologs misclassified by OMA in separate groups, or genes that are homologous for only part of their sequence (e.g. genes spanning over a domain fusion or fission event, artefactual fragments, etc.)Please see (Altenhoff et al. 2018) for more information.
Domains can be used to establish links between HOGs. Given an initial HOG, a user can retrieve a table of the most similar HOGs based on conserved domain architecture. The similarity is computed by counting the number of domains in common between two HOGs. At the top a visual representation of the overall domain architecture of the HOG is displayed, as well as the prevalence of this representative domain within the HOG.
In the domain architecture view of a HOG, information about the HOG (on the top) is followed by the table containing information about other HOGs that share at least one domain in common with the HOG of interest. The table can be sorted by any of the attributes.
Here one can find similar HOGs based on the number of orthologous relations which link the HOGs. This can be useful for finding split HOGs which have not been merged by OMA.
The OMA Group pages are similar to the HOG pages, but display information about an OMA Group, which are more stringent groups of orthologs. (See Different types of orthologs in OMA ). The header of the page contains the OMA Group ID, the numbers, and the fingerprint. The side menu contains the following: Members, Alignment, Gene Ontology, and Similar OMA groups.
The genome-centric pages can be reached by clicking on the Genome tab on the right hand side of the header from individual gene pages, or by searching for a genome on the search bar.
The header of the genome-centric pages include the 5-letter uniprot species code, the scientific name, the proteome annotation source, date, and the number of proteins in this genome. The side menu includes the Genome information, a list of genes, related species, and synteny.
At the top of this page the taxonomy of the species is displayed, starting at the root of the NCBI species tree, going to the species of interest. Shown underneath are the common name, UniProt identifier (i.e. 5-letter species code), NCBI taxonomy ID, DB release, and the number of proteins which have been assigned to OMA Groups and HOGs.
The list of genes contains a table with all the proteins in the genome, their cross references and domain architecture. The protein IDs are in the form of OMA ID, which goes in order with the locus position in the genome. Since these are displayed in order, one can easily see potential tandem duplications as those with identical domain architecture.
Here one can find the most and least related species to the species of interest based on the number of shared OMA Groups or HOGs.
See Synteny Dot Plot
The OMA database is accessible programmatically for users who wish to obtain the data with python and R.
The REST API provides programmatic access to a comprehensive set of features provided through the web server. This API can be used to automate almost any analysis that a user could do on the website. On the REST API documentation page, which is accessible under https://omabrowser.org/api , all the endpoints and their parameters are described. Each endpoint includes also a live example.Please see (Altenhoff et al. 2018) and (Kaleb et al. 2019) for more information.
In addition, for R and python users, we provide native libraries wrapping around the REST API that further facilitates querying the OMA database in these languages.
The API wrapper package in R is available in Bioconductor (http://bioconductor.org/packages/OmaDB/). The package consists of a collection of functions that import OMA data into R objects, the type of which depends on the query supplied. Due to the volume of the data available, some selected object attributes are at first given as URL endpoints. However, these are automatically loaded upon accession. OmaDB also facilitates further downstream analyses with other Bioconductor packages, such as GO enrichment analysis with topGO, sequence analysis with BioStrings, phylogenetic analyses using ggtree or gene locus analyses with the help of GenomicRanges.
The open source code is hosted at https://github.com/DessimozLab/OmaDB/. The latest version of the package (v2.0) requires R version >= 3.6 and Bioconductor version >= 3.9.Please see (Kaleb et al. 2019) for more information.
For Python users, we provide an analogous package named omadb. Results are supplied to users as a hybrid attribute-dictionary object. As such, both attribute and key-based access is possible. Where the URL of a further API call is listed in a response, this has been designed to be automatically requested for the user.
For data that can be represented as a table, the pandas package is supported. HOGs can be analysed or displayed using the pyham library. Trees are retrievable as DendroPy or ETE3 Tree objects. Gene Ontology enrichment analyses are possible through the use of the goatools package.
The open source code is hosted at https://github.com/DessimozLab/pyomadb/ . The package requires Python >=3.6, as well as a stable internet connection. It is also available to download from PyPI, installable using pip.Please see (Kaleb et al. 2019) for more information.
The entire OMA database is available for download in several formats. This option is available under Download -> Current Release.
HDF5 download files from the current releases of OMA are still available under Download -> Archives.
Recent years have witnessed an increasing adoption of Resource Description Framework (RDF) among bioinformatics databases to model their data, as evidenced by the YummyData monitor (see https://yummydata.org), which currently highlights more than 65 biological and biomedical resources with SPARQL endpoints for querying their data (SPARQL is the main query language for RDF). The attractiveness of using RDF and SPARQL for bioinformatics databases can be explained by three main facts: the virtuous cycle of adopting a common data syntax and model that leverages data interchange on the Web; the SPARQL 1.1 specification makes it possible to run federated queries, which allow bioinformaticians to jointly retrieve data from multiple resources in one single query; and the existing growing number of RDF-based ontologies, controlled vocabularies, and taxonomies in life sciences. For example, the Bioportal, a repository of biomedical ontologies, contains more than 850 ontologies (see https://bioportal.bioontology.org).
Currently, the OMA RDF data are fully structured based on the Orthology Ontology (ORTH) version 2 (see documentation at http://qfo.github.io/OrthologyOntology/). As a result, we also enhance the interoperability among other resources such as the Microbial Genome Database (MBGD), another orthology database. The webpage of the OMA SPARQL endpoint is https://sparql.omabrowser.org and provides several example queries. To build different queries and to understand how the RDF data are structured, please refer to the tutorial paper at https://f1000research.com/articles/8-1822 and the full documentation of ORTH at http://qfo.github.io/OrthologyOntology/. For programmatic requests and/or to invoke a portion of a SPARQL query against a remote SPARQL endpoint with the SERVICE2 keyword (federated queries), we recommend to directly use the SPARQL endpoint query processor at https://sparql.omabrowser.org/sparql/. To implement and execute SPARQL queries within software tools, specialized packages for SPARQL are often provided in different programming languages such as the SPARQLWrapper package in Python ( pip install SPARQLWrapper, see https://pypi.org/project/SPARQLWrapper/). An example of SPARQLWrapper usage with OMA is available as Orthology_SPARQL_Notebook.ipynb (a Jupyter notebook) at https://github.com/biosoda/tutorial_orthology. To learn the SPARQL language, we suggest the introductory book chapter at https://doi.org/10.1007/978-1-4939-9074-0_22.
In addition to accessing orthologs through the OMA database, one can run the OMA Standalone software on their own genomes (see OMA Standalone in Catalog of OMA Tools). This can be combined with any of the species in OMA, so to facilitate saving time on the all-against-all computations, they are available for download as well (see Export all-against-all in Catalog of OMA Tools).Please see (Altenhoff et al. 2019) for more information.