Module 3: FastOMA

Back to home / Reset

3.1 Obtaining data and getting setup

FastOMA is a software package for inferring homologous relationships between coding genes of multiple species, including generating Hierarchical Orthologous Groups. It takes as input several files in the FASTA format; each containing the sequences of proteins coded by all genes in a species genome - the proteome. It also requires a species tree, resolving the relationship between species


For this Module, you will need to use our GitPod instance


In this exercise, we will run FastOMA standalone to infer the orthology information for five yeast species. We already provided the proteomes of five species in the GitPod environment, located at /workspace/FastOMA-academy/Module_FastOMA/working_dir/in_folder/proteome.

Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided in the GitPod workspace: Module_FastOMA/working_dir/in_folder/species_tree.nwk. It is as follows:

(((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota; You can visualize the species tree using the phylo.io website.

In the GitPod, FastOMA is already installed – you should be able to use it after logging into your GitPod workspace.


Optional (If you are not using GitPod)

If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page[https://github.com/DessimozLab/fastoma].

If you want to download the proteomes on your own system, check out the following hint:

Instruction for downloading proteomes The UniProt database includes the proteomes of many species. You can download the reference proteomes of the following species from UniProt by clicking on “Download one protein sequence per gene (FASTA)”:

Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:

wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz gunzip -k UP000002485_284812.fasta.gz


  • 1. In which format are the proteome files?

    FASTA format

  • 2. How many proteins are there in the Schizosaccharomyces pombe proteome?

    Each record in a FASTA file always starts with ">". You can use the following command to calculate the number of records in the FASTA file: grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l

    There are 5,122 proteins in this FASTA file.

  • 3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?

    The answer to both questions is 5.

3.2 Running FastOMA

The FastOMA algorithm runs in three main steps:

  1. Mapping input proteins to existing Hierarchical Orthologous Groups (HOGs) in the OMA database using OMAmer (see Module 2)
  2. Inferring gene families based on the OMAmer results
  3. Orthology inference in the form of new HOGs

Note that these steps are efficiently executed, thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format for HOGs. For more information on HOGs, see Module 1 and also Page 4 of Zahn-Zabal et al. F1000, 2020.

First change directory to the Module3_FastOMA/working_dir/ where the folder in_folder exists.

cd /workspace/FastOMA-academy/Module_FastOMA/working_dir

Then, check whether Nextflow is installed on your system by running nextflow -h. Now we can use the command line to run FastOMA on the five proteomes in in_folder, also using the species tree from in_folder/species_tree.nwk.

  • 1. What is the command line to run FastOMA using a local OMAmer database?

    Check the FastOMA GitHub page. Important note: without specifying --omamer_db FastOMA will download a large database, covering the entire OMA database. This is not a problem for most machines, but could be problematic on GitPod.

    nextflow /workspace/FastOMA/FastOMA.nf --input_folder in_folder --output_folder out_folder --omamer_db in_folder/omamerdb.h5

Execute the above command to run FastOMA.

  • 2. Where is the output orthoXML file?

    Check the parent directory ofin_folder. What does it contain?

    Output_folder. The output of FastOMA includes three folders (hogmap, OrthologousGroupsFasta, RootHOGsFasta) and seven files (FastOMA_HOGs.orthoxml, orthologs.tsv.gz, OrthologousGroups.tsv phylostratigraphy.html, report.ipynb, RootHOGs.tsv, report.html, species_tree_checked.nwk).

3.3 Interpreting the results

Recall that Orthologous Groups are groups of strict orthologs, with at most 1 representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.

The output of FastOMA includes three folders hogmap : contains the OMAmer results used by FastOMA OrthologousGroupsFasta: contains FASTA files of marker orthologous groups RootHOGsFasta: contains FASTA file of sequences in each HOGs) and seven files The main orthology results: FastOMA_HOGs.orthoxml, orthologs.tsv.gz, RootHOGs.tsv, OrthologousGroups.tsv
Synthetic reports : phylostratigraphy.html, report.ipynb, report.html, The transformed input species tree: species_tree_checked.nwk).

The hogmap folder includes the output of OMAmer (Described in the OMAmer Module); each file corresponds to an input proteome.

Orthologous groups The folder OrthologousGroupsFasta includes FASTA files, in which all proteins inside each FASTA file are orthologous to each other. These could be used as gene markers for species tree inference (Module 3).

  • 1. How many Orthologous Groups are there?

    You can count the number of FASTA files in the folder OrthologousGroupsFasta.

    6,773. Note: The number could slightly change in different runs.

  • 2. How many genes in total are present in all Orthologous Groups?

    Genes are coded in the OrthoXML file as <geneRef id="1002001760"/>. We need to count the number of lines including "geneRef": grep geneRef output_hog.orthoxml | wc -l

    There are 22,606 genes in the groups.

Orthologous Groups which have a representative gene in every species could be considered as the “core genome” of the species of interest - genes that are conserved in all of the species.

  • 3. How many Orthologous Groups include one representative gene for each species?

    Count how many groups in OrthologousGroups.tsv have five genes. You can count how many times each group appears using this command: cat OrthologousGroups.tsv | cut -f 1 | sort | uniq -c | awk '{print $1}' | grep 5 | wc -l.

    There are 1,618 Orthologous Groups having five genes.

Genes in orthologous groups could be used for tasks such as resolving a species tree. One possible use of FastOMA would be to generate an orthologous group with a mostly unresolved species then use those groups to make a better resolved one.

  • 4. Which genes are orthologous to the gene A7EQW0_SCLS1?

    Find the corresponding line in the OrthologousGroups.tsv using $ grep A7EQW0_SCLS1 OrthologousGroups.tsv. Then, use the grep command on the first column $ grep "OG_0003358" OrthologousGroups.tsv.

    'tr|Q4WEI0|Q4WEI0_ASPFU', 'tr|A7EQW0|A7EQW0_SCLS1', 'sp|P32468|CDC12_YEAST', 'tr|Q6C7L3|Q6C7L3_YARLI', 'sp|P48009|SPN4_SCHPO

Root HOGs

The file RootHOG.tsv and the RootHOG folder contain information about the highest level of HOGs included in the OrthoXML file. Contrary to Orthologous Groups, Root HOGs represent families of genes that all descend from one common ancestor, in the ancestor of the species represented in the dataset. As such genes may have undergone duplication during their evolutionary histories, they may contain more than one gene per species, which differentiates them with Orthologous Groups. RootHOGs may be used in order to resolve the evolutionary history of a certain gene family.

  • 5. How many root HOGs are in the HOG output file?

    Each line in the output file RootHOGs.tsv denotes a gene. You can count how many times each root HOG appears using this command cat RootHOGs.tsv | cut -f 1 | sort | uniq -c | wc -l Note that the first line is the header. So the output value - 1 will be the answer.

    There are 6,793 root HOGs (gene families) in this file.

  • 6. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)? What are their identifiers?

    Find the corresponding line in RootHOGs.tsv using $ grep RL15A_SCHPO RootHOGs.tsv. Then, use the grep command on the first column $ grep "HOG:0006324" RootHOGs.tsv

    There are 7 proteins in this family. RL15A_YEAST, Q4WJV5_ASPFU, sp|P54780|RL15B_YEAS, RL15B_SCHPO, Q6C7Y3_YARLI, A7EYU6_SCLS1, sp|O74895|RL15A_SCHPO

  • 7.In which species are there more than one protein? What does it mean?

    YEAST (Saccharomyces cerevisiae) and SCHPO (Schizosaccharomyces pombe) have more than one protein in the root HOG, which indicate there have been one or more duplications in this gene family.

  • 8. Find the FASTA file for this root HOG. It could be used as an input for more analysis, such as gene tree analysis.

    FASTA files are in the folder RootHOGsFasta, which could be used to align them using MAFFT and infer the gene trees (FastTree both are already installed in the environment).

Phylostratigraphy

By reconstructing HOGs, FastOMA also models the evolutionary histories of gene families: at which taxonomic level they are gained, lost and duplicated. The results of this are contained in the phylostratigraphy files.

  • 9. How many genes are duplicated at the level of Saccharomyceta?

    Download the phylostratigraphy.html file in the out_folder to your computer and open it in a browser.

    265

  • 10.Which species appear to have the most unique proteins?

    Proteins unique to a species are gained on the terminal branch and reported at the leaf level.

    Sclerotinia sclerotiorum, with 5,363 genes gained.

Report from FastOMA

FastOMA produces a report in HTML format (report.html) indicating information about the input proteomes and about specificity from the output. This report can be also explored using the Jupyter Notebook (report.ipynb).

  • 11. Which species has the most proteins in its proteome?

    Check the section "Stats on input dataset" in the report.html file.

    Sclerotinia sclerotiorum

The report also contains basic statistics about HOGs in the dataset.

  • 12. What is the size of the HOG with the most members? What does it mean to have so many members?

    One HOG has 30 members. This indicates it likely underwent many duplications. This HOG contains proteins from the Hexose transporter family, which is highly duplicated in Yeast (https://pubmed.ncbi.nlm.nih.gov/20660490/).

FastOMA computes a so-called completeness score, which indicates how many species below the defined taxonomic level are found in a givenHOG. A high completeness score indicates genes have been found in all species in the clade, which typically gives high confidence in the HOG.

  • 13. What is the maximum value of the CompletenessScore?

    Check the section "Roothogs (deepest levels for every HOG)" in the report.html file.

    1

  • 14. Are there many HOGs with a high completeness score?

    Yes, most HOGs have a completeness score of 1.

Given FastOMA is installed in your local computer and you have a Jupyter Notebook installation. You can dynamically explore the contents of the HOGs in the FastOMA results using the PyHAM library which is already installed in the FastOMA environment .