FastOMA is a software package for inferring homologous relationships between protein-coding genes of multiple species, including generating Hierarchical Orthologous Groups. It takes as input several files in FASTA format, each containing all the protein-coding gene sequences in a species' genome— the proteome. It also requires a species tree to group homologous genes at each taxonomic level of the species of interest. Find out more about the method in the FastOMA paper.
In this exercise, we will run FastOMA to infer the orthology information for five yeast species. You will need to use our Gitpod instance.
We already provided the proteomes of the five species in the Gitpod environment, located at /workspace/SIBBiodiversityBioinformatics2025/Module4_FastOMA/working_dir/in_folder/proteome.
Another input needed by FastOMA is the species tree. For our case, the species tree in newick format is provided at: /workspace/SIBBiodiversityBioinformatics2025/Module4_FastOMA/working_dir/in_folder/species_tree.nwk. It is as follows: (((Yarrowia_lipolytica:1,Saccharomyces_cerevisiae:1)Saccharomycetales:1,(Neosartorya_fumigata:1,Sclerotinia_sclerotiorum:1)leotiomyceta:1)Saccharomyceta:1,Schizosaccharomyces_pombe:1)Ascomycota;. You can visualize the species tree using the phylo.io website.
In the Gitpod, FastOMA is already installed – you should be able to use it after logging into your Gitpod workspace.
Optional (If you are not using Gitpod)
If you want to install FastOMA on your system, you can follow the installation instructions on the FastOMA GitHub page.
If you want to download the proteomes on your own system, check out the following hint:
The UniProt database includes the proteomes of many species. You can download the reference proteomes of the following species from UniProt by clicking on “Download one protein sequence per gene (FASTA)”:
Right click on “Download one protein sequence per gene (FASTA)" and copy the link. Then, use wget to download the file and unzip the file using gunzip software. For example for Schizosaccharomyces pombe:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000002485/UP000002485_284812.fasta.gz
gunzip -k UP000002485_284812.fasta.gz
1. In which format are the proteome files?
2. How many proteins are there in the Schizosaccharomyces pombe proteome?
grep ">" in_folder/proteome/Schizosaccharomyces_pombe.fa | wc -l
3. How many leaves are in the species tree? For how many species does the species tree provide evolutionary information?
The FastOMA algorithm runs in three main steps:
Note that these steps are efficiently executed, thanks to our highly-parallelized pipeline implemented in Nextflow. The output of FastOMA is reported in OrthoXML, which is the standard format for HOGs. For more information on HOGs, see Module 1 and also Zahn-Zabal et al. F1000, 2020 (page 4).
First change directory to the Module4_FastOMA/working_dir/ where the folder in_folder exists.
cd /workspace/SIBBiodiversityBioinformatics2025/Module4_FastOMA/working_dir/
Then, check whether Nextflow is installed on your system by running nextflow -h. Now we can use the command line to run FastOMA on the five proteomes in in_folder, also using the species tree from in_folder/species_tree.nwk
1. What is the command to run FastOMA using a local OMAmer database?
--omamer_db FastOMA will download a large database, covering the entire OMA database. This is not a problem for most machines, but it could be problematic on Gitpod.
nextflow FastOMA/FastOMA.nf --input_folder in_folder --output_folder out_folder --omamer_db in_folder/omamerdb.h5
Note: if the analysis is interrupted, you can use the -resume flag.
FastOMA is fast, but still takes ~45 minutes to run for these genomes. Due to time constraints, please use the results in /workspace/SIBBiodiversityBioinformatics2025/Module4_FastOMA/expected_output to answer the following questions.
2. Where is the output orthoXML file?
in_folder. What does it contain?
Recall that Orthologous Groups are groups of strict orthologs, with at most one representative per species. Hierarchical Orthologous Groups are groups of orthologs and paralogs, defined at each taxonomic level.
The output of FastOMA includes 3 folders and 7 additional files:
hogmap: contains the OMAmer results used by FastOMA (Described in the OMAmer Module); each file corresponds to an input proteome.
OrthologousGroupsFasta: contains FASTA files of marker orthologous groups
RootHOGsFasta: contains FASTA filea of sequences in each HOGs
The main orthology results: FastOMA_HOGs.orthoxml, orthologs.tsv.gz, RootHOGs.tsv, OrthologousGroups.tsv
Summary reports: phylostratigraphy.html, report.ipynb, report.html,
The transformed input species tree: species_tree_checked.nwk
Orthologous groups
The folder OrthologousGroupsFasta includes FASTA files, in which all proteins inside each file are orthologous to each other. These could be used as marker genes for species tree inference (Module 5).
Note: FastOMA is not deterministic, so the answers to the questions below could slightly change in different runs.
1. How many Orthologous Groups are there?
2. How many genes in total are present in all Orthologous Groups?
Orthologous Groups which have a representative gene in every species could be considered as the “core genome” of the species of interest - genes that are conserved in all of the species.
3. How many Orthologous Groups include one representative gene for each species?
OrthologousGroups.tsv have five genes. You can count how many times each group appears using this command: cat OrthologousGroups.tsv | cut -f 1 | sort | uniq -c | awk '{print $1}' | grep 5 | wc -l.
Genes in orthologous groups could also be used for tasks such as resolving a species tree (see Module 5 Estimating a Species Tree).
4. Which genes are orthologous to the gene A7EQW0_SCLS1 (strict orthology)?
OrthologousGroups.tsv using $ grep A7EQW0_SCLS1 OrthologousGroups.tsv. Then, use the grep command on the first column $ grep "OG_XXXXXXX" OrthologousGroups.tsv.
Root HOGs
The file RootHOG.tsv and the RootHOG folder contain information about the highest level of HOGs included in the OrthoXML file. Contrary to Orthologous Groups, Root HOGs represent families of genes that all descend from one common ancestor, in the ancestor of the species represented in the dataset. As such genes may have undergone duplication during their evolutionary histories, they may contain more than one gene per species, which differentiates them with Orthologous Groups.
RootHOGs may be used to help resolve the evolutionary history of a certain gene family as they should contain all the homologs of the gene family.
5. How many root HOGs are in the HOG output file?
RootHOGs.tsv denotes a gene. You can count how many times each root HOG appears using this command cat RootHOGs.tsv | cut -f 1 | sort | uniq -c | wc -l
Note that the first line is the header. So the output value - 1 will be the answer.
6. Consider the gene “60S ribosomal protein L15-A” in Schizosaccharomyces pombe with protein ID: RL15A_SCHPO. How many proteins are in the gene family (for these 5 species of interest)? What are their identifiers?
RootHOGs.tsv using $ grep RL15A_SCHPO RootHOGs.tsv. Then, use the grep command on the first column $ grep "HOG:XXXXXXX" RootHOGs.tsv
7. In regards to the previous question, which species have more than one protein in the HOG? What does it mean?
Phylostratigraphy
By reconstructing HOGs, FastOMA also models the evolutionary histories of gene families: at which taxonomic level they are gained, lost, and duplicated. The results of this are contained in the phylostratigraphy files. See a phylostratrigraphy example in Train et al. Fig.1B.
9. How many genes are duplicated at the level of Saccharomyceta?
phylostratigraphy.html file in the out_folder to your computer by navigating to the file in the Gitpod workspace explorer on the left-hand side of the browser, right-clicking, and selecting download. Open the phylostratigraphy html file in a browser.
10. Which species shows the highest number of lineage-specific gene gains (potential genetic innovations)?
Report from FastOMA
FastOMA produces a report in HTML format (report.html) indicating information about the input proteomes and about specificity from the output. This report can be also explored using the Jupyter Notebook (report.ipynb).
11. Which species has the most proteins in its proteome?
report.html file in the same manner and check the section "Stats on input dataset".
The report also contains basic statistics about HOGs in the dataset.
12. What is the size of the HOG with the most members? What does it mean to have so many members?
report.html file.
FastOMA also computes a "Completeness Score," which indicates how many species below the defined taxonomic level are found in a given HOG. A high Completeness Score indicates genes have been found in all species in the clade, which typically means a high confidence HOG.
13. What is the maximum value of the Completeness Score?
14. Are there many HOGs with a high Completeness Score?