To compute a species tree using concatenated alignments, we need the sequences of orthologous genes for each species. If the species of interest are all in OMA, it is easy to get this information through the browser.
For this exercise, we are interested in the phylogeny of 20 yeast species in OMA. You will need to download OMA Groups present in most of the species to have a dataset for species tree estimation. In the OMA browser, go to "Download > Export Marker Genes".
2. In which clade was their last common ancestor?
3. You can vary two parameters for this step: number of genes you wish to export (too many increases the required computation) and the proportion of species that must be in a group. Set the number of gene to -1 (no limit) amd chose 0.5, 0.70, 0.9 and 1.0 as proportion of species. How many groups of genes do you have in each case?
4. What would be the best choice?
To compute a species tree, the sequences of orthologous genes of each species is needed. If the species of interest are not all in OMA, you have to run OMA Standalone to compute the marker gene you’ll need. However, you can still use the computations done from the species in OMA to speed up the process.
For this exercise, you are interested in the phylogeny of 2 species hypothetically not in OMA (either you just sequenced it or it was not included in the browser) and 18 yeasts species in OMA. As orthology between the 18 species were already computed, you need to download their all-vs-all data. In the OMA Browser, go to Download > Export All-All
2.Examine the contents of the archive. We have exported the precomputed all-by-all for our 18 fungal genomes in order to save time when running OMA standalone. The genomes are stored in the DB/ folder. Which species has the biggest genome and how many predicted proteins are in the file?
grep
doc to select the header lines (starting with ">") and count them.Bio.SeqIO
or similar.3. In what folder are the precomputed all-by-all alignments stored in?
4. Now we want to add our genomes. Add the following fungal genome to your dataset: YEAST.fa and SPOMBE.fa
Now run OMA standalone on the 20 fungal genomes.
Before starting the computation, it is wise to adjust the parameters file, called “parameters.drw”, which can be edited with any text editor. If the goal is to only generate a dataset for species phylogeny inference (and not keeping other unrelated orthology inferences), one can avoid doing computations and generating output files that are not needed by the following:
To run the software, type the following command on the command line from within the OMA.2.4.2 folder:
>bin/oma
If not done on a parallel framework, this can take hours. To speed up time, you can get the completed file by downloading this Figshare archive, under Protocol_2/data/OMA.2.3.1
5. OMA has estimated a rough species tree from orthologous groups, using a distance-based method. (If you know the phylogeny of the species, you can give a predefined species tree in the parameters file. The is written in the newick format. How do you read this format?
((species A, species B), species C);
would have the species A and B together on a terminal branch, and C splitting off before.
6. You can visualize the tree by pasting the newick line into phylo.io. Examine the tree. Where do you think it should be rooted? Keep this tree aside, we will discuss it in more depth in the ‘Visualizing and interpreting tree’ module.
7. Not trusting this lowly distance tree reconstructed by OMA standalone, you set out to perform a proper tree inference. For that you’ll need orthologous groups containing genes from each genome. Go to the OrthologousGroups : How many are there?
8. Optional : Use the Python script provided to obtain groups with at least 100% species (Click the hint button for the command line). How many are there now?
python filter_groups.py --min-nr-species <MIN_NR_SPECIES> --input Output/OrthologousGroupFasta/ --output <destination/directory>
With marker genes obtained, for example, from the OMA Browser or from an OMA standalone computation, one way to obtain the species phylogeny is to align all sequences within each orthologous group, before combining these alignments into a single concatenated alignment: this is called a supermatrix and is then used as input to tree estimation software to obtain the species tree.
Note: this exercise requires multiple pieces of command line software, as well as Python scripts. If you wish to proceed with this exercise you will need to install these on your computer. Links will take you to the required software. If not, please go directly to the "Visualising and Understanding Trees" section.
1. First, we need to align the sequences within the orthologous groups. For this, it is possible to use any software for multiple sequences alignment, such as MAFFT. For example, this command should work from within the directory containing the FASTA files for the marker genes:
$ for i in $(ls -1 *.fa); do mafft --maxiterate 1000 --localpair $i > $i.aln; done </code>
Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/Data/Alignment) for example.
2. Open one of the resulting aligment files. What format is it in?
3. The next step is to concatenate the alignments into the supermatrix. A Python script (concat_alignments.py
) is available in this git repository to perform this. This can be run as so:
$ python concat_alignments.py <path>/<to>/<alignments>/*aln --format-output [fasta/phylip] > output
What size is the new, concatenated alignment?
$ raxmlHPC -f a -m PROTGAMMALG -p 12345 -x 12345 -# 100 -s MSA.fa -n tree.nwk
This should result in a tree in the newick format.
Note : Computation can take time. If you do not wish to wait for the results, you can find example alignment files in this Figshare archive (under Protocol1/trees/RAxML/)RAxML_bipartitionsBranchLabels.100.phylo_4 for example)
We will analyse this tree in the next module - "Visualising and Understanding Trees".
Once a phylogeny has been obtained, it is important to have the tools to visualise it, as well as the difficulties in interpretation. This is particularly the case because, even if they are displayed in an unambiguous way, uncertainty may exist in the topology.
For this section, we will use the tree you obtained in the last module. If you skipped this part, please use the RAxML tree available in FigShare under Protocol_1 (Protocol_1/trees/RAXML/RAxML_bipartitionsBranchLabels.100.phylo_4).
We will visualise the tree through a browser on the phylo.io website. Copy the newick line into the text panel.
1. Unless an outgroup was specified in the parameters, undirected trees are often arbitrarily or approximately rooted. What is rooting?
2. An outgroup is sometimes included in the analysis, which can then be used to choose the root by maintaining maximum distance between the outgroup and the rest of the species. Here we do not have any outgroup species, how can we choose the root?
3. After rooting, we can see the topology of the tree. However, we do not have equal certainty in each of the branches. Click on "Setting > Branch label". What is a bootstrap value?
4. What branches are the least certain?
5. The branch lengths give information about the distance between species under the topology. You can show the actual value of the length under "Setting > Branch length". What species are the closest?
Protocol_2/data/OMA.2.3.1/Output/EstimatedSpeciesTree.nwk
, from FigShare. What is different?