For programmatic access, we provide an easy to use REST API documentation. We have also released libraries for both Python and R, in order to provide a native experience to those fluent in those languages. For those particularly interested in semantic web technologies - we also provide an Resource Description Framework (RDF) representation of the OMA data, which can be queried using the RDF query language SPARQL. This is, however, outside the scope of these exercises - for more information, see here. These exercises can be performed in the language of your choice, however the solutions will be provided in Python. The R package is available on Bioconductor, whilst the Python module is available from PyPI and installable using
pip install omadb
using a terminal. For these exercises we will also use the extra parts of PyOMADB for HOG analysis and GO enrichment analyses. To ensure you install the required extra packages, use the command
pip install "omadb[hog_analysis,goea]"
After installing the package, we can load it in Python (for instance, within a Jupyter Notebook), as follows
from omadb import Client
c = Client()
Note: there are a number of examples using the PyOMADB package available here, which may help when completing these exercises.
Your colleague provides you with the following excerpt of a protein sequence
SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN
Unfortunately they have misplaced the sequence identifier, but can remember that it was a Human protein. Using the API, identify which OMA entry this is.
Note: OMA IDs contain the UniProt 5 character species code as the first 5 characters of the ID.
1. How many targets contain this excerpt? How were these identified?
unknown_seq = 'SCHTGLRRTAGWNVPIGTLRPFLNWTGPPEPIEAAVARFFSASCVPGADKGQFPNLCRLCAGTGEN'
results = c.proteins.search(unknown_seq)
print('Number of results:', len(results.targets))
print('Identified by:', results['identified_by'])
2. What is the OMA ID of the human protein which contains this subsequence?
human_ids = list(filter(lambda target: target.omaid.startswith('HUMAN'), results.targets))
print('Number of human entries:', len(human_ids))
human_entry = human_ids[0]
print('OMA ID:', human_entry.omaid)
3. How many orthologs does this entry have?
print('Number of orthologs:', len(human_entry.orthologs))
4. How many species have a protein member of the OMA Group of which this protein is part?
og = human_entry.oma_group_url
print('OMA group: ', og.group_nr)
print('Fingerprint:', og.fingerprint)
species = {m.omaid[:5] for m in og.members}
print('Number of species:', len(species))
5. Which HOG is this protein a member of? What is the top level family ID?
print('HOG ID:', human_entry.oma_hog_id)
print('HOG level:', human_entry.oma_hog_members.level)
print('Top-level (root) HOG ID:', human_entry.roothog_id)
In this exercise, we will retrieve the Gene Ontology (GO) annotations for the members of a particular gene family, at the Eukaryota level. We can then perform a GO enrichment analysis (GOEA) in order to see if a sub-family, defined at a lower level (Amniota), is enriched in any particular GO terms.
This will focus on the gene family (top level HOG) found in the previous exercise
1. Identify the members of the HOG that the protein in the previous exercise is part of.
hog_members = human_entry.oma_hog_members.members
2. Generate a list of OMA IDs for these members. This is the foreground set.
foreground = [m.omaid for m in hog_members]
3. Generate a list of OMA IDs for the gene family (HOG) defined at the Eukarayota level. This defines the background set of genes in our GOEA.
euk_hog_members = c.hogs.members('TRFL_HUMAN', level='Eukaryota')
background = [m.omaid for m in euk_hog_members]
We can use the GOATOOLs library to perform the GOEA. There is handy functionality in the PyOMADB library to load this automatically for us - it can even filter to a desired aspect of the GO. For instance, the following loads for the biological process aspect only.
goea = c.entries.gene_ontology(background, # Background list of IDs
aspect='BP', # Filtering (optional)
as_goatools=True, # To return GOATOOLS GOEA object
progress=True, # Show progress of GO retrieval
methods=['fdr_bh']) # Passed to GOATOOLS (methods for GOEA)
4. Run the GOEA, on the biological process annotations, using the foreground selected earlier.
import pandas as pd
res = pd.DataFrame.from_records((dict(zip(z.get_prtflds_default(),
z.get_field_values(z.get_prtflds_default())))
for z in results))
results = goea.run_study(foreground)
Using the OMA API, we have access to the pairwise relationships that OMA computes. In the Python client, this is available through the Client.pairwise
method. In this exercise, we will compare the distribution of the evolutionary distance (in PAM units) between the 1:1 pairwise orthologs between human and Mus musculus (Mouse) as well as human and Canis lupus familiaris (Dog).
Hint: If you find this difficult, look at example 4 here (for R), or the notebook of examples (for PyOMADB).
1. Do you expect humans to be more similar to mice or dogs?
2. Load the 1:1 pairwise orthologs between human and mouse.
import numpy as np
human_mouse_iter = c.pairwise(genome_id1='HUMAN',
genome_id2='MOUSE',
rel_type='1:1',
progress=True)
human_mouse = np.array([x.distance for x in human_mouse_iter])
3. Repeat, loading the distances between 1:1 orthologs for human and dog.
human_dog_iter = c.pairwise(genome_id1='HUMAN',
genome_id2='CANLF',
rel_type='1:1',
progress=True)
human_dog = np.array([x.distance for x in human_dog_iter])
4. Using a plotting library (for instance, seaborn in Python), plot the two distributions PAM distances.
%matplotlib inline
import seaborn as sns
sns.set()
ax = sns.distplot(human_dog, label='Canis lupus familiaris')
ax = sns.distplot(human_mouse, label='Mus musculus', ax=ax)
ax.set_title('Distribution of Evolutionary Distance between\n1:1 Orthologous Pairs HUMAN-{CANLF, MOUSE}')
ax.set_xlabel('Evolutionary Distance (PAM)')
ax.set_ylabel('Density')
ax.legend()
5. Look at the median and mean of the two samples (human - mouse and human - dog)
print('Median:',
'\nHUMAN-MOUSE', np.median(human_mouse),
'\nHUMAN-CANLF', np.median(human_dog))
print('Mean:'
'\nHUMAN-MOUSE', np.mean(human_mouse),
'\nHUMAN-CANLF', np.mean(human_dog))
6. Use the two-sample Kolmogorov-Smirnov test, to test whether the two samples are drawn from the same distribution.
scipy.stats.ks_2samp
.
from scipy import stats
z = stats.ks_2samp(human_mouse, human_dog)
print('p-value:', z.pvalue)
7. Are the results to (5) and (6) what you expected in (1)? Does it correspond with the literature?