Basic usage#
This notebook shows some basic usage of the genomic-features package.
import genomic_features as gf
Retrieving Ensembl gene annotations#
We can load annotation tables using genomic_features.ensembl.annotation().
ensdb = gf.ensembl.annotation(species="Hsapiens", version="108")
ensdb
EnsemblDB(organism='Homo sapiens', ensembl_release='108')
These tables have been created for the ensembldb Bioconductor package [RGW19], and are automatically downloaded and cached from the AnnotationHub resource.
We can check which Ensembl versions are available for different species using the genomic_features.ensembl.list_ensdb_annotations() util.
gf.ensembl.list_ensdb_annotations(species="Mmusculus")
| Species | Ensembl_version | |
|---|---|---|
| 37 | Mmusculus | 87 |
| 105 | Mmusculus | 88 |
| 173 | Mmusculus | 89 |
| 247 | Mmusculus | 90 |
| 330 | Mmusculus | 91 |
| 419 | Mmusculus | 92 |
| 510 | Mmusculus | 93 |
| 621 | Mmusculus | 94 |
| 748 | Mmusculus | 95 |
| 894 | Mmusculus | 96 |
| 1062 | Mmusculus | 97 |
| 1239 | Mmusculus | 98 |
| 1429 | Mmusculus | 99 |
| 1648 | Mmusculus | 100 |
| 1875 | Mmusculus | 101 |
| 2118 | Mmusculus | 102 |
| 2361 | Mmusculus | 103 |
| 2604 | Mmusculus | 104 |
| 2847 | Mmusculus | 105 |
| 3089 | Mmusculus | 106 |
| 3332 | Mmusculus | 107 |
| 3575 | Mmusculus | 108 |
| 3863 | Mmusculus | 109 |
Using EnsemblDB objects#
The genomic_features.ensembl.EnsemblDB is the access point to an annotation. This is an interface to a sqlite table retrieved from AnnotationHub (as shown above). Information on provenance can be accessed via the metadata attribute:
ensdb.metadata
{'Db type': 'EnsDb',
'Type of Gene ID': 'Ensembl Gene ID',
'Supporting package': 'ensembldb',
'Db created by': 'ensembldb package from Bioconductor',
'script_version': '0.3.7',
'Creation time': 'Fri Oct 28 05:24:43 2022',
'ensembl_version': '108',
'ensembl_host': 'localhost',
'Organism': 'Homo sapiens',
'taxonomy_id': '9606',
'genome_build': 'GRCh38',
'DBSCHEMAVERSION': '2.2'}
And can be queried for genomic features:
genes = ensdb.genes()
genes.head()
| gene_id | gene_name | gene_biotype | gene_seq_start | gene_seq_end | seq_name | seq_strand | seq_coord_system | description | gene_id_version | canonical_transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ENSG00000000003 | TSPAN6 | protein_coding | 100627108 | 100639991 | X | -1 | chromosome | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | ENSG00000000003.15 | ENST00000373020 |
| 1 | ENSG00000000005 | TNMD | protein_coding | 100584936 | 100599885 | X | 1 | chromosome | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | ENSG00000000005.6 | ENST00000373031 |
| 2 | ENSG00000000419 | DPM1 | protein_coding | 50934867 | 50959140 | 20 | -1 | chromosome | dolichyl-phosphate mannosyltransferase subunit... | ENSG00000000419.14 | ENST00000371588 |
| 3 | ENSG00000000457 | SCYL3 | protein_coding | 169849631 | 169894267 | 1 | -1 | chromosome | SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... | ENSG00000000457.14 | ENST00000367771 |
| 4 | ENSG00000000460 | C1orf112 | protein_coding | 169662007 | 169854080 | 1 | 1 | chromosome | chromosome 1 open reading frame 112 [Source:HG... | ENSG00000000460.17 | ENST00000359326 |
Filters#
genomic_features.filters defines a number of filters to use with these annotations. You can filter by specific columns:
ensdb.genes(filter=gf.filters.GeneBioTypeFilter("Mt_tRNA")).head()
| gene_id | gene_name | gene_biotype | gene_seq_start | gene_seq_end | seq_name | seq_strand | seq_coord_system | description | gene_id_version | canonical_transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ENSG00000209082 | MT-TL1 | Mt_tRNA | 3230 | 3304 | MT | 1 | chromosome | mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... | ENSG00000209082.1 | ENST00000386347 |
| 1 | ENSG00000210049 | MT-TF | Mt_tRNA | 577 | 647 | MT | 1 | chromosome | mitochondrially encoded tRNA-Phe (UUU/C) [Sour... | ENSG00000210049.1 | ENST00000387314 |
| 2 | ENSG00000210077 | MT-TV | Mt_tRNA | 1602 | 1670 | MT | 1 | chromosome | mitochondrially encoded tRNA-Val (GUN) [Source... | ENSG00000210077.1 | ENST00000387342 |
| 3 | ENSG00000210100 | MT-TI | Mt_tRNA | 4263 | 4331 | MT | 1 | chromosome | mitochondrially encoded tRNA-Ile (AUU/C) [Sour... | ENSG00000210100.1 | ENST00000387365 |
| 4 | ENSG00000210107 | MT-TQ | Mt_tRNA | 4329 | 4400 | MT | -1 | chromosome | mitochondrially encoded tRNA-Gln (CAA/G) [Sour... | ENSG00000210107.1 | ENST00000387372 |
Or by genomic range:
ensdb.genes(filter=gf.filters.GeneRangesFilter("1:10000-20000"))
| gene_id | gene_name | gene_biotype | gene_seq_start | gene_seq_end | seq_name | seq_strand | seq_coord_system | description | gene_id_version | canonical_transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ENSG00000223972 | DDX11L1 | transcribed_unprocessed_pseudogene | 12010 | 13670 | 1 | 1 | chromosome | DEAD/H-box helicase 11 like 1 (pseudogene) [So... | ENSG00000223972.6 | ENST00000450305 |
| 1 | ENSG00000227232 | WASH7P | unprocessed_pseudogene | 14404 | 29570 | 1 | -1 | chromosome | WASP family homolog 7, pseudogene [Source:HGNC... | ENSG00000227232.5 | ENST00000488147 |
| 2 | ENSG00000278267 | MIR6859-1 | miRNA | 17369 | 17436 | 1 | -1 | chromosome | microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:5... | ENSG00000278267.1 | ENST00000619216 |
| 3 | ENSG00000290825 | DDX11L2 | lncRNA | 11869 | 14409 | 1 | 1 | chromosome | DEAD/H-box helicase 11 like 2 (pseudogene) [So... | ENSG00000290825.1 | ENST00000456328 |
Logical operations (&, |, and ~) on filters are also possible:
ensdb.genes(
filter=gf.filters.GeneBioTypeFilter("lncRNA")
& gf.filters.GeneRangesFilter("1:10000-20000")
)
| gene_id | gene_name | gene_biotype | gene_seq_start | gene_seq_end | seq_name | seq_strand | seq_coord_system | description | gene_id_version | canonical_transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ENSG00000290825 | DDX11L2 | lncRNA | 11869 | 14409 | 1 | 1 | chromosome | DEAD/H-box helicase 11 like 2 (pseudogene) [So... | ENSG00000290825.1 | ENST00000456328 |
Column selectors#
Using the cols argument, you can get annotations from other tables in the database.
ensdb.genes(cols=["gene_id", "tx_id", "gene_name", "protein_id", "uniprot_id"]).head()
| gene_id | tx_id | gene_name | protein_id | uniprot_id | |
|---|---|---|---|---|---|
| 0 | ENSG00000000003 | ENST00000373020 | TSPAN6 | ENSP00000362111 | O43657.176 |
| 1 | ENSG00000000003 | ENST00000612152 | TSPAN6 | ENSP00000482130 | A0A087WYV6.48 |
| 2 | ENSG00000000003 | ENST00000614008 | TSPAN6 | ENSP00000482894 | A0A087WZU5.51 |
| 3 | ENSG00000000005 | ENST00000373031 | TNMD | ENSP00000362122 | Q9H2S6.148 |
| 4 | ENSG00000000419 | ENST00000466152 | DPM1 | ENSP00000507119 | A0A804HIK9.2 |
chromosomes()#
Information on chromosome length for this annotation (useful for downstream operations) is also available through the chromosomes function.
ensdb.chromosomes()
| seq_name | seq_length | is_circular | |
|---|---|---|---|
| 0 | X | 156040895 | 0 |
| 1 | 20 | 64444167 | 0 |
| 2 | 1 | 248956422 | 0 |
| 3 | 6 | 170805979 | 0 |
| 4 | 3 | 198295559 | 0 |
| ... | ... | ... | ... |
| 452 | LRG_741 | 231167 | 0 |
| 453 | LRG_763 | 176286 | 0 |
| 454 | LRG_792 | 42144 | 0 |
| 455 | LRG_793 | 38439 | 0 |
| 456 | LRG_93 | 22459 | 0 |
457 rows × 3 columns
Adding annotations to an AnnData object:#
import pandas as pd
import scanpy as sc
pbmc = sc.datasets.pbmc3k()
pbmc.var.head()
| gene_ids | |
|---|---|
| index | |
| MIR1302-10 | ENSG00000243485 |
| FAM138A | ENSG00000237613 |
| OR4F5 | ENSG00000186092 |
| RP11-34P13.7 | ENSG00000238009 |
| RP11-34P13.8 | ENSG00000239945 |
pbmc.var = pd.merge(
(
pbmc.var.reset_index().rename(
columns={"gene_ids": "gene_id", "index": "orig_gene_name"}
)
),
genes,
on="gene_id",
how="left",
).set_index("gene_id")
pbmc.var.head()
| orig_gene_name | gene_name | gene_biotype | gene_seq_start | gene_seq_end | seq_name | seq_strand | seq_coord_system | description | gene_id_version | canonical_transcript | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gene_id | |||||||||||
| ENSG00000243485 | MIR1302-10 | MIR1302-2HG | lncRNA | 29554.0 | 31109.0 | 1 | 1.0 | chromosome | MIR1302-2 host gene [Source:HGNC Symbol;Acc:HG... | ENSG00000243485.5 | ENST00000473358 |
| ENSG00000237613 | FAM138A | FAM138A | lncRNA | 34554.0 | 36081.0 | 1 | -1.0 | chromosome | family with sequence similarity 138 member A [... | ENSG00000237613.2 | ENST00000417324 |
| ENSG00000186092 | OR4F5 | OR4F5 | protein_coding | 65419.0 | 71585.0 | 1 | 1.0 | chromosome | olfactory receptor family 4 subfamily F member... | ENSG00000186092.7 | ENST00000641515 |
| ENSG00000238009 | RP11-34P13.7 | lncRNA | 89295.0 | 133723.0 | 1 | -1.0 | chromosome | novel transcript | ENSG00000238009.6 | ENST00000477740 | |
| ENSG00000239945 | RP11-34P13.8 | lncRNA | 89551.0 | 91105.0 | 1 | -1.0 | chromosome | novel transcript | ENSG00000239945.1 | ENST00000495576 |