Basic usage#

This notebook shows some basic usage of the genomic-features package.

import genomic_features as gf

Retrieving Ensembl gene annotations#

We can load annotation tables using genomic_features.ensembl.annotation().

ensdb = gf.ensembl.annotation(species="Hsapiens", version="108")
ensdb
EnsemblDB(organism='Homo sapiens', ensembl_release='108')

These tables have been created for the ensembldb Bioconductor package [RGW19], and are automatically downloaded and cached from the AnnotationHub resource.

We can check which Ensembl versions are available for different species using the genomic_features.ensembl.list_ensdb_annotations() util.

gf.ensembl.list_ensdb_annotations(species="Mmusculus")
Species Ensembl_version
37 Mmusculus 87
105 Mmusculus 88
173 Mmusculus 89
247 Mmusculus 90
330 Mmusculus 91
419 Mmusculus 92
510 Mmusculus 93
621 Mmusculus 94
748 Mmusculus 95
894 Mmusculus 96
1062 Mmusculus 97
1239 Mmusculus 98
1429 Mmusculus 99
1648 Mmusculus 100
1875 Mmusculus 101
2118 Mmusculus 102
2361 Mmusculus 103
2604 Mmusculus 104
2847 Mmusculus 105
3089 Mmusculus 106
3332 Mmusculus 107
3575 Mmusculus 108
3863 Mmusculus 109

Using EnsemblDB objects#

The genomic_features.ensembl.EnsemblDB is the access point to an annotation. This is an interface to a sqlite table retrieved from AnnotationHub (as shown above). Information on provenance can be accessed via the metadata attribute:

ensdb.metadata
{'Db type': 'EnsDb',
 'Type of Gene ID': 'Ensembl Gene ID',
 'Supporting package': 'ensembldb',
 'Db created by': 'ensembldb package from Bioconductor',
 'script_version': '0.3.7',
 'Creation time': 'Fri Oct 28 05:24:43 2022',
 'ensembl_version': '108',
 'ensembl_host': 'localhost',
 'Organism': 'Homo sapiens',
 'taxonomy_id': '9606',
 'genome_build': 'GRCh38',
 'DBSCHEMAVERSION': '2.2'}

And can be queried for genomic features:

genes = ensdb.genes()
genes.head()
gene_id gene_name gene_biotype gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system description gene_id_version canonical_transcript
0 ENSG00000000003 TSPAN6 protein_coding 100627108 100639991 X -1 chromosome tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] ENSG00000000003.15 ENST00000373020
1 ENSG00000000005 TNMD protein_coding 100584936 100599885 X 1 chromosome tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] ENSG00000000005.6 ENST00000373031
2 ENSG00000000419 DPM1 protein_coding 50934867 50959140 20 -1 chromosome dolichyl-phosphate mannosyltransferase subunit... ENSG00000000419.14 ENST00000371588
3 ENSG00000000457 SCYL3 protein_coding 169849631 169894267 1 -1 chromosome SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... ENSG00000000457.14 ENST00000367771
4 ENSG00000000460 C1orf112 protein_coding 169662007 169854080 1 1 chromosome chromosome 1 open reading frame 112 [Source:HG... ENSG00000000460.17 ENST00000359326

Filters#

genomic_features.filters defines a number of filters to use with these annotations. You can filter by specific columns:

ensdb.genes(filter=gf.filters.GeneBioTypeFilter("Mt_tRNA")).head()
gene_id gene_name gene_biotype gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system description gene_id_version canonical_transcript
0 ENSG00000209082 MT-TL1 Mt_tRNA 3230 3304 MT 1 chromosome mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... ENSG00000209082.1 ENST00000386347
1 ENSG00000210049 MT-TF Mt_tRNA 577 647 MT 1 chromosome mitochondrially encoded tRNA-Phe (UUU/C) [Sour... ENSG00000210049.1 ENST00000387314
2 ENSG00000210077 MT-TV Mt_tRNA 1602 1670 MT 1 chromosome mitochondrially encoded tRNA-Val (GUN) [Source... ENSG00000210077.1 ENST00000387342
3 ENSG00000210100 MT-TI Mt_tRNA 4263 4331 MT 1 chromosome mitochondrially encoded tRNA-Ile (AUU/C) [Sour... ENSG00000210100.1 ENST00000387365
4 ENSG00000210107 MT-TQ Mt_tRNA 4329 4400 MT -1 chromosome mitochondrially encoded tRNA-Gln (CAA/G) [Sour... ENSG00000210107.1 ENST00000387372

Or by genomic range:

ensdb.genes(filter=gf.filters.GeneRangesFilter("1:10000-20000"))
gene_id gene_name gene_biotype gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system description gene_id_version canonical_transcript
0 ENSG00000223972 DDX11L1 transcribed_unprocessed_pseudogene 12010 13670 1 1 chromosome DEAD/H-box helicase 11 like 1 (pseudogene) [So... ENSG00000223972.6 ENST00000450305
1 ENSG00000227232 WASH7P unprocessed_pseudogene 14404 29570 1 -1 chromosome WASP family homolog 7, pseudogene [Source:HGNC... ENSG00000227232.5 ENST00000488147
2 ENSG00000278267 MIR6859-1 miRNA 17369 17436 1 -1 chromosome microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:5... ENSG00000278267.1 ENST00000619216
3 ENSG00000290825 DDX11L2 lncRNA 11869 14409 1 1 chromosome DEAD/H-box helicase 11 like 2 (pseudogene) [So... ENSG00000290825.1 ENST00000456328

Logical operations (&, |, and ~) on filters are also possible:

ensdb.genes(
    filter=gf.filters.GeneBioTypeFilter("lncRNA")
    & gf.filters.GeneRangesFilter("1:10000-20000")
)
gene_id gene_name gene_biotype gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system description gene_id_version canonical_transcript
0 ENSG00000290825 DDX11L2 lncRNA 11869 14409 1 1 chromosome DEAD/H-box helicase 11 like 2 (pseudogene) [So... ENSG00000290825.1 ENST00000456328

Column selectors#

Using the cols argument, you can get annotations from other tables in the database.

ensdb.genes(cols=["gene_id", "tx_id", "gene_name", "protein_id", "uniprot_id"]).head()
gene_id tx_id gene_name protein_id uniprot_id
0 ENSG00000000003 ENST00000373020 TSPAN6 ENSP00000362111 O43657.176
1 ENSG00000000003 ENST00000612152 TSPAN6 ENSP00000482130 A0A087WYV6.48
2 ENSG00000000003 ENST00000614008 TSPAN6 ENSP00000482894 A0A087WZU5.51
3 ENSG00000000005 ENST00000373031 TNMD ENSP00000362122 Q9H2S6.148
4 ENSG00000000419 ENST00000466152 DPM1 ENSP00000507119 A0A804HIK9.2

chromosomes()#

Information on chromosome length for this annotation (useful for downstream operations) is also available through the chromosomes function.

ensdb.chromosomes()
seq_name seq_length is_circular
0 X 156040895 0
1 20 64444167 0
2 1 248956422 0
3 6 170805979 0
4 3 198295559 0
... ... ... ...
452 LRG_741 231167 0
453 LRG_763 176286 0
454 LRG_792 42144 0
455 LRG_793 38439 0
456 LRG_93 22459 0

457 rows × 3 columns

Adding annotations to an AnnData object:#

import pandas as pd
import scanpy as sc

pbmc = sc.datasets.pbmc3k()
pbmc.var.head()
gene_ids
index
MIR1302-10 ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
RP11-34P13.7 ENSG00000238009
RP11-34P13.8 ENSG00000239945
pbmc.var = pd.merge(
    (
        pbmc.var.reset_index().rename(
            columns={"gene_ids": "gene_id", "index": "orig_gene_name"}
        )
    ),
    genes,
    on="gene_id",
    how="left",
).set_index("gene_id")
pbmc.var.head()
orig_gene_name gene_name gene_biotype gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system description gene_id_version canonical_transcript
gene_id
ENSG00000243485 MIR1302-10 MIR1302-2HG lncRNA 29554.0 31109.0 1 1.0 chromosome MIR1302-2 host gene [Source:HGNC Symbol;Acc:HG... ENSG00000243485.5 ENST00000473358
ENSG00000237613 FAM138A FAM138A lncRNA 34554.0 36081.0 1 -1.0 chromosome family with sequence similarity 138 member A [... ENSG00000237613.2 ENST00000417324
ENSG00000186092 OR4F5 OR4F5 protein_coding 65419.0 71585.0 1 1.0 chromosome olfactory receptor family 4 subfamily F member... ENSG00000186092.7 ENST00000641515
ENSG00000238009 RP11-34P13.7 lncRNA 89295.0 133723.0 1 -1.0 chromosome novel transcript ENSG00000238009.6 ENST00000477740
ENSG00000239945 RP11-34P13.8 lncRNA 89551.0 91105.0 1 -1.0 chromosome novel transcript ENSG00000239945.1 ENST00000495576