Basic usage#

This notebook shows some basic usage of the genomic-features package.

import genomic_features as gf

Retrieving Ensembl gene annotations#

We can load annotation tables using genomic_features.ensembl.annotation().

ensdb = gf.ensembl.annotation(species="Hsapiens", version="108")
ensdb

EnsemblDB(organism='Homo sapiens', ensembl_release='108')

These tables have been created for the ensembldb Bioconductor package [RGW19], and are automatically downloaded and cached from the AnnotationHub resource.

We can check which Ensembl versions are available for different species using the genomic_features.ensembl.list_ensdb_annotations() util.

gf.ensembl.list_ensdb_annotations(species="Mmusculus")

	Species	Ensembl_version
37	Mmusculus	87
105	Mmusculus	88
173	Mmusculus	89
247	Mmusculus	90
330	Mmusculus	91
419	Mmusculus	92
510	Mmusculus	93
621	Mmusculus	94
748	Mmusculus	95
894	Mmusculus	96
1062	Mmusculus	97
1239	Mmusculus	98
1429	Mmusculus	99
1648	Mmusculus	100
1875	Mmusculus	101
2118	Mmusculus	102
2361	Mmusculus	103
2604	Mmusculus	104
2847	Mmusculus	105
3089	Mmusculus	106
3332	Mmusculus	107
3575	Mmusculus	108
3863	Mmusculus	109

Using `EnsemblDB` objects#

The genomic_features.ensembl.EnsemblDB is the access point to an annotation. This is an interface to a sqlite table retrieved from AnnotationHub (as shown above). Information on provenance can be accessed via the metadata attribute:

ensdb.metadata

{'Db type': 'EnsDb',
 'Type of Gene ID': 'Ensembl Gene ID',
 'Supporting package': 'ensembldb',
 'Db created by': 'ensembldb package from Bioconductor',
 'script_version': '0.3.7',
 'Creation time': 'Fri Oct 28 05:24:43 2022',
 'ensembl_version': '108',
 'ensembl_host': 'localhost',
 'Organism': 'Homo sapiens',
 'taxonomy_id': '9606',
 'genome_build': 'GRCh38',
 'DBSCHEMAVERSION': '2.2'}

And can be queried for genomic features:

genes = ensdb.genes()
genes.head()

	gene_id	gene_name	gene_biotype	gene_seq_start	gene_seq_end	seq_name	seq_strand	seq_coord_system	description	gene_id_version	canonical_transcript
0	ENSG00000000003	TSPAN6	protein_coding	100627108	100639991	X	-1	chromosome	tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]	ENSG00000000003.15	ENST00000373020
1	ENSG00000000005	TNMD	protein_coding	100584936	100599885	X	1	chromosome	tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]	ENSG00000000005.6	ENST00000373031
2	ENSG00000000419	DPM1	protein_coding	50934867	50959140	20	-1	chromosome	dolichyl-phosphate mannosyltransferase subunit...	ENSG00000000419.14	ENST00000371588
3	ENSG00000000457	SCYL3	protein_coding	169849631	169894267	1	-1	chromosome	SCY1 like pseudokinase 3 [Source:HGNC Symbol;A...	ENSG00000000457.14	ENST00000367771
4	ENSG00000000460	C1orf112	protein_coding	169662007	169854080	1	1	chromosome	chromosome 1 open reading frame 112 [Source:HG...	ENSG00000000460.17	ENST00000359326

Filters#

genomic_features.filters defines a number of filters to use with these annotations. You can filter by specific columns:

ensdb.genes(filter=gf.filters.GeneBioTypeFilter("Mt_tRNA")).head()

	gene_id	gene_name	gene_biotype	gene_seq_start	gene_seq_end	seq_name	seq_strand	seq_coord_system	description	gene_id_version	canonical_transcript
0	ENSG00000209082	MT-TL1	Mt_tRNA	3230	3304	MT	1	chromosome	mitochondrially encoded tRNA-Leu (UUA/G) 1 [So...	ENSG00000209082.1	ENST00000386347
1	ENSG00000210049	MT-TF	Mt_tRNA	577	647	MT	1	chromosome	mitochondrially encoded tRNA-Phe (UUU/C) [Sour...	ENSG00000210049.1	ENST00000387314
2	ENSG00000210077	MT-TV	Mt_tRNA	1602	1670	MT	1	chromosome	mitochondrially encoded tRNA-Val (GUN) [Source...	ENSG00000210077.1	ENST00000387342
3	ENSG00000210100	MT-TI	Mt_tRNA	4263	4331	MT	1	chromosome	mitochondrially encoded tRNA-Ile (AUU/C) [Sour...	ENSG00000210100.1	ENST00000387365
4	ENSG00000210107	MT-TQ	Mt_tRNA	4329	4400	MT	-1	chromosome	mitochondrially encoded tRNA-Gln (CAA/G) [Sour...	ENSG00000210107.1	ENST00000387372

Or by genomic range:

ensdb.genes(filter=gf.filters.GeneRangesFilter("1:10000-20000"))

	gene_id	gene_name	gene_biotype	gene_seq_start	gene_seq_end	seq_name	seq_strand	seq_coord_system	description	gene_id_version	canonical_transcript
0	ENSG00000223972	DDX11L1	transcribed_unprocessed_pseudogene	12010	13670	1	1	chromosome	DEAD/H-box helicase 11 like 1 (pseudogene) [So...	ENSG00000223972.6	ENST00000450305
1	ENSG00000227232	WASH7P	unprocessed_pseudogene	14404	29570	1	-1	chromosome	WASP family homolog 7, pseudogene [Source:HGNC...	ENSG00000227232.5	ENST00000488147
2	ENSG00000278267	MIR6859-1	miRNA	17369	17436	1	-1	chromosome	microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:5...	ENSG00000278267.1	ENST00000619216
3	ENSG00000290825	DDX11L2	lncRNA	11869	14409	1	1	chromosome	DEAD/H-box helicase 11 like 2 (pseudogene) [So...	ENSG00000290825.1	ENST00000456328

Logical operations (&, |, and ~) on filters are also possible:

ensdb.genes(
    filter=gf.filters.GeneBioTypeFilter("lncRNA")
    & gf.filters.GeneRangesFilter("1:10000-20000")
)

	gene_id	gene_name	gene_biotype	gene_seq_start	gene_seq_end	seq_name	seq_strand	seq_coord_system	description	gene_id_version	canonical_transcript
0	ENSG00000290825	DDX11L2	lncRNA	11869	14409	1	1	chromosome	DEAD/H-box helicase 11 like 2 (pseudogene) [So...	ENSG00000290825.1	ENST00000456328

Column selectors#

Using the cols argument, you can get annotations from other tables in the database.

ensdb.genes(cols=["gene_id", "tx_id", "gene_name", "protein_id", "uniprot_id"]).head()

	gene_id	tx_id	gene_name	protein_id	uniprot_id
0	ENSG00000000003	ENST00000373020	TSPAN6	ENSP00000362111	O43657.176
1	ENSG00000000003	ENST00000612152	TSPAN6	ENSP00000482130	A0A087WYV6.48
2	ENSG00000000003	ENST00000614008	TSPAN6	ENSP00000482894	A0A087WZU5.51
3	ENSG00000000005	ENST00000373031	TNMD	ENSP00000362122	Q9H2S6.148
4	ENSG00000000419	ENST00000466152	DPM1	ENSP00000507119	A0A804HIK9.2

`chromosomes()`#

Information on chromosome length for this annotation (useful for downstream operations) is also available through the chromosomes function.

ensdb.chromosomes()

	seq_name	seq_length	is_circular
0	X	156040895	0
1	20	64444167	0
2	1	248956422	0
3	6	170805979	0
4	3	198295559	0
...	...	...	...
452	LRG_741	231167	0
453	LRG_763	176286	0
454	LRG_792	42144	0
455	LRG_793	38439	0
456	LRG_93	22459	0

457 rows × 3 columns

Adding annotations to an AnnData object:#

import pandas as pd
import scanpy as sc

pbmc = sc.datasets.pbmc3k()

pbmc.var.head()

	gene_ids
index
MIR1302-10	ENSG00000243485
FAM138A	ENSG00000237613
OR4F5	ENSG00000186092
RP11-34P13.7	ENSG00000238009
RP11-34P13.8	ENSG00000239945

pbmc.var = pd.merge(
    (
        pbmc.var.reset_index().rename(
            columns={"gene_ids": "gene_id", "index": "orig_gene_name"}
        )
    ),
    genes,
    on="gene_id",
    how="left",
).set_index("gene_id")

pbmc.var.head()

	orig_gene_name	gene_name	gene_biotype	gene_seq_start	gene_seq_end	seq_name	seq_strand	seq_coord_system	description	gene_id_version	canonical_transcript
gene_id
ENSG00000243485	MIR1302-10	MIR1302-2HG	lncRNA	29554.0	31109.0	1	1.0	chromosome	MIR1302-2 host gene [Source:HGNC Symbol;Acc:HG...	ENSG00000243485.5	ENST00000473358
ENSG00000237613	FAM138A	FAM138A	lncRNA	34554.0	36081.0	1	-1.0	chromosome	family with sequence similarity 138 member A [...	ENSG00000237613.2	ENST00000417324
ENSG00000186092	OR4F5	OR4F5	protein_coding	65419.0	71585.0	1	1.0	chromosome	olfactory receptor family 4 subfamily F member...	ENSG00000186092.7	ENST00000641515
ENSG00000238009	RP11-34P13.7		lncRNA	89295.0	133723.0	1	-1.0	chromosome	novel transcript	ENSG00000238009.6	ENST00000477740
ENSG00000239945	RP11-34P13.8		lncRNA	89551.0	91105.0	1	-1.0	chromosome	novel transcript	ENSG00000239945.1	ENST00000495576

Basic usage

Contents