userguide.Rmd
The gesel package implements an R client to the Gesel database for gene set searching. The Gesel database is a collection of static files containing information about gene sets, which can be hosted anywhere - via a file server, on a HPC’s shared filesystem etc. Clients like the gesel package then use HTTP range requests to perform a variety of queries. No custom server logic is required, greatly reducing effort and cost required to keep Gesel up and running. Most database files do not need to be downloaded to the client, allowing us to easily scale with increasing numbers of gene sets in the database.
The raison d’etre of the Gesel database is to find overlaps between known gene sets and a user-supplied list of genes. Let’s say we have several human genes of interest:
my.genes <- c("SNAP25", "NEUROD6", "GAD1", "GAD2", "RELN")
We map these to Gesel gene indices, which are Gesel’s internal identifiers for each gene.
library(gesel)
gene.idx <- searchGenes("9606", my.genes)
gene.idx # this is a list of integer vectors, in case of 1:many mappings.
## [[1]]
## [1] 4640
##
## [[2]]
## [1] 12768
##
## [[3]]
## [1] 1759
##
## [[4]]
## [1] 1760
##
## [[5]]
## [1] 3913
gene.idx <- unlist(gene.idx)
fetchAllGenes("9606")[gene.idx,] # double-checking that we got it right.
## symbol entrez ensembl
## 4640 SNAP25 6616 ENSG0000....
## 12768 NEUROD6 63974 ENSG0000....
## 1759 GAD1 2571 ENSG0000....
## 1760 GAD2 2572 ENSG0000....
## 3913 RELN 5649 ENSG0000....
Now we find all sets with overlapping genes. The
findOverlappingSets()
identifies all sets with one or more
overlaps with the user-suppleid set.
overlaps <- findOverlappingSets("9606", gene.idx, counts.only=FALSE)
head(overlaps$overlap) # set index and the identities of overlapping genes.
## set genes
## 1 2421 4640, 17....
## 2 2522 4640, 17....
## 3 21749 4640, 12....
## 4 39867 4640, 17....
## 5 2311 4640, 17....
## 6 3332 4640, 17....
The set index can be converted back to information about each set, along with the collection from which it came:
set.info <- fetchSomeSets("9606", head(overlaps$overlap$set)) # getting details of the top sets
set.info
## name
## 1 GO:0005737
## 2 GO:0005886
## 3 BLALOCK_ALZHEIMERS_DISEASE_DN
## 4 MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 5 GO:0005515
## 6 GO:0007268
## description
## 1 cytoplasm
## 2 plasma membrane
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BLALOCK_ALZHEIMERS_DISEASE_DN
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 5 protein binding
## 6 chemical synaptic transmission
## size collection number
## 1 5010 1 2421
## 2 4778 1 2522
## 3 1248 3 2516
## 4 1106 15 93
## 5 12505 1 2311
## 6 248 1 3332
col.info <- fetchSomeCollections("9606", set.info$collection) # getting details of the collections
col.info
## title
## 1 Gene ontology
## 2 Gene ontology
## 3 Gene ontology
## 4 MSigDB chemical and genetic perturbations (C2 CGP)
## 5 Gene ontology
## 6 Gene ontology
## description
## 1 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 2 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 3 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 4 Gene sets that represent expression signatures of genetic and chemical perturbations.
## 5 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 6 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## maintainer
## 1 Aaron Lun
## 2 Aaron Lun
## 3 Aaron Lun
## 4 Aaron Lun
## 5 Aaron Lun
## 6 Aaron Lun
## source
## 1 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 2 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 3 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 4 https://github.com/LTLA/gesel-feedstock/blob/master/msigdb/build.R
## 5 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 6 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## start size
## 1 1 18933
## 2 1 18933
## 3 19234 3405
## 4 39775 830
## 5 1 18933
## 6 1 18933
The various statistics produced by gesel can be also used to perform a simple hypergeometric test for enrichment. This often yields a more relevant ranking than the absolute number of overlaps as the hypergeometric p-value considers the size of the gene sets.
set.sizes <- fetchSetSizes("9606")[overlaps$overlap$set]
p <- phyper(
# Subtract 1 to account for probability mass at the observed number of
# overlaps when computing the upper tail.
q=lengths(overlaps$overlap$genes) - 1L,
# Number of genes in the gene set.
m=set.sizes,
# Number of genes not in the gene set. We use effectiveNumberOfGenes() to
# ignore irrelevant pseudogenes, predicted genes, etc. in the database.
n=effectiveNumberOfGenes("9606") - set.sizes,
# Number of genes in the user-supplied list of genes that are present in
# at least one Gesel set (ignore typos and other unknown genes).
k=overlaps$present,
lower.tail=FALSE
)
head(p)
## [1] 9.166142e-04 7.619258e-04 3.795345e-06 2.346103e-06 1.602832e-01
## [6] 2.019335e-06
We can also search for gene sets based on the text in their names or
descriptions. The searchSetText()
function accepts a query
string with multiple words and the usual *
(any characters)
and ?
(one character) wildcards.
chits <- searchSetText("9606", "cancer")
fetchSomeSets("9606", head(chits))
## name
## 1 SOGA_COLORECTAL_CANCER_MYC_DN
## 2 SOGA_COLORECTAL_CANCER_MYC_UP
## 3 WATANABE_RECTAL_CANCER_RADIOTHERAPY_RESPONSIVE_UP
## 4 LIU_PROSTATE_CANCER_UP
## 5 BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_UP
## 6 WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## description
## 1 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SOGA_COLORECTAL_CANCER_MYC_DN
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SOGA_COLORECTAL_CANCER_MYC_UP
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_RECTAL_CANCER_RADIOTHERAPY_RESPONSIVE_UP
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_PROSTATE_CANCER_UP
## 5 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_UP
## 6 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## size collection number
## 1 71 3 2
## 2 82 3 3
## 3 113 3 65
## 4 99 3 67
## 5 207 3 69
## 6 29 3 79
ihits <- searchSetText("9606", "innate immun*")
fetchSomeSets("9606", head(ihits))
## name
## 1 REACTOME_INNATE_IMMUNE_SYSTEM
## 2 REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 3 REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 4 REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 5 WP_SARSCOV2_B117_VARIANT_ANTAGONISES_INNATE_IMMUNE_ACTIVATION
## 6 WP_SARS_CORONAVIRUS_AND_INNATE_IMMUNITY
## description
## 1 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_INNATE_IMMUNE_SYSTEM
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 5 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_SARSCOV2_B117_VARIANT_ANTAGONISES_INNATE_IMMUNE_ACTIVATION
## 6 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_SARS_CORONAVIRUS_AND_INNATE_IMMUNITY
## size collection number
## 1 1118 5 230
## 2 15 5 553
## 3 41 5 1563
## 4 126 5 1581
## 5 9 6 186
## 6 31 6 190
thits <- searchSetText("9606", "cd? t cell")
fetchSomeSets("9606", head(thits))
## name
## 1 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_DN
## 2 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_UP
## 3 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_7DY_UP
## 4 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_DN
## 5 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## 6 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## description
## 1 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_DN
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_UP
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_7DY_UP
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_DN
## 5 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## 6 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## size collection number
## 1 47 14 50
## 2 28 14 132
## 3 40 14 205
## 4 16 14 253
## 5 24 14 297
## 6 42 14 317
Users can construct powerful queries by intersecting the sets
recovered from searchSetText()
with those from
findOverlappingSets()
.
cancer.sets <- intersect(chits, overlaps$overlap$set)
info <- fetchSomeSets("9606", cancer.sets)
m <- match(cancer.sets, overlaps$overlap$set)
info$count <- lengths(overlaps$overlap$genes)[m]
info$pvalue <- p[m]
info[head(order(info$pvalue)),]
## name
## 4 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 13 SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 1 WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 5 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 3 WANG_HCP_PROSTATE_CANCER
## description
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 13 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 1 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 5 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## size collection number count pvalue
## 4 28 3 904 1 0.003330713
## 13 28 3 2865 1 0.003330713
## 1 29 3 79 1 0.003449503
## 2 37 3 281 1 0.004399413
## 5 97 3 923 1 0.011500672
## 3 106 3 444 1 0.012562357
Gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into the R session at once. This is most useful for the gene set details, which can be retrieved en masse:
set.info <- fetchAllSets("9606")
nrow(set.info)
## [1] 40654
head(set.info)
## name description size collection
## 1 GO:0000002 mitochondrial genome maintenance 11 1
## 2 GO:0000003 reproduction 4 1
## 3 GO:0000009 alpha-1,6-mannosyltransferase activity 2 1
## 4 GO:0000010 trans-hexaprenyltranstransferase activity 2 1
## 5 GO:0000012 single strand break repair 10 1
## 6 GO:0000014 single-stranded DNA endodeoxyribonuclease activity 9 1
## number
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
The set indices can then be used to directly subset the
set.info
data frame by row. In fact, calling
fetchSomeSets()
after fetchAllSets()
will
automatically use the data frame created by the latter, instead of
attempting another request to the database.
head(set.info[cancer.sets,])
## name
## 19312 WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 19514 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 19677 WANG_HCP_PROSTATE_CANCER
## 20137 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 20156 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20221 SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
## description
## 19312 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 19514 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 19677 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## 20137 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 20156 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20221 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
## size collection number
## 19312 29 3 79
## 19514 37 3 281
## 19677 106 3 444
## 20137 28 3 904
## 20156 97 3 923
## 20221 324 3 988
head(fetchSomeSets("9606", cancer.sets)) # same results
## name
## 1 WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3 WANG_HCP_PROSTATE_CANCER
## 4 LOPES_METHYLATED_IN_COLON_CANCER_UP
## 5 KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 6 SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
## description
## 1 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 5 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 6 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
## size collection number
## 1 29 3 79
## 2 37 3 281
## 3 106 3 444
## 4 28 3 904
## 5 97 3 923
## 6 324 3 988
The same approach can be used to extract collection information, via
fetchAllCollections()
; gene set membership, via
fetchGenesForAllSets()
; and the sets containing each gene,
via fetchSetsForAllGenes()
.
gesel uses a lot of in-memory caching to reduce the number of requests to the database files within a single R session. On rare occasions, the cache may become outdated, e.g., if the database files are updated while an R session is running. Users can prompt gesel to re-acquire all data by flusing the cache:
Applications can specify their own functions for obtaining files (or
byte ranges thereof) by passing a custom config=
in each
gesel
function. For example, on a shared HPC filesystem, we could point gesel towards a
directory of Gesel database files. This provides a performant
alternative to HTTP requests for an institutional collection of gene
sets.
## R Under development (unstable) (2024-11-13 r87330)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gesel_0.1.2 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.3 knitr_1.49 rlang_1.1.4
## [4] xfun_0.49 textshaping_0.4.0 jsonlite_1.8.9
## [7] glue_1.8.0 htmltools_0.5.8.1 ragg_1.3.3
## [10] sass_0.4.9 rappdirs_0.3.3 rmarkdown_2.29
## [13] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0
## [16] yaml_2.3.10 lifecycle_1.0.4 httr2_1.0.6
## [19] bookdown_0.41 BiocManager_1.30.25 compiler_4.5.0
## [22] fs_1.6.5 Rcpp_1.0.13-1 htmlwidgets_1.6.4
## [25] systemfonts_1.1.0 digest_0.6.37 R6_2.5.1
## [28] curl_6.0.1 magrittr_2.0.3 bslib_0.8.0
## [31] tools_4.5.0 pkgdown_2.1.1 cachem_1.1.0
## [34] desc_1.4.3