Introduction

The gesel package implements an R client to the Gesel database for gene set searching. The Gesel database is a collection of static files containing information about gene sets, which can be hosted anywhere - via a file server, on a HPC’s shared filesystem etc. Clients like the gesel package then use HTTP range requests to perform a variety of queries. No custom server logic is required, greatly reducing effort and cost required to keep Gesel up and running. Most database files do not need to be downloaded to the client, allowing us to easily scale with increasing numbers of gene sets in the database.

Finding overlaps

The raison d’etre of the Gesel database is to find overlaps between known gene sets and a user-supplied list of genes. Let’s say we have several human genes of interest:

my.genes <- c("SNAP25", "NEUROD6", "GAD1", "GAD2", "RELN")

We map these to Gesel gene indices, which are Gesel’s internal identifiers for each gene.

library(gesel)
gene.idx <- searchGenes("9606", my.genes)
gene.idx # this is a list of integer vectors, in case of 1:many mappings.
## [[1]]
## [1] 4640
## 
## [[2]]
## [1] 12768
## 
## [[3]]
## [1] 1759
## 
## [[4]]
## [1] 1760
## 
## [[5]]
## [1] 3913
gene.idx <- unlist(gene.idx)
fetchAllGenes("9606")[gene.idx,] # double-checking that we got it right.
##        symbol entrez      ensembl
## 4640   SNAP25   6616 ENSG0000....
## 12768 NEUROD6  63974 ENSG0000....
## 1759     GAD1   2571 ENSG0000....
## 1760     GAD2   2572 ENSG0000....
## 3913     RELN   5649 ENSG0000....

Now we find all sets with overlapping genes. The findOverlappingSets() identifies all sets with one or more overlaps with the user-suppleid set.

overlaps <- findOverlappingSets("9606", gene.idx, counts.only=FALSE)
head(overlaps$overlap) # set index and the identities of overlapping genes.
##     set        genes
## 1  2421 4640, 17....
## 2  2522 4640, 17....
## 3 21749 4640, 12....
## 4 39867 4640, 17....
## 5  2311 4640, 17....
## 6  3332 4640, 17....

The set index can be converted back to information about each set, along with the collection from which it came:

set.info <- fetchSomeSets("9606", head(overlaps$overlap$set)) # getting details of the top sets
set.info
##                              name
## 1                      GO:0005737
## 2                      GO:0005886
## 3   BLALOCK_ALZHEIMERS_DISEASE_DN
## 4 MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 5                      GO:0005515
## 6                      GO:0007268
##                                                                            description
## 1                                                                            cytoplasm
## 2                                                                      plasma membrane
## 3   http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BLALOCK_ALZHEIMERS_DISEASE_DN
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/MANNO_MIDBRAIN_NEUROTYPES_HGABA
## 5                                                                      protein binding
## 6                                                       chemical synaptic transmission
##    size collection number
## 1  5010          1   2421
## 2  4778          1   2522
## 3  1248          3   2516
## 4  1106         15     93
## 5 12505          1   2311
## 6   248          1   3332
col.info <- fetchSomeCollections("9606", set.info$collection) # getting details of the collections
col.info
##                                                title
## 1                                      Gene ontology
## 2                                      Gene ontology
## 3                                      Gene ontology
## 4 MSigDB chemical and genetic perturbations (C2 CGP)
## 5                                      Gene ontology
## 6                                      Gene ontology
##                                                                                                                 description
## 1 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 2 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 3 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 4                                     Gene sets that represent expression signatures of genetic and chemical perturbations.
## 5 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
## 6 Gene sets defined from the Gene Ontology (version 2022-07-01), sourced from the Bioconductor package org.Hs.eg.db 3.16.0.
##   maintainer
## 1  Aaron Lun
## 2  Aaron Lun
## 3  Aaron Lun
## 4  Aaron Lun
## 5  Aaron Lun
## 6  Aaron Lun
##                                                                         source
## 1 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 2 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 3 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 4           https://github.com/LTLA/gesel-feedstock/blob/master/msigdb/build.R
## 5 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
## 6 https://github.com/LTLA/gesel-feedstock/blob/gene-ontology-v1.0.0/go/build.R
##   start  size
## 1     1 18933
## 2     1 18933
## 3 19234  3405
## 4 39775   830
## 5     1 18933
## 6     1 18933

The various statistics produced by gesel can be also used to perform a simple hypergeometric test for enrichment. This often yields a more relevant ranking than the absolute number of overlaps as the hypergeometric p-value considers the size of the gene sets.

set.sizes <- fetchSetSizes("9606")[overlaps$overlap$set]
p <- phyper(
    # Subtract 1 to account for probability mass at the observed number of
    # overlaps when computing the upper tail.
    q=lengths(overlaps$overlap$genes) - 1L,

    # Number of genes in the gene set.
    m=set.sizes,

    # Number of genes not in the gene set. We use effectiveNumberOfGenes() to
    # ignore irrelevant pseudogenes, predicted genes, etc. in the database.
    n=effectiveNumberOfGenes("9606") - set.sizes,

    # Number of genes in the user-supplied list of genes that are present in
    # at least one Gesel set (ignore typos and other unknown genes).
    k=overlaps$present,

    lower.tail=FALSE
)
head(p)
## [1] 9.166142e-04 7.619258e-04 3.795345e-06 2.346103e-06 1.602832e-01
## [6] 2.019335e-06

Searching on text

We can also search for gene sets based on the text in their names or descriptions. The searchSetText() function accepts a query string with multiple words and the usual * (any characters) and ? (one character) wildcards.

chits <- searchSetText("9606", "cancer")
fetchSomeSets("9606", head(chits))
##                                                name
## 1                     SOGA_COLORECTAL_CANCER_MYC_DN
## 2                     SOGA_COLORECTAL_CANCER_MYC_UP
## 3 WATANABE_RECTAL_CANCER_RADIOTHERAPY_RESPONSIVE_UP
## 4                            LIU_PROSTATE_CANCER_UP
## 5     BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_UP
## 6               WATANABE_COLON_CANCER_MSI_VS_MSS_UP
##                                                                                              description
## 1                     http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SOGA_COLORECTAL_CANCER_MYC_DN
## 2                     http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SOGA_COLORECTAL_CANCER_MYC_UP
## 3 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_RECTAL_CANCER_RADIOTHERAPY_RESPONSIVE_UP
## 4                            http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LIU_PROSTATE_CANCER_UP
## 5     http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/BERTUCCI_MEDULLARY_VS_DUCTAL_BREAST_CANCER_UP
## 6               http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
##   size collection number
## 1   71          3      2
## 2   82          3      3
## 3  113          3     65
## 4   99          3     67
## 5  207          3     69
## 6   29          3     79
ihits <- searchSetText("9606", "innate immun*")
fetchSomeSets("9606", head(ihits))
##                                                                           name
## 1                                                REACTOME_INNATE_IMMUNE_SYSTEM
## 2              REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 3              REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 4 REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 5                WP_SARSCOV2_B117_VARIANT_ANTAGONISES_INNATE_IMMUNE_ACTIVATION
## 6                                      WP_SARS_CORONAVIRUS_AND_INNATE_IMMUNITY
##                                                                                                                         description
## 1                                                http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_INNATE_IMMUNE_SYSTEM
## 2              http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_REGULATION_OF_INNATE_IMMUNE_RESPONSES_TO_CYTOSOLIC_DNA
## 3              http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_1_ACTIVATES_MODULATES_INNATE_IMMUNE_RESPONSES
## 4 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/REACTOME_SARS_COV_2_ACTIVATES_MODULATES_INNATE_AND_ADAPTIVE_IMMUNE_RESPONSES
## 5                http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_SARSCOV2_B117_VARIANT_ANTAGONISES_INNATE_IMMUNE_ACTIVATION
## 6                                      http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WP_SARS_CORONAVIRUS_AND_INNATE_IMMUNITY
##   size collection number
## 1 1118          5    230
## 2   15          5    553
## 3   41          5   1563
## 4  126          5   1581
## 5    9          6    186
## 6   31          6    190
thits <- searchSetText("9606", "cd? t cell")
fetchSomeSets("9606", head(thits))
##                                                                                                                     name
## 1    HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_DN
## 2                       HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_UP
## 3                                              HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_7DY_UP
## 4                       HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_DN
## 5    HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## 6 HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
##                                                                                                                                                                   description
## 1    http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_DN
## 2                       http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_UP
## 3                                              http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_7DY_UP
## 4                       http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_7DY_TOP_100_DEG_EX_VIVO_DN
## 5    http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
## 6 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HOFT_CD4_POSITIVE_ALPHA_BETA_MEMORY_T_CELL_BCG_VACCINE_AGE_18_45YO_ID_56D_TOP_100_DEG_AFTER_IN_VITRO_RE_STIMULATION_UP
##   size collection number
## 1   47         14     50
## 2   28         14    132
## 3   40         14    205
## 4   16         14    253
## 5   24         14    297
## 6   42         14    317

Users can construct powerful queries by intersecting the sets recovered from searchSetText() with those from findOverlappingSets().

cancer.sets <- intersect(chits, overlaps$overlap$set)
info <- fetchSomeSets("9606", cancer.sets)
m <- match(cancer.sets, overlaps$overlap$set)
info$count <- lengths(overlaps$overlap$genes)[m]
info$pvalue <- p[m]
info[head(order(info$pvalue)),]
##                                                       name
## 4                      LOPES_METHYLATED_IN_COLON_CANCER_UP
## 13 SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 1                      WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2          WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 5                  KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 3                                 WANG_HCP_PROSTATE_CANCER
##                                                                                                     description
## 4                      http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 13 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SCHLESINGER_H3K27ME3_IN_NORMAL_AND_METHYLATED_IN_CANCER
## 1                      http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2          http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 5                  http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 3                                 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
##    size collection number count      pvalue
## 4    28          3    904     1 0.003330713
## 13   28          3   2865     1 0.003330713
## 1    29          3     79     1 0.003449503
## 2    37          3    281     1 0.004399413
## 5    97          3    923     1 0.011500672
## 3   106          3    444     1 0.012562357

Fetching all data

Gesel is designed around partial extraction from the database files, but it may be more efficient to pull all of the data into the R session at once. This is most useful for the gene set details, which can be retrieved en masse:

set.info <- fetchAllSets("9606")
nrow(set.info)
## [1] 40654
head(set.info)
##         name                                        description size collection
## 1 GO:0000002                   mitochondrial genome maintenance   11          1
## 2 GO:0000003                                       reproduction    4          1
## 3 GO:0000009             alpha-1,6-mannosyltransferase activity    2          1
## 4 GO:0000010          trans-hexaprenyltranstransferase activity    2          1
## 5 GO:0000012                         single strand break repair   10          1
## 6 GO:0000014 single-stranded DNA endodeoxyribonuclease activity    9          1
##   number
## 1      1
## 2      2
## 3      3
## 4      4
## 5      5
## 6      6

The set indices can then be used to directly subset the set.info data frame by row. In fact, calling fetchSomeSets() after fetchAllSets() will automatically use the data frame created by the latter, instead of attempting another request to the database.

head(set.info[cancer.sets,])
##                                                  name
## 19312             WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 19514 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 19677                        WANG_HCP_PROSTATE_CANCER
## 20137             LOPES_METHYLATED_IN_COLON_CANCER_UP
## 20156         KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20221           SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
##                                                                                                description
## 19312             http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 19514 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 19677                        http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## 20137             http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 20156         http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 20221           http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
##       size collection number
## 19312   29          3     79
## 19514   37          3    281
## 19677  106          3    444
## 20137   28          3    904
## 20156   97          3    923
## 20221  324          3    988
head(fetchSomeSets("9606", cancer.sets)) # same results
##                                              name
## 1             WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3                        WANG_HCP_PROSTATE_CANCER
## 4             LOPES_METHYLATED_IN_COLON_CANCER_UP
## 5         KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 6           SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
##                                                                                            description
## 1             http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WATANABE_COLON_CANCER_MSI_VS_MSS_UP
## 2 http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_BARRETTS_ESOPHAGUS_AND_ESOPHAGUS_CANCER_DN
## 3                        http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/WANG_HCP_PROSTATE_CANCER
## 4             http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/LOPES_METHYLATED_IN_COLON_CANCER_UP
## 5         http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/KONDO_PROSTATE_CANCER_HCP_WITH_H3K27ME3
## 6           http://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SMID_BREAST_CANCER_RELAPSE_IN_BONE_DN
##   size collection number
## 1   29          3     79
## 2   37          3    281
## 3  106          3    444
## 4   28          3    904
## 5   97          3    923
## 6  324          3    988

The same approach can be used to extract collection information, via fetchAllCollections(); gene set membership, via fetchGenesForAllSets(); and the sets containing each gene, via fetchSetsForAllGenes().

Advanced use

gesel uses a lot of in-memory caching to reduce the number of requests to the database files within a single R session. On rare occasions, the cache may become outdated, e.g., if the database files are updated while an R session is running. Users can prompt gesel to re-acquire all data by flusing the cache:

Applications can specify their own functions for obtaining files (or byte ranges thereof) by passing a custom config= in each gesel function. For example, on a shared HPC filesystem, we could point gesel towards a directory of Gesel database files. This provides a performant alternative to HTTP requests for an institutional collection of gene sets.

Session information

## R Under development (unstable) (2024-11-13 r87330)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gesel_0.1.2      BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.3           knitr_1.49          rlang_1.1.4        
##  [4] xfun_0.49           textshaping_0.4.0   jsonlite_1.8.9     
##  [7] glue_1.8.0          htmltools_0.5.8.1   ragg_1.3.3         
## [10] sass_0.4.9          rappdirs_0.3.3      rmarkdown_2.29     
## [13] evaluate_1.0.1      jquerylib_0.1.4     fastmap_1.2.0      
## [16] yaml_2.3.10         lifecycle_1.0.4     httr2_1.0.6        
## [19] bookdown_0.41       BiocManager_1.30.25 compiler_4.5.0     
## [22] fs_1.6.5            Rcpp_1.0.13-1       htmlwidgets_1.6.4  
## [25] systemfonts_1.1.0   digest_0.6.37       R6_2.5.1           
## [28] curl_6.0.1          magrittr_2.0.3      bslib_0.8.0        
## [31] tools_4.5.0         pkgdown_2.1.1       cachem_1.1.0       
## [34] desc_1.4.3