findOverlappingSets.RdFind all sets overlapping any gene in a user-supplied list, and return the number of overlaps per set.
findOverlappingSets(species, genes, counts.only = TRUE, config = NULL)String containing the NCBI taxonomy ID of the species of interest.
Integer vector containing gene indices.
Each gene index refers to a row of the data frame returned by fetchAllGenes.
Boolean indicating whether to only report the number of overlapping genes for each set.
Configuration list, typically created by newConfig.
If NULL, the default configuration is used.
A list containing:
overlap, a data frame of the overlapping sets.
Each row represents a set that is identified by the set index in the set column.
(This set index refers to a row of the data frame returned by fetchAllSets.)
It also has:
The count column, if counts.only=TRUE.
This specifies the number of overlaps between the genes in the set and those in genes.
The genes column, if counts.only=FALSE.
This is a list that contains the entries of genes that overlap with those in the set.
The row order is arbitrary.
present, an integer indicating the number of genes in genes that are present in at least one set in the Gesel database for species.
present should be used as the number of draws when performing a hypergeomtric test for gene set enrichment (see phyper), instead of length(genes).
This ensures that genes outside of the Gesel universe are ignored, e.g., due to user error or different genome versions.
Otherwise, unknown genes would inappropriately increase the number of draws and inflate the enrichment p-value.
overlaps <- findOverlappingSets("9606", 1:10)
head(overlaps$overlap)
#> set count
#> 1 25 1
#> 2 395 1
#> 3 413 1
#> 4 605 1
#> 5 701 1
#> 6 790 1
# More details on the overlapping sets.
all.sets <- fetchAllSets("9606")
all.sets[head(overlaps$overlap$set),]
#> name description
#> 25 GO:0000049 tRNA binding
#> 395 GO:0001525 angiogenesis
#> 413 GO:0001553 luteinization
#> 605 GO:0001869 negative regulation of complement activation, lectin pathway
#> 701 GO:0002020 protease binding
#> 790 GO:0002161 aminoacyl-tRNA editing activity
#> size collection number
#> 25 73 1 25
#> 395 236 1 395
#> 413 8 1 413
#> 605 2 1 605
#> 701 107 1 701
#> 790 10 1 790
# Computing an enrichment p-value. We take the upper tail after
# subtracting 1 to ensure that the probability mass of the observed
# number of overlapping genes is included in the p-value.
set.size <- all.sets$size[overlaps$overlap$set]
universe <- effectiveNumberOfGenes("9606")
p <- phyper(
q = overlaps$overlap$count - 1,
m = set.size,
n = universe - set.size,
k = overlaps$present,
lower.tail=FALSE
)
# For multiple testing correction, it is necessary to consider all sets
# in the database, as these were implicitly considered during the search
# though only a subset of them are reported by findOverlappingSets.
fdr <- p.adjust(p, method="BH", n=nrow(all.sets))
summary(fdr <= 0.05)
#> Mode FALSE TRUE
#> logical 1309 17