Prepare the Gesel database — prepareDatabaseFiles • gesel

Prepare Gesel database files from various pieces of gene set information.

prepareDatabaseFiles(
  species,
  collections,
  set.info,
  set.membership,
  num.genes,
  path = "."
)

Arguments

species: String specifying the species in the form of its NCBI taxonomy ID.
collections: Data frame of information about each gene set collection, where each row corresponds to a collection. This data frame should contain the same columns as that returned by fetchAllCollections.
set.info: Data frame of information about each gene set, where each row corresponds to a set. This data frame should contain the same columns as that returned by fetchAllSets.
set.membership: List of integer vectors, where each vector corresponds to a gene set and contains the indices of its constituent genes. All gene indices should be positive, no greater than num.genes, and unique within each set.
num.genes: Integer scalar specifying the total number of genes available for this species.
path: String containing the path to a directory in which to create the database files.

Value

Several files are produced at path with the <species>_ prefix. These can be made available for download with downloadDatabaseFile.

Author

Aaron Lun

Examples

# Mocking up some information.
collections <- data.frame(
    title=c("FOO", "BAR"),
    description=c("I am a foo", "I am a bar"),
    maintainer=c("Aaron", "Aaron"),
    source=c("https://foo", "https://bar"),
    start=c(1L, 21L),
    size=c(20L, 50L)
)

set.info <- data.frame(
    name=c(
        sprintf("FOO_%i", seq_len(20)),
        sprintf("BAR_%i", seq_len(50))
    ),
    description=c(
        sprintf("this is FOO %i", seq_len(20)),
        sprintf("this is BAR %i", seq_len(50))
    ),
    collection=rep(1:2, c(20L, 50L))
)

# Mocking up the gene sets.
num.genes <- 10000
set.membership <- split(
    sample(num.genes, 5000, replace=TRUE),
    factor(
        sample(nrow(set.info), 5000, replace=TRUE),
        seq_len(nrow(set.info))
    )
)
set.membership <- lapply(set.membership, unique)
set.info$size <- lengths(set.membership)

# Now making the database files.
output <- tempfile()
dir.create(output)
prepareDatabaseFiles(
    "9606",
    collections, 
    set.info, 
    set.membership,
    num.genes,
    output
)

# We can then read directly from them:
config <- newConfig(fetch.file=function(x) file.path(output, x))
head(fetchAllSets("9606", config))
#>    name   description size collection number
#> 1 FOO_1 this is FOO 1   71          1      1
#> 2 FOO_2 this is FOO 2   77          1      2
#> 3 FOO_3 this is FOO 3   62          1      3
#> 4 FOO_4 this is FOO 4   76          1      4
#> 5 FOO_5 this is FOO 5   64          1      5
#> 6 FOO_6 this is FOO 6   79          1      6