Prepare Gesel database files from various pieces of gene set information.

prepareDatabaseFiles(
  species,
  collections,
  set.info,
  set.membership,
  num.genes,
  path = "."
)

Arguments

species

String specifying the species in the form of its NCBI taxonomy ID.

collections

Data frame of information about each gene set collection, where each row corresponds to a collection. This data frame should contain the same columns as that returned by fetchAllCollections.

set.info

Data frame of information about each gene set, where each row corresponds to a set. This data frame should contain the same columns as that returned by fetchAllSets.

set.membership

List of integer vectors, where each vector corresponds to a gene set and contains the indices of its constituent genes. All gene indices should be positive, no greater than num.genes, and unique within each set.

num.genes

Integer scalar specifying the total number of genes available for this species.

path

String containing the path to a directory in which to create the database files.

Value

Several files are produced at path with the <species>_ prefix. These can be made available for download with downloadDatabaseFile.

Author

Aaron Lun

Examples

# Mocking up some information.
collections <- data.frame(
    title=c("FOO", "BAR"),
    description=c("I am a foo", "I am a bar"),
    maintainer=c("Aaron", "Aaron"),
    source=c("https://foo", "https://bar"),
    start=c(1L, 21L),
    size=c(20L, 50L)
)

set.info <- data.frame(
    name=c(
        sprintf("FOO_%i", seq_len(20)),
        sprintf("BAR_%i", seq_len(50))
    ),
    description=c(
        sprintf("this is FOO %i", seq_len(20)),
        sprintf("this is BAR %i", seq_len(50))
    ),
    collection=rep(1:2, c(20L, 50L))
)

# Mocking up the gene sets.
num.genes <- 10000
set.membership <- split(
    sample(num.genes, 5000, replace=TRUE),
    factor(
        sample(nrow(set.info), 5000, replace=TRUE),
        seq_len(nrow(set.info))
    )
)
set.membership <- lapply(set.membership, unique)
set.info$size <- lengths(set.membership)

# Now making the database files.
output <- tempfile()
dir.create(output)
prepareDatabaseFiles(
    "9606",
    collections, 
    set.info, 
    set.membership,
    num.genes,
    output
)

# We can then read directly from them:
config <- newConfig(fetch.file=function(x) file.path(output, x))
head(fetchAllSets("9606", config))
#>    name   description size collection number
#> 1 FOO_1 this is FOO 1   71          1      1
#> 2 FOO_2 this is FOO 2   77          1      2
#> 3 FOO_3 this is FOO 3   62          1      3
#> 4 FOO_4 this is FOO 4   76          1      4
#> 5 FOO_5 this is FOO 5   64          1      5
#> 6 FOO_6 this is FOO 6   79          1      6