gesel package

Module contents

gesel.cache_directory(path=None)[source]

Get or set the default cache directory for download_database_file() and download_gene_file().

Parameters:

path (str | None) – Path to a new cache directory.

Return type:

str

Returns:

If path = None, the path to the Gesel cache directory is returned. This defaults to a location defined by the appdirs package, and can be changed by setting the GESEL_CACHE_DIRECTORY environment variable before the first call to this function.

If path is provided, it is used to set the location of the cache directory, and the previous location is returned.

Examples

>>> import gesel
>>> gesel.cache_directory()
>>> old = gesel.cache_directory("/tmp/foo/bar") # setting it.
>>> gesel.cache_directory() # now it's changed.
>>> gesel.cache_directory(old) # setting it back.
gesel.database_url(url=None)[source]

Get or set the base URL to the Gesel database files, which is used in download_database_file().

Parameters:

url (str | None) – The new database URL.

Return type:

str

Returns:

If url = None, the URL to the Gesel database is returned. The URL defaults to the GitHub releases page; this can be altered by setting the GESEL_DATABASE_URL environment variable before the first call to this function.

If url is provided, this function sets the database URL to url, and returns the previous value of the URL.

Examples

>>> import gesel
>>> gesel.database_url()
>>> old = gesel.database_url("https://foo.bar")
>>> gesel.database_url(old)
>>> gesel.database_url()
gesel.download_database_file(name, url=None, cache=None, overwrite=False)[source]

Download Gesel database files and store them in a cache on the local file system.

Parameters:
  • name (str) – Name of the file. This usually has the species identifier as a prefix.

  • url (str | None) – Base URL to the Gesel database files. If None, it is set to database_url().

  • cache (str | None) – Path to a cache directory. If None, it is set to cache_directory().

  • overwrite (bool) – Boolean indicating whether any cached file should be overwritten with a new download.

Return type:

str

Returns:

Path to the downloaded file on the local file system.

Examples

>>> import gesel
>>> gesel.download_database_file("9606_collections.tsv.gz")
gesel.download_database_ranges(name, start, end, url=None, multipart=False, concurrency=None)[source]

Download any number of byte ranges from a Gesel database file.

Parameters:
  • name (str) – Name of the file. This usually has the species identifier as a prefix.

  • start (list[int]) – List of integers containing the zero-indexed closed start of each byte range to extract from the file. This list may be empty.

  • end (list[int]) – List of integers containing the zero-indexed open end of each byte range to extract from the file. This should have the same length as start such that the i-th range is defined as [start[i], end[i]). All ranges supplied in a single call to this function should be non-overlapping.

  • url (str | None) – Base URL to the Gesel database files. If None, it is set to database_url().

  • multipart (bool) – Whether the server at url supports multi-part range requests.

  • concurrency (int | None) – Maximum number of concurrent range requests. If None, defaults to range_concurrency(). Only used if multipart = False.

Return type:

list[str]

Returns:

List of byte strings containing the requested bytes for each range. For ranges where end <= start, an empty string is returned.

Examples

>>> import gesel
>>> gesel.download_database_ranges("9606_set2gene.tsv", [0], [100])
>>> gesel.download_database_ranges("9606_set2gene.tsv", [10, 100, 1000], [20, 150, 1100])
gesel.download_gene_file(name, url=None, cache=None, overwrite=False)[source]

Download Gesel gene files and store them in a cache on the local file system.

Parameters:
  • name (str) – Name of the file. This usually has the species identifier as a prefix.

  • url (str | None) – Base URL to the Gesel gene files. If None, it is set to gene_url().

  • cache (str | None) – Path to a cache directory. If None, it is set to cache_directory().

  • overwrite (bool) – Boolean indicating whether any cached file should be overwritten with a new download.

Return type:

str

Returns:

Path to the downloaded file on the local file system.

Examples

>>> import gesel
>>> gesel.download_gene_file("9606_symbol.tsv.gz")
gesel.download_multipart_ranges(url, start, end, _mock=None)[source]

Perform a multi-part range request on a Gesel database file.

Parameters:
  • url (str) – URL to a specific Gesel database file.

  • start (list[int]) – List of integers containing the zero-indexed closed start of each byte range to extract from the file. This list may be empty.

  • end (list[int]) – List of integers containing the zero-indexed open end of each byte range to extract from the file. This should have the same length as start such that the i-th range is defined as [start[i], end[i]). All ranges supplied in a single call to this function should be non-overlapping.

  • _mock – Internal use only.

Return type:

list[str]

Returns:

List of byte strings containing the requested bytes for each range. For ranges where end <= start, an empty string is returned.

Examples

>>> import gesel
>>> url = gesel.database_url() + "/9606_set2gene.tsv"
>>> gesel.download_multipart_ranges(url, [0], [100])
>>> gesel.download_multipart_ranges(url, [10, 100, 1000], [10, 150, 900])
>>> # Note: as of writing, GitHub releases don't support multi-part range requests.
gesel.effective_number_of_genes(species, config=None)[source]
Return type:

int

gesel.fetch_all_collections(species, config=None)[source]

Fetch information about all gene set collections in the Gesel database.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

BiocFrame

Returns:

A BiocFrame where each row represents a collection. This contains the following columns:

  • title, string containing the title of the collection.

  • description, string containing a description of the collection.

  • maintainer, string containing the identity of the collection’s maintainer.

  • source, string containing the source of origin of the collection.

  • start, integer containing the set index of the first gene set in this collection. The set index refers to a row in the data frame returned by fetch_all_sets().

  • size, integer specifying the number of gene sets in the collection.

If this function is called once, the data frame will be cached in memory and re-used in subsequent calls. The cached information will also be used to speed up fetch_some_collections().

Examples

>>> import gesel
>>> df = gesel.fetch_all_collections("9606")
>>> print(df)
gesel.fetch_all_genes(species, types=['symbol', 'entrez', 'ensembl'], config=None)[source]

Fetch names and identifiers of various types for all genes.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • types (list) – Types of gene names to return. Typically one or more of symbol, entrez, and/or ensembl.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

BiocFrame

Returns:

A BiocFrame where each row represents a gene. Each column corresponds to one of the types and is a list of lists. Each inner list in the column contains the names of the specified type for each gene.

Examples

>>> import gesel
>>> df = gesel.fetch_all_genes("9606")
>>> print(df)
>>> print(df["symbol"][1:10])
gesel.fetch_all_sets(species, config=None)[source]

Fetch information about all gene sets in the Gesel database.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

BiocFrame

Returns:

A BiocFrame where each row represents a gene set. This contains the following columns:

  • name, string containing the name of the gene set.

  • description, string containing a description of the gene set.

  • size: integer specifying the number of genes in this gene set.

  • collection: integer, the collection index of the collection that contains this gene set. The collection index refers to a row of the data frame returned by fetch_all_collections().

  • number: integer, the position of the gene set inside the specified collection. The set index of the current gene set is defined by adding number to the collection’s start.

If this function is called once, the data frame will be cached in memory and re-used in subsequent calls. The cached information will also be used to speed up fetch_some_sets().

Examples

>>> import gesel
>>> df = gesel.fetch_all_sets("9606")
>>> print(df)
gesel.fetch_collection_sizes(species, config=None)[source]

Quickly get the sizes of the collections in the Gesel database. This is more efficient than fetch_all_collections() when only the sizes are of interest.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List containing the size of each collection (i.e., the number of gene sets in each collection).

Examples

>>> import gesel
>>> gesel.fetch_collection_sizes("9606")
gesel.fetch_genes_for_all_sets(species, config=None)[source]

Fetch the identities for genes in all sets in the Gesel database.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of lists. Each inner list represents a gene set, corresponding to the rows of the data frame returned by fetch_all_sets(). Each inner list contains the identities of the genes in that set, where each integer is a gene index that refers to a row of the data frame returned by fetch_all_genes().

Examples

>>> import gesel
>>> set_to_gene = gesel.fetch_genes_for_all_sets("9606")
>>> len(set_to_gene)
>>> set_to_gene[0]
>>>
>>> # Genes in the first set:
>>> gene_symbols = gesel.fetch_all_genes("9606")["symbol"]
>>> import biocutils
>>> biocutils.subset(gene_symbols, set_to_gene[0])
>>>
>>> # Identity of the first set.
>>> set_info = gesel.fetch_all_sets("9606")
>>> print(set_info[0,:])
gesel.fetch_genes_for_some_sets(species, sets, config=None)[source]

Fetch genes for some sets in the Gesel database. This can be more efficient than fetch_genes_for_all_sets() if only a few sets are of interest.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • sets (list) – List of set indices, where each set index refers to a row in the data frame returned by fetch_all_sets().

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of integer vectors. Each vector corresponds to a set in sets and contains the identities of its member genes. Each gene is defined by its gene index, which refers to a row of the data frame returned by fetch_all_genes().

Examples

>>> import gesel
>>> first_set = gesel.fetch_genes_for_some_sets("9606", [0, 10, 20])
>>> len(first_set)
>>> first_set[0]
>>>
>>> # Genes in the first set:
>>> gene_symbols = gesel.fetch_all_genes("9606")["symbol"]
>>> import biocutils
>>> biocutils.subset(gene_symbols, first_set[0])
>>>
>>> # Identities of the sets used above.
>>> set_info = gesel.fetch_all_sets("9606")
>>> print(set_info[[0, 10, 20], :])
gesel.fetch_set_sizes(species, config=None)[source]

Quickly get the sizes of the sets in the Gesel database. This is more efficient than fetch_all_sets() when only the sizes are of interest.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List containing the size of each set (i.e., the number of genes in each set).

Examples

>>> import gesel
>>> gesel.fetch_set_sizes("9606")
gesel.fetch_sets_for_all_genes(species, config=None)[source]

Fetch the identities of the sets that contain each gene in the Gesel database.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of lists. Each inner list represents a gene, corresponding to the rows of the data frame returned by fetch_all_genes(). Each inner list contains the identities of the sets that include that gene, where each integer is a set index that refers to a row of the data frame returned by fetch_all_sets().

Examples

>>> import gesel
>>> gene_to_set = gesel.fetch_sets_for_all_genes("9606")
>>> len(gene_to_set)
>>> gene_to_set[0]
>>>
>>> # Sets containing the first gene.
>>> set_info = gesel.fetch_all_sets("9606")
>>> print(set_info[gene_to_set[0],:])
>>>
>>> # Identity of the first gene.
>>> gene_info = gesel.fetch_all_genes("9606")
>>> print(gene_info[0,:])
gesel.fetch_sets_for_some_genes(species, genes, config=None)[source]

Fetch all sets that contain some genes in the Gesel database. This can be more efficient than fetch_sets_for_all_genes() if only a few genes are of interest.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • genes (list) – List of integers containing the gene indices. Each gene index refers to a row of the data frame returned by fetch_all_genes().

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of integer vectors. Each vector corresponds to a gene in genes and contains the identities of the sets containing that gene. Each set is defined by its set index, which refers to a row of the data frame returned by fetch_all_sets().

Examples

>>> import gesel
>>> has_genes = gesel.fetch_sets_for_some_genes("9606", [0, 5, 10])
>>> has_genes[0]
>>>
>>> # Sets containing the first gene:
>>> set_info = gesel.fetch_all_sets("9606")
>>> print(set_info[has_genes[0], :])
>>>
>>> # Identities of the genes used above:
>>> gene_symbols = gesel.fetch_all_genes("9606")["symbol"]
>>> import biocutils
>>> print(biocutils.subset(gene_symbols, [0, 5, 10]))
gesel.fetch_some_collections(species, collections, config=None)[source]

Fetch the details of some gene set collections from the Gesel database. This can be more efficient than fetch_all_collections() when only a few collections are of interest.

Every time this function is called, information from the requested collections will be added to an in-memory cache. Subsequent calls to this function will re-use as many of the cached collections as possible.

If fetch_all_collections() was previously called, information from all collections are cached in memory and will be retrieved when this function is called. If collections is large, it may be beneficial to call fetch_all_collections() first before calling this function.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • collections (list) – List of collection indices. Each entry refers to a row of the data frame returned by fetch_all_collections().

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

BiocFrame

Returns:

A BiocFrame with the same columns as that returned by fetch_all_collections(), where each row corresponds to an entry of collections.

Examples

>>> import gesel
>>> gesel.fetch_some_collections("9606", 0)
gesel.fetch_some_sets(species, sets, config=None)[source]

Fetch the details of some gene sets from the Gesel database. This can be more efficient than fetch_all_sets() when only a few sets are of interest.

Every time this function is called, information from the requested sets will be added to an in-memory cache. Subsequent calls to this function will re-use as many of the cached sets as possible.

If fetch_all_sets() was previously called, information from all sets are cached in memory and will be retrieved when this function is called. If sets is large, it may be beneficial to call fetch_all_sets() first before calling this function.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • sets (list) – List of set indices, where each set index refers to a row in the data frame returned by fetch_all_sets().

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

BiocFrame

Returns:

A BiocFrame with the same columns as that returned by fetch_all_sets(), where each row corresponds to an entry of sets.

Examples

>>> import gesel
>>> gesel.fetch_some_sets("9606", [0, 10, 20])
gesel.find_overlapping_sets(species, genes, counts_only=True, config=None)[source]

Find all sets overlapping any gene in a user-supplied list, and return the number of overlaps per set.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • genes (list) – List of integers containing gene indices. Each gene index refers to a row of the data frame returned by fetch_all_genes().

  • counts_only (bool) – Whether to only report the number of overlapping genes for each set.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

tuple

Returns:

A tuple of length 2.

The first element is a BiocFrame describing the overlapping sets. Each row represents a set that is identified by the set index in the set column. (This set index refers to a row of the data frame returned by fetch_all_sets().) It also has:

  • The counts column, if counts_only = True. This specifies the number of overlaps between the genes in the set and those in genes.

  • The genes column, if counts_only = False. This is a list that contains the entries of genes that overlap with those in the set.

Rows are sorted by the number of overlapping genes, in decreasing order.

The second element is an integer scalar containing the number of genes in genes that are present in at least one set in the Gesel database for species. This is intended for use as the number of draws when performing a hypergeomtric test for gene set enrichment, instead of len(genes). It ensures that genes outside of the Gesel universe are ignored, e.g., due to user error, different genome versions. Otherwise, unknown genes would inappropriately increase the number of draws and inflate the enrichment p-value.

Examples

>>> # Present like the first 10 genes are what we're interested in.
>>> genes_of_interest = range(10)
>>>
>>> import gesel
>>> overlaps, present = gesel.find_overlapping_sets("9606", genes_of_interest)
>>> print(overlaps)
>>> present
>>>
>>> # More details on the overlapping sets.
>>> all_sets = gesel.fetch_all_sets("9606")
>>> print(all_sets[overlaps["set"],:])
gesel.flush_memory_cache(config=None)[source]

Flush the in-memory cache for Gesel data structures in the current Python session. By default, Gesel functions caches the data structures in the current session to avoid unnecessary requests to the filesystem and remote server. On rare occasion, these cached data structures may be out of date when the Gesel database files change. In such cases, the cache can be flushed to ensure that the various Gesel functions operate on the latest version of the database.

Parameters:

config (dict | None) – Configuration object created by new_config(). If None, the default configuration is used.

Returns:

The in-memory cache in config is cleared.

Examples

>>> import gesel
>>> gesel.flush_memory_cache()
gesel.gene_url(url=None)[source]

Get or set the base URL to the Gesel gene files, which is used in download_gene_file().

Parameters:

url (str | None) – The new gene URL.

Return type:

str

Returns:

If url = None, the URL to the Gesel gene is returned. The default gene URL is set to the GitHub releases page. This can be altered by setting the GESEL_GENE_URL environment variable.

If url is provided, this function sets the gene URL to url, and returns the previous value of the URL.

Examples

>>> import gesel
>>> gesel.gene_url()
>>> old = gesel.gene_url("https://foo.bar")
>>> gesel.gene_url(old)
>>> gesel.gene_url()
gesel.map_genes_by_name(species, type, ignore_case=False, config=None)[source]

Create a mapping of gene names (Ensembl, symbol, etc.) to their gene indices.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • type (str) – Type of gene name/identifier. Typically one of symbol, entrez, and/or ensembl.

  • ignore_case (bool) – Whether case should be ignored when creating the mapping.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

dict

Returns:

Dictionary where each key is the gene name/identifier of the specified type and each value is a list of integers. Each list contains the genes associated with that name (after ignoring case, if ignore_case = True). List entries should be interpreted as indices into any of the lists returned by fetch_all_genes().

Examples

>>> import gesel
>>> mapping = gesel.map_genes_by_name("9606", type="symbol")
>>>
>>> # Taking it for a spin:
>>> found = mapping["SNAP25"]
>>> print(found)
>>>
>>> # Cross-checking it
>>> ref = gesel.fetch_all_genes("9606")["symbol"]
>>> import biocutils
>>> biocutils.subset(ref, found)
gesel.new_config(fetch_gene=None, fetch_gene_kwargs={}, fetch_file=None, fetch_file_kwargs={}, fetch_ranges=None, fetch_ranges_kwargs={})[source]

Create a new configuration object to specify how the Gesel database should be queried. This can be passed to each Gesel function to alter its behavior in a consistent manner. For example, we could point to a different Gesel database from the default, or we can override fetch_file to retrieve database files from a shared filesystem instead of performing a HTTP request.

The configuration list also contains a cache of data structures that can be populated by Gesel functions. This avoids unnecessary fetch requests upon repeated calls to the same function. If the cache becomes stale or too large, it can be cleared by calling flush_memory_cache().

If no configuration list is supplied to Gesel functions, the default configuration is used. The default is created by calling new_config() without any arguments.

Parameters:
  • fetch_gene (Callable | None) – Function that accepts the name of the file in the Gesel gene descriptions and returns an absolute path to the file. If None, it defaults to download_gene_file().

  • fetch_gene_kwargs (dict) – Dictionary of name:value pairs containing extra arguments to pass to fetch_gene.

  • fetch_file (Callable | None) – Function that accepts the name of the file in the Gesel database and returns an absolute path to the file. If None, it defaults to download_database_file().

  • fetch_file_kwargs (dict) – Dictionary of name:value pairs containing extra arguments to pass to fetch_file.

  • fetch_ranges (Callable | None) – Function that accepts three arguments - the name of the file in the Gesel database, an integer vector containing the starts of the byte ranges, and another vector containing the ends of the byte ranges - and returns a list of byte strings containing the contents of the specified ranges. If None, it defaults to download_database_ranges().

  • fetch_ranges_kwargs (dict) – Dictionary of name:value pairs containing extra arguments to pass to fetch_ranges.

Return type:

dict

Returns:

Dictionary of Gesel configuration settings.

Examples

>>> import gesel
>>> gesel.new_config()
gesel.range_concurrency(concurrency=None)[source]

Set the default number of threads for concurrent range requests in download_database_ranges().

Parameters:

concurrency (int | None) – Number of threads.

Results:

If concurrency = None, the number of threads is returned (default 10).

If concurrency is provided, it is used to set the number of threads, and the previous number of threads is returned.

Examples

>>> import gesel
>>> gesel.range_concurrency()
>>> old = gesel.range_concurrency(5)
>>> print(old)
>>> gesel.range_concurrency(old)
gesel.search_genes(species, genes, types=['entrez', 'ensembl', 'symbol'], ignore_case=True, config=None)[source]

Search for genes by converting gene identifiers to gene indices.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • genes (list) – List of gene names of any type specified in types.

  • types (list) – Types of gene names to return. Typically one or more of symbol, entrez, and/or ensembl.

  • ignore_case (bool) – Whether case should be ignored when creating the mapping.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of length equal to genes. Each entry is an integer vector of gene indices that refer to rows of the data frame returned by fetch_all_genes(). These rows represent the genes that match to the corresponding entry of genes.

Examples

>>> import gesel
>>> out = gesel.search_genes("9606", ["SNAP25", "NEUROD6", "ENSG00000139618"])
>>> print(out)
>>>
>>> # Round-tripping them:
>>> genes = gesel.fetch_all_genes("9606")
>>> print(genes[out[0],:])
>>> print(genes[out[1],:])
>>> print(genes[out[2],:])
gesel.search_set_text(species, query, use_name=True, use_description=True, config=None)[source]

Search for sets based on their names and descriptions.

Parameters:
  • species (str) – NCBI taxonomy ID of the species of interest.

  • query (str) – One or more words to search on. A set is only matched if it matches to all of the tokens in the query in its name/description. The * and ? wildcards can be used to match to any or one character, respectively.

  • genes – List of integers containing the gene indices. Each gene index refers to a row of the data frame returned by fetch_all_genes().

  • use_name (bool) – Whether to search for the query string in the name of the set.

  • use_description (bool) – Whether to search for the query string in the description of the set.

  • config (dict | None) – Configuration object, typically created by new_config(). If None, the default configuration is used.

Return type:

list

Returns:

List of set indices for the matching gene sets. Each set index refers to a row in the data frame returned by fetch_all_sets().

Examples

>>> import gesel
>>> out = gesel.search_set_text("9606", "cancer")
>>> print(gesel.fetch_all_sets("9606")[out, :])
>>>
>>> out = gesel.search_set_text("9606", "innate immun*")
>>> print(gesel.fetch_all_sets("9606")[out, :])