speos.preprocessing#

Speos automatically integrates various types of inputs, namely GWAS summary statistics, gene expression data and different biological networks. To allow an extensible and easy integration of new data sources, please read the following documentation.

Quickstart#

Speos has an InputHandler class that requires only a config file and returns the fully equipped preprocessor. Use this class if all you want is the data for a given run:

class speos.preprocessing.handler.InputHandler(config, **preprocessor_kwargs)#

get_data(*args, **kwargs)#

Utility function that calls get_data of the preprocessor

Returns:: returns input matrix X, ground truth y and adjacency matrix adj as pytorch tensors.
Return type:: tuple(Tensor, Tensor, Tensor)

get_preprocessor()#

Returns:: The Preprocessor object that holds all the data necessary for the run in graph format.
Return type:: speos.preprocessing.preprocessor.Preprocessor

Preprocessor#

The Preprocessor is a rather extensive class that strings together all preprocessing operations, such as reading files, building the graph from edgelists and normalizing input features. It has a few useful functions for users, such as get_data(), get_graph() or get_feature_names()

class speos.preprocessing.preprocessor.PreProcessor(config, mapping_list: list, adjacency_list: list, translation_table: str = 'data/hgnc_official_list.tsv', expression_files=['data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct', 'data/human_protein_atlas_rna_blood_cell.tsv'], extension_inputs: str = './extensions/datasets.json')#

assign_new_ground_truth(mapping_list, compile=True) → None#

Goes through mapping list and extracts new ground truth.

If compile is set to true, calls self._add_y_label and adds labels to nodes in the graph. To do this, the graph has to be already built.

Explicitely call this method only when you want to plot the same adjacency with different labels. For ML runs, initialize graphs from scratch each time.

build_graph(features=True, use_embeddings=None, adjacency=True)#

Builds the graph given the adjacency matrices and input features specified during initialization.

Parameters:

features (bool) – If features should be added to nodes. This leads to longer compilation times, but it changes the number and indices of nodes, as nodes with missing features will be removed from the graph.
use_embeddings (bool) – If node embeddings should be added to node features. If None, the respective setting will be read from the config provided during initialization.
adjacency (bool) – If adjacencies should be loaded and the graph constructed or if only features and labels should be loaded.

contains_directed_graphs() → bool#: Returns True if any of the adjacency matrices is directed

contains_only_directed_graphs() → bool#: Returns True if all of the adjacency matrices is directed

dump_edgelist(path, symbol='hgnc')#

Dumps the edgelist that is currently held by the graph in tabstop-seperated format. If you need other formats, call self.get_graph() and write the edgelists from the graph object.

Args:
path (str): Path to the file where the edgelist will be written to. symbol (str): Specify in which vocabulary/symbol the genes should be identified. Either ‘hgnc’, ‘entrez’ or ‘ensembl’.

get_data(features=True)#

Returns the data in the same format produced by self.format_for_pygeo(), but compiles the data beforehand (i.e. builds the graph etc.) if that has not happened yet.

Returns:: returns input matrix X, ground truth y and adjacency matrix adj as pytorch tensors.
Return type:: tuple(Tensor, Tensor, Tensor)

get_graph(features=False, use_embeddings=False)#

Returns a networkx graph object with the required settings. If the graph hasnt been built yet, then it builds it first.

Parameters:

features (bool) – If the node-features should be read and added to the nodes. Nodes that have missing features will be removed, thus this setting changes the size of the graph returned. Is only relevant if the graph hasnt been built yet.
use_embeddings (bool) – If node embeddings should be added to node features. If None, the respective setting will be read from the config provided during initialization. Is only relevant if the graph hasnt been built yet.

Returns:

The graph object holding all input adjacencies, node labels and input features.

Return type:

networkx.MultiDiGraph

get_num_relations()#

Returns:: Number of different relations / edge types in the graph.
Return type:: int

GWAS Data#

The mapping of phenotypes to appropriate GWAS traits is done by the speos.preprocessing.mappers.GWASMapper :

class speos.preprocessing.mappers.GWASMapper(mapping_file: str = './speos/mapping.json', extension_mappings: str = './extensions/mapping.json', **kwargs)#

Handles the mapping of y labels to GWAS feature files.

Enables simple matching of multiple GWAS to individual Phenotypes via its get_mappings() method.

Parameters:

mapping_file (str) – The path to the file that maps ground truths (labels) to sets of feature (GWAS) files. (default: /speos/mapping.json)
extension_mappings (str) – The path to the file that maps ground truths (labels) to sets of feature (GWAS) files for user-defined extensions. (default: /extensions/mapping.json)

get_mappings(*args, **kwargs)#: Returns mappings fitting the description. If the description returns no mappings due to missing GWAS files, just return one of them so we have the mapping to the labels

Biological Networks#

The mapping of networks and filtering by their properties is done by the speos.preprocessing.mappers.AdjacencyMapper :

class speos.preprocessing.mappers.AdjacencyMapper(mapping_file: str = 'speos/adjacencies.json', extension_mappings: str = './extensions/adjacencies.json', **kwargs)#

Handles the mapping of names and network types to their respective files.

Enables simple matching of multiple Networks to individual queries via its get_mappings() method.

Parameters:

mapping_file (str) – The path to the file that describes the networks and where they are stored. (default: /speos/adjacencies.json)
extension_mappings (str) – The path to the file that describes the networks and where they are stored for user-defined extensions. (default: /extensions/adjacencies.json)

get_mappings(tags: str = '', fields: str = 'name')#

goes through the mapping list and returns all mappings that include the provided tag in the provided field (default is name field)

If called without arguments, returns all mappings (tag = “”)

Parameters:

tags (str/list) – the tag or a list of tags that should be searched for in the fiven field of adjacencies (i.e. a name, a type etc.)
fields (str/list) – the field in which the tag should be searched for. in case field is a string, all tags are searched in that field. In case of multiple tags and multiple fields, lengths must match and the nth tag is searched in the nth field.

Returns:

List of adjacencies that match the tag/field mapping and are not blacklisted.

Return type:

list

For example:

>>> from speos.preprocessing.mappers import AdjacencyMapper
>>> mapper = AdjacencyMapper()     # initialize with default
>>>
>>> # get BioPlex 3.0 293T
>>> mapper.get_mappings(tags="BioPlex 3.0 293T", fields="name")
[{'name': 'BioPlex30293T', 'type': 'ppi', 'file_path': 'data/ppi/BioPlex_293T_Network_10K_Dec_2019.tsv', 'source': 'SymbolA', 'target': 'SymbolB', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': False}]
>>>
>>> # also possible without punctuation and spaces
>>> mapper.get_mappings(tags="BioPlex30293T", fields="name")
[{'name': 'BioPlex30293T', 'type': 'ppi', 'file_path': 'data/ppi/BioPlex_293T_Network_10K_Dec_2019.tsv', 'source': 'SymbolA', 'target': 'SymbolB', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': False}]
>>>
>>> # get both implemented BioPlex networks
>>> mapper.get_mappings(tags="BioPlex", fields="name")
[{'name': 'BioPlex30HCT116', 'type': 'ppi', 'file_path': 'data/ppi/BioPlex_HCT116_Network_5.5K_Dec_2019.tsv', 'source': 'SymbolA', 'target': 'SymbolB', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': False}, {'name': 'BioPlex30293T', 'type': 'ppi', 'file_path': 'data/ppi/BioPlex_293T_Network_10K_Dec_2019.tsv', 'source': 'SymbolA', 'target': 'SymbolB', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': False}]
>>>
>>> # get all implemented Gene regulatory networks
>>> mapper.get_mappings(tags="grn", fields="type")
[{'name': 'hetionetregulates', 'type': 'grn', 'file_path': 'data/hetionet/hetionet_regulates.tsv', 'source': 'GeneA', 'target': 'GeneB', 'sep': ' ', 'symbol': 'entrez', 'weight': 'None', 'directed': True}, {'name': 'GRNDBadrenalgland', 'type': 'grn', 'file_path': 'data/grndb/adrenal_gland.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBbloodx', 'type': 'grn', 'file_path': 'data/grndb/blood.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBbloodvessel', 'type': 'grn', 'file_path': 'data/grndb/blood_vessel.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBbrain', 'type': 'grn', 'file_path': 'data/grndb/brain.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBbreast', 'type': 'grn', 'file_path': 'data/grndb/breast.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBcolon', 'type': 'grn', 'file_path': 'data/grndb/colon.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBesophagus', 'type': 'grn', 'file_path': 'data/grndb/esophagus.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBheart', 'type': 'grn', 'file_path': 'data/grndb/heart.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBkidney', 'type': 'grn', 'file_path': 'data/grndb/kidney.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBliver', 'type': 'grn', 'file_path': 'data/grndb/liver.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBlung', 'type': 'grn', 'file_path': 'data/grndb/lung.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBmuscle', 'type': 'grn', 'file_path': 'data/grndb/muscle.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBnerve', 'type': 'grn', 'file_path': 'data/grndb/nerve.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBovary', 'type': 'grn', 'file_path': 'data/grndb/ovary.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBpancreas', 'type': 'grn', 'file_path': 'data/grndb/pancreas.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBpituitary', 'type': 'grn', 'file_path': 'data/grndb/pituitary.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBprostate', 'type': 'grn', 'file_path': 'data/grndb/prostate.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBsalivarygland', 'type': 'grn', 'file_path': 'data/grndb/salivary_gland.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBskin', 'type': 'grn', 'file_path': 'data/grndb/skin.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBsmallintestine', 'type': 'grn', 'file_path': 'data/grndb/small_intestine.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBspleen', 'type': 'grn', 'file_path': 'data/grndb/spleen.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBstomach', 'type': 'grn', 'file_path': 'data/grndb/stomach.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBtestis', 'type': 'grn', 'file_path': 'data/grndb/testis.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBthyroid', 'type': 'grn', 'file_path': 'data/grndb/thyroid.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDButerus', 'type': 'grn', 'file_path': 'data/grndb/uterus.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}, {'name': 'GRNDBvagina', 'type': 'grn', 'file_path': 'data/grndb/vagina.txt', 'source': 'TF', 'target': 'gene', 'sep': '\t', 'symbol': 'hgnc', 'weight': 'None', 'directed': True}]
>>>
>>> # get all implemented metabolic networks
>>> mapper.get_mappings(tags="metabolic", fields="type")
[{'name': 'Recon3D', 'type': 'metabolic', 'file_path': 'data/recon/reconparser/data/recon_directed.tsv', 'source': 'EntrezA', 'target': 'EntrezB', 'sep': '\t', 'symbol': 'entrez', 'weight': 'None', 'directed': True}]