speos.postprocessing#

The Preprocessor handles everything that happens after the crossvalidation ensemble has been trained. It gathers the candidate genes and conducts several external validation tasks among the candidates. Usually, the same external validation tasks are also performed on the positively labeled genes so that the user can judge how well his or her candidate genes compare to the ‘gold standard’ positives.

If the crossvalidation pipeline has been chosen, these tasks are run automatically and the user does not need to bother with this class. If, however, the user needs more detailed results, there is an example at the end of this page which shows how to obtain them (TODO).

For a more detailed of description of the the external validation tasks you can consult the method section of the accompanying paper .

class speos.postprocessing.postprocessor.PostProcessor(config, translation_table='data/hgnc_official_list.tsv')#

Reads a results file and generates reports and analyses on it. The results file must contain identifers, labels and predictions per gene

check_overlap(results_paths: list, cutoff_value, cutoff_type: str, plot=True)#

Checks the overlap of multiple runs and returns them

Parameters:
  • results_paths (list) – Paths to the results files that should be compared for overlap (i.e. all results files from one outer cv run)

  • cutoff_value (int/float) – Value which is depends on cutoff_type .

  • cutoff_type (str) – Type of cutoff that should be applied to find candidates. See High-Level API for possible values.

  • plot (bool) – If we overlap bins should be plotted.

Returns:

Multiple results, most of which are summarized in the DataFrame at the end (i.e. tuple[-1])

Return type:

tuple([…, pd.DataFrame])

dge(results_path=None, plot=True, save=True, convergence_score=1) DataFrame#

Runs Differential Gene Expression enrichment on the results of the outer crossvalidation. Uses only unknown genes as background, mendelians are removed.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

contains the results of the Differential Gene Expression Enrichment Analysis.

Return type:

pandas.DataFrame

druggable(results_path=None, plot=False, save=True)#

Takes the results of the outer crossvalidation and analyzes if there is an enrichment of druggable genes among the predicted genes.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

Returns a tuple of various results, most of which are summarized in the DataFrame at the end (tuple[-1]).

ResultA is the enrichment of druggable genes in positively labeled genes, ResultB is the enrichment in the candidates. ResultC is the entrichment of druggable genes among the non-drug-target genes in the positively labeled genes, ResultsD is the same enrichment in the candidates.

Return type:

tuple(list[ResultA, ResultB], list[ResultC, ResultD], pd.DataFrame)

drugtarget(results_path=None, plot=True, save=True) tuple#

Takes the results of the outer crossvalidation and analyzes if there is an enrichment of drug targets among the predicted genes.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

Returns a tuple of various results, most of which are summarized in the DataFrame at the end (tuple[-1]).

Return type:

tuple([…], pd.DataFrame)

get_mouse_knockout_genes(tag=None, mapping='./data/mgi/query_mapping.yaml', main_dir=None) list#

Reads the Mouse Knockout genes from mapping file, matches it against the mouse to human homologs (self.mouse2human) and returns the human homologs with a corresponding mouse KO gene

Mouse KO genes which do not have human homologs will not be returned.

get_random_overlap(eligible_genes, kept_genes, algorithm='fast', n_models=None)#

Gets the same number of random genes as in kept_genes out of eligible genes and repeats this procedure self.num_runs_for_random_experiments times to get mean and standard deviation of overlaps

if algorithm=”descriptive”, then we sample from an actual list of gene symbols. If algorithm=”fast”, we recreate the sampling as a bernoulli experiment in scipy, which is much faster.

go_enrichment(results_path=None, plot=True, save=True) DataFrame#

Runs GO Term enrichment on the results of the outer crossvalidation. Uses only unknown genes as background, mendelians are removed.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

contains the results of the GO Term Enrichment Analysis

Return type:

pandas.DataFrame

hpo_enrichment(results_path=None, plot=True, save=True) DataFrame#

Runs HPO Term enrichment on the results of the outer crossvalidation. Uses only unknown genes as background, mendelians are removed.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

contains the results of the HPO Enrichment Analysis

Return type:

pandas.DataFrame

lof_intolerance(results_path=None, plot=True, save=False)#

Takes the results of the outer crossvalidation and analyzes if there is an enrichment of loss of function and missense mutation intolerant genes among the predicted genes. Genes for which we have no LoF or missense information have been excluded.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

Returns a tuple of various results.

ResultA is the enrichment of genes with pLI > 0.8 in positively labeled genes, ResultB is the enrichment in the candidates. Array A and ArrayB are the contingency tables for ResultA and ResultB. ResultC is the result of a tukey’s HSD test for LoF mutation intolerance among positives, candidates and noncandidates. ResultD is the result of a tukey’s HSD test for Missense mutation intolerance among positives, candidates and noncandidates.

Return type:

tuple(list[ResultA, ArrayA, ResultB, ArrayB], list[ResultC, ResultD])

mouseKO(results_path=None, plot=False, save=True)#

Takes the results of the outer crossvalidation and analyzes if there is an enrichment of mouse KO genes among the predicted genes. Genes that have not been tested in mouse KO experiments at all have been excluded.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

Returns a tuple of various results.

ResultA is the enrichment of mouse KO genes in positively labeled genes, ResultB is the enrichment in the candidates. ArrayA is the contingency table of the mouse KO genes with positively Labeled Genes, ArrayB is the contingency table of mouse KO genes with candidate genes.

Return type:

tuple(ResultA, ResultB, ArrayA, ArrayB)

overlap_analysis(write=True, plot=True)#

Takes the crossval suffix from config and performs an overlap analysis by calling check_overlap() with the given values. Stores the results in output_dir if write is set to True, which is default.

Returns a dictionary that maps the counts to the genes and a list of runs that have been considered for the analysis.

pathway(results_path=None, plot=True, save=True) DataFrame#

Runs Pathway enrichment on the results of the outer crossvalidation. Uses only unknown genes as background, mendelians are removed.

Parameters:
  • results_path (str) – The path to a resultsfile so the positive labels can be extracted. This is not necessary if the task overlap_analysis has been run before, then the results paths are already known to the postprocessor.

  • plot (bool) – If plots should be produced. If True, then the plots are placed in config.pp.plot_dir.

  • save (bool) – If results should be saved. If True, then the results are placed in the plots in config.pp.save_dir.

Returns:

contains the results of the Pathway Enrichment Analysis

Return type:

pandas.DataFrame

run()#

Runs all tasks that are specified in the config as pp.tasks

Returns:

The results of all individual tasks, in the same order as specified in the config file.

Return type:

list