Training the Ensemble#
As mentioned earlier, The heart of Speos is the model ensemble consisting of a nested crossvalidation and the postprocessing that takes the predictions of all models, assesses the overlap and returns a set of candidate genes. For details on how the data is partitioned between the individual models and how the overlap is compared to a statistical cutoff, please consult the methods section in our preprint.
Running an Outer Crossvalidation Ensemble#
After testing several hyperparameters, networks and input variants via Benchmarking and selecting a promising combination, it is time to run the full outer crossvalidation. Keep in mind that this will train n*(n+1) = 110 models with default settings.
First, lets draft a suitable config file:
1 name: cardiovascular_gcn
2
3 input:
4 adjacency: BioPlex30293T
5 tag: Cardiovascular Disease
6
7 model:
8 mp:
9 type: gcn
10
11 crossval:
12 mode: kfold
13 n_folds: 10
14 positive_only: True
15
16 inference:
17 save_dir: ~/results/
18 save_sorted: True
19 save_tsv: True
and save it as cardiovascular_gcn.yaml
.
Note
inference.save_tsv
(in the last line) must be set to True
for the postprocessing that happens after training to work!
Here we use mostly the same settings as in the API section earlier, you should adapt it to the settings that produced the best results in your benchmark runs. The only keys that should be identical to the config above is everything below crossval
and inference
.
Then, you run it with the following command:
$python outer_crossval.py -c my_crossval_config.yaml
And wait for the results to roll in. At the end of the ensemble training, the postprocessor will gather the results and asses the predictions in overlaps. You can find this section towards the end of the logfile:
1cardiovascular_gcn 2023-02-22 14:43:50,309 [INFO] speos.postprocessing.postprocessor: Applying concensus strategy: top_down
2cardiovascular_gcn 2023-02-22 14:43:50,309 [INFO] speos.postprocessing.postprocessor: Starting Overlap Analysis.
3cardiovascular_gcn 2023-02-22 14:44:19,159 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_0_fold__overlap.svg
4cardiovascular_gcn 2023-02-22 14:44:44,929 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_1_fold__overlap.svg
5cardiovascular_gcn 2023-02-22 14:45:06,524 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_2_fold__overlap.svg
6cardiovascular_gcn 2023-02-22 14:45:44,366 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_3_fold__overlap.svg
7cardiovascular_gcn 2023-02-22 14:46:03,944 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_4_fold__overlap.svg
8cardiovascular_gcn 2023-02-22 14:46:29,244 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_5_fold__overlap.svg
9cardiovascular_gcn 2023-02-22 14:46:45,441 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_6_fold__overlap.svg
10cardiovascular_gcn 2023-02-22 14:47:07,211 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_7_fold__overlap.svg
11cardiovascular_gcn 2023-02-22 14:47:36,267 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_8_fold__overlap.svg
12cardiovascular_gcn 2023-02-22 14:48:07,654 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_9_fold__overlap.svg
13cardiovascular_gcn 2023-02-22 14:48:34,821 [INFO] speos.postprocessing.postprocessor: Plotting overlap plot to cardiovascular_gcn_outer_10_fold__overlap.svg
14cardiovascular_gcn 2023-02-22 14:48:35,133 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #0: 7; Returned 560 Candidate Genes
15cardiovascular_gcn 2023-02-22 14:48:35,147 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #1: 8; Returned 351 Candidate Genes
16cardiovascular_gcn 2023-02-22 14:48:35,167 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #2: 7; Returned 352 Candidate Genes
17cardiovascular_gcn 2023-02-22 14:48:35,184 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #3: 8; Returned 428 Candidate Genes
18cardiovascular_gcn 2023-02-22 14:48:35,199 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #4: 6; Returned 559 Candidate Genes
19cardiovascular_gcn 2023-02-22 14:48:35,222 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #5: 8; Returned 284 Candidate Genes
20cardiovascular_gcn 2023-02-22 14:48:35,238 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #6: 6; Returned 558 Candidate Genes
21cardiovascular_gcn 2023-02-22 14:48:35,255 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #7: 7; Returned 425 Candidate Genes
22cardiovascular_gcn 2023-02-22 14:48:35,273 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #8: 7; Returned 287 Candidate Genes
23cardiovascular_gcn 2023-02-22 14:48:35,332 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #9: 8; Returned 252 Candidate Genes
24cardiovascular_gcn 2023-02-22 14:48:35,356 [INFO] speos.postprocessing.postprocessor: Consensus Score for Outer Crossval #10: 6; Returned 512 Candidate Genes
25cardiovascular_gcn 2023-02-22 14:48:35,431 [INFO] speos.postprocessing.postprocessor: Outer Crossvalidation results in 781 candidate genes in total. Results written to ./results/cardiovascular_gcnouter_results.json
As we see, each of the n+1=11 outer crossvalidation folds produced an overlap plot, from which a consensus score (CS) has been chosen to arrive at an overlap cutoff which in the end results in a set of candidate genes for each outer fold. In the last line, we see that the union of the 11 sets contains 187 candidate genes, which means that several genes must have been predicted by more than one outer crossvalaidation fold. How often each of the candidate genes has been predicted, i.e. its CS, can be seen in the JSON file that the log refers to:
1[
2 {
3 "A1BG": 9,
4 "A2M": 9,
5 "ACAA2": 11,
6 "ACO2": 11,
7 "ACSL1": 9,
8 "ACTB": 6,
9 "ACTG2": 5,
10 "ADH1B": 10,
11 "ADH4": 11,
12 "ADH5": 1,
13 "ADI1": 5,
14 [...]
We see the sorted HGNC symbols of the 781 candidate genes accompanied by their CS. The higher the CS, the more outer crossvalidations have predicted the gene to be a candidate, i.e. the higher the confidence.
At the very end of the file, you will find summary statistics of how many genes have received which CS:
1 {
2 "9": 48,
3 "11": 163,
4 "6": 40,
5 "5": 41,
6 "10": 66,
7 "1": 161,
8 "3": 58,
9 "7": 39,
10 "2": 89,
11 "8": 37,
12 "4": 39
13 }
14]
As we see, a total of 163 genes has received a CS of 11, which is more than the intermediate CS of 6, 7 and 8.
Additionally, a postprocessing table is placed in the config.pp.save_dir
directory that contains the prediction for every gene as well as the results of several external validation tasks which will be discussed in the next chapter:
1 Is Included Mendelian Candidate CS DGE: Coronary Artery Disease DGE: Atrial Fibrillation DGE: Aortic Aneurysm DGE: Ischemia DGE: Hypertension DGE: Atherosclerosis DGE: Union Drug Target Number of Drug Interactions Druggable Included in Mouse KO Is Mouse KO pLI>0.9
2A1BG True False True 9.0 False False False False False False False True 5.0 True True False False
3A1CF True False False 0.0 False False False False False False False False 0.0 False True False False
4A2M True False True 9.0 False False False False False False False True 29.0 True True False False
5A2ML1 True False False 0.0 False False False False False False False False 0.0 True False False False
6A3GALT2 True False False 0.0 False False False False False False False False 0.0 False True False False
7A4GALT True False False 0.0 False False False False False False False True 2.0 False True False False
8A4GNT True False False 0.0 False False False False False False False False 0.0 False True False False
9AAAS True False False 0.0 False False False False False False False False 0.0 False True False False
10AACS True False False 0.0 False False False False False False False False 0.0 False True False False
11AADAC True False False 0.0 False False False False False False False True 1.0 True True False False
12AADACL2 True False False 0.0 False False False False False False False True 2.0 True False False False
13AADACL3 True False False 0.0 False False False False False False False False 0.0 False True False False
14AADACL4 True False False 0.0 False False False False False False False False 0.0 False True False False
15AADAT True False False 0.0 False False False False False False False True 2.0 True True False False
Configuring the Postprocessor#
There are plenty of ways to adapt the postprocessing. We encourage you to stick to the defaults first, but you might want to tweak your settings later.
Let’s have a closer look at the default values for our postprocessing routine (as defined in the config_default.yaml):
pp: # postprocessing
switch: on # on, off, True, False
tasks: [overlap_analysis, dge, pathway, hpo_enrichment, go_enrichment, drugtarget, druggable, mouseKO, lof_intolerance] # this is the full set of postprocessing options
consensus: top_down # either int specifying the min bin for consensus or bottom_up, or top_down for p-val search starting from 0 up or from 10 down
cutoff_value: 0.7 # float in case of cutoff_type split, else int
cutoff_type: split # split, top or bottom
save: True
save_dir: ./results/
plot: True
plot_dir: ./plots/
Let us walk through the keys one by one.
switch: on
:This is easy, setting this value to off or False disables postprocessing
tasks: [...]
:This describes the tasks that should be done on the ensemble. Here, overlap_analysis is necessary to find the convergence properties of the ensembles and count the votes for every gene. Without this task, the postprocessing won’t work. The other tasks are pretty self-explanatory. If, for example, you would like to only do the drug target analysis and not waste time on the other tasks, reformat add the following lines to your
my_crossval_config.yaml
:pp: tasks: [overlap_analysis, drugtarget]
and it will skip the other tasks.
consensus: top_down/bottom_up
:This key regulates the application of the consensus score.
top_down
means that it will start at bin 10 (unanimous decision) and go down until it finds the first bin that is not significantly enriched for the positive holdout set. The consensus score will be the last bin that is still significant.bottom_up
means that it will start at bin 1 and go up until it reaches the first bin that is significant. This bin will be the consensus score.
cutoff_value: (float)/(int)
:A number between 0 and 1, indicating the cut-off of uncalibrated probabilities assigned to the genes. In other words, setting it to 0.7 means that all genes with a prediction higher than 0.7 will receive a vote from this model. Increasing the cutoff value decreases the number of genes that are voted to be disease genes. Can also be an integer above 1, in case
cutoff_type
: is set totop/bottom
.
cutoff_type: split/top/bottom
:indicating the way that the cutoff is applied.
split
means that the predictions are split at the given float, as explained right above.top
means that the topk
genes are selected, but the thecutoff_value
has to be an integer (i.e. 100 as in Schulte-Sasse et al.)bottom`means that the bottom :obj:`k
genes are ignored and the rest is selected.
save: True
save_dir: ./results/
plot: True
plot_dir: ./plots/
These keys are pretty self-explanatory. So, if you have lost your results or ask yourself where the plots went, this is where you should look.
Now, if you think that you want to try some changes in these settings, i.e. a different cutoff_value
or different tasks
, just add the respective lines to your cardiovascular_gcn.yaml
and re-run the postprocessing only:
$python postprocessing.py -c cardiovascular_gcn.yaml
Have a little fun with it, but don’t forget that hyperparameter hacking and repeated hypothesis testing with different settings like this weakens your evidence!