Machine-Learning on RNA-Seq data¶

This chapter describes how to use the machine learning script and use different KEGG genes as filter.

Machine Learning¶

This script should be run within the docker container using the python_scrips/environment.yml. For efficient investigation of different algorithms a wrapper bash command is outlined here. It executes the machine_learning script for all implemented algorithms and writes STDOUT and STDERR to a log file.

Outputs:

Plot of 10 most important features
Feature importances as csv file
Plot of confusion matrix
Plot of ROC-AUC curve

# Bash wrapper script

declare -a classifiers=("LinearSVC" "SVC" "RandomForest" "MultiLayerPerceptron")

for val in ${classifiers[@]}; do

    echo Starting to train $val classifier

    python container/app/machine_learning_tool.py \
    -i/--inpath <gene_counts/TPM.txt> \
    -m /--metadata <metadata.csv> \
    -a/--algorithm $val \
    [-k/--kegg <KEGG_filter.txt>] \
    -o/--outpath <destination directory> \
    -c/--cores <INT>  \
    -t/-title <tissue_dataset> 2>&1 >> out.log

done

KEGG database¶

For obtaining human or cancer genes from the KEGG database the following sources can be used. The tables have to be merged in a database-like fashion to be input in the ML script.

Source of human KEGG genes
Source of human KEGG pathways
Source of NCBI Entrez Identifiers to KEGG identifiers
Source of mapping NCBI Entrez to Ensembl ids
05200 Pathways in cancer , (pathway list 5200)
05202 Transcriptional misregulation in cancer , (pathway list 5202)
05206 MicroRNAs in cancer , (pathway list 5206)
05205 Proteoglycans in cancer , (pathway list 5205)
05204 Chemical carcinogenesis , (pathway list 5204)
05203 Viral carcinogenesis , (pathway list 5203)
05230 Central carbon metabolism in cancer , (pathway list 5230)
05231 Choline metabolism in cancer , (pathway list 5231)
05235 PD-L1 expression and PD-1 checkpoint pathway in cancer , (pathway list 5235)

Excerpt of a KEGG_filter.txt file¶
ensembl_id	GeneName	pathway
ENSG00000000419	DPM1	[‘path:hsa01100’, ‘path:hsa00510’]
ENSG00000000938	FGR	[‘path:hsa04062’]
ENSG00000000971	CFH	[‘path:hsa04610’, ‘path:hsa05150’]
ENSG00000001036	FUCA2	[‘path:hsa04142’, ‘path:hsa00511’]

Feature importances¶

Annotation of cancer prognostics can be done with the get_protein_atlas.py script. It deploys programmatic data access to the API of The Human Protein Atlas and retrieves information in JSON format.

get_protein_atlas.py
-i/--inpath:            Inpath to table with feature importance values (CSV)
-c/--cancer:            Cancer type used for query
-o/--outpath:           Outpath for annotated feature importance table (CSV)
--latex:                If set, print annotated table in latex format to STDOUT

python get_protein_atlas.py -i <feature_importance.csv> -c <Pancreatic/Liver>  -o <annotated_feature_importance.csv>