Getting Started¶

This chapter gives an introduction into the preprocessing steps of RNA-Seq expression data from different public repositories. It documents how to download data and metadata, setup a conda environment or Docker container to use the scripts written in bash, nextflow, python and R.

Features¶

Download RNA-Seq expression data from repositories
Convert BAM to FASTQ and use nf-core/rnaseq
Download metadata from TCGA, ICGC, GTEx, SRA
Extract metadata into a table in csv format
Merge TPM values from nf-core/rnaseq/stringTieFPKM
Merge raw featureCounts from nf-core/rnaseq/featureCounts
Dimensionality reduction with PCA, t-SNE and UMAP
Batch correction in R with ComBat, CombatSeq, removeBatchEffect
Supvervised Classification Machine Learning: LinearSVM, SVM, RandomForest, MultiLayerPerceptron

Main Workflow Overview¶

Prerequisites¶

Setting-up conda environment¶

Requires conda
Requires python version 3.8.8
Requires python_scripts/environment.yml

conda env create -f environment.yml

Activate the environment to run the scripts

conda activate python_scripts

Alternative set up with docker container¶

Requires python_scripts/Dockerfile
Requires python_scripts/environment.yml

# folder structure within the container

├── app/
│   ├── tools.py
|   ├── ...
├── data/
│     ├── ...
├── Dockerfile
├── environment.yml
└── results/

# copy the code to run in the container to ``app/`` and data to ``data/``

# from within folder containing scripts, Dockerfile and environment.yml
docker build -t <name> .

docker run -it --rm -w <work_dir> -v <host_dir>:<container_dir> <container_name>

# run script from command-line
# note that the conda env is already activated