Getting Started

This chapter gives an introduction into the preprocessing steps of RNA-Seq expression data from different public repositories. It documents how to download data and metadata, setup a conda environment or Docker container to use the scripts written in bash, nextflow, python and R.

Features

  • Download RNA-Seq expression data from repositories

  • Convert BAM to FASTQ and use nf-core/rnaseq

  • Download metadata from TCGA, ICGC, GTEx, SRA

  • Extract metadata into a table in csv format

  • Merge TPM values from nf-core/rnaseq/stringTieFPKM

  • Merge raw featureCounts from nf-core/rnaseq/featureCounts

  • Dimensionality reduction with PCA, t-SNE and UMAP

  • Batch correction in R with ComBat, CombatSeq, removeBatchEffect

  • Supvervised Classification Machine Learning: LinearSVM, SVM, RandomForest, MultiLayerPerceptron

Main Workflow Overview

_images/workflow_scripts_final.png

Prerequisites

See also

Assure nextflow, docker, singularity are installed

Setting-up conda environment

  • Requires conda

  • Requires python version 3.8.8

  • Requires python_scripts/environment.yml

conda env create -f environment.yml

Activate the environment to run the scripts

conda activate python_scripts

Alternative set up with docker container

  • Requires python_scripts/Dockerfile

  • Requires python_scripts/environment.yml

# folder structure within the container

├── app/
│   ├── tools.py
|   ├── ...
├── data/
│     ├── ...
├── Dockerfile
├── environment.yml
└── results/

# copy the code to run in the container to ``app/`` and data to ``data/``

# from within folder containing scripts, Dockerfile and environment.yml
docker build -t <name> .

docker run -it --rm -w <work_dir> -v <host_dir>:<container_dir> <container_name>

# run script from command-line
# note that the conda env is already activated