Preprocessing¶
This chapter outlines how to use the command-line tools.
Parsing stringTie-TPM¶
The script merges TPM values from rnaseq/stringTieFPKM/ and maps them to all possible Ensembl gene_ids
and Ensembl gene_names of the nf-core/rnaseq pipeline.
parseTPM.py
-i/--inpath: Path to stringtie folder
-g/--genes: Path to extracted genes from rnaseq file
-o/--outpath: Path to output file - tsv table containing merged TPM values
python parseTPM.py -i <path_to/stringTieFPKM/> -g <path_to/genes_id2name> -o <path_to/mergedTPM>
Parsing FeatureCounts¶
Depending on whether the pipeline has run through without issues it is possible to
directly take the rnaseq/featureCounts/merged_gene_counts.txt.
merge_gene_counts.py
-i/--inpath: Path to folder with nf-core/rnaseq/featureCounts/merged_gene_counts.txt tables to be merged
-o/--outpath: Destination directory. Extension of file filled automatically
-p/--pseudogenes: Path to file with tab separated file with pseudogenes as EnsemblID
python merge_gene_counts.py -i <merged_gene_counts.txt> -p <pseudogenes.txt> -o <gene_table.txt>
If the pipeline had errors and stopped, it might be necessary to merge the raw counts
separately from rnaseq/featureCounts/gene_counts/.
parseFeatureCounts.py
-i/--inpath: Path to featureCounts/gene_counts/
-g/--genes: Path to extracted genes from rnaseq file
--lengths/--no-lengths: Extract featureLengths (into csv) for raw count to TPM conversion (do this only once)
-o/--outpath: Path to output file - tsv table containing merged raw counts
python parseFeatureCounts.py -i <path_to/gene_counts/> --no-lengths -g <path_to/genes_id2name> -o <path_to/mergedFeatureCounts>
Merging Replica¶
If biological replicates are present, it is necessary to merge them into a single observation.
For this purpose the merge_replica.py script needs the associated metadata table.
It produces two gene_count output tables which can be both input into the clustering analysis.
The outputs are:
gene_count_table [GeneID, GeneName, ID1, ID2, …]
transformed metadata [FileID, CaseID, SampleType, Project]
combined gene_count_metadata_table [CaseID, SampleType, Condition, Project, UUID, ID1, ID2, …]
merge_replica.py
-i/--inpath: Path to merged_gene_counts.txt/ mergedFeatureCounts.txt
-m/--metadata: Path to metadata from metadata_processing.py
-o/--outpath: Output directory
Batch Correction¶
Adjust for known batch effects arising from technical variation with ComBat, ComBat-Seq (svaseq) and removeBatchEffect() from Limma.
Needs raw featureCounts as input and converts the counts to TPM units.
The batch_correction.R script should be run
from within the R_docker_container.
Additionally performs Principal Component Analysis for a first overview of the effect of batch correction.
# run in R_container
Rscript container/app/batch_correction.R
-c/--counts: Path to gene_count_table.txt
-m/--metadata: Path to metadata.csv
-l/--lengths: Path to featureLengths.txt - can be obtained by running parseFeatureCounts.py once with --lengths
-t/--title: Title for plots
-o/--outpath: Destination outpath
Analysis¶
Dimension Reduction / Embeddings with PCA, t-SNE and UMAP¶
Note
The mergedTPM/gene_counts.txt file has 63,675 unique genes and are highly dimensional.
Without pseudogenes there are still 47,224 genes left.
In order to get a feeling for the structure of the data, it is helpful to
reduce the number of dimensions to a lower dimensional embedding.
It is advisable to run the script before and after batch correction to get additional plots
with non-linear dimensionality reduction techniques.
dim_reduction.py()
scales the data on the gene features with sklearn.preprocessing.MinMaxScaler() to preserve the variances
can perform unsupervised linear reduction via Principal Component Analysis - PCA
can perform unsupervised non-linear reduction via T-distributed Stochastic Neighbor Embedding - t-SNE
can perform non-linear stochastic reduction with Uniform Manifold Approximation Projection - UMAP
outputs interactive html plots displaying metadata on hover (produced with plotly) and png images
can produce a comparative embedding in one image
can produce a silhouette plot to get an estimate of number of clusters in the data (k-means clustering)
dim_reduction.py
-i/--inpath: Path to gene_expression_table
-m/--metadata: Path to file with metadata for hover_info in plotly plots: [FileID, CaseID, SampleType, Project]
-o/--outpath: Path to output images folder
--pca/--no-pca: Conditional flag to perform PCA
--tsne/--no-tsne: Conditional flag to perform t-SNE
--umap/--no-umap: Conditional flag to perform UMAP
--comparison/--no-comparison: Conditional flag to plot PCA, t-SNE, UMAP in one plot
--silhouette/--no-silhouette: Conditional flag to perform silhouette analysis
-t/--title: Title for plots
python dim_reduction.py -i <mergedTPM> -m <metadata> -o <images/> --pca --tsne --umap --no-comparison --no-silhouette