Metadata acquisition

Acquisition of RNA-Seq data

TCGA

Download GDC transfer tool to /usr/local/bin/ or another location in $PATH

wget 'https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip' -nv -O gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip
unzip gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip

Obtain a manifest for data download and an authentication token. Download data with the following command:

gdc-client download -m <manifest> -t <token> -d <outdir>

ICGC

Note

ICGC currently allows download of data from the Collaboratory Repository without having specific cloud access. Other repositories need cloud access.

Download score-client (requires OpenJDK11)

wget -O score-client.tar.gz https://artifacts.oicr.on.ca/artifactory/dcc-release/bio/overture/score-client/\[RELEASE\]/score-client-\[RELEASE\]-dist.tar.gz
tar -xvzf score-client.tar.gz
cd score-client-5.1.0  # or newer version
bin/score-client

Get access token and store under score-client/conf/application.properties

Download Data

score-client --profile collab download --manifest <manifest.tsv> --output-dir <dir>

GTEX / NCBI SRA

Download data from GTEX project via the Short Read Archive using the qbic-pipelines/sradownloader. The current version does not support automatic metadata download yet, but it will be a feature in the future.

Obtain RNA-Seq expression data

Metadata

Download tcga metadata from TCGA via API from Case endpoint json

bash tcga_metadata.sh <manifest> <json_files>

Download icgc metadata from ICGC via API from file and donor endpoint json

bash icgc_metadata.sh <manifest> <json_file_endpt> <json_donor_endpt>

Download sra metadata from NCBI SRA csv, xml

nextflow run metadata.nf --run_acc_list <SRA.txt> --outdir <results>

Extract metadata from TCGA, ICGC, SRA into one table csv

Note

Allows multiple paths as input

metadata_processing.py --icgc <path_to_json> --sra <path_to_csv> --tcga <path_to_json> -o <all_metadata.csv>
Structure of table output from metadata_processing.py

FileID

CaseID

SampleType

Project

FileID1

SampleID1

normal

Project1

FileID1

SampleID2

tumor

Project2

Extract rich metadata from TCGA or ICGC containing additional information on primary diagnosis, tumor subtype, gender, vital status, age, survival time, tumor stage, Icd10

tcga|icgc_metadata_processing.py -i <inpath> -o <outpath>

Extract rich metadata from NCBI SRA xml file and annotate with conditions.

xml_soup.py -x <inpath_xml_dir> -o <outpath>