Metadata acquisition¶
Acquisition of RNA-Seq data¶
TCGA¶
Download GDC transfer tool to /usr/local/bin/ or another location in $PATH
wget 'https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip' -nv -O gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip
unzip gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip
Obtain a manifest for data download and an authentication token. Download data with the following command:
gdc-client download -m <manifest> -t <token> -d <outdir>
ICGC¶
Note
ICGC currently allows download of data from the Collaboratory Repository without having specific cloud access. Other repositories need cloud access.
Download score-client (requires OpenJDK11)
wget -O score-client.tar.gz https://artifacts.oicr.on.ca/artifactory/dcc-release/bio/overture/score-client/\[RELEASE\]/score-client-\[RELEASE\]-dist.tar.gz
tar -xvzf score-client.tar.gz
cd score-client-5.1.0 # or newer version
bin/score-client
Get access token and store under score-client/conf/application.properties
Download Data
score-client --profile collab download --manifest <manifest.tsv> --output-dir <dir>
GTEX / NCBI SRA¶
Download data from GTEX project via the Short Read Archive using the qbic-pipelines/sradownloader. The current version does not support automatic metadata download yet, but it will be a feature in the future.
Obtain RNA-Seq expression data¶
Run qbic-pipelines/bamtofastq
v.1.1.0Run nf-core/rnaseq
v.1.4.2
Metadata¶
Download tcga metadata from TCGA via API from Case endpoint json
bash tcga_metadata.sh <manifest> <json_files>
Download icgc metadata from ICGC via API from file and donor endpoint json
bash icgc_metadata.sh <manifest> <json_file_endpt> <json_donor_endpt>
Download sra metadata from NCBI SRA csv, xml
nextflow run metadata.nf --run_acc_list <SRA.txt> --outdir <results>
Extract metadata from TCGA, ICGC, SRA into one table csv
Note
Allows multiple paths as input
metadata_processing.py --icgc <path_to_json> --sra <path_to_csv> --tcga <path_to_json> -o <all_metadata.csv>
FileID |
CaseID |
SampleType |
Project |
|---|---|---|---|
FileID1 |
SampleID1 |
normal |
Project1 |
FileID1 |
SampleID2 |
tumor |
Project2 |
Extract rich metadata from TCGA or ICGC containing additional information on primary diagnosis, tumor subtype, gender, vital status, age, survival time, tumor stage, Icd10
tcga|icgc_metadata_processing.py -i <inpath> -o <outpath>
Extract rich metadata from NCBI SRA xml file and annotate with conditions.
xml_soup.py -x <inpath_xml_dir> -o <outpath>