[IGBF-2984] Investigate tools for detecting strandedness in RNA-Seq - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
0.25
Epic Link:
Deploy and maintain infrastructure
Sprint:
Fall 5 2021 Oct 11 - Oct 22

Description

Some confusions have emerged regarding the strandedness of RNA-Seq data sets from tomato. Investigate tooling that can detect this in a dataset.

Attachments

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Ann Loraine added a comment - 20/Oct/21 4:22 AM - edited

RSeqQC is a suite of python programs for analyzing RNA-Seq data. One of these programs is "infer_experiment.py", which can detect whether a stranded library synthesis protocol was used.

To install and then use RSeqQC, use pip3 to install virtualenv for your user. Then start the virtual environment and install RSeqQC and its dependencies, using pip3.

pip3 install virtualenv --user
mkdir ~/envs
cd envs
virtualenv rseqc 
source ~/envs/rseqc/bin/activate 
pip3 install RSeQC

Show

Ann Loraine added a comment - 20/Oct/21 4:22 AM - edited RSeqQC is a suite of python programs for analyzing RNA-Seq data. One of these programs is "infer_experiment.py", which can detect whether a stranded library synthesis protocol was used. To install and then use RSeqQC, use pip3 to install virtualenv for your user. Then start the virtual environment and install RSeqQC and its dependencies, using pip3. pip3 install virtualenv --user mkdir ~/envs cd envs virtualenv rseqc source ~/envs/rseqc/bin/activate pip3 install RSeQC

Hide

Permalink

Ann Loraine added a comment - 20/Oct/21 4:28 AM

Options and arguments of "infer_experiment.py":

$ infer_experiment.py -h
Usage: infer_experiment.py [options]


Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input-file=INPUT_FILE
                        Input alignment file in SAM or BAM format
  -r REFGENE_BED, --refgene=REFGENE_BED
                        Reference gene model in bed fomat.
  -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE
                        Number of reads sampled from SAM/BAM file.
                        default=200000
  -q MAP_QUAL, --mapq=MAP_QUAL
                        Minimum mapping quality (phred scaled) for an
                        alignment to be considered as "uniquely mapped".
                        default=30

Show

Ann Loraine added a comment - 20/Oct/21 4:28 AM Options and arguments of "infer_experiment.py": $ infer_experiment.py -h Usage: infer_experiment.py [options] Options: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input-file=INPUT_FILE Input alignment file in SAM or BAM format -r REFGENE_BED, --refgene=REFGENE_BED Reference gene model in bed fomat. -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE Number of reads sampled from SAM/BAM file. default =200000 -q MAP_QUAL, --mapq=MAP_QUAL Minimum mapping quality (phred scaled) for an alignment to be considered as "uniquely mapped" . default =30

Hide

Permalink

Ann Loraine added a comment - 20/Oct/21 4:37 AM

Examples of running infer_experiment.py, with output:

Unstranded protocol:

$ infer_experiment.py -i VF36_C_R1.bam -r S_lycopersicum_Sep_2019.bed
Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0087
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4945
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4969

Intepretation: About half of the time, read 1 has the same sequence as its gene's plus strand. The other half of the time, it matches the gene's minus strand.

Stranded synthesis protocol:

$ infer_experiment.py -i A_C_3S.bam -r O_sativa_japonica_Oct_2011.bed 
[W::hts_idx_load3] The index file is older than the data file: A_C_3S.bam.bai
Reading reference gene model O_sativa_japonica_Oct_2011.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is SingleEnd Data
Fraction of reads failed to determine: 0.0201
Fraction of reads explained by "++,--": 0.0095
Fraction of reads explained by "+-,-+": 0.9704

Interpretation: Almost every sequenced read matches the antisense strand of a gene, never the sense strand of a gene.

Older BAM files (from SRP252265), aligned with hisat2 and run using the wrong strandedness parameter:

infer_experiment.py -i N3-S.bam -r S_lycopersicum_Sep_2019.bed 
[W::hts_idx_load3] The index file is older than the data file: N3-S.bam.bai
Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0013
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5170
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4817

Show

Ann Loraine added a comment - 20/Oct/21 4:37 AM Examples of running infer_experiment.py, with output: Unstranded protocol: $ infer_experiment.py -i VF36_C_R1.bam -r S_lycopersicum_Sep_2019.bed Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is PairEnd Data Fraction of reads failed to determine: 0.0087 Fraction of reads explained by "1++,1--,2+-,2-+" : 0.4945 Fraction of reads explained by "1+-,1-+,2++,2--" : 0.4969 Intepretation: About half of the time, read 1 has the same sequence as its gene's plus strand. The other half of the time, it matches the gene's minus strand. Stranded synthesis protocol: $ infer_experiment.py -i A_C_3S.bam -r O_sativa_japonica_Oct_2011.bed [W::hts_idx_load3] The index file is older than the data file: A_C_3S.bam.bai Reading reference gene model O_sativa_japonica_Oct_2011.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is SingleEnd Data Fraction of reads failed to determine: 0.0201 Fraction of reads explained by "++,--" : 0.0095 Fraction of reads explained by "+-,-+" : 0.9704 Interpretation: Almost every sequenced read matches the antisense strand of a gene, never the sense strand of a gene. Older BAM files (from SRP252265), aligned with hisat2 and run using the wrong strandedness parameter: infer_experiment.py -i N3-S.bam -r S_lycopersicum_Sep_2019.bed [W::hts_idx_load3] The index file is older than the data file: N3-S.bam.bai Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is PairEnd Data Fraction of reads failed to determine: 0.0013 Fraction of reads explained by "1++,1--,2+-,2-+" : 0.5170 Fraction of reads explained by "1+-,1-+,2++,2--" : 0.4817

Hide

Permalink

Ann Loraine added a comment - 20/Oct/21 5:45 AM - edited

Retrieved SRA-stored data from a previous experiment and converted from sra to fastq using fastq-dump. However, the defline did not include anything about the read number. Needed to use a different option.

Trying fasterq-dump, with options:

  -S|--split-files                 write reads into different files
  -p|--progress                    show progress (not possible if stdout used)
  -P|--print-read-nr               include read-number in defline

Command line:

fasterq-dump -S -P -p SRR11284{116..139}

Because I need internet connection for this to work, ran it on the head node, in a tmux session. Started 14:52 2021-10-24.

Ref:

Command-line number expansion syntax from "Bash tricks" by Julia Evans https://www.reddit.com/r/linux/comments/cn5ea3/bash_tricks_by_julia_evans/

Show

Ann Loraine added a comment - 20/Oct/21 5:45 AM - edited Retrieved SRA-stored data from a previous experiment and converted from sra to fastq using fastq-dump. However, the defline did not include anything about the read number. Needed to use a different option. Trying fasterq-dump, with options: -S|--split-files write reads into different files -p|--progress show progress (not possible if stdout used) -P|--print-read-nr include read-number in defline Command line: fasterq-dump -S -P -p SRR11284{116..139} Because I need internet connection for this to work, ran it on the head node, in a tmux session. Started 14:52 2021-10-24. Ref: Command-line number expansion syntax from "Bash tricks" by Julia Evans https://www.reddit.com/r/linux/comments/cn5ea3/bash_tricks_by_julia_evans/

Hide

Permalink

Ann Loraine added a comment - 20/Oct/21 5:48 AM

See: https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/19/sra-lite/

Show

Ann Loraine added a comment - 20/Oct/21 5:48 AM See: https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/19/sra-lite/

Hide

Permalink

Ann Loraine added a comment - 24/Oct/21 5:16 PM

The software is installed on the system. To run it:

module load rseqc

It dumps you into a virtual environment for the software.

Show

Ann Loraine added a comment - 24/Oct/21 5:16 PM The software is installed on the system. To run it: module load rseqc It dumps you into a virtual environment for the software.

Hide

Permalink

Ann Loraine added a comment - 24/Oct/21 5:17 PM

Moving to done.

Show

Ann Loraine added a comment - 24/Oct/21 5:17 PM Moving to done.

People

Assignee:

Ann Loraine

Reporter:

Ann Loraine

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

20/Oct/21 4:19 AM

Updated:

24/Oct/21 5:17 PM

Resolved:

24/Oct/21 5:17 PM