Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2984

Investigate tools for detecting strandedness in RNA-Seq

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Some confusions have emerged regarding the strandedness of RNA-Seq data sets from tomato. Investigate tooling that can detect this in a dataset.

        Attachments

          Activity

          Hide
          ann.loraine Ann Loraine added a comment - - edited

          RSeqQC is a suite of python programs for analyzing RNA-Seq data. One of these programs is "infer_experiment.py", which can detect whether a stranded library synthesis protocol was used.

          To install and then use RSeqQC, use pip3 to install virtualenv for your user. Then start the virtual environment and install RSeqQC and its dependencies, using pip3.

          pip3 install virtualenv --user
          mkdir ~/envs
          cd envs
          virtualenv rseqc 
          source ~/envs/rseqc/bin/activate 
          pip3 install RSeQC
          
          Show
          ann.loraine Ann Loraine added a comment - - edited RSeqQC is a suite of python programs for analyzing RNA-Seq data. One of these programs is "infer_experiment.py", which can detect whether a stranded library synthesis protocol was used. To install and then use RSeqQC, use pip3 to install virtualenv for your user. Then start the virtual environment and install RSeqQC and its dependencies, using pip3. pip3 install virtualenv --user mkdir ~/envs cd envs virtualenv rseqc source ~/envs/rseqc/bin/activate pip3 install RSeQC
          Hide
          ann.loraine Ann Loraine added a comment -

          Options and arguments of "infer_experiment.py":

          $ infer_experiment.py -h
          Usage: infer_experiment.py [options]
          
          
          Options:
            --version             show program's version number and exit
            -h, --help            show this help message and exit
            -i INPUT_FILE, --input-file=INPUT_FILE
                                  Input alignment file in SAM or BAM format
            -r REFGENE_BED, --refgene=REFGENE_BED
                                  Reference gene model in bed fomat.
            -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE
                                  Number of reads sampled from SAM/BAM file.
                                  default=200000
            -q MAP_QUAL, --mapq=MAP_QUAL
                                  Minimum mapping quality (phred scaled) for an
                                  alignment to be considered as "uniquely mapped".
                                  default=30
          
          Show
          ann.loraine Ann Loraine added a comment - Options and arguments of "infer_experiment.py": $ infer_experiment.py -h Usage: infer_experiment.py [options] Options: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input-file=INPUT_FILE Input alignment file in SAM or BAM format -r REFGENE_BED, --refgene=REFGENE_BED Reference gene model in bed fomat. -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE Number of reads sampled from SAM/BAM file. default =200000 -q MAP_QUAL, --mapq=MAP_QUAL Minimum mapping quality (phred scaled) for an alignment to be considered as "uniquely mapped" . default =30
          Hide
          ann.loraine Ann Loraine added a comment -

          Examples of running infer_experiment.py, with output:

          Unstranded protocol:

          $ infer_experiment.py -i VF36_C_R1.bam -r S_lycopersicum_Sep_2019.bed
          Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done
          Loading SAM/BAM file ...  Total 200000 usable reads were sampled
          This is PairEnd Data
          Fraction of reads failed to determine: 0.0087
          Fraction of reads explained by "1++,1--,2+-,2-+": 0.4945
          Fraction of reads explained by "1+-,1-+,2++,2--": 0.4969
          

          Intepretation: About half of the time, read 1 has the same sequence as its gene's plus strand. The other half of the time, it matches the gene's minus strand.

          Stranded synthesis protocol:

          $ infer_experiment.py -i A_C_3S.bam -r O_sativa_japonica_Oct_2011.bed 
          [W::hts_idx_load3] The index file is older than the data file: A_C_3S.bam.bai
          Reading reference gene model O_sativa_japonica_Oct_2011.bed ... Done
          Loading SAM/BAM file ...  Total 200000 usable reads were sampled
          This is SingleEnd Data
          Fraction of reads failed to determine: 0.0201
          Fraction of reads explained by "++,--": 0.0095
          Fraction of reads explained by "+-,-+": 0.9704
          

          Interpretation: Almost every sequenced read matches the antisense strand of a gene, never the sense strand of a gene.

          Older BAM files (from SRP252265), aligned with hisat2 and run using the wrong strandedness parameter:

          infer_experiment.py -i N3-S.bam -r S_lycopersicum_Sep_2019.bed 
          [W::hts_idx_load3] The index file is older than the data file: N3-S.bam.bai
          Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done
          Loading SAM/BAM file ...  Total 200000 usable reads were sampled
          This is PairEnd Data
          Fraction of reads failed to determine: 0.0013
          Fraction of reads explained by "1++,1--,2+-,2-+": 0.5170
          Fraction of reads explained by "1+-,1-+,2++,2--": 0.4817
          
          Show
          ann.loraine Ann Loraine added a comment - Examples of running infer_experiment.py, with output: Unstranded protocol: $ infer_experiment.py -i VF36_C_R1.bam -r S_lycopersicum_Sep_2019.bed Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is PairEnd Data Fraction of reads failed to determine: 0.0087 Fraction of reads explained by "1++,1--,2+-,2-+" : 0.4945 Fraction of reads explained by "1+-,1-+,2++,2--" : 0.4969 Intepretation: About half of the time, read 1 has the same sequence as its gene's plus strand. The other half of the time, it matches the gene's minus strand. Stranded synthesis protocol: $ infer_experiment.py -i A_C_3S.bam -r O_sativa_japonica_Oct_2011.bed [W::hts_idx_load3] The index file is older than the data file: A_C_3S.bam.bai Reading reference gene model O_sativa_japonica_Oct_2011.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is SingleEnd Data Fraction of reads failed to determine: 0.0201 Fraction of reads explained by "++,--" : 0.0095 Fraction of reads explained by "+-,-+" : 0.9704 Interpretation: Almost every sequenced read matches the antisense strand of a gene, never the sense strand of a gene. Older BAM files (from SRP252265), aligned with hisat2 and run using the wrong strandedness parameter: infer_experiment.py -i N3-S.bam -r S_lycopersicum_Sep_2019.bed [W::hts_idx_load3] The index file is older than the data file: N3-S.bam.bai Reading reference gene model S_lycopersicum_Sep_2019.bed ... Done Loading SAM/BAM file ... Total 200000 usable reads were sampled This is PairEnd Data Fraction of reads failed to determine: 0.0013 Fraction of reads explained by "1++,1--,2+-,2-+" : 0.5170 Fraction of reads explained by "1+-,1-+,2++,2--" : 0.4817
          Hide
          ann.loraine Ann Loraine added a comment - - edited

          Retrieved SRA-stored data from a previous experiment and converted from sra to fastq using fastq-dump. However, the defline did not include anything about the read number. Needed to use a different option.

          Trying fasterq-dump, with options:

            -S|--split-files                 write reads into different files
            -p|--progress                    show progress (not possible if stdout used)
            -P|--print-read-nr               include read-number in defline
          

          Command line:

          fasterq-dump -S -P -p SRR11284{116..139}
          

          Because I need internet connection for this to work, ran it on the head node, in a tmux session. Started 14:52 2021-10-24.

          Ref:

          Show
          ann.loraine Ann Loraine added a comment - - edited Retrieved SRA-stored data from a previous experiment and converted from sra to fastq using fastq-dump. However, the defline did not include anything about the read number. Needed to use a different option. Trying fasterq-dump, with options: -S|--split-files write reads into different files -p|--progress show progress (not possible if stdout used) -P|--print-read-nr include read-number in defline Command line: fasterq-dump -S -P -p SRR11284{116..139} Because I need internet connection for this to work, ran it on the head node, in a tmux session. Started 14:52 2021-10-24. Ref: Command-line number expansion syntax from "Bash tricks" by Julia Evans https://www.reddit.com/r/linux/comments/cn5ea3/bash_tricks_by_julia_evans/
          Show
          ann.loraine Ann Loraine added a comment - See: https://ncbiinsights.ncbi.nlm.nih.gov/2021/10/19/sra-lite/
          Hide
          ann.loraine Ann Loraine added a comment -

          The software is installed on the system. To run it:

          module load rseqc
          

          It dumps you into a virtual environment for the software.

          Show
          ann.loraine Ann Loraine added a comment - The software is installed on the system. To run it: module load rseqc It dumps you into a virtual environment for the software.
          Hide
          ann.loraine Ann Loraine added a comment -

          Moving to done.

          Show
          ann.loraine Ann Loraine added a comment - Moving to done.

            People

            • Assignee:
              ann.loraine Ann Loraine
              Reporter:
              ann.loraine Ann Loraine
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: