Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3040

Create samplesheet with SRR run identifiers and experimental attributes

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Previous analysis of the RNA-Seq data found that the treatments triggered changes in expression and splicing. Since our first processing of the data, we added a new dataset in which a methylation inhibitor was applied, but this dataset has not been processed as yet.

      A new RNA-Seq data analysis pipeline has been developed that uses the Nextflow workflow system, and we've been using this workflow system to process data from the pollen heat stress project. Also, this workflow management system is better equipped to accommodate diverse samples, e.g., sample libraries sequenced using different strategies (single- versus paired-end) and read lengths.

      Since we need to process a new methylation-inhibitor RNA-Seq dataset and incorporate it into our analysis, let's reprocess the data using a more up-to-date workflow - the nc-core/rnaseq pipeline.

      The first steps in doing this will be to:

      • generate a comma-separated sample sheet data file that relates SRA run identifiers to experimental attributes, required for running nextflow. Note that we can also use the sample sheet as inputs for statistical analyses.
      • generate a script that will download the SRA data files and convert them to fastq, required for running the pipeline. (Note: The project identifier is: PRJNA481973/)

      The sample sheet data file columns will include the following fields:

      1. SRA run identifier (e.g, SRR7591232)
      2. fastq_1 - SRR name with _1 appended (e.g, SRR7591232_1)
      3. fastq_2 - SRR name with _2 appended, or blank for single-end samples ( (e.g, SRR7591232_2)
      4. strandedness - should be "reverse" for Truseq Illumina protocol (see attached image from nf-core/rnaseq slack)
      5. genotype - A (Agami), M (M103)
      6. treatment - C (control), E (salt)
      7. 5-Azacytidine treatment - Y (treated), N (not treated)
      8. tissue - S (shoot), R (root)
      9. replicate - 1, 2, 3
      10. read length

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Used https://sra-explorer.info/ (SRA Explorer) to get tsv file (attached) listing SRR numbers for RNA-Seq data files associated with the experiment.

            Next, we can use fasterq-dump (installed on the Charlotte HPC system) to download the files from the SRA and simultaneously convert them to fastq format.

            See: this comment from IGBF-2984 "Investigate tools for detecting strandedness in RNA-Seq" for the correct options to use.

            Show
            ann.loraine Ann Loraine added a comment - - edited Used https://sra-explorer.info/ (SRA Explorer) to get tsv file (attached) listing SRR numbers for RNA-Seq data files associated with the experiment. Next, we can use fasterq-dump (installed on the Charlotte HPC system) to download the files from the SRA and simultaneously convert them to fastq format. See: this comment from IGBF-2984 "Investigate tools for detecting strandedness in RNA-Seq" for the correct options to use.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Used cut, uniq, and sort to retrieve the 29 SRR identifiers, in order, for the RNA-Seq datasets listed in the attached tsv file.

            Using faster-dump, a utility part of SRA tools (released from NCBI), we can retrieve data using the following bash commands.

            Dividing into 6 sections of consecutive numbers, with bash commands to run:

            fasterq-dump -S -P -p SRR759123{2..4}
            SRR7591232
            SRR7591233
            SRR7591234
            
            fasterq-dump -S -P -p SRR759124{3..5}
            SRR7591243
            SRR7591244
            SRR7591245
            
            fasterq-dump -S -P -p SRR7591249
            SRR7591249
            
            fasterq-dump -S -P -p SRR75912{58..66}
            SRR7591258
            SRR7591259
            SRR7591260
            SRR7591261
            SRR7591262
            SRR7591263
            SRR7591264
            SRR7591265
            SRR7591266
            
            fasterq-dump -S -P -p SRR75912{69..73}
            SRR7591269
            SRR7591270
            SRR7591271
            SRR7591272
            SRR7591273
            
            fasterq-dump -S -P -p SRR759128{2..9}
            SRR7591282
            SRR7591283
            SRR7591284
            SRR7591285
            SRR7591286
            SRR7591287
            SRR7591288
            SRR7591289
            
            Show
            ann.loraine Ann Loraine added a comment - - edited Used cut, uniq, and sort to retrieve the 29 SRR identifiers, in order, for the RNA-Seq datasets listed in the attached tsv file. Using faster-dump, a utility part of SRA tools (released from NCBI), we can retrieve data using the following bash commands. Dividing into 6 sections of consecutive numbers, with bash commands to run: fasterq-dump -S -P -p SRR759123{2..4} SRR7591232 SRR7591233 SRR7591234 fasterq-dump -S -P -p SRR759124{3..5} SRR7591243 SRR7591244 SRR7591245 fasterq-dump -S -P -p SRR7591249 SRR7591249 fasterq-dump -S -P -p SRR75912{58..66} SRR7591258 SRR7591259 SRR7591260 SRR7591261 SRR7591262 SRR7591263 SRR7591264 SRR7591265 SRR7591266 fasterq-dump -S -P -p SRR75912{69..73} SRR7591269 SRR7591270 SRR7591271 SRR7591272 SRR7591273 fasterq-dump -S -P -p SRR759128{2..9} SRR7591282 SRR7591283 SRR7591284 SRR7591285 SRR7591286 SRR7591287 SRR7591288 SRR7591289
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Nowlan Freese - is the above information enough to enable you to create a first draft of the sample sheet file?

            Show
            ann.loraine Ann Loraine added a comment - - edited Nowlan Freese - is the above information enough to enable you to create a first draft of the sample sheet file?
            Hide
            nfreese Nowlan Freese added a comment - - edited

            I have attached the RNA-Seq sample sheet (samplesheet_RNA-Seq.csv).
            Note that the file has a header line.

            A few additional notes:

            • Not all samples were sequenced with paired end (we did not pay for paired end, but some samples were placed on flow cells that were running paired end samples for other clients).
            • The Agami salt treated Az treated shoot samples only had two biological replicates (2 and 3). My recollection is that something had happened to the biological replicate 1 sample during library preparation and we were unable to sequence it.
            • Some samples (the Az samples) were sequenced multiple times. These samples have paired end reads and an additional single end sequencing run that was done on a separate date. All of the fastq files were submitted to the SRA (the initial paired fastq files as well as the additional single end fastq file). However, it is unclear to me how those files are stored by the SRA. It could be that the SRA has combined the two R1 fastq submissions, and this is why there is a discrepancy in the spot lengths (see comment below). We may need to look at this more closely when aligning the Az samples.
            • The following samples had abnormal average spot lengths according to the SRA Run Selector: SRR7591233, SRR7591234, SRR7591249, SRR7591266, SRR7591286. These samples are the Az treated samples and are the same samples that were resequenced. This may be due to an issue with the sequencing itself or it may be due to the SRA combining R1 fastq files from two files (see above comment).
            • I double-checked the forward/reverse logic and it does appear correct that these samples would be labeled as reverse for strandedness.
            • The read lengths all appear to be 125bp.
            Show
            nfreese Nowlan Freese added a comment - - edited I have attached the RNA-Seq sample sheet (samplesheet_RNA-Seq.csv). Note that the file has a header line. A few additional notes: Not all samples were sequenced with paired end (we did not pay for paired end, but some samples were placed on flow cells that were running paired end samples for other clients). The Agami salt treated Az treated shoot samples only had two biological replicates (2 and 3). My recollection is that something had happened to the biological replicate 1 sample during library preparation and we were unable to sequence it. Some samples (the Az samples) were sequenced multiple times. These samples have paired end reads and an additional single end sequencing run that was done on a separate date. All of the fastq files were submitted to the SRA (the initial paired fastq files as well as the additional single end fastq file). However, it is unclear to me how those files are stored by the SRA. It could be that the SRA has combined the two R1 fastq submissions, and this is why there is a discrepancy in the spot lengths (see comment below). We may need to look at this more closely when aligning the Az samples. The following samples had abnormal average spot lengths according to the SRA Run Selector: SRR7591233, SRR7591234, SRR7591249, SRR7591266, SRR7591286. These samples are the Az treated samples and are the same samples that were resequenced. This may be due to an issue with the sequencing itself or it may be due to the SRA combining R1 fastq files from two files (see above comment). I double-checked the forward/reverse logic and it does appear correct that these samples would be labeled as reverse for strandedness. The read lengths all appear to be 125bp.
            Hide
            ann.loraine Ann Loraine added a comment -

            Adding sample sheet to the repository. Moving to Done.

            Show
            ann.loraine Ann Loraine added a comment - Adding sample sheet to the repository. Moving to Done.
            Hide
            nfreese Nowlan Freese added a comment -

            Note that the sample sheet attached to this ticket is outdated. The newest version of the sample sheet can be found in the bseq_rice repository.

            Show
            nfreese Nowlan Freese added a comment - Note that the sample sheet attached to this ticket is outdated. The newest version of the sample sheet can be found in the bseq_rice repository.

              People

              • Assignee:
                nfreese Nowlan Freese
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: