Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3507

Re-run Nextflow Muday time course data with SL4 and data downloaded from SRA

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      SRP460750

      Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/

      Only SL5 was rerun with the SRA data and SL4 needs to be run with data as well.

      For this task, we need to confirm and sanity-check the muday time course data that Rob recently uploaded and submitted to the Sequence Read Archive.
      If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
      For this task:

      • Check SRP on NCBI and review submission
      • Download the data onto the cluster by using the SRP name
      • Run nf-core/rnaseq pipeline
      • Run our coverage graph and junctions scripts on the data

      Note that all files should now use their "SRR" names instead of the existing file names.

        Attachments

          Issue Links

            Activity

            Mdavis4290 Molly Davis created issue -
            Mdavis4290 Molly Davis made changes -
            Field Original Value New Value
            Epic Link IGBF-2993 [ 21429 ]
            Mdavis4290 Molly Davis made changes -
            Link This issue relates to IGBF-3406 [ IGBF-3406 ]
            Mdavis4290 Molly Davis made changes -
            Sprint Fall 7 [ 183 ]
            Mdavis4290 Molly Davis made changes -
            Description SRP460750

            *Directory*: /projects/tomato_genome/fnb/dataprocessing/SRP460750/

            Only SL5 was rerun with the SRA data and SL4 needs to be run with data as well.
            SRP460750

            *Directory*: /projects/tomato_genome/fnb/dataprocessing/SRP460750/

            Only SL5 was rerun with the SRA data and SL4 needs to be run with data as well.

            For this task, we need to confirm and sanity-check the muday time course data that Rob recently uploaded and submitted to the Sequence Read Archive.
            If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
            For this task:
            * Check SRP on NCBI and review submission
            * Download the data onto the cluster by using the SRP name
            * Run nf-core/rnaseq pipeline
            * Run our coverage graph and junctions scripts on the data

            Note that all files should now use their "SRR" names instead of the existing file names.
            Mdavis4290 Molly Davis made changes -
            Sprint Fall 7 [ 183 ]
            Mdavis4290 Molly Davis made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 7 [ 183 ] Fall 7, Fall 8 [ 183, 184 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Mdavis4290 Molly Davis made changes -
            Comment [ *Re-run Directory*: /projects/tomato_genome/fnb/dataprocessing/SRP460750

            *Prefetch SRR Script*:


            {code:java}

            #! /bin/bash

            #SBATCH --job-name=prefetch_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=4gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00

            cd /projects/tomato_genome/fnb/dataprocessing/SRP460750
            module load sra-tools/2.11.0
            vdb-config --interactive

            files=(
            SRR25478240
            SRR25478241
            SRR25478242
            SRR25478243
            SRR25478244
            SRR25478245
            SRR25478246
            SRR25478247
            SRR25478248
            SRR25478249
            SRR25478250
            SRR25478251
            SRR25478252
            SRR25478253
            SRR25478254
            SRR25478255
            SRR25478256
            SRR25478257
            SRR25478258
            SRR25478259
            SRR25478260
            SRR25478261
            SRR25478262
            SRR25478263
            SRR25478264
            SRR25478265
            SRR25478266
            SRR25478267
            SRR25478268
            SRR25478269
            SRR25478270
            SRR25478271
            SRR25478272
            SRR25478273
            SRR25478274
            SRR25478275
            SRR25478276
            SRR25478277
            SRR25478278
            SRR25478279
            SRR25478280
            SRR25478281
            SRR25478282
            SRR25478283
            SRR25478284
            SRR25478285
            SRR25478286
            SRR25478287
            SRR25478288
            SRR25478289
            SRR25478290
            SRR25478291
            SRR25478292
            SRR25478293
            SRR25478294
            SRR25478295
            SRR25478296
            SRR25478297
            SRR25478298
            SRR25478299
            SRR25478300
            SRR25478301
            SRR25478302
            SRR25478303
            SRR25478304
            SRR25478305
            SRR25478306
            SRR25478307
            SRR25478308
            SRR25478309
            SRR25478310
            SRR25478311

            )

            for f in "${files[@]}"; do echo $f; prefetch $f; done


            {code}

            *Execute*:


            {code:java}
            chmod u+x prefetch.slurm
            {code}


            {code:java}
            sbatch prefetch.slurm
            {code}

            *Faster Dump Script*:


            {code:java}
            #! /bin/bash

            #SBATCH --job-name=fastqdump_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=40gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00
            #SBATCH --array=1-72

            #setting up where to grab files from
            file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP460750/Sra_ids.txt)


            cd /projects/tomato_genome/fnb/dataprocessing/SRP460750
            module load sra-tools/2.11.0

            echo "Starting faster-qdump on $file";

            cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/$file

            fasterq-dump ${file}.sra

            perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq

            cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/${file}_1.fastq
            cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/${file}_2.fastq

            echo "finished"
            {code}


            *Execute*:

            {code:java}
            chmod u+x fasterdump.slurm
            {code}


            {code:java}
            sbatch fasterdump.slurm
            {code}
            ]
            Hide
            Mdavis4290 Molly Davis added a comment -

            Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4

            Prefetch SRR Script:

            #! /bin/bash
            
            #SBATCH --job-name=prefetch_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=4gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00
            
            cd  /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4
            module load sra-tools/2.11.0
            vdb-config --interactive
            
            files=(
            SRR25478240
            SRR25478241
            SRR25478242
            SRR25478243
            SRR25478244
            SRR25478245
            SRR25478246
            SRR25478247
            SRR25478248
            SRR25478249
            SRR25478250
            SRR25478251
            SRR25478252
            SRR25478253
            SRR25478254
            SRR25478255
            SRR25478256
            SRR25478257
            SRR25478258
            SRR25478259
            SRR25478260
            SRR25478261
            SRR25478262
            SRR25478263
            SRR25478264
            SRR25478265
            SRR25478266
            SRR25478267
            SRR25478268
            SRR25478269
            SRR25478270
            SRR25478271
            SRR25478272
            SRR25478273
            SRR25478274
            SRR25478275
            SRR25478276
            SRR25478277
            SRR25478278
            SRR25478279
            SRR25478280
            SRR25478281
            SRR25478282
            SRR25478283
            SRR25478284
            SRR25478285
            SRR25478286
            SRR25478287
            SRR25478288
            SRR25478289
            SRR25478290
            SRR25478291
            SRR25478292
            SRR25478293
            SRR25478294
            SRR25478295
            SRR25478296
            SRR25478297
            SRR25478298
            SRR25478299
            SRR25478300
            SRR25478301
            SRR25478302
            SRR25478303
            SRR25478304
            SRR25478305
            SRR25478306
            SRR25478307
            SRR25478308
            SRR25478309
            SRR25478310
            SRR25478311
            
            )
            
            for f in "${files[@]}"; do echo $f; prefetch $f;  done
            
            
            

            Execute:

            chmod u+x prefetch.slurm
            
            sbatch prefetch.slurm
            

            Faster Dump Script:

            #! /bin/bash
            
            #SBATCH --job-name=fastqdump_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=40gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00
            #SBATCH --array=1-72
            
            #setting up where to grab files from
            file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/Sra_ids.txt)
            
            
            cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4
            module load sra-tools/2.11.0
            
            echo "Starting faster-qdump on $file";
            
            cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/$file
            
            fasterq-dump ${file}.sra
            
            perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq
            
            cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}_1.fastq
            cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}_2.fastq 
            
            echo "finished"
            

            Execute:

            chmod u+x fasterdump.slurm
            
            sbatch fasterdump.slurm
            
            Show
            Mdavis4290 Molly Davis added a comment - Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 Prefetch SRR Script : #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR25478240 SRR25478241 SRR25478242 SRR25478243 SRR25478244 SRR25478245 SRR25478246 SRR25478247 SRR25478248 SRR25478249 SRR25478250 SRR25478251 SRR25478252 SRR25478253 SRR25478254 SRR25478255 SRR25478256 SRR25478257 SRR25478258 SRR25478259 SRR25478260 SRR25478261 SRR25478262 SRR25478263 SRR25478264 SRR25478265 SRR25478266 SRR25478267 SRR25478268 SRR25478269 SRR25478270 SRR25478271 SRR25478272 SRR25478273 SRR25478274 SRR25478275 SRR25478276 SRR25478277 SRR25478278 SRR25478279 SRR25478280 SRR25478281 SRR25478282 SRR25478283 SRR25478284 SRR25478285 SRR25478286 SRR25478287 SRR25478288 SRR25478289 SRR25478290 SRR25478291 SRR25478292 SRR25478293 SRR25478294 SRR25478295 SRR25478296 SRR25478297 SRR25478298 SRR25478299 SRR25478300 SRR25478301 SRR25478302 SRR25478303 SRR25478304 SRR25478305 SRR25478306 SRR25478307 SRR25478308 SRR25478309 SRR25478310 SRR25478311 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done Execute : chmod u+x prefetch.slurm sbatch prefetch.slurm Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-72 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/Sra_ids.txt) cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}_1.fastq cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}_2.fastq echo "finished" Execute : chmod u+x fasterdump.slurm sbatch fasterdump.slurm
            Mdavis4290 Molly Davis made changes -
            Status In Progress [ 3 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 7, Fall 8 [ 183, 184 ] Fall 7, Spring 2 [ 183, 186 ]
            Mdavis4290 Molly Davis made changes -
            Sprint Fall 7, Spring 2 [ 183, 186 ] Fall 7, Spring 1 [ 183, 185 ]
            Mdavis4290 Molly Davis made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Nextflow Pipeline ran successfully with SL4 genome
            Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4
            MultiQC report notes: No errors or warnings were present in the report. The output file is named 'SRP460750_SL4_multiqc_report.html'.

            Show
            Mdavis4290 Molly Davis added a comment - - edited Nextflow Pipeline ran successfully with SL4 genome Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 MultiQC report notes : No errors or warnings were present in the report. The output file is named 'SRP460750_SL4_multiqc_report.html'.
            Hide
            Mdavis4290 Molly Davis added a comment -

            Next steps:
            Commit multiqc report to Flavonoid repo on bitbucket
            Change sorted bam names
            Create junction files
            Create Coverage graphs

            Show
            Mdavis4290 Molly Davis added a comment - Next steps : Commit multiqc report to Flavonoid repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs
            Hide
            Mdavis4290 Molly Davis added a comment -

            Launch renameBams.sh script:
            ./renameBams.sh
            Launch Scaled Coverage graphs script:
            ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
            Launch Junction files script:
            ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

            Show
            Mdavis4290 Molly Davis added a comment - Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
            Hide
            Mdavis4290 Molly Davis added a comment -

            Directory: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/results/star_salmon
            Reviewer:
            Check that files have reasonable sizes (no "zero" size files, for example)
            Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
            Check that every bam file has a corresponding "FJ.bed.gz" file
            Check that every bam file has a corresponding "scaled.bedgraph.gz" file
            Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

            Show
            Mdavis4290 Molly Davis added a comment - Directory : /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/results/star_salmon Reviewer : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            Mdavis4290 Molly Davis made changes -
            Assignee Robert Reid [ robertreid ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 7, Spring 1 [ 183, 185 ] Fall 7, Spring 1, Spring 2 [ 183, 185, 186 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            robofjoy Robert Reid added a comment -

            The folder exists.

            There are 72 files that end with:

            .tbi
            .bam
            .bai.
            .gz

            The .err files are all empty impying no issues with the individual runs.

            All bam files are about 2-3GB.
            All .bai files are about 750k. Seems correct.
            All of the bed.gz files are about 3-4 MB.

            Everything looks great!

            Show
            robofjoy Robert Reid added a comment - The folder exists. There are 72 files that end with: .tbi .bam .bai. .gz The .err files are all empty impying no issues with the individual runs. All bam files are about 2-3GB. All .bai files are about 750k. Seems correct. All of the bed.gz files are about 3-4 MB. Everything looks great!
            robofjoy Robert Reid made changes -
            Assignee Robert Reid [ robertreid ] Molly Davis [ molly ]
            robofjoy Robert Reid made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            Mdavis4290 Molly Davis made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            Show
            Mdavis4290 Molly Davis added a comment - Branch : https://bitbucket.org/mdavis4290/molly5-flavonoid-rnaseq/branch/IGBF-3507
            Mdavis4290 Molly Davis made changes -
            Rank Ranked higher
            Show
            Mdavis4290 Molly Davis added a comment - PR : https://bitbucket.org/hotpollen/flavonoid-rnaseq/pull-requests/40
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            Mdavis4290 Molly Davis made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 7, Spring 1, Spring 2 [ 183, 185, 186 ] Fall 7, Spring 1, Spring 2, Spring 3 [ 183, 185, 186, 187 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Suggestions for testing:

            • Review the newly added reports and check for problems.
            • Compare this new report to the one we made for the "original" data files - are they consistent? The statistics for the new files should match the statistics for the old ones.
            • However, recall that the "original" data files have the "original" file names assigned by the sequencer. Later we learned that the files were mis-named. We we submitted the files to the SRA, we submitted them using corrected, revised sample names.
            Show
            ann.loraine Ann Loraine added a comment - Suggestions for testing: Review the newly added reports and check for problems. Compare this new report to the one we made for the "original" data files - are they consistent? The statistics for the new files should match the statistics for the old ones. However, recall that the "original" data files have the "original" file names assigned by the sequencer. Later we learned that the files were mis-named. We we submitted the files to the SRA, we submitted them using corrected, revised sample names.
            Hide
            ann.loraine Ann Loraine added a comment -

            Molly Davis - please see above comment on how to test. I don't know if you have already compared the files or not?

            If not, it would be good to do that now.

            The QC reports provide a great overview of a data processing run. Comparing the QC reports pre- and post-SRA submission will tell us a lot. For example, if there are a big differences between the pre- and post-SRA submission files, or if something went wrong with the sample switching, the QC report will likely show it.

            Show
            ann.loraine Ann Loraine added a comment - Molly Davis - please see above comment on how to test. I don't know if you have already compared the files or not? If not, it would be good to do that now. The QC reports provide a great overview of a data processing run. Comparing the QC reports pre- and post-SRA submission will tell us a lot. For example, if there are a big differences between the pre- and post-SRA submission files, or if something went wrong with the sample switching, the QC report will likely show it.
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 7, Spring 1, Spring 2, Spring 3 [ 183, 185, 186, 187 ] Fall 7, Spring 1, Spring 2, Spring 3, Spring 4 [ 183, 185, 186, 187, 188 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Testing:

            • Compared Mulitqc reports from original muday SL4 data to this rerun SL4 report
            • The reports are not exactly the same mapping numbers but are close enough that there is no cause for concern. This could be because of sample switching in the original data because the report plots and averages have the same overall patterns.
            • I noticed that the original muday mutliqc reports for SL4 and SL5 are not in the flavonoid repo so I am going to add them to this ticket just incase because the data has sample switching which is why I guess we don't want them officially in the flavonoid repo.

            Moving to done!

            Show
            Mdavis4290 Molly Davis added a comment - - edited Testing : Compared Mulitqc reports from original muday SL4 data to this rerun SL4 report The reports are not exactly the same mapping numbers but are close enough that there is no cause for concern. This could be because of sample switching in the original data because the report plots and averages have the same overall patterns. I noticed that the original muday mutliqc reports for SL4 and SL5 are not in the flavonoid repo so I am going to add them to this ticket just incase because the data has sample switching which is why I guess we don't want them officially in the flavonoid repo. Moving to done!
            Mdavis4290 Molly Davis made changes -
            Attachment muday-144-SL4-multiqc_report.html [ 18224 ]
            Attachment muday-144-SL5-multiqc_report.html [ 18225 ]
            Mdavis4290 Molly Davis made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            Mdavis4290 Molly Davis made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
            Mdavis4290 Molly Davis made changes -
            Link This issue relates to IGBF-3627 [ IGBF-3627 ]
            Mdavis4290 Molly Davis made changes -
            Link This issue relates to IGBF-3720 [ IGBF-3720 ]

              People

              • Assignee:
                Mdavis4290 Molly Davis
                Reporter:
                Mdavis4290 Molly Davis
              • Votes:
                0 Vote for this issue
                Watchers:
                Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: