Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3500

Re-run mark-2022-timeseries data with data downloaded from SRA

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

      For this task, we need to confirm and sanity-check the mark 2022 time series data that Rob uploaded and submitted to the Sequence Read Archive.

      If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
      For this task:

      • Check SRP on NCBI and review submission
      • Download the data onto the cluster by using the SRP name
      • Run nf-core/rnaseq pipeline
      • Run our coverage graph and junctions scripts on the data

      Note that all files should now use their "SRR" names instead of the existing file names.

        Attachments

          Issue Links

            Activity

            Mdavis4290 Molly Davis created issue -
            Mdavis4290 Molly Davis made changes -
            Field Original Value New Value
            Epic Link IGBF-2993 [ 21429 ]
            Mdavis4290 Molly Davis made changes -
            Link This issue relates to IGBF-3406 [ IGBF-3406 ]
            Mdavis4290 Molly Davis made changes -
            Link This issue relates to IGBF-3499 [ IGBF-3499 ]
            Mdavis4290 Molly Davis made changes -
            Rank Ranked higher
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Re-run Directory: /projects/tomato_genome/fnb/dataprocessing/SRP441343
            SL4: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
            SL5: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5

            Prefetch SRR Script:

            #! /bin/bash
            
            #SBATCH --job-name=prefetch_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=4gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00
            
            cd   /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
            module load sra-tools/2.11.0
            vdb-config --interactive
            
            files=(
            SRR24836276
            SRR24836277
            SRR24836278
            SRR24836279
            SRR24836280
            SRR24836281
            SRR24836282
            SRR24836283
            SRR24836284
            SRR24836285
            SRR24836286
            SRR24836287
            SRR24836288
            SRR24836289
            SRR24836290
            SRR24836291
            SRR24836292
            SRR24836293
            SRR24836294
            SRR24836295
            SRR24836296
            SRR24836297
            SRR24836298
            SRR24836299
            SRR24836300
            SRR24836301
            SRR24836302
            SRR24836303
            SRR24836304
            SRR24836305
            SRR24836306
            SRR24836307
            SRR24836308
            SRR24836309
            SRR24836310
            SRR24836311
            SRR24836312
            SRR24836313
            SRR24836314
            SRR24836315
            SRR24836316
            SRR24836317
            SRR24836318
            SRR24836319
            SRR24836320
            SRR24836321
            SRR24836322
            SRR24836323
            SRR24836324
            SRR24836325
            SRR24836326
            SRR24836327
            SRR24836328
            SRR24836329
            )
            
            for f in "${files[@]}"; do echo $f; prefetch $f;  done
            

            Execute:

            chmod u+x prefetch.slurm
            
            sbatch prefetch.slurm
            

            Faster Dump Script:

            #! /bin/bash
            
            #SBATCH --job-name=fastqdump_SRR
            #SBATCH --partition=Orion
            #SBATCH --nodes=1
            #SBATCH --ntasks-per-node=1
            #SBATCH --mem=40gb
            #SBATCH --output=%x_%j.out
            #SBATCH --time=24:00:00
            #SBATCH --array=1-54
            
            #setting up where to grab files from
            file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p"  /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt)
            
            
            cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4
            module load sra-tools/2.11.0
            
            echo "Starting faster-qdump on $file";
            
            cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file
            
            fasterq-dump ${file}.sra
            
            perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq
            
            cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq
            cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq 
            
            echo "finished"
            

            Execute:

            chmod u+x fasterdump.slurm
            
            sbatch fasterdump.slurm
            
            Show
            Mdavis4290 Molly Davis added a comment - - edited Re-run Directory : /projects/tomato_genome/fnb/dataprocessing/SRP441343 SL4 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 SL5 : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5 Prefetch SRR Script : #! /bin/bash #SBATCH --job-name=prefetch_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=4gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 vdb-config --interactive files=( SRR24836276 SRR24836277 SRR24836278 SRR24836279 SRR24836280 SRR24836281 SRR24836282 SRR24836283 SRR24836284 SRR24836285 SRR24836286 SRR24836287 SRR24836288 SRR24836289 SRR24836290 SRR24836291 SRR24836292 SRR24836293 SRR24836294 SRR24836295 SRR24836296 SRR24836297 SRR24836298 SRR24836299 SRR24836300 SRR24836301 SRR24836302 SRR24836303 SRR24836304 SRR24836305 SRR24836306 SRR24836307 SRR24836308 SRR24836309 SRR24836310 SRR24836311 SRR24836312 SRR24836313 SRR24836314 SRR24836315 SRR24836316 SRR24836317 SRR24836318 SRR24836319 SRR24836320 SRR24836321 SRR24836322 SRR24836323 SRR24836324 SRR24836325 SRR24836326 SRR24836327 SRR24836328 SRR24836329 ) for f in "${files[@]}" ; do echo $f; prefetch $f; done Execute : chmod u+x prefetch.slurm sbatch prefetch.slurm Faster Dump Script : #! /bin/bash #SBATCH --job-name=fastqdump_SRR #SBATCH --partition=Orion #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=40gb #SBATCH --output=%x_%j.out #SBATCH --time=24:00:00 #SBATCH --array=1-54 #setting up where to grab files from file=$(sed -n -e "${SLURM_ARRAY_TASK_ID}p" /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/Sra_ids.txt) cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 module load sra-tools/2.11.0 echo "Starting faster-qdump on $file" ; cd /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/$file fasterq-dump ${file}.sra perl /projects/tomato_genome/scripts/validateHiseqPairs.pl ${file}_1.fastq ${file}_2.fastq cp ${file}_1.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_1.fastq cp ${file}_2.fastq /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/${file}_2.fastq echo "finished" Execute : chmod u+x fasterdump.slurm sbatch fasterdump.slurm
            Mdavis4290 Molly Davis made changes -
            Description SRP441343 Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.
            ann.loraine Ann Loraine made changes -
            Sprint Fall 6 [ 182 ] Fall 6, Fall 7 [ 182, 183 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            Mdavis4290 Molly Davis added a comment - - edited

            Nextflow Pipeline ran successfully with SL4 & SL5 genome
            Directory:

            • /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5
            • /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4

            MultiQC report notes: No errors or warnings were present in the reports. The output files are named 'SRP441343_SL5_multiqc_report.html' & 'SRP441343_SL4_multiqc_report.html'.

            Show
            Mdavis4290 Molly Davis added a comment - - edited Nextflow Pipeline ran successfully with SL4 & SL5 genome Directory: /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5 /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4 MultiQC report notes: No errors or warnings were present in the reports. The output files are named 'SRP441343_SL5_multiqc_report.html' & 'SRP441343_SL4_multiqc_report.html'.
            Hide
            Mdavis4290 Molly Davis added a comment -

            Next steps:
            Commit CSV and multiqc report to Flavonoid repo on bitbucket
            Change sorted bam names
            Create junction files
            Create Coverage graphs

            Show
            Mdavis4290 Molly Davis added a comment - Next steps : Commit CSV and multiqc report to Flavonoid repo on bitbucket Change sorted bam names Create junction files Create Coverage graphs
            Hide
            Mdavis4290 Molly Davis added a comment -

            Launch renameBams.sh script:
            ./renameBams.sh
            Launch Scaled Coverage graphs script:
            ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err
            Launch Junction files script:
            ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err

            Show
            Mdavis4290 Molly Davis added a comment - Launch renameBams.sh script : ./renameBams.sh Launch Scaled Coverage graphs script : ./sbatch-doIt.sh .bam bamCoverage.sh >jobs.out 2>jobs.err Launch Junction files script : ./sbatch-doIt.sh .bam find_junctions.sh >jobs.out 2>jobs.err
            Hide
            Mdavis4290 Molly Davis added a comment -

            Directories:

            • /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon/
            • /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5/results/star_salmon/

            Reviewer:
            Check that files have reasonable sizes (no "zero" size files, for example)
            Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file
            Check that every bam file has a corresponding "FJ.bed.gz" file
            Check that every bam file has a corresponding "scaled.bedgraph.gz" file
            Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi"

            Branch: https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500

            • SRP441343_SL4_multiqc_report.html
            • SRP441343_SL5_multiqc_report.html
            • SRP441343.csv
            Show
            Mdavis4290 Molly Davis added a comment - Directories : /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon/ /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL5/results/star_salmon/ Reviewer : Check that files have reasonable sizes (no "zero" size files, for example) Check that every "FJ.bed.gz" file has a corresponding "FJ.bed.gz.tbi" index file Check that every bam file has a corresponding "FJ.bed.gz" file Check that every bam file has a corresponding "scaled.bedgraph.gz" file Check that every "scaled.bedgraph.gz" has a corresponding "scaled.bedgraph.gz.tbi" Branch : https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500 SRP441343_SL4_multiqc_report.html SRP441343_SL5_multiqc_report.html SRP441343.csv
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            Mdavis4290 Molly Davis made changes -
            Assignee Robert Reid [ robertreid ]
            Mdavis4290 Molly Davis made changes -
            Description Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes. Re-run mark 2022 timeseries data with the name SRP441343 from SRA for both SL4 and SL5 genomes.

            For this task, we need to confirm and sanity-check the mark 2022 time series data that Rob uploaded and submitted to the Sequence Read Archive.

            If the data are good, we will replace all the existing BAM, junctions, etc. files deployed in the "hotpollen" quickload site with newly processed data.
            For this task:
            * Check SRP on NCBI and review submission
            * Download the data onto the cluster by using the SRP name
            * Run nf-core/rnaseq pipeline
            * Run our coverage graph and junctions scripts on the data

            Note that all files should now use their "SRR" names instead of the existing file names.
            Hide
            robofjoy Robert Reid added a comment -

            SL4 Folder:

            Overall NFCore structure appears as expected.

            We see the expected # of bam, bai, bedgraph ad bed files, 54!
            /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bam | wc -l
            54
            rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bai | wc -l
            54
            rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *gz | wc -l
            108
            rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bedgraph.gz | wc -l
            54
            rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bed.gz | wc -l
            54
            There are 54 tbi files as expected. (*scaled.bedgraph.gz.tbi)

            All of the .err files are 0 bytes in size. As expected. All other files have at least some size!

            scaled.bedgraph.gz.tbi files are all around 60-70kb in size.
            All of the bamfiles are 1.3GB to 2.7GB in size as expected.
            .bai files are look fine.

            Next up SL5.

            Show
            robofjoy Robert Reid added a comment - SL4 Folder: Overall NFCore structure appears as expected. We see the expected # of bam, bai, bedgraph ad bed files, 54! /projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bam | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bai | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *gz | wc -l 108 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bedgraph.gz | wc -l 54 rreid2@str-i1:/projects/tomato_genome/fnb/dataprocessing/SRP441343/nfcore-SL4/results/star_salmon$ ll *bed.gz | wc -l 54 There are 54 tbi files as expected. (*scaled.bedgraph.gz.tbi) All of the .err files are 0 bytes in size. As expected. All other files have at least some size! scaled.bedgraph.gz.tbi files are all around 60-70kb in size. All of the bamfiles are 1.3GB to 2.7GB in size as expected. .bai files are look fine. Next up SL5.
            Mdavis4290 Molly Davis made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            Hide
            robofjoy Robert Reid added a comment -

            SL5 folder

            The NFcore file structure is as expected.

            54 files for each bam, bai, bed.gz. and bed.gz.tbi

            Bam files are expected.

            All .err files are 0.

            All looks good.

            Next up:
            csv files and html in bitbucket.

            Show
            robofjoy Robert Reid added a comment - SL5 folder The NFcore file structure is as expected. 54 files for each bam, bai, bed.gz. and bed.gz.tbi Bam files are expected. All .err files are 0. All looks good. Next up: csv files and html in bitbucket.
            Hide
            robofjoy Robert Reid added a comment -

            In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500

            the HTML files both exist with content.

            The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column.

            I call this done!

            Show
            robofjoy Robert Reid added a comment - In https://bitbucket.org/mdavis4290/molly-splicing-analysis/branch/IGBF-3500 the HTML files both exist with content. The .csv file looks to have the full 55 lines (54 expts plus a header) with the proper fastq in each column. I call this done!
            robofjoy Robert Reid made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            robofjoy Robert Reid made changes -
            Assignee Robert Reid [ robertreid ] Molly Davis [ molly ]
            Show
            Mdavis4290 Molly Davis added a comment - Thank you! Robert Reid PR : https://bitbucket.org/hotpollen/splicing-analysis/pull-requests/13
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Mdavis4290 Molly Davis made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Sprint Fall 6, Fall 7 [ 182, 183 ] Fall 6, Fall 7, Fall 8 [ 182, 183, 184 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Mdavis4290 Molly Davis made changes -
            Assignee Ann Loraine [ aloraine ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Assignee Ann Loraine [ aloraine ]
            Hide
            ann.loraine Ann Loraine added a comment -

            Testing suggestions:

            • Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course
            • Check that the .html files mention the expected SRA identifiers
            • Check that the SRA identifiers listed in the added csv files match up with the .html files
            • Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names)
            • Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4
            Show
            ann.loraine Ann Loraine added a comment - Testing suggestions: Open the newly added .html files in a Web browser to check that they didn't get corrupted somehow (by mistake, of course Check that the .html files mention the expected SRA identifiers Check that the SRA identifiers listed in the added csv files match up with the .html files Check that the csv file SRA identifiers are repeated in the expected way in the expected columns (e.g., sample names match up with file names) Make a note of any interesting (or not so interesting!) differences in results obtained for SL4 and SL5, recalling whether or not SL5 has more or less gene models and genes than SL4
            ann.loraine Ann Loraine made changes -
            Sprint Fall 6, Fall 7, Fall 8 [ 182, 183, 184 ] Fall 6, Fall 7, Spring 2 [ 182, 183, 186 ]
            Mdavis4290 Molly Davis made changes -
            Sprint Fall 6, Fall 7, Spring 2 [ 182, 183, 186 ] Fall 6, Fall 7, Spring 1 [ 182, 183, 185 ]
            Mdavis4290 Molly Davis made changes -
            Assignee Molly Davis [ molly ]
            Hide
            Mdavis4290 Molly Davis added a comment -

            Testing:

            • html files open and report accurate information
            • SRA identifiers are present
            • csv SRA identifiers match the SRA identifiers in the html files
            • the fastq file SRA identifiers match the sample SRA identifiers in the csv file
            • There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5.

            Next step: prepare data to be moved from the cluster to IGB quick load. Refer to IGBF-3499

            Moving ticket to done!

            Show
            Mdavis4290 Molly Davis added a comment - Testing : html files open and report accurate information SRA identifiers are present csv SRA identifiers match the SRA identifiers in the html files the fastq file SRA identifiers match the sample SRA identifiers in the csv file There seems to be more 'reads mapped' for SL5 than SL4. But for SL4 there are more '% Proper Pairs' than SL5. Next step: prepare data to be moved from the cluster to IGB quick load. Refer to IGBF-3499 Moving ticket to done!
            Mdavis4290 Molly Davis made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            Mdavis4290 Molly Davis made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

              People

              • Assignee:
                Mdavis4290 Molly Davis
                Reporter:
                Mdavis4290 Molly Davis
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: