Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3708

Download tardigrade RNA-Seq data onto cluster

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      Download the data sets identified in the linked ticket into the "data processing" directory on the cluster.

      Instructions:

      1) For each "SRP" accession (the identifier that uniquely identifies a data set), create a folder with the same name as the dataset. For example, if the dataset accession is SRP123456, make a folder named SRP123456. Make sure the folder is group-writeable so that other people in the group can add files to it. (By default, it will likely already be group-executable and group-readable, but please check it just in case.)

      2) Use the "run selector" tool to make a meta-data file listing all the "runs" (SRR id's) plus library names, sample names, etc. for each "SRP" accession. Add it to the data set's folder.

      3) Download fastq files and save them to a subfolder named "fastq" within each "SRP" folder.

      4) Check that the data downloaded correctly. Talk with Molly and Rob Reid about how best to do that. Also, do some research (google, biostars, etc) for fast and easy ways to check that the download process did not fail. Another possibility: run fastQC on the data and compare the output to the meta-data file from the run selector tool? If you notice any problems, fix them.

      5) Compress all the fastq files using gzip.

        Attachments

          Issue Links

            Activity

            ann.loraine Ann Loraine created issue -
            ann.loraine Ann Loraine made changes -
            Field Original Value New Value
            Epic Link IGBF-1395 [ 17470 ]
            ann.loraine Ann Loraine made changes -
            Link This issue relates to IGBF-3685 [ IGBF-3685 ]
            pkulzer Paige Kulzer made changes -
            Sprint Spring 8 [ 192 ] Spring 9 [ 193 ]
            pkulzer Paige Kulzer made changes -
            Assignee Paige Kulzer [ pkulzer ]
            pkulzer Paige Kulzer made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            The "data processing" directory on the cluster can be found here: /projects/tomato_genome/fnb/dataprocessing
            Most scripts can be found here: /projects/tomato_genome/fnb/scripts/flavonoid-rnaseq/src

            Here is the entire workflow for processing the tardigrade data:

            1. Run prefetch.sh to get .sra files for the list of SRR accessions.
            2. Check that the .sra files got downloaded correctly.
            3. Run fasterq-dump to make fastq files from the .sra files.
            4. Compress the fastq files with gzip using gzip.sh and submit to slurm queue using sbatch-doIt.sh
            5. Run nf-core/rnaseq pipeline latest version. For this, you have to make a "parameter" file and provide gene annotations (.gtf file), genome annotations (.bed file), and genome assembly (.fasta file).
            6. Check that the pipeline ran correctly.
            7. Rename the BAM files that the pipeline made using rename.sh
            8. Make a folder with links to the bam files and their indexes. Make coverage graphs with bamCoverage.sh by submitting to the slurm cluster using sbatch-doIt.sh.
            9. Make junction files with find_junctions.sh (located here: /projects/tomato_genome/fnb/alt_splicing/SRP371294/results/star_salmon) by submitting to the slurm cluster using sbatch-doIt.sh.
            10. Check that steps 8 and 9 ran correctly.
            11. Make a folder called "for_quickload" as a direct child folder in each "SRP" folder. Make a directory named for the genome assembly, same name as used in IGB. Inside there, make another folder named for the SRP accession. We will use this file subtree to copy the data files to a new host using rsync.
            12. Move (don't copy) bam, bam.bai, coverage graph files, and junction files into the above child folder.
            13. Make sure that all the files everywhere are group-readable, group-writeable. Make sure that all the folders (directories) are group-readable, group-executable, and group-writeable.

            Based on the mastersheet in the linked ticket, there should be 10 "SRP" folders at the end of this process (11 if we're including the experiment with only an ERR accession).

            Show
            pkulzer Paige Kulzer added a comment - - edited The "data processing" directory on the cluster can be found here: /projects/tomato_genome/fnb/dataprocessing Most scripts can be found here: /projects/tomato_genome/fnb/scripts/flavonoid-rnaseq/src Here is the entire workflow for processing the tardigrade data: Run prefetch.sh to get .sra files for the list of SRR accessions. Check that the .sra files got downloaded correctly. Run fasterq-dump to make fastq files from the .sra files. Compress the fastq files with gzip using gzip.sh and submit to slurm queue using sbatch-doIt.sh Run nf-core/rnaseq pipeline latest version. For this, you have to make a "parameter" file and provide gene annotations (.gtf file), genome annotations (.bed file), and genome assembly (.fasta file). Check that the pipeline ran correctly. Rename the BAM files that the pipeline made using rename.sh Make a folder with links to the bam files and their indexes. Make coverage graphs with bamCoverage.sh by submitting to the slurm cluster using sbatch-doIt.sh. Make junction files with find_junctions.sh (located here: /projects/tomato_genome/fnb/alt_splicing/SRP371294/results/star_salmon) by submitting to the slurm cluster using sbatch-doIt.sh. Check that steps 8 and 9 ran correctly. Make a folder called "for_quickload" as a direct child folder in each "SRP" folder. Make a directory named for the genome assembly, same name as used in IGB. Inside there, make another folder named for the SRP accession. We will use this file subtree to copy the data files to a new host using rsync. Move (don't copy) bam, bam.bai, coverage graph files, and junction files into the above child folder. Make sure that all the files everywhere are group-readable, group-writeable. Make sure that all the folders (directories) are group-readable, group-executable, and group-writeable. Based on the mastersheet in the linked ticket, there should be 10 "SRP" folders at the end of this process (11 if we're including the experiment with only an ERR accession).
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3721 [ IGBF-3721 ]
            pkulzer Paige Kulzer made changes -
            Status In Progress [ 3 ] To-Do [ 10305 ]
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            I'm running this pipeline with the smallest dataset from the mastersheet (ERP134165, six samples) so that I can optimize it before running all of the tardigrade data through it.

            Show
            pkulzer Paige Kulzer added a comment - - edited I'm running this pipeline with the smallest dataset from the mastersheet (ERP134165, six samples) so that I can optimize it before running all of the tardigrade data through it.
            Hide
            pkulzer Paige Kulzer added a comment -
            1. The prefetch script ran without issue. The .sra files can be found in their respective SRR-numbered folders in /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/
            2. I've confirmed that the .sra files got downloaded correctly by 1) ensuring that vdb-validate finished running with each .sra file without any exit codes (i.e., checking .out files), 2) ensuring that there are no error messages present in the .err files, and 3) ensuring that .sra files are present in each SRR-numbered folder and are of a non-zero size.
            Show
            pkulzer Paige Kulzer added a comment - The prefetch script ran without issue . The .sra files can be found in their respective SRR-numbered folders in /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/ I've confirmed that the .sra files got downloaded correctly by 1) ensuring that vdb-validate finished running with each .sra file without any exit codes (i.e., checking .out files), 2) ensuring that there are no error messages present in the .err files, and 3) ensuring that .sra files are present in each SRR-numbered folder and are of a non-zero size.
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            3. The fasterq-dump script did not run successfully. In each of the .out files produced from this script, there was an error message that the projects folder had hit its capacity for disk space.

            This was happening because the script was copying fastq files from one location to another, therefore creating two copies of the same fastq file each time it's run (and these files are not small!).

            To fix this issue, the script has been modified such that fastq files are only ever moved, not copied, and are gzipped to further reduce the amount of space they take up on the cluster. This makes step 4 of the pipeline above redundant, so it should be removed from future versions.

            Show
            pkulzer Paige Kulzer added a comment - - edited 3. The fasterq-dump script did not run successfully . In each of the .out files produced from this script, there was an error message that the projects folder had hit its capacity for disk space. This was happening because the script was copying fastq files from one location to another, therefore creating two copies of the same fastq file each time it's run (and these files are not small!). To fix this issue, the script has been modified such that fastq files are only ever moved, not copied, and are gzipped to further reduce the amount of space they take up on the cluster. This makes step 4 of the pipeline above redundant, so it should be removed from future versions .
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9 [ 193 ] Spring 9, Spring 10 [ 193, 194 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            pkulzer Paige Kulzer made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            Hide
            pkulzer Paige Kulzer added a comment - - edited

            The fastq files output from the fasterq-dump script are located here: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/
            The fastqc reports produced and used to confirm that the fastq files downloaded properly are located here: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/fastqc/

            For review:

            • Navigate to: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/
            • Have I set up the "SRP folder" and all of its subfolders with the correct permissions thus far?
            • Are the names of these subfolders logical such that they clue you in to what they contain?
            • Are there folders for each SRR number/sample?
            • Is there an .sra file in each of those folders that corresponds to that sample?
            • Navigate to: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/
            • Is there a .fastq file downloaded for each sample?

            Please see the linked ticket for documentation of the steps I took to produce this data.

            Show
            pkulzer Paige Kulzer added a comment - - edited The fastq files output from the fasterq-dump script are located here: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/ The fastqc reports produced and used to confirm that the fastq files downloaded properly are located here: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/fastqc/ For review: Navigate to: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/ Have I set up the "SRP folder" and all of its subfolders with the correct permissions thus far? Are the names of these subfolders logical such that they clue you in to what they contain? Are there folders for each SRR number/sample? Is there an .sra file in each of those folders that corresponds to that sample? Navigate to: /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165/fastq/ Is there a .fastq file downloaded for each sample? Please see the linked ticket for documentation of the steps I took to produce this data.
            pkulzer Paige Kulzer made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            pkulzer Paige Kulzer made changes -
            Assignee Paige Kulzer [ pkulzer ] Ann Loraine [ aloraine ]
            pkulzer Paige Kulzer made changes -
            Link This issue relates to IGBF-3735 [ IGBF-3735 ]
            pkulzer Paige Kulzer made changes -
            Summary Download tardigrade RNA-Seq onto cluster Download tardigrade RNA-Seq data onto cluster
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9, Spring 10 [ 193, 194 ] Spring 9, Spring 10, Summer 1 [ 193, 194, 195 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9, Spring 10, Summer 1 [ 193, 194, 195 ] Spring 9, Spring 10, Summer 1, Summer 2 [ 193, 194, 195, 196 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] To-Do [ 10305 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9, Spring 10, Summer 1, Summer 2 [ 193, 194, 195, 196 ] Spring 9, Spring 10, Summer 1, Summer 2, Summer 3 [ 193, 194, 195, 196, 197 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            ann.loraine Ann Loraine made changes -
            Epic Link IGBF-1395 [ 17470 ] IGBF-3778 [ 22997 ]
            ann.loraine Ann Loraine made changes -
            Sprint Spring 9, Spring 10, Summer 1, Summer 2, Summer 3 [ 193, 194, 195, 196, 197 ] Spring 9, Spring 10, Summer 1, Summer 2, Summer 3, Summer 4 [ 193, 194, 195, 196, 197, 198 ]
            ann.loraine Ann Loraine made changes -
            Rank Ranked higher
            Hide
            ann.loraine Ann Loraine added a comment -

            Everything looks OK, except that the symbolic links in /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165 are flashing red. I don't think this is a problem, however.
            Moving to Done.

            Show
            ann.loraine Ann Loraine added a comment - Everything looks OK, except that the symbolic links in /projects/tomato_genome/fnb/dataprocessing/tardigrade/ERP134165 are flashing red. I don't think this is a problem, however. Moving to Done.
            ann.loraine Ann Loraine made changes -
            Status To-Do [ 10305 ] In Progress [ 3 ]
            ann.loraine Ann Loraine made changes -
            Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
            ann.loraine Ann Loraine made changes -
            Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
            ann.loraine Ann Loraine made changes -
            Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
            ann.loraine Ann Loraine made changes -
            Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
            ann.loraine Ann Loraine made changes -
            Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
            ann.loraine Ann Loraine made changes -
            Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
            ann.loraine Ann Loraine made changes -
            Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
            ann.loraine Ann Loraine made changes -
            Resolution Done [ 10000 ]
            Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]

              People

              • Assignee:
                ann.loraine Ann Loraine
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: