Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3701

Muday time course: Check first sequences in renamed fastq VS. what is in SRA

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      To ensure that the sequences are matching what we have versus what the SRA has, I came up with a simple check. We look for the first 10 sequences of a given SRA run across all the sequence files.

      For speed, I head the fastq.gz file to get only the first 5 sequences.
      Use zcat to keep files zipped.
      Grep for the sequence of interest.

      for f in *gz; do echo $f; zcat < $f | head -n 20 | grep "^CTGGCTTTTC" ; done

      Probability dictates that we should find just 1 result.

        Attachments

          Issue Links

            Activity

            Hide
            robofjoy Robert Reid added a comment -

            When succesful, results will look a little like this

            .....
            V.28.45.9_R2.fastq.gz
            V.28.75.7_R1.fastq.gz
            V.28.75.7_R2.fastq.gz
            V.28.75.8_R1.fastq.gz
            V.28.75.8_R2.fastq.gz
            V.28.75.9_R1.fastq.gz
            V.28.75.9_R2.fastq.gz
            V.34.15.7_R1.fastq.gz
            V.34.15.7_R2.fastq.gz
            V.34.15.8_R1.fastq.gz
            V.34.15.8_R2.fastq.gz
            V.34.15.9_R1.fastq.gz
            V.34.15.9_R2.fastq.gz
            V.34.30.7_R1.fastq.gz
            V.34.30.7_R2.fastq.gz
            V.34.30.8_R1.fastq.gz
            V.34.30.8_R2.fastq.gz
            V.34.30.9_R1.fastq.gz
            CTGGCTTTTCAGATTTCTCATCCCTGTATGCTTTTCTTCGAGGTGGAGACACCTTCGGCACCTTGTCCACTACATCAGCTGAACTTTGCAAATTGGTTGTCGAGTACAGTTTCTGACCAGCTGGAATGCTGTACGCATTCTTCACCTCAA
            V.34.30.9_R2.fastq.gz
            V.34.45.7_R1.fastq.gz
            V.34.45.7_R2.fastq.gz
            V.34.45.8_R1.fastq.gz
            V.34.45.8_R2.fastq.gz
            V.34.45.9_R1.fastq.gz
            V.34.45.9_R2.fastq.gz
            V.34.75.7_R1.fastq.gz
            V.34.75.7_R2.fastq.gz
            V.34.75.8_R1.fastq.gz
            V.34.75.8_R2.fastq.gz
            V.34.75.9_R2.fastq.gz

            We see just the 1 result and this matches the SRA title and ID for this sample. (and this is one of the 16 samples that got switched!!)
            Will spot test a few more.

            Show
            robofjoy Robert Reid added a comment - When succesful, results will look a little like this ..... V.28.45.9_R2.fastq.gz V.28.75.7_R1.fastq.gz V.28.75.7_R2.fastq.gz V.28.75.8_R1.fastq.gz V.28.75.8_R2.fastq.gz V.28.75.9_R1.fastq.gz V.28.75.9_R2.fastq.gz V.34.15.7_R1.fastq.gz V.34.15.7_R2.fastq.gz V.34.15.8_R1.fastq.gz V.34.15.8_R2.fastq.gz V.34.15.9_R1.fastq.gz V.34.15.9_R2.fastq.gz V.34.30.7_R1.fastq.gz V.34.30.7_R2.fastq.gz V.34.30.8_R1.fastq.gz V.34.30.8_R2.fastq.gz V.34.30.9_R1.fastq.gz CTGGCTTTTCAGATTTCTCATCCCTGTATGCTTTTCTTCGAGGTGGAGACACCTTCGGCACCTTGTCCACTACATCAGCTGAACTTTGCAAATTGGTTGTCGAGTACAGTTTCTGACCAGCTGGAATGCTGTACGCATTCTTCACCTCAA V.34.30.9_R2.fastq.gz V.34.45.7_R1.fastq.gz V.34.45.7_R2.fastq.gz V.34.45.8_R1.fastq.gz V.34.45.8_R2.fastq.gz V.34.45.9_R1.fastq.gz V.34.45.9_R2.fastq.gz V.34.75.7_R1.fastq.gz V.34.75.7_R2.fastq.gz V.34.75.8_R1.fastq.gz V.34.75.8_R2.fastq.gz V.34.75.9_R2.fastq.gz We see just the 1 result and this matches the SRA title and ID for this sample. (and this is one of the 16 samples that got switched!!) Will spot test a few more.
            Hide
            robofjoy Robert Reid added a comment -

            Now one can go to SRA run selector:

            https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=2&WebEnv=MCID_66269cb9913cfa406c7fb09c&o=acc_s%3Aa

            Navigate to a sample, click the reads tab to see the first read.
            Copy A portion of that read to grep all the files as mentioned above.

            TTCATGAAGAGATCTCTTC

            for f in *gz; do echo $f; zcat < $f | head | grep "^TTCATGAAGAGATCTCTTC" ; done

            A.28.45.7_R1.fastq.gz
            A.28.45.7_R2.fastq.gz
            A.28.45.8_R1.fastq.gz
            TTCATGAAGAGATCTCTTCTTTGTTGTGTCACCATTACCTCAATATTTGTTCCCTCCTTAGATTTTTGTTGAGAAGGAACAAATTCAACAATATAATCTGATACAGATAAAAGCCAGTTGATTTCTTTTTTCCATCTTGATTTTCTTTCT
            A.28.45.8_R2.fastq.gz
            A.28.45.9_R1.fastq.gz
            A.28.45.9_R2.fastq.gz

            (results above truncated to the part that matters.....)

            It matches :
            RNA-Seq of Solanum lycopersicum:anthocyanin reduced (are)pollen tube at 28C, 45 minutes, replicate 8 (SRR25478240)

            Success.

            Show
            robofjoy Robert Reid added a comment - Now one can go to SRA run selector: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=2&WebEnv=MCID_66269cb9913cfa406c7fb09c&o=acc_s%3Aa Navigate to a sample, click the reads tab to see the first read. Copy A portion of that read to grep all the files as mentioned above. TTCATGAAGAGATCTCTTC for f in *gz; do echo $f; zcat < $f | head | grep "^TTCATGAAGAGATCTCTTC" ; done A.28.45.7_R1.fastq.gz A.28.45.7_R2.fastq.gz A.28.45.8_R1.fastq.gz TTCATGAAGAGATCTCTTCTTTGTTGTGTCACCATTACCTCAATATTTGTTCCCTCCTTAGATTTTTGTTGAGAAGGAACAAATTCAACAATATAATCTGATACAGATAAAAGCCAGTTGATTTCTTTTTTCCATCTTGATTTTCTTTCT A.28.45.8_R2.fastq.gz A.28.45.9_R1.fastq.gz A.28.45.9_R2.fastq.gz (results above truncated to the part that matters.....) It matches : RNA-Seq of Solanum lycopersicum:anthocyanin reduced (are)pollen tube at 28C, 45 minutes, replicate 8 (SRR25478240) Success.
            Hide
            robofjoy Robert Reid added a comment -

            Testing one more:
            RNA-Seq of Solanum lycopersicum:VF-36 pollen tube at 34C, 30 minutes, replicate 9 (SRR25478288)
            First sequence is:
            CTGGCTTTTCAGATTTCTCA

            for f in *gz; do echo $f; zcat < $f | head | grep "^CTGGCTTTTCAGATTTCTC" ; done

            V.34.30.7_R2.fastq.gz
            V.34.30.8_R1.fastq.gz
            V.34.30.8_R2.fastq.gz
            V.34.30.9_R1.fastq.gz
            CTGGCTTTTCAGATTTCTCATCCCTGTATGCTTTTCTTCGAGGTGGAGACACCTTCGGCACCTTGTCCACTACATCAGCTGAACTTTGCAAATTGGTTGTCGAGTACAGTTTCTGACCAGCTGGAATGCTGTACGCATTCTTCACCTCAA
            V.34.30.9_R2.fastq.gz
            V.34.45.7_R1.fastq.gz
            V.34.45.7_R2.fastq.gz

            Success!

            Show
            robofjoy Robert Reid added a comment - Testing one more: RNA-Seq of Solanum lycopersicum:VF-36 pollen tube at 34C, 30 minutes, replicate 9 (SRR25478288) First sequence is: CTGGCTTTTCAGATTTCTCA for f in *gz; do echo $f; zcat < $f | head | grep "^CTGGCTTTTCAGATTTCTC" ; done V.34.30.7_R2.fastq.gz V.34.30.8_R1.fastq.gz V.34.30.8_R2.fastq.gz V.34.30.9_R1.fastq.gz CTGGCTTTTCAGATTTCTCATCCCTGTATGCTTTTCTTCGAGGTGGAGACACCTTCGGCACCTTGTCCACTACATCAGCTGAACTTTGCAAATTGGTTGTCGAGTACAGTTTCTGACCAGCTGGAATGCTGTACGCATTCTTCACCTCAA V.34.30.9_R2.fastq.gz V.34.45.7_R1.fastq.gz V.34.45.7_R2.fastq.gz Success!
            Hide
            robofjoy Robert Reid added a comment -

            Nothing more to this. Moving to done and adding it to the archive.

            Show
            robofjoy Robert Reid added a comment - Nothing more to this. Moving to done and adding it to the archive.

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: