[IGBF-3596] Compare original Muday flavonoid data to SRA Re-run data - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
5
Epic Link:
Support NSF pollen grant
Sprint:
Spring 3, Spring 4, Spring 5, Spring 6, Spring 7

Description

Data has been submitted to SRA and now needs to be reviewed. To do this we perform a rerun of the data by pulling it from SRA directly and pushing it through the nextflow pipeline and prepping it for IGB quick load. To confirm that the data on SRA is correct we can make a few comparisons with the original data.

Compare files sizes of the original data to the new rerun file sizes.
Open IGB and compare bam and coverage graphs with the original and rerun data.
Perform these comparisons for SL4 and SL5

Muday cluster Directories:

Original: /projects/tomato_genome/fnb/dataprocessing/muday-144
Re-run: /projects/tomato_genome/fnb/dataprocessing/SRP460750

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

7-are-15-min-28C-multiqcreport.png
68 kB
21/Feb/24 2:09 PM
8-VF36-30-min-28C-multiqcreport-A.28.15.8.png
75 kB
21/Feb/24 2:24 PM
megaqc_homepage.png
665 kB
20/Mar/24 4:39 PM
SRR25478244-mutliqcreport.png
70 kB
21/Feb/24 2:24 PM
SRR25478244-Switch-IGB-Comparison.png
690 kB
21/Feb/24 2:49 PM
SRR25478302-multiqcreport.png
71 kB
21/Feb/24 2:09 PM
SRR25478302-Noswitch-Igb-Comparison .png
1.15 MB
21/Feb/24 2:44 PM

Issue Links

relates to

IGBF-3627 Create sample sheet for SRP460750

Closed

IGBF-3639 Compare original mark-2022-timeseries data to SRA Re-run data

Closed

IGBF-3683 Update SRA to use the correct sample codes for Muday lab time course data

Closed

IGBF-3346 Prepping data and entries for SRA for Muday time course

To-Do

IGBF-3290 Use new sample labels recommended by Muday Lab

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Molly Davis added a comment - 13/Feb/24 4:37 PM - edited

Download Data from Cluster:

Original Muday Bam files:

scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/muday-144/for_quickload/S_lycopersicum_Jun_2022/muday-2022-timeseries/*.bam.gz /Volumes/Molly-drive/Review-reruns/Original-Muday-Bam-files

Rerun Muday Bam files:

scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/results/star_salmon/*.bam.gz /Volumes/Molly-drive/Review-reruns/Rerun-Muday-Bam-files

Original Muday Scaled Coverage Graphs:

scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/muday-144/for_quickload/S_lycopersicum_Jun_2022/muday-2022-timeseries/*.scaled.bedgraph.gz /Volumes/Molly-drive/Review-reruns/Original-Muday-Scaled-Coverage-Graphs

Rerun Muday Scaled Coverage Graphs:

scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/results/star_salmon/*.scaled.bedgraph.gz /Volumes/Molly-drive/Review-reruns/Rerun-Muday-Scaled-Coverage-Graphs

Notes: Had to compress bam files in each directory before downloading. I also have to use 'Muday-lab-RNA-samples-for-sample-name-conversion' file from flavonoid repo to check the correct names of the samples due to mistakes in the lab which led to sample switching.

Show

Molly Davis added a comment - 13/Feb/24 4:37 PM - edited Download Data from Cluster : Original Muday Bam files: scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/muday-144/for_quickload/S_lycopersicum_Jun_2022/muday-2022-timeseries/*.bam.gz /Volumes/Molly-drive/Review-reruns/Original-Muday-Bam-files Rerun Muday Bam files: scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/results/star_salmon/*.bam.gz /Volumes/Molly-drive/Review-reruns/Rerun-Muday-Bam-files Original Muday Scaled Coverage Graphs: scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/muday-144/for_quickload/S_lycopersicum_Jun_2022/muday-2022-timeseries/*.scaled.bedgraph.gz /Volumes/Molly-drive/Review-reruns/Original-Muday-Scaled-Coverage-Graphs Rerun Muday Scaled Coverage Graphs: scp mdavi258@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5/results/star_salmon/*.scaled.bedgraph.gz /Volumes/Molly-drive/Review-reruns/Rerun-Muday-Scaled-Coverage-Graphs Notes : Had to compress bam files in each directory before downloading. I also have to use 'Muday-lab-RNA-samples-for-sample-name-conversion' file from flavonoid repo to check the correct names of the samples due to mistakes in the lab which led to sample switching.

Hide

Permalink

Molly Davis added a comment - 21/Feb/24 11:10 AM - edited

Next step: Use multiqc reports to sanity check the sample switch. Product of this work would be a visualization/picture that would allow other people to reach a similar conclusion that the SRA submissions were accurate. Also that the sample switching was accurate.

Method 1: Use the multiqc reports from the original data and compare it with the report for the SRA rerun data. Make conclusions whether the data are the same and why with visuals from the reports. Do this with a sample that didn't have sample switching and one that did.
Method 2: Use IGB to show coverage graphs of one chromosome for original and rerun. Do this twice, one without sample switching names and one with sample switching. Screenshot the comparisons.

Show

Molly Davis added a comment - 21/Feb/24 11:10 AM - edited Next step : Use multiqc reports to sanity check the sample switch. Product of this work would be a visualization/picture that would allow other people to reach a similar conclusion that the SRA submissions were accurate. Also that the sample switching was accurate. Method 1: Use the multiqc reports from the original data and compare it with the report for the SRA rerun data. Make conclusions whether the data are the same and why with visuals from the reports. Do this with a sample that didn't have sample switching and one that did. Method 2: Use IGB to show coverage graphs of one chromosome for original and rerun. Do this twice, one without sample switching names and one with sample switching. Screenshot the comparisons.

Hide

Permalink

Molly Davis added a comment - 21/Feb/24 1:23 PM - edited

Summary

Goal: Now that the original flavonoid muday-144 data has been submitted to SRA and rerun with nextflow, we can compare the original data to the SRA data to make sure everything was submitted correctly. To check this we want to compare the original data to the SRA data to ensure consistency. Comparisons will be made with multiqc reports and IGB visualizations of comparing coverage graphs. As we know there was a sample switching issue that was resolved. But during this comparison I will compare a sample that had no sample switching and one that did to double check the names.

Method 1: MultiQc reports SL5 Comparisons:

No sample Switch name = SRR25478302 = A.28.15.7 = 7-are-15-min-28C (Cluster Name)

Sample switch name = SRR25478244 = V.28.30.8 (Original Name) = A.28.15.8 (Actual Name) = 8-VF36-30-min-28C (Cluster Name)

Conclusion of Method 1:

The No sample Switch name is accurate and have the same exact nextflow mapping results. In other words, the original sample and the SRA rerun sample have the exact same results in the multiqc report after being run with nextflow. The sample switch name is accurate with the same exact results and the fix to the naming is correct. In other words, by using the corrected name we can confirm with the report that there was a switching issue and the names are now fixed. The SRA submission in summary is the same as the original data but now with the fixed names.

Method 2: IGB Coverage Graphs SL5 Comparisons

Conclusion of Method 2:

Visually we can see that the data is similar. The coverage graphs are the same, indicating that the original data matches with the SRA submission data. The sample name was also switched to the correct sample name.

Questions and Review:

Do we think these methods achieve our goal of comparing original data to SRA rerun data?
Should we perform these methods for every single sample in the experiment or are these samples enough proof?
Once we can confirm that the new SRA submission is accurate should we move the original data from the cluster to an off sight location like 'Glacier'?
Should we show muday lab these comparisons or go ahead and submit the SRA data to quick load and then show them the new SRA data that is available?

Next step: Create sample sheet for SRA data and submit to quick load.

Show

Molly Davis added a comment - 21/Feb/24 1:23 PM - edited Summary Goal : Now that the original flavonoid muday-144 data has been submitted to SRA and rerun with nextflow, we can compare the original data to the SRA data to make sure everything was submitted correctly. To check this we want to compare the original data to the SRA data to ensure consistency. Comparisons will be made with multiqc reports and IGB visualizations of comparing coverage graphs. As we know there was a sample switching issue that was resolved. But during this comparison I will compare a sample that had no sample switching and one that did to double check the names. Method 1 : MultiQc reports SL5 Comparisons : No sample Switch name = SRR25478302 = A.28.15.7 = 7-are-15-min-28C (Cluster Name) Sample switch name = SRR25478244 = V.28.30.8 (Original Name) = A.28.15.8 (Actual Name) = 8-VF36-30-min-28C (Cluster Name) Conclusion of Method 1: The No sample Switch name is accurate and have the same exact nextflow mapping results. In other words, the original sample and the SRA rerun sample have the exact same results in the multiqc report after being run with nextflow. The sample switch name is accurate with the same exact results and the fix to the naming is correct. In other words, by using the corrected name we can confirm with the report that there was a switching issue and the names are now fixed. The SRA submission in summary is the same as the original data but now with the fixed names. Method 2 : IGB Coverage Graphs SL5 Comparisons No sample Switch name = SRR25478302 = A.28.15.7 = 7-are-15-min-28C (Cluster Name) Sample switch name = SRR25478244 = V.28.30.8 (Original Name) = A.28.15.8 (Actual Name) = 8-VF36-30-min-28C (Cluster Name) Conclusion of Method 2: Visually we can see that the data is similar. The coverage graphs are the same, indicating that the original data matches with the SRA submission data. The sample name was also switched to the correct sample name. Questions and Review : Do we think these methods achieve our goal of comparing original data to SRA rerun data? Should we perform these methods for every single sample in the experiment or are these samples enough proof? Once we can confirm that the new SRA submission is accurate should we move the original data from the cluster to an off sight location like 'Glacier'? Should we show muday lab these comparisons or go ahead and submit the SRA data to quick load and then show them the new SRA data that is available? Next step: Create sample sheet for SRA data and submit to quick load.

Hide

Permalink

Robert Reid added a comment - 26/Feb/24 9:16 AM

This all looks good to me and we should charge forward!

But Let's get Ann's thoughts first!

Show

Robert Reid added a comment - 26/Feb/24 9:16 AM This all looks good to me and we should charge forward! But Let's get Ann's thoughts first!

Hide

Permalink

Ann Loraine added a comment - 13/Mar/24 4:26 PM - edited

I liked the visualization from the multi qc report. The images from nextflow runs of the same sequence data before and after submission to SRA were the same.

Based on this, I think we should now compare all the samples in exactly this way. If there is a problem with any of the samples, we will likely find out immediately.

My recommendation for the next step:

I suggest that Molly Davis create a ticket to use the "compare images from multiqc" on all the SRA files for the Muday flavonoid data set. To make it easier to look at all the data, she could create a Power Point document with the images shown side-by-side - "pre" and "post" SRA-submission.

Once the PowerPoint is made, we could add it to the BitBucket repository as documentation that this QA set was completed.

Once this new ticket is made and added to the board (current sprint is fine) let's move this ticket to DONE.

Thank you Molly Davis and Robert Reid !!!!

Show

Ann Loraine added a comment - 13/Mar/24 4:26 PM - edited I liked the visualization from the multi qc report. The images from nextflow runs of the same sequence data before and after submission to SRA were the same. Based on this, I think we should now compare all the samples in exactly this way. If there is a problem with any of the samples, we will likely find out immediately. My recommendation for the next step: I suggest that Molly Davis create a ticket to use the "compare images from multiqc" on all the SRA files for the Muday flavonoid data set. To make it easier to look at all the data, she could create a Power Point document with the images shown side-by-side - "pre" and "post" SRA-submission. Once the PowerPoint is made, we could add it to the BitBucket repository as documentation that this QA set was completed. Once this new ticket is made and added to the board (current sprint is fine) let's move this ticket to DONE. Thank you Molly Davis and Robert Reid !!!!

Hide

Permalink

Molly Davis added a comment - 20/Mar/24 4:27 PM - edited

As of right now comparing many samples from two mutliQC report seems like it would take a lot of time. I found this tool MegaQC which allows you to look at and compare many multiQC reports at once:

https://megaqc.info/docs/index.html
https://github.com/MultiQC/MegaQC?tab=readme-ov-file

MegaQC is a web application that you can install and run on your own network. It collects and visualises data parsed by MultiQC across multiple runs.

Other options: https://multiqc.info/docs/usage/downstream/

Chronqc:
https://chronqc.readthedocs.io/en/latest/run_chronqc.html#generating-chronqc-plots

Show

Molly Davis added a comment - 20/Mar/24 4:27 PM - edited As of right now comparing many samples from two mutliQC report seems like it would take a lot of time. I found this tool MegaQC which allows you to look at and compare many multiQC reports at once: https://megaqc.info/docs/index.html https://github.com/MultiQC/MegaQC?tab=readme-ov-file MegaQC is a web application that you can install and run on your own network. It collects and visualises data parsed by MultiQC across multiple runs. Other options : https://multiqc.info/docs/usage/downstream/ Chronqc : https://chronqc.readthedocs.io/en/latest/run_chronqc.html#generating-chronqc-plots

Hide

Permalink

Ann Loraine added a comment - 21/Mar/24 10:31 AM - edited

Just a quick comment on above:

The main (only) goal here is: Show that the original data and the SRA-downloaded data are the same.

Obviously, since there are millions (lots and lots!) of sequence reads per sample, we can't compare each one.

However, the "horizontal strip" images you showed in a previous comment (e.g., 7-are-15-min-28C-multiqcreport.png) really seemed to have the perfect amount of detail required to compare samples before and after upload to the SRA.

Indeed, that image looked like a kind of summary profile of the much bigger data set as it behaved when we aligned it to a reference.

Show

Ann Loraine added a comment - 21/Mar/24 10:31 AM - edited Just a quick comment on above: The main (only) goal here is: Show that the original data and the SRA-downloaded data are the same. Obviously, since there are millions (lots and lots!) of sequence reads per sample, we can't compare each one. However, the "horizontal strip" images you showed in a previous comment (e.g., 7-are-15-min-28C-multiqcreport.png) really seemed to have the perfect amount of detail required to compare samples before and after upload to the SRA. Indeed, that image looked like a kind of summary profile of the much bigger data set as it behaved when we aligned it to a reference.

Hide

Permalink

Ann Loraine added a comment - 21/Mar/24 10:44 AM - edited

How about making a google slidedeck:

https://docs.google.com/presentation/d/17NeX_extZ6pTOBlImOs2-Bp5V6wNZP_O5KSCCTu5t5g/edit?usp=sharing

Show

Ann Loraine added a comment - 21/Mar/24 10:44 AM - edited How about making a google slidedeck: https://docs.google.com/presentation/d/17NeX_extZ6pTOBlImOs2-Bp5V6wNZP_O5KSCCTu5t5g/edit?usp=sharing

Hide

Permalink

Molly Davis added a comment - 25/Mar/24 1:43 PM - edited

Check Method:

The method of comparing mutliQC reports has been useful and accurate. It actually found some original file names that were not converted correctly. To check that this method is sound, Dr. Reid and I decided it would be best to compare the original sequence files (FASTQ) on the cluster. These comparisons were made with the following files locations and scripts:

Original fastq file location: /projects/tomato_genome/rnaseq/muday144-timeSeries-checkReadMEFIRST/00_fastq
SRA fastq files location: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5

Sequence information script:

for file in *gz; do /projects/tomato_genome/sw/sequence_info -n $file >> seqsummary.txt; done

The goal of comparing the original sequences is to confirm that the files were not matched up to the corrected sample names for some of the samples.

Show

Molly Davis added a comment - 25/Mar/24 1:43 PM - edited Check Method: The method of comparing mutliQC reports has been useful and accurate. It actually found some original file names that were not converted correctly. To check that this method is sound, Dr. Reid and I decided it would be best to compare the original sequence files (FASTQ) on the cluster. These comparisons were made with the following files locations and scripts: Original fastq file location: /projects/tomato_genome/rnaseq/muday144-timeSeries-checkReadMEFIRST/00_fastq SRA fastq files location: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL5 Sequence information script : for file in *gz; do /projects/tomato_genome/sw/sequence_info -n $file >> seqsummary.txt; done The goal of comparing the original sequences is to confirm that the files were not matched up to the corrected sample names for some of the samples.

Hide

Permalink

Ann Loraine added a comment - 28/Mar/24 9:49 AM - edited

Loraine update:

I wrote a Markdown to compare General Statistics pre- and post-submission.
Preliminary results (assuming no bugs in my code) is that a lot of samples look like they got mixed up somehow

See files:

72_F3H_PollenTube/Documentation/CheckSRA-SL5.xlsx
72_F3H_PollenTube/CheckSRA.Rmd
72_F3H_PollenTube/CheckSRA.html

in https://bitbucket.org/hotpollen/flavonoid-rnaseq

Next steps: Check for bugs in my code.

Show

Ann Loraine added a comment - 28/Mar/24 9:49 AM - edited Loraine update: I wrote a Markdown to compare General Statistics pre- and post-submission. Preliminary results (assuming no bugs in my code) is that a lot of samples look like they got mixed up somehow See files: 72_F3H_PollenTube/Documentation/CheckSRA-SL5.xlsx 72_F3H_PollenTube/CheckSRA.Rmd 72_F3H_PollenTube/CheckSRA.html in https://bitbucket.org/hotpollen/flavonoid-rnaseq Next steps: Check for bugs in my code.

Hide

Permalink

Ann Loraine added a comment - 28/Mar/24 4:46 PM

I have added testing code that double-checks what I've been assuming are the correct sample codes for each original file name using the Muday lab spreadsheet mentioned above.

My code appears correct.

I have pushed a new version to the repository.

Moving this forward to review. Also, we have re-opened and linked this ticket to the original SRA submission ticket to make it easier for Robert Reid to review this work and determine the best way to update the SRA records or re-submit things as needed.

Show

Ann Loraine added a comment - 28/Mar/24 4:46 PM I have added testing code that double-checks what I've been assuming are the correct sample codes for each original file name using the Muday lab spreadsheet mentioned above. My code appears correct. I have pushed a new version to the repository. Moving this forward to review. Also, we have re-opened and linked this ticket to the original SRA submission ticket to make it easier for Robert Reid to review this work and determine the best way to update the SRA records or re-submit things as needed.

Hide

Permalink

Ann Loraine added a comment - 28/Mar/24 5:45 PM

Lastly, I added some new code that suggests new sample codes be assigned to 16 sample records.

The suggested new sample codes are in a new file "updateSRA.csv"

See:

72_F3H_PollenTube/results/updateSRA.csv

in https://bitbucket.org/hotpollen/flavonoid-rnaseq "main" branch.

attn: Robert Reid and Molly Davis

Show

Ann Loraine added a comment - 28/Mar/24 5:45 PM Lastly, I added some new code that suggests new sample codes be assigned to 16 sample records. The suggested new sample codes are in a new file "updateSRA.csv" See: 72_F3H_PollenTube/results/updateSRA.csv in https://bitbucket.org/hotpollen/flavonoid-rnaseq "main" branch. attn: Robert Reid and Molly Davis

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 9:18 AM

Confirming that the Fastq file headers get changed after SRA submission is true.

In this folder I have summaries of the MD5 results.

/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files

I did the MD5 of the zipped fastq files prior to them being loaded to the SRA.
I also did the MD5 of the SRA post downloaded files which can be found here on the HPC:

/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4

Comparing he MD5 codes after sorting, there are no overlaps of ANY of the 144 files.

Show

Robert Reid added a comment - 01/Apr/24 9:18 AM Confirming that the Fastq file headers get changed after SRA submission is true. In this folder I have summaries of the MD5 results. /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files I did the MD5 of the zipped fastq files prior to them being loaded to the SRA. I also did the MD5 of the SRA post downloaded files which can be found here on the HPC: /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 Comparing he MD5 codes after sorting, there are no overlaps of ANY of the 144 files.

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 9:19 AM

Running the MD5 checks

/projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4
for file in *gz; do echo $file; md5sum $file >> /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files/postdownloaded-md5.txt; done

/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files$
for file in *gz; do echo $file; md5sum $file >> md5.txt; done

Show

Robert Reid added a comment - 01/Apr/24 9:19 AM Running the MD5 checks /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4 for file in *gz; do echo $file; md5sum $file >> /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files/postdownloaded-md5.txt; done /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files$ for file in *gz; do echo $file; md5sum $file >> md5.txt; done

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 9:23 AM - edited

Checking the first 20 sequences of each file

The next step is this:

Zcat each fastq file, and awk to capture every 4th line (the actual sequence line). Head this result to get just the 20 lines of sequence. Write these results to a new file. This will act as a fingerprint (150 characters * 5) (( 20 lines will be 5 sequences total, still plenty for a finger print))
Compare each of the pre and post fingerprints and summarize into table. Compare to Ann's results above, should match up.
And then plan on changing SRA accordingly.

Oh this all fails if the SRA changes the order of the sequences or the composition in any way!! Which would be horrifying.

Show

Robert Reid added a comment - 01/Apr/24 9:23 AM - edited Checking the first 20 sequences of each file The next step is this: Zcat each fastq file, and awk to capture every 4th line (the actual sequence line). Head this result to get just the 20 lines of sequence. Write these results to a new file. This will act as a fingerprint (150 characters * 5) (( 20 lines will be 5 sequences total, still plenty for a finger print)) Compare each of the pre and post fingerprints and summarize into table. Compare to Ann's results above, should match up. And then plan on changing SRA accordingly. Oh this all fails if the SRA changes the order of the sequences or the composition in any way!! Which would be horrifying.

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 9:56 AM - edited

Awk command to catch the 2nd line of the fastq

This will take 20 lines from the zipped fastq, and print out JUST the 2nd line (aka the sequence).

zcat A.28.45.9_R2.fastq.gz | head -n 20 | awk ' NR%4==2

{ print $1 }

With this we make our fingerprints.

Show

Robert Reid added a comment - 01/Apr/24 9:56 AM - edited Awk command to catch the 2nd line of the fastq This will take 20 lines from the zipped fastq, and print out JUST the 2nd line (aka the sequence). zcat A.28.45.9_R2.fastq.gz | head -n 20 | awk ' NR%4==2 { print $1 } ' With this we make our fingerprints.

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 11:01 AM

Folder for COMPARING fastq pre and post:

/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra

Fingerprints for Pre fastq files:

Fingerprints for Post fastq files:
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles

The 1 line command to make fingerprints:
PRE
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles$
while read file; do echo $

{file}; zcat /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files/${file}

.fastq.gz | head -n 20 | awk ' NR%4==2

{print $1}' > ./pre-finalfiles/${file}-5lines.txt ; done < prenames.txt

POST
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles$
while read file; do echo ${file}; zcat /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}.fastq.gz | head -n 20 | awk ' NR%4==2 {print $1}

' > ./post-srafiles/$

{file}

-5lines.txt ; done < postsrafiles.txt

Show

Robert Reid added a comment - 01/Apr/24 11:01 AM Folder for COMPARING fastq pre and post: /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra Fingerprints for Pre fastq files: Fingerprints for Post fastq files: /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles The 1 line command to make fingerprints: PRE /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles$ while read file; do echo $ {file}; zcat /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/final-files/${file} .fastq.gz | head -n 20 | awk ' NR%4==2 {print $1}' > ./pre-finalfiles/${file}-5lines.txt ; done < prenames.txt POST /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/troubleshoot-sra/post-srafiles$ while read file; do echo ${file}; zcat /projects/tomato_genome/fnb/dataprocessing/SRP460750/nfcore-SL4/${file}.fastq.gz | head -n 20 | awk ' NR%4==2 {print $1} ' > ./post-srafiles/$ {file} -5lines.txt ; done < postsrafiles.txt

Hide

Permalink

Robert Reid added a comment - 01/Apr/24 11:02 AM

Fingerprints complete.

Now to compare them!

We need to iterate through each PRE fingerprint file and find its matching partner in POST sra.

Show

Robert Reid added a comment - 01/Apr/24 11:02 AM Fingerprints complete. Now to compare them! We need to iterate through each PRE fingerprint file and find its matching partner in POST sra.

Hide

Permalink

Ann Loraine added a comment - 03/Apr/24 11:55 AM

This is now done.

Summarizing the above:

We think the sequence data itself is fine because the pre- and post-SRA submission nf-core/rnaseq profiles were identical
However, we found that 16 of the SRR entries had the wrong sample codes.
We identified the affected files using two methods - (1) comparing multiqc-generated General Statistics profiles (see Ann's R code) and (2) comparing the first 20 lines from the sequence data files (see Rob's comments above).
In addition, we identified the files that need to have their sample codes re-assigned, and how. A csv file reporting these is in the "results" folder of the git repository. See above comments.

Now that we know which samples were affected and how to re-label them in the SRA, we can close this ticket.

Show

Ann Loraine added a comment - 03/Apr/24 11:55 AM This is now done. Summarizing the above: We think the sequence data itself is fine because the pre- and post-SRA submission nf-core/rnaseq profiles were identical However, we found that 16 of the SRR entries had the wrong sample codes. We identified the affected files using two methods - (1) comparing multiqc-generated General Statistics profiles (see Ann's R code) and (2) comparing the first 20 lines from the sequence data files (see Rob's comments above). In addition, we identified the files that need to have their sample codes re-assigned, and how. A csv file reporting these is in the "results" folder of the git repository. See above comments. Now that we know which samples were affected and how to re-label them in the SRA, we can close this ticket.

People

Assignee:

Robert Reid

Reporter:

Molly Davis

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

05/Feb/24 2:53 PM

Updated:

03/Apr/24 11:58 AM

Resolved:

03/Apr/24 11:55 AM