Details
-
Type:
Task
-
Status: To-Do (View Workflow)
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Sprint:Summer 2 2023 May 29, Summer 3 2023 June 12, Summer 6 2023 July 24, Summer 7 2023 Aug 7
Description
Starting the data prepping and table prepping for the Muday time course data.
EXTRA step will be needed due to the labelling errors!!
A script to rename the files will be needed and this script will need review to ensure we get that step right!
Otherwise, the SRA submission should be straightforward, similar to IGBF-3255.
Attachments
Issue Links
Activity
The original untouched data from Azenta is located here:
/projects/tomato_genome/rnaseq/muday144-timeSeries-checkReadMEFIRST/00_fastq
I plant to make a working copy of this folder for the purposes of renaming some files.
COPY TIME:
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences$ rsync -aP /projects/tomato_genome/rnaseq/muday144-timeSeries-checkReadMEFIRST/* ./
CHECK MD5 SUMS to ensure copy was correct is next step.
COPY ERROR.
1 file didn't make it.
So copied that file again.
Rerunning the MD5 checksums again:
WHERE: /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/00_fastq
`while read file; do echo $file; md5sum $
{file} >> md5-version2.txt; done < allfiles.txt``diff md5.orig md5-version2.txt`
It is now the same. Copy is Successful.
PLaying in SaNDbox
To test the renaming scheme to reflect the errors made by Azenta, I am making a sandbox area. Rather than full sequences, I head each sequence file to grab the first 20 lines in the file and store that in the sandbox.
The sandbox is here:
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/testingRenameZone
To grab the 20 lines ands make new files in the above location:
`while read file; do echo $file; zcat ../00_fastq/${file}
.gz |head -n 20 > $
{file}; gzip ${file}; done < files.txt`
(we are unzipping via zcat, grabbing the 20 lines and gzipping again). Success.
Check that each file is 20 lines long.
`for file in *gz; do zcat $file | wc -l; done`
Toy data ready to be played with!
Which files need renaming??
We know samples were mislabeled at Azenta. For a reminder see ticket below:
More details about that are in this ticket: https://jira.bioviz.org/browse/IGBF-3290
From there:
A.34.15.8 is actually F.34.15.8 and vice versa
A.28.30.8 is actually V.28.30.8 and vice versa
A.34.45.8 is actually F.34.45.8 and vice versa
A.28.75.8 is actually V.28.75.8 and vice versa
The Excel sheet that conveys the changes is this one: https://jira.bioviz.org/secure/attachment/17863/Muday%20lab%20RNA%20samples%20for%20sample%20name%20conversion.xls
(or see ticket IGBF-3290 above)
So we need to swap using a temp file!
And at the same time let's rename the files too to the 4 code format.
Example:
A.15.28.9 is replicate 9, genotype ARE, 15 minutes time point, and 28 degrees C temperature.
So 9-VF36-75-min-28C_R1_001.fastq.gz would become V.75.28.9_R1.fastq.gz . (NEED TO MAINTAIN THE R1 and R2 pairs!!)
"F.28.15.7" "F.34.15.7" "F.28.30.7" "F.34.30.7"
-
- [6] "F.28.45.7" "F.34.45.7" "F.28.75.7" "F.34.75.7" "V.28.15.7"
- [11] "V.34.15.7" "V.28.30.7" "V.34.30.7" "V.28.45.7" "V.34.45.7"
- [16] "V.28.75.7" "V.34.75.7" "A.28.15.7" "A.34.15.7" "A.28.30.7"
- [21] "A.34.30.7" "A.28.45.7" "A.34.45.7" "A.28.75.7" "A.34.75.7"
- [26] "F.28.15.8" "F.34.15.8" "F.28.30.8" "F.34.30.8" "F.28.45.8"
- [31] "F.34.45.8" "F.28.75.8" "F.34.75.8" "V.28.15.8" "V.34.15.8"
- [36] "V.28.30.8" "V.34.30.8" "V.28.45.8" "V.34.45.8" "V.28.75.8"
- [41] "V.34.75.8" "A.28.15.8" "A.34.15.8" "A.28.30.8" "A.34.30.8"
- [46] "A.28.45.8" "A.34.45.8" "A.28.75.8" "A.34.75.8" "F.28.15.9"
- [51] "F.34.15.9" "F.28.30.9" "F.34.30.9" "F.28.45.9" "F.34.45.9"
- [56] "F.28.75.9" "F.34.75.9" "V.28.15.9" "V.34.15.9" "V.28.30.9"
- [61] "V.34.30.9" "V.28.45.9" "V.34.45.9" "V.28.75.9" "V.34.75.9"
- [66] "A.28.15.9" "A.34.15.9" "A.28.30.9" "A.34.30.9" "A.28.45.9"
- [71] "A.34.45.9" "A.28.75.9" "A.34.75.9"
Just a quick comment in case this is useful for [~RobertReid]: Muday lab sent us a spreadsheet that documents which samples need to be renamed. There were a lot more than we initially thought.
The spreadsheet is in this folder on Bitbucket: https://bitbucket.org/hotpollen/flavonoid-rnaseq/src/main/72_F3H_PollenTube/Documentation/
Ha, yes! I found the sheet and was surprised that 2/3 of the samples were wrong!
I assumed that the red cells were wrong and did a MATCH check to confirm.
A useful table generated by Ann during the re-analysis.
Is this:
original new changed
F.28.15.7 F.28.15.7 FALSE
F.34.15.7 F.34.15.7 FALSE
F.28.30.7 F.28.30.7 FALSE
F.34.30.7 F.34.30.7 FALSE
F.28.45.7 F.28.45.7 FALSE
F.34.45.7 F.34.45.7 FALSE
F.28.75.7 F.28.75.7 FALSE
F.34.75.7 F.34.75.7 FALSE
V.28.15.7 V.28.15.7 FALSE
V.34.15.7 V.34.15.7 FALSE
V.28.30.7 V.28.30.7 FALSE
V.34.30.7 V.34.30.7 FALSE
V.28.45.7 V.28.45.7 FALSE
V.34.45.7 V.34.45.7 FALSE
V.28.75.7 V.28.75.7 FALSE
V.34.75.7 V.34.75.7 FALSE
A.28.15.7 A.28.15.7 FALSE
A.34.15.7 A.34.15.7 FALSE
A.28.30.7 A.28.30.7 FALSE
A.34.30.7 A.34.30.7 FALSE
A.28.45.7 A.28.45.7 FALSE
A.34.45.7 A.34.45.7 FALSE
A.28.75.7 A.28.75.7 FALSE
A.34.75.7 A.34.75.7 FALSE
F.28.15.8 V.34.15.8 TRUE
F.34.15.8 A.34.30.8 TRUE
F.28.30.8 F.34.30.8 TRUE
F.34.30.8 V.34.30.8 TRUE
F.28.45.8 F.28.75.8 TRUE
F.34.45.8 A.28.45.8 TRUE
F.28.75.8 V.28.45.8 TRUE
F.34.75.8 F.28.45.8 TRUE
V.28.15.8 F.28.15.8 TRUE
V.34.15.8 V.28.15.8 TRUE
V.28.30.8 A.28.15.8 TRUE
V.34.30.8 F.28.30.8 TRUE
V.28.45.8 V.34.75.8 TRUE
V.34.45.8 F.34.75.8 TRUE
V.28.75.8 A.34.75.8 TRUE
V.34.75.8 V.34.45.8 TRUE
A.28.15.8 A.28.30.8 TRUE
A.34.15.8 F.34.15.8 TRUE
A.28.30.8 V.28.30.8 TRUE
A.34.30.8 A.34.15.8 TRUE
A.28.45.8 A.34.45.8 TRUE
A.34.45.8 V.28.75.8 TRUE
A.28.75.8 F.34.45.8 TRUE
A.34.75.8 A.28.75.8 TRUE
F.28.15.9 V.34.30.9 TRUE
F.34.15.9 V.28.30.9 TRUE
F.28.30.9 V.34.15.9 TRUE
F.34.30.9 V.28.15.9 TRUE
F.28.45.9 V.34.75.9 TRUE
F.34.45.9 V.28.75.9 TRUE
F.28.75.9 V.34.45.9 TRUE
F.34.75.9 V.28.45.9 TRUE
V.28.15.9 F.34.30.9 TRUE
V.34.15.9 F.28.30.9 TRUE
V.28.30.9 F.34.15.9 TRUE
V.34.30.9 F.28.15.9 TRUE
V.28.45.9 F.34.75.9 TRUE
V.34.45.9 F.28.75.9 TRUE
V.28.75.9 F.34.45.9 TRUE
V.34.75.9 F.28.45.9 TRUE
A.28.15.9 A.34.30.9 TRUE
A.34.15.9 A.28.30.9 TRUE
A.28.30.9 A.34.15.9 TRUE
A.34.30.9 A.28.15.9 TRUE
A.28.45.9 A.34.75.9 TRUE
A.34.45.9 A.28.75.9 TRUE
A.28.75.9 A.34.45.9 TRUE
A.34.75.9 A.28.45.9 TRUE
A simple copy script exists!
- It converts each sequence file into the coded single letter format.
- It renames the files to match the Excel table above.
Testing area is on HPC here:
/projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/testingRenameZone
About to need a review and test by another.........
To be reviewed:
- User needs to log into the HPC cluster.
- Navigate to /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/testingRenameZone
- Spot check the script to see that each line is picking the correct NEW sample name. To do that, one will need to look at the Excel sheet mentioned above (in https://bitbucket.org/hotpollen/flavonoid-rnaseq/src/main/72_F3H_PollenTube/Documentation/)
- Run the following command
- bash renameANDrelabel.bash
- Check that there are 144 new Fastq.gz files ! And then check that each file is 20 lines long. (zcat file | wc -l)
NCBI Submission ID: SUB13519532
2 more tables to review as well !!
This is the Biosample and the SRA table.
They are located in the Google drive at this location:
https://drive.google.com/drive/folders/1EaCt42IuxWd--1kKZW931PWw9N5OWpw4?usp=drive_link
Like previous tables, need to check everything aligns and is correct.
We can't truly confirm the SRA table until the renaming step is reviewed and signed off on.
Need a design Description for SRA sheet.
The Azenta summary is this:
The RNA sample received was quantified using Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and RNA integrity was checked using TapeStation (Agilent Technologies, Palo Alto, CA, USA). The RNA sequencing library was prepared using the NEBNext Ultra II RNA Library Prep Kit for Illumina using manufacturer’s instructions (NEB, Ipswich, MA, USA). mRNAs were initially enriched with Oligod(T) beads. Enriched mRNAs were fragmented for 15 minutes at 94 °C. First strand and second strand cDNA were subsequently synthesized. cDNA fragments were end repaired and adenylated at 3’ends, and universal adapters were ligated to cDNA fragments, followed by index addition and library enrichment by PCR with limited cycles. The sequencing library was validated on the Agilent TapeStation (Agilent Technologies, Palo Alto, CA, USA), and quantified by using Qubit 2.0 Fluorometer (Invitrogen, Carlsbad, CA) as well as by quantitative PCR (KAPA Biosystems, Wilmington, MA, USA). The sequencing library was clustered on one lane of a flowcell. After clustering, the flowcell was loaded on the Illumina HiSeq instrument (4000) according to manufacturer’s instructions. The sample was sequenced using a 2x150bp Paired End (PE) configuration. Image analysis and base calling were conducted by the HiSeq Control Software (HCS). Raw sequence
data (.bcl files) generated from Illumina HiSeq was converted into fastq files and de-multiplexed
using Illumina's bcl2fastq 2.17 software. One mismatch was allowed for index sequence
identification.
Reviewing:
/testingRenameZone
- I checked line by line the script renameANDrelabel.bash compared to the Muday-lab-RNA-samples-for-sample-name-conversion.xlsx file. Everything matched correctly. Files that were to be renamed matched the corrected name, and the conversion to the 4 code format was correct.
- I was not able to run the script as I did not have permission, but I don't know that it matters since I have reviewed the script itself.
SRA_metadata_Muday144.xlsx
- I compared the sample_name, library_ID, and title and they all matched.
- I compared the filename to the sample_name. The filenames appear to be using the mislabeled names, but with the updated 4 code format. I'm not sure what the expectation here is, but these may need to be double-checked.
- There is only a single design_description that is split across multiple lines (I assume this is a placeholder).
- Oligod(T) -> Oligo d(T)
_SRABiosampleForm-muday144.xlsx
- Compared sample_name, sample_title, cultivar, temp, treatment, description, and Replicate Code. Everything appears to match.
Nowlan Freese: suggests Molly also look at it and try to run the script in the "sandbox" space with smaller versions of the files.
The Design Description was messed up as Nowlan pointed out. Due to return carriages in the text, creating havoc when pasting into an Excel sheet.
Reworked the description to be just one line below. This has now been updated on the SRA excel sheet and pasted in properly.
design description:
The RNA sample received was quantified using Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and RNA integrity was checked using TapeStation (Agilent Technologies, Palo Alto, CA, USA). The RNA sequencing library was prepared using the NEBNext Ultra II RNA Library Prep Kit for Illumina using manufacturer’s instructions (NEB, Ipswich, MA, USA). mRNAs were initially enriched with Oligod(T) beads. Enriched mRNAs were fragmented for 15 minutes at 94 °C. First strand and second strand cDNA were subsequently synthesized. cDNA fragments were end repaired and adenylated at 3’ends, and universal adapters were ligated to cDNA fragments, followed by index addition and library enrichment by PCR with limited cycles. The sequencing library was validated on the Agilent TapeStation (Agilent Technologies, Palo Alto, CA, USA), and quantified by using Qubit 2.0 Fluorometer (Invitrogen, Carlsbad, CA) as well as by quantitative PCR (KAPA Biosystems, Wilmington, MA, USA). The sequencing library was clustered on one lane of a flowcell. After clustering, the flowcell was loaded on the Illumina HiSeq instrument (4000) according to manufacturer’s instructions. The sample was sequenced using a 2x150bp Paired End (PE) configuration. Image analysis and base calling were conducted by the HiSeq Control Software (HCS). Raw sequence data (.bcl files) generated from Illumina HiSeq was converted into fastq files and de-multiplexed using Illumina's bcl2fastq 2.17 software.
Corrected the description to no longer refer to Oligods !!!! ( Is now Oligo D(T) )
I do like the term Oligod though, sounds like a deity we can pray to get to get sequencing projects to run smoothly.
Round 2 of review: This time Molly.
2 parts to this review:
#1
To be reviewed:
User needs to log into the HPC cluster.
Navigate to /projects/tomato_genome/rnaseq/renamed_MudayTimeCourseSequences/testingRenameZone
Spot check the script to see that each line is picking the correct NEW sample name. To do that, one will need to look at the Excel sheet mentioned above (in https://bitbucket.org/hotpollen/flavonoid-rnaseq/src/main/72_F3H_PollenTube/Documentation/)
Run the following command (just check the script for logic)
bash renameANDrelabel.bash
Check that there are 144 new Fastq.gz files ! And then check that each file is 20 lines long. (zcat file | wc -l)
#2 Checking the Biosample and SRA sheets:
This is the Biosample and the SRA table.
They are located in the Google drive at this location:
https://drive.google.com/drive/folders/1EaCt42IuxWd--1kKZW931PWw9N5OWpw4?usp=drive_link
Like previous tables, need to check everything aligns and is correct.
I need to walk through this all again. As a final check.
And then go through the NCBI submission portal and ensure all of those pieces are intact and correct.
The FTP site data, I need to look and see that it is ready to go as well.
But once that is checked, I think we are ready to submit.
Nowlan Freese: says check file name concordance
Sorting folly on my part!
I have corrected the file names so that they are now in line with the ID and names in the SRA tables.
Great catch Nowlan!
When I listed and copied the file name, they sorted by date generated (ls -lrt) and not listed by name, which alters the list order.
The files all line up correctly now in the table.
It would be good for Nowlan to double check this however!!
I will bump to review and assign.
File to review is this one:
https://docs.google.com/spreadsheets/d/1n4nsE4E8lykivizPtQyf17XJnL7FRELR/edit?usp=sharing&ouid=100714234126361751017&rtpof=true&sd=true
Also, FTP environment for NCBI has ben updated in June 2023.
Need to upload Muday Data again. Process for that has begun. 72 files corresponding to google sheet above.
R
This is an FTp note for Rob:
/uploads/rreid2_uncc.edu_eIUyy48y/muday144
It's the location where this data is going short term at NCBI.
72 pairs of files. 144 total sequence files.
142 successfully transferred. 2 did not, server disconnection. Resending the final 2.
Testing SRA metadata:
[~RobertReid] - I think the lines 50-57 and 58-65 for the file names are mismatched/swapped.
Corrected these file names in the table.
Ready for another Nowlan inspection.
SRA metadata table looks good, I couldn't find any issues.
Dear Robert Reid,
Your submission SUB13519532 has failed with the following error:
1. Similar projects already exist: PRJNA980666
That is a new complaint!!!!
I think I will try and change the title and description to highlight that these are varieties related to flavanoid production.
I change title and the desc.
After that, I will need to reach out and chat with the SRA people.
This produces an out right rejection, with no ability to edit.
So I will reach out to the SRA people and provide some changes that hopefully makes them happy.
Or I will have them add this SRA under the BioProject of Mark's recent submission. Maybe that will appease them!
I have reached out the the SRA withthe following email:
Hi I made this submission and it has been rejected due to the Bioproject being too similar to
PRJNA980666.
Is it possible to add this to the previous BioProject? The data came from 2 different labs (Brown University versus Wake Forest) and are completely different tomato varieties.
But they are all the same species and are a similar time course experiment.
And they are all the same tissue, pollen tube.
Rob Reid
The SRA have responded favorably!!
SRA submission SUB13519532 is now re-processing with BioProject PRJNA980666.
Best,
Rick Lapoint
SRA Curator
SUCCESS !!
Dear Robert Reid,
This is an automatic acknowledgment that your recent submission to the SRA database has been successfully processed and will be released on the date specified.
Please reference PRJNA980666 in your publication. This BioProject accession number is provided instead of SRP and should be used in your publication as it will allow better searching in Entrez.
Accession to cite for these SRA data: PRJNA980666
Temporary Submission ID: SUB13519532
Release date: 2023-08-02
Your SRA records will be accessible with the following link after the indicated release date:
https://www.ncbi.nlm.nih.gov/sra/PRJNA980666
Send questions and update requests to sra@ncbi.nlm.nih.gov; include the citation accession PRJNA980666 in any correspondence.
Regards,
NCBI SRA Submissions Staff
Bethesda, Maryland USA
We need to add all of the Accession IDs before we close this out!
Stay tuned.
SRR Accesions:
SRR25478240
SRR25478241
SRR25478242
SRR25478243
SRR25478244
SRR25478245
SRR25478246
SRR25478247
SRR25478248
SRR25478249
SRR25478250
SRR25478251
SRR25478252
SRR25478253
SRR25478254
SRR25478255
SRR25478256
SRR25478257
SRR25478258
SRR25478260
SRR25478261
SRR25478262
SRR25478263
SRR25478264
SRR25478265
SRR25478266
SRR25478267
SRR25478268
SRR25478269
SRR25478270
SRR25478272
SRR25478273
SRR25478275
SRR25478276
SRR25478277
SRR25478278
SRR25478279
SRR25478280
SRR25478281
SRR25478282
SRR25478283
SRR25478284
SRR25478285
SRR25478286
SRR25478287
SRR25478289
SRR25478290
SRR25478291
SRR25478292
SRR25478293
SRR25478294
SRR25478295
SRR25478296
SRR25478297
SRR25478298
SRR25478299
SRR25478300
SRR25478301
SRR25478302
SRR25478304
SRR25478305
SRR25478306
SRR25478307
SRR25478308
SRR25478309
SRR25478310
SRR25478311
SRR25478259
SRR25478271
SRR25478274
SRR25478303
SRR25478288
[~RobertReid] Hi Dr. Reid, I was wondering why the SRR names are the same as seedling and mature pollen submission IGBF-3347. Is that correct, because then I would only need to rerun the data once for one ticket. Let me know!
For the SRA dataset SRP441343, both sets of data are contained therein. (Muday's and the mature pollen/seedling)
https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP441343
NCBi forced me to combine them due to the similarities.
There are 126 total. 72 of those are Muday's time course.
Is this what you were looking for?
Ok! I have the 72 Muday SRR names but you pasted the same 72 in seedling and mature pollen experiment IGBF-3347 and was wondering what those names are actually supposed to be? Thank you for your help! [~RobertReid]
And the adventure begins:
With another biosample sheet for submission.
Working copy can be found here:
https://docs.google.com/spreadsheets/d/1F39JAFfvyct4hdpfV2narT4BH6T13mBgJPvfl5BkbzM/edit?usp=sharing
Still in progress atm.