Details
-
Type:
Task
-
Status: Closed (View Workflow)
-
Priority:
Major
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Labels:None
-
Story Points:2
-
Epic Link:
-
Sprint:Fall 4, Fall 5
Description
GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.
LOcation on cluster where this data is:
/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData
Files to use:
rw-rr- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
rw-rr- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
rw-rr- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
rw-rr- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
rw-rr- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt
FLOW:
#Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
#Iterate through the fasta file checking to see if each header is in the dict.
#If NO, write the sequence out to a new file.
#If YES, chop away the the region.
-
- If region is in middle of sequence, make 2 new sequences.
- ## if within 50BP of the beginning or the end, truncate the sequence.
- Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).
Attachments
Issue Links
- blocks
-
IGBF-3928 Shotgun assembly submission to the TSA
-
- Closed
-