[IGBF-3944] Python script to remove NCBI's contaminants - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Support NSF pollen grant
Sprint:
Fall 4, Fall 5

Description

GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.

LOcation on cluster where this data is:
/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

Files to use:
~~rw-r~~r- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt

FLOW:
#Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
#Iterate through the fasta file checking to see if each header is in the dict.
#If NO, write the sequence out to a new file.
#If YES, chop away the the region.

1. If region is in middle of sequence, make 2 new sequences.
2. ## if within 50BP of the beginning or the end, truncate the sequence.
Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).

Attachments

Issue Links

blocks

IGBF-3928 Shotgun assembly submission to the TSA

Closed

Activity

People

Assignee:

Robert Reid

Reporter:

Robert Reid

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

16/Oct/24 1:19 PM

Updated:

07/Nov/24 12:11 PM

Resolved:

07/Nov/24 12:11 PM