[IGBF-3944] Python script to remove NCBI's contaminants - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
2
Epic Link:
Support NSF pollen grant
Sprint:
Fall 4, Fall 5

Description

GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.

LOcation on cluster where this data is:
/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

Files to use:
~~rw-r~~r- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
~~rw-r~~r- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt

FLOW:
#Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
#Iterate through the fasta file checking to see if each header is in the dict.
#If NO, write the sequence out to a new file.
#If YES, chop away the the region.

1. If region is in middle of sequence, make 2 new sequences.
2. ## if within 50BP of the beginning or the end, truncate the sequence.
Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).

Attachments

Issue Links

blocks

IGBF-3928 Shotgun assembly submission to the TSA

Closed

Activity

Ascending order - Click to sort in descending order

Hide

Permalink

Brandon Bendickson added a comment - 24/Oct/24 2:28 PM

Completed main script, gonna debug and make it more usable, then upload to cluster.

Show

Brandon Bendickson added a comment - 24/Oct/24 2:28 PM Completed main script, gonna debug and make it more usable, then upload to cluster.

Hide

Permalink

Brandon Bendickson added a comment - 24/Oct/24 3:06 PM

Script ran without error. Script and results are found in: /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

Moving to first level review.

Usage for script is: python ./remove_contams.py ./Tamaulipas_new_ID_clean.fna ./seqs_to_exclude.txt ./seqs_to_trim.txt

Show

Brandon Bendickson added a comment - 24/Oct/24 3:06 PM Script ran without error. Script and results are found in: /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData Moving to first level review. Usage for script is: python ./remove_contams.py ./Tamaulipas_new_ID_clean.fna ./seqs_to_exclude.txt ./seqs_to_trim.txt

Hide

Permalink

Brandon Bendickson added a comment - 24/Oct/24 3:08 PM

Required files are also in same location.

Show

Brandon Bendickson added a comment - 24/Oct/24 3:08 PM Required files are also in same location.

Hide

Permalink

Robert Reid added a comment - 25/Oct/24 11:42 AM

The latest iteration from the TSA submission, they do not like OUR smaller fragments!!

Every sequence needs to be at least 200NT.
Can you revise the script and throw away every sequence less then 200NT?

Moving back to do.

Show

Robert Reid added a comment - 25/Oct/24 11:42 AM The latest iteration from the TSA submission, they do not like OUR smaller fragments!! Every sequence needs to be at least 200NT. Can you revise the script and throw away every sequence less then 200NT? Moving back to do.

Hide

Permalink

Robert Reid added a comment - 25/Oct/24 11:48 AM

In the report there are not many sequences remaining that they are complaining about.

Let's just remove these;

I have added the report to the cluster
rsync -aP ~/Downloads/report-tsa-ver3.txt rreid2@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData/

Revise a python script to remove each sequence mentioned in report-tsa-ver3.txt

Write out YET another version of the fna files.
And then we will submit again.

Show

Robert Reid added a comment - 25/Oct/24 11:48 AM In the report there are not many sequences remaining that they are complaining about. Let's just remove these; I have added the report to the cluster rsync -aP ~/Downloads/report-tsa-ver3.txt rreid2@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData/ Revise a python script to remove each sequence mentioned in report-tsa-ver3.txt Write out YET another version of the fna files. And then we will submit again.

Hide

Permalink

Brandon Bendickson added a comment - 29/Oct/24 10:56 AM

Wrote a small subscript to finish cleaning up files for TSA submission. Script and results are located in /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData. The script removes the contigs according to the report file. Moving to first level review.

Show

Brandon Bendickson added a comment - 29/Oct/24 10:56 AM Wrote a small subscript to finish cleaning up files for TSA submission. Script and results are located in /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData. The script removes the contigs according to the report file. Moving to first level review.

Hide

Permalink

Brandon Bendickson added a comment - 29/Oct/24 10:58 AM

final outputs look like this: [VARIETY]_final_submission.fna

Show

Brandon Bendickson added a comment - 29/Oct/24 10:58 AM final outputs look like this: [VARIETY] _final_submission.fna

Hide

Permalink

Robert Reid added a comment - 31/Oct/24 10:03 AM

Testing the TSA submission with the 4 new files.

After removing the N's, some sequences become too short!!!

And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's.

Last step needs to be checking for size ( < 200NT) and pitch those away!

Rob

Show

Robert Reid added a comment - 31/Oct/24 10:03 AM Testing the TSA submission with the 4 new files. After removing the N's, some sequences become too short!!! And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's. Last step needs to be checking for size ( < 200NT) and pitch those away! Rob

Hide

Permalink

Brandon Bendickson added a comment - 31/Oct/24 12:31 PM

Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety]_final_submission_2.fna

Show

Brandon Bendickson added a comment - 31/Oct/24 12:31 PM Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety] _final_submission_2.fna

Hide

Permalink

Robert Reid added a comment - 01/Nov/24 9:52 AM

Reuploading to TSA now.

Show

Robert Reid added a comment - 01/Nov/24 9:52 AM Reuploading to TSA now.

Hide

Permalink

Robert Reid added a comment - 01/Nov/24 10:09 AM

There are still sequences that contain N's on the ends.

For example:
Malintka-contig-7313-Solyc10T002558.2-351_A

It ends like this .......... GANNNNNNNNNNNNC

So the C breaks your script and makes the TSA Gods unhappy.

Also, 2 entries ended up with empty sequences. That also breaks the TSA.

Show

Robert Reid added a comment - 01/Nov/24 10:09 AM There are still sequences that contain N's on the ends. For example: Malintka-contig-7313-Solyc10T002558.2-351_A It ends like this .......... GANNNNNNNNNNNNC So the C breaks your script and makes the TSA Gods unhappy. Also, 2 entries ended up with empty sequences. That also breaks the TSA.

Hide

Permalink

Brandon Bendickson added a comment - 06/Nov/24 1:42 PM

Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

Show

Brandon Bendickson added a comment - 06/Nov/24 1:42 PM Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

People

Assignee:

Robert Reid

Reporter:

Robert Reid

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

16/Oct/24 1:19 PM

Updated:

07/Nov/24 12:11 PM

Resolved:

07/Nov/24 12:11 PM