Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3944

Python script to remove NCBI's contaminants

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.

      LOcation on cluster where this data is:
      /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

      Files to use:
      rw-rr- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt

      FLOW:
      #Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
      #Iterate through the fasta file checking to see if each header is in the dict.
      #If NO, write the sequence out to a new file.
      #If YES, chop away the the region.

        1. If region is in middle of sequence, make 2 new sequences.
        2. ## if within 50BP of the beginning or the end, truncate the sequence.
      1. Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).

        Attachments

          Issue Links

            Activity

            Hide
            bbendick Brandon Bendickson added a comment -

            Completed main script, gonna debug and make it more usable, then upload to cluster.

            Show
            bbendick Brandon Bendickson added a comment - Completed main script, gonna debug and make it more usable, then upload to cluster.
            Hide
            bbendick Brandon Bendickson added a comment -

            Script ran without error. Script and results are found in: /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

            Moving to first level review.

            Usage for script is: python ./remove_contams.py ./Tamaulipas_new_ID_clean.fna ./seqs_to_exclude.txt ./seqs_to_trim.txt

            Show
            bbendick Brandon Bendickson added a comment - Script ran without error. Script and results are found in: /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData Moving to first level review. Usage for script is: python ./remove_contams.py ./Tamaulipas_new_ID_clean.fna ./seqs_to_exclude.txt ./seqs_to_trim.txt
            Hide
            bbendick Brandon Bendickson added a comment -

            Required files are also in same location.

            Show
            bbendick Brandon Bendickson added a comment - Required files are also in same location.
            Hide
            robofjoy Robert Reid added a comment -

            The latest iteration from the TSA submission, they do not like OUR smaller fragments!!

            Every sequence needs to be at least 200NT.
            Can you revise the script and throw away every sequence less then 200NT?

            Moving back to do.

            Show
            robofjoy Robert Reid added a comment - The latest iteration from the TSA submission, they do not like OUR smaller fragments!! Every sequence needs to be at least 200NT. Can you revise the script and throw away every sequence less then 200NT? Moving back to do.
            Hide
            robofjoy Robert Reid added a comment -

            In the report there are not many sequences remaining that they are complaining about.

            Let's just remove these;

            I have added the report to the cluster
            rsync -aP ~/Downloads/report-tsa-ver3.txt rreid2@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData/

            Revise a python script to remove each sequence mentioned in report-tsa-ver3.txt

            Write out YET another version of the fna files.
            And then we will submit again.

            Show
            robofjoy Robert Reid added a comment - In the report there are not many sequences remaining that they are complaining about. Let's just remove these; I have added the report to the cluster rsync -aP ~/Downloads/report-tsa-ver3.txt rreid2@hpc.uncc.edu:/projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData/ Revise a python script to remove each sequence mentioned in report-tsa-ver3.txt Write out YET another version of the fna files. And then we will submit again.
            Hide
            bbendick Brandon Bendickson added a comment -

            Wrote a small subscript to finish cleaning up files for TSA submission. Script and results are located in /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData. The script removes the contigs according to the report file. Moving to first level review.

            Show
            bbendick Brandon Bendickson added a comment - Wrote a small subscript to finish cleaning up files for TSA submission. Script and results are located in /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData. The script removes the contigs according to the report file. Moving to first level review.
            Hide
            bbendick Brandon Bendickson added a comment -

            final outputs look like this: [VARIETY]_final_submission.fna

            Show
            bbendick Brandon Bendickson added a comment - final outputs look like this: [VARIETY] _final_submission.fna
            Hide
            robofjoy Robert Reid added a comment -

            Testing the TSA submission with the 4 new files.

            After removing the N's, some sequences become too short!!!

            And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's.

            Last step needs to be checking for size ( < 200NT) and pitch those away!

            Rob

            Show
            robofjoy Robert Reid added a comment - Testing the TSA submission with the 4 new files. After removing the N's, some sequences become too short!!! And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's. Last step needs to be checking for size ( < 200NT) and pitch those away! Rob
            Hide
            bbendick Brandon Bendickson added a comment -

            Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety]_final_submission_2.fna

            Show
            bbendick Brandon Bendickson added a comment - Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety] _final_submission_2.fna
            Hide
            robofjoy Robert Reid added a comment -

            Reuploading to TSA now.

            Show
            robofjoy Robert Reid added a comment - Reuploading to TSA now.
            Hide
            robofjoy Robert Reid added a comment -

            There are still sequences that contain N's on the ends.

            For example:
            Malintka-contig-7313-Solyc10T002558.2-351_A

            It ends like this .......... GANNNNNNNNNNNNC

            So the C breaks your script and makes the TSA Gods unhappy.

            Also, 2 entries ended up with empty sequences. That also breaks the TSA.

            Show
            robofjoy Robert Reid added a comment - There are still sequences that contain N's on the ends. For example: Malintka-contig-7313-Solyc10T002558.2-351_A It ends like this .......... GANNNNNNNNNNNNC So the C breaks your script and makes the TSA Gods unhappy. Also, 2 entries ended up with empty sequences. That also breaks the TSA.
            Hide
            bbendick Brandon Bendickson added a comment -

            Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

            Show
            bbendick Brandon Bendickson added a comment - Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: