Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3944

Python script to remove NCBI's contaminants

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.

      LOcation on cluster where this data is:
      /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

      Files to use:
      rw-rr- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt

      FLOW:
      #Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
      #Iterate through the fasta file checking to see if each header is in the dict.
      #If NO, write the sequence out to a new file.
      #If YES, chop away the the region.

        1. If region is in middle of sequence, make 2 new sequences.
        2. ## if within 50BP of the beginning or the end, truncate the sequence.
      1. Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).

        Attachments

          Issue Links

            Activity

            Hide
            robofjoy Robert Reid added a comment -

            Testing the TSA submission with the 4 new files.

            After removing the N's, some sequences become too short!!!

            And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's.

            Last step needs to be checking for size ( < 200NT) and pitch those away!

            Rob

            Show
            robofjoy Robert Reid added a comment - Testing the TSA submission with the 4 new files. After removing the N's, some sequences become too short!!! And since they get mad if there are stretches of N's, after the script clips away N's, we need to rerun the N's check because the ends could still have too many N's. Last step needs to be checking for size ( < 200NT) and pitch those away! Rob
            Hide
            bbendick Brandon Bendickson added a comment -

            Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety]_final_submission_2.fna

            Show
            bbendick Brandon Bendickson added a comment - Modified final cleanup script, checked to make sure ends didn't have any Ns and removed sequences that were less than or equal to 200bps. Files are in same location, but named new files are named [Variety] _final_submission_2.fna
            Hide
            robofjoy Robert Reid added a comment -

            Reuploading to TSA now.

            Show
            robofjoy Robert Reid added a comment - Reuploading to TSA now.
            Hide
            robofjoy Robert Reid added a comment -

            There are still sequences that contain N's on the ends.

            For example:
            Malintka-contig-7313-Solyc10T002558.2-351_A

            It ends like this .......... GANNNNNNNNNNNNC

            So the C breaks your script and makes the TSA Gods unhappy.

            Also, 2 entries ended up with empty sequences. That also breaks the TSA.

            Show
            robofjoy Robert Reid added a comment - There are still sequences that contain N's on the ends. For example: Malintka-contig-7313-Solyc10T002558.2-351_A It ends like this .......... GANNNNNNNNNNNNC So the C breaks your script and makes the TSA Gods unhappy. Also, 2 entries ended up with empty sequences. That also breaks the TSA.
            Hide
            bbendick Brandon Bendickson added a comment -

            Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

            Show
            bbendick Brandon Bendickson added a comment - Modified script, results found in Kelsie data. I wrote function that looked at the last 12 characters, and if there was an N in those 12 it removed the all 12. Should be ready for resubmission. May it please the TSA gods.

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: