Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3944

Python script to remove NCBI's contaminants

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL: After successfully removing UNIVEC adapters, the TSA have come up with a new list of sequences they fear might be contaminant. So we will now write a python script to remove these bits.

      LOcation on cluster where this data is:
      /projects/tomato_genome/fnb/dataprocessing/TSA-transcriptomeShotgunAssembly/kelsieData

      Files to use:
      rw-rr- 1 rreid2 tomato_genome 80M Oct 14 12:38 Heinz_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 146M Oct 14 12:38 Nagcarlang_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 69M Oct 14 12:38 Malintka_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 52M Oct 14 12:38 Tamaulipas_new_ID_clean.fna
      rw-rr- 1 rreid2 tomato_genome 1.4M Oct 15 09:30 contaminant2.txt

      FLOW:
      #Read in contaminant2.txt as a dictionary, the header as the key, the region to be removed as the value.
      #Iterate through the fasta file checking to see if each header is in the dict.
      #If NO, write the sequence out to a new file.
      #If YES, chop away the the region.

        1. If region is in middle of sequence, make 2 new sequences.
        2. ## if within 50BP of the beginning or the end, truncate the sequence.
      1. Write new sequence(s) to file, renaming the header if making 2 sequences ( A and B ).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: