Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3878

Adding SolyIds back to the NEXTFLOW de novo results via a Python Script

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL: Prep a new table that harnesses the results from Ticket 3772, The salmon counts.
      The first step is to make a new counts table but we add SolyIDs back in. We associate the Soly Ids with the de novo contigs via a previous BLAT alignment. This task is the

      We take the 4 tables and a Soly ID table produced previously (ticket # I don't recall)

      Make a python script that will read in all the data, and make a large table where each row is a SolyId gene, each column is an experiment.

      Will need good column labels!

        Attachments

          Activity

          Hide
          robofjoy Robert Reid added a comment -

          This will be a 2 task process, both involving writing python scripts.

          This task is step 1: Adding a SolyID to a salmon counts table using our BLAT results from many steps ago.

          We run this script repeatedly, one for each plant variety.

          1. We need the Blat result fna file where we have a blatted the rna=spades contigs to SL5.
          That can be found in this location:
          /projects/tomato_genome/fnb/dataprocessing/brandon_work/mal/malintka-spades/spades_blat/blat-SL5-CDS-malintka-bestLongHit.fna

          We read this file into a dict with the NODE id as the key and the SolyID as the value pair.

          2. We need the salmon gene count file for the same variety:
          /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv

          The first column in the table is the NodeID, we ignore the 2nd column and then we keep all of the remaining column of read counts.
          We read a line, parse it, we check if the ID in 1st column is in our dict from above.
          If so, we write out a line using the SolyID as the first column and then write out all of the remaining fields!

          In the end we write out a table, each row has a solyID and all of the gene counts.

          We then repeat this script but point at new plant variety (aka MAL, etc).
          After that we move to next phase of merging the 4 tables into 1 (new ticket that is not yet created)!!!

          Show
          robofjoy Robert Reid added a comment - This will be a 2 task process, both involving writing python scripts. This task is step 1: Adding a SolyID to a salmon counts table using our BLAT results from many steps ago. We run this script repeatedly, one for each plant variety. 1. We need the Blat result fna file where we have a blatted the rna=spades contigs to SL5. That can be found in this location: /projects/tomato_genome/fnb/dataprocessing/brandon_work/mal/malintka-spades/spades_blat/blat-SL5-CDS-malintka-bestLongHit.fna We read this file into a dict with the NODE id as the key and the SolyID as the value pair. 2. We need the salmon gene count file for the same variety: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv The first column in the table is the NodeID, we ignore the 2nd column and then we keep all of the remaining column of read counts. We read a line, parse it, we check if the ID in 1st column is in our dict from above. If so, we write out a line using the SolyID as the first column and then write out all of the remaining fields! In the end we write out a table, each row has a solyID and all of the gene counts. We then repeat this script but point at new plant variety (aka MAL, etc). After that we move to next phase of merging the 4 tables into 1 (new ticket that is not yet created)!!!
          Hide
          bbendick Brandon Bendickson added a comment -

          Completed the Python script, and added solyIDs back to the de novo results. I am moving the ticket to first-level review. I want to make sure the files are in the format we are looking for, if so, we can close this and move on to the next step.

          Results are located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID
          The script is located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/add_soly_back.py

          Show
          bbendick Brandon Bendickson added a comment - Completed the Python script, and added solyIDs back to the de novo results. I am moving the ticket to first-level review. I want to make sure the files are in the format we are looking for, if so, we can close this and move on to the next step. Results are located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID The script is located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/add_soly_back.py
          Hide
          robofjoy Robert Reid added a comment -

          In the script add_soly_back.py, it looks logically correct.

          Our data going into the script is off however!

          /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID$ for file in *.tsv; do wc -l $file; done
          7842 all_counts_with_solID_denovo.tsv
          7842 all_counts_with_solID.tsv
          25080 Hei_counts_with_solID.tsv
          25092 Mal_counts_with_solID.tsv
          11112 Nag_counts_with_solID.tsv
          25616 Tam_counts_with_solID.tsv

          First check why Nag does not have the same number as the other 3 varieties. This might become a new ticket.
          Do we need to rerun Nextflow or did something not copy correctly?

          Show
          robofjoy Robert Reid added a comment - In the script add_soly_back.py, it looks logically correct. Our data going into the script is off however! /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID$ for file in *.tsv; do wc -l $file; done 7842 all_counts_with_solID_denovo.tsv 7842 all_counts_with_solID.tsv 25080 Hei_counts_with_solID.tsv 25092 Mal_counts_with_solID.tsv 11112 Nag_counts_with_solID.tsv 25616 Tam_counts_with_solID.tsv First check why Nag does not have the same number as the other 3 varieties. This might become a new ticket. Do we need to rerun Nextflow or did something not copy correctly?
          Hide
          robofjoy Robert Reid added a comment -

          Start with Nag NETFLOW results and make sure that salmon count folder has the same size files as the other varieties.

          Show
          robofjoy Robert Reid added a comment - Start with Nag NETFLOW results and make sure that salmon count folder has the same size files as the other varieties.
          Hide
          bbendick Brandon Bendickson added a comment -

          The Nag NEXTFLOW results have the same size as the Nag_counts_with_solID.tsv file, so it copied correctly. The pipeline was also completed without error.

          Show
          bbendick Brandon Bendickson added a comment - The Nag NEXTFLOW results have the same size as the Nag_counts_with_solID.tsv file, so it copied correctly. The pipeline was also completed without error.
          Hide
          bbendick Brandon Bendickson added a comment -

          Reran Nag. Again, the pipeline was completed without error, but the resulting gene counts file still only has 11,112 lines.
          Results are found in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/test_Nag/results-3.14.0/star_salmon

          Show
          bbendick Brandon Bendickson added a comment - Reran Nag. Again, the pipeline was completed without error, but the resulting gene counts file still only has 11,112 lines. Results are found in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/test_Nag/results-3.14.0/star_salmon
          Hide
          robofjoy Robert Reid added a comment -

          This is now held up by the RNa-Spades run task (Fall 2 epic board)

          This remains as a To do for now. I can't find the ticket ID that blocks this.

          Show
          robofjoy Robert Reid added a comment - This is now held up by the RNa-Spades run task (Fall 2 epic board) This remains as a To do for now. I can't find the ticket ID that blocks this.
          Hide
          robofjoy Robert Reid added a comment -

          This is blocked by 3901. I am not quite sure how to tag this as blocked within Jira. I only see that option when setting up a ticket / task for the first time.

          Show
          robofjoy Robert Reid added a comment - This is blocked by 3901. I am not quite sure how to tag this as blocked within Jira. I only see that option when setting up a ticket / task for the first time.
          Hide
          bbendick Brandon Bendickson added a comment -

          Added soly IDs to all NEXTFLOW results. Moving this to done. Results are found in:
          /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID

          Show
          bbendick Brandon Bendickson added a comment - Added soly IDs to all NEXTFLOW results. Moving this to done. Results are found in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID

            People

            • Assignee:
              bbendick Brandon Bendickson
              Reporter:
              robofjoy Robert Reid
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: