[IGBF-3878] Adding SolyIds back to the NEXTFLOW de novo results via a Python Script - JIRA UNCC

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: None
Labels:
None

Story Points:
3
Epic Link:
Support NSF pollen grant
Sprint:
Fall 1, Fall 2

Description

We take the 4 tables and a Soly ID table produced previously (ticket # I don't recall)

Make a python script that will read in all the data, and make a large table where each row is a SolyId gene, each column is an experiment.

Will need good column labels!

Attachments

Activity

Ascending order - Click to sort in descending order

Robert Reid created issue - 27/Aug/24 9:58 AM

Robert Reid made changes - 27/Aug/24 9:58 AM

Field	Original Value	New Value
Epic Link		IGBF-2993 [ 21429 ]

Brandon Bendickson made changes - 28/Aug/24 10:02 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Hide

Permalink

Robert Reid added a comment - 28/Aug/24 3:41 PM

This will be a 2 task process, both involving writing python scripts.

This task is step 1: Adding a SolyID to a salmon counts table using our BLAT results from many steps ago.

We run this script repeatedly, one for each plant variety.

1. We need the Blat result fna file where we have a blatted the rna=spades contigs to SL5.
That can be found in this location:
/projects/tomato_genome/fnb/dataprocessing/brandon_work/mal/malintka-spades/spades_blat/blat-SL5-CDS-malintka-bestLongHit.fna

We read this file into a dict with the NODE id as the key and the SolyID as the value pair.

2. We need the salmon gene count file for the same variety:
/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv

The first column in the table is the NodeID, we ignore the 2nd column and then we keep all of the remaining column of read counts.
We read a line, parse it, we check if the ID in 1st column is in our dict from above.
If so, we write out a line using the SolyID as the first column and then write out all of the remaining fields!

In the end we write out a table, each row has a solyID and all of the gene counts.

We then repeat this script but point at new plant variety (aka MAL, etc).
After that we move to next phase of merging the 4 tables into 1 (new ticket that is not yet created)!!!

Show

Robert Reid added a comment - 28/Aug/24 3:41 PM This will be a 2 task process, both involving writing python scripts. This task is step 1: Adding a SolyID to a salmon counts table using our BLAT results from many steps ago. We run this script repeatedly, one for each plant variety. 1. We need the Blat result fna file where we have a blatted the rna=spades contigs to SL5. That can be found in this location: /projects/tomato_genome/fnb/dataprocessing/brandon_work/mal/malintka-spades/spades_blat/blat-SL5-CDS-malintka-bestLongHit.fna We read this file into a dict with the NODE id as the key and the SolyID as the value pair. 2. We need the salmon gene count file for the same variety: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv The first column in the table is the NodeID, we ignore the 2nd column and then we keep all of the remaining column of read counts. We read a line, parse it, we check if the ID in 1st column is in our dict from above. If so, we write out a line using the SolyID as the first column and then write out all of the remaining fields! In the end we write out a table, each row has a solyID and all of the gene counts. We then repeat this script but point at new plant variety (aka MAL, etc). After that we move to next phase of merging the 4 tables into 1 (new ticket that is not yet created)!!!

Robert Reid made changes - 29/Aug/24 9:19 AM

Assignee

Robert Reid [ robertreid ]

Brandon Bendickson [ bbendick ]

Robert Reid made changes - 29/Aug/24 10:25 AM

Summary

Blend 4 salmon count tables into 1 and add SolyIds back

Adding SolyIds back to the NEXTFLOW de novo results via a Python Script

Robert Reid made changes - 29/Aug/24 10:26 AM

Description

GOAL: Prep a new table that harnesses the results from Ticket 3772, The salmon counts.

We take the 4 tables and a Soly ID table produced previously (ticket # I don't recall)

Make a python script that will read in all the data, and make a large table where each row is a SolyId gene, each column is an experiment.

Will need good column labels!

GOAL: Prep a new table that harnesses the results from Ticket 3772, The salmon counts.
The first step is to make a new counts table but we add SolyIDs back in. We associate the Soly Ids with the de novo contigs via a previous BLAT alignment. This task is the

We take the 4 tables and a Soly ID table produced previously (ticket # I don't recall)

Make a python script that will read in all the data, and make a large table where each row is a SolyId gene, each column is an experiment.

Will need good column labels!

Hide

Permalink

Brandon Bendickson added a comment - 04/Sep/24 10:45 AM

Completed the Python script, and added solyIDs back to the de novo results. I am moving the ticket to first-level review. I want to make sure the files are in the format we are looking for, if so, we can close this and move on to the next step.

Results are located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID
The script is located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/add_soly_back.py

Show

Brandon Bendickson added a comment - 04/Sep/24 10:45 AM Completed the Python script, and added solyIDs back to the de novo results. I am moving the ticket to first-level review. I want to make sure the files are in the format we are looking for, if so, we can close this and move on to the next step. Results are located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID The script is located in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/add_soly_back.py

Brandon Bendickson made changes - 04/Sep/24 10:45 AM

Status

In Progress [ 3 ]

Needs 1st Level Review [ 10005 ]

Robert Reid made changes - 05/Sep/24 9:37 AM

Assignee

Brandon Bendickson [ bbendick ]

Robert Reid [ robertreid ]

Hide

Permalink

Robert Reid added a comment - 06/Sep/24 10:06 AM

In the script add_soly_back.py, it looks logically correct.

Our data going into the script is off however!

/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID$ for file in *.tsv; do wc -l $file; done
7842 all_counts_with_solID_denovo.tsv
7842 all_counts_with_solID.tsv
25080 Hei_counts_with_solID.tsv
25092 Mal_counts_with_solID.tsv
11112 Nag_counts_with_solID.tsv
25616 Tam_counts_with_solID.tsv

First check why Nag does not have the same number as the other 3 varieties. This might become a new ticket.
Do we need to rerun Nextflow or did something not copy correctly?

Show

Robert Reid added a comment - 06/Sep/24 10:06 AM In the script add_soly_back.py, it looks logically correct. Our data going into the script is off however! /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/result_processing/counts_with_solID$ for file in *.tsv; do wc -l $file; done 7842 all_counts_with_solID_denovo.tsv 7842 all_counts_with_solID.tsv 25080 Hei_counts_with_solID.tsv 25092 Mal_counts_with_solID.tsv 11112 Nag_counts_with_solID.tsv 25616 Tam_counts_with_solID.tsv First check why Nag does not have the same number as the other 3 varieties. This might become a new ticket. Do we need to rerun Nextflow or did something not copy correctly?

Robert Reid made changes - 06/Sep/24 10:06 AM

Assignee

Robert Reid [ robertreid ]

Brandon Bendickson [ bbendick ]

Robert Reid made changes - 06/Sep/24 10:09 AM

Status

Needs 1st Level Review [ 10005 ]

First Level Review in Progress [ 10301 ]

Robert Reid made changes - 06/Sep/24 10:09 AM

Status

First Level Review in Progress [ 10301 ]

To-Do [ 10305 ]

Hide

Permalink

Robert Reid added a comment - 06/Sep/24 10:12 AM

Start with Nag NETFLOW results and make sure that salmon count folder has the same size files as the other varieties.

Show

Robert Reid added a comment - 06/Sep/24 10:12 AM Start with Nag NETFLOW results and make sure that salmon count folder has the same size files as the other varieties.

Brandon Bendickson made changes - 06/Sep/24 10:34 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]

Brandon Bendickson made changes - 06/Sep/24 10:34 AM

Status

In Progress [ 3 ]

To-Do [ 10305 ]

Hide

Permalink

Brandon Bendickson added a comment - 06/Sep/24 10:46 AM

The Nag NEXTFLOW results have the same size as the Nag_counts_with_solID.tsv file, so it copied correctly. The pipeline was also completed without error.

Show

Brandon Bendickson added a comment - 06/Sep/24 10:46 AM The Nag NEXTFLOW results have the same size as the Nag_counts_with_solID.tsv file, so it copied correctly. The pipeline was also completed without error.

Hide

Permalink

Brandon Bendickson added a comment - 11/Sep/24 9:43 AM

Reran Nag. Again, the pipeline was completed without error, but the resulting gene counts file still only has 11,112 lines.
Results are found in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/test_Nag/results-3.14.0/star_salmon

Show

Brandon Bendickson added a comment - 11/Sep/24 9:43 AM Reran Nag. Again, the pipeline was completed without error, but the resulting gene counts file still only has 11,112 lines. Results are found in: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh/test_Nag/results-3.14.0/star_salmon

Hide

Permalink

Robert Reid added a comment - 13/Sep/24 10:03 AM

This is now held up by the RNa-Spades run task (Fall 2 epic board)

This remains as a To do for now. I can't find the ticket ID that blocks this.

Show

Robert Reid added a comment - 13/Sep/24 10:03 AM This is now held up by the RNa-Spades run task (Fall 2 epic board) This remains as a To do for now. I can't find the ticket ID that blocks this.

Ann Loraine made changes - 16/Sep/24 8:34 AM

Sprint

Fall 1 [ 202 ]

Fall 1, Fall 2 [ 202, 203 ]

Ann Loraine made changes - 16/Sep/24 8:34 AM

Rank

Ranked higher

Hide

Permalink

Robert Reid added a comment - 16/Sep/24 9:29 AM

This is blocked by 3901. I am not quite sure how to tag this as blocked within Jira. I only see that option when setting up a ticket / task for the first time.

Show

Robert Reid added a comment - 16/Sep/24 9:29 AM This is blocked by 3901. I am not quite sure how to tag this as blocked within Jira. I only see that option when setting up a ticket / task for the first time.

Brandon Bendickson made changes - 19/Sep/24 11:30 AM

Status

To-Do [ 10305 ]

In Progress [ 3 ]