Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3647

De novo Assembly Trinity Run n Kelsey data

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Trivial
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL: to set up a script to run Trinity on Kelsey data.

      Eventually get this script integrated into bitbucket for future purposes.
      This dataset could then be aligned as contigs to SL4 and SL5.
      This dataset could also be fed into IGB as a "reference" and we can look at how the raw sequences align the newly created contigs.

        Attachments

          Activity

          robofjoy Robert Reid created issue -
          robofjoy Robert Reid made changes -
          Field Original Value New Value
          Epic Link IGBF-2993 [ 21429 ]
          robofjoy Robert Reid made changes -
          Status To-Do [ 10305 ] In Progress [ 3 ]
          robofjoy Robert Reid made changes -
          Description
          GOAL: to set up a script to run Trinity on Kelsey data.

          Eventually get this script integrated into bitbucket for future purposes.
          This dataset could then be aligned as contigs to SL4 and SL5.
          This dataset could also be fed into IGB as a "reference" and we can look at how the raw sequences align the newly created contigs.
          GOAL: to set up a script to run Trinity on Kelsey data.

          Eventually get this script integrated into bitbucket for future purposes.
          This dataset could then be aligned as contigs to SL4 and SL5.
          This dataset could also be fed into IGB as a "reference" and we can look at how the raw sequences align the newly created contigs.
          Sprint Spring 6 [ 190 ]
          Hide
          robofjoy Robert Reid added a comment -

          So Rob as run the scrips a few years ago!

          /projects/tomato_genome/scripts/rob/scripts/

          rwx-w--- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-tam.slurm
          rwx-w--- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-mal.slurm
          rwx-w--- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-hei.slurm
          rwx-w--- 1 rreid2 tomato_genome 2.7K Jul 7 2021 trinity-nag.slurm

          These 4 scripts are for each of the 4 original varieties.
          And are a good starting point going forward.

          Plan:
          Make a copy of script.
          gather the previous sequence data of marks for TMNH varieties PLUS Kelseys data and make a MEGA large raw sequence file.
          Run Trinity on each variety to make a set of de novo contigs.

          Show
          robofjoy Robert Reid added a comment - So Rob as run the scrips a few years ago! /projects/tomato_genome/scripts/rob/scripts/ rwx-w --- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-tam.slurm rwx-w --- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-mal.slurm rwx-w --- 1 rreid2 tomato_genome 2.7K Jun 30 2021 trinity-hei.slurm rwx-w --- 1 rreid2 tomato_genome 2.7K Jul 7 2021 trinity-nag.slurm These 4 scripts are for each of the 4 original varieties. And are a good starting point going forward. Plan: Make a copy of script. gather the previous sequence data of marks for TMNH varieties PLUS Kelseys data and make a MEGA large raw sequence file. Run Trinity on each variety to make a set of de novo contigs.
          Hide
          robofjoy Robert Reid added a comment - - edited

          Where this project will reside on HPC:

          /projects/tomato_genome/fnb/dataprocessing/trinity

          Ok, this next step is a bit too crazy for a one line command line.
          Bash script needed!

          Show
          robofjoy Robert Reid added a comment - - edited Where this project will reside on HPC: /projects/tomato_genome/fnb/dataprocessing/trinity Ok, this next step is a bit too crazy for a one line command line. Bash script needed!
          ann.loraine Ann Loraine made changes -
          Sprint Spring 6 [ 190 ] Spring 6, Spring 7 [ 190, 191 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          ann.loraine Ann Loraine made changes -
          Sprint Spring 6, Spring 7 [ 190, 191 ] Spring 6, Spring 7, Spring 8 [ 190, 191, 192 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          Hide
          robofjoy Robert Reid added a comment -

          Putting notes related to the Trinity reference free strategy.
          This location will contain random notes related to Trinity.

          https://drive.google.com/drive/folders/1yMG7gWB67-falaLF1o-kDCQIK_E8mwgu?usp=drive_link

          Show
          robofjoy Robert Reid added a comment - Putting notes related to the Trinity reference free strategy. This location will contain random notes related to Trinity. https://drive.google.com/drive/folders/1yMG7gWB67-falaLF1o-kDCQIK_E8mwgu?usp=drive_link
          Hide
          robofjoy Robert Reid added a comment -

          Ran into a SNAFU. My mega collection of reads might be corupted. Trinity does not like the files.

          Using a smaller subset to ensure that the code works correctly.

          And this smaller subset works great, creating 158,000 contigs.

          Now to redo the making of the larger read set to be used......

          Show
          robofjoy Robert Reid added a comment - Ran into a SNAFU. My mega collection of reads might be corupted. Trinity does not like the files. Using a smaller subset to ensure that the code works correctly. And this smaller subset works great, creating 158,000 contigs. Now to redo the making of the larger read set to be used......
          Hide
          robofjoy Robert Reid added a comment -

          The command line for a Trinity de novo run is:

          Trinity --seqType fq --max_memory 400G --output foo-trinity \
          --left foo_r1_paired.fq \
          --right foo_r2_paired.fq --CPU 24

          Show
          robofjoy Robert Reid added a comment - The command line for a Trinity de novo run is: Trinity --seqType fq --max_memory 400G --output foo-trinity \ --left foo_r1_paired.fq \ --right foo_r2_paired.fq --CPU 24
          ann.loraine Ann Loraine made changes -
          Sprint Spring 6, Spring 7, Spring 8 [ 190, 191, 192 ] Spring 6, Spring 7, Spring 8, Spring 9 [ 190, 191, 192, 193 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          Hide
          robofjoy Robert Reid added a comment -

          1 of 4 runs are completed.

          Got an error in 2 of the runs that I have never seen before.

          Parafly error.

          Apparently it is related to multithreading.
          The first bit of advice is to simply relaunch the runs.
          Will try this first. And then work backwards from there.

          MANY steps in the process have been completed.

          Show
          robofjoy Robert Reid added a comment - 1 of 4 runs are completed. Got an error in 2 of the runs that I have never seen before. Parafly error. Apparently it is related to multithreading. The first bit of advice is to simply relaunch the runs. Will try this first. And then work backwards from there. MANY steps in the process have been completed.
          Hide
          robofjoy Robert Reid added a comment -

          For Tam run, totally different error than above.

          Transcript polishing issue......stay tuned.

          Show
          robofjoy Robert Reid added a comment - For Tam run, totally different error than above. Transcript polishing issue......stay tuned.
          Hide
          robofjoy Robert Reid added a comment -

          Will start step 2, the blat script to align reads back to the Heinz de novo transcripts.

          Show
          robofjoy Robert Reid added a comment - Will start step 2, the blat script to align reads back to the Heinz de novo transcripts.
          Hide
          robofjoy Robert Reid added a comment -

          Success on all 4 trinity runs.

          Locations of the assembled contigs:

          • /projects/tomato_genome/fnb/dataprocessing/trinity/mal/malintka-trinity.Trinity.fasta
          • /projects/tomato_genome/fnb/dataprocessing/trinity/hei/heinz-trinity.Trinity.fasta
          • /projects/tomato_genome/fnb/dataprocessing/trinity/nag/nagcarlang-trinity.Trinity.fasta
          • /projects/tomato_genome/fnb/dataprocessing/trinity/tam/tamaulipas-trinity.Trinity.fasta

          Now need to decide on next steps.
          Blat to get annotations.
          Star align expt reads to these contigs or a subset to get the read counts. (Nextflow / salmon)

          We expect 35,000 genes. We HAVE MANY more than that.

          File tamaulipas-trinity.Trinity.fasta

          Number of sequences 797,447

          Hopefully we have isoforms. In reality we will have many chimeras that are not biologically true.

          Show
          robofjoy Robert Reid added a comment - Success on all 4 trinity runs. Locations of the assembled contigs: /projects/tomato_genome/fnb/dataprocessing/trinity/mal/malintka-trinity.Trinity.fasta /projects/tomato_genome/fnb/dataprocessing/trinity/hei/heinz-trinity.Trinity.fasta /projects/tomato_genome/fnb/dataprocessing/trinity/nag/nagcarlang-trinity.Trinity.fasta /projects/tomato_genome/fnb/dataprocessing/trinity/tam/tamaulipas-trinity.Trinity.fasta Now need to decide on next steps. Blat to get annotations. Star align expt reads to these contigs or a subset to get the read counts. (Nextflow / salmon) We expect 35,000 genes. We HAVE MANY more than that. File tamaulipas-trinity.Trinity.fasta Number of sequences 797,447 Hopefully we have isoforms. In reality we will have many chimeras that are not biologically true.
          Hide
          robofjoy Robert Reid added a comment -

          First blat run was a success.

          As per Ann's suggestion to use pslx to get the sequence also, I will run again like so:

          blat /projects/tomato_genome/db/SL5/SL5.cds.fa ./infile.fa blat-$

          {file}.pslx \
          -ooc=${file}

          .11.ooc \
          -t=dna -q=dna \
          -maxIntron=10000 \
          -out=pslx

          This slrum script can be found at:
          /projects/tomato_genome/fnb/dataprocessing/trinity

          Once this works, we will repeat for the other 3 varieties. And then create new ticket for the next sprint to parse the results, create a refined set of contigs with good annotations to become the NEW reference genome and prepare NETFLOW.

          Show
          robofjoy Robert Reid added a comment - First blat run was a success. As per Ann's suggestion to use pslx to get the sequence also, I will run again like so: blat /projects/tomato_genome/db/SL5/SL5.cds.fa ./infile.fa blat-$ {file}.pslx \ -ooc=${file} .11.ooc \ -t=dna -q=dna \ -maxIntron=10000 \ -out=pslx This slrum script can be found at: /projects/tomato_genome/fnb/dataprocessing/trinity Once this works, we will repeat for the other 3 varieties. And then create new ticket for the next sprint to parse the results, create a refined set of contigs with good annotations to become the NEW reference genome and prepare NETFLOW.
          ann.loraine Ann Loraine made changes -
          Sprint Spring 6, Spring 7, Spring 8, Spring 9 [ 190, 191, 192, 193 ] Spring 6, Spring 7, Spring 8, Spring 9, Spring 10 [ 190, 191, 192, 193, 194 ]
          ann.loraine Ann Loraine made changes -
          Rank Ranked higher
          Hide
          robofjoy Robert Reid added a comment -

          Blat runs were a success.
          Going to close this task and start a number of related new ones.

          1. Need to have these runs tested. New student Brandon starts this week, so this is a good task for him to learn the HPC cluster and how to learn slurm.
          2. Need a new task to start a google slide deck outlining this project. This will suitable for a biweekly tomato meeting. This will also allow me to ponder the next best steps.
          3. Need a new task to parse the results generated.
            #
          Show
          robofjoy Robert Reid added a comment - Blat runs were a success. Going to close this task and start a number of related new ones. Need to have these runs tested. New student Brandon starts this week, so this is a good task for him to learn the HPC cluster and how to learn slurm. Need a new task to start a google slide deck outlining this project. This will suitable for a biweekly tomato meeting. This will also allow me to ponder the next best steps. Need a new task to parse the results generated. #
          robofjoy Robert Reid made changes -
          Status In Progress [ 3 ] Needs 1st Level Review [ 10005 ]
          robofjoy Robert Reid made changes -
          Status Needs 1st Level Review [ 10005 ] First Level Review in Progress [ 10301 ]
          robofjoy Robert Reid made changes -
          Status First Level Review in Progress [ 10301 ] Ready for Pull Request [ 10304 ]
          robofjoy Robert Reid made changes -
          Status Ready for Pull Request [ 10304 ] Pull Request Submitted [ 10101 ]
          robofjoy Robert Reid made changes -
          Status Pull Request Submitted [ 10101 ] Reviewing Pull Request [ 10303 ]
          robofjoy Robert Reid made changes -
          Status Reviewing Pull Request [ 10303 ] Merged Needs Testing [ 10002 ]
          robofjoy Robert Reid made changes -
          Status Merged Needs Testing [ 10002 ] Post-merge Testing In Progress [ 10003 ]
          robofjoy Robert Reid made changes -
          Resolution Done [ 10000 ]
          Status Post-merge Testing In Progress [ 10003 ] Closed [ 6 ]
          Hide
          robofjoy Robert Reid added a comment -

          Script to parse the blast results.
          ~/scripts/python/identifyBlatMatches.py ./infile.fa ./blat-tamaulipas.pslx

          Command for later to summarize the Trinity contigs.
          /projects/tomato_genome/fnb/dataprocessing/trinity/tam/blat$ awk -F "_i" 'BEGIN

          { OFS = "\t" }

          { print $1 }

          ' tmp2

          Show
          robofjoy Robert Reid added a comment - Script to parse the blast results. ~/scripts/python/identifyBlatMatches.py ./infile.fa ./blat-tamaulipas.pslx Command for later to summarize the Trinity contigs. /projects/tomato_genome/fnb/dataprocessing/trinity/tam/blat$ awk -F "_i" 'BEGIN { OFS = "\t" } { print $1 } ' tmp2

            People

            • Assignee:
              robofjoy Robert Reid
              Reporter:
              robofjoy Robert Reid
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: