Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-3772

Run NEXTFLOW on contigs that are NOT SL4 or SL5

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      GOAL:
      To set up and run an nf-core/rna-seq pipeline to align our RNA-Seq sequences against the de novo created contigs (Trinity, Spades).

      Next flow requires a GTF and a bed file along with the contigs. There are various ways to generate these and it is possible issues will arise from these files.

      Otherwise, we can follow Molly's protocol:
      https://docs.google.com/document/d/1ig9ET-ykXF5nAX3P487cXWmZDGUlQpcwrvFXpbyP5vw/edit?usp=sharing

        Attachments

          Issue Links

            Activity

            Hide
            robofjoy Robert Reid added a comment -

            And all the bam files exist:

            To check:
            /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls -lrt Tam-run-2/results-3.14.0/star_salmon/*bam

            There are some size discrepancies across the bam files, the smallest being 800MB and the largest 5GB. But no idea if that is anything to be concerned about.

            Next phase: Create 1 large salmon counts file using the results from these 4 salmon files. That means new ticket!

            Show
            robofjoy Robert Reid added a comment - And all the bam files exist: To check: /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls -lrt Tam-run-2/results-3.14.0/star_salmon/*bam There are some size discrepancies across the bam files, the smallest being 800MB and the largest 5GB. But no idea if that is anything to be concerned about. Next phase: Create 1 large salmon counts file using the results from these 4 salmon files. That means new ticket!
            Hide
            robofjoy Robert Reid added a comment -

            Looks like we now have the proper tsv salmon count files needed for the next step.

            Lots of them.
            rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Tam-run-2/results-3.14.0/star_salmon/*tsv
            Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv
            Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv
            Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv
            Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Tam-run-2/results-3.14.0/star_salmon/tx2gene.tsv
            Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv
            rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Mal-run-2/results-3.14.0/star_salmon/*tsv
            Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv
            Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv
            Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv
            Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Mal-run-2/results-3.14.0/star_salmon/tx2gene.tsv
            Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv
            rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Nag-run-2/results-3.14.0/star_salmon/*tsv
            Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv
            Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv
            Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv
            Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Nag-run-2/results-3.14.0/star_salmon/tx2gene.tsv
            Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv

            Show
            robofjoy Robert Reid added a comment - Looks like we now have the proper tsv salmon count files needed for the next step. Lots of them. rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Tam-run-2/results-3.14.0/star_salmon/*tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Tam-run-2/results-3.14.0/star_salmon/tx2gene.tsv Tam-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Mal-run-2/results-3.14.0/star_salmon/*tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Mal-run-2/results-3.14.0/star_salmon/tx2gene.tsv Mal-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv rreid2@str-i2:/projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh$ ls Nag-run-2/results-3.14.0/star_salmon/*tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_length_scaled.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_counts.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts_scaled.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_lengths.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_counts.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.transcript_tpm.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_lengths.tsv Nag-run-2/results-3.14.0/star_salmon/tx2gene.tsv Nag-run-2/results-3.14.0/star_salmon/salmon.merged.gene_tpm.tsv
            Hide
            bbendick Brandon Bendickson added a comment -

            Results are found in /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh
            for Nag, Tam, and Mal

            Show
            bbendick Brandon Bendickson added a comment - Results are found in /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW/start_fresh for Nag, Tam, and Mal
            Hide
            robofjoy Robert Reid added a comment -

            In Mal, Hei, Nag, Tam, there are the full compliment of bam files and bai index files.
            And the sizes appear correct.

            For tsv salmon counts, it looks as though only Heinz has the tsv file of counts, the other 3 runs do not.
            Something is amiss !

            In mal, there is an error along the nextflow path:

            Command executed:

            inner_distance.py \
            -i M.25C.0hr.U.1.sorted.bam \
            -r blat-mal.bed \
            -o M.25C.0hr.U.1 \
            \
            > stdout.txt
            head -n 2 stdout.txt > M.25C.0hr.U.1.inner_distance_mean.txt

            cat <<-END_VERSIONS > versions.yml
            "NFCORE_RNASEQ:RNASEQ:BAM_RSEQC:RSEQC_INNERDISTANCE":
            rseqc: $(inner_distance.py --version | sed -e "s/inner_distance.py //g")
            END_VERSIONS

            And in Tam, there is a totally different error, it gets further along before error occurs.

            These are nasty in that they do not crash the pipeline but let it continue rolling along.
            We need to back up a few steps and rerun Mal, Tam and Nag.

            Show
            robofjoy Robert Reid added a comment - In Mal, Hei, Nag, Tam, there are the full compliment of bam files and bai index files. And the sizes appear correct. For tsv salmon counts, it looks as though only Heinz has the tsv file of counts, the other 3 runs do not. Something is amiss ! In mal, there is an error along the nextflow path: Command executed: inner_distance.py \ -i M.25C.0hr.U.1.sorted.bam \ -r blat-mal.bed \ -o M.25C.0hr.U.1 \ \ > stdout.txt head -n 2 stdout.txt > M.25C.0hr.U.1.inner_distance_mean.txt cat <<-END_VERSIONS > versions.yml "NFCORE_RNASEQ:RNASEQ:BAM_RSEQC:RSEQC_INNERDISTANCE": rseqc: $(inner_distance.py --version | sed -e "s/inner_distance.py //g") END_VERSIONS And in Tam, there is a totally different error, it gets further along before error occurs. These are nasty in that they do not crash the pipeline but let it continue rolling along. We need to back up a few steps and rerun Mal, Tam and Nag.
            Hide
            robofjoy Robert Reid added a comment -

            Rob needs to review that all 4 runs appear complete.

            Will check for proper bam sizes, bai, and bedgraphs, etc.
            And that the salmon table exists.

            Show
            robofjoy Robert Reid added a comment - Rob needs to review that all 4 runs appear complete. Will check for proper bam sizes, bai, and bedgraphs, etc. And that the salmon table exists.
            Hide
            bbendick Brandon Bendickson added a comment -

            The final run on Nag just completed. All results are found in /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW

            Show
            bbendick Brandon Bendickson added a comment - The final run on Nag just completed. All results are found in /projects/tomato_genome/fnb/dataprocessing/brandon_work/NEXTFLOW
            Hide
            bbendick Brandon Bendickson added a comment -

            I am running into issues with the bed file. Heinz ran without error, despite having an empty bed file. I tried using the commands in an earlier comment to generate a bed file but resulted in an error and the death of the run. I also tried using an empty bed file, like for Heinz, but also resulted in an error.

            INFO: Converting SIF file to temporary sandbox...
            [E::idx_find_and_load] Could not retrieve index file for 'M.25C.3hr.U.3.sorted.bam'
            Get exon regions from blat-mal.bed ...
            Traceback (most recent call last):
            File "/usr/local/bin/inner_distance.py", line 95, in <module>
            main()
            File "/usr/local/bin/inner_distance.py", line 87, in main
            obj.mRNA_inner_distance(outfile=options.output_prefix,low_bound=options.lower_bound_size,up_bound=options.upper_bound_size,step=options.step_size,refbed=options.ref_gene,sample_size=options.sampleSize, q_cut = options.map_qual)
            File "/usr/local/lib/python3.9/site-packages/qcmodule/SAM.py", line 3580, in mRNA_inner_distance
            for exn in bed_obj.getExon():
            File "/usr/local/lib/python3.9/site-packages/qcmodule/BED.py", line 478, in getExon
            chrom_start = int(f[1])
            ValueError: invalid literal for int() with base 10: 'Solyc08T000468.1'
            INFO: Cleaning up image...

            Show
            bbendick Brandon Bendickson added a comment - I am running into issues with the bed file. Heinz ran without error, despite having an empty bed file. I tried using the commands in an earlier comment to generate a bed file but resulted in an error and the death of the run. I also tried using an empty bed file, like for Heinz, but also resulted in an error. INFO: Converting SIF file to temporary sandbox... [E::idx_find_and_load] Could not retrieve index file for 'M.25C.3hr.U.3.sorted.bam' Get exon regions from blat-mal.bed ... Traceback (most recent call last): File "/usr/local/bin/inner_distance.py", line 95, in <module> main() File "/usr/local/bin/inner_distance.py", line 87, in main obj.mRNA_inner_distance(outfile=options.output_prefix,low_bound=options.lower_bound_size,up_bound=options.upper_bound_size,step=options.step_size,refbed=options.ref_gene,sample_size=options.sampleSize, q_cut = options.map_qual) File "/usr/local/lib/python3.9/site-packages/qcmodule/SAM.py", line 3580, in mRNA_inner_distance for exn in bed_obj.getExon(): File "/usr/local/lib/python3.9/site-packages/qcmodule/BED.py", line 478, in getExon chrom_start = int(f [1] ) ValueError: invalid literal for int() with base 10: 'Solyc08T000468.1' INFO: Cleaning up image...
            Hide
            robofjoy Robert Reid added a comment -

            Update on these Nextflow runs.

            Brandon is running into an additional tmp folder write / remove issue that needs resolving.
            Otherwise it is VERY close to the end.
            Hopefully we will meet today and hash out any other problems that arise.

            Once to the end, we will rerun on the other 3 varieties to confirm all is working ok.

            Also, this ticket needs to be moved back to in progress but I was not able to, only have the option to move it in the forward direction.

            Show
            robofjoy Robert Reid added a comment - Update on these Nextflow runs. Brandon is running into an additional tmp folder write / remove issue that needs resolving. Otherwise it is VERY close to the end. Hopefully we will meet today and hash out any other problems that arise. Once to the end, we will rerun on the other 3 varieties to confirm all is working ok. Also, this ticket needs to be moved back to in progress but I was not able to, only have the option to move it in the forward direction.
            Hide
            robofjoy Robert Reid added a comment -

            Ann's run of Nextflow on the tardigrade recently.

            https://jira.bioviz.org/browse/IGBF-3790

            This is a bit better than Molly's doc (Only took 2 months to become antiquated!!!!)

            Show
            robofjoy Robert Reid added a comment - Ann's run of Nextflow on the tardigrade recently. https://jira.bioviz.org/browse/IGBF-3790 This is a bit better than Molly's doc (Only took 2 months to become antiquated!!!!)
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Notes from Ann:

            • Use the latest version of the nf-core/rnaseq pipeline, or other pipeline as appropriate given the goals. Ann asked the URC team to install a newer version of the nextflow, which the latest rnaseq pipeline requires.
            • Rob: Investigate whether there is another nf-core pipeline that is better for what we want to do: "Count the number of aligned reads for each assembled contig, where the contigs are hypothetical transcript sequences"
            • Rob: Also, before doing anything further, check results from contig quality checking - "RNA-quast"
            Show
            ann.loraine Ann Loraine added a comment - - edited Notes from Ann: Use the latest version of the nf-core/rnaseq pipeline, or other pipeline as appropriate given the goals. Ann asked the URC team to install a newer version of the nextflow, which the latest rnaseq pipeline requires. Rob: Investigate whether there is another nf-core pipeline that is better for what we want to do: "Count the number of aligned reads for each assembled contig, where the contigs are hypothetical transcript sequences" Rob: Also, before doing anything further, check results from contig quality checking - "RNA-quast"
            Hide
            robofjoy Robert Reid added a comment -

            cat blat-heinz-bestLongHit.fna | awk '$0 ~ "^>"

            {name=substr($0, 2); printf name"\t1\t"}

            ' > blat-heinz.bed

            Show
            robofjoy Robert Reid added a comment - cat blat-heinz-bestLongHit.fna | awk '$0 ~ "^>" {name=substr($0, 2); printf name"\t1\t"} ' > blat-heinz.bed
            Hide
            robofjoy Robert Reid added a comment -

            Mental note for Rob:

            Add the bloody modules!

            srun --partition Draco --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash

            module load nf-core
            module load singularity

            Show
            robofjoy Robert Reid added a comment - Mental note for Rob: Add the bloody modules! srun --partition Draco --cpus-per-task 16 --mem-per-cpu 12000 --time 60:00:00 --pty bash module load nf-core module load singularity
            Hide
            robofjoy Robert Reid added a comment -

            Wrong heinz fasta file. Tried using the full 800,000 contigs as the reference file. It crashed, samtools could not even index it.

            Using the blat-bestlonghit.fna file instead. This is the fasta sequences that had the best matches to the original Heinz reference via blat.

            /projects/tomato_genome/fnb/dataprocessing/trinity/NEXTFLOW/hei-run1$ seqinfo blat-heinz-bestLongHit.fna

            File blat-heinz-bestLongHit.fna

            Number of sequences 38857

            Residue counts:
            Total 102938111

            Sequence lengths:
            Minimum 184
            Maximum 28109
            Average 2649.15
            N50 3592

            Rerunning script as follows:
            ./doIt.sh heinz-SRP499796-files.csv blat-heinz-bestLongHit.fna Trinity.gtf blat-heinz.bed tomato.config 1> out.txt 2> err.txt1

            Show
            robofjoy Robert Reid added a comment - Wrong heinz fasta file. Tried using the full 800,000 contigs as the reference file. It crashed, samtools could not even index it. Using the blat-bestlonghit.fna file instead. This is the fasta sequences that had the best matches to the original Heinz reference via blat. /projects/tomato_genome/fnb/dataprocessing/trinity/NEXTFLOW/hei-run1$ seqinfo blat-heinz-bestLongHit.fna File blat-heinz-bestLongHit.fna Number of sequences 38857 Residue counts: Total 102938111 Sequence lengths: Minimum 184 Maximum 28109 Average 2649.15 N50 3592 Rerunning script as follows: ./doIt.sh heinz-SRP499796-files.csv blat-heinz-bestLongHit.fna Trinity.gtf blat-heinz.bed tomato.config 1> out.txt 2> err.txt1
            Hide
            robofjoy Robert Reid added a comment -

            Location:
            /projects/tomato_genome/fnb/dataprocessing/trinity/NEXTFLOW/hei-run1

            Initial run:

            ./doIt.sh heinz-SRP499796-files.csv ../heinz-trinity.Trinity.fasta Trinity.gtf blat-heinz.bed tomato.config 1> out.txt 2> err.txt

            Show
            robofjoy Robert Reid added a comment - Location: /projects/tomato_genome/fnb/dataprocessing/trinity/NEXTFLOW/hei-run1 Initial run: ./doIt.sh heinz-SRP499796-files.csv ../heinz-trinity.Trinity.fasta Trinity.gtf blat-heinz.bed tomato.config 1> out.txt 2> err.txt
            Hide
            robofjoy Robert Reid added a comment -

            Trinity has a tool to make a GTF via perl.

            perl /apps/pkg/trinity/2.14.0/util/misc/cdna_fasta_file_to_transcript_gtf.pl tamaulipas-trinity.Trinity.fasta | grep -w "exon" - > Trinity.gtf

            Show
            robofjoy Robert Reid added a comment - Trinity has a tool to make a GTF via perl. perl /apps/pkg/trinity/2.14.0/util/misc/cdna_fasta_file_to_transcript_gtf.pl tamaulipas-trinity.Trinity.fasta | grep -w "exon" - > Trinity.gtf
            Hide
            robofjoy Robert Reid added a comment -

            To make a bed file:

            cat ../postblat/blat-heinz-bestLongHit.fna | awk '$0 ~ "^>"

            {name=substr($0, 2); printf name"\t1\t"}

            $0 !~ "^>"

            {printf length($0)"\t"name"\n"}

            ' > blat-heinz.bed

            Show
            robofjoy Robert Reid added a comment - To make a bed file: cat ../postblat/blat-heinz-bestLongHit.fna | awk '$0 ~ "^>" {name=substr($0, 2); printf name"\t1\t"} $0 !~ "^>" {printf length($0)"\t"name"\n"} ' > blat-heinz.bed

              People

              • Assignee:
                robofjoy Robert Reid
                Reporter:
                robofjoy Robert Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: