Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-2970

Re-run nf-core/rnaseq using proper strand designation and better sample prefix

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Description

      The first run of the rnaseq pipeline used a samples.csv file as input that indicated the incorrect strand.

      The protocol used to synthesis the libraries (see attached) was not strand-specific.

      Also, the sample name prefixes are a bit awkward and long. Since every sample name contains the letters "120min" let's shorten the prefixes to:

      GENOTYPE_TREATMENT_REPLICATE

      Genotypes:

      • VF36 (wildtype)
      • F3H-OX3 (F3H over-expressing line 3)
      • F3H-OX4 (F3H over-expressing line 4)
      • are (mutant, anthocyanin reduced)

      Treatments (duration 2 hours):

      • 28 degrees C for 2 hours (control), harvested at 120 minutes
      • 28 degrees C for 30 minutes, then shifted to 34 C (treatment), harvested at 120 minutes

      Sample codes:

      genotype: VF36, F3H-OX3, F3H-OX4
      treatment: C and T
      replicate: R1, R2, R3

        Attachments

          Issue Links

            Activity

            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Trying test profile with my custom configuration file:

            nextflow run nf-core-rnaseq-3.4/workflow -profile test,singularity -c intron_max.conf
            

            Note I'm running this on the head node because compute nodes lack internet access.

            Show
            ann.loraine Ann Loraine added a comment - - edited Trying test profile with my custom configuration file: nextflow run nf-core-rnaseq-3.4/workflow -profile test,singularity -c intron_max.conf Note I'm running this on the head node because compute nodes lack internet access.
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Output from above:

            Core Nextflow options
              runName                   : awesome_archimedes
              containerEngine           : singularity
              launchDir                 : /nobackup/tomato_genome/testing
              workDir                   : /nobackup/tomato_genome/testing/work
              projectDir                : /nobackup/tomato_genome/testing/nf-core-rnaseq-3.4/workflow
              userName                  : aloraine
              profile                   : test,singularity
              configFiles               : /nobackup/tomato_genome/testing/nf-core-rnaseq-3.4/workflow/nextflow.config, /nobackup/tomato_genome/testing/intron_max.conf
            

            But then it failed with an error:

            The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: 4D4FXNNYMZ34CY64; S3 Extended Request ID: bFzqa/jwya1G8EE8LOcSA7+sMo7GnEnR6CtxUEVpJcGgeNq+/fMFM9s9rsO/Ixn/6nFJnOd9nO0=)
            
             -- Check script 'nf-core-rnaseq-3.4/workflow/./workflows/../subworkflows/local/input_check.nf' at line: 33 or see '.nextflow.log' file for more details
            
            Show
            ann.loraine Ann Loraine added a comment - - edited Output from above: Core Nextflow options runName : awesome_archimedes containerEngine : singularity launchDir : /nobackup/tomato_genome/testing workDir : /nobackup/tomato_genome/testing/work projectDir : /nobackup/tomato_genome/testing/nf-core-rnaseq-3.4/workflow userName : aloraine profile : test,singularity configFiles : /nobackup/tomato_genome/testing/nf-core-rnaseq-3.4/workflow/nextflow.config, /nobackup/tomato_genome/testing/intron_max.conf But then it failed with an error: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: 4D4FXNNYMZ34CY64; S3 Extended Request ID: bFzqa/jwya1G8EE8LOcSA7+sMo7GnEnR6CtxUEVpJcGgeNq+/fMFM9s9rsO/Ixn/6nFJnOd9nO0=) -- Check script 'nf-core-rnaseq-3.4/workflow/./workflows/../subworkflows/local/input_check.nf' at line: 33 or see '.nextflow.log' file for more details
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Tried again with:

            nextflow run nf-core/rnaseq -profile test,singularity -c intron_max.conf
            

            BAM header shows correct parameter was used:

            [aloraine@str-i1 testing]$ samtools view -H work/13/7f46a902a683e41db4d94b8609a88d/WT_REP2.sorted.bam
            @HD	VN:1.4	SO:coordinate
            @SQ	SN:I	LN:230218
            @SQ	SN:Gfp_transgene	LN:729
            @PG	ID:STAR	PN:STAR	VN:STAR_2.6.1d	CL:STAR   --runThreadN 2   --runRNGseed 0   --genomeDir star   --readFilesIn WT_REP2_1_val_1.fq.gz   WT_REP2_2_val_2.fq.gz      --readFilesCommand zcat      --outFileNamePrefix WT_REP2.   --outSAMtype BAM   Unsorted      --outSAMattributes NH   HI   AS   NM   MD      --outSAMattrRGline ID:WT_REP2   SM:WT_REP2      --outFilterMultimapNmax 20   --alignIntronMax 13000   --alignSJDBoverhangMin 1   --sjdbGTFfile genome_gfp.gtf   --quantMode TranscriptomeSAM      --quantTranscriptomeBan Singleend   --twopassMode Basic
            @PG	ID:samtools	PN:samtools	PP:STAR	VN:1.12	CL:samtools sort -@ 2 -o WT_REP2.sorted.bam -T WT_REP2.sorted WT_REP2.Aligned.out.bam
            @PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.10	CL:samtools view -H work/13/7f46a902a683e41db4d94b8609a88d/WT_REP2.sorted.bam
            @RG	ID:WT_REP2	SM:WT_REP2
            @CO	user command line: STAR --genomeDir star --readFilesIn WT_REP2_1_val_1.fq.gz WT_REP2_2_val_2.fq.gz --runThreadN 2 --outFileNamePrefix WT_REP2. --sjdbGTFfile genome_gfp.gtf --outSAMattrRGline ID:WT_REP2 SM:WT_REP2 --alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
            

            Note however that this is using the default aligner of "star_salmon".

            Show
            ann.loraine Ann Loraine added a comment - - edited Tried again with: nextflow run nf-core/rnaseq -profile test,singularity -c intron_max.conf BAM header shows correct parameter was used: [aloraine@str-i1 testing]$ samtools view -H work/13/7f46a902a683e41db4d94b8609a88d/WT_REP2.sorted.bam @HD VN:1.4 SO:coordinate @SQ SN:I LN:230218 @SQ SN:Gfp_transgene LN:729 @PG ID:STAR PN:STAR VN:STAR_2.6.1d CL:STAR --runThreadN 2 --runRNGseed 0 --genomeDir star --readFilesIn WT_REP2_1_val_1.fq.gz WT_REP2_2_val_2.fq.gz --readFilesCommand zcat --outFileNamePrefix WT_REP2. --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD --outSAMattrRGline ID:WT_REP2 SM:WT_REP2 --outFilterMultimapNmax 20 --alignIntronMax 13000 --alignSJDBoverhangMin 1 --sjdbGTFfile genome_gfp.gtf --quantMode TranscriptomeSAM --quantTranscriptomeBan Singleend --twopassMode Basic @PG ID:samtools PN:samtools PP:STAR VN:1.12 CL:samtools sort -@ 2 -o WT_REP2.sorted.bam -T WT_REP2.sorted WT_REP2.Aligned.out.bam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -H work/13/7f46a902a683e41db4d94b8609a88d/WT_REP2.sorted.bam @RG ID:WT_REP2 SM:WT_REP2 @CO user command line: STAR --genomeDir star --readFilesIn WT_REP2_1_val_1.fq.gz WT_REP2_2_val_2.fq.gz --runThreadN 2 --outFileNamePrefix WT_REP2. --sjdbGTFfile genome_gfp.gtf --outSAMattrRGline ID:WT_REP2 SM:WT_REP2 --alignIntronMax 13000 --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand zcat --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend Note however that this is using the default aligner of "star_salmon".
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Running the test profile again, using star_rsem instead, which is what I was trying to use with my data, above.

            Initial output appears to picked up my use of this option:

            Alignment options
              aligner                   : star_rsem
              pseudo_aligner            : salmon
            

            Using star_rsem, I found that the maximum intron parameter was not set as I requested in my custom config file, as shown by output from samtools view -H:

            [aloraine@str-i1 star_rsem]$ samtools view -H WT_REP1.markdup.sorted.bam
            @HD	VN:1.6	SO:coordinate
            @SQ	SN:I	LN:230218
            @SQ	SN:Gfp_transgene	LN:729
            @PG	ID:STAR	PN:STAR	VN:2.7.6a	CL:STAR   --runThreadN 2   --genomeDir ./rsem   --genomeLoad NoSharedMemory   --readFilesIn WT_REP1_1_val_1.fq.gz   WT_REP1_2_val_2.fq.gz      --readFilesCommand zcat      --outFileNamePrefix ./tmp//WT_REP1   --outSAMtype BAM   Unsorted      --outSAMattributes NH   HI   AS   NM   MD      --outSAMunmapped Within      --outSAMheaderHD @HD   VN:1.4   SO:unsorted      --outFilterType BySJout   --outFilterMultimapNmax 20   --outFilterMismatchNmax 999   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000   --alignMatesGapMax 1000000   --alignSJoverhangMin 8   --alignSJDBoverhangMin 1   --sjdbScore 1   --quantMode TranscriptomeSAM   
            @PG	ID:samtools	PN:samtools	PP:STAR	VN:1.12	CL:samtools sort -@ 2 -o WT_REP1.sorted.bam -T WT_REP1.sorted WT_REP1.STAR.genome.bam
            @PG	ID:MarkDuplicates	VN:2.23.9	CL:MarkDuplicates INPUT=[WT_REP1.sorted.bam] OUTPUT=WT_REP1.markdup.sorted.bam METRICS_FILE=WT_REP1.markdup.sorted.MarkDuplicates.metrics.txt REMOVE_DUPLICATES=false ASSUME_SORTED=true TMP_DIR=[tmp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false	PN:MarkDuplicates
            @PG	ID:samtools.1	PN:samtools	PP:samtools	VN:1.10	CL:samtools view -H WT_REP1.markdup.sorted.bam
            @PG	ID:samtools.2	PN:samtools	PP:MarkDuplicates	VN:1.10	CL:samtools view -H WT_REP1.markdup.sorted.bam
            @CO	user command line: STAR --genomeDir ./rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 2 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ./tmp//WT_REP1 --readFilesCommand zcat --readFilesIn WT_REP1_1_val_1.fq.gz WT_REP1_2_val_2.fq.gz
            
            Show
            ann.loraine Ann Loraine added a comment - - edited Running the test profile again, using star_rsem instead, which is what I was trying to use with my data, above. Initial output appears to picked up my use of this option: Alignment options aligner : star_rsem pseudo_aligner : salmon Using star_rsem, I found that the maximum intron parameter was not set as I requested in my custom config file, as shown by output from samtools view -H: [aloraine@str-i1 star_rsem]$ samtools view -H WT_REP1.markdup.sorted.bam @HD VN:1.6 SO:coordinate @SQ SN:I LN:230218 @SQ SN:Gfp_transgene LN:729 @PG ID:STAR PN:STAR VN:2.7.6a CL:STAR --runThreadN 2 --genomeDir ./rsem --genomeLoad NoSharedMemory --readFilesIn WT_REP1_1_val_1.fq.gz WT_REP1_2_val_2.fq.gz --readFilesCommand zcat --outFileNamePrefix ./tmp //WT_REP1 --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD --outSAMunmapped Within --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --quantMode TranscriptomeSAM @PG ID:samtools PN:samtools PP:STAR VN:1.12 CL:samtools sort -@ 2 -o WT_REP1.sorted.bam -T WT_REP1.sorted WT_REP1.STAR.genome.bam @PG ID:MarkDuplicates VN:2.23.9 CL:MarkDuplicates INPUT=[WT_REP1.sorted.bam] OUTPUT=WT_REP1.markdup.sorted.bam METRICS_FILE=WT_REP1.markdup.sorted.MarkDuplicates.metrics.txt REMOVE_DUPLICATES= false ASSUME_SORTED= true TMP_DIR=[tmp] VALIDATION_STRINGENCY=LENIENT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS= false REMOVE_SEQUENCING_DUPLICATES= false TAGGING_POLICY=DontTag CLEAR_DT= true DUPLEX_UMI= false ADD_PG_TAG_TO_READS= true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET= false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX= false CREATE_MD5_FILE= false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER= false USE_JDK_INFLATER= false PN:MarkDuplicates @PG ID:samtools.1 PN:samtools PP:samtools VN:1.10 CL:samtools view -H WT_REP1.markdup.sorted.bam @PG ID:samtools.2 PN:samtools PP:MarkDuplicates VN:1.10 CL:samtools view -H WT_REP1.markdup.sorted.bam @CO user command line: STAR --genomeDir ./rsem --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 2 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD @HD VN:1.4 SO:unsorted --outFileNamePrefix ./tmp //WT_REP1 --readFilesCommand zcat --readFilesIn WT_REP1_1_val_1.fq.gz WT_REP1_2_val_2.fq.gz
            Hide
            ann.loraine Ann Loraine added a comment - - edited

            Re-ran pipeline and transferred data to the Google drive using rclone.
            Also, copied data to RENCI Quickload site.
            See: http://lorainelab-quickload.scidas.org/hotpollen/

            Show
            ann.loraine Ann Loraine added a comment - - edited Re-ran pipeline and transferred data to the Google drive using rclone. Also, copied data to RENCI Quickload site. See: http://lorainelab-quickload.scidas.org/hotpollen/

              People

              • Assignee:
                ann.loraine Ann Loraine
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: