bioinformatics_for_rnaseq_day1.html 32.4 KB
Newer Older
Rebecca E Batorsky's avatar
Rebecca E Batorsky committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title>Bioinformatics for RNAseq Workshop</title>
    <meta charset="utf-8" />
    <meta name="author" content="Rebecca Batorsky, Sr. Bioinformatics Specialist at Tufts" />
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link href="libs/remark-css-0.0.1/default-fonts.css" rel="stylesheet" />
    <link href="libs/remark-css-0.0.1/hygge.css" rel="stylesheet" />
    <link rel="stylesheet" href="custom.css" type="text/css" />
    <link rel="stylesheet" href="footer-header.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Bioinformatics for RNAseq Workshop
### Rebecca Batorsky, Sr. Bioinformatics Specialist at Tufts
### April 2019

---


## Experimental Methods
RNA Extraction + QC  
Library Prep methods  
Illumina Sequencing  
---
## Experimental Design
Avoiding bias: Randomization, blocking  
Statistical power  
Capturing variability 
---
## Bioinformatics Workflow
Our analysis will have the following steps with QC at each stage 

&lt;img src="fig/wf0.png" width="500px" style="display: block; margin: auto;" /&gt;
---
## Data for the course
.pull-left[
- mRNA data from 48 replicates of two Saccromyces cerevisiae populations 

- Wildtype (WT) and `\(\Delta\)`SNF2

- Unusually comprehensive analysis of variability in sequencing replicates
]
.pull-right[
&lt;img src="fig/paper.png" width="100%"&gt;
&lt;img src="fig/gier.png" width="100%"&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;Gierlinski et al Bioinformatics 2015&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754627&lt;/span&gt;&lt;/div&gt; 

---
## Start Jupyter Lab on the HPC cluster On Demand
.pull-left[
1. In **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
2. Interactive Apps -&gt; Jupyter Lab
  + Hours: 3 hours
  + Core: 8 cores
  + Memory: 64 GB
4. Press "Connect to Jupyter Lab"
5. Choose "Terminal" from the Launcher menu
6. A terminal will appear on the compute node where your job is running, where you can type bash commands  
]
.pull-right[
&lt;img src="fig/jn.png" width="70%" style="display: block; margin: auto;" /&gt;&lt;img src="fig/terminal.png" width="70%" style="display: block; margin: auto;" /&gt;
]
---
## Setting up our working directory

Make a directory in the training space called `your-user-name` e.g.rbator01:
```
cd /cluster/tufts/bio/tools/   
mkdir training/&lt;your-user-name&gt;
```
Copy the workshop materials into the new directory:
```
cp -r tutorials/bioinformatics-rnaseq/ training/&lt;your-user-name&gt;
```
Make a soft link into your home directory:
```
ln –s training/&lt;your-user-name&gt;/bioinformatics-rnaseq/ ~/ 
```
You now have a directory called "bioinformatics_rnaseq" in your home directory containing:
```
tree .
```
---
## Downloading data from a public archive
In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)

Use bash utility **head** to look at the first few lines   

```bash
cd bioinformatics_rnaseq/data
head ERP004763_info.txt 
```
```
run_accession  condition  biol_rep  fastq_ftp
ERR458493      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
ERR458494      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458494/ERR458494.fastq.gz
ERR458495      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458495/ERR458495.fastq.gz
ERR458496      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458496/ERR458496.fastq.gz
ERR458497      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458497/ERR458497.fastq.gz
ERR458498      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458498/ERR458498.fastq.gz
ERR458499      WT         1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458499/ERR458499.fastq.gz
ERR458500      SNF2       1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458500/ERR458500.fastq.gz
ERR458501      SNF2       1         ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458501/ERR458501.fastq.gz
```
---
## Downloading data from a public archive
Find the accession numbers corresponding to WT and replicate 1 using contents using bash utlilty **awk** 


```bash
cat ERP004763_info.txt | awk '$2=="WT"'
```
```
ERR458493	WT	1	ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR459206	WT	48	ftp.sra.ebi.ac.uk/vol1/fastq/ERR459/ERR459206/ERR459206.fastq.gz
```

Find the accession numbers corresponding to the first biological replicate:
```bash
cat ERP004763_info.txt | awk '$3==1'
```
```
ERR458493	WT	1	ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR458506	SNF2	1	ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458506/ERR458506.fastq.gz
```

The pipe symbol "|" will tell bash to send the output of one program to another program for further processing  

---
.content-box-yellow[
**Exercise** (2 minutes):
Can you combine the above commands using a pipe in order to output only the lines corresponding to WT replicate 1?
]

---
## Downloading fastq files
We'll open a script in a text editor in Jupyter lab

.pull-left[
1. Click on "Files" in the left menu
2. Click the "Home" icon and navigate to "bioinformatics-rnaseq"
3. Double click the file "download_samples.sh" and it will open in a text editor
]
.pull-right[
&lt;img src="fig/jn_files.png" width="70%" /&gt;
]

```bash
#!/bin/bash
mkdir WT_1
cd WT_1
cat ERP004763_info.txt | awk '$3==1'| awk '$2=="WT"'| cut -f4 | xargs wget
cd ..
```
--
The script can be run by returning to your terminal and doing:

```bash
cd ../ 
./scripts/download_samples.sh
```
You should now have a folder WT_1 in your bioinformatics-rnaseq directory with seven fastq.gz files.

---
## Fastq files
&lt;img src="fig/wf1.png" width="500px" style="display: block; margin: auto;" /&gt;
---
## Fastq format

View the first 4 lines of one of the gzipped fastq files  without decompressing:

```bash
gzip -cd WT_1/ERR458493.fastq.gz | head -n 4
```
```
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
B@@FFFFFHHHGHJJJJJJIJJGIGIIIGI9DGGIIIEIGIIFHHGGHJIB
```
The four lines corresponding to a single read are:

1. Sequence identifier: \@Read ID and sequencing run info
2. Sequence
3. \+ optionally followed by sequence identifier again
4. Quality scores

---
## Quality Scores
Base quality scores encode the prediction of the probabilty of an error in base calling by the sequencing instrument.
The sequencing quality score of a given base, Q, is defined by: `$$Q = -10log_{10}(e)$$` where e is the estimated probability that the call is wrong. 

| Quality Score | Probability of an Incorrect Base Call | Inferred Base Call Accuracy |  
| ----- |:-----:|:-----:|
| 10 (Q10) | 1 in 10 | 90% | 
| 20 (Q20) | 1 in 100 | 99% | 
| 30 (Q30) | 1 in 1000 | 99.9% | 

--
So why don't we see numbers in the quality score?

```
B@@FFFFFHHHGHJJJJJJIJJGIGIIIGI9DGGIIIEIGIIFHHGGHJIB
```
---
### Quality Scores
In FASTQ files produced by Illumia software 1.8+, quality scores are encoded as the character with an ASCII code equal to its value + 33.
&lt;img src="fig/illumina_fastq_coding.png" width="60%" /&gt;
```
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
B@@FFFFFHHHGHJJJJJJIJJGIGIIIGI9DGGIIIEIGIIFHHGGHJIB
```
B = Q(33), probability of error of 1 in 2000
&lt;div class="my-footer"&gt;&lt;span&gt;https://www.illumina.com/science/education/sequencing-quality-scores.html/&lt;/span&gt;&lt;/div&gt; 
---
## FastQC
Widely used tool for both DNA and RNA sequencing data that is run on each fastq file.
To use, in the terminal

```bash
module load fastqc/0.11.8
mkdir fastqc
```

Since FastQC can run on multiple files at once, we'll use a wildcard "*" to indicate each file in the folder "WT_1":

```bash
fastqc WT_1/*.fastq.gz -o fastqc --extract
```

For a quick check, we can look at "summary.txt" file for each fastq, which has a line for each test that fastQC performs

```bash
cat fastqc/ERR458498_fastqc/summary.txt 
```
.small[
```
PASS	Basic Statistics	ERR458498.fastq.gz
PASS	Per base sequence quality	ERR458498.fastq.gz
WARN	Per tile sequence quality	ERR458498.fastq.gz
PASS	Per sequence quality scores	ERR458498.fastq.gz
FAIL	Per base sequence content	ERR458498.fastq.gz
PASS	Per sequence GC content	ERR458498.fastq.gz
PASS	Per base N content	ERR458498.fastq.gz
PASS	Sequence Length Distribution	ERR458498.fastq.gz
WARN	Sequence Duplication Levels	ERR458498.fastq.gz
PASS	Overrepresented sequences	ERR458498.fastq.gz
PASS	Adapter Content	ERR458498.fastq.gz
```
]
---
## MultiQC
This tool combines QC output across multiple samples.
To use on the HPC (not installed as a module yet):

```bash
module load anaconda
source activate /cluster/tufts/bio/tools/conda_envs/multiqc/1.7/
```

To run multiQC on all the fastq files for WT_1:

```bash
multiqc fastqc/ -o multiqc
```

Deactivate the conda environment
```
source deactivate
```

---
## Download MultiQC report 
- Click on Files-&gt; Home Directory -&gt; bioinformatics_rnaseq -&gt; multiqc

- Right click on multiqc_report.html -&gt; Download

- Double click on file and it will open in a web browser

&lt;img src="fig/jn_files_2.png" width="40%" style="display: block; margin: auto auto auto 0;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://multiqc.info/&lt;/span&gt;&lt;/div&gt; 
---
## FastQC results

We'll go through two of the plots that FastQC produces
- Sequence Quality Histograms

- Per Base Sequence Content
---
## FastQC results - Sequence Quality Histograms
- Distribution of quality scores across all bases at each position in the reads. 

- Drops at the end of reads due to molecules in a given sequencing cluster getting out of sync

- Drops in quality below Phred score of ~20 can be handled by read trimming.
&lt;img src="fig/fastqc0.png" width="100%" style="display: block; margin: auto auto auto 0;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://sequencing.qcfail.com//&lt;/span&gt;&lt;/div&gt; 
---
## FastQC results - Sequence Quality Histograms
Drops in quality below Phred score of ~20 can be handled by read trimming.
&lt;img src="fig/fastqc4.png" width="100%" style="display: block; margin: auto auto auto 0;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://sequencing.qcfail.com//&lt;/span&gt;&lt;/div&gt; 
---
## FastQC results - Per Base Sequence Content 
&lt;img src="fig/fastqc1.png" width="60%" style="display: block; margin: auto auto auto 0;" /&gt;
- Proportion of each position for which each DNA base has been called
- RNAseq data tends to show a positional sequence bias in the first ~12 bases
- The "random" priming step during library construction is not truly random and certain hexamers are more prevalent than others
- Studies have shown that this does NOT cause mis-called bases or drastic bias in sequenced fragments
&lt;div class="my-footer"&gt;&lt;span&gt;https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/&lt;/span&gt;&lt;/div&gt; 
---
## FastQC results - Per Base Sequence Content 

The right plot results show a strong positional bias throughout the reads, which in this case is due to the library having a certain sequence that is overrepresented

&lt;img src="fig/fastqc5.png" width="100%" style="display: block; margin: auto auto auto 0;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/&lt;/span&gt;&lt;/div&gt; 
---
## Read Alignment
&lt;img src="fig/wf2.png" width="500px" style="display: block; margin: auto;" /&gt;
---
## Read Alignment
.pull-left[
- Find the genomic origin of sequence fragments

- RNAseq data originates from spliced mRNA, ni introns

- When aligning to the genome, our aligner must find a spliced alignment for reads
]
.pull-right[
&lt;img src="fig/align0.png" width="100%" style="display: block; margin: auto;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;http://chagall.med.cornell.edu/RNASEQcourse/&lt;/span&gt;&lt;/div&gt;
]
---
## STAR (Spliced Transcripts Alignment to a Reference)

.pull-left[ 
- Highly accurate, memory intensive aligner
- Two phase mapping process
1. Find Maximum Mappable Prefix (MMP) in a read
.small[ a contiguous sequence in the read that matches a segment of the genome
Continue with the unmapped portion of the read. If a read is not completely covered by MMPs, the MMP are extended with mismatches (a) indels (b) or soft-clipped (c in the Figure below) ]
2. Clustering, stitching and scoring
.small[
Using MMP a anchors, reads are stitched together. All seeds that fall within a user-defined genomic window (which determines the maximum intron length) will be clustered. If all seeds in a read are not within the window, chimeric alignment is produced, such as would happen in gene fusion.
]
]
.pull-right[
&lt;img src="fig/align1.png" width="1139" style="display: block; margin: auto auto auto 0;" /&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;Dobin et al Bioinformatics 2013&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/&lt;/span&gt;&lt;/div&gt; 
---
## Genome annotation standards

- STAR can use an annotation file gives the location and structure of genes in order to improve alignment in known splice junctions

- Annotation is dynamic and there are at least three major sources of annotation

&lt;img src="fig/ann0.png" width="50%" style="display: block; margin: auto;" /&gt;

- The intersection among RefGene, UCSC, and Ensembl annotations shows high overlap. RefGene has the fewest unique genes, while more than 50% of genes in Ensembl are unique

- Be consistent!

&lt;div class="my-footer"&gt;&lt;span&gt;Zhao et al Bioinformatics 2015&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp; https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8&lt;/span&gt;&lt;/div&gt; 

---
## Genome annotation standards
RefSeq and Ensemble have different gene definitions for gene PIK3CA can give rise to differences in gene quantification.

&lt;img src="fig/ann1.png" width="800px" style="display: block; margin: auto auto auto 0;" /&gt;

.small[" We demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis."]

&lt;div class="my-footer"&gt;&lt;span&gt;https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8&lt;/span&gt;&lt;/div&gt; 
---
## Final note on standards

&lt;img src="fig/ann2.png" width="600px" style="display: block; margin: auto;" /&gt;

&lt;div class="my-footer"&gt;&lt;span&gt;https://xkcd.com/927/&lt;/span&gt;&lt;/div&gt; 
---
## Reference files on the HPC
Tufts HPC hosts genome reference data from [UCSC](https://genome.ucsc.edu/cgi-bin/hgTables
) at the following location
```
/cluster/tufts/bio/data/genomes
```

For our data, we will need reference files from Saccharomyces_cerevisiae genome version sacCer3.
We can explore available files like this:


```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/
tree -d
```
.pull-left[

```bash
├── Annotation
│   ├── Genes -
│   └── SmallRNA
└── Sequence
    ├── AbundantSequences
    ├── Bowtie2Index
    ├── BowtieIndex
    ├── BWAIndex
    │   ├── version0.5.x
    │   └── version0.6.0
    ├── Chromosomes
    ├── HISAT2
    ├── STAR
    └── WholeGenomeFasta
```
]
.pull-right[
The reference files that we need for this analysis are:
1. Genome indexed for STAR aligner, under Sequence/STAR
2. Annotation file in GTF and BED formats, under Annotation/Genes/
]
---
## Annotation file formats
STAR uses a GTF format for genome annotation

```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/Genes
head sacCer3.gtf | column -ts $'\t'
```
```
chrI  sacCer3.genepred  transcript   335  649  .  +  .  gene_id "YAL069W"; transcript_id "YAL069W"; 
chrI  sacCer3.genepred  exon         335  649  .  +  .  gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI  sacCer3.genepred  CDS          335  646  .  +  0  gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI  sacCer3.genepred  start_codon  335  337  .  +  0  gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI  sacCer3.genepred  stop_codon   647  649  .  +  0  gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI  sacCer3.genepred  transcript   538  792  .  +  .  gene_id "YAL068W-A"; transcript_id "YAL068W-A"; 
chrI  sacCer3.genepred  exon         538  792  .  +  .  gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI  sacCer3.genepred  CDS          538  789  .  +  0  gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI  sacCer3.genepred  start_codon  538  540  .  +  0  gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI  sacCer3.genepred  stop_codon   790  792  .  +  0  gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
```

There is a simpler format which contains only one feature type, called BED.
Since there is no feature type, we download a table for each feature type of interest, e.g.:

```bash
head sacCer3.sgdGene.wholegene.bed
```
```
chrI  130798  131983  YAL012W    0  +  130798  131983  0  1  1185,  0,
chrI  334     649     YAL069W    0  +  334     649     0  1  315,   0,
chrI  537     792     YAL068W-A  0  +  537     792     0  1  255,   0,
```
---
.content-box-yellow[
**Exercise** 3 (10 minutes):
The gene we'll be analyzing is called SNF2 or YOR290C. Check to make sure it's represented consistently in the GTF and BED files using bash utility **grep**, e.g.:
```
grep YOR290C &lt;file name&gt;
```

Using the files above, how long is the gene? Does it have any introns?
]

--
&lt;img src="fig/ucsc_conventions.png" width="500px" style="display: block; margin: auto auto auto 0;" /&gt;

&lt;div class="my-footer"&gt;&lt;span&gt;http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/&lt;/span&gt;&lt;/div&gt; 
---
## Read Alignment

STAR alignment is a 2 step process:
1. Index Genome: genomeGenerate, needs to be done only once per genome
2. Align reads 

Although we'll use pre-indexed genome it's worth knowing how it's done  
--

Open up bioinformatics-rnaseq/scripts/sbatch_star_index.sh in a text editor

```bash
# load the module 
module load STAR/2.5.2b

# create a directory to store the index in
REF_DIR=/cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3

# Run STAR in "genomeGenerate" mode
STAR --runMode genomeGenerate \
--genomeDir ${REF_DIR}/Sequence/STAR \
--genomeFastaFiles ${REF_DIR}/Sequence/WholeGenomeFasta/genome.fa \
--runThreadN 12 
```
---
## Read Alignment: running the command

Open file 
The basic command

```bash
STAR --genomeDir &lt;directory of indexed genome&gt; \
--readFilesIn &lt;CSV list of gzipped fastq files&gt; \
--readFilesCommand zcat \
--outFileNamePrefix &lt;prefix for output file&gt; \
--outFilterMultimapNmax &lt;max number of alignment positions to allow&gt; \
--outSAMtype BAM SortedByCoordinate \
--runThreadN &lt;threads to use - must match SLURM request&gt; \
--alignIntronMin &lt;min size of intron&gt; \
--alignIntronMax &lt;max size of intron&gt;
--sjdbGTFfile &lt;path to GTF file&gt; \
--sjdbOverhang &lt;read length -1 &gt;
```
---
Open up bioinformatics-rnaseq/scripts/star_align.sh in a text editor
.small[ 

```bash
## Use STAR aligner to align all fastq files in a directory
## This step must be done for each sample
module load STAR/2.7.0a
mkdir -p STAR

## File is run as sbatch sbatch_align_star.sh &lt;folder&gt;
## Where &lt;folder&gt; contains fastq.gz files for 1 sample
SAMPLE=$1

## Obtain a comma separated list of files
FILES=`ls -m ${SAMPLE}/*fastq.gz | tr -d ' ' | tr -d '\n'`

## Name the output file, for example if the folder is /cluster/tufts/sample_1
## The output will have prefix sample_1_
OUT=$(basename $SAMPLE)

echo "Starting to align: $FILES"
echo "Output file will have prefix: $OUT"

REF_DIR=/cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3
# execute STAR in the runMode "alignReads"
STAR --genomeDir ${REF_DIR}/Sequence/STAR \
--readFilesIn $FILES \
--readFilesCommand zcat \
--outFileNamePrefix STAR/${OUT}_ \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--runThreadN 12 \
--alignIntronMin 1 \
--alignIntronMax 2500 \
--sjdbGTFfile ${REF_DIR}/Annotation/Genes/sacCer3.gtf \
--sjdbOverhang 49

# generate the bam index
module load samtools/1.2
samtools index STAR/${OUT}_Aligned.sortedByCoord.out.bam
```
]
---
## Read Alignment
.content-box-yellow[
**Exercise** (5 minutes): Run the command in the terminal using sbatch 
```
sbatch scripts/sbatch_star_align.sh WT_1
```
Check the result of your job submission
```
squeue -u &lt;your user name&gt;
```
View the outputs of your job while it's running like this:

```
cat &lt;job-number&gt;.err
cat &lt;job-number&gt;.out
```
Modify the script to ignore small introns and run it again:

```
--outFileNamePrefix STAR/${OUT}_nosmallintron_ \
--alignIntronMin 2500 \
--alignIntronMax 2500 \
```
]


---
## Read Alignment: Result files
```
ls -lh STAR/
```
.small[
```
-rw-rw-r-- 1 rbator01 biotools 272M Mar 25 15:45 WT_1_Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 rbator01 biotools  34K Mar 25 15:45 WT_1_Aligned.sortedByCoord.out.bam.bai
-rw-rw-r-- 1 rbator01 biotools 1.8K Mar 25 15:45 WT_1_Log.final.out
-rw-rw-r-- 1 rbator01 biotools  24K Mar 25 15:45 WT_1_Log.out
-rw-rw-r-- 1 rbator01 biotools  364 Mar 25 15:45 WT_1_Log.progress.out
-rw-rw-r-- 1 rbator01 biotools  46K Mar 25 15:45 WT_1_SJ.out.tab
drwx------ 2 rbator01 biotools 4.0K Mar 25 15:43 WT_1__STARgenome
```
]
---
```bash
cat WT_1_Log.final.out
```
.small[
```bash
                          Number of input reads |	7014609
                      Average input read length |	51
                                    UNIQUE READS:
                   Uniquely mapped reads number |	6014703
                        Uniquely mapped reads % |	85.75%
                          Average mapped length |	50.72
                       Number of splices: Total |	55354
            Number of splices: Annotated (sjdb) |	47840
                       Number of splices: GT/AG |	50848
                       Number of splices: GC/AG |	50
                       Number of splices: AT/AC |	3
               Number of splices: Non-canonical |	4453
                      Mismatch rate per base, % |	0.36%
                         Deletion rate per base |	0.00%
                        Deletion average length |	0.00
                        Insertion rate per base |	0.00%
                       Insertion average length |	1.04
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	0
             % of reads mapped to multiple loci |	0.00%
        Number of reads mapped to too many loci |	792806
             % of reads mapped to too many loci |	11.30%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |	0.00%
                 % of reads unmapped: too short |	2.92%
                     % of reads unmapped: other |	0.04%
                                  CHIMERIC READS:
                       Number of chimeric reads |	0
                            % of chimeric reads |	0.00%
```
- We should look for &gt;75% of the reads to be uniquely mapped 
- Further QC options are available with RSEQC and samtools packages (see scripts/sbatch_rseqc.sh )
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://github.com/hbc/NGS_Data_Analysis_Course/ and http://rseqc.sourceforge.net/&lt;/span&gt;&lt;/div&gt; 
---
## BAM format


---
## Visualizing reads with IGV
- Return to On Demand "Dashboard" tab: https://ondemand.cluster.tufts.edu
- Choose Interactive Apps-&gt;IGV
  - hours: 1
  - cores: 4
  - memory: 64 Gb
  - directory: /cluster/tufts/bio/tools/igv/igv_home/ 

&lt;img src="fig/igv_od.png" width="50%" /&gt;
---
## IGV settings for RNAseq
.pull-left[
- In top menu click **View** -&gt; **Preferences**

&lt;img src="fig/igv_pref1.png" width="100%" /&gt;
]
.pull-right[
- Click **Alignments** -&gt; **Track Display Options**, check **Splice Junction Track**
&lt;img src="fig/igv_pref2.png" width="1424" /&gt;
]
---
## Visualizing Reads with IGV
.pull-left[
- Choose Genome sacCer3
&lt;img src="fig/igv_genome.png" width="620" /&gt;
]
.pull-right[
- Choose the two BAM files we generated, with and without introns
&lt;img src="fig/igv_files.png" width="681" /&gt;
]

---
### Viewing splice junctionsin IGV
Sashimi plots are used for showing spliced reads, especially useful for isoform visualization

.pull-left[
&lt;img src="fig/igv_sashimi1.png" width="1357" /&gt;
]
.pull-right[
&lt;img src="fig/igv_sashimi2.png" width="1552" /&gt;
]

When available, IGV uses the "XS" tag provided by the alignment to determine strandedness. If missing, strandeness is inferred from the read strand. For paired-end data the strand of the alignment marked "first in pair" is used.   

---
.content-box-yellow[
**Exercise** (2 minutes):
View a few genes that have introns: SUS1,ACT1, GLC7, and PRE3. The splice reads track shows the strand direction of the mapped reads. Based on this, do you think we have stranded or unstranded reads?
]

---
## Counting reads

&lt;img src="fig/wf3.png" width="500px" style="display: block; margin: auto;" /&gt;
---
## Counting reads: three classes of tools

Method | Tools
------------- | -------------
align to a gene | featureCounts, HTSeq
align to a transcript | RSEM, Cufflinks
consistent with a transcript without alignment | Sailfish, Kallisto

&lt;img src="fig/gene_transcript.png" width="100%" style="display: block; margin: auto;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;http://chagall.med.cornell.edu/RNASEQcourse/&lt;/span&gt;&lt;/div&gt; 

--
We'll use method 1, featureCounts.
RSEM is reccomended if you are trying to compare isoforms of a gene.
---
## Counting reads: featureCounts

.pull-left[
- The mapped coordinates of each read are compared with the "features" in the GTF file
- Reads that overlap with a gene by &gt;=1 bp are counted as belonging to that feature
- Ambiguous reads will be discarded
- Output is a read table

Gene | SampleA | SampleB | SampleC
------------- | -------------
 A1BG | 57 | 41 | 3
 A1CF | 4 | 56 | 78
 A2M | 0 | 0 | 10
... ||| 
]
.pull-right[
&lt;img src="fig/fc.png" width="821" /&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;http://bioinf.wehi.edu.au/featureCounts/&lt;/span&gt;&lt;/div&gt; 
---
## Counting reads: running the script

Open bioinformatics-rnaseq/scripts/sbatch_featurecounts_gene.sh in Jupyter Lab text editor

```bash
module load subread/1.6.3

mkdir featurecounts

REF=/cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/Genes/sacCer3.gtf
DIR=$1

featureCounts \
-a $REF \
-o featurecounts/featurecounts_results.txt \
$DIR/*bam
```

To run:

```bash
./scripts/featurecounts.sh STAR

```
Output files:
featurecounts_results.txt
featurecounts_results.txt.summary

---
## Counting reads: results
```
cat featurecounts_results.txt.summary | column -ts $'\t'
```
.small[
```
Status                         STAR//WT_1_Aligned.sortedByCoord.out.bam  STAR//WT_1_nosmall__Aligned.sortedByCoord.out.bam
Assigned                       5395145                                   5396147
Unassigned_Unmapped            0                                         0
Unassigned_MappingQuality      0                                         0
Unassigned_Chimera             0                                         0
Unassigned_FragmentLength      0                                         0
Unassigned_Duplicate           0                                         0
Unassigned_MultiMapping        1561475                                   0
Unassigned_Secondary           0                                         0
Unassigned_Nonjunction         0                                         0
Unassigned_NoFeatures          232980                                    233079
Unassigned_Overlapping_Length  0                                         0
Unassigned_Ambiguity           386578                                    386500
```
]
---
## Counting reads: notes

- note on strandedness: -s 1 (+ stranded), 2 (- stranded), 0 (unstranded, default)
- note on paired end reads: For paired-end (PE) data, the bam file contains information about whether both read1 and read2 mapped and if they were at roughly the correct distance from each other, that is to say if they were "properly" paired. For most counting tools, only properly paired reads are considered by default, and each read pair is counted only once as a single "fragment".

---
## Tracking read numbers

As the analysis progresses you should keep track of the following:

Number of reads | Source | Result
------------- | ------------- | -------------
raw | FastQC |   7.1 M
( after trimming ) | ( FastQC after trimming )  |  NA  
aligned to genome| STAR log |  6 M
associated with genes | featureCounts log | 5.4 M 

&lt;div class="my-footer"&gt;&lt;span&gt;https://github.com/hbc/NGS_Data_Analysis_Course&lt;/span&gt;&lt;/div&gt; 
---
## Summary
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();</script>

<script>
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>