bioinformatics_for_rnaseq_day2.html 25.1 KB
Newer Older
Rebecca E Batorsky's avatar
Rebecca E Batorsky committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title>Bioinformatics for RNAseq Workshop - day 2</title>
    <meta charset="utf-8" />
    <meta name="author" content="Rebecca Batorsky, Sr. Bioinformatics Specialist at Tufts" />
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link href="libs/remark-css-0.0.1/default-fonts.css" rel="stylesheet" />
    <link href="libs/remark-css-0.0.1/hygge.css" rel="stylesheet" />
    <link rel="stylesheet" href="custom.css" type="text/css" />
    <link rel="stylesheet" href="footer-header.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Bioinformatics for RNAseq Workshop - day 2
### Rebecca Batorsky, Sr. Bioinformatics Specialist at Tufts
### April 2019

---

## Differential Expression with DESeq2

&lt;img src="fig/wf4.png" width="500px" style="display: block; margin: auto;" /&gt;
---
## Differential Expression with DESeq2

- A common goal for RNAseq analysis is to identify genes that are differentially expressed between conditions

- The treatment here follows closely the course from Harvard Chan Bioinformatics Core DGE workshop:
https://hbctraining.github.io/DGE_workshop

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Getting set up
.pull-left[
Jupyter Lab on the Tufts HPC cluster via the "On Demand" interface:

1. Point **Chrome** web browser to [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
2. Choose Rstudio from the Interactive Apps Menu
3. Choose 
  - Number of hours: 3
  - Number of cores: 4
  - Amount of Memory: 64 Gb
  - R version: 3.5.0
4. Press "Connect to Rstudio"
]
.pull-right[
&lt;img src="fig/rstudio_od.png" width="400px" style="display: block; margin: auto auto auto 0;" /&gt;
]
---
## DESeq2: Workflow overview
&lt;img src="fig/deseq_workflow_full_2018.png" width="500px" style="display: block; margin: auto;" /&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Loading libraries
Open the R file 'bioinformatics_rnaseq/deg.R' and execute

```r
# Put HPC biotools R libraries on your R path
.libPaths(c('', '/cluster/tufts/bio/tools/R_libs/3.5/'))

# load required libraries
library(DESeq2)
library(vsn) 
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggrepel)
library(DEGreport)
library(pheatmap)
library(org.Sc.sgd.db)
library(clusterProfiler)
```
---
## Reading in data
We'll switch to analyzing a preprocessed data set of 5 WT replicates and 5 SNF2 knockouts
```r
## Read in preprocessed count and meta data
course_data_path="~/bioinformatics_rnaseq/"
setwd(course_data_path)
data &lt;-read.table("sacCerfeatureCounts_gene_results.formatted.txt",header=TRUE)
meta &lt;- read.table("sample_info.txt", header=TRUE)

## View first few rows of tables
head(data)
head(meta)

## View table in new window
View(data)
View(meta)
```

Take a look at the first few lines of "data" and "meta" using **head()** or open the files in another tab using **View()**  

&lt;img src="fig/head.png" width="90%"&gt;
---
## Normalization

&lt;img src="fig/deseq_workflow_normalization_2018.png" width="400"&gt;

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
##Normalization
.pull-left[
The number of sequenced reads mapped to a gene depends on: 
- Gene Length
]
.pull-right[
&lt;img src="fig/normalization_methods_length.png" width="400"&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Normalization
.pull-left[
The number of sequenced reads mapped to a gene depends on: 
- Gene Length

- Sequencing depth
]
.pull-right[
&lt;img src="fig/normalization_methods_depth.png" width="500"&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Normalization
.pull-left[
The number of sequenced reads mapped to a gene depends on: 
- Gene Length

- Sequencing depth

- The expression level of other genes in the sample

- **It's own expression level**
]
.pull-right[
&lt;img src="fig/normalization_methods_composition.png" width="400"&gt;

]

Normalization eliminates the factors that are not of interest!
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Normalization
.pull-left[
The **naive** approach: divide by total library size for each sample is NOT applied anymore (CPM, TPM)

Why not? **Composition** matters!
]
.pull-right[
&lt;img src="fig/abs_vs_rel.png" width="100%"&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://ro-che.info/articles/2016-11-28-rna-seq-normalization&lt;/span&gt;&lt;/div&gt; 

---
##  Median of ratios method

High level: Accounts for both sequencing depth and composition

**Step 1: creates a pseudo-reference sample (row-wise geometric mean)**

For each gene, a pseudo-reference sample is created that is equal to the geometric mean across all samples.
.small[
| gene | sampleA | sampleB | pseudo-reference sample  |
| ----- |:-----:|:-----:|:-----:|
| EF2A | 1489 | 906 | `\(\sqrt(1489 * 906)\)` = **1161.5** |
| ABCD1 | 22 | 13 | `\(\sqrt(22 * 13)\)` = **17.7** |
| ... | ... | ... | ... |
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Normalization with DESeq2: Median of ratios method
**Step 2: calculates ratio of each sample to the reference**

Calculate the ratio of each sample to the pseudo-reference.
Since most genes aren't differentially expressed, ratios should be similar.
.small[
| gene | sampleA | sampleB | pseudo-reference sample  | ratio of sampleA/ref | ratio of sampleB/ref |
| ----- |:-----:|:-----:|:-----:| :-----: | :-----: |
| EF2A | 1489 | 906 | 1161.5 | 1489/1161.5 = **1.28** | 906/1161.5 = **0.78** |
| ABCD1 | 22 | 13 | 16.9 | 22/16.9 = **1.30** | 13/16.9 = **0.77** |
| ... | ... | ... | ... |
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
##  Median of ratios method
**Step 3: calculate the normalization factor for each sample (size factor)**

The median value (column-wise for the above table) of all ratios for a given sample is taken as the normalization factor (size factor) for that sample, as calculated below. Notice that the differentially expressed genes should not affect the median value:

`normalization_factor_sampleA &lt;- median(c(1.28, 1.3, ...))`   
`normalization_factor_sampleB &lt;- median(c(0.78, 0.77, ...))`

.pull-left[
- The figure illustrates the median value for the distribution of all gene ratios for a single sample (frequency is on the y-axis)  
- **This method is robust to imbalance in up-/down-regulation and large numbers of differentially expressed genes.**]
.pull-right[
&lt;img src="fig/deseq_median_of_ratios.png" width="400"&gt;
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Median of ratios method
**Step 4: calculate the normalized count values using the normalization factor**

This is performed by dividing each raw count value in a given sample by that sample's normalization factor to generate normalized count values. This is performed for all count values (every gene in every sample). For example, if the median ratio for SampleA was 1.3 and the median ratio for SampleB was 0.77, you could calculate normalized counts as follows:

SampleA median ratio = 1.3

SampleB median ratio = 0.77

.pull-left[
**Raw Counts**

| gene | sampleA | sampleB |  
| ----- |:-----:|:-----:|
| EF2A | 1489 | 906 | 
| ABCD1 | 22 | 13 | 
]
.pull-right[
**Normalized Counts**

| gene | sampleA | sampleB |
| ----- |:-----:|:-----:|
| EF2A | 1489 / 1.3 = **1145.39** | 906 / 0.77 = **1176.62** | 
| ABCD1 | 22 / 1.3 = **16.92** | 13 / 0.77 = **16.88** | 
]
---
.content-box-yellow[**Exercise**

Determine the normalized counts for your gene of interest, PD1, given the raw counts and size factors below. 

NOTE: You will need to run the code below to generate the raw counts dataframe (PD1) and the size factor vector (size_factors), then use these objects to determine the normalized counts values. Open a new r file.

```r

# Raw counts for PD1
PD1 &lt;- c(21, 58, 17, 97, 83, 10)
names(PD1) &lt;- paste0("Sample", 1:6)
PD1 &lt;- data.frame(PD1)
PD1 &lt;- t(PD1)

# Size factors for each sample
size_factors &lt;- c(1.32, 0.70, 1.04, 1.27, 1.11, 0.85)

```
]
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Create DESeq2 Dataset object 

Check to make sure that all rows labels in meta are columns in data!
```r
all(colnames(data) %in% rownames(meta))
all(colnames(data) == rownames(meta))
```
--

Create the dataset and run the analysis
```r
dds  &lt;- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ condition)
dds  &lt;- DESeq(dds)
```

.small[
Behind the scenes these steps were run:
- estimating size factors
- estimating dispersions
- gene-wise dispersion estimates
- mean-dispersion relationship
- final dispersion estimates
- fitting model and testing
]

--
Have a look at the size factors and normalized counts:

```r
sizeFactors(dds)
```
---
## DESeq2: Design formula

```r
dds  &lt;- DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ condition)
```
The design formula `design = ~condition`  
- Tells DESeq2 which factors in the metadata to test
- Can include multiple factors and combinations that are columns in the metadata
- The factor that you are testing for comes last, and factors that you want to account for come first.

E.g. design &lt;- ~ sex + age + condition

---

.content-box-yellow[**Exercise**

With the following table, if you wanted to test to difference in the two age groups, how would you write the design formula?

&lt;img src="fig/meta_example.png" width="300"&gt;
]
---
## Unsupervised Clustering

&lt;img src="fig/deseq_workflow_qc_2018.png" width="400"&gt;

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Unsupervised Clustering

QC step to asses overall similarity between samples: 

- Which samples are similar to each other, which are different? 

- Does this fit to the expectation from the experiment’s design? 

- What are the major sources of variation in the dataset?

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Principle Components Analysis

Here's an example dataset with 4 genes and two samples  
We'd like to use PCA to reduce the dimensions of the data

&lt;img src="fig/PCA_2sample_genes.png" width="100%"&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 

---
## PCA

.pull-left[
PCA starts by finding the path through the data that shows the largest spread.

**This is called the first principal component, or PC1.**  

Each point has a different "influence" on the direction of PC1

Then, the score is computed for each sample
]
.pull-right[
&lt;img src="fig/pca_with_PC1_line.png" width="300"&gt;
]

```
Sample1 PC1 score = (read count Gene A * influence Gene A) + 
    (read count Gene B * influence Gene B) + .. for all genes
```
--

For a more mathematical treatment of PCA using R: https://uc-r.github.io/pca

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## PCA

We obtain a 2x2 matrix for the first two PC:

&lt;img src="fig/PCA_samples.png" width="600"&gt;

In reality we'd have many more samples and many, many more genes.

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Make a PCA plot

- The regularized log transform improves visualization  
- This uses the built in function plotPCA from DESeq2

```r
rld &lt;- rlog(dds, blind=TRUE)
plotPCA(rld, intgroup="condition") + geom_text(aes(label=name),vjust=2)
```
&lt;img src="fig/pca_wrong.png" width="600"&gt;

---
.content-box-yellow[**Exercise**

Does something look wrong with the PCA plot?

Suppose we go back over the data and verify that somewhere in the processing steps, the headers for WT_rep1 and SNF_rep5 were switched.  Can you fix them and verify that the PCA looks as expected?

]
---
## Heirarchical Clustering

Another common method is to look at the sample-sample correlation in a heatmap.

- Check that samples are grouping as expected 
- Overall correlation is good

.pull-left[
```r
## Heirarchcal Clustering of sample correlation
rld_counts &lt;- assay(rld)  
rld_cor    &lt;- cor(rld_counts) 
pheatmap(rld_cor)
```
]
.pull-right[
&lt;img src="fig/hc.png" width="500"&gt;
]
---
## DESeq2 Workflow
&lt;img src="fig/DESeq2_workflow_2018.png" width="70%", align="center"&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Modeling count data

Our goal in modeling is to test for significant difference in expression

- DESeq2 will seek to fit a probability distibution to each gene and condition
- Wald test will tell us if the difference in gene expression is statistically significant.

&lt;img src="fig/expression_level.png" width="600"&gt;

---
## Modeling count data - which statistical distribution?
.pull-left[
```r
## Mean and variance for WT replicates
mean_counts &lt;- apply(data[, 1:5], 1, mean)
variance_counts &lt;- apply(data[, 1:5], 1, var)
df &lt;- data.frame(mean_counts, variance_counts)

ggplot(df) +
  geom_point(aes(x=mean_counts, y=variance_counts)) + 
  geom_line(aes(x=mean_counts, y=mean_counts, color="red")) +
  scale_y_log10() +
  scale_x_log10()
```
]
.pull-right[
&lt;img src="fig/mean_variance2.png" width="400"&gt;
]
---
## Modeling count data - which statistical distribution?

.pull-left[
How many parameters do we need?

- Poisson distribution 
  - 1 parameter P ( `\(\mu\)` )
  - mean = variance = `\(\mu\)`

- Negative binomial distribution
  - 2 parameters NB( `\(\mu\)` , `\(\alpha\)`)
  - allows extra source of variation
]
.pull-right[
&lt;img src="fig/mean_variance2.png" width="400"&gt;
]

---
## Modeling count data -  Negative Binomial

For Negative Binomial the variance has this form:

`\(Var = \mu + \alpha \mu^2\)`

For low average count `\(\mu\)` is small -&gt; Variance is dominated by technical noise (Poisson)  
For high average count, neglect the first term -&gt; Variance is dominated by the dispersion

Intuitively:

Var = techincal variation + biological variation

---
## Fitting the gene-wise dispersion estimates

```r
plotDispEsts(dds)
```
.pull-left[
Once each gene is fit, DESeq2 uses information across genes to improve confidence in parameters 
- Fit a curve to the dispersion estimates for all genes (red line)
- Points are moved closer to the predicted curve depending on:
  - how close it is to the curve
  - number of samples that were used in fit
- This shrinkage is necessary to avoid false positives 

]
.pull-right[
&lt;img src="fig/deseq_dispersion1.png" width="400"&gt;
]
---
## How well does the model fit our data?

Summary:
- A NB model is fit to each gene in our dataset, estimates for mean and dispersion
- A curve is fit to all genes, giving information about how dispersion varies with mean count
- Dispersion values are adjusted to be closer to the curve

Never work without biological replicates, try to have at least 4.

---
## The Generalized Linear Model

 &lt;img src="fig/NB_model_formula.png" width="600"&gt;

Then, coefficients are estimated which take into account comparisons for each group:

 &lt;img src="fig/NB_model_formula_betas.png" width="600"&gt;
`\(x_{jr}\)` is the model design. In our simple case x [0,1] whether it's  a mutant or not

This is called the linker function
The coefficient estimate the most likely log2 fold changes

&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Testing for Differential Expression
&lt;img src="fig/deseq2_workflow_separate_2018.png" width="400"&gt;
&lt;div class="my-footer"&gt;&lt;span&gt;https://hbctraining.github.io/DGE_workshop&lt;/span&gt;&lt;/div&gt; 
---
## Creating contrasts and running a Wald test

The null hypothesis: log fold change = 0 for across conditions

P-values are the probability of rejecting the null hypothesis 
```r
## Creating contrasts
contrast &lt;- c("condition", "SNF2", "WT")
res_unshrunken &lt;- results(dds, contrast=contrast)
summary(res_unshrunken)
```

```r
out of 6391 with nonzero total read count
adjusted p-value &lt; 0.1
LFC &gt; 0 (up)       : 1464, 23%
LFC &lt; 0 (down)     : 1623, 25%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count &lt; 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
```
---
## Shrinkage of the log2 fold changes

One more shrinking step - shrink the estimated log2 fold changes

Estimates of log fold change do not account for the large dispersion we observe with low read counts. To avoid this, the log2 fold changes calculated by the model need to be adjusted.


&lt;img src="fig/deseq2_shrunken_lfc.png" width="500"&gt;

This is not done by default, so we run the code:
```r
res &lt;- lfcShrink(dds, contrast=contrast, res=res_unshrunken)
```
&lt;div class="my-footer"&gt;&lt;span&gt;Love et al Genome Biology 2014&lt;/span&gt;&lt;/div&gt; 
---
## MA plot: Log ratio vs. average for comparison 
 
- Shows the mean of the normalized counts versus the log2 foldchanges for all genes tested 
- Genes that are significantly DE are colored to be easily identified and should span the range of fold changes 


```r
plotMA(res_unshrunken, ylim=c(-2,2))
plotMA(res, ylim=c(-2,2))
```
&lt;img src="fig/MA.png" width="70%"&gt;
---
## Exploring results

There are two ways to quickly check our data
```r
head(res)
```

```r
summary(res)
```
```bash
log2 fold change (MAP): condition SNF2 vs WT 
Wald test p-value: condition SNF2 vs WT 
DataFrame with 6 rows and 6 columns
                   baseMean     log2FoldChange             lfcSE              stat               pvalue                 padj
                  &lt;numeric&gt;          &lt;numeric&gt;         &lt;numeric&gt;         &lt;numeric&gt;            &lt;numeric&gt;            &lt;numeric&gt;
YAL069W                   0                 NA                NA                NA                   NA                   NA
YAL068W-A                 0                 NA                NA                NA                   NA                   NA
YAL068C   0.967880032164114  0.130810394924328 0.392843011761567 0.328000845094124    0.742911023648992    0.819176044192669
YAL067W-A                 0                 NA                NA                NA                   NA                   NA
YAL067C      40.87932305681   1.01424015380012 0.212805370098983  4.74630178261535 2.07169548468324e-06 1.31612384121377e-05
YAL066W   0.140275942926642 0.0448864068979152 0.208358779835131 0.217860827508585     0.82753754913751    0.880584827928377

```
---
## Plotting a single gene SNF2

We first need to find the Open Reading Frame that corresponds to SNF2

```r
## convert results to tibble
res_tb &lt;- res %&gt;%  
  data.frame() %&gt;%
  rownames_to_column(var="gene") %&gt;%
  as_tibble()

## annotation table to find a gene name 
keytypes(org.Sc.sgd.db)
anno &lt;- AnnotationDbi::select(org.Sc.sgd.db,                        # database table
                              keys = res_tb$gene,                   # keys from "gene" column
                              keytype = "ORF",                      # start with ORF
                              columns = c("ENTREZID", "GENENAME"))  # return these from database

## find the ORF corresponding to SNF2
subset(anno, GENENAME == "SNF2")
```
```r
&gt; keytypes(org.Sc.sgd.db)
 [1] "ALIAS"        "COMMON"       "DESCRIPTION"  "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"      
 [9] "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "INTERPRO"     "ONTOLOGY"     "ONTOLOGYALL" 
[17] "ORF"          "PATH"         "PFAM"         "PMID"         "REFSEQ"       "SGD"          "SMART"        "UNIPROT"
&gt; subset(anno, GENENAME == "SNF2")
         ORF        SGD ENTREZID GENENAME
6055 YOR290C S000005816   854465     SNF2
```
---
## Plotting a single gene SNF2

```r
## simple plot for a single gene
plotCounts(dds, gene="YOR290C", intgroup="condition") 
```
&lt;img src="fig/single_gene.pdf" width="40%"&gt;

---
## Filtering results

We may choose to view a subset of significant results
```r
# filtering significant genes
padj.cutoff &lt;- 0.05 # False Discovery Rate
lfc.cutoff  &lt;- 0.58 # log fold change 0.58 corresponds to a fold change of 1.5 

## filter results using cutoffs and sort by adjusted pvalue
sig_tb &lt;- res_tb %&gt;%
  filter(padj &lt; padj.cutoff &amp; abs(log2FoldChange) &gt; lfc.cutoff) %&gt;%  #tibble filtering
  arrange(pvalue)                                                    #sort by adjusted pvalue

## export results
file_name='results_pval_0.05_lfc_0.58.csv'
write.csv(res_tb, file)
```

---
## Plot multiple genes in a heatmap

We'll first convert the metadata, rld counts and significant results to tibble

```r
# plot multiple genes
## take the 50 most significant genes
sig_tb_50 &lt;- sig_tb[1:50,] # select top 50 DE genes

## convert metadata to tibble'
meta_tb &lt;- meta %&gt;% 
  rownames_to_column(var="samplename") %&gt;% 
  as_tibble()

## convert counts to tibble
rld_counts &lt;- rld_counts %&gt;% 
  data.frame() %&gt;%
  rownames_to_column(var="gene") %&gt;% 
  as_tibble()

## extract counts for significant genes
rld_sig &lt;- rld_counts %&gt;% 
  filter(gene %in% sig_tb_50$gene) %&gt;% 
  data.frame() %&gt;%
  column_to_rownames(var = "gene") 
```
---
## Plot multiple genes in a heatmap
.pull-left[
```r
# heatmap for top 50 most significant genes
pheatmap(rld_sig,    
         cluster_rows = T, 
         show_rownames = T,
         annotation = meta, 
         border_color = NA, 
         fontsize = 10, 
         scale = "row", 
         fontsize_row = 10, 
         height = 20)
```
]
.pull-right[
&lt;img src="fig/50_gene_heatmap.png" width="100%"&gt;
]
---
## summary

Remember to copy your data to a permanent location before it's deleted
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();</script>

<script>
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>