- mRNA data from 48 replicates of two Saccromyces cerevisiae populations
- Wildtype (WT) and $\Delta$SNF2
- Unusually comprehensive analysis of variability in sequencing replicates
]
.pull-right[
<img src="fig/paper.png" width="100%">
<img src="fig/gier.png" width="100%">
]
<div class="my-footer"><span>Gierlinski et al Bioinformatics 2015              https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4754627</span></div>
---
## Start Jupyter Lab on the HPC cluster On Demand
.pull-left[
1. In **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
2. Interactive Apps -> Jupyter Lab
+ Hours: 3 hours
+ Core: 8 cores
+ Memory: 64 GB
4. Press "Connect to Jupyter Lab"
5. Choose "Terminal" from the Launcher menu
6. A terminal will appear on the compute node where your job is running, where you can type bash commands
You now have a directory called "bioinformatics_rnaseq" in your home directory containing:
```
tree .
```
---
## Downloading data from a public archive
In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)
Use bash utility **head** to look at the first few lines
The right plot results show a strong positional bias throughout the reads, which in this case is due to the library having a certain sequence that is overrepresented
## STAR (Spliced Transcripts Alignment to a Reference)
.pull-left[
- Highly accurate, memory intensive aligner
- Two phase mapping process
1. Find Maximum Mappable Prefix (MMP) in a read
.small[ a contiguous sequence in the read that matches a segment of the genome
Continue with the unmapped portion of the read. If a read is not completely covered by MMPs, the MMP are extended with mismatches (a) indels (b) or soft-clipped (c in the Figure below) ]
2. Clustering, stitching and scoring
.small[
Using MMP a anchors, reads are stitched together. All seeds that fall within a user-defined genomic window (which determines the maximum intron length) will be clustered. If all seeds in a read are not within the window, chimeric alignment is produced, such as would happen in gene fusion.
]
]
.pull-right[
```{r,echo=FALSE,fig.align="left"}
knitr::include_graphics("fig/align1.png")
```
]
<div class="my-footer"><span>Dobin et al Bioinformatics 2013                        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3530905/</span></div>
---
## Genome annotation standards
- STAR can use an annotation file gives the location and structure of genes in order to improve alignment in known splice junctions
- Annotation is dynamic and there are at least three major sources of annotation
- The intersection among RefGene, UCSC, and Ensembl annotations shows high overlap. RefGene has the fewest unique genes, while more than 50% of genes in Ensembl are unique
- Be consistent!
<div class="my-footer"><span>Zhao et al Bioinformatics 2015               https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8</span></div>
---
## Genome annotation standards
RefSeq and Ensemble have different gene definitions for gene PIK3CA can give rise to differences in gene quantification.
.small[" We demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis."]
The gene we'll be analyzing is called SNF2 or YOR290C. Check to make sure it's represented consistently in the GTF and BED files using bash utility **grep**, e.g.:
```
grep YOR290C <file name>
```
Using the files above, how long is the gene? Does it have any introns?