In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)
Use bash utility **head** to look at the first few lines
```{bash, echo=TRUE, eval=FALSE}
Use bash utility **head** and **column** to look at the first few lines
.small[ a contiguous sequence in the read that matches a segment of the genome
Continue with the unmapped portion of the read. If a read is not completely covered by MMPs, the MMP are extended with mismatches (a) indels (b) or soft-clipped (c in the Figure below) ]
2. Clustering, stitching and scoring
.small[
Using MMP a anchors, reads are stitched together. All seeds that fall within a user-defined genomic window (which determines the maximum intron length) will be clustered. If all seeds in a read are not within the window, chimeric alignment is produced, such as would happen in gene fusion.
]
1. Find Maximum Mappable Prefixes (MMP) in a read. MMP can be extended by
a. mismatches
b. indels
c. soft-clipping
2. Clustering MMP, stitching and scoring to determine final read location
- The intersection among RefGene, UCSC, and Ensembl annotations shows high overlap. RefGene has the fewest unique genes, while more than 50% of genes in Ensembl are unique
- Be consistent!
- Be consistent with your choice of annotation source!
<div class="my-footer"><span>Zhao et al Bioinformatics 2015               https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8</span></div>
...
...
@@ -413,12 +432,12 @@ Tufts HPC hosts genome reference data from [UCSC](https://genome.ucsc.edu/cgi-bi
For our data, we will need reference files from Saccharomyces_cerevisiae genome version sacCer3.
We can explore available files like this:
```{bash, echo=TRUE, eval=FALSE}
```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/
tree -d
```
.pull-left[
```{bash, echo=TRUE, eval=FALSE}
```bash
├── Annotation
│ ├── Genes -
│ └── SmallRNA
...
...
@@ -443,29 +462,24 @@ The reference files that we need for this analysis are:
---
## Annotation file formats
STAR uses a GTF format for genome annotation
```{bash, echo=TRUE, eval=FALSE}
```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/Genes
In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)
Use bash utility **head** to look at the first few lines
Use bash utility **head** and **column** to look at the first few lines