Commit 4ab31250 authored by Rebecca E Batorsky's avatar Rebecca E Batorsky
Browse files

test md

parent c0e04970
......@@ -23,7 +23,6 @@ Statistical power
Capturing variability
---
## Bioinformatics Workflow
Our analysis will have the following steps with QC at each stage
```{r, out.width = "500px",echo=FALSE,fig.align="center"}
knitr::include_graphics("fig/wf0.png")
......@@ -33,7 +32,7 @@ knitr::include_graphics("fig/wf0.png")
.pull-left[
- mRNA data from 48 replicates of two Saccromyces cerevisiae populations
- Wildtype (WT) and $\Delta$SNF2
- Wildtype (WT) and SNF2 knock-out ( $\Delta$snf2 )
- Unusually comprehensive analysis of variability in sequencing replicates
]
......@@ -46,14 +45,18 @@ knitr::include_graphics("fig/wf0.png")
---
## Start Jupyter Lab on the HPC cluster On Demand
.pull-left[
1. In **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
1. **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
2. Interactive Apps -> Jupyter Lab
+ Hours: 3 hours
+ Core: 8 cores
+ Memory: 64 GB
4. Press "Connect to Jupyter Lab"
5. Choose "Terminal" from the Launcher menu
6. A terminal will appear on the compute node where your job is running, where you can type bash commands
+ Hours: 4
+ Core: 4
+ Memory: 32 GB
3. Press "Connect to Jupyter Lab"
4. Choose "Terminal" from the Launcher menu
5. A terminal will appear on the compute node where your job is running, where you can type bash commands
]
.pull-right[
```{r, out.width = "70%",echo=FALSE,fig.align="center"}
......@@ -65,32 +68,54 @@ knitr::include_graphics("fig/terminal.png")
## Setting up our working directory
Make a directory in the training space called `your-user-name` e.g.rbator01:
```
```bash
cd /cluster/tufts/bio/tools/
mkdir training/<your-user-name>
```
Copy the workshop materials into the new directory:
```
```bash
cp -r tutorials/bioinformatics-rnaseq/ training/<your-user-name>
```
Make a soft link into your home directory:
```
```bash
ln –s training/<your-user-name>/bioinformatics-rnaseq/ ~/
```
You now have a directory called "bioinformatics_rnaseq" in your home directory containing:
You now have a directory called "bioinformatics_rnaseq" in your home directory:
```bash
cd ~/bioinformatics-rnaseq
```
---
## Setting up our working directory
Look at the contents of that directory:
```bash
tree .
```
```bash
├── data
│ ├── ERP004763_info.txt
│ ├── sacCerfeatureCounts_gene_results.formatted.mod.txt
│ └── sample_info.txt
└── scripts
├── bamqc.sh
├── deseq2_v2.R
├── fastqc_multiqc.sh
├── fastqc.sh
├── featurecounts.sh
├── sbatch_star_align.sh
└── sbatch_star_index.sh
```
---
## Downloading data from a public archive
In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)
Use bash utility **head** to look at the first few lines
```{bash, echo=TRUE, eval=FALSE}
Use bash utility **head** and **column** to look at the first few lines
```bash
cd bioinformatics_rnaseq/data
head ERP004763_info.txt
```
head ERP004763_info.txt | column -ts $'\t'
```
```bash
run_accession condition biol_rep fastq_ftp
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
ERR458494 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458494/ERR458494.fastq.gz
......@@ -104,12 +129,12 @@ ERR458501 SNF2 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR45850
```
---
## Downloading data from a public archive
Find the accession numbers corresponding to WT and replicate 1 using contents using bash utlilty **awk**
Find the accession numbers corresponding to WT and replicate 1 using contents using bash utlilty **cat** and **awk**
```{bash, echo=TRUE, eval=FALSE}
cat ERP004763_info.txt | awk '$2=="WT"'
```
```
```bash
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR459206 WT 48 ftp.sra.ebi.ac.uk/vol1/fastq/ERR459/ERR459206/ERR459206.fastq.gz
......@@ -119,7 +144,7 @@ Find the accession numbers corresponding to the first biological replicate:
```bash
cat ERP004763_info.txt | awk '$3==1'
```
```
```bash
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR458506 SNF2 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458506/ERR458506.fastq.gz
......@@ -129,7 +154,7 @@ The pipe symbol "|" will tell bash to send the output of one program to another
---
.content-box-yellow[
**Exercise** (2 minutes):
**Exercise** (5 minutes):
Can you combine the above commands using a pipe in order to output only the lines corresponding to WT replicate 1?
]
......@@ -154,9 +179,13 @@ cd WT_1
cat ERP004763_info.txt | awk '$3==1'| awk '$2=="WT"'| cut -f4 | xargs wget
cd ..
```
--
The script can be run by returning to your terminal and doing:
```{r, engine = 'bash', echo=TRUE, eval=FALSE}
---
## Downloading fastq files
1. Click on "Files" in the left menu
2. Click the "Home" icon and navigate to "bioinformatics-rnaseq"
3. Double click the file "download_samples.sh" and it will open in a text editor
4. The script can be run by returning to your terminal and doing:
```bash
cd ../
./scripts/download_samples.sh
```
......@@ -171,10 +200,10 @@ knitr::include_graphics("fig/wf1.png")
## Fastq format
View the first 4 lines of one of the gzipped fastq files without decompressing:
```{bash, echo=TRUE, eval=FALSE}
```bash
gzip -cd WT_1/ERR458493.fastq.gz | head -n 4
```
```
```bash
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
......@@ -205,12 +234,12 @@ So why don't we see numbers in the quality score?
B@@FFFFFHHHGHJJJJJJIJJGIGIIIGI9DGGIIIEIGIIFHHGGHJIB
```
---
### Quality Scores
### Quality Score Encoding
In FASTQ files produced by Illumia software 1.8+, quality scores are encoded as the character with an ASCII code equal to its value + 33.
```{r, out.width = "60%",echo=FALSE}
knitr::include_graphics("fig/illumina_fastq_coding.png")
```
```
```bash
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
......@@ -222,22 +251,22 @@ B = Q(33), probability of error of 1 in 2000
## FastQC
Widely used tool for both DNA and RNA sequencing data that is run on each fastq file.
To use, in the terminal
```{bash, echo=TRUE, eval=FALSE}
```bash
module load fastqc/0.11.8
mkdir fastqc
```
Since FastQC can run on multiple files at once, we'll use a wildcard "*" to indicate each file in the folder "WT_1":
```{bash, echo=TRUE, eval=FALSE}
```bash
fastqc WT_1/*.fastq.gz -o fastqc --extract
```
For a quick check, we can look at "summary.txt" file for each fastq, which has a line for each test that fastQC performs
```{bash, echo=TRUE, eval=FALSE}
```bash
cat fastqc/ERR458498_fastqc/summary.txt
```
.small[
```
```bash
PASS Basic Statistics ERR458498.fastq.gz
PASS Per base sequence quality ERR458498.fastq.gz
WARN Per tile sequence quality ERR458498.fastq.gz
......@@ -253,23 +282,16 @@ PASS Adapter Content ERR458498.fastq.gz
]
---
## MultiQC
This tool combines QC output across multiple samples.
To use on the HPC (not installed as a module yet):
```{bash, echo=TRUE, eval=FALSE}
module load anaconda
source activate /cluster/tufts/bio/tools/conda_envs/multiqc/1.7/
We ran FastQC for 7 fastq files: MultiQC combines QC output for convenient viewing
```bash
module load multiqc/1.7.0
```
To run multiQC on all the fastq files for WT_1:
```{r, engine = 'bash', echo=TRUE, eval=FALSE}
```bash
multiqc fastqc/ -o multiqc
```
Deactivate the conda environment
```
source deactivate
```
---
## Download MultiQC report
- Click on Files-> Home Directory -> bioinformatics_rnaseq -> multiqc
......@@ -293,20 +315,18 @@ We'll go through two of the plots that FastQC produces
## FastQC results - Sequence Quality Histograms
- Distribution of quality scores across all bases at each position in the reads.
- Drops at the end of reads due to molecules in a given sequencing cluster getting out of sync
- Drops in quality below Phred score of ~20 can be handled by read trimming.
```{r, out.width = "100%",echo=FALSE,fig.align="left"}
knitr::include_graphics("fig/fastqc0.png")
```
<div class="my-footer"><span>https://sequencing.qcfail.com//</span></div>
<div class="my-footer"><span>https://sequencing.qcfail.com/articles/loss-of-base-call-accuracy-with-increasing-sequencing-cycles</span></div>
---
## FastQC results - Sequence Quality Histograms
Drops in quality below Phred score of ~20 can be handled by read trimming.
- Drops at the end of reads due to molecules in a given sequencing cluster getting out of sync
- Drops in quality below Phred score of ~20 can be handled by read trimming.
```{r, out.width = "100%",echo=FALSE,fig.align="left"}
knitr::include_graphics("fig/fastqc4.png")
```
<div class="my-footer"><span>https://sequencing.qcfail.com//</span></div>
<div class="my-footer"><span>https://sequencing.qcfail.com/articles/loss-of-base-call-accuracy-with-increasing-sequencing-cycles</span></div>
---
## FastQC results - Per Base Sequence Content
```{r, out.width = "60%",echo=FALSE,fig.align="left"}
......@@ -336,7 +356,7 @@ knitr::include_graphics("fig/wf2.png")
.pull-left[
- Find the genomic origin of sequence fragments
- RNAseq data originates from spliced mRNA, ni introns
- RNAseq data originates from spliced mRNA, in introns
- When aligning to the genome, our aligner must find a spliced alignment for reads
]
......@@ -352,13 +372,12 @@ knitr::include_graphics("fig/align0.png")
.pull-left[
- Highly accurate, memory intensive aligner
- Two phase mapping process
1. Find Maximum Mappable Prefix (MMP) in a read
.small[ a contiguous sequence in the read that matches a segment of the genome
Continue with the unmapped portion of the read. If a read is not completely covered by MMPs, the MMP are extended with mismatches (a) indels (b) or soft-clipped (c in the Figure below) ]
2. Clustering, stitching and scoring
.small[
Using MMP a anchors, reads are stitched together. All seeds that fall within a user-defined genomic window (which determines the maximum intron length) will be clustered. If all seeds in a read are not within the window, chimeric alignment is produced, such as would happen in gene fusion.
]
1. Find Maximum Mappable Prefixes (MMP) in a read. MMP can be extended by
a. mismatches
b. indels
c. soft-clipping
2. Clustering MMP, stitching and scoring to determine final read location
]
.pull-right[
```{r,echo=FALSE,fig.align="left"}
......@@ -379,7 +398,7 @@ knitr::include_graphics("fig/ann0.png")
- The intersection among RefGene, UCSC, and Ensembl annotations shows high overlap. RefGene has the fewest unique genes, while more than 50% of genes in Ensembl are unique
- Be consistent!
- Be consistent with your choice of annotation source!
<div class="my-footer"><span>Zhao et al Bioinformatics 2015&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1308-8</span></div>
......@@ -413,12 +432,12 @@ Tufts HPC hosts genome reference data from [UCSC](https://genome.ucsc.edu/cgi-bi
For our data, we will need reference files from Saccharomyces_cerevisiae genome version sacCer3.
We can explore available files like this:
```{bash, echo=TRUE, eval=FALSE}
```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/
tree -d
```
.pull-left[
```{bash, echo=TRUE, eval=FALSE}
```bash
├── Annotation
│ ├── Genes -
│ └── SmallRNA
......@@ -443,29 +462,24 @@ The reference files that we need for this analysis are:
---
## Annotation file formats
STAR uses a GTF format for genome annotation
```{bash, echo=TRUE, eval=FALSE}
```bash
cd /cluster/tufts/bio/data/genomes/Saccharomyces_cerevisiae/UCSC/sacCer3/Annotation/Genes
head sacCer3.gtf | column -ts $'\t'
```
```
```bash
chrI sacCer3.genepred transcript 335 649 . + . gene_id "YAL069W"; transcript_id "YAL069W";
chrI sacCer3.genepred exon 335 649 . + . gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI sacCer3.genepred CDS 335 646 . + 0 gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI sacCer3.genepred start_codon 335 337 . + 0 gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI sacCer3.genepred stop_codon 647 649 . + 0 gene_id "YAL069W"; transcript_id "YAL069W"; exon_number "1"; exon_id "YAL069W.1";
chrI sacCer3.genepred transcript 538 792 . + . gene_id "YAL068W-A"; transcript_id "YAL068W-A";
chrI sacCer3.genepred exon 538 792 . + . gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI sacCer3.genepred CDS 538 789 . + 0 gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI sacCer3.genepred start_codon 538 540 . + 0 gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
chrI sacCer3.genepred stop_codon 790 792 . + 0 gene_id "YAL068W-A"; transcript_id "YAL068W-A"; exon_number "1"; exon_id "YAL068W-A.1";
```
There is a simpler format which contains only one feature type, called BED.
Since there is no feature type, we download a table for each feature type of interest, e.g.:
```{r, engine = 'bash', echo=TRUE, eval=FALSE}
head sacCer3.sgdGene.wholegene.bed
```
head sacCer3.sgdGene.wholegene.bed | column -ts $'\t'
```
```bash
chrI 130798 131983 YAL012W 0 + 130798 131983 0 1 1185, 0,
chrI 334 649 YAL069W 0 + 334 649 0 1 315, 0,
chrI 537 792 YAL068W-A 0 + 537 792 0 1 255, 0,
......@@ -516,7 +530,7 @@ STAR --runMode genomeGenerate \
Open file
The basic command
```{bash, echo=TRUE, eval=FALSE}
```bash
STAR --genomeDir <directory of indexed genome> \
--readFilesIn <CSV list of gzipped fastq files> \
--readFilesCommand zcat \
......@@ -532,7 +546,7 @@ STAR --genomeDir <directory of indexed genome> \
---
Open up bioinformatics-rnaseq/scripts/star_align.sh in a text editor
.small[
```{bash, echo=TRUE, eval=FALSE}
```bash
## Use STAR aligner to align all fastq files in a directory
## This step must be done for each sample
module load STAR/2.7.0a
......@@ -597,14 +611,12 @@ Modify the script to ignore small introns and run it again:
```
]
---
## Read Alignment: Result files
```
ls -lh STAR/
```
.small[
```
```bash
-rw-rw-r-- 1 rbator01 biotools 272M Mar 25 15:45 WT_1_Aligned.sortedByCoord.out.bam
-rw-rw-r-- 1 rbator01 biotools 34K Mar 25 15:45 WT_1_Aligned.sortedByCoord.out.bam.bai
-rw-rw-r-- 1 rbator01 biotools 1.8K Mar 25 15:45 WT_1_Log.final.out
......@@ -613,10 +625,9 @@ ls -lh STAR/
-rw-rw-r-- 1 rbator01 biotools 46K Mar 25 15:45 WT_1_SJ.out.tab
drwx------ 2 rbator01 biotools 4.0K Mar 25 15:43 WT_1__STARgenome
```
]
---
```bash
cat WT_1_Log.final.out
cat STAR/WT_1_Log.final.out
```
.small[
```bash
......@@ -656,8 +667,22 @@ cat WT_1_Log.final.out
<div class="my-footer"><span>https://github.com/hbc/NGS_Data_Analysis_Course/ and http://rseqc.sourceforge.net/</span></div>
---
## BAM format
The BAM file is compressed, but we can view it using samtools
```bash
module load samtools/1.2
samtools view STAR/WT_1_Aligned.sortedByCoord.out.bam | head
```
```bash
ERR458498.771047 16 chrI 541 1 51M * 0 0 CACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCC AAIIHDCGGEIGIIIIIIIIIIGIGEGGIIGEIIHFGDBDHFDFDDBB+@@ NH:i:3 HI:i:1 AS:i:50 nM:i:0
...
```
---
## BAM format
```{r, out.width = "100%",echo=FALSE,fig.align="left"}
knitr::include_graphics("fig/sam1.png")
```
<div class="my-footer"><span>https://www.samformat.info/images/sam_format_annotated_example.5108a0cd.jpg
</span></div>
---
## Visualizing reads with IGV
- Return to On Demand "Dashboard" tab: https://ondemand.cluster.tufts.edu
......@@ -772,7 +797,7 @@ knitr::include_graphics("fig/fc.png")
## Counting reads: running the script
Open bioinformatics-rnaseq/scripts/sbatch_featurecounts_gene.sh in Jupyter Lab text editor
```{bash, echo=TRUE, eval=FALSE}
```bash
module load subread/1.6.3
mkdir featurecounts
......@@ -787,21 +812,23 @@ $DIR/*bam
```
To run:
```{bash, echo=TRUE, eval=FALSE}
```bash
./scripts/featurecounts.sh STAR
```
Output files:
featurecounts_results.txt
featurecounts_results.txt.summary
Output files
```bash
featurecounts/
├── featurecounts_results.txt
└── featurecounts_results.txt.summary
```
---
## Counting reads: results
```
cat featurecounts_results.txt.summary | column -ts $'\t'
```bash
cat featurecounts/featurecounts_results.txt.summary | column -ts $'\t'
```
.small[
```
```bash
Status STAR//WT_1_Aligned.sortedByCoord.out.bam STAR//WT_1_nosmall__Aligned.sortedByCoord.out.bam
Assigned 5395145 5396147
Unassigned_Unmapped 0 0
......@@ -837,4 +864,7 @@ associated with genes | featureCounts log | 5.4 M
<div class="my-footer"><span>https://github.com/hbc/NGS_Data_Analysis_Course</span></div>
---
## Summary
\ No newline at end of file
## Summary
```{r, out.width = "80%",echo=FALSE}
knitr::include_graphics("fig/wf0.png")
```
......@@ -32,7 +32,6 @@ Statistical power
Capturing variability
---
## Bioinformatics Workflow
Our analysis will have the following steps with QC at each stage
&lt;img src="fig/wf0.png" width="500px" style="display: block; margin: auto;" /&gt;
---
......@@ -40,7 +39,7 @@ Our analysis will have the following steps with QC at each stage
.pull-left[
- mRNA data from 48 replicates of two Saccromyces cerevisiae populations
- Wildtype (WT) and `\(\Delta\)`SNF2
- Wildtype (WT) and SNF2 knock-out ( `\(\Delta\)`snf2 )
- Unusually comprehensive analysis of variability in sequencing replicates
]
......@@ -53,14 +52,18 @@ Our analysis will have the following steps with QC at each stage
---
## Start Jupyter Lab on the HPC cluster On Demand
.pull-left[
1. In **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
1. **Chrome** web browser: [https://ondemand.cluster.tufts.edu](http://ondemand.cluster.tufts.edu)
2. Interactive Apps -&gt; Jupyter Lab
+ Hours: 3 hours
+ Core: 8 cores
+ Memory: 64 GB
4. Press "Connect to Jupyter Lab"
5. Choose "Terminal" from the Launcher menu
6. A terminal will appear on the compute node where your job is running, where you can type bash commands
+ Hours: 4
+ Core: 4
+ Memory: 32 GB
3. Press "Connect to Jupyter Lab"
4. Choose "Terminal" from the Launcher menu
5. A terminal will appear on the compute node where your job is running, where you can type bash commands
]
.pull-right[
&lt;img src="fig/jn.png" width="70%" style="display: block; margin: auto;" /&gt;&lt;img src="fig/terminal.png" width="70%" style="display: block; margin: auto;" /&gt;
......@@ -69,33 +72,54 @@ Our analysis will have the following steps with QC at each stage
## Setting up our working directory
Make a directory in the training space called `your-user-name` e.g.rbator01:
```
```bash
cd /cluster/tufts/bio/tools/
mkdir training/&lt;your-user-name&gt;
```
Copy the workshop materials into the new directory:
```
```bash
cp -r tutorials/bioinformatics-rnaseq/ training/&lt;your-user-name&gt;
```
Make a soft link into your home directory:
```
```bash
ln –s training/&lt;your-user-name&gt;/bioinformatics-rnaseq/ ~/
```
You now have a directory called "bioinformatics_rnaseq" in your home directory containing:
You now have a directory called "bioinformatics_rnaseq" in your home directory:
```bash
cd ~/bioinformatics-rnaseq
```
---
## Setting up our working directory
Look at the contents of that directory:
```bash
tree .
```
```bash
├── data
│ ├── ERP004763_info.txt
│ ├── sacCerfeatureCounts_gene_results.formatted.mod.txt
│ └── sample_info.txt
└── scripts
├── bamqc.sh
├── deseq2_v2.R
├── fastqc_multiqc.sh
├── fastqc.sh
├── featurecounts.sh
├── sbatch_star_align.sh
└── sbatch_star_index.sh
```
---
## Downloading data from a public archive
In "bioinformatics_rnaseq/data" we have a tab delimited file with samples information for study ERP004763 at [European Nucleotide Archive](https://www.ebi.ac.uk/ena)
Use bash utility **head** to look at the first few lines
Use bash utility **head** and **column** to look at the first few lines
```bash
cd bioinformatics_rnaseq/data
head ERP004763_info.txt
```
head ERP004763_info.txt | column -ts $'\t'
```
```bash
run_accession condition biol_rep fastq_ftp
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
ERR458494 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458494/ERR458494.fastq.gz
......@@ -109,13 +133,13 @@ ERR458501 SNF2 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR45850
```
---
## Downloading data from a public archive
Find the accession numbers corresponding to WT and replicate 1 using contents using bash utlilty **awk**
Find the accession numbers corresponding to WT and replicate 1 using contents using bash utlilty **cat** and **awk**
```bash
cat ERP004763_info.txt | awk '$2=="WT"'
```
```
```bash
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR459206 WT 48 ftp.sra.ebi.ac.uk/vol1/fastq/ERR459/ERR459206/ERR459206.fastq.gz
......@@ -125,7 +149,7 @@ Find the accession numbers corresponding to the first biological replicate:
```bash
cat ERP004763_info.txt | awk '$3==1'
```
```
```bash
ERR458493 WT 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz
...
ERR458506 SNF2 1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458506/ERR458506.fastq.gz
......@@ -135,7 +159,7 @@ The pipe symbol "|" will tell bash to send the output of one program to another
---
.content-box-yellow[
**Exercise** (2 minutes):
**Exercise** (5 minutes):
Can you combine the above commands using a pipe in order to output only the lines corresponding to WT replicate 1?
]
......@@ -159,9 +183,12 @@ cd WT_1
cat ERP004763_info.txt | awk '$3==1'| awk '$2=="WT"'| cut -f4 | xargs wget
cd ..
```
--
The script can be run by returning to your terminal and doing:
---
## Downloading fastq files
1. Click on "Files" in the left menu
2. Click the "Home" icon and navigate to "bioinformatics-rnaseq"
3. Double click the file "download_samples.sh" and it will open in a text editor
4. The script can be run by returning to your terminal and doing:
```bash
cd ../
./scripts/download_samples.sh
......@@ -175,11 +202,10 @@ You should now have a folder WT_1 in your bioinformatics-rnaseq directory with s
## Fastq format
View the first 4 lines of one of the gzipped fastq files without decompressing:
```bash
gzip -cd WT_1/ERR458493.fastq.gz | head -n 4
```
```
```bash
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
......@@ -210,10 +236,10 @@ So why don't we see numbers in the quality score?
B@@FFFFFHHHGHJJJJJJIJJGIGIIIGI9DGGIIIEIGIIFHHGGHJIB
```
---
### Quality Scores
### Quality Score Encoding
In FASTQ files produced by Illumia software 1.8+, quality scores are encoded as the character with an ASCII code equal to its value + 33.
&lt;img src="fig/illumina_fastq_coding.png" width="60%" /&gt;
```
```bash
@ERR458493.1 DHKW5DQ1:219:D0PT7ACXX:1:1101:1724:2080/1
CGCAAGACAAGGCCCAAACGAGAGATTGAGCCCAATCGGCAGTGTAGTGAA
+
......@@ -225,25 +251,22 @@ B = Q(33), probability of error of 1 in 2000
## FastQC