Table of Contents
Raw read processing
FASTA and FASTQ format
Many high-throughput sequencing methods produce reads in a format called FASTQ. This format is very similar to the FASTA format, but in addition to the nucleotide sequence itself it also contains some quality information for each position.
In the FASTA format, each record has two elements:
- the line starting with
>
contains the record name - the next line contain the nucleotide sequence
>myLittleSeq AATTCCCACAGAATCCCTCGANGGACTGCAAGGCAGCAGCCCATTGCCTAAAAAGGAAGAGTGCACACAGA
In the FASTQ format, each record has four elements:
- the line starting with
@
contains the record name and precedes the sequence - the next line is the nucleotide sequence
- the line starting with
+
repeats the record name and precedes the quality information - the next line contains the quality score for each position (one character per position)
@myLittleSeq AATTCCCACAGAATCCCTCGANGGACTGCAAGGCAGCAGCCCATTGCCTAAAAAGGAAGAGTGCACACAGA +myLittleSeq BBBBBBB?ABABB@BBB6B?6%6>9B;@BA=;ABAACBABA?=?@AAB>8.(.);@:>BBA7@C@.<(0B8
The quality score for each position is encoded in one character and can be converted to a numerical value. This score tells how confident the sequencer is about the identity of the base at this position of the sequence. One format for quality score used by Illumina is the Phred score.
Phred quality score
The quality information in the FASTQ can be encoded as a quality score such as the commonly used Phred score. The type of score encoding for a FASTQ file depends on the sequencing platform format.
The Phred score gives the probability that the base identity in the sequence is
correct at this position. The higher the score, the higher probability that the
base call is correct. The relation between the quality score Q
and the
probability of incorrect base assignment P
is:
Q = -10.log(P) where log is the decimal log operation
| Phred score | Prob(incorrect base) | Prob(correct base) | |-------------+-------------------+-----------------| | 10 | 0.1 | 0.90000 | | 20 | 0.01 | 0.99000 | | 30 | 0.001 | 0.99900 | | 40 | 0.0001 | 0.99990 | | 50 | 0.00001 | 0.99999 |
Typically, the Phred score numerical value is encoded to the FASTQ file by adding a fixed, constant value and converting the resulting number to an ASCII character. For example, if the encoding is [Phred+64] as it is for Illumina 1.5+, then a Phred score of 50 is encoded as "r" in the FASTQ file since 50+64=114 and the ASCII character for 114 is "r". The wikipedia page about the quality score encoding in FASTQ file has more detailed information and examples about it.
Raw reads files
Note: the raw reads files were prepared from the files submitted to the Dryad
repository. For convenience, dos2unix
was run on the files to make them more
easy to work with on GNU/Linux systems.
TODO Have a look at the raw reads files
Go into the raw reads folder and list its content by typing:
cd 01_raw_reads ls -lh
ls
lists the content of the current folder, and the -l
and -h
options ask
for a detailed list and for human-readable sizes, respectively.
The raw reads are separated into several files, one per lane per sense (forward or reverse reads):
s_1_1_sequence.fastq s_1_2_sequence.fastq s_2_1_sequence.fastq s_2_2_sequence.fastq s_3_1_sequence.fastq s_3_2_sequence.fastq s_5_1_sequence.fastq s_5_2_sequence.fastq s_6_1_sequence.fastq s_6_2_sequence.fastq
The file name format is: s_xxx_yyy_sequence.fastq
where:
xxx
is the sequencing lane (1, 2, 3, 5 or 6 here)yyy
is the read direction (1 for FORWARD reads, 2 for REVERSE reads - we used paired-end sequencing, remember)
For example, s_1_1_sequence.fastq
and s_1_2_sequence.fastq
are the
sequences from lane 1, forward and reverse reads, respectively.
Get a look at the file by typing:
head s_1_1_sequence.fastq
@HWI-EAS418:1:1:3:665#0/1 AATTCATTTACTGTGTAGTTTNTTTGCNGCAAATGAAAAGCAGNCTACATAATGCATAAACAGGCACTGCAAGA +HWI-EAS418:1:1:3:665#0/1 \`bba`ababbaa`a_[a`a^D^aa`^D[a^S]```URW^b`RD^aabb^^]aabbaSOS__^`bab^a^___a @HWI-EAS418:1:1:3:1917#0/1 AATTCACATGTGCTCTCTTCCNTTGAGNCGATAAACGCCTCAGAGGTTTTCCTTGTAATCGTGGATGGATGACA +HWI-EAS418:1:1:3:1917#0/1 `baaa^aa``[\_^a_a__a^DX\^[TDZ\[RL\^[NX\L\HRHVVMM\[]BBBBBBBBBBBBBBBBBBBBBBB @HWI-EAS418:1:1:3:1141#0/1 AATTCTCTATGGCAACCAATGNACAAATAGTACTAACAGCTTAAAATGTTGGGACACAGTTAAGTGCTCAGCTA
The head
command displays the first 10 lines of a file. Can you recognize
which lines contain record names? Record sequences? Quality scores?
Count the number of reverse reads for each population
Each population pool was barcoded on one side of the RAD fragment, which corresponds to the reverse reads here. The barcodes were three letter long, and were linked to the restriction site of ENZ1. This means that the reverse reads must start by one of those eight sequences:
| Pop | Lane | Barcode | Rev start | |-----+------+---------+-----------| | BYN | 1 | CAC | CACTCC | | RYT | 1 | CTT | CTTTCC | | HKI | 2 | TCT | TCTTCC | | PYÖ | 3 | CTT | CTTTCC | | ABB | 3 | TTG | TTGTCC | | SKA | 5 | CAC | CACTCC | | LEV | 5 | TCT | TCTTCC | | POR | 6 | TTG | TTGTCC |
We see that there are two populations pooled in each of lanes 1, 3 and 5, but only one population in each of lanes 2 and 6.
TODO Count the number of reverse reads for HKI population
Let's start with something simple and look at lane 2. There is only one
population, HKI, so all the reverse reads in s_2_2_sequence.fastq
should
start with TCTTCC
.
head s_2_2_sequence.fastq
@HWI-EAS418:2:1:2:1978#0/2 TCTTCCCTACACTGCGTGTCGTCTCAATCGCGGGAGCAGCAGTAGACACAGCTAGGGGTGATGTGNGTGTGTGT +HWI-EAS418:2:1:2:1978#0/2 abbbbbbbbbbbabbb`aabbaba_^`abbaaa_^_aaaa_Z[__\`a`_`aZFY_]YQ^^[_XVDWYXP\T\U @HWI-EAS418:2:1:4:1179#0/2 TCTTCCCAGCTACGCAGACAATGGCTATCCTTAAAAAGAAAAGTGTGGTTTTCTTACTTTTAACCNTTGAGCCA +HWI-EAS418:2:1:4:1179#0/2 aabaaaabbaabaaaa`aaa``aaaaa^`aa_aa`aa`_``^`V`\aaX`__]_`a`aaaa``[WD[`ZN^_\\ @HWI-EAS418:2:1:4:1998#0/2 TCTTCCACACCGGGTCAGTCTCACTTTGAAGGAAACTTGGTCCCCTCTAAACTGGAGTTAATCTCNTTGGTTGC
Does the first sequences fulfil this expectation?
Of course we do not really want to check manually that each sequence starts with the correct pattern by visual inspection. One way is to count the total number of sequences in the file, and another is to count the number of sequences starting with the correct pattern. Hopefully the numbers match…
To count the total number of sequences in the file, we can count its lines with
the command wc -l
and then divide by four since each record has four lines:
wc -l s_2_2_sequence.fastq
1942408 s_2_2_sequence.fastq
How many reads are there in total in this file?
Now we can count the number of sequences starting with TCTTCC
by using the
grep
command and counting the number of lines of its output with wc -l
. The
grep
command looks for a given pattern in the lines of a file, and output the
matching lines only. The pattern we use for the match is ="TCTTCC". The =^
tells grep
that the string should match at the beginning of a line only, not
in the middle of it. The output of grep
is then sent to wc -l
with a pipe
|
so that wc
will tell us how many matching lines grep
returned. A pipe
sends the output of the first command as an input to the second command.
grep "^TCTTCC" s_2_2_sequence.fastq | wc -l
485602
Do the two numbers match? Is everything as expected? Is our approach robust or can it fail in some cases?
TODO Count the number of reverse reads for BYN and RYT populations
The BYN population shared lane 1 with the RYT population. Reverse reads from
both populations are in s_1_2_sequence.fastq
. To count the number of reads
for BYN, we can again use grep
and the appropriate pattern:
grep "^CACTCC" s_1_2_sequence.fastq | wc -l
187008
And we can do the same for RYT:
grep "^CTTTCC" s_1_2_sequence.fastq | wc -l
599202
Finally, we can check that everything makes sense by counting the total number
of reads in s_1_2_sequence.fastq
:
wc -l s_1_2_sequence.fastq
3144840
Do the number match? Does everything make sense?
TODO Count the number of reverse reads for the other populations
You can now obtain the number of reverse reads for all the populations. Compare the number of reads between populations. Is the coverage homogeneous? Can you explain what you observe?
| Pop | Lane | Barcode | Rev start | N reads | |-----+------+---------+-----------+---------| | BYN | 1 | CAC | CACTCC | 187008 | | RYT | 1 | CTT | CTTTCC | 599202 | | HKI | 2 | TCT | TCTTCC | 485602 | | PYÖ | 3 | CTT | CTTTCC | 675123 | | ABB | 3 | TTG | TTGTCC | 428807 | | SKA | 5 | CAC | CACTCC | 81633 | | LEV | 5 | TCT | TCTTCC | 376011 | | POR | 6 | TTG | TTGTCC | 333744 |
Plot your results in a bar plot with R. You can do it by yourself if you know how to do, or you can use the code below:
# *** R script *** # Summary plot for reverse reads abundance # ---------------------------------------- # Input the data pop = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR") habitat = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake") n_reads = c(187008, 599202, 485602, 675123, 428807, 81633, 376011, 333744) # Bar plot barplot(n_reads, # numerical values used for the plot names.arg = pop, # names for each bar col = as.factor(habitat), # col is determined by the factor "habitat" las = 1, # orientation of the y-axis numbers ylab = "N reads") # label for y-axis
Demultiplexing SKA and LEV reads
As is common in RAD projects, we have several populations sharing the same lane but we can differentiate them using their specific barcode. The demultiplexing operation consists in sorting the reads from each population into separate files for downstream processing.
How to demultiplex
We will start by demultiplexing the reverse reads (those which have the specific barcodes). Once we have the names of the reverse reads for each population, we'll match those names in the forward files. This is because, in paired-end sequencing, paired sequences have matching names:
# Somewhere in the reverse reads file: @HWI-EAS418:1:1:3:665#0/1 # Somewhere in the forward reads file: @HWI-EAS418:1:1:3:665#0/2
In this example, the name of the RAD fragment is HWI-EAS418:1:1:3:665#0
. The
forward and reverse reads have the suffix /1
and /2
, respectively. Once we
have the names of the reverse reads for one population, that we identified with
the specific barcode, we can extract the forward reads (which do not have a
barcode) by matching the fragment names.
Reads in s_5_2_sequence.fastq
should start with either CACTCC
(SKA
population) or TCTTCC
(LEV population). We will filter the reads from this
file based on this expectation.
Note that it is possible to have reads that do have mistakes at the beginning of the sequence in the barcode or in the restriction site and that do not match the expectations. If the read cannot be assigned to a population unambiguously, it should be discarded.
In practice, there are methods to correct the barcode for one base mismatch when the barcodes used for different populations are sufficiently different from each other (see for example the STACKS pipeline).
TODO Extract the reverse reads from SKA
We use grep
again. Since we want to extract full records this time (the four
lines with the name of the record, the sequence and the quality score), not
only the nucleotide sequences that matches the pattern, we use grep
with the
options -B 1
(extract one line before the match) and -A 2
(extract two
lines after the match). We send grep
output to a file by using the
redirection operator >
.
grep -B 1 -A 2 "^CACTCC" s_5_2_sequence.fastq > SKA-rev.tmp.fastq
In addition, grep
will add a --
line between groups of contiguous matches
(you can check that with less SKA-rev.tmp.fastq
). We do not want to keep it
in our output file, so we filter that out with a reverse grep
: a grep
call
with the -v
option which asks grep
to output only the lines that do not
match the pattern.
grep -v "^\-\-" SKA-rev.tmp.fastq > SKA-rev.final.fastq # Here the pattern is "^\-\-" and means a "--" at the beginning of the line. # Since "-" is a special character for grep, we have to escape it with "\-" # so that grep considers it as a normal character.
We could do all in one go, without the intermediate SKA-rev.tmp.fastq
file
which contains the --
lines, by using a pipe between the two grep
calls:
grep -B 1 -A 2 "^CACTCC" s_5_2_sequence.fastq | grep -v "^\-" > SKA-rev.fastq
Now we can count the number of sequence in SKA-rev.fastq
:
grep "^@" SKA-rev.fastq | wc -l
81235
TODO Extract the reverse reads from LEV
Let's do the same for the LEV population:
grep -B 1 -A 2 "^TCTTCC" s_5_2_sequence.fastq | grep -v "^\-" > LEV-rev.fastq grep "^@" LEV-rev.fastq | wc -l
374311
How would you check that all reads were sent to the SKA or to the LEV file? Perform the check: was there any reads not assigned to a population?
TODO Get the reads names for SKA and LEV
Now we have a file with the full reverse reads for each population. What we would really like to have is just a list of the names of the reads for each population, to use them to match the forward reads.
The approach is straightforward: we can just grep
the lines starting with @
in the sorted reverse reads files, and then remove the suffix /2
from the
record name. Let's do it for SKA:
grep "^@" SKA-rev.fastq > SKA-rev-records-names head SKA-rev-records-names
To remove the ending /2
we use sed
. sed
is a program which can replace
one string by another in a text file:
sed -e "s/\/2//g" SKA-rev-records-names > SKA.rev.names
The sed
commands takes an expression (introduced by -e
) of the format
s/PATTERN/REPLACE/g
, where PATTERN
is the string to be searched for in the
file and REPLACE
is the replacement string. Here PATTERN
is \/2
(we have
to escape the /
because it is a special character for sed
) and the
replacement is the empty string!
Let's check that everything makes sense:
grep "^@" SKA-rev.fastq | wc -l wc -l SKA-rev-records-names wc -l SKA.rev.names
Do you have the same number of records in each file? Extract the names in the same way for LEV.
TODO Getting the reverse reads names for the other populations
Since we are only interested in getting the reads names, we can bypass the creation of all the intermediate files and just pipe the commands together.
As an example, let's get all the names of the reverse reads for BYN. We can do it in three steps:
- first we extract the full records of the reverse reads starting with
CACTCC
(BYN specific pattern) - then we extract only the lines containing the reads names
- finally we remove the suffix
/2
and send the result to a file
Let's grep
again!
grep -B 1 -A 2 "^CACTCC" s_1_2_sequence.fastq | grep "^@" | sed -e "s/\/2//g" > BYN.rev.names
This gives you an idea of the versatility of the command line when combining simple programs with the pipe.
Now, you can prepare a list of reverse reads names for each population.
TODO Final step: getting all the read pairs for all populations
Now that we have the reverse reads names for each population, can you find a
way, using only grep
, sed
and pipes, to create for each
population two files containing the forward and reverse reads involved in
matching pairs? (The order within each file is not important)
As an example, this is how to do it for BYN:
# Get the list of reverse names for BYN grep -B 1 -A 2 "^CACTCC" s_1_2_sequence.fastq | grep "^@" | sed -e "s/\/2//g" > BYN.rev.names # Get the forward records that have a match in the rev list grep -F -A 3 -f BYN.rev.names s_1_1_sequence.fastq | grep -v "^\-\-" > BYN.pair.for.fastq # Extract the names of those forward reads grep "^@" BYN.pair.for.fastq | sed -e "s/\/1//g" > BYN.pair.for.names # Get the reverse records which had a match in the forward file grep -F -A 3 -f BYN.pair.for.names s_1_2_sequence.fastq | grep -v "^\-\-" > BYN.pair.rev.fastq rm BYN.*.names
If you have time, prepare the paired forward and reverse files for each
population and count how many pairs of reads are available for each population.
If you don't have time, you can run this bash script, extract_pairs.sh
:
bash extract_pairs.sh
You can plot the results with R:
# *** R script *** # Summary plot for paired reads abundance # --------------------------------------- # Input the data pop = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR") habitat = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake") n_pairs = c(175685, 577420, 446765, 652455, 396861, 78983, 361187, 321539) # Bar plot barplot(n_pairs, # numerical values used for the plot names.arg = pop, # names for each bar col = as.factor(habitat), # col is determined by the factor "habitat" las = 1, # orientation of the y-axis numbers ylab = "N pairs") # label for y-axis
TODO Barcode and restriction site removal
All forward reads start with AATTC
, due to the digestion by EcoRI
(G-AATTC
). Similarly, all the reverse reads start with XXXTCC
, due to the
barcodes (XXX
) and the digestion by HaeIII (GG-CC
) and the addition of a
T
base in between. Those nucleotides can be removed by simply removing the 5
first nucleotides in all forward reads and the 6 first nucleotides in all
reverse reads.
A small homemade python script (remove_N_start_fastq.py
) is used for
that:
python remove_N_start_fastq.py *.for.fastq 5 python remove_N_start_fastq.py *.rev.fastq 6
Check for the first lines of a few files that the operation worked. Why do you think it is important to cut those nucleotides? What would happen if we would keep them?