Raw reads processing - Complete workflow

Raw read processing

Raw read processing

FASTA and FASTQ format

Many high-throughput sequencing methods produce reads in a format called FASTQ. This format is very similar to the FASTA format, but in addition to the nucleotide sequence itself it also contains some quality information for each position.

In the FASTA format, each record has two elements:

the line starting with > contains the record name
the next line contain the nucleotide sequence

>myLittleSeq
AATTCCCACAGAATCCCTCGANGGACTGCAAGGCAGCAGCCCATTGCCTAAAAAGGAAGAGTGCACACAGA

In the FASTQ format, each record has four elements:

the line starting with @ contains the record name and precedes the sequence
the next line is the nucleotide sequence
the line starting with + repeats the record name and precedes the quality information
the next line contains the quality score for each position (one character per position)

@myLittleSeq
AATTCCCACAGAATCCCTCGANGGACTGCAAGGCAGCAGCCCATTGCCTAAAAAGGAAGAGTGCACACAGA
+myLittleSeq
BBBBBBB?ABABB@BBB6B?6%6>9B;@BA=;ABAACBABA?=?@AAB>8.(.);@:>BBA7@C@.<(0B8

The quality score for each position is encoded in one character and can be converted to a numerical value. This score tells how confident the sequencer is about the identity of the base at this position of the sequence. One format for quality score used by Illumina is the Phred score.

Phred quality score

The quality information in the FASTQ can be encoded as a quality score such as the commonly used Phred score. The type of score encoding for a FASTQ file depends on the sequencing platform format.

The Phred score gives the probability that the base identity in the sequence is correct at this position. The higher the score, the higher probability that the base call is correct. The relation between the quality score Q and the probability of incorrect base assignment P is:

Q = -10.log(P) where log is the decimal log operation

| Phred score | Prob(incorrect base) | Prob(correct base) |
|-------------+-------------------+-----------------|
|          10 |               0.1 |         0.90000 |
|          20 |              0.01 |         0.99000 |
|          30 |             0.001 |         0.99900 |
|          40 |            0.0001 |         0.99990 |
|          50 |           0.00001 |         0.99999 |

Typically, the Phred score numerical value is encoded to the FASTQ file by adding a fixed, constant value and converting the resulting number to an ASCII character. For example, if the encoding is [Phred+64] as it is for Illumina 1.5+, then a Phred score of 50 is encoded as "r" in the FASTQ file since 50+64=114 and the ASCII character for 114 is "r". The wikipedia page about the quality score encoding in FASTQ file has more detailed information and examples about it.

Raw reads files

Note: the raw reads files were prepared from the files submitted to the Dryad repository. For convenience, dos2unix was run on the files to make them more easy to work with on GNU/Linux systems.

TODO Have a look at the raw reads files

Go into the raw reads folder and list its content by typing:

cd 01_raw_reads
ls -lh

ls lists the content of the current folder, and the -l and -h options ask for a detailed list and for human-readable sizes, respectively.

The raw reads are separated into several files, one per lane per sense (forward or reverse reads):

s_1_1_sequence.fastq
s_1_2_sequence.fastq
s_2_1_sequence.fastq
s_2_2_sequence.fastq
s_3_1_sequence.fastq
s_3_2_sequence.fastq
s_5_1_sequence.fastq
s_5_2_sequence.fastq
s_6_1_sequence.fastq
s_6_2_sequence.fastq

The file name format is: s_xxx_yyy_sequence.fastq where:

xxx is the sequencing lane (1, 2, 3, 5 or 6 here)
yyy is the read direction (1 for FORWARD reads, 2 for REVERSE reads - we used paired-end sequencing, remember)

For example, s_1_1_sequence.fastq and s_1_2_sequence.fastq are the sequences from lane 1, forward and reverse reads, respectively.

Get a look at the file by typing:

head s_1_1_sequence.fastq

@HWI-EAS418:1:1:3:665#0/1
AATTCATTTACTGTGTAGTTTNTTTGCNGCAAATGAAAAGCAGNCTACATAATGCATAAACAGGCACTGCAAGA
+HWI-EAS418:1:1:3:665#0/1
\`bba`ababbaa`a_[a`a^D^aa`^D[a^S]```URW^b`RD^aabb^^]aabbaSOS__^`bab^a^___a
@HWI-EAS418:1:1:3:1917#0/1
AATTCACATGTGCTCTCTTCCNTTGAGNCGATAAACGCCTCAGAGGTTTTCCTTGTAATCGTGGATGGATGACA
+HWI-EAS418:1:1:3:1917#0/1
`baaa^aa``[\_^a_a__a^DX\^[TDZ\[RL\^[NX\L\HRHVVMM\[]BBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS418:1:1:3:1141#0/1
AATTCTCTATGGCAACCAATGNACAAATAGTACTAACAGCTTAAAATGTTGGGACACAGTTAAGTGCTCAGCTA

The head command displays the first 10 lines of a file. Can you recognize which lines contain record names? Record sequences? Quality scores?

Count the number of reverse reads for each population

Each population pool was barcoded on one side of the RAD fragment, which corresponds to the reverse reads here. The barcodes were three letter long, and were linked to the restriction site of ENZ1. This means that the reverse reads must start by one of those eight sequences:

| Pop | Lane | Barcode | Rev start |
|-----+------+---------+-----------|
| BYN |    1 | CAC     | CACTCC    |
| RYT |    1 | CTT     | CTTTCC    |
| HKI |    2 | TCT     | TCTTCC    |
| PYÖ |    3 | CTT     | CTTTCC    |
| ABB |    3 | TTG     | TTGTCC    |
| SKA |    5 | CAC     | CACTCC    |
| LEV |    5 | TCT     | TCTTCC    |
| POR |    6 | TTG     | TTGTCC    |

We see that there are two populations pooled in each of lanes 1, 3 and 5, but only one population in each of lanes 2 and 6.

TODO Count the number of reverse reads for HKI population

Let's start with something simple and look at lane 2. There is only one population, HKI, so all the reverse reads in s_2_2_sequence.fastq should start with TCTTCC.

head s_2_2_sequence.fastq

@HWI-EAS418:2:1:2:1978#0/2
TCTTCCCTACACTGCGTGTCGTCTCAATCGCGGGAGCAGCAGTAGACACAGCTAGGGGTGATGTGNGTGTGTGT
+HWI-EAS418:2:1:2:1978#0/2
abbbbbbbbbbbabbb`aabbaba_^`abbaaa_^_aaaa_Z[__\`a`_`aZFY_]YQ^^[_XVDWYXP\T\U
@HWI-EAS418:2:1:4:1179#0/2
TCTTCCCAGCTACGCAGACAATGGCTATCCTTAAAAAGAAAAGTGTGGTTTTCTTACTTTTAACCNTTGAGCCA
+HWI-EAS418:2:1:4:1179#0/2
aabaaaabbaabaaaa`aaa``aaaaa^`aa_aa`aa`_``^`V`\aaX`__]_`a`aaaa``[WD[`ZN^_\\
@HWI-EAS418:2:1:4:1998#0/2
TCTTCCACACCGGGTCAGTCTCACTTTGAAGGAAACTTGGTCCCCTCTAAACTGGAGTTAATCTCNTTGGTTGC

Does the first sequences fulfil this expectation?

Of course we do not really want to check manually that each sequence starts with the correct pattern by visual inspection. One way is to count the total number of sequences in the file, and another is to count the number of sequences starting with the correct pattern. Hopefully the numbers match…

To count the total number of sequences in the file, we can count its lines with the command wc -l and then divide by four since each record has four lines:

wc -l s_2_2_sequence.fastq

1942408 s_2_2_sequence.fastq

How many reads are there in total in this file?

Now we can count the number of sequences starting with TCTTCC by using the grep command and counting the number of lines of its output with wc -l. The grep command looks for a given pattern in the lines of a file, and output the matching lines only. The pattern we use for the match is ="^TCTTCC". The =^ tells grep that the string should match at the beginning of a line only, not in the middle of it. The output of grep is then sent to wc -l with a pipe | so that wc will tell us how many matching lines grep returned. A pipe sends the output of the first command as an input to the second command.

grep "^TCTTCC" s_2_2_sequence.fastq | wc -l

Do the two numbers match? Is everything as expected? Is our approach robust or can it fail in some cases?

TODO Count the number of reverse reads for BYN and RYT populations

The BYN population shared lane 1 with the RYT population. Reverse reads from both populations are in s_1_2_sequence.fastq. To count the number of reads for BYN, we can again use grep and the appropriate pattern:

grep "^CACTCC" s_1_2_sequence.fastq | wc -l

And we can do the same for RYT:

grep "^CTTTCC" s_1_2_sequence.fastq | wc -l

Finally, we can check that everything makes sense by counting the total number of reads in s_1_2_sequence.fastq:

wc -l s_1_2_sequence.fastq

Do the number match? Does everything make sense?

TODO Count the number of reverse reads for the other populations

You can now obtain the number of reverse reads for all the populations. Compare the number of reads between populations. Is the coverage homogeneous? Can you explain what you observe?

| Pop | Lane | Barcode | Rev start | N reads |
|-----+------+---------+-----------+---------|
| BYN |    1 | CAC     | CACTCC    |  187008 |
| RYT |    1 | CTT     | CTTTCC    |  599202 |
| HKI |    2 | TCT     | TCTTCC    |  485602 |
| PYÖ |    3 | CTT     | CTTTCC    |  675123 |
| ABB |    3 | TTG     | TTGTCC    |  428807 |
| SKA |    5 | CAC     | CACTCC    |   81633 |
| LEV |    5 | TCT     | TCTTCC    |  376011 |
| POR |    6 | TTG     | TTGTCC    |  333744 |

Plot your results in a bar plot with R. You can do it by yourself if you know how to do, or you can use the code below:

# *** R script ***

# Summary plot for reverse reads abundance
# ----------------------------------------

# Input the data
pop = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR")
habitat = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake")
n_reads = c(187008, 599202, 485602, 675123, 428807, 81633, 376011, 333744)

# Bar plot 
barplot(n_reads,            # numerical values used for the plot
  names.arg = pop,          # names for each bar
  col = as.factor(habitat), # col is determined by the factor "habitat"
  las = 1,                  # orientation of the y-axis numbers
  ylab = "N reads")         # label for y-axis

Demultiplexing SKA and LEV reads

As is common in RAD projects, we have several populations sharing the same lane but we can differentiate them using their specific barcode. The demultiplexing operation consists in sorting the reads from each population into separate files for downstream processing.

How to demultiplex

We will start by demultiplexing the reverse reads (those which have the specific barcodes). Once we have the names of the reverse reads for each population, we'll match those names in the forward files. This is because, in paired-end sequencing, paired sequences have matching names:

# Somewhere in the reverse reads file:
@HWI-EAS418:1:1:3:665#0/1
# Somewhere in the forward reads file:
@HWI-EAS418:1:1:3:665#0/2

In this example, the name of the RAD fragment is HWI-EAS418:1:1:3:665#0. The forward and reverse reads have the suffix /1 and /2, respectively. Once we have the names of the reverse reads for one population, that we identified with the specific barcode, we can extract the forward reads (which do not have a barcode) by matching the fragment names.

Reads in s_5_2_sequence.fastq should start with either CACTCC (SKA population) or TCTTCC (LEV population). We will filter the reads from this file based on this expectation.

Note that it is possible to have reads that do have mistakes at the beginning of the sequence in the barcode or in the restriction site and that do not match the expectations. If the read cannot be assigned to a population unambiguously, it should be discarded.

In practice, there are methods to correct the barcode for one base mismatch when the barcodes used for different populations are sufficiently different from each other (see for example the STACKS pipeline).

TODO Extract the reverse reads from SKA

We use grep again. Since we want to extract full records this time (the four lines with the name of the record, the sequence and the quality score), not only the nucleotide sequences that matches the pattern, we use grep with the options -B 1 (extract one line before the match) and -A 2 (extract two lines after the match). We send grep output to a file by using the redirection operator >.

grep -B 1 -A 2 "^CACTCC" s_5_2_sequence.fastq > SKA-rev.tmp.fastq

In addition, grep will add a -- line between groups of contiguous matches (you can check that with less SKA-rev.tmp.fastq). We do not want to keep it in our output file, so we filter that out with a reverse grep: a grep call with the -v option which asks grep to output only the lines that do not match the pattern.

grep -v "^\-\-" SKA-rev.tmp.fastq > SKA-rev.final.fastq
# Here the pattern is "^\-\-" and means a "--" at the beginning of the line.
# Since "-" is a special character for grep, we have to escape it with "\-"
# so that grep considers it as a normal character.

We could do all in one go, without the intermediate SKA-rev.tmp.fastq file which contains the -- lines, by using a pipe between the two grep calls:

grep -B 1 -A 2 "^CACTCC" s_5_2_sequence.fastq | grep -v "^\-" > SKA-rev.fastq

Now we can count the number of sequence in SKA-rev.fastq:

grep "^@" SKA-rev.fastq | wc -l

TODO Extract the reverse reads from LEV

Let's do the same for the LEV population:

grep -B 1 -A 2 "^TCTTCC" s_5_2_sequence.fastq | grep -v "^\-" > LEV-rev.fastq
grep "^@" LEV-rev.fastq | wc -l

How would you check that all reads were sent to the SKA or to the LEV file? Perform the check: was there any reads not assigned to a population?

TODO Get the reads names for SKA and LEV

Now we have a file with the full reverse reads for each population. What we would really like to have is just a list of the names of the reads for each population, to use them to match the forward reads.

The approach is straightforward: we can just grep the lines starting with @ in the sorted reverse reads files, and then remove the suffix /2 from the record name. Let's do it for SKA:

grep "^@" SKA-rev.fastq > SKA-rev-records-names
head SKA-rev-records-names

To remove the ending /2 we use sed. sed is a program which can replace one string by another in a text file:

sed -e "s/\/2//g" SKA-rev-records-names > SKA.rev.names

The sed commands takes an expression (introduced by -e) of the format s/PATTERN/REPLACE/g, where PATTERN is the string to be searched for in the file and REPLACE is the replacement string. Here PATTERN is \/2 (we have to escape the / because it is a special character for sed) and the replacement is the empty string!

Let's check that everything makes sense:

grep "^@" SKA-rev.fastq | wc -l
wc -l SKA-rev-records-names
wc -l SKA.rev.names

Do you have the same number of records in each file? Extract the names in the same way for LEV.

TODO Getting the reverse reads names for the other populations

Since we are only interested in getting the reads names, we can bypass the creation of all the intermediate files and just pipe the commands together.

As an example, let's get all the names of the reverse reads for BYN. We can do it in three steps:

first we extract the full records of the reverse reads starting with CACTCC (BYN specific pattern)
then we extract only the lines containing the reads names
finally we remove the suffix /2 and send the result to a file

Let's grep again!

grep -B 1 -A 2 "^CACTCC" s_1_2_sequence.fastq | grep "^@" | sed -e "s/\/2//g" > BYN.rev.names

This gives you an idea of the versatility of the command line when combining simple programs with the pipe.

Now, you can prepare a list of reverse reads names for each population.

TODO Final step: getting all the read pairs for all populations

Now that we have the reverse reads names for each population, can you find a way, using only grep, sed and pipes, to create for each population two files containing the forward and reverse reads involved in matching pairs? (The order within each file is not important)

As an example, this is how to do it for BYN:

# Get the list of reverse names for BYN
grep -B 1 -A 2 "^CACTCC" s_1_2_sequence.fastq | grep "^@" | sed -e "s/\/2//g" > BYN.rev.names
# Get the forward records that have a match in the rev list
grep -F -A 3 -f BYN.rev.names s_1_1_sequence.fastq | grep -v "^\-\-" > BYN.pair.for.fastq
# Extract the names of those forward reads
grep "^@" BYN.pair.for.fastq | sed -e "s/\/1//g" > BYN.pair.for.names
# Get the reverse records which had a match in the forward file
grep -F -A 3 -f BYN.pair.for.names s_1_2_sequence.fastq | grep -v "^\-\-" > BYN.pair.rev.fastq
rm BYN.*.names

If you have time, prepare the paired forward and reverse files for each population and count how many pairs of reads are available for each population. If you don't have time, you can run this bash script, extract_pairs.sh:

bash extract_pairs.sh

You can plot the results with R:

# *** R script ***

# Summary plot for paired reads abundance
# ---------------------------------------

# Input the data
pop = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR")
habitat = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake")
n_pairs = c(175685, 577420, 446765, 652455, 396861, 78983, 361187, 321539)

# Bar plot 
barplot(n_pairs,            # numerical values used for the plot
  names.arg = pop,          # names for each bar
  col = as.factor(habitat), # col is determined by the factor "habitat"
  las = 1,                  # orientation of the y-axis numbers
  ylab = "N pairs")         # label for y-axis

TODO Barcode and restriction site removal

All forward reads start with AATTC, due to the digestion by EcoRI (G-AATTC). Similarly, all the reverse reads start with XXXTCC, due to the barcodes (XXX) and the digestion by HaeIII (GG-CC) and the addition of a T base in between. Those nucleotides can be removed by simply removing the 5 first nucleotides in all forward reads and the 6 first nucleotides in all reverse reads.

A small homemade python script (remove_N_start_fastq.py) is used for that:

python remove_N_start_fastq.py *.for.fastq 5
python remove_N_start_fastq.py *.rev.fastq 6

Check for the first lines of a few files that the operation worked. Why do you think it is important to cut those nucleotides? What would happen if we would keep them?

Next: De novo assembly

Roscoff course 2015 - RAD tags

Table of Contents