Variant calling and genotyping

Variant calling and genotyping

Variant calling and genotyping

Mapping back the reads to the consensus sequences

TODO Examine Phred score profiles

Now that we have the consensus sequences, we need to map back each read to them to align matching reads together and detect polymorphic positions.

Read calling during sequencing is not perfect, and for Illumina the miscall rate is around 1% (Nielsen et al. 2011). To minimise the errors during SNP detection (false positives), we should process the reads to remove bases with low quality score from the analysis.

Let's examine the distribution of quality scores for each set of reads. We can use a homemade python script (extract_phred_profiles.py) to get the Phred score in a tabular format. We can then use R to plot the average profiles.

python extract_phred_profiles.py s_6_2_sequence.fastq

# R script
#
# Plot the Phred profile for a FASTQ file
#----------------------------------------

p = read.table("s_6_2_sequence.fastq.phred_profile", sep = "\t", header = F)
boxplot(p, outline = F)

How does the quality score change along the reads?

TODO Read trimming

To remove low quality bases from the 3' end, we will remove bases from this end until the Phred quality score reached 20 (error rate = 1%). Remaining reads with length less than 40 will be discarded.

We could do that using Fastq Quality trimmer as implemented in Galaxy tools, but we can also use a homemade python script (trim_reads.py):

# Usage: python script.py inputFile phredThreshold lengthThreshold
python trim_reads.py s_6_2_sequence.fastq 20 40

We apply the quality trimming to each forward and reverse reads files for each population:

ls -1

ABB.pair.for.fastq.clean
ABB.pair.rev.fastq.clean
BYN.pair.for.fastq.clean
BYN.pair.rev.fastq.clean
HKI.pair.for.fastq.clean
HKI.pair.rev.fastq.clean
LEV.pair.for.fastq.clean
LEV.pair.rev.fastq.clean
POR.pair.for.fastq.clean
POR.pair.rev.fastq.clean
PYO.pair.for.fastq.clean
PYO.pair.rev.fastq.clean
RYT.pair.for.fastq.clean
RYT.pair.rev.fastq.clean
SKA.pair.for.fastq.clean
SKA.pair.rev.fastq.clean

python trim_reads.py *.fastq.clean 20 40

Also apply this trimming to the files containing all the forward reads and all the reverse reads used for the de novo assembly. How would you examine the resulting distribution of read lengths in those trimmed files? If you have time, do it.

Variant calling

TODO Prepare input files for Maq

We will use the Maq program to perform the read alignment. A detailed explanation of the Maq workflow is available here.

We have the reference files:

consensus-for.fasta
consensus-rev.fasta

and the reads files:

ABB.pair.for.fastq.clean.trimmed
ABB.pair.rev.fastq.clean.trimmed
BYN.pair.for.fastq.clean.trimmed
BYN.pair.rev.fastq.clean.trimmed
HKI.pair.for.fastq.clean.trimmed
HKI.pair.rev.fastq.clean.trimmed
LEV.pair.for.fastq.clean.trimmed
LEV.pair.rev.fastq.clean.trimmed
POR.pair.for.fastq.clean.trimmed
POR.pair.rev.fastq.clean.trimmed
PYO.pair.for.fastq.clean.trimmed
PYO.pair.rev.fastq.clean.trimmed
RYT.pair.for.fastq.clean.trimmed
RYT.pair.rev.fastq.clean.trimmed
SKA.pair.for.fastq.clean.trimmed
SKA.pair.rev.fastq.clean.trimmed

First, we convert the fasta reference to Maq bfa format:

maq fasta2bfa consensus-for.fasta consensus-for.bfa
maq fasta2bfa consensus-rev.fasta consensus-rev.bfa

Then we convert the fastq read files to Maq bfq format using a small bash script, convert-trimmed-to-bfq.sh. The script file is:

#!/bin/bash
FILES=`ls *.trimmed`   # get the list of trimmed read files

for f in $FILES
do
  echo $f
  maq sol2sanger $f $f.sanger       # convert Illumina 1.5+ to Sanger scores
  maq fastq2bfq $f.sanger $f.bfq    # convert fastq to bfq
done

We use it with:

bash convert-trimmed-to-bfq.sh

We also generate the bfq files for all the concatenated forward reads and all the concatenated reverse reads:

# Concatenate the forward and reverse reads
cat *for*.trimmed > reads.for.trimmed
cat *rev*.trimmed > reads.rev.trimmed
# Convert to bfq
maq sol2sanger reads.for.trimmed reads.for.sanger
maq sol2sanger reads.rev.trimmed reads.rev.sanger
maq fastq2bfq reads.for.sanger reads.for.bfq
maq fastq2bfq reads.rev.sanger reads.rev.bfq

At this point, the fasta and fastq files have been converted to the binary format used by Maq.

TODO Read alignment to the consensus sequences

Here we will use all the forward reads and all the reverse reads and align them to the consensus sequences. For the practicals, we work only with the forward reads from now on.

Again, this step could be long, so we submit it as a job in the file map-maq-for-job. The map file is a binary file, but we also convert it to a human-readable form with mapview:

#!/bin/bash
#$ -S /bin/bash
#$ -M matthieu.bruneaux@ens-lyon.org
#$ -V
#$ -m bea
#$ -cwd
# Maq map forward reads
maq map for.map consensus-for.bfa reads.for.bfq
maq mapview for.map > for.map.txt

We submit the job with qsub -q long.q map-maq-for-job. Have a look at the map with less -S for.map.txt. Using the man page (search for "mapview" with CTRL+F), understand the output format, and in particular look for the number of mismatches of the best hits. Can you identify which reads are similar to the consensus sequence and which harbour potential SNPs?

TODO SNP calling

Now that the reads have been mapped back, we can perform the SNP calling operation.

We use the assemble and the cns2snp function of Maq to extract the SNP information. The job file is maq-assemble-cns2snp-job:

#!/bin/bash
#$ -S /bin/bash
#$ -M matthieu.bruneaux@ens-lyon.org
#$ -V
#$ -m bea
#$ -cwd
# Maq assemble
maq assemble for.cns consensus-for.bfa for.map 2> for-assemble.log
# Maq cns2snp
maq cns2snp for.cns > for.snp

As usual, we submit the job with qsub -q long.q map-assemble-cns2snp-job. Have a look at the output file for.snp. How many putative SNPs were detected? The format of the output file is explained on the man page (search for "cns2snp" with CTRL+F).

A very useful file for detailed SNP filtering is the pileup file. We can produce it with:

maq pileup consensus-for.bfa for.map > for.pileup

Have a look at the pileup file with less -S for.pileup. Can you understand its format? (see the man page, search for "pileup") Can you identify the contigs with low coverage and high coverage? Can you identify putative polymorphic positions? Can you identify positions that look like genuine SNPs?

SNP filtering

The SNP filtering is a delicate operation, since it is important to make a distinction between genuine SNPs and false positives due to sequencing errors. In this case, we used the following criteria:

alleles with a frequency equal to or less than the error rate (i.e. <=0.01) were removed
retained SNP should be biallelic
read depth (coverage) should be greater than 10 but less than 1000
at least 3 reads should be observed per SNP allele
maximum two SNPs per consensus sequence
not falling less than 5 bases away from identified microsatellites (if microsatellites were kept)

Those criteria are rather on the conservative side and might result in losing genuine SNPs. Explain why and propose better criteria.

Some SNP calling methods use more elaborate approaches such as maximum likelihood and bayesian methods. See the bibliography notes section.

Since the SNP filtering as done here is mainly an exercise in scripting (using R or Python for example), we will not perform it during the practicals and we will use ready-made output files containing the SNP information.

Summary plots for reality check

We give you a list of SNP summary files, with information about coverage and allele frequency in the full dataset (Total) and per populations, per habitat or per water type (freshwater vs marine).

ls -1

0_Ninesp_020212_Final_SNPs_Totals.txt
1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt
1_Ninesp_020212_Final_SNPs_Totals_and_Populations_Horizontal.txt
2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Horizontal.txt
2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Stacked.txt
3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt
3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Stacked.txt

Stacked and Horizontal correspond to different ways of organizing the data in the table, but the information is the same.

TODO Have a look at the SNP files

Look at the SNP files with, for example:

column -t 0_Ninesp_020212_Final_SNPs_Totals.txt | less -S

Here, we use the colum command to produce a nice, human-readable display of the tabular file, and we send it through a pipe to less for easy navigation through the file.

Have a look to a few different files. Can you understand what the different columns stand for? Can you see the difference between a Stacked file and a Horizontal file?

TODO Count the number of SNP filtered

Count the number of SNP which were filtered and kept:

wc -l 0_Ninesp_020212_Final_SNPs_Totals.txt

Note that the header line is also counted here.

How many SNP do we have? How many consensus sequences did we have initially? What is the approximate proportion of consensus sequences with a SNP?

Actually there are several SNP types: some are single SNPs on the consensus sequence, some are associated into haplotypes of several SNPs on the same sequence. Let's count how many SNPs of each type we have:

cut -f 13 0_Ninesp_020212_Final_SNPs_Totals.txt | sort | uniq -c

We use cut to extract the column 13 of the file (which contains the SNP type information) and pipe it to sort and uniq -c to count the number of occurrences of the different entries.

TODO Examine the coverage per population

When we looked at the number of reads per population, we noticed that there were large discrepancies between populations. Let's have a look at how the coverage per SNP per population compares.

# R script

# Compare the distribution of SNP coverage between populations
#-------------------------------------------------------------

# Load the data
d = read.table("1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt", header = T,
      stringsAsFactors = F)

# Get the columns names
names(d)

# Store the habitat types per population
pop = c("ABB", "BYN", "HKI", "LEV", "POR", "PYO", "RYT", "SKA")
habitat = c("Pond", "Pond", "Marine", "Marine", "Lake", "Pond", "Pond", "Lake")
habColors = c("green", "green", "red", "red", "grey", "green", "green", "grey")

# Boxplot of SNP coverage
boxplot(d2$AlignDepth ~ d2$Population, col = habColors)
# note: the colors match because boxplot sorts the population alphabetically

# Remove outliers display
boxplot(d2$AlignDepth ~ d2$Population, col = habColors, outline = F)

# As a comparison, we will plot again the read distribution per population

popReads = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR")
habitatReads = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake")
habColReads = c("green", "green", "red", "green", "green", "grey", "red", "grey")
n_pairs = c(175685, 577420, 446765, 652455, 396861, 78983, 361187, 321539)

# Sort the data alphabetically
n_pairs = n_pairs[order(popReads)]
habColReads = habColReads[order(popReads)]
popReads = popReasd[order(popReads)]

# Barplot
barplot(n_pairs, names.arg = popReads, col = habColReads, ylab = "N pairs", las = 1)

# The two plots together for easier comparison
par(mfrow = c(2, 1), mar = c(2.5, 4, 1, 1))
boxplot(d2$AlignDepth ~ d2$Population, col = habColors, outline = F, las = 1, 
  ylab = "Coverage")
barplot(n_pairs, names.arg = popReads, col = habColReads, ylab = "N pairs", las = 1)

Is the coverage homogeneous between populations? What is your feeling about estimating allele frequencies per population based on those coverage?

How do the numbers of pairs per population and the SNP coverage compare? Can you explain the differences?

TODO Compare ABB and BYN more thoroughly

Let's have a closer look at ABB and BYN:

# R script

# Load the data
d = read.table("1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt", header = T,
      stringsAsFactors = F)

# Get the data for ABB and BYN
d_ABB = subset(d, d$Population == "ABB")
d_BYN = subset(d, d$Population == "BYN")

# Look at the count of SNP per AlignDepth
table(d_ABB$AlignDepth)
table(d_BYN$AlignDepth)

# Plots
par(mfrow = c(2, 1), mar = c(2.5, 4, 1, 1))
plot(table(d_ABB$AlignDepth))
plot(table(d_BYN$AlignDepth))

# Same xlim
plot(table(d_ABB$AlignDepth), xlim = c(0, 650))
plot(table(d_BYN$AlignDepth), xlim = c(0, 650))

Can you now explain why ABB and BYN differ in their comparison if we consider read pairs or SNP coverage?

TODO SNP coverage per habitat type and per water type

We will have to pool our populations together by habitat type or by water type (freshwater vs marine) if we want to improve a bit allele frequency estimates. Note that this approach is far from perfect! Populations are different, and different demographic histories and evolutionary trajectories means that we might be pooling things which are different in essence and that our allele frequency estimates might have little meaning. However, since we are basically looking for signals of parallel evolution in one water type vs another, we can hope that, if such parallel evolution occurred for some markers, we might be able to recover some meaningful information when averaging the values during the genome scan step.

Let's have a look at coverage when we pool the data per habitat type:

# R script

# Load the data
d = read.table("2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Stacked.txt", header = T)
head(d)

# Boxplot
dev.off() # To have only one plot displayed again
boxplot(d$AlignDepth ~ d$HabitatType, col = c("grey", "red", "green"),
  ylab = "Coverage", las = 1)

# Remove outliers
boxplot(d$AlignDepth ~ d$HabitatType, col = c("grey", "red", "green"),
  ylab = "Coverage", las = 1, outline = F)

How do the coverages per habitat type compare?

In the end we will pool the data by water type. Let's look at the coverages in more details:

# R script

# Load the data pooled by water type
d = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Stacked.txt", header = T)

# Boxplot
boxplot(d$AlginDepth ~ d$WaterType, ylab = "Coverage", las = 1)
boxplot(d$AlginDepth ~ d$WaterType, ylab = "Coverage", las = 1, outline = F)

# Load the data in the horizontal format
d2 = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt", header = T)

# Plot the relation between coverage in marine populations and in freshwater populations
plot(d2$Marine_AlignDepth, d2$Freshwater_AlignDepth, xlab = "Marine coverage",
  ylab = "Freshwater coverage", pch = 16, col = rgb(0, 0, 1, 0.2))

# Use log scale (note that we had 1 to all values to see also the zeros)
plot(d2$Marine_AlignDepth + 1, d2$Freshwater_AlignDepth + 1, xlab = "Marine coverage + 1",
  ylab = "Freshwater coverage + 1", pch = 16, col = rgb(0, 0, 1, 0.2), log = "xy")

# Add a bit of jittering to reduce points overlap
plot(jitter(d2$Marine_AlignDepth + 1), d2$Freshwater_AlignDepth + 1, xlab = "Marine coverage + 1",
  ylab = "Freshwater coverage + 1", pch = 16, col = rgb(0, 0, 1, 0.2), log = "xy")

We can also compare the number of fixed alleles in each water type. Questions are embedded in the following code:

# R script

# Load the data in the horizontal format
d2 = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt", header = T)

# Compare the allele frequencies
plot(d2$Marine_snpfreq1, d2$Freshwater_snpfreq1, xlab = "Marine freq", 
  ylab = "Freshwater freq", pch = 16, color = rgb(0, 0, 1, 0.2))

# *Q*: Can you see the fixed alleles in the freshwater habitats? In the marine
# habitats?

# Another view of the same data
smoothScatter(d2$Marine_snpfreq1, d2$Freshwater_snpfreq1, xlab = "Marine freq", 
  ylab = "Freshwater freq")

# *Q*: In the density scatter plot, compare the frequencies of alleles fixed in
# the marine populations with the frequencies in the freshwater populations.

# Let's count the number of fixed alleles in each water type
sum(d2$Marine_snpfreq1 == 0 | d2$Marine_snpfreq1 == 1, na.rm = T)
sum(d2$Freshwater_snpfreq1 == 0 | d2$Freshwater_snpfreq1 == 1, na.rm = T)

# *Q*: Is this what you expected? If not, how can you explain it?

Next steps

We now have the consensus sequences, a list of variants on those sequences and some estimates of their allele frequencies in the different water types. The more technical part concerning the processing of raw RAD-seq data into genetic data of interest is now complete, and we can proceed to the downstream analyses.

Next: Downstream analysis

Roscoff course 2015 - RAD tags

Table of Contents