Table of Contents
Variant calling and genotyping
Mapping back the reads to the consensus sequences
TODO Examine Phred score profiles
Now that we have the consensus sequences, we need to map back each read to them to align matching reads together and detect polymorphic positions.
Read calling during sequencing is not perfect, and for Illumina the miscall rate is around 1% (Nielsen et al. 2011). To minimise the errors during SNP detection (false positives), we should process the reads to remove bases with low quality score from the analysis.
Let's examine the distribution of quality scores for each set of reads. We can
use a homemade python script (extract_phred_profiles.py
) to get the Phred
score in a tabular format. We can then use R to plot the average profiles.
python extract_phred_profiles.py s_6_2_sequence.fastq
# R script # # Plot the Phred profile for a FASTQ file #---------------------------------------- p = read.table("s_6_2_sequence.fastq.phred_profile", sep = "\t", header = F) boxplot(p, outline = F)
How does the quality score change along the reads?
TODO Read trimming
To remove low quality bases from the 3' end, we will remove bases from this end until the Phred quality score reached 20 (error rate = 1%). Remaining reads with length less than 40 will be discarded.
We could do that using Fastq Quality trimmer as implemented in Galaxy tools,
but we can also use a homemade python script (trim_reads.py
):
# Usage: python script.py inputFile phredThreshold lengthThreshold python trim_reads.py s_6_2_sequence.fastq 20 40
We apply the quality trimming to each forward and reverse reads files for each population:
ls -1
ABB.pair.for.fastq.clean ABB.pair.rev.fastq.clean BYN.pair.for.fastq.clean BYN.pair.rev.fastq.clean HKI.pair.for.fastq.clean HKI.pair.rev.fastq.clean LEV.pair.for.fastq.clean LEV.pair.rev.fastq.clean POR.pair.for.fastq.clean POR.pair.rev.fastq.clean PYO.pair.for.fastq.clean PYO.pair.rev.fastq.clean RYT.pair.for.fastq.clean RYT.pair.rev.fastq.clean SKA.pair.for.fastq.clean SKA.pair.rev.fastq.clean
python trim_reads.py *.fastq.clean 20 40
Also apply this trimming to the files containing all the forward reads and all the reverse reads used for the de novo assembly. How would you examine the resulting distribution of read lengths in those trimmed files? If you have time, do it.
Variant calling
TODO Prepare input files for Maq
We will use the Maq
program to perform the read alignment. A detailed
explanation of the Maq workflow is available here.
We have the reference files:
consensus-for.fasta consensus-rev.fasta
and the reads files:
ABB.pair.for.fastq.clean.trimmed ABB.pair.rev.fastq.clean.trimmed BYN.pair.for.fastq.clean.trimmed BYN.pair.rev.fastq.clean.trimmed HKI.pair.for.fastq.clean.trimmed HKI.pair.rev.fastq.clean.trimmed LEV.pair.for.fastq.clean.trimmed LEV.pair.rev.fastq.clean.trimmed POR.pair.for.fastq.clean.trimmed POR.pair.rev.fastq.clean.trimmed PYO.pair.for.fastq.clean.trimmed PYO.pair.rev.fastq.clean.trimmed RYT.pair.for.fastq.clean.trimmed RYT.pair.rev.fastq.clean.trimmed SKA.pair.for.fastq.clean.trimmed SKA.pair.rev.fastq.clean.trimmed
First, we convert the fasta reference to Maq bfa format:
maq fasta2bfa consensus-for.fasta consensus-for.bfa maq fasta2bfa consensus-rev.fasta consensus-rev.bfa
Then we convert the fastq read files to Maq bfq format using a small bash
script, convert-trimmed-to-bfq.sh
. The script file is:
#!/bin/bash FILES=`ls *.trimmed` # get the list of trimmed read files for f in $FILES do echo $f maq sol2sanger $f $f.sanger # convert Illumina 1.5+ to Sanger scores maq fastq2bfq $f.sanger $f.bfq # convert fastq to bfq done
We use it with:
bash convert-trimmed-to-bfq.sh
We also generate the bfq files for all the concatenated forward reads and all the concatenated reverse reads:
# Concatenate the forward and reverse reads cat *for*.trimmed > reads.for.trimmed cat *rev*.trimmed > reads.rev.trimmed # Convert to bfq maq sol2sanger reads.for.trimmed reads.for.sanger maq sol2sanger reads.rev.trimmed reads.rev.sanger maq fastq2bfq reads.for.sanger reads.for.bfq maq fastq2bfq reads.rev.sanger reads.rev.bfq
At this point, the fasta and fastq files have been converted to the binary format used by Maq.
TODO Read alignment to the consensus sequences
Here we will use all the forward reads and all the reverse reads and align them to the consensus sequences. For the practicals, we work only with the forward reads from now on.
Again, this step could be long, so we submit it as a job in the file
map-maq-for-job
. The map file is a binary file, but we also convert it to a
human-readable form with mapview
:
#!/bin/bash #$ -S /bin/bash #$ -M matthieu.bruneaux@ens-lyon.org #$ -V #$ -m bea #$ -cwd # Maq map forward reads maq map for.map consensus-for.bfa reads.for.bfq maq mapview for.map > for.map.txt
We submit the job with qsub -q long.q map-maq-for-job
. Have a look at the map
with less -S for.map.txt
. Using the man page (search for "mapview" with
CTRL+F), understand the output format, and in particular look for the number of
mismatches of the best hits. Can you identify which reads are similar to the
consensus sequence and which harbour potential SNPs?
TODO SNP calling
Now that the reads have been mapped back, we can perform the SNP calling operation.
We use the assemble
and the cns2snp
function of Maq to extract the SNP
information. The job file is maq-assemble-cns2snp-job
:
#!/bin/bash #$ -S /bin/bash #$ -M matthieu.bruneaux@ens-lyon.org #$ -V #$ -m bea #$ -cwd # Maq assemble maq assemble for.cns consensus-for.bfa for.map 2> for-assemble.log # Maq cns2snp maq cns2snp for.cns > for.snp
As usual, we submit the job with qsub -q long.q
map-assemble-cns2snp-job
. Have a look at the output file for.snp
. How many
putative SNPs were detected? The format of the output file is explained on the
man page (search for "cns2snp" with CTRL+F).
A very useful file for detailed SNP filtering is the pileup file. We can produce it with:
maq pileup consensus-for.bfa for.map > for.pileup
Have a look at the pileup file with less -S for.pileup
. Can you understand
its format? (see the man page, search for "pileup") Can you identify the
contigs with low coverage and high coverage? Can you identify putative
polymorphic positions? Can you identify positions that look like genuine SNPs?
SNP filtering
The SNP filtering is a delicate operation, since it is important to make a distinction between genuine SNPs and false positives due to sequencing errors. In this case, we used the following criteria:
- alleles with a frequency equal to or less than the error rate (i.e. <=0.01) were removed
- retained SNP should be biallelic
- read depth (coverage) should be greater than 10 but less than 1000
- at least 3 reads should be observed per SNP allele
- maximum two SNPs per consensus sequence
- not falling less than 5 bases away from identified microsatellites (if microsatellites were kept)
Those criteria are rather on the conservative side and might result in losing genuine SNPs. Explain why and propose better criteria.
Some SNP calling methods use more elaborate approaches such as maximum likelihood and bayesian methods. See the bibliography notes section.
Since the SNP filtering as done here is mainly an exercise in scripting (using R or Python for example), we will not perform it during the practicals and we will use ready-made output files containing the SNP information.
Summary plots for reality check
We give you a list of SNP summary files, with information about coverage and
allele frequency in the full dataset (Total
) and per populations, per habitat
or per water type (freshwater vs marine).
ls -1
0_Ninesp_020212_Final_SNPs_Totals.txt 1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt 1_Ninesp_020212_Final_SNPs_Totals_and_Populations_Horizontal.txt 2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Horizontal.txt 2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Stacked.txt 3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt 3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Stacked.txt
Stacked
and Horizontal
correspond to different ways of organizing the data
in the table, but the information is the same.
TODO Have a look at the SNP files
Look at the SNP files with, for example:
column -t 0_Ninesp_020212_Final_SNPs_Totals.txt | less -S
Here, we use the colum
command to produce a nice, human-readable display of
the tabular file, and we send it through a pipe to less
for easy navigation
through the file.
Have a look to a few different files. Can you understand what the different
columns stand for? Can you see the difference between a Stacked
file and a
Horizontal
file?
TODO Count the number of SNP filtered
Count the number of SNP which were filtered and kept:
wc -l 0_Ninesp_020212_Final_SNPs_Totals.txt
Note that the header line is also counted here.
How many SNP do we have? How many consensus sequences did we have initially? What is the approximate proportion of consensus sequences with a SNP?
Actually there are several SNP types: some are single SNPs on the consensus sequence, some are associated into haplotypes of several SNPs on the same sequence. Let's count how many SNPs of each type we have:
cut -f 13 0_Ninesp_020212_Final_SNPs_Totals.txt | sort | uniq -c
We use cut
to extract the column 13 of the file (which contains the SNP type
information) and pipe it to sort
and uniq -c
to count the number of
occurrences of the different entries.
TODO Examine the coverage per population
When we looked at the number of reads per population, we noticed that there were large discrepancies between populations. Let's have a look at how the coverage per SNP per population compares.
# R script # Compare the distribution of SNP coverage between populations #------------------------------------------------------------- # Load the data d = read.table("1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt", header = T, stringsAsFactors = F) # Get the columns names names(d) # Store the habitat types per population pop = c("ABB", "BYN", "HKI", "LEV", "POR", "PYO", "RYT", "SKA") habitat = c("Pond", "Pond", "Marine", "Marine", "Lake", "Pond", "Pond", "Lake") habColors = c("green", "green", "red", "red", "grey", "green", "green", "grey") # Boxplot of SNP coverage boxplot(d2$AlignDepth ~ d2$Population, col = habColors) # note: the colors match because boxplot sorts the population alphabetically # Remove outliers display boxplot(d2$AlignDepth ~ d2$Population, col = habColors, outline = F) # As a comparison, we will plot again the read distribution per population popReads = c("BYN", "RYT", "HKI", "PYO", "ABB", "SKA", "LEV", "POR") habitatReads = c("Pond", "Pond", "Marine", "Pond", "Pond", "Lake", "Marine", "Lake") habColReads = c("green", "green", "red", "green", "green", "grey", "red", "grey") n_pairs = c(175685, 577420, 446765, 652455, 396861, 78983, 361187, 321539) # Sort the data alphabetically n_pairs = n_pairs[order(popReads)] habColReads = habColReads[order(popReads)] popReads = popReasd[order(popReads)] # Barplot barplot(n_pairs, names.arg = popReads, col = habColReads, ylab = "N pairs", las = 1) # The two plots together for easier comparison par(mfrow = c(2, 1), mar = c(2.5, 4, 1, 1)) boxplot(d2$AlignDepth ~ d2$Population, col = habColors, outline = F, las = 1, ylab = "Coverage") barplot(n_pairs, names.arg = popReads, col = habColReads, ylab = "N pairs", las = 1)
Is the coverage homogeneous between populations? What is your feeling about estimating allele frequencies per population based on those coverage?
How do the numbers of pairs per population and the SNP coverage compare? Can you explain the differences?
TODO Compare ABB and BYN more thoroughly
Let's have a closer look at ABB and BYN:
# R script # Load the data d = read.table("1_Ninesp_020212_Final_SNPs_Populations_Stacked.txt", header = T, stringsAsFactors = F) # Get the data for ABB and BYN d_ABB = subset(d, d$Population == "ABB") d_BYN = subset(d, d$Population == "BYN") # Look at the count of SNP per AlignDepth table(d_ABB$AlignDepth) table(d_BYN$AlignDepth) # Plots par(mfrow = c(2, 1), mar = c(2.5, 4, 1, 1)) plot(table(d_ABB$AlignDepth)) plot(table(d_BYN$AlignDepth)) # Same xlim plot(table(d_ABB$AlignDepth), xlim = c(0, 650)) plot(table(d_BYN$AlignDepth), xlim = c(0, 650))
Can you now explain why ABB and BYN differ in their comparison if we consider read pairs or SNP coverage?
TODO SNP coverage per habitat type and per water type
We will have to pool our populations together by habitat type or by water type (freshwater vs marine) if we want to improve a bit allele frequency estimates. Note that this approach is far from perfect! Populations are different, and different demographic histories and evolutionary trajectories means that we might be pooling things which are different in essence and that our allele frequency estimates might have little meaning. However, since we are basically looking for signals of parallel evolution in one water type vs another, we can hope that, if such parallel evolution occurred for some markers, we might be able to recover some meaningful information when averaging the values during the genome scan step.
Let's have a look at coverage when we pool the data per habitat type:
# R script # Load the data d = read.table("2_Ninesp_020212_Final_SNPs_Totals_and_Habitats_Stacked.txt", header = T) head(d) # Boxplot dev.off() # To have only one plot displayed again boxplot(d$AlignDepth ~ d$HabitatType, col = c("grey", "red", "green"), ylab = "Coverage", las = 1) # Remove outliers boxplot(d$AlignDepth ~ d$HabitatType, col = c("grey", "red", "green"), ylab = "Coverage", las = 1, outline = F)
How do the coverages per habitat type compare?
In the end we will pool the data by water type. Let's look at the coverages in more details:
# R script # Load the data pooled by water type d = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Stacked.txt", header = T) # Boxplot boxplot(d$AlginDepth ~ d$WaterType, ylab = "Coverage", las = 1) boxplot(d$AlginDepth ~ d$WaterType, ylab = "Coverage", las = 1, outline = F) # Load the data in the horizontal format d2 = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt", header = T) # Plot the relation between coverage in marine populations and in freshwater populations plot(d2$Marine_AlignDepth, d2$Freshwater_AlignDepth, xlab = "Marine coverage", ylab = "Freshwater coverage", pch = 16, col = rgb(0, 0, 1, 0.2)) # Use log scale (note that we had 1 to all values to see also the zeros) plot(d2$Marine_AlignDepth + 1, d2$Freshwater_AlignDepth + 1, xlab = "Marine coverage + 1", ylab = "Freshwater coverage + 1", pch = 16, col = rgb(0, 0, 1, 0.2), log = "xy") # Add a bit of jittering to reduce points overlap plot(jitter(d2$Marine_AlignDepth + 1), d2$Freshwater_AlignDepth + 1, xlab = "Marine coverage + 1", ylab = "Freshwater coverage + 1", pch = 16, col = rgb(0, 0, 1, 0.2), log = "xy")
We can also compare the number of fixed alleles in each water type. Questions are embedded in the following code:
# R script # Load the data in the horizontal format d2 = read.table("3_Ninesp_020212_Final_SNPs_Totals_and_Waters_Horizontal.txt", header = T) # Compare the allele frequencies plot(d2$Marine_snpfreq1, d2$Freshwater_snpfreq1, xlab = "Marine freq", ylab = "Freshwater freq", pch = 16, color = rgb(0, 0, 1, 0.2)) # *Q*: Can you see the fixed alleles in the freshwater habitats? In the marine # habitats? # Another view of the same data smoothScatter(d2$Marine_snpfreq1, d2$Freshwater_snpfreq1, xlab = "Marine freq", ylab = "Freshwater freq") # *Q*: In the density scatter plot, compare the frequencies of alleles fixed in # the marine populations with the frequencies in the freshwater populations. # Let's count the number of fixed alleles in each water type sum(d2$Marine_snpfreq1 == 0 | d2$Marine_snpfreq1 == 1, na.rm = T) sum(d2$Freshwater_snpfreq1 == 0 | d2$Freshwater_snpfreq1 == 1, na.rm = T) # *Q*: Is this what you expected? If not, how can you explain it?
Next steps
We now have the consensus sequences, a list of variants on those sequences and some estimates of their allele frequencies in the different water types. The more technical part concerning the processing of raw RAD-seq data into genetic data of interest is now complete, and we can proceed to the downstream analyses.