Table of Contents
De novo assembly
Assembly into consensus sequences
We choose to go for a de novo assembly since we do not have a genome for this species. Several tools were tested, and in the end we decided to go with a bit older tool, CAP3 (original publication), which can incorporate quality information in the assembly process.
We perform two independent assemblies, one for the forward reads and one for the reverse reads.
Concatenate all reads for a given sense
We have to pool all the reads of a given sense together:
ls -1
ABB.pair.for.fastq.clean ABB.pair.rev.fastq.clean BYN.pair.for.fastq.clean BYN.pair.rev.fastq.clean HKI.pair.for.fastq.clean HKI.pair.rev.fastq.clean LEV.pair.for.fastq.clean LEV.pair.rev.fastq.clean POR.pair.for.fastq.clean POR.pair.rev.fastq.clean PYO.pair.for.fastq.clean PYO.pair.rev.fastq.clean RYT.pair.for.fastq.clean RYT.pair.rev.fastq.clean SKA.pair.for.fastq.clean SKA.pair.rev.fastq.clean
cat *for* > for.fastq cat *rev* > rev.fastq
We obtain two large files with all the reads.
TODO Run CAP3 (forward reads)
CAP3 takes for input two files, one containing the FASTA sequences and one
containing the corresponding quality scores. We can split the FASTQ files into
those two files using a python script (split_fasta_qual.py
):
python split_fasta_qual.py for.fastq python split_fasta_qual.py rev.fastq
We submit the CAP3 assembly as a job on the ABiMS server. We need to prepare
our job file (cap3-for-job
) (see the ABiMS how-to page for more details about
job files and how to submit them):
#!/bin/bash #$ -S /bin/bash #$ -M matthieu.bruneaux@ens-lyon.org #$ -V #$ -m bea #$ -cwd cap3 for.fastq.fasta > for.cap3.log
Here the command is simply cap3 for.fastq.fasta > for.cap3.log
. CAP3
automatically gets the quality information from for.fastq.fasta.qual
.
The job file is submitted to the long queue with:
qsub -q long.q cap3-for-job
TODO Run CAP3 (reverse reads)
Submit a similar assembly for the reverse reads.
An alternative approach would have been to map the reads back to the reference genome. Why did we not use the threespine genome as a reference?
(When mapping back we have to be quite stringent and allow only for example one or two mismatch, and one unique good quality mapping location. With the reference genome of a different species, this can be problematic and the mapping back might be less efficient that creating consensus contigs or stacks.)
CAP3 run
The CAP3 run is quite long (a bit more than 15 hours for the reverse reads). We
give you the assembly for the reverse reads (CAP3-rev.fasta
), using the
default parameters of CAP3. Note that for a real analysis we would need to made
several tests with different parameters and select the ones that give the
better assembly.
Quality filtering of consensus sequences
The de novo assembly with CAP3 can be tuned by changing the parameters from the default ones (which we should do for a real analysis). An assembly is likely not to be perfect and the resulting consensus sequences have to be quality checked before being used for read mapping.
TODO Examine read length
Because of the ddRAD protocol, we expect to produce consensus sequences by piling up identical or very similar reads into stacks. Since most of the reads have the same length to start with (xx for forward reads and xx for reverse reads after cutting the barcode and restriction sites), we expect to obtain consensus sequences of the same length.
We can get the length of the reads by using a homemade python script
(get_fasta_length.py
):
python get_fasta_length.py cap3-rev.fasta
We can plot the read length distribution with R:
# R script # # Plot read length distribution #------------------------------ d = read.table("cap3-rev.fasta.lengths") plot(hist(d[, 1])
How does the length distribution look like? What is the length of forward reads? Of reverse reads? How would you remove the sequences that do not match the median length?
TODO Identify repetitive motives or extremely similar consensus sequences
CAP3 sometimes produces consensus which contain repetitive motives or very similar consensus sequences. Those sequences are problematic for the read mapping back. We perform a blast of the consensus sequences against themselves to identify sequences which have motives with many matches, or duplicate consensus sequences.
To perform the blast search, we firt prepare a blast database from the
consensus sequences and then we use blastn
:
# Prepare the blast database makeblastdb -help makeblastdb -in cap3-rev.fasta -dbtype nucl
We have to perform the blastn
by submitting a job because it is a long
process. See blastn -help
for more details. The job file is
cap3-blastn-job
:
#!/bin/bash #$ -S /bin/bash #$ -M matthieu.bruneaux@ens-lyon.org #$ -V #$ -m bea #$ -cwd # Perform the blast search (output in tabular format) blastn -task blastn -db cap3-rev.fasta -query cap3-rev.fasta -out cap3-blastn-results -outfmt 6
We submit the job with qsub -q long.q cap3-for-job
. Here we ask for a tabular
output format.
Next, we parse the results to identify consensus sequences to be removed or to
be merged together. We must remember that the default output for blastn
in
tabular format is (from blastn -help
):
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
We can first have a look at the distribution of the size of the matches;
cut -f 4 cap3-blastn-results > lengths
# R script d = read.table("lengths")[, 1] hist(d, xlim = c(0,120), breaks = 0:120, col = "cornflowerblue")
What is the minimal match length observed in this distribution? How would you differentiate between repeated motives and duplicate consensus sequences?
We can also look at the number of matches for each consensus sequence:
# command line cut -f 1 cap3-blastn-results > seq-names sort seq-names > seq-names-sorted uniq -c seq-names-sorted > seq-names-count sort -nr seq-names-count > seq-match-count
# R script d = read.table("seq-match-count") names(d) = c("count", "consensus") plot(d$count, pch = ".") plot(d$count, pch = ".", ylim = c(0, 100)) hist(d$count, col = "cornflowerblue", xlab = "N matches") hist(d$count, col = "cornflowerblue", ylim = c(0, 10000), xlab = "N matches") hist(d$count, breaks = 1000, col = "cornflowerblue", xlab = "N matches", xlim = c(0, 200))
How does the distribution of matches look like? How would you select a threshold? How would you apply it?
Here is some R code to extract the names of the consensus sequences for which the number of matches:
# R d = read.table("seq-match-count") names(d) = c("count", "consensus") threshold = xx # Replace xx with your threshold for the number of matches allowed d2 = subset(d, d$count <= threshold) nrow(d) nrow(d2) write.table(d2$consensus, file = "names-kept-consensus", quote = F, row.names = F, col.names = F)
Have a look at names-kept-consensus
with less
. How would you use this file
to filter only the consensus sequences we want to keep? If you have some time,
do it.
TODO Remove microsatellite repeats
The last quality control step is to remove consensus sequences that contains
microsatellite repeats. To do this, we use Sputnik on the consensus
sequences. We should use it on the consensus sequence file after removing the
sequences with repeated motives, but since the intermediate file is missing
we'll run it on the full consensus file. We have to submit a job again,
sputnik-job
:
#!/bin/bash #$ -S /bin/bash #$ -M matthieu.bruneaux@ens-lyon.org #$ -V #$ -m bea #$ -cwd # Perform the blast search (output in tabular format) sputnik cap3-rev.fasta
We submit it with qsub -q long.q sputnik-job
.
Examine the output of Sputnik and count the number of contigs with microsatellite repeats. How would you proceed to remove them from the initial consensus file?
We will now give you two fasta files containing the filtered consensus sequences that we decide to keep in the end.