Table of Contents

De novo assembly

Assembly into consensus sequences

We choose to go for a de novo assembly since we do not have a genome for this species. Several tools were tested, and in the end we decided to go with a bit older tool, CAP3 (original publication), which can incorporate quality information in the assembly process.

We perform two independent assemblies, one for the forward reads and one for the reverse reads.

Concatenate all reads for a given sense

We have to pool all the reads of a given sense together:

ls -1
ABB.pair.for.fastq.clean
ABB.pair.rev.fastq.clean
BYN.pair.for.fastq.clean
BYN.pair.rev.fastq.clean
HKI.pair.for.fastq.clean
HKI.pair.rev.fastq.clean
LEV.pair.for.fastq.clean
LEV.pair.rev.fastq.clean
POR.pair.for.fastq.clean
POR.pair.rev.fastq.clean
PYO.pair.for.fastq.clean
PYO.pair.rev.fastq.clean
RYT.pair.for.fastq.clean
RYT.pair.rev.fastq.clean
SKA.pair.for.fastq.clean
SKA.pair.rev.fastq.clean
cat *for* > for.fastq
cat *rev* > rev.fastq

We obtain two large files with all the reads.

TODO Run CAP3 (forward reads)

CAP3 takes for input two files, one containing the FASTA sequences and one containing the corresponding quality scores. We can split the FASTQ files into those two files using a python script (split_fasta_qual.py):

python split_fasta_qual.py for.fastq
python split_fasta_qual.py rev.fastq

We submit the CAP3 assembly as a job on the ABiMS server. We need to prepare our job file (cap3-for-job) (see the ABiMS how-to page for more details about job files and how to submit them):

#!/bin/bash
#$ -S /bin/bash
#$ -M matthieu.bruneaux@ens-lyon.org
#$ -V
#$ -m bea
#$ -cwd
cap3 for.fastq.fasta > for.cap3.log

Here the command is simply cap3 for.fastq.fasta > for.cap3.log. CAP3 automatically gets the quality information from for.fastq.fasta.qual.

The job file is submitted to the long queue with:

qsub -q long.q cap3-for-job

TODO Run CAP3 (reverse reads)

Submit a similar assembly for the reverse reads.

An alternative approach would have been to map the reads back to the reference genome. Why did we not use the threespine genome as a reference?

(When mapping back we have to be quite stringent and allow only for example one or two mismatch, and one unique good quality mapping location. With the reference genome of a different species, this can be problematic and the mapping back might be less efficient that creating consensus contigs or stacks.)

CAP3 run

The CAP3 run is quite long (a bit more than 15 hours for the reverse reads). We give you the assembly for the reverse reads (CAP3-rev.fasta), using the default parameters of CAP3. Note that for a real analysis we would need to made several tests with different parameters and select the ones that give the better assembly.

Quality filtering of consensus sequences

The de novo assembly with CAP3 can be tuned by changing the parameters from the default ones (which we should do for a real analysis). An assembly is likely not to be perfect and the resulting consensus sequences have to be quality checked before being used for read mapping.

TODO Examine read length

Because of the ddRAD protocol, we expect to produce consensus sequences by piling up identical or very similar reads into stacks. Since most of the reads have the same length to start with (xx for forward reads and xx for reverse reads after cutting the barcode and restriction sites), we expect to obtain consensus sequences of the same length.

We can get the length of the reads by using a homemade python script (get_fasta_length.py):

python get_fasta_length.py cap3-rev.fasta

We can plot the read length distribution with R:

# R script
#
# Plot read length distribution
#------------------------------

d = read.table("cap3-rev.fasta.lengths")
plot(hist(d[, 1])

How does the length distribution look like? What is the length of forward reads? Of reverse reads? How would you remove the sequences that do not match the median length?

TODO Identify repetitive motives or extremely similar consensus sequences

CAP3 sometimes produces consensus which contain repetitive motives or very similar consensus sequences. Those sequences are problematic for the read mapping back. We perform a blast of the consensus sequences against themselves to identify sequences which have motives with many matches, or duplicate consensus sequences.

To perform the blast search, we firt prepare a blast database from the consensus sequences and then we use blastn:

# Prepare the blast database
makeblastdb -help
makeblastdb -in cap3-rev.fasta -dbtype nucl

We have to perform the blastn by submitting a job because it is a long process. See blastn -help for more details. The job file is cap3-blastn-job:

#!/bin/bash
#$ -S /bin/bash
#$ -M matthieu.bruneaux@ens-lyon.org
#$ -V
#$ -m bea
#$ -cwd
# Perform the blast search (output in tabular format)
blastn -task blastn -db cap3-rev.fasta -query cap3-rev.fasta -out cap3-blastn-results -outfmt 6

We submit the job with qsub -q long.q cap3-for-job. Here we ask for a tabular output format.

Next, we parse the results to identify consensus sequences to be removed or to be merged together. We must remember that the default output for blastn in tabular format is (from blastn -help):

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

We can first have a look at the distribution of the size of the matches;

cut -f 4 cap3-blastn-results > lengths
# R script
d = read.table("lengths")[, 1]
hist(d, xlim = c(0,120), breaks = 0:120, col = "cornflowerblue")

What is the minimal match length observed in this distribution? How would you differentiate between repeated motives and duplicate consensus sequences?

We can also look at the number of matches for each consensus sequence:

# command line
cut -f 1 cap3-blastn-results > seq-names
sort seq-names > seq-names-sorted
uniq -c seq-names-sorted > seq-names-count
sort -nr seq-names-count > seq-match-count
# R script
d = read.table("seq-match-count")
names(d) = c("count", "consensus")
plot(d$count, pch = ".")
plot(d$count, pch = ".", ylim = c(0, 100))
hist(d$count, col = "cornflowerblue", xlab = "N matches")
hist(d$count, col = "cornflowerblue", ylim = c(0, 10000), xlab = "N matches")
hist(d$count, breaks = 1000, col = "cornflowerblue", xlab = "N matches",
     xlim = c(0, 200))

How does the distribution of matches look like? How would you select a threshold? How would you apply it?

Here is some R code to extract the names of the consensus sequences for which the number of matches:

# R
d = read.table("seq-match-count")
names(d) = c("count", "consensus")
threshold = xx # Replace xx with your threshold for the number of matches allowed
d2 = subset(d, d$count <= threshold)
nrow(d)
nrow(d2)
write.table(d2$consensus, file = "names-kept-consensus", quote = F,
  row.names = F, col.names = F)

Have a look at names-kept-consensus with less. How would you use this file to filter only the consensus sequences we want to keep? If you have some time, do it.

TODO Remove microsatellite repeats

The last quality control step is to remove consensus sequences that contains microsatellite repeats. To do this, we use Sputnik on the consensus sequences. We should use it on the consensus sequence file after removing the sequences with repeated motives, but since the intermediate file is missing we'll run it on the full consensus file. We have to submit a job again, sputnik-job:

#!/bin/bash
#$ -S /bin/bash
#$ -M matthieu.bruneaux@ens-lyon.org
#$ -V
#$ -m bea
#$ -cwd
# Perform the blast search (output in tabular format)
sputnik cap3-rev.fasta

We submit it with qsub -q long.q sputnik-job.

Examine the output of Sputnik and count the number of contigs with microsatellite repeats. How would you proceed to remove them from the initial consensus file?

We will now give you two fasta files containing the filtered consensus sequences that we decide to keep in the end.