Assembling E. coli sequences with Velvet

The goal of this tutorial is to show you the basics of assembly using the Velvet assembler.

We’ll be using data from Efficient de novo assembly of single-cell bacterial genomes from short-read data sets, Chitsaz et al., 2011.

Booting an Amazon AMI

Start up an Amazon computer (m1.large or m1.xlarge) using AMI ami-d25283ba (see Start up an EC2 instance and Starting up a custom operating system).

Log in with Windows or from Mac OS X.

Logging in

Log in and type:

sudo bash

to change into superuser mode.

Packages to install

Softwares and packages are already installed:

  • velvet (An assmbler)
  • khmer (A package for preprocess and assemble large data)
  • quast (A tools for evaluating assembly with known genomes)

Data

Data is already downloaded in /mnt/assembly:

ls /mnt/assembly

to see data in “/mnt/assembly”:

  • ecoli_ref-5m-trim.fastq.gz: quality trimmed pair end data sets from E. coli genome sequence data.
  • ecoli-reads-5m-dn-paired.fa.gz: data set is a specially processed data set using digital normalization that will assemble quickly.

Running an assembly

Go inside the data directory:

cd /mnt/assembly

Now... assemble the small, fast data sets, using the Velvet assembler. Here we will set the required parameter k=21:

velveth ecoli.21 21 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.21 -exp_cov auto

Check out the stats for the assembled contigs for a cutoff of 1000:

python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa

Also try assembling with k=23 and k=25:

velveth ecoli.23 23 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.23 -exp_cov auto

velveth ecoli.25 25 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.25 -exp_cov auto

Now check out the stats for the assembled contigs for a cutoff of 1000:

python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa

(Also read: What does k control in de Bruijn graph assemblers?.)

Comparing and evaluating assemblies - QUAST

Run QUAST to mapping contigs (ecoli.23/contigs.fa, and ecoli.25/contigs.fa) reference genome (ecoliMG1655.fa):

gunzip ecoliMG1655.fa.gz
/root/quast-2.3/quast.py -R ecoliMG1655.fa ecoli.*/contigs.fa

Note that here we’re looking at all the assemblies we’ve generated.

Now look at the results:

more quast_results/latest/report.txt

The first bits to look at are Genome fraction (%) and # misassembled contigs, I think.

Searching assemblies – BLAST

Build BLAST databases for the assemblies you’ve done:

cd /mnt/assembly

for i in 21 23 25
do
   extract-long-sequences.py -o ecoli-$i.fa -l 500 ecoli.$i/contigs.fa
   formatdb -i ecoli-$i.fa -o T -p F
done

BLAST is already installed. Let’s search for a specific gene (CRP, a transcription regulator).

The commandline for BLAST is:

blastall -i crp.fa -d ecoli-21.fa -p tblastn -b 1 -v 1

Questions and Discussion Points

Why do we use a lower cutoff of 1kb for the assemstats3 links, above? Why not 0?