Assembling E. coli sequences with Velvet¶
The goal of this tutorial is to show you the basics of assembly using the Velvet assembler.
We’ll be using data from Efficient de novo assembly of single-cell bacterial genomes from short-read data sets, Chitsaz et al., 2011.
Booting an Amazon AMI¶
Start up an Amazon computer (m1.large or m1.xlarge) using AMI ami-d25283ba (see Start up an EC2 instance and Starting up a custom operating system).
Log in with Windows or from Mac OS X.
Packages to install¶
Softwares and packages are already installed:
Data¶
Data is already downloaded in /mnt/assembly:
ls /mnt/assembly
to see data in “/mnt/assembly”:
- ecoli_ref-5m-trim.fastq.gz: quality trimmed pair end data sets from E. coli genome sequence data.
- ecoli-reads-5m-dn-paired.fa.gz: data set is a specially processed data set using digital normalization that will assemble quickly.
Running an assembly¶
Go inside the data directory:
cd /mnt/assembly
Now... assemble the small, fast data sets, using the Velvet assembler. Here we will set the required parameter k=21:
velveth ecoli.21 21 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.21 -exp_cov auto
Check out the stats for the assembled contigs for a cutoff of 1000:
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
Also try assembling with k=23 and k=25:
velveth ecoli.23 23 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.23 -exp_cov auto
velveth ecoli.25 25 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.25 -exp_cov auto
Now check out the stats for the assembled contigs for a cutoff of 1000:
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
(Also read: What does k control in de Bruijn graph assemblers?.)
Comparing and evaluating assemblies - QUAST¶
Run QUAST to mapping contigs (ecoli.23/contigs.fa, and ecoli.25/contigs.fa) reference genome (ecoliMG1655.fa):
gunzip ecoliMG1655.fa.gz
/root/quast-2.3/quast.py -R ecoliMG1655.fa ecoli.*/contigs.fa
Note that here we’re looking at all the assemblies we’ve generated.
Now look at the results:
more quast_results/latest/report.txt
The first bits to look at are Genome fraction (%) and # misassembled contigs, I think.
Searching assemblies – BLAST¶
Build BLAST databases for the assemblies you’ve done:
cd /mnt/assembly
for i in 21 23 25
do
extract-long-sequences.py -o ecoli-$i.fa -l 500 ecoli.$i/contigs.fa
formatdb -i ecoli-$i.fa -o T -p F
done
BLAST is already installed. Let’s search for a specific gene (CRP, a transcription regulator).
The commandline for BLAST is:
blastall -i crp.fa -d ecoli-21.fa -p tblastn -b 1 -v 1
Questions and Discussion Points¶
Why do we use a lower cutoff of 1kb for the assemstats3 links, above? Why not 0?