MSc Bioinformatics

QA and Assembly

bmpvieira.com/assembly14

bmpvieira

Bruno Vieira | @bmpvieira

Phd Student @ QMUL

Bioinformatics and Population Genomics

Supervisor:
Yannick Wurm | @yannick__

© 2014 Bruno Vieira CC-BY 4.0

Download data

bit.ly/ant-reads

Useful books

Papers

De novo genome assembly: what every biologist should know

Assemblathon 2: evaluating de novo methods of genome assembly[...]

Genome Assembly

Chen 2011

Types

Algoritms

  • Overlap Layout Consensus
  • De Bruijn

Strategies

  • De Novo
  • Reference guided


Assembly paradigms

Overlap/Layout/Consensus

Overlap/Layout/Consensus

  • A node corresponds to a read, an edge denotes an overlap between two reads.
  • The overlap graph is used to compute a layout of reads and consensus sequence of contigs by pair-wise sequence alignment.
  • Good for sequences with limited number of reads but significant overlap. Computational intensive for short reads (short and high error rate).
  • Example assemblers: Celera Assembler, Arachne, CAP and PCAP

Chen 2011

de Brujin

de Brujin

  • No need for all against all overlap discovery.
  • Break reads into smaller sequences of DNA (K-mers, K denotes the length in bases of these sequences).
  • Captures overlaps of length K-1 between these K-mers.
  • More sensitive to repeats and sequencing errors.
  • By construction, the graph contains a path corresponding to the original sequence.
  • Example assemblers: Euler, Velvet, ABySS, AllPaths, SOAPdenovo, CLC Bio

Chen 2011

Schatz 2012

Schatz 2012

Too many assemblers

seqanswers.com/wiki/De-novo_assembly


A5, ABySS, ALLPATHS, CABOG, CLCbio, Contrail, Curtain, DecGPU, Forge, Geneious, GenoMiner, IDBA, Lasergene, MIRA, Newbler, PE-Assembler, QSRA, Ray, SeqMan NGen, SeqPrep, Sequencher, SHARCGS, SHORTY, SHRAP, SOAPdenovo, SR-ASM, SuccinctAssembly, SUTTA, Taipan, VCAKE, Velvet

Benchmarking


Why we need the assemblathon

Assembly quality assessment

  • Accuracy or “Correctness”
    • Base accuracy – the frequency of calling the correct nucleotide at a given position in the assembly.
    • Mis-assembly rate – the frequency of rearrangements, significant insertions, deletions and inversions.

Assembly quality assessment

  • Continuity
    • Lengths distribution of contigs/scaffolds.
    • Average length, minimum and maximum lengths, combined total lengths.
    • N50 captures how much of the assembly is covered by relatively large contigs.

Assembly quality assessment

  • N50
  • NG50


N50 must die?

Assembly quality assessment

  • Fragment analysis - Count how many randomly chosen fragments from species A genome can be found in assembly
  • Repeat analysis - Choose fragments that either overlap or don’t overlap a known repeat
  • Gene finding - How many genes are present in each assembly? (CEGMA)

Assembly quality assessment

  • Contamination - “all libraries will contain some bacterial contamination”
  • Mauve analysis - Uses whole genome alignment to reveal
  • BWA analysis - Align contigs to genome
  • Optical Maps / Irys

FastQC

FastQC Documentation

Diginorm

"(...)systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors."

Diginorm

"(...)reduces the size of shotgun data sets and decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs."

Magic? No, Bloom filters

Diginorm

What is digital normalization, anyway?

Why you shouldn't use digital normalization

Fasta

Fastq

Fastq

Interleaved format

Practical

bmpvieira.com/assembly14-practical