Assessment of the CHM13 genome (120X NanoPore data) assemblies using NextDenovo, Canu, Flye, ShastaΒΆ

  1. Download reads
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/nanopore/rel3/rel3.fastq.gz
  1. Prepare input file (input.fofn)
ls rel3.fastq.gz > input.fofn
  1. Prepare config file (run.cfg)

    [General]
    job_type = sge # here we use SGE to manage jobs
    job_prefix = nextDenovo
    task = all
    rewrite = yes
    deltmp = yes
    parallel_jobs = 25
    input_type = raw
    read_type = ont # clr, ont, hifi
    input_fofn = input.fofn
    workdir = chm13_asm
    
    [correct_option]
    read_cutoff = 20k
    genome_size = 3.1g # estimated genome size
    sort_options = -m 150g -t 30
    minimap2_options_raw = -t 8
    pa_correction = 5
    correction_options = -p 30
    
    [assemble_option]
    minimap2_options_cns = -t 8
    nextgraph_options = -a 1
    
  2. Run

nohup nextDenovo run.cfg &
  1. Get result
  • Final corrected reads file (use the -b parameter to get more corrected reads):

    chm13_asm/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta
    
  • Final assembly result:

    chm13_asm/03.ctg_graph/nd.asm.fasta
    

The folowing is the assembly statistics:

Type           Length (bp)            Count (#)
N10            179297054                   2
N20            169128386                   3
N30            131652719                   6
N40            120761272                   8
N50            106090521                  10
N60             95206689                  13
N70             80513393                  16
N80             59725892                  21
N90             39058727                  27

Min.               84432                   -
Max.           237405279                   -
Ave.            35344197                   -
Total         2898224197                  82
  1. Download reference
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.7.fasta.gz
gzip -d chm13.draft_v0.7.fasta.gz
  1. Run Quast v5.0.2
quast.py --eukaryote --large --min-identity 80 --threads 30 -r ./chm13.draft_v0.7.fasta --fragmented nd.asm.fasta
Quast result
  NextDenovo Canu Flye Shasta
# contigs 82 1223 472 297
Largest contig 237405279 139909728 132009996 130803838
Total length 2898224197 2991947723 2920201070 2823384269
# misassemblies 1227 6396 3230 187
# misassembled contigs 61 875 193 78
Misassembled contigs length 2740877545 2458710426 2440399207 1351075153
# local misassemblies 433 1164 981 129
# possible TEs 42 160 96 14
# unaligned mis. contigs 11 73 17 0
# unaligned contigs 0 + 64 part 168 + 248 part 8 + 135 part 0 + 37 part
Unaligned length 22021119 30076945 14583673 393547
Genome fraction (%) 97.421 98.391 97.392 96.149
Duplication ratio 1.007 1.027 1.018 1.002
# mismatches per 100 kbp 29.43 77.26 74.04 15.56
# indels per 100 kbp 170.98 327.08 447.97 141.25
Largest alignment 111497488 104447985 111814657 111679369
Total aligned length 2865321418 2943726417 2894073152 2821352191
N50 106090521 77964612 70319350 58111632
NG50 106090521 77964612 70319350 58088067
L50 10 15 16 17
LG50 10 15 16 18
NA50 57779597 47440498 46858921 47392260
NGA50 57779597 47440498 46546094 44539326
LA50 18 21 19 19
LGA50 18 21 20 20

Note

The results of Canu, Flye and Shasta are copied from here, the complete result of NextDenovo can be seen from here.