Assessment of the Drosophila melanogaster ISO1 ref. strain genome (69X NanoPore data) assemblies using NextDenovo, Canu, Flye, Shasta and WtdbgΒΆ

  1. Download reads
SRA Accession: SRR6702603, SRR6821890
  1. Prepare input file (input.fofn)
ls SRR6702603.fasta.gz SRR6821890.fasta.gz > input.fofn
  1. Prepare config file (run.cfg)
[General]
job_type = sge # here we use SGE to manage jobs
job_prefix = nextDenovo
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 12
input_type = raw
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = 01_rundir

[correct_option]
read_cutoff = 1k
genome_size = 130m # estimated genome size
sort_options = -m 30g -t 35
minimap2_options_raw = -t 20
pa_correction = 6
correction_options = -p 35

[assemble_option]
minimap2_options_cns = -t 20
nextgraph_options = -a 1
  1. Run
nohup nextDenovo run.cfg &
  1. Get result
  • Final corrected reads file (use the -b parameter to get more corrected reads):

    01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta
    
  • Final assembly result:

    01_rundir/03.ctg_graph/nd.asm.fasta
    

    The folowing is the assembly statistics:

    Type           Length (bp)            Count (#)
    N10             25701192                   1
    N20             22251987                   2
    N30             22251987                   2
    N40             21195733                   3
    N50             21195733                   3
    N60             18110856                   4
    N70             13648743                   5
    N80              6408543                   6
    N90              1033518                  12
    
    Min.               18454                   -
    Max.            25701192                   -
    Ave.             1826448                   -
    Total          133330776                  73
    
  1. Assemble with shasta
shasta-Linux-0.5.1  --input SRR6702603.fasta --input SRR6821890.fasta --threads 30
  1. Download reference
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
gzip -d GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
  1. Run Quast v5.0.2
quast.py --large --eukaryote --min-identity 80 --threads 30 -r GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna nextDenovo.asm.fa Canu.asm.fa Flye.asm.fa Shasta.asm.fa Wtdbg.asm.fa
Quast result
  NextDenovo Canu Flye Shasta Wtdbg
# contigs 73 424 461 872 510
Largest contig 25701192 14715425 12613153 1801407 23221757
Total length 133330776 140540470 135880693 129225244 132926651
N50 21195733 4298595 6016667 535885 12028162
NG50 18110856 4298595 6016667 440773 10631323
N75 13648743 777595 2182645 244480 3308195
NG75 3925274 714013 1367004 182722 1752322
LG50 4 11 9 92 5
LG75 7 36 20 218 13
# misassemblies 345 971 724 262 616
# misassembled contigs 48 226 217 78 191
# local misassemblies 137 433 670 123 185
# unaligned mis. contigs 1 3 5 7 36
# unaligned contigs 1 + 36 part 8 + 122 part 11 + 118 part 191 + 76 part 89 + 291 part
Unaligned length 603053 769264 811595 1660668 2264882
Genome fraction (%) 92.109 93.614 91.799 88.085 91.504
Duplication ratio 1.011 1.047 1.032 1.016 1.002
# mismatches per 100 kbp 90.86 183.12 220.48 609.69 179.86
# indels per 100 kbp 567.78 831.54 1334.52 1428.10 1081.15
Largest alignment 25696021 11699048 11981267 1799773 18844039
Total aligned length 132416893 139189216 134650393 127438699 130313270
NA50 6618721 3863099 5596752 527231 4309906
NGA50 6618721 3863099 5143715 434179 4174617
NA75 3269191 670044 1955654 230034 1573933
NGA75 2125978 611559 1267543 168924 928918
LGA50 5 13 11 94 10
LGA75 14 42 24 227 27

Note

The results of Canu, Flye and Wtdbg are copied from ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/dm-ISO1, published by wtdbg2 paper, the complete result of Quast can be seen from here.