Assessment of the Drosophila melanogaster ISO1 ref. strain genome (69X NanoPore data) assemblies using NextDenovo, Canu, Flye, Shasta and Wtdbg

  1. Download reads

SRA Accession: SRR6702603, SRR6821890
  1. Prepare input file (input.fofn)

ls SRR6702603.fasta.gz SRR6821890.fasta.gz > input.fofn
  1. Prepare config file (run.cfg)

[General]
job_type = sge # here we use SGE to manage jobs
job_prefix = nextDenovo
task = all
rewrite = yes
deltmp = yes
parallel_jobs = 12
input_type = raw
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = 01_rundir

[correct_option]
read_cutoff = 1k
genome_size = 130m # estimated genome size
sort_options = -m 30g -t 35
minimap2_options_raw = -t 20
pa_correction = 6
correction_options = -p 35

[assemble_option]
minimap2_options_cns = -t 20
nextgraph_options = -a 1
  1. Run

nohup nextDenovo run.cfg &
  1. Get result

  • Final corrected reads file (use the -b parameter to get more corrected reads):

    01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fasta
    
  • Final assembly result:

    01_rundir/03.ctg_graph/nd.asm.fasta
    

    The folowing is the assembly statistics:

    Type           Length (bp)            Count (#)
    N10             25701192                   1
    N20             22251987                   2
    N30             22251987                   2
    N40             21195733                   3
    N50             21195733                   3
    N60             18110856                   4
    N70             13648743                   5
    N80              6408543                   6
    N90              1033518                  12
    
    Min.               18454                   -
    Max.            25701192                   -
    Ave.             1826448                   -
    Total          133330776                  73
    
  1. Assemble with shasta

shasta-Linux-0.5.1  --input SRR6702603.fasta --input SRR6821890.fasta --threads 30
  1. Download reference

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
gzip -d GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
  1. Run Quast v5.0.2

quast.py --large --eukaryote --min-identity 80 --threads 30 -r GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna nextDenovo.asm.fa Canu.asm.fa Flye.asm.fa Shasta.asm.fa Wtdbg.asm.fa
Quast result

NextDenovo

Canu

Flye

Shasta

Wtdbg

# contigs

73

424

461

872

510

Largest contig

25701192

14715425

12613153

1801407

23221757

Total length

133330776

140540470

135880693

129225244

132926651

N50

21195733

4298595

6016667

535885

12028162

NG50

18110856

4298595

6016667

440773

10631323

N75

13648743

777595

2182645

244480

3308195

NG75

3925274

714013

1367004

182722

1752322

LG50

4

11

9

92

5

LG75

7

36

20

218

13

# misassemblies

345

971

724

262

616

# misassembled contigs

48

226

217

78

191

# local misassemblies

137

433

670

123

185

# unaligned mis. contigs

1

3

5

7

36

# unaligned contigs

1 + 36 part

8 + 122 part

11 + 118 part

191 + 76 part

89 + 291 part

Unaligned length

603053

769264

811595

1660668

2264882

Genome fraction (%)

92.109

93.614

91.799

88.085

91.504

Duplication ratio

1.011

1.047

1.032

1.016

1.002

# mismatches per 100 kbp

90.86

183.12

220.48

609.69

179.86

# indels per 100 kbp

567.78

831.54

1334.52

1428.10

1081.15

Largest alignment

25696021

11699048

11981267

1799773

18844039

Total aligned length

132416893

139189216

134650393

127438699

130313270

NA50

6618721

3863099

5596752

527231

4309906

NGA50

6618721

3863099

5143715

434179

4174617

NA75

3269191

670044

1955654

230034

1573933

NGA75

2125978

611559

1267543

168924

928918

LGA50

5

13

11

94

10

LGA75

14

42

24

227

27

Note

The results of Canu, Flye and Wtdbg are copied from ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/dm-ISO1, published by wtdbg2 paper, the complete result of Quast can be seen from here.