Assessment of the CHM13 genome (120X NanoPore data) assemblies using NextDenovo, Canu, Flye, Shasta
Download reads
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/nanopore/rel3/rel3.fastq.gz
Prepare input file (input.fofn)
ls rel3.fastq.gz > input.fofn
Prepare config file (run.cfg)
[General] job_type = sge # here we use SGE to manage jobs job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 25 input_type = raw read_type = ont # clr, ont, hifi input_fofn = input.fofn workdir = chm13_asm [correct_option] read_cutoff = 20k genome_size = 3.1g # estimated genome size sort_options = -m 150g -t 30 minimap2_options_raw = -t 8 pa_correction = 5 correction_options = -p 30 [assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1
Run
nohup nextDenovo run.cfg &
Get result
Final corrected reads file (use the
-b
parameter to get more corrected reads):chm13_asm/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fastaFinal assembly result:
chm13_asm/03.ctg_graph/nd.asm.fastaThe folowing is the assembly statistics:
Type Length (bp) Count (#) N10 179297054 2 N20 169128386 3 N30 131652719 6 N40 120761272 8 N50 106090521 10 N60 95206689 13 N70 80513393 16 N80 59725892 21 N90 39058727 27 Min. 84432 - Max. 237405279 - Ave. 35344197 - Total 2898224197 82
Download reference
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.7.fasta.gz gzip -d chm13.draft_v0.7.fasta.gz
Run Quast v5.0.2
quast.py --eukaryote --large --min-identity 80 --threads 30 -r ./chm13.draft_v0.7.fasta --fragmented nd.asm.fasta
- Quast result
NextDenovo
Canu
Flye
Shasta
# contigs
82
1223
472
297
Largest contig
237405279
139909728
132009996
130803838
Total length
2898224197
2991947723
2920201070
2823384269
# misassemblies
1227
6396
3230
187
# misassembled contigs
61
875
193
78
Misassembled contigs length
2740877545
2458710426
2440399207
1351075153
# local misassemblies
433
1164
981
129
# possible TEs
42
160
96
14
# unaligned mis. contigs
11
73
17
0
# unaligned contigs
0 + 64 part
168 + 248 part
8 + 135 part
0 + 37 part
Unaligned length
22021119
30076945
14583673
393547
Genome fraction (%)
97.421
98.391
97.392
96.149
Duplication ratio
1.007
1.027
1.018
1.002
# mismatches per 100 kbp
29.43
77.26
74.04
15.56
# indels per 100 kbp
170.98
327.08
447.97
141.25
Largest alignment
111497488
104447985
111814657
111679369
Total aligned length
2865321418
2943726417
2894073152
2821352191
N50
106090521
77964612
70319350
58111632
NG50
106090521
77964612
70319350
58088067
L50
10
15
16
17
LG50
10
15
16
18
NA50
57779597
47440498
46858921
47392260
NGA50
57779597
47440498
46546094
44539326
LA50
18
21
19
19
LGA50
18
21
20
20