Assessment of the Arabidopsis thaliana F1 generation of Col-0 and Cvi-0 strains genome (~1% heterozygosity, 192X PacBio CLR reads) assemblies using NextDenovo, Canu, Falcon, Flye, Shasta, Mecat and Wtdbg
Download reads
SRA Accession: SRX1715706, SRX1715705, SRX1715704, SRX1715703
Prepare input file (input.fofn)
ls f1.fasta.gz > input.fofn
Prepare config file (run.cfg)
[General] job_type = sge # here we use SGE to manage jobs job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 12 input_type = raw read_type = clr # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir [correct_option] read_cutoff = 1k genome_size = 120m # estimated genome size sort_options = -m 50g -t 35 minimap2_options_raw = -t 20 pa_correction = 6 correction_options = -p 35 [assemble_option] minimap2_options_cns = -t 20 nextgraph_options = -a 1
Run
nohup nextDenovo run.cfg &
Get result
Final corrected reads file (use the
-b
parameter to get more corrected reads):01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fastaFinal assembly result:
01_rundir/03.ctg_graph/nd.asm.fastaThe folowing is the assembly statistics:
Type Length (bp) Count (#) N10 13144176 1 N20 13090493 2 N30 9367478 4 N40 9212899 5 N50 8798661 6 N60 5544810 8 N70 3588034 11 N80 2192782 16 N90 688550 25 Min. 26566 - Max. 13144176 - Ave. 1434812 - Total 126263508 88
Assemble with shasta
shasta-Linux-0.5.1 --input f1.fasta --threads 30
Download reference
wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
Run Quast v5.0.2
quast.py --large --eukaryote --min-identity 80 --threads 30 -r TAIR10_chr_all.fa nextDenovo.asm.fa Canu.asm.fa Falcon.asm.fa Flye.asm.fa Shasta.asm.fa Mecat.asm.fa Wtdbg.asm.fa
- Quast result
NextDenovo
Canu
Falcon
Flye
Shasta
Mecat
Wtdbg
# contigs
88
2107
171
1097
1468
1243
703
Largest contig
13144176
3980575
13319401
4836132
4378421
12631656
14128365
Total length
126263508
229056851
140024465
131553479
143148140
202215921
132890796
N50
8798661
231924
7960654
325940
357597
688687
5479602
NG50
8798661
873036
7979657
370306
560105
3525236
8707235
N75
2323231
69274
1507122
137772
93305
85155
1095469
NG75
3588034
460325
4810976
180227
185928
1096121
2182254
LG50
6
40
6
71
50
8
6
LG75
11
86
10
190
149
22
13
# misassemblies
1314
2314
1607
1570
1631
1783
1529
# misassembled contigs
63
383
89
362
357
250
156
# local misassemblies
1128
2571
1437
1189
1077
2196
1086
# unaligned mis. contigs
0
8
0
39
79
0
25
# unaligned contigs
13 + 57 part
278 + 511 part
48 + 63 part
27 + 494 part
81 + 528 part
1 + 355 part
253 + 256 part
Unaligned length
5577991
13404835
6336453
4365056
11810280
5760459
12620722
Genome fraction (%)
96.006
99.528
96.938
96.517
97.774
98.166
93.695
Duplication ratio
1.052
1.813
1.154
1.103
1.124
1.675
1.074
# mismatches per 100 kbp
668.46
1299.53
822.92
753.04
763.33
1052.95
722.82
# indels per 100 kbp
193.40
281.21
127.09
212.74
727.64
338.60
303.37
Largest alignment
5887963
3963652
10477942
4820655
3059195
5451806
7529822
Total aligned length
120235666
214635623
133317043
126764931
131090282
196116682
120017897
NA50
1136416
115341
1459104
280334
255952
202014
756810
NGA50
1504454
539509
1909294
328298
384761
901832
945708
NA75
354228
48301
270481
93990
41634
62905
192079
NGA75
472949
246039
676191
128725
118594
339389
316618
LGA50
21
60
15
82
60
27
27
LGA75
61
140
41
230
202
80
82
Note
the results of Canu, Falcon, Flye, Mecat and Wtdbg are copied from ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/at-f1, published by wtdbg2 paper, the complete result of Quast can be seen from
here
.