Assessment of the Arabidopsis thaliana F1 generation of Col-0 and Cvi-0 strains genome (~1% heterozygosity, 192X PacBio CLR reads) assemblies using NextDenovo, Canu, Falcon, Flye, Shasta, Mecat and WtdbgΒΆ
- Download reads
SRA Accession: SRX1715706, SRX1715705, SRX1715704, SRX1715703
- Prepare input file (input.fofn)
ls f1.fasta.gz > input.fofn
- Prepare config file (run.cfg)
[General] job_type = sge # here we use SGE to manage jobs job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 12 input_type = raw read_type = clr # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir [correct_option] read_cutoff = 1k genome_size = 120m # estimated genome size sort_options = -m 50g -t 35 minimap2_options_raw = -t 20 pa_correction = 6 correction_options = -p 35 [assemble_option] minimap2_options_cns = -t 20 nextgraph_options = -a 1
- Run
nohup nextDenovo run.cfg &
- Get result
Final corrected reads file (use the
-b
parameter to get more corrected reads):01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fastaFinal assembly result:
01_rundir/03.ctg_graph/nd.asm.fastaThe folowing is the assembly statistics:
Type Length (bp) Count (#) N10 13144176 1 N20 13090493 2 N30 9367478 4 N40 9212899 5 N50 8798661 6 N60 5544810 8 N70 3588034 11 N80 2192782 16 N90 688550 25 Min. 26566 - Max. 13144176 - Ave. 1434812 - Total 126263508 88
- Assemble with shasta
shasta-Linux-0.5.1 --input f1.fasta --threads 30
- Download reference
wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas
- Run Quast v5.0.2
quast.py --large --eukaryote --min-identity 80 --threads 30 -r TAIR10_chr_all.fa nextDenovo.asm.fa Canu.asm.fa Falcon.asm.fa Flye.asm.fa Shasta.asm.fa Mecat.asm.fa Wtdbg.asm.fa
Quast result
NextDenovo Canu Falcon Flye Shasta Mecat Wtdbg # contigs 88 2107 171 1097 1468 1243 703 Largest contig 13144176 3980575 13319401 4836132 4378421 12631656 14128365 Total length 126263508 229056851 140024465 131553479 143148140 202215921 132890796 N50 8798661 231924 7960654 325940 357597 688687 5479602 NG50 8798661 873036 7979657 370306 560105 3525236 8707235 N75 2323231 69274 1507122 137772 93305 85155 1095469 NG75 3588034 460325 4810976 180227 185928 1096121 2182254 LG50 6 40 6 71 50 8 6 LG75 11 86 10 190 149 22 13 # misassemblies 1314 2314 1607 1570 1631 1783 1529 # misassembled contigs 63 383 89 362 357 250 156 # local misassemblies 1128 2571 1437 1189 1077 2196 1086 # unaligned mis. contigs 0 8 0 39 79 0 25 # unaligned contigs 13 + 57 part 278 + 511 part 48 + 63 part 27 + 494 part 81 + 528 part 1 + 355 part 253 + 256 part Unaligned length 5577991 13404835 6336453 4365056 11810280 5760459 12620722 Genome fraction (%) 96.006 99.528 96.938 96.517 97.774 98.166 93.695 Duplication ratio 1.052 1.813 1.154 1.103 1.124 1.675 1.074 # mismatches per 100 kbp 668.46 1299.53 822.92 753.04 763.33 1052.95 722.82 # indels per 100 kbp 193.40 281.21 127.09 212.74 727.64 338.60 303.37 Largest alignment 5887963 3963652 10477942 4820655 3059195 5451806 7529822 Total aligned length 120235666 214635623 133317043 126764931 131090282 196116682 120017897 NA50 1136416 115341 1459104 280334 255952 202014 756810 NGA50 1504454 539509 1909294 328298 384761 901832 945708 NA75 354228 48301 270481 93990 41634 62905 192079 NGA75 472949 246039 676191 128725 118594 339389 316618 LGA50 21 60 15 82 60 27 27 LGA75 61 140 41 230 202 80 82 Note
the results of Canu, Falcon, Flye, Mecat and Wtdbg are copied from ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/at-f1, published by wtdbg2 paper, the complete result of Quast can be seen from
here
.