Assessment of the CHM13 genome (120X NanoPore data) assemblies using NextDenovo, Canu, Flye, ShastaΒΆ
- Download reads
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/nanopore/rel3/rel3.fastq.gz
- Prepare input file (input.fofn)
ls rel3.fastq.gz > input.fofn
Prepare config file (run.cfg)
[General] job_type = sge # here we use SGE to manage jobs job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 25 input_type = raw read_type = ont # clr, ont, hifi input_fofn = input.fofn workdir = chm13_asm [correct_option] read_cutoff = 20k genome_size = 3.1g # estimated genome size sort_options = -m 150g -t 30 minimap2_options_raw = -t 8 pa_correction = 5 correction_options = -p 30 [assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1
Run
nohup nextDenovo run.cfg &
- Get result
Final corrected reads file (use the
-b
parameter to get more corrected reads):chm13_asm/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fastaFinal assembly result:
chm13_asm/03.ctg_graph/nd.asm.fastaThe folowing is the assembly statistics:
Type Length (bp) Count (#) N10 179297054 2 N20 169128386 3 N30 131652719 6 N40 120761272 8 N50 106090521 10 N60 95206689 13 N70 80513393 16 N80 59725892 21 N90 39058727 27 Min. 84432 - Max. 237405279 - Ave. 35344197 - Total 2898224197 82
- Download reference
wget https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v0.7.fasta.gz gzip -d chm13.draft_v0.7.fasta.gz
- Run Quast v5.0.2
quast.py --eukaryote --large --min-identity 80 --threads 30 -r ./chm13.draft_v0.7.fasta --fragmented nd.asm.fasta
Quast result
NextDenovo Canu Flye Shasta # contigs 82 1223 472 297 Largest contig 237405279 139909728 132009996 130803838 Total length 2898224197 2991947723 2920201070 2823384269 # misassemblies 1227 6396 3230 187 # misassembled contigs 61 875 193 78 Misassembled contigs length 2740877545 2458710426 2440399207 1351075153 # local misassemblies 433 1164 981 129 # possible TEs 42 160 96 14 # unaligned mis. contigs 11 73 17 0 # unaligned contigs 0 + 64 part 168 + 248 part 8 + 135 part 0 + 37 part Unaligned length 22021119 30076945 14583673 393547 Genome fraction (%) 97.421 98.391 97.392 96.149 Duplication ratio 1.007 1.027 1.018 1.002 # mismatches per 100 kbp 29.43 77.26 74.04 15.56 # indels per 100 kbp 170.98 327.08 447.97 141.25 Largest alignment 111497488 104447985 111814657 111679369 Total aligned length 2865321418 2943726417 2894073152 2821352191 N50 106090521 77964612 70319350 58111632 NG50 106090521 77964612 70319350 58088067 L50 10 15 16 17 LG50 10 15 16 18 NA50 57779597 47440498 46858921 47392260 NGA50 57779597 47440498 46546094 44539326 LA50 18 21 19 19 LGA50 18 21 20 20