Assessment of the Drosophila melanogaster ISO1 ref. strain genome (69X NanoPore data) assemblies using NextDenovo, Canu, Flye, Shasta and WtdbgΒΆ
- Download reads
SRA Accession: SRR6702603, SRR6821890
- Prepare input file (input.fofn)
ls SRR6702603.fasta.gz SRR6821890.fasta.gz > input.fofn
- Prepare config file (run.cfg)
[General] job_type = sge # here we use SGE to manage jobs job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 12 input_type = raw read_type = ont # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir [correct_option] read_cutoff = 1k genome_size = 130m # estimated genome size sort_options = -m 30g -t 35 minimap2_options_raw = -t 20 pa_correction = 6 correction_options = -p 35 [assemble_option] minimap2_options_cns = -t 20 nextgraph_options = -a 1
- Run
nohup nextDenovo run.cfg &
- Get result
Final corrected reads file (use the
-b
parameter to get more corrected reads):01_rundir/02.cns_align/01.seed_cns.sh.work/seed_cns*/cns.fastaFinal assembly result:
01_rundir/03.ctg_graph/nd.asm.fastaThe folowing is the assembly statistics:
Type Length (bp) Count (#) N10 25701192 1 N20 22251987 2 N30 22251987 2 N40 21195733 3 N50 21195733 3 N60 18110856 4 N70 13648743 5 N80 6408543 6 N90 1033518 12 Min. 18454 - Max. 25701192 - Ave. 1826448 - Total 133330776 73
- Assemble with shasta
shasta-Linux-0.5.1 --input SRR6702603.fasta --input SRR6821890.fasta --threads 30
- Download reference
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz gzip -d GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz
- Run Quast v5.0.2
quast.py --large --eukaryote --min-identity 80 --threads 30 -r GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna nextDenovo.asm.fa Canu.asm.fa Flye.asm.fa Shasta.asm.fa Wtdbg.asm.fa
Quast result
NextDenovo Canu Flye Shasta Wtdbg # contigs 73 424 461 872 510 Largest contig 25701192 14715425 12613153 1801407 23221757 Total length 133330776 140540470 135880693 129225244 132926651 N50 21195733 4298595 6016667 535885 12028162 NG50 18110856 4298595 6016667 440773 10631323 N75 13648743 777595 2182645 244480 3308195 NG75 3925274 714013 1367004 182722 1752322 LG50 4 11 9 92 5 LG75 7 36 20 218 13 # misassemblies 345 971 724 262 616 # misassembled contigs 48 226 217 78 191 # local misassemblies 137 433 670 123 185 # unaligned mis. contigs 1 3 5 7 36 # unaligned contigs 1 + 36 part 8 + 122 part 11 + 118 part 191 + 76 part 89 + 291 part Unaligned length 603053 769264 811595 1660668 2264882 Genome fraction (%) 92.109 93.614 91.799 88.085 91.504 Duplication ratio 1.011 1.047 1.032 1.016 1.002 # mismatches per 100 kbp 90.86 183.12 220.48 609.69 179.86 # indels per 100 kbp 567.78 831.54 1334.52 1428.10 1081.15 Largest alignment 25696021 11699048 11981267 1799773 18844039 Total aligned length 132416893 139189216 134650393 127438699 130313270 NA50 6618721 3863099 5596752 527231 4309906 NGA50 6618721 3863099 5143715 434179 4174617 NA75 3269191 670044 1955654 230034 1573933 NGA75 2125978 611559 1267543 168924 928918 LGA50 5 13 11 94 10 LGA75 14 42 24 227 27 Note
The results of Canu, Flye and Wtdbg are copied from ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/dm-ISO1, published by wtdbg2 paper, the complete result of Quast can be seen from
here
.