NextDenovo Parameter Reference
NextDenovo requires at least one read file (option: input_fofn
) as input, it works with gzip’d FASTA and FASTQ formats and uses a config file
to pass options.
Input
input_fofn
(one file one line)ls reads1.fasta reads2.fastq reads3.fasta.gz reads4.fastq.gz ... > input.fofn
config file
A config file is a text file that contains a set of parameters (key=value pairs) to set runtime parameters for NextDenovo. The following is a typical config file, which is also located in
doc/run.cfg
.[General] job_type = local job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 20 input_type = raw read_type = clr # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir [correct_option] read_cutoff = 1k genome_size = 1g # estimated genome size sort_options = -m 20g -t 15 minimap2_options_raw = -t 8 pa_correction = 3 correction_options = -p 15 [assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1
Output
workdir/03.ctg_graph/nd.asm.fasta
Contigs with fasta format, the fasta header includes ID, type, length, node count, a consecutive lowercase region in the sequence implies a weak connection, and a low quality base is marked with a single lowercase base.
workdir/03.ctg_graph/nd.asm.fasta.stat
Some basic statistical information (N10-N90, Total size et al.).
Options
Global options
- job_type = sge
local, sge, pbs, lsf, slurm… (default: sge)
- job_prefix = nextDenovo
prefix tag for jobs. (default: nextDenovo)
- task = <all, correct, assemble>
task need to run, correct = only do the correction step, assemble = only do the assembly step (only work if
input_type
= corrected orread_type
= hifi), all = correct + assemble. (default: all)
- rewrite = no
overwrite existed directory [yes, no]. (default: no)
- deltmp = yes
delete intermediate results. (default: yes)
- rerun = 3
re-run unfinished jobs untill finished or reached
rerun
loops, 0=no. (default: 3)
- parallel_jobs = 10
number of tasks used to run in parallel. (default: 10)
- input_type = raw
input reads type [raw, corrected]. (default: raw)
- input_fofn = input.fofn
input file, one line one file. (required)
- read_type = {clr, hifi, ont}
reads type, clr=PacBio continuous long read, hifi=PacBio highly accurate long reads, ont=NanoPore 1D reads. (required)
- workdir = 01.workdir
work directory. (default: ./)
- usetempdir = /tmp/test
temporary directory in compute nodes to avoid high IO wait. (default: None)
- nodelist = avanode.list.fofn
a list of hostnames of available nodes, one node one line, used with usetempdir for non-sge job_type.
- submit = auto
command to submit a job, auto = automatically set by Paralleltask.
- kill = auto
command to kill a job, auto = automatically set by Paralleltask.
- check_alive = auto
command to check a job status, auto = automatically set by Paralleltask.
- job_id_regex = auto
the job-id-regex to parse the job id from the out of
submit
, auto = automatically set by Paralleltask.
- use_drmaa = no
use drmaa to submit and control jobs.
Correction options
- read_cutoff = 1k
filter reads with length <
read_cutoff
. (default: 1k)
- genome_size = 1g
estimated genome size, suffix K/M/G recognized, used to calculate
seed_cutoff
/seed_cutfiles
/blocksize
and average depth, it can be omitted when manually settingseed_cutoff
.
- seed_depth = 45
expected seed depth, used to calculate
seed_cutoff
, co-use withgenome_size
, you can try to set it 30-45 to get a better assembly result. (default: 45)
- seed_cutoff = 0
minimum seed length, <=0 means calculate it automatically using bin/seq_stat.
- seed_cutfiles = 5
split seed reads into
seed_cutfiles
subfiles. (default:pa_correction
)
- blocksize = 10g
block size for parallel running, split non-seed reads into small files, the maximum size of each file is
blocksize
. (default: 10g)
- pa_correction = 3
number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage, overwrite
parallel_jobs
only for this step. (default: 3)
- minimap2_options_raw = -t 10
minimap2 options, used to find overlaps between raw reads, see minimap2-nd for details.
- correction_options = -p 10
correction options, see following:
-p, --process, set the number of processes used for correcting. (default: 10) -b, --blacklist, disable the filter step and increase more corrected data. -s, --split, split the corrected seed with un-corrected regions. (default: False) -fast, 0.5-1 times faster mode with a little lower accuracy. (default: False) -dbuf, disable caching 2bit files and reduce ~TOTAL_INPUT_BASES/4 bytes of memory usage. (default:False) -max_lq_length, maximum length of a continuous low quality region in a corrected seed, larger max_lq_length will produce more corrected data with lower accuracy. (default: auto [pb/1k, ont/10k])