Utilities
seq_stat
seq_stat can be used to perform some simple statistics (such as length distribution, total amount of data and sequencing depth) on the input data, and give the recommended minimum seed length.
- INPUT
read files list, one line one file
- OUTPUT (stdout)
Read length histogram
Read length info.
Total Bases info.
Recommended minimum seed length
- OPTIONS
- -f
skip reads with length shorter than this value [1kb].
- -g
estimated genome size [5Mb].
- -d
expected seed depth (30-45), used to be corrected [45].
- -a
disable automatic adjustment.
- -o
output file [stdout].
seq_dump
sql_dump is used to classify reads based on a given seed length threshold, and split and compress different categories to subfiles (bit format).
- INPUT
read files list, one line one file
- OUTPUT
The output consists of four parts:
- input.part*2bit (non-seed reads) - .input.part*idx (index of non-seed reads) - input.seed*2bit (seed reads) - .input.seed*idx (index of seed reads)
- OPTIONS
- -f
minimum read length.
- -s
minimum seed length.
- -b
block size (Mb or Gb, < 16Gb).
- -n
number of seed subfiles in total.
- -d
output directory.
seq_bit
seq_bit can be used to compress fasta files to bit files or uncompress bit files to fasta files.
- INPUT
one seq file.
- OUTPUT (stdout)
sequences with fasta or bit format.
minimap2-nd
minimap2-nd is a modified version of minimap2, which is used to find all overlaps between raw reads and dovetail overlaps between corrected seeds. Compared to minimap2, minimap-nd has five minor modifications:
1. Add support for input files in bit format. 2. Add a filter step for output. 3. Compress output when output to a file. 4. Add a re-align step for potential dovetail overlaps. 5. Optimize overlapping for PacBio Hifi reads.
- EXTRA OPTIONS
- --step <1,2,3>
preset options for NextDenovo, [required].
- --minlen INT
min overlap length [500]
- --minmatch INT
min match length [100]
- --minide FLOAT
min identity [0.05]
- --mode <0,1,2>
re-align mode, 0:disable 1:fast mode, low accuracy 2:slow mode, high accuracy [2]
- --kn INT
k-mer size (no larger than 28), used to re-align [17]
- --wn INT
minizer window size, used to re-align [10]
- --cn INT
do re-align for every INT reads, larger is faster [20]
- --maxhan1 INT
max over hang length, used to re-align [5000]
- --maxhan2 INT
max over hang length, used to filter contained reads [500]
- -x ava-hifi
Hifi read overlap
ovl_sort
ovl_sort is used to sort and remove redundancy overlaps by number of matches for a given seed.
- INPUT
overlap files, one line one file.
index file of seeds need to be sorted.
- OUTPUT
sorted overlap file.
- OPTIONS
- -i
index file of seeds need to be sorted [required]
- -m
set max total available buffer size, suffix K/M/G [40G]
- -t
number of threads to use [8]
- -k
max depth of each overlap, should <= average sequencing depth [40]
- -l
max over hang length to filter [300]
- -o
output file name [required]
- -d
temporary directory [$CWD]
ovl_cvt
ovl_cvt can be used to compress or uncompress overlap files.
- INPUT
one overlap file
- OUTPUT (stdout)
compressed or uncompressed overlaps
- OPTIONS
- -m INT
conversion mode (0 for compress, 1 for uncompress)
nextgraph
NextGraph is used to construct a string graph with corrected reads. The main algorithms are similar to other mainstream assemblers except using a graph-based algorithm to identify chimeric nodes and a scoring-based strategy to identify incorrect edges. It can output an assembly in Fasta, GFA2, GraphML, Path formats, or only statistical information (for quickly optimize parameters).
- INPUT
read files list, one line one file.
overlap files list, one line one file.
- OUTPUT
assembly statistical information.
assembly sequences.
- OPTIONS
- -f FILE
input seq list [required]
- -o FILE
output file [stdout]
- -c
disable pre-filter chimeric reads
- -G
retain potential chimeric edges
- -k
delete complex bubble paths
- -A
output alternative contigs, for highly heterozygous genomes, it will increase assembly size.
- -a, --out_format INT
output format, 0=None, 1=fasta, 2=graphml, 3=gfa2, 4=path [1]
- -E, --out_ctg_len INT
min contig length for output [1000]
- -q, --out_spath_len INT
min short branch len for output, 0=disable, set 5-16 to adjust the assembly size [0]
- -i, --min_ide FLOAT
min identity of alignments [0.10]
- -I, --min_ide_ratio FLOAT
min test-to-best identity ratio [0.70]
- -R, --max_ide_ratio FLOAT
min test-to-best identity ratio of a low quality edge [0.00]
- -S, --min_sco_ratio FLOAT
min test-to-best aligned length ratio [0.40]
- -r, --max_sco_ratio FLOAT
max test-to-best score ratio of a low quality edge [0.50]
- -M, --min_mat_ratio FLOAT
min test-to-best aligned matches ratio [0.90]
- -T, --min_depth_ratio FLOAT
min test-to-best depth ratio of an edge [0.60]
- -N, --min_node_count <1,2>
min valid nodes of a read [2]
- -u, --min_con_count <1,2>
min contained number to filter contained reads [2]
- -w, --min_edge_cov INT
min depth of an edge [3]
- -D, --bfs_depth INT
depth of BFS to identify chimeric nodes [2]
- -P, --bfs_depth_multi INT
max depth multiple of a node for BFS [2]
- -m, --min_depth_multi FLOAT
min depth multiple of a repeat node [1.50]
- -n, --max_depth_multi FLOAT
max depth multiple of a node [2000.00]
- -B, --bubble_len INT
max len of a bubble [500]
- -C, --cpath_len INT
max len of a compound path [20]
- -z, --zbranch_len INT
max len of a z branch [8]
- -l, --sbranch_len INT
max len of a short branch [15]
- -L, --sloop_len INT
max len of a short loop [5]
- -t, --max_hang_len INT
max over hang length of dovetails [500]
- -F, --fuzz_len INT
fuzz len for trans-reduction [1000]
bam_sort
bam_sort is used to sort bam files.
- INPUT
bam file need to be sorted.
- OUTPUT
sorted bam file.
index file.
- OPTIONS
- -i
Write index file.
- -m INT
Set maximum memory per thread; suffix K/M/G recognized [1024M]
- -o FILE
Write final output to FILE rather than standard output.
- -T PREFIX
Write temporary files to PREFIX.nnnn.bam.
- -@ INT
Number of additional threads to use [0]