Utilities¶

seq_stat¶

seq_stat can be used to perform some simple statistics (such as length distribution, total amount of data and sequencing depth) on the input data, and give the recommended minimum seed length.

INPUT

read files list, one line one file

OUTPUT (stdout)

Read length histogram

Read length info.

Total Bases info.

Recommended minimum seed length

OPTIONS

-f skip reads with length shorter than this value [1kb].

-g estimated genome size [5Mb].

-d expected seed depth (30-45), used to be corrected [45].

-a disable automatic adjustment.

-o output file [stdout].

seq_dump¶

sql_dump is used to classify reads based on a given seed length threshold, and split and compress different categories to subfiles (bit format).

INPUT

read files list, one line one file
OUTPUT
The output consists of four parts:
- input.part*2bit (non-seed reads)
- .input.part*idx (index of non-seed reads)
- input.seed*2bit (seed reads)
- .input.seed*idx (index of seed reads)
OPTIONS

-f minimum read length.

-s minimum seed length.

-b block size (Mb or Gb, < 16Gb).

-n number of seed subfiles in total.

-d output directory.

seq_bit¶

seq_bit can be used to compress fasta files to bit files or uncompress bit files to fasta files.

INPUT

one seq file.

OUTPUT (stdout)

sequences with fasta or bit format.

minimap2-nd¶

minimap2-nd is a modified version of minimap2, which is used to find all overlaps between raw reads and dovetail overlaps between corrected seeds. Compared to minimap2, minimap-nd has five minor modifications:
1. Add support for input files in bit format.
2. Add a filter step for output.
3. Compress output when output to a file.
4. Add a re-align step for potential dovetail overlaps.
5. Optimize overlapping for PacBio Hifi reads.
EXTRA OPTIONS

--step <1,2,3> preset options for NextDenovo, [required].

--minlen INT min overlap length [500]

--minmatch INT min match length [100]

--minide FLOAT min identity [0.05]

--mode <0,1,2> re-align mode, 0:disable 1:fast mode, low accuracy 2:slow mode, high accuracy [2]

--kn INT k-mer size (no larger than 28), used to re-align [17]

--wn INT minizer window size, used to re-align [10]

--cn INT do re-align for every INT reads, larger is faster [20]

--maxhan1 INT max over hang length, used to re-align [5000]

--maxhan2 INT max over hang length, used to filter contained reads [500]

-x ava-hifi Hifi read overlap

ovl_sort¶

ovl_sort is used to sort and remove redundancy overlaps by number of matches for a given seed.

INPUT

overlap files, one line one file.

index file of seeds need to be sorted.

OUTPUT

sorted overlap file.

OPTIONS

-i index file of seeds need to be sorted [required]

-m set max total available buffer size, suffix K/M/G [40G]

-t number of threads to use [8]

-k max depth of each overlap, should <= average sequencing depth [40]

-l max over hang length to filter [300]

-o output file name [required]

-d temporary directory [$CWD]

ovl_cvt¶

ovl_cvt can be used to compress or uncompress overlap files.

INPUT

one overlap file

OUTPUT (stdout)

compressed or uncompressed overlaps

OPTIONS

-m INT conversion mode (0 for compress, 1 for uncompress)

nextgraph¶

NextGraph is used to construct a string graph with corrected reads. The main algorithms are similar to other mainstream assemblers except using a graph-based algorithm to identify chimeric nodes and a scoring-based strategy to identify incorrect edges. It can output an assembly in Fasta, GFA2, GraphML, Path formats, or only statistical information (for quickly optimize parameters).

INPUT

read files list, one line one file.

overlap files list, one line one file.

OUTPUT

assembly statistical information.

assembly sequences.

OPTIONS

-f FILE input seq list [required]

-o FILE output file [stdout]

-c disable pre-filter chimeric reads

-G retain potential chimeric edges

-k delete complex bubble paths

-A output alternative contigs, for highly heterozygous genomes, it will increase assembly size.

-a, --out_format INT

output format, 0=None, 1=fasta, 2=graphml, 3=gfa2, 4=path [1]

-E, --out_ctg_len INT

min contig length for output [1000]

-q, --out_spath_len INT

min short branch len for output, 0=disable, set 5-16 to adjust the assembly size [0]

-i, --min_ide FLOAT

min identity of alignments [0.10]

-I, --min_ide_ratio FLOAT

min test-to-best identity ratio [0.70]

-R, --max_ide_ratio FLOAT

min test-to-best identity ratio of a low quality edge [0.00]

-S, --min_sco_ratio FLOAT

min test-to-best aligned length ratio [0.40]

-r, --max_sco_ratio FLOAT

max test-to-best score ratio of a low quality edge [0.50]

-M, --min_mat_ratio FLOAT

min test-to-best aligned matches ratio [0.90]

-T, --min_depth_ratio FLOAT

min test-to-best depth ratio of an edge [0.60]

-N, --min_node_count <1,2>

min valid nodes of a read [2]

-u, --min_con_count <1,2>

min contained number to filter contained reads [2]

-w, --min_edge_cov INT

min depth of an edge [3]

-D, --bfs_depth INT

depth of BFS to identify chimeric nodes [2]

-P, --bfs_depth_multi INT

max depth multiple of a node for BFS [2]

-m, --min_depth_multi FLOAT

min depth multiple of a repeat node [1.50]

-n, --max_depth_multi FLOAT

max depth multiple of a node [2000.00]

-B, --bubble_len INT

max len of a bubble [500]

-C, --cpath_len INT

max len of a compound path [20]

-z, --zbranch_len INT

max len of a z branch [8]

-l, --sbranch_len INT

max len of a short branch [15]

-L, --sloop_len INT

max len of a short loop [5]

-t, --max_hang_len INT

max over hang length of dovetails [500]

-F, --fuzz_len INT

fuzz len for trans-reduction [1000]

bam_sort¶

bam_sort is used to sort bam files.

INPUT

bam file need to be sorted.

OUTPUT

sorted bam file.

index file.

OPTIONS

-i Write index file.

-m INT Set maximum memory per thread; suffix K/M/G recognized [1024M]

-o FILE Write final output to FILE rather than standard output.

-T PREFIX Write temporary files to PREFIX.nnnn.bam.

-@ INT

Number of additional threads to use [0]