Utilities

seq_stat

seq_stat can be used to perform some simple statistics (such as length distribution, total amount of data and sequencing depth) on the input data, and give the recommended minimum seed length.

INPUT
  • read files list, one line one file
OUTPUT (stdout)
  • Read length histogram
  • Read length info.
  • Total Bases info.
  • Recommended minimum seed length
OPTIONS
-f skip reads with length shorter than this value [1kb].
-g estimated genome size [5Mb].
-d expected seed depth (30-45), used to be corrected [45].
-a disable automatic adjustment.
-o output file [stdout].

seq_dump

sql_dump is used to classify reads based on a given seed length threshold, and split and compress different categories to subfiles (bit format).

INPUT
  • read files list, one line one file
OUTPUT

The output consists of four parts:

- input.part*2bit (non-seed reads)
- .input.part*idx (index of non-seed reads)
- input.seed*2bit (seed reads)
- .input.seed*idx (index of seed reads)
OPTIONS
-f minimum read length.
-s minimum seed length.
-b block size (Mb or Gb, < 16Gb).
-n number of seed subfiles in total.
-d output directory.

seq_bit

seq_bit can be used to compress fasta files to bit files or uncompress bit files to fasta files.

INPUT
  • one seq file.
OUTPUT (stdout)
  • sequences with fasta or bit format.

minimap2-nd

minimap2-nd is a modified version of minimap2, which is used to find all overlaps between raw reads and dovetail overlaps between corrected seeds. Compared to minimap2, minimap-nd has five minor modifications:

1. Add support for input files in bit format.
2. Add a filter step for output.
3. Compress output when output to a file.
4. Add a re-align step for potential dovetail overlaps.
5. Optimize overlapping for PacBio Hifi reads.
EXTRA OPTIONS
--step <1,2,3> preset options for NextDenovo, [required].
--minlen INT min overlap length [500]
--minmatch INT min match length [100]
--minide FLOAT min identity [0.05]
--mode <0,1,2> re-align mode, 0:disable 1:fast mode, low accuracy 2:slow mode, high accuracy [2]
--kn INT k-mer size (no larger than 28), used to re-align [17]
--wn INT minizer window size, used to re-align [10]
--cn INT do re-align for every INT reads, larger is faster [20]
--maxhan1 INT max over hang length, used to re-align [5000]
--maxhan2 INT max over hang length, used to filter contained reads [500]
-x ava-hifi Hifi read overlap

ovl_sort

ovl_sort is used to sort and remove redundancy overlaps by number of matches for a given seed.

INPUT
  • overlap files, one line one file.
  • index file of seeds need to be sorted.
OUTPUT
  • sorted overlap file.
OPTIONS
-i index file of seeds need to be sorted [required]
-m set max total available buffer size, suffix K/M/G [40G]
-t number of threads to use [8]
-k max depth of each overlap, should <= average sequencing depth [40]
-l max over hang length to filter [300]
-o output file name [required]
-d temporary directory [$CWD]

ovl_cvt

ovl_cvt can be used to compress or uncompress overlap files.

INPUT
  • one overlap file
OUTPUT (stdout)
  • compressed or uncompressed overlaps
OPTIONS
-m INT conversion mode (0 for compress, 1 for uncompress)

nextgraph

NextGraph is used to construct a string graph with corrected reads. The main algorithms are similar to other mainstream assemblers except using a graph-based algorithm to identify chimeric nodes and a scoring-based strategy to identify incorrect edges. It can output an assembly in Fasta, GFA2, GraphML, Path formats, or only statistical information (for quickly optimize parameters).

INPUT
  • read files list, one line one file.
  • overlap files list, one line one file.
OUTPUT
  • assembly statistical information.
  • assembly sequences.
OPTIONS
-f FILE input seq list [required]
-o FILE output file [stdout]
-c disable pre-filter chimeric reads
-G retain potential chimeric edges
-k delete complex bubble paths
-A output alternative contigs, for highly heterozygous genomes, it will increase assembly size.
-a, --out_format INT
 output format, 0=None, 1=fasta, 2=graphml, 3=gfa2, 4=path [1]
-E, --out_ctg_len INT
 min contig length for output [1000]
-q, --out_spath_len INT
 min short branch len for output, 0=disable, set 5-16 to adjust the assembly size [0]
-i, --min_ide FLOAT
 min identity of alignments [0.10]
-I, --min_ide_ratio FLOAT
 min test-to-best identity ratio [0.70]
-R, --max_ide_ratio FLOAT
 min test-to-best identity ratio of a low quality edge [0.00]
-S, --min_sco_ratio FLOAT
 min test-to-best aligned length ratio [0.40]
-r, --max_sco_ratio FLOAT
 max test-to-best score ratio of a low quality edge [0.50]
-M, --min_mat_ratio FLOAT
 min test-to-best aligned matches ratio [0.90]
-T, --min_depth_ratio FLOAT
 min test-to-best depth ratio of an edge [0.60]
-N, --min_node_count <1,2>
 min valid nodes of a read [2]
-u, --min_con_count <1,2>
 min contained number to filter contained reads [2]
-w, --min_edge_cov INT
 min depth of an edge [3]
-D, --bfs_depth INT
 depth of BFS to identify chimeric nodes [2]
-P, --bfs_depth_multi INT
 max depth multiple of a node for BFS [2]
-m, --min_depth_multi FLOAT
 min depth multiple of a repeat node [1.50]
-n, --max_depth_multi FLOAT
 max depth multiple of a node [2000.00]
-B, --bubble_len INT
 max len of a bubble [500]
-C, --cpath_len INT
 max len of a compound path [20]
-z, --zbranch_len INT
 max len of a z branch [8]
-l, --sbranch_len INT
 max len of a short branch [15]
-L, --sloop_len INT
 max len of a short loop [5]
-t, --max_hang_len INT
 max over hang length of dovetails [500]
-F, --fuzz_len INT
 fuzz len for trans-reduction [1000]

bam_sort

bam_sort is used to sort bam files.

INPUT
  • bam file need to be sorted.
OUTPUT
  • sorted bam file.
  • index file.
OPTIONS
-i Write index file.
-m INT Set maximum memory per thread; suffix K/M/G recognized [1024M]
-o FILE Write final output to FILE rather than standard output.
-T PREFIX Write temporary files to PREFIX.nnnn.bam.
-@ INT
Number of additional threads to use [0]