fqkit: 一个处理fastq序列的小工具 (一)

一个用于处理fastq测序文件的命令行小工具,功能还在不断更新中,子命令也不多,支持gzip压缩文件的输入和输出(结果文件名以.gz结尾,结果会自动压缩)。

reop:

https://github.com/sharkLoc/fqkit

install:

cargo install fqkit

usage:

FqKit -- A simple and cross-platform program for fastq file manipulation

Version: 0.4.5

Authors: sharkLoc <mmtinfo@163.com>
Source code: https://github.com/sharkLoc/fqkit.git

Fqkit supports reading and writing gzip (.gz) format.
Bzip2 (.bz2) format is supported since v0.3.8.
Xz (.xz) format is supported since v0.3.9.
Under the same compression level, xz has the highest compression ratio but consumes more time. 

Compression level:
  format   range   default   crate
  gzip     1-9     6         https://crates.io/crates/flate2
  bzip2    1-9     6         https://crates.io/crates/bzip2
  xz       1-9     6         https://crates.io/crates/xz2


Usage: fqkit [OPTIONS] <COMMAND>

Commands:
  topn     get first N records from fastq file [aliases: head]
  tail     get last N records from fastq file
  concat   concat fastq files from different lanes
  subfq    subsample sequences from big fastq file [aliases: sample]
  select   select pair-end reads by read id
  trim     trim fastq reads by position
  adapter  cut the adapter sequence on the reads
  filter   a simple filter for pair end fastq sqeuence
  range    print fastq records in a range
  search   search reads/motifs from fastq file
  grep     grep fastq sequence by read id or full name
  stats    summary for fastq format file [aliases: stat]
  shuffle  shuffle fastq sequences
  size     report the number sequences and bases
  slide    extract subsequences in sliding windows
  sort     sort fastq file by name/seq/gc/length
  plot     line plot for A T G C N percentage in read position
  fq2fa    translate fastq to fasta
  fq2sam   converts a fastq file to an unaligned SAM file
  fqscore  converts the fastq file quality scores
  flatten  flatten fastq sequences [aliases: flat]
  barcode  perform demultiplex for pair-end fastq reads [aliases: demux]
  check    check the validity of a fastq record
  remove   remove reads by read name
  rename   rename sequence id in fastq file
  reverse  get a reverse-complement of fastq file [aliases: rev]
  split    split interleaved fastq file
  merge    merge PE reads as interleaved fastq file
  mask     convert any low quality base to 'N' or other chars
  split2   split fastq file by records number
  gcplot   get GC content result and plot
  length   get reads length count [aliases: len]
  view     view fastq file page by page
  help     Print this message or the help of the given subcommand(s)

Global Arguments:
      --compress-level <INT>  set gzip/bzip2/xz compression level 1 (compress faster) - 9 (compress better) for
                              gzip/bzip2/xz output file, just work with option -o/--out [default: 6]
      --log <FILE>            if file name specified, write log message to this file, or write to stderr
  -v, --verbosity <STR>       control verbosity of logging, possible values: {error, warn, info, debug, trace} [default:
                              debug]

Global FLAGS:
  -q, --quiet    be quiet and do not show any extra information
  -h, --help     prints help information
  -V, --version  prints version information

Use "fqkit help [command]" for more information about a command

topn:

输出一个fq文件的前N个reads,-n 参数指定数量; -q参数关闭日志
image

subfq:

从一个fq文件中随机抽取指定数量的reads数(蓄水池算法),如果是超大文件且抽取的read数很多可以指定-r参数节省内存,但是会增加耗时;-q参数关闭日志
image

从fq文件中搜索含有目标pattern/motif的reads,参数-p指定pattern/motif(需要大写),支持正则表达式传入模式;-q参数关闭日志
image

stats:

统计fq文件基本信息,包括每个cycle每个位置测序质量分数的计数
summary.txt:基本信息汇总:

read average length:    126
read max length:        126
total gc content(%):    57.52
total read count:       2000
total base count:       252000

base A count:   53864   (21.37%)
base T count:   53136   (21.09%)
base G count:   70989   (28.17%)
base C count:   73967   (29.35%)
base N count:   44      (0.02%)

Number of base calls with quality value of 20 or higher (Q20+) (%)      237670  (94.31%)
Number of base calls with quality value of 30 or higher (Q30+) (%)      223461  (88.67%)

cycle.txt: 每个cycle每个位置测序质量分数的计数
image

plot:

stats命令结果的可视化,可以输出png和svg格式的图片:
image
添加参数-s还可以在终端上显示每个位置ATGCN的含量比例:
image

fq2fa:

fq文件转fasta格式

barcode:

混库测序按照barcode序列拆分个体样本

remove:

从fq文件中按照read name移除reads,参数-n指定含有read name的文件,一行一个,且不包含read name前缀符号@

merge:

将PE测序的reads交替合并成一个fq文件

split:

merge命令的逆操作

gcplot:

输出fq文件的gc含量结果并作图
image
指定参数-s可在终端上显示GC含量分布的柱状图
image
参数-o指定输出GC含量文件,从GC含量0%到100%范围内每个百分比下的reads的数量和比例
image

posted @   天使不设防  阅读(642)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!
点击右上角即可分享
微信分享提示