【Linux】文件内容统计相关命令(wc|sort|uniq|head)

1. 统计字数相关信息

统计字数相关信息的命令为wc
wc命令的帮助文档为:

[chaoql@localhost c]$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  With no FILE, or when FILE is -,
read standard input.  A word is a non-zero-length sequence of characters
delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the length of the longest line
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'wc invocation'

其中:
-c 表示获取文件内容的字节数量
-m 表示获取字符数量
-l 表示获取行数
-L 表示最长的一行的长度
-w 表示打印单词数量,默认使用空格符分割
演示如下:

[chaoql@localhost c]$ cat hello.txt 
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
[chaoql@localhost c]$ wc -c hello.txt 
78 hello.txt
[chaoql@localhost c]$ wc -m hello.txt 
78 hello.txt
[chaoql@localhost c]$ wc -l hello.txt 
6 hello.txt
[chaoql@localhost c]$ wc -L hello.txt 
12 hello.txt
[chaoql@localhost c]$ wc -w hello.txt 
12 hello.txt

wc -c 和-m返回的都是 78 78 78,说明文件共有 78 78 78个字符。但是wc -L返回的是 12 12 12,也就是说最长的一行内容是 12 12 12个字符,一共有 6 6 6行, 12 × 6 = 72 12\times6=72 12×6=72,这个值和wc -m返回的值不相等,主要原因是wc -m在统计的时候把每行后面默认的换行符也统计到了,但是wc -L统计的时候没有包含换行符,所以 126 + 6 = 78 126+6 = 78 126+6=78

2. 文件内容排序

文件内容排序的命令为sort
sort命令的帮助文档为:

[chaoql@localhost c]$ sort --help 
Usage: sort [OPTION]... [FILE]...
  or:  sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too.
Ordering options:

  -b, --ignore-leading-blanks  ignore leading blanks
  -d, --dictionary-order      consider only blanks and alphanumeric characters
  -f, --ignore-case           fold lower case to upper case characters
  -g, --general-numeric-sort  compare according to general numerical value
  -i, --ignore-nonprinting    consider only printable characters
  -M, --month-sort            compare (unknown) < 'JAN' < ... < 'DEC'
  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)
  -n, --numeric-sort          compare according to string numerical value
  -R, --random-sort           sort by random hash of keys
      --random-source=FILE    get random bytes from FILE
  -r, --reverse               reverse the result of comparisons
      --sort=WORD             sort according to WORD:
                                general-numeric -g, human-numeric -h, month -M,
                                numeric -n, random -R, version -V
  -V, --version-sort          natural sort of (version) numbers within text

Other options:

      --batch-size=NMERGE   merge at most NMERGE inputs at once;
                            for more use temp files
  -c, --check, --check=diagnose-first  check for sorted input; do not sort
  -C, --check=quiet, --check=silent  like -c, but do not report first bad line
      --compress-program=PROG  compress temporaries with PROG;
                              decompress them with PROG -d
      --debug               annotate the part of the line used to sort,
                              and warn about questionable usage to stderr
      --files0-from=F       read input from the files specified by
                            NUL-terminated names in file F;
                            If F is - then read names from standard input
  -k, --key=KEYDEF          sort via a key; KEYDEF gives location and type
  -m, --merge               merge already sorted files; do not sort
  -o, --output=FILE         write result to FILE instead of standard output
  -s, --stable              stabilize sort by disabling last-resort comparison
  -S, --buffer-size=SIZE    use SIZE for main memory buffer
  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition
  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                              multiple options specify multiple directories
      --parallel=N          change the number of sorts run concurrently to N
  -u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run
  -z, --zero-terminated     end lines with 0 byte, not newline
      --help     display this help and exit
      --version  output version information and exit

这里只介绍四个常用的参数:
-n 表示按照数值大小进行排序
-r 表示排序结果倒序
-k 表示按照指定列的数据进行排序
-o 表示将结果输出到文件中
演示如下:

[chaoql@localhost c]$ cat num.txt 
3
2
9
10
1
[chaoql@localhost c]$ sort num.txt 
1
10
2
3
9
[chaoql@localhost c]$ sort -n num.txt 
1
2
3
9
10
[chaoql@localhost c]$ sort -nr num.txt 
10
9
3
2
1
[chaoql@localhost c]$ cat num2.txt 
aa 9
ax 1
bc 2
dd 7
xc 15
[chaoql@localhost c]$ sort -n num2.txt 
aa 9
ax 1
bc 2
dd 7
xc 15
[chaoql@localhost c]$ sort -k 2 -n num2.txt 
ax 1
bc 2
dd 7
aa 9
xc 15

如果想要把排序后的结果输出到原文件中,如下:

[chaoql@localhost c]$ cat hello.txt 
Hello world!
Hello world!
abc
Hello world!
Hello world!
[chaoql@localhost c]$ sort -n hello.txt -o hello.txt 
[chaoql@localhost c]$ cat hello.txt 
abc
Hello world!
Hello world!
Hello world!
Hello world!

3. 检查文件重复行列

检查文件重复行列的命令为uniq
uniq命令的帮助文档为:

uniq: invalid option -- 'h'
Try 'uniq --help' for more information.
[chaoql@localhost c]$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines, one for each group
  -D, --all-repeated[=METHOD]  print all duplicate lines
                          groups can be delimited with an empty line
                          METHOD={none(default),prepend,separate}
  -f, --skip-fields=N   avoid comparing the first N fields
      --group[=METHOD]  show all items, separating groups with an empty line
                          METHOD={separate(default),prepend,append,both}
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -u, --unique          only print unique lines
  -z, --zero-terminated  end lines with 0 byte, not newline
  -w, --check-chars=N   compare no more than N characters in lines
      --help     display this help and exit
      --version  output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.  Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'

这里只介绍两个常用的参数:
-c 表示在重复行的前面加上重复次数
-u 表示返回文件中不重复的行
演示如下:

[chaoql@localhost c]$ cat hello.txt 
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
abc
abc
bcde
[chaoql@localhost c]$ uniq hello.txt 
Hello world!
abc
bcde
[chaoql@localhost c]$ uniq -c hello.txt 
      6 Hello world!
      2 abc
      1 bcde
[chaoql@localhost c]$ uniq -u hello.txt 
bcde

但是需注意如下问题:

[chaoql@localhost c]$ cat hello.txt 
Hello world!
Hello world!
abc
Hello world!
Hello world!
[chaoql@localhost c]$ uniq hello.txt 
Hello world!
abc
Hello world!

按照常理,uniq hello.txt命令不应该输出第三行内容,因为其与第一行重复。解决方法也很简单,先将文件进行排序,再把排序结果进行去重即可。如下:

[chaoql@localhost c]$ sort hello.txt | uniq
abc
Hello world!

4. 读取文件前N行数据

读取文件前N行数据的命令为head
head命令默认返回前十行数据,不过可以修改参数,如head -3 hello.txt表示输出hello.txt文件的前三行数据。
演示如下:

[chaoql@localhost c]$ cat hello.txt 
Hello world!1
Hello world!2
Hello world!3
Hello world!4
Hello world!5
Hello world!6
Hello world!7
Hello world!8
Hello world!9
Hello world!10
Hello world!11
Hello world!12
[chaoql@localhost c]$ head hello.txt 
Hello world!1
Hello world!2
Hello world!3
Hello world!4
Hello world!5
Hello world!6
Hello world!7
Hello world!8
Hello world!9
Hello world!10
[chaoql@localhost c]$ head -3 hello.txt 
Hello world!1
Hello world!2
Hello world!3

headsort命令结合起来可以做到输出文件topN数据,如下:

[chaoql@localhost c]$ cat num.txt 
3
2
9
10
1
[chaoql@localhost c]$ sort -n num.txt | head -3
1
2
3

创作不易,点个赞再走吧Ծ‸Ծ

posted @ 2023-01-13 12:17  ccql  阅读(20)  评论(0编辑  收藏  举报  来源