【Linux】文件内容统计相关命令(wc|sort|uniq|head)
1. 统计字数相关信息
统计字数相关信息的命令为wc
wc
命令的帮助文档为:
[chaoql@localhost c]$ wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. With no FILE, or when FILE is -,
read standard input. A word is a non-zero-length sequence of characters
delimited by white space.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the length of the longest line
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'wc invocation'
其中:
-c 表示获取文件内容的字节数量
-m 表示获取字符数量
-l 表示获取行数
-L 表示最长的一行的长度
-w 表示打印单词数量,默认使用空格符分割
演示如下:
[chaoql@localhost c]$ cat hello.txt
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
[chaoql@localhost c]$ wc -c hello.txt
78 hello.txt
[chaoql@localhost c]$ wc -m hello.txt
78 hello.txt
[chaoql@localhost c]$ wc -l hello.txt
6 hello.txt
[chaoql@localhost c]$ wc -L hello.txt
12 hello.txt
[chaoql@localhost c]$ wc -w hello.txt
12 hello.txt
wc -c 和-m返回的都是 78 78 78,说明文件共有 78 78 78个字符。但是wc -L返回的是 12 12 12,也就是说最长的一行内容是 12 12 12个字符,一共有 6 6 6行, 12 × 6 = 72 12\times6=72 12×6=72,这个值和wc -m返回的值不相等,主要原因是wc -m在统计的时候把每行后面默认的换行符也统计到了,但是wc -L统计的时候没有包含换行符,所以 126 + 6 = 78 126+6 = 78 126+6=78。
2. 文件内容排序
文件内容排序的命令为sort
。
sort
命令的帮助文档为:
[chaoql@localhost c]$ sort --help
Usage: sort [OPTION]... [FILE]...
or: sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
-n, --numeric-sort compare according to string numerical value
-R, --random-sort sort by random hash of keys
--random-source=FILE get random bytes from FILE
-r, --reverse reverse the result of comparisons
--sort=WORD sort according to WORD:
general-numeric -g, human-numeric -h, month -M,
numeric -n, random -R, version -V
-V, --version-sort natural sort of (version) numbers within text
Other options:
--batch-size=NMERGE merge at most NMERGE inputs at once;
for more use temp files
-c, --check, --check=diagnose-first check for sorted input; do not sort
-C, --check=quiet, --check=silent like -c, but do not report first bad line
--compress-program=PROG compress temporaries with PROG;
decompress them with PROG -d
--debug annotate the part of the line used to sort,
and warn about questionable usage to stderr
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
--parallel=N change the number of sorts run concurrently to N
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
这里只介绍四个常用的参数:
-n 表示按照数值大小进行排序
-r 表示排序结果倒序
-k 表示按照指定列的数据进行排序
-o 表示将结果输出到文件中
演示如下:
[chaoql@localhost c]$ cat num.txt
3
2
9
10
1
[chaoql@localhost c]$ sort num.txt
1
10
2
3
9
[chaoql@localhost c]$ sort -n num.txt
1
2
3
9
10
[chaoql@localhost c]$ sort -nr num.txt
10
9
3
2
1
[chaoql@localhost c]$ cat num2.txt
aa 9
ax 1
bc 2
dd 7
xc 15
[chaoql@localhost c]$ sort -n num2.txt
aa 9
ax 1
bc 2
dd 7
xc 15
[chaoql@localhost c]$ sort -k 2 -n num2.txt
ax 1
bc 2
dd 7
aa 9
xc 15
如果想要把排序后的结果输出到原文件中,如下:
[chaoql@localhost c]$ cat hello.txt
Hello world!
Hello world!
abc
Hello world!
Hello world!
[chaoql@localhost c]$ sort -n hello.txt -o hello.txt
[chaoql@localhost c]$ cat hello.txt
abc
Hello world!
Hello world!
Hello world!
Hello world!
3. 检查文件重复行列
检查文件重复行列的命令为uniq
。
uniq
命令的帮助文档为:
uniq: invalid option -- 'h'
Try 'uniq --help' for more information.
[chaoql@localhost c]$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
-c, --count prefix lines by the number of occurrences
-d, --repeated only print duplicate lines, one for each group
-D, --all-repeated[=METHOD] print all duplicate lines
groups can be delimited with an empty line
METHOD={none(default),prepend,separate}
-f, --skip-fields=N avoid comparing the first N fields
--group[=METHOD] show all items, separating groups with an empty line
METHOD={separate(default),prepend,append,both}
-i, --ignore-case ignore differences in case when comparing
-s, --skip-chars=N avoid comparing the first N characters
-u, --unique only print unique lines
-z, --zero-terminated end lines with 0 byte, not newline
-w, --check-chars=N compare no more than N characters in lines
--help display this help and exit
--version output version information and exit
A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters. Fields are skipped before chars.
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
For complete documentation, run: info coreutils 'uniq invocation'
这里只介绍两个常用的参数:
-c 表示在重复行的前面加上重复次数
-u 表示返回文件中不重复的行
演示如下:
[chaoql@localhost c]$ cat hello.txt
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
Hello world!
abc
abc
bcde
[chaoql@localhost c]$ uniq hello.txt
Hello world!
abc
bcde
[chaoql@localhost c]$ uniq -c hello.txt
6 Hello world!
2 abc
1 bcde
[chaoql@localhost c]$ uniq -u hello.txt
bcde
但是需注意如下问题:
[chaoql@localhost c]$ cat hello.txt
Hello world!
Hello world!
abc
Hello world!
Hello world!
[chaoql@localhost c]$ uniq hello.txt
Hello world!
abc
Hello world!
按照常理,uniq hello.txt
命令不应该输出第三行内容,因为其与第一行重复。解决方法也很简单,先将文件进行排序,再把排序结果进行去重即可。如下:
[chaoql@localhost c]$ sort hello.txt | uniq
abc
Hello world!
4. 读取文件前N行数据
读取文件前N行数据的命令为head
。
head
命令默认返回前十行数据,不过可以修改参数,如head -3 hello.txt
表示输出hello.txt文件的前三行数据。
演示如下:
[chaoql@localhost c]$ cat hello.txt
Hello world!1
Hello world!2
Hello world!3
Hello world!4
Hello world!5
Hello world!6
Hello world!7
Hello world!8
Hello world!9
Hello world!10
Hello world!11
Hello world!12
[chaoql@localhost c]$ head hello.txt
Hello world!1
Hello world!2
Hello world!3
Hello world!4
Hello world!5
Hello world!6
Hello world!7
Hello world!8
Hello world!9
Hello world!10
[chaoql@localhost c]$ head -3 hello.txt
Hello world!1
Hello world!2
Hello world!3
将head
和sort
命令结合起来可以做到输出文件topN数据,如下:
[chaoql@localhost c]$ cat num.txt
3
2
9
10
1
[chaoql@localhost c]$ sort -n num.txt | head -3
1
2
3
创作不易,点个赞再走吧Ծ‸Ծ