二十六、获取每日格言脚本

大致浏览了一下，书中的网站是可以访问的，先按书中的例子来。

看了一下，中文的每日格言都是写好了的，可以先产生一个一定范围的随机数，然后使用sed或gawk筛选出随机数的位置末尾使用句号或者感叹号结束，再这样输出到屏幕上也是可以的。

wget命令

下载网页

[root@tzPC 26Unit]# wget www.quotationspage.com/qotd.html
--2020-09-11 08:06:29--  http://www.quotationspage.com/qotd.html
Resolving www.quotationspage.com (www.quotationspage.com)... 74.208.47.119
Connecting to www.quotationspage.com (www.quotationspage.com)|74.208.47.119|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘qotd.html’
    [  <=>                                         ] 13,360      48.4KB/s   in 0.3s   
2020-09-11 08:06:30 (48.4 KB/s) - ‘qotd.html’ saved [13360]

[root@tzPC 26Unit]# ls
mu.sh  qotd.html

-o 选项将输出保存到文件

[root@tzPC 26Unit]# url=www.quotationspage.com/qotd.html
[root@tzPC 26Unit]# wget -o quote.log $url
[root@tzPC 26Unit]# ls
mu.sh  qotd.html  quote.log

-O指定文件名

[root@tzPC 26Unit]# wget -o quote.log -O Daily_quote.html $url
#Daily_quote每日格言

--no-cookies选项不下载cookie

默认情况下是不会存储cookie的。

测试网页有效性

wget --spider选项（spider蜘蛛，应该是使用爬虫爬取网页）

[root@tzPC 26Unit]# wget --spider $url
Spider mode enabled. Check if remote file exists.
--2020-09-11 08:20:10--  http://www.quotationspage.com/qotd.html
Resolving www.quotationspage.com (www.quotationspage.com)... 74.208.47.119
Connecting to www.quotationspage.com (www.quotationspage.com)|74.208.47.119|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.

-nv选项（代表non-verbose非冗长的）

注意结尾的OK并不是说web地址正常访问，而是表示返回的web地址跟访问的web地址是一样的。

[root@tzPC 26Unit]# wget -nv --spider $url
2020-09-11 08:22:23 URL: http://www.quotationspage.com/qotd.html 200 OK

这里有个错误的web地址访问，可以看到访问百度的robots.txt文件，返回的error.html文件，后面依旧是OK。

[root@tzPC 26Unit]# url=https://www.baidu.com/search/robots.txt
[root@tzPC 26Unit]# wget -nv --spider $url
2020-09-11 08:26:43 URL: https://www.baidu.com/search/error.html 200 OK

字符串参数扩展

string parameter expansion

对check_url中的字符串进行搜索，找到error就输出error404

[root@tzPC 26Unit]# echo $url
https://www.baidu.com/search/robots.txt
[root@tzPC 26Unit]# check_url=$(wget -nv --spider $url 2>&1)
[root@tzPC 26Unit]# echo ${check_url/*error*/error404}
error404
#注意我这里访问百度robots返回的是error.html所以通配符*内的字符串为error，书中返回的是error404，要根据实际网页返回结果判断

删除所有HTML标签

sed 's/<[^>]*//g' quote.html

使用grep匹配文中日期，再使用-A2选项提取另外两行文本。

[root@tzPC 26Unit]# sed 's/<[^>]*//g' qotd.html |
> grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2

>>Selected from Michael Moncur's Collection of Quotations - September 10, 2020>>
>>>Setting a good example for children takes all the fun out of middle age.> >>>>>>>>>>>>>>>>>>William Feather> (1908 - 1976)> &nbsp; >>>
>>We don't see things as they are, we see things as we are.> >>>>>>>>>>>>>>>>>>Anais Nin> (1903 - 1977)> &nbsp; >>>

注意，今天是2020年9月11日，美国时间日期为2020年9月10日，有时差，要取昨日。

删除>符号

[root@tzPC 26Unit]# sed 's/<[^>]*//g' qotd.html |
> grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2 |
> sed 's/>//g'

Selected from Michael Moncur's Collection of Quotations - September 10, 2020
Setting a good example for children takes all the fun out of middle age. William Feather (1908 - 1976) &nbsp; 
We don't see things as they are, we see things as we are. Anais Nin (1903 - 1977) &nbsp;

删除多余格言

先定位空格&nbsp，再删除下一行文本

[root@tzPC 26Unit]# sed 's/<[^>]*//g' qotd.html |
> grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2 |
> sed 's/>//g' |
> sed '/&nbsp;/{n ; d }'
Selected from Michael Moncur's Collection of Quotations - September 10, 2020
Setting a good example for children takes all the fun out of middle age. William Feather (1908 - 1976) &nbsp;

删除末尾的&nbsp

[root@tzPC 26Unit]# sed 's/<[^>]*//g' qotd.html |
> grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2 |
> sed 's/>//g' |
> sed '/&nbsp;/{n ; d }' |
> gawk 'BEGIN{FS="&nbsp;"}  {print $1}'
Selected from Michael Moncur's Collection of Quotations - September 10, 2020
Setting a good example for children takes all the fun out of middle age. William Feather (1908 - 1976)

使用tee命令将格言保存到文件中

[root@tzPC 26Unit]# sed 's/<[^>]*//g' qotd.html |
> grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2 |
> sed 's/>//g' |
> sed '/&nbsp;/{n ; d }' |
> gawk 'BEGIN{FS="&nbsp;"}  {print $1}' |
> tee daily_quote.txt > /dev/null
[root@tzPC 26Unit]# cat daily_quote.txt 
Selected from Michael Moncur's Collection of Quotations - September 10, 2020
Setting a good example for children takes all the fun out of middle age. William Feather (1908 - 1976)

至此所有部分都已完成，整个脚本如下

[root@tzPC 26Unit]# cat daily_quote.sh
#!/bin/bash
#Get a Daily Inspirational励志的 Quote
#Script Variables
quote_url=www.quotationspage.com/qotd.html
#Check url validity
check_url=$(wget -nv --spider $quote_url 2>&1)
if [[ $check_url == "error404" ]]
then
    echo "Bad web address"
    echo "$quote_url invalid"
    echo "Exiting script..."
    exit
fi
#Download Web Site's Information
wget -o /tmp/quote.log -O /tmp/quote.html $quote_url
#Extract提取 the Desired期望 Data
sed 's/<[^>]*//g' /tmp/quote.html |
grep "$(date -d"yesterday" +%B' '%-d,' '%Y)" -A2 |
sed 's/>//g' |
sed '/&nbsp;/{n ; d }' |
gawk 'BEGIN{FS="&nbsp;"}  {print $1}' |
tee /tmp/daily_quote.txt > /dev/null
cat /tmp/daily_quote.txt
exit

学习来自：《Linux命令行与Shell脚本大全第3版》第26章

看到P587案例时间为2015-09-24，本书2016年8月出版的第三版，可以想象可能不到短短半年的时间将这本600多页的书翻译出来，任务量是多么的繁重，在此感谢两位老师的翻译！

posted @ 2020-09-11 10:10 努力吧阿团阅读(324) 评论(0) 收藏举报

刷新页面返回顶部

努力吧阿团

闭关中...