CentOS Shell 获取网页数据存储到 MySQL Redis
内容
组成:
脚本文件:spider.sh
配置文件:stcn_kuaixun.cnf
功能文件:skincare.fn
环境:
Redis-5.0.10 , MySQL-8.0.22 ,CentOS 7 , dos2unix , lsof
脚本整体思想:
1 基础变量及功能(spider.sh 提供最基础的执行框架,stcn_kuaixun.cnf 是 spider.sh 与 skincare.fn 之间的纽带)
2 可重写的功能(将需要对spider.sh的函数功能重写或新增的代码放在 skincare.fn)
效果:
[ root@stock.data.news.100 pts/0 2021-01-16/31@6 16:13:06 /Server/datas/downloads/scripts ]
# bash spider.sh lock_file=/tmp/spider.sh_stcn_kuaixun.lf config_file=stcn_kuaixun.cnf skincare_file=skincare.fn
2021-01-16 16:12:18 bash ./spider.sh process is 4416
current directory: /Server/datas/downloads/scripts
base function: receive parameters
lock file = /tmp/spider.sh_stcn_kuaixun.lf
config file = stcn_kuaixun.cnf
skincare file = skincare.fn
base function: Initialize
load config file .
-- variables is loading .
Configuration[stcn_kuaixun.cnf] complete loading .
-- Log_2.1 is loading .
-- ConvertToSqlFieldFormat_2.1 is loading .
-- BuildingMySqlInsertStatement_2.1 is loading .
-- Main_2.2 is loading .
Configuration[skincare.fn] complete loading .
check script is already running .
creating lock file .
lock file is completed
Main
-- Get Web Page Urls From https://kuaixun.stcn.com/
https://kuaixun.stcn.com/ https://kuaixun.stcn.com/index_1.html ...... https://kuaixun.stcn.com/index_19.html
-- Get Web Page Datas From https://kuaixun.stcn.com/
15:53 ./ss/202101/t20210116_2741808.html 新疆铁路首次引入“北斗”系统检测钢轨 2021-01-16
15:43 ./ss/202101/t20210116_2741797.html 河北本轮疫情排除与我国既往本土疫情相关性由境外输入 2021-01-16
......
-- Adjust data format
2021-01-16 2741808 15:53 ss ss/202101/t20210116_2741808.html 新疆铁路首次引入“北斗”系统检测钢轨
2021-01-16 2741797 15:43 ss ss/202101/t20210116_2741797.html 河北本轮疫情排除与我国既往本土疫情相关性由境外输入
......
-- Is the latest data[is null] of redis in this page ?
-- The data of this page will be stored.
-- Get Web Page Datas From https://kuaixun.stcn.com/index_1.html
13:18 ./cj/202101/t20210116_2741713.html 北京发布投资领域审批事项清单 2021-01-16
13:07 ./egs/202101/t20210116_2741709.html ST乐凯:目前生产经营正常暂未受到疫情影响 2021-01-16
......
......
-- Data Sorting
2021-01-16 2741808 15:53 ss ss/202101/t20210116_2741808.html 新疆铁路首次引入“北斗”系统检测钢轨
2021-01-16 2741798 15:42 egs egs/202101/t20210116_2741798.html 河北本轮疫情排除与我国既往本土疫情相关性由境外输入
......
-- Collate Unsaved Data
-- Convert To SQL Field Format
20210116 2741808 '15:53' 'ss' 'https://kuaixun.stcn.com/ss/202101/t20210116_2741808.html' '新疆铁路首次引入“北斗”系统检测钢轨'
20210116 2741798 '15:42' 'egs' 'https://kuaixun.stcn.com/egs/202101/t20210116_2741798.html' '河北本轮疫情排除与我国既往本土疫情相关性由境外输入'
......
-- Building MySQL Insert Statements
insert into news values(20210116,2741808,'15:53','ss','https://kuaixun.stcn.com/ss/202101/t20210116_2741808.html','新疆铁路首次引入“北斗”系统检测钢轨'),......
-- Latest Data Update To Redis
OK
(integer) 400
End unlock file .
lock file is deleted
2021-01-16 16:13:06 | finish
[ root@stock.data.news.100 pts/0 2021-01-16/31@6 16:39:53 /Server/datas/downloads/scripts ]
# redis-cli
127.0.0.1:6379> get stock_news-string-latest_stored_title_data_id
"20210116_2741808"
127.0.0.1:6379> LRANGE stock_news-llist-web_pages_titles_of_urls 0 3
1) "https://kuaixun.stcn.com/yb/202101/t20210115_2736688.html"
2) "https://kuaixun.stcn.com/cj/202101/t20210115_2736700.html"
3) "https://kuaixun.stcn.com/yb/202101/t20210115_2736704.html"
4) "https://kuaixun.stcn.com/egs/202101/t20210115_2736708.html"
127.0.0.1:6379>
MySQL没有建表,故省略
脚本程序(基础执行框架): spider.sh
#!/bin/bash VERSION=2.0.2 Help() { cat <<-EOF # synopsis: bash $0 # description: 抓取 https://kuaixun.stcn.com/ 的数据 # date: 2020-11-03 # version: ${VERSION} [ 增脚本传参,多进程,等各项改进 ] # author: LBC # -- 使用介绍 # 单进程(默认在脚本所在的目录搜索${0%.*}.cnf、${0%.*}.fn;若没有配置则按脚本默认的方式执行。) # bash $0 # 多进程 # bash $0 lock_file='sample1.lf' config_file='file1.cnf' skincare_file='file1.fn' # bash $0 lock_file='sample2.lf' config_file='file2.cnf' skincare_file='file2.fn' # 参数项 # lock_file 文本锁,用于区别不同配置文件的进程 # config_file 配置文件,用于修改脚本的全局变量 # skincare_file 功能文件,用于重写或新增脚本的函数功能 # -- 当前版本改进 # 重定义命名变量或函数名,更具容易理解。 # 增加传参函数(ReceiveParameters),实现多进程脚本基础,类似MySQL多线程。 # 增加各函数的日志输出,更清晰观察脚本执行的节点。 EOF } # 版本定义 # version: 2.0.0 [ 结构(主版本号,实现的逻辑).基础功能变更(新增或删除).基础功能优化 ] # 命名约定 # 全局变量:全大写+下划线 # 函数:每个单词的首字母大写 # 局部变量:小写+下划线 # 版本说明 # 2.0.0 [ 基础版本 ] # 2.0.1 [ GetWebPageDatas:grep + tr 取代 sed] # 2.0.2 [ LoadConfigFile:使用for循环 ] # 全局变量初始化 export PATH='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin' WORKING_DIRECTORY=$(dirname $0) LOCK_FILE="/tmp/$(basename $0).lf" FILE_NAME=$(basename ${0%.*}) CONFIG_FILE="${WORKING_DIRECTORY}/${FILE_NAME}.cnf" SKINCARE_FILE="${WORKING_DIRECTORY}/${FILE_NAME}.fn" LOG_ON=1 LOCK_ON=1 # WEB_URL_INDEX='https://kuaixun.stcn.com/' WEB_URL_PATH_PREFIX='index_' WEB_URL_PATH_NUMBER=1 WEB_URL_PATH_SUFFIX='.html' # HTML_TITLE_CODE_BEGIN_LABEL='news_list2' HTML_TITLE_CODE_END_LABEL='<\/ul>' # 正则提取HTML代码中的数据 HTML_TITLE_DATA_MATCHING_PATTERN='([0-9.a-z/_]+).html|>.*<' #HTML_TITLE_DATA_MATCHING_PATTERN='[0-9:-]+|([0-9.a-z/_]+).html' # # 日志输出 Log() { if [ ${LOG_ON:-0} -eq 1 ] ; then echo "$1 $2" fi } # [ 基础功能 ] 文件锁,防止脚本在执行期间重复执行. # $1=锁文件路径,$2=[0|1],0:检查锁文件是否存在,1:创建或删除锁文件 LockFile() { if [ ${LOCK_ON:-0} -eq 0 ] ; then return 0 fi local lock_file=$1 if [ ${2:-0} -eq 0 ] ; then if [ -e ${lock_file} ] ; then Log " " "error: $(stat -c %y ${lock_file}) script is already running." Log ' ' "repair: Stop the process($(cat ${lock_file})) and delete the file(${lock_file})" exit fi else if [ -e ${lock_file} ] ; then rm -f ${lock_file} Log ' ' 'lock file is deleted' else echo $$ > ${lock_file} Log ' ' "lock file is completed" fi fi } # [ 基础功能 ] 加载配置:$1=配置文件,$2=配置标签。 LoadConfigToRunning() { local config_file=$1 local config_label_list=$2 local config_variables='' local config_label='' for config_label in ${config_label_list} do Log ' --' "${config_label} is loading ." config_variables=$(sed -rn '/^\['${config_label}'\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${config_file}) # 这里可以加一个判断,判断 config_variables 是否为空。 eval "${config_variables}" #echo "${config_label_list}" #echo "${config_variables}" done } # [ 基础功能 ] 获取脚本配置文件(配置文件与脚本名同名) LoadConfigFile() { local config_file='' local config_labels='' for config_file in $@ do if [ -e ${config_file} ] ; then dos2unix "${config_file}" &>/dev/null case ${config_file} in ${CONFIG_FILE}) config_labels='variables' ;; ${SKINCARE_FILE}) config_labels="$(sed -rn '/\[skincare\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${CONFIG_FILE})" ;; esac LoadConfigToRunning "${config_file}" "${config_labels:-''}" Log ' ' "Configuration[${config_file}] complete loading ." else Log ' ' "${config_file} does not exist ." return 0 fi done } # [ 基础功能 ] 获取脚本配置文件(配置文件与脚本名同名) # LoadConfigFile() # { # local config_file="${0%.*}.conf" # if [ -e ${config_file} ] ; then # dos2unix "${config_file}.conf" &>/dev/null # LoadConfigToRunning "${config_file}" 'variables' # else # echo "${config_file} does not exist ." # return 0 # fi # echo "Configuration[${config_file}.conf] complete loading ." # # # # local skincare_file="$(dirname $0)/skincare.conf" # local skincare_file="${0%.*}.sf" # if [ -e ${skincare_file} ] ; then # dos2unix "${skincare_file}" &>/dev/null # local variables=$(sed -rn '/\[skincare\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${config_file}) # LoadConfigToRunning "${skincare_file}" "${variables}" # else # echo "${skincare_file} does not exist ." # return 0 # fi # echo "overwrite[${skincare_file}] function complete loading ." # } # 获取某一个网页中所有的分页网址 GetWebPageUrls() { # $1=首页 local http_url=$1 # $3:分页数 local http_url_page_number=$3 # https://kuaixun.stcn.com/index_1.html # $1=https://kuaixun.stcn.com/ ; $2=index_ ; $4=.html local http_url_page='$1$2${http_url_page_number}$4' # 分页的完整URL local return_url_list='' while true do local http_status_code=$(curl -Is "${http_url}" | awk 'NR==1{print $2}') if [ ${http_status_code:-0} -eq 200 ] ; then return_url_list="${return_url_list} ${http_url}" http_url=$(eval "echo ${http_url_page}") http_url_page_number=$((${http_url_page_number} + 1)) sleep 1 else break fi done echo "${return_url_list}" } # 获取网页的数据 GetWebPageTitleDatas() { # 获取标签段内的内容 local web_data=$( \ curl -s $1 | \ sed -rn '/'${HTML_TITLE_CODE_BEGIN_LABEL}'/,/'${HTML_TITLE_CODE_END_LABEL}'/p' | \ sed -r \ -e 's/[[:blank:]]*//g' \ -e 's/[[:space:]]*//g' \ ) # 从标签段的内容中提取数据 local return_results=$( \ echo -e "${web_data}" | \ grep -Eo "${HTML_TITLE_DATA_MATCHING_PATTERN}" | \ tr -d '><' # sed 比 grep 的性能要差点. 该处sed实现数据格式化,处理步骤上,该处的 sed 不适合过早对原数据进行调整。 #sed -rn \ # -e 's#.*(\.[a-z0-9./]+)t([0-9]+)_([0-9]+)(\.[a-z]+).*>(.*)</a>#\2\3 \1\2_\3\4 \5#p' \ # -e 's/<span>([0-9-]+)<.span>/\1/p' \ # -e 's#<i>([0-9:]+)</i>#\1#p' ) echo "${return_results}" } # [ 基础功能 ] 接收脚本参数 ReceiveParameters() { Log "base function:" "receive parameters" #while [ "${1}" != "${1#*=}" ] ; do while [ 0 -ne $# ] ; do case ${1,,} in lock_file=?*) LOCK_FILE=${1#*=} shift ;; config_file=?*) CONFIG_FILE=${1#*=} shift ;; skincare_file=?*) SKINCARE_FILE=${1#*=} shift ;; *) Help exit ;; esac done Log " " "lock file = ${LOCK_FILE}" Log " " "config file = ${CONFIG_FILE}" Log " " "skincare file = ${SKINCARE_FILE}" } # [ 基础功能 ] 初始化函数包含基础功能函数 Initialize() { Log "base function:" "Initialize" Log ' ' 'load config file . ' LoadConfigFile ${CONFIG_FILE} ${SKINCARE_FILE} Log ' ' 'check script is already running .' LockFile ${LOCK_FILE} Log ' ' 'creating lock file .' LockFile ${LOCK_FILE} 1 } Main() { Log 'Main' Log ' --' "Get Web Page Urls From ${WEB_URL_INDEX}" local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}") Log '' "${web_pages_urls}" 4 local web_page_data='' for web_page_url in $web_pages_urls do Log ' --' "Get Web Page Datas From ${web_page_url}" web_page_data=$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4) Log '' "$web_page_data" 4 Log ' ' 'format data' web_page_data=$( Log '' "$web_page_data" 4|sed -rn 's/[ ]+([0-9:]+)[ ]+([0-9a-z./]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]+(.*)[ ]+([0-9-]+)/\3\4 \2\3_\4\5 \7 \1 \6/p' ) Log '' "$web_page_data" 5 done } Log "$(date +%F\ %H:%M:%S) bash ${WORKING_DIRECTORY}/$(basename $0) process is $$" Log ' ' "current directory: $(pwd)" ReceiveParameters $@ Initialize wait Main Log 'End' 'unlock file .' LockFile ${LOCK_FILE} 1 Log "$(date +%F\ %H:%M:%S) | finish"
对某一服务的配置文件:stcn_kuaixun.cnf
# 说明:标签与标签之间必须有空行,标签内的代码不能有空行 [variables] export PATH="/Server/program-files/redis/bin:/Server/program-files/mysql/bin:${PATH}" LOG_ON=1 LOCK_ON=1 # LOCK_FILE="/tmp/$(basename $0).lockfile" # # 首页 WEB_URL_INDEX='https://kuaixun.stcn.com/' # 分页网址起始页:https://kuaixun.stcn.com/index_1.html WEB_URL_PATH_PREFIX='index_' WEB_URL_PATH_NUMBER=1 WEB_URL_PATH_SUFFIX='.html' # HTML_TITLE_CODE_BEGIN_LABEL='news_list2' HTML_TITLE_CODE_END_LABEL='<\/ul>' # 正则提取HTML代码中的数据 HTML_TITLE_DATA_MATCHING_PATTERN='([0-9.a-z/_]+).html|>.*<' # MYSQL_DATABASE='stock' MYSQL_USER='stock' MYSQL_PASSWORD='111111' MYSQL_HOSTNAME="$(lsof -i:3306 | grep -Eo '[0-9.]{6,}' )" MYSQL_PORT='3306' MYSQL_TABLE="news" # REDIS_PORT='6379' REDIS_HOSTNAME="$(ps -ef | grep [r]edis | grep -Eo '([0-9.]+){6,}')" [skincare]
# 调用 skincare.fn 对应标签的函数对脚本 spider.sh 重新或新增 Log_2.1 ConvertToSqlFieldFormat_2.1 BuildingMySqlInsertStatement_2.1 #Main_2.1 Main_2.2
对某一服务的功能文件:skincare.fn
# description: 对脚本的功能进行重写 # note: 标签之间用空行隔开,标签内的语句不能有空行,文件尾必须是空行。 # date: 2020-11-07 # version: 2.0.0 # author: LBC # 命名约定 # 全局变量:全大写+下划线 # 函数:每个单词的首字母大写 # 局部变量:小写+下划线 # 版本号定义 # 功能名称_脚本主版本号.序号 # 日志输出格式化 [Log_2.1] # $1: 顶格标题 # $2: 文本内容 # $3: 一行输出多个文本内容(单位:单词) Log() { if [ ${LOG_ON:-0} -eq 1 ] ; then local text_counter=0 local print_text='' # $1:一行的顶格单词(可以是序号,标题等) local text_title=$1 # $2:输出的文本内容 local text_list=$2 # $3:一行输出多少个文本内容(单位:单词),默认100个单词一行输出。 local text_merge_line=${3:-0} if [ ${text_merge_line} -eq 0 ] ; then echo "${text_title} ${text_list}" return 1 fi # print_text="${text_title}" for text in ${text_list:-" "} do if [ ${text_counter} -eq ${text_merge_line} ] ; then print_text="${print_text}\n ${text}" text_counter=0 else print_text="${print_text} ${text}" fi text_counter=$((${text_counter}+1)) done echo -e "${print_text}" fi } # 转换为SQL的字段格式 [ConvertToSqlFieldFormat_2.1] # $1 数据源 # $2 数据源中多少个文本为一组 # 格式化后的结果,是 6 个文本为一组 ConvertToSqlFieldFormat() { local news_yymmdd='' local news_id='' local news_hhmm='' local news_category='' local news_web_path='' local news_title='' # local return_format_data='' # $2 中每6个文本为一行 local text_group=$2 local text_counter=1 # for text in $1 do case ${text} in ????-??-??) news_yymmdd=${text} ;; ???????) news_id=${text} ;; ??:??) news_hhmm=${text} ;; ??|???|????) news_category=${text} ;; ????*_*.*) news_web_path=${text} ;; *) news_title=${text} ;; esac # # 每读6个文本为一组数据,文本的排列顺序与SQL的表结构顺序一致 if [ $(( ${text_counter} % ${text_group} )) -eq 0 ] ; then return_format_data=" ${return_format_data} \ ${news_yymmdd//-/} \ ${news_id} \ '${news_hhmm}' \ '${news_category}' \ '${WEB_URL_INDEX}${news_web_path}' \ '${news_title}' \ " fi text_counter=$(( ${text_counter} + 1 )) done echo "${return_format_data}" } # Building MySQL Insert Statements # $1 表名 # $2 数据源 # $3 数据源中多少个文本构成一行数据 [BuildingMySqlInsertStatement_2.1] BuildingMySqlInsertStatement() { local text_group=$3 local text_counter=1 # local fields_value='' #local MYSQL_INSERT_STATEMENT='' local insert_datas='' # for text in $2 do fields_value="${fields_value}${text}," if [ $(( ${text_counter} % ${text_group} )) -eq 0 ] ; then # 构成一行数据 insert_datas="${insert_datas}(${fields_value%,})," fields_value='' fi text_counter=$(( ${text_counter} + 1 )) done echo "${1}${insert_datas%,}" } # Main_2.1 测试使用,目的是否能替换脚本中的 Main [Main_2.1] Main() { Log 'Main' Log '--' "Get Web Page Urls From ${WEB_URL_INDEX}" local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}") Log '' "${web_pages_urls}" 4 # # 获取网页上标题的数据 local web_pages_titles_data='' for web_page_url in $web_pages_urls do Log '--' "Get Web Page Datas From ${web_page_url}" web_pages_titles_data=$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4) Log '' "${web_pages_titles_data}" 4 done } # Main_2.2 正式版 [Main_2.2] Main() { Log 'Main' # 检查 redis,mysql 是否启动。
if [ -z ${REDIS_HOSTNAME} ] ; then
Log ' --' 'Redis is not running.'
return
fi
if [ -z ${MYSQL_HOSTNAME} ] ; then
Log ' --' 'MySQL is not running.'
retrun
fi # # 获取所有的网页url # 从 https://kuaixun.stcn.com/index_1.html 到 https://kuaixun.stcn.com/index_19.html 顺序获取 # web_pages_urls='https://kuaixun.stcn.com/index_1.html https://kuaixun.stcn.com/index_2.html ... index_19.html' Log ' --' "Get Web Page Urls From ${WEB_URL_INDEX}" local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}") Log '' "${web_pages_urls}" 4 # # 存放所有页面的标题数据 local web_pages_titles_data='' # # 存放单个页面的标题数据 local web_page_title_data='' # local mysql_stock="mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -h${MYSQL_HOSTNAME} -P${MYSQL_PORT} ${MYSQL_DATABASE} -e " # # Redis 变量命名 约定:服务名称-变量类型-变量描述 ## 举例:stock_news-string-latest_stored_title_data_id ## stock_news:服务名称 ## string:变量数据类型 ## latest_stored_title_data_id :脚本的变量 # local redis_cli="redis-cli -h ${REDIS_HOSTNAME} -p ${REDIS_PORT} " # # 将已经保存到MySQL最新的标题数据,存入 Redis。用于下一次抓取数据时判断哪些数据已经保存到本地 # 存入 Redis 的数据格式:截取与url中的一段:20210107_2712148 # 对应 Redis 中键为:stock_news-string-latest_stored_title_data_id local latest_stored_title_data_id_at_redis=$(${redis_cli} get stock_news-string-latest_stored_title_data_id) if [ -z ${latest_stored_title_data_id_at_redis} ] ; then ${redis_cli} set stock_news-string-latest_stored_title_data_id 'is null' latest_stored_title_data_id_at_redis='is null' fi # # 判断页面的数据是否有已经保存到MySQL local is_the_page_data_stored=0 # for web_page_url in $web_pages_urls do # 年月日 id号 时间 分类 url 标题 Log ' --' "Get Web Page Datas From ${web_page_url} " # # 获取网页上标题的数据,格式(4个数据):时间 url 标题 年月日 # 页面题材的标题原文格式:20:22 ./ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 2021-01-07 Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 # Log ' --' "Adjust data format" web_page_title_data="$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 |sed -rn 's#[ ]*([0-9:]+)[ ]*\./+([a-z]+)([0-9a-z/]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]*(.*)[ ]+([0-9-]+)#\8 \5 \1 \2 \2\3\4_\5\6 \7#p')" # 对原页面标题数据调整:格式(6个数据):年月日 id号(从url中提取) 时间 分类(从url中提取) url 标题 # 对原文格式调整:2021-01-07 2712148 20:22 ss ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 Log '' "${web_page_title_data}" 6 # web_pages_titles_data="${web_pages_titles_data} ${web_page_title_data}" #web_pages_titles_data="${web_pages_titles_data} $(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 |sed -rn 's#[ ]*([0-9:]+)[ ]*\./+([a-z]+)([0-9a-z/]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]*(.*)[ ]+([0-9-]+)#\8 \5 \1 \2 '${WEB_URL_INDEX}' \2\3\4_\5\6 \7#p')" # # 当前页面的数据是否有已经保存到本地的数据。 Log ' --' "Is the latest data[${latest_stored_title_data_id_at_redis}] of redis in this page ?" # 从redis提取数据,检查该页的数据是否已经存在,存在则退出循环,不再加载数据,表示后面的网页中的数据已经在数据库中。 echo "${web_page_title_data}" | fgrep -q "${latest_stored_title_data_id_at_redis}" if [ $? -eq 0 ] ; then Log ' --' "The latest data from Redis exist in this page, which stops data crawling from subsequent pages." is_the_page_data_stored=1 break else Log ' --' 'The data of this page will be stored.' fi done # # # 由于网页上标题在发布时,存在时间上非顺序关系,故需要重排序(降序)。以最新的数据排在第一位。 Log ' --' 'Data Sorting' local web_pages_titles_data_of_sort_desc="$(Log '' " ${web_pages_titles_data} " 6 | sort -n -k1 -k2 -t' ' -r)" Log '' "${web_pages_titles_data_of_sort_desc}" 6 unset web_pages_titles_data # # # 在排序的数据的基础上,删除对应已经保存到本地的数据 Log ' --' 'Collate Unsaved Data' if [ ${is_the_page_data_stored} -eq 1 ] ; then web_pages_titles_data_of_sort_desc="${web_pages_titles_data_of_sort_desc%%${latest_stored_title_data_id_at_redis##*_}*}" fi # # Log ' --' 'Convert To SQL Field Format' # 调整的格式(6个数据):2021-01-07 2712148 20:22 ss ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 # MySQL存储(6个数据):20210107 2712148 '20:22' 'ss' 'https://kuaixun.stcn.com/ss/202101/t20210107_2712148.html' '2020年珠江水运内河货运量企稳回升' # 在 MySQL 表设计中,前6个字段使用固定值的数据类型 local web_pages_titles_data_convert_to_sql_field="$(ConvertToSqlFieldFormat "${web_pages_titles_data_of_sort_desc}" 6)" Log '' "${web_pages_titles_data_convert_to_sql_field}" 6 unset web_pages_titles_data_of_sort_desc # # # 若数据为空,则表示当前抓取网页上的数据已经全部保存到MySQL中。 if [ -z "${web_pages_titles_data_convert_to_sql_field}" ] ; then Log ' --' 'There Is No Data To Save' return fi # # # 从最新的数据中提取前两个文本(年月日,id),用于保存至 Redis local text_counter=1 for text in ${web_pages_titles_data_convert_to_sql_field} do if [ ${text_counter} -eq 2 ] ; then latest_stored_title_data_id_at_redis="${latest_stored_title_data_id_at_redis}_${text}" break else latest_stored_title_data_id_at_redis="${text}" fi text_counter=$(( ${text_counter} + 1 )) done # # Log ' --' 'Building MySQL Insert Statements' local mysql_insert_statement="$(BuildingMySqlInsertStatement "insert into ${MYSQL_TABLE} values" "${web_pages_titles_data_convert_to_sql_field}" 6)" Log '' "${mysql_insert_statement}" 6 # # Log ' --' 'Data Insert To MySQL' echo "${mysql_stock} ${mysql_insert_statement}" # # Log ' --' 'Latest Data Update To Redis' ${redis_cli} set stock_news-string-latest_stored_title_data_id "${latest_stored_title_data_id_at_redis}" # 保存所有页面标题的网址(url),并保存到 Redis 中,对应键:stock_news-llist-web_pages_titles_of_urls # stock_news:服务名称 # llist:使用lpush进行压入队列 # web_pages_titles_of_urls:对应脚本的变量 ${redis_cli} LPUSH stock_news-llist-web_pages_titles_of_urls $(echo ${web_pages_titles_data_convert_to_sql_field} | grep -Eo 'http[0-9a-z./_:]+') }
posted on 2021-01-16 16:47 3L·BoNuo·Lotus 阅读(192) 评论(0) 编辑 收藏 举报