迷,什么都是道理;悟,什么都不是道理。

CentOS   MySQL

导航

CentOS Shell 获取网页数据存储到 MySQL Redis

内容

组成:
脚本文件:spider.sh
配置文件:stcn_kuaixun.cnf
功能文件:skincare.fn

环境:
Redis-5.0.10 , MySQL-8.0.22 ,CentOS 7 , dos2unix , lsof

脚本整体思想:
1 基础变量及功能(spider.sh 提供最基础的执行框架,stcn_kuaixun.cnf 是 spider.sh 与 skincare.fn 之间的纽带)
2 可重写的功能(将需要对spider.sh的函数功能重写或新增的代码放在 skincare.fn)

效果:
[ root@stock.data.news.100 pts/0 2021-01-16/31@6 16:13:06 /Server/datas/downloads/scripts ]
# bash spider.sh lock_file=/tmp/spider.sh_stcn_kuaixun.lf config_file=stcn_kuaixun.cnf skincare_file=skincare.fn
2021-01-16 16:12:18 bash ./spider.sh process is 4416
  current directory: /Server/datas/downloads/scripts
base function: receive parameters
  lock file = /tmp/spider.sh_stcn_kuaixun.lf
  config file = stcn_kuaixun.cnf
  skincare file = skincare.fn
base function: Initialize
  load config file .
  -- variables is loading .
  Configuration[stcn_kuaixun.cnf] complete loading .
  -- Log_2.1 is loading .
  -- ConvertToSqlFieldFormat_2.1 is loading .
  -- BuildingMySqlInsertStatement_2.1 is loading .
  -- Main_2.2 is loading .
  Configuration[skincare.fn] complete loading .
  check script is already running .
  creating lock file .
  lock file is completed
Main
  -- Get Web Page Urls From https://kuaixun.stcn.com/
  https://kuaixun.stcn.com/  https://kuaixun.stcn.com/index_1.html  ......  https://kuaixun.stcn.com/index_19.html
  -- Get Web Page Datas From https://kuaixun.stcn.com/
  15:53  ./ss/202101/t20210116_2741808.html  新疆铁路首次引入“北斗”系统检测钢轨  2021-01-16
  15:43  ./ss/202101/t20210116_2741797.html  河北本轮疫情排除与我国既往本土疫情相关性由境外输入  2021-01-16
  ......
  -- Adjust data format
  2021-01-16  2741808  15:53  ss  ss/202101/t20210116_2741808.html  新疆铁路首次引入“北斗”系统检测钢轨
  2021-01-16  2741797  15:43  ss  ss/202101/t20210116_2741797.html  河北本轮疫情排除与我国既往本土疫情相关性由境外输入
......
  -- Is the latest data[is null] of redis in this page ?
  -- The data of this page will be stored.
  -- Get Web Page Datas From https://kuaixun.stcn.com/index_1.html
  13:18  ./cj/202101/t20210116_2741713.html  北京发布投资领域审批事项清单  2021-01-16
  13:07  ./egs/202101/t20210116_2741709.html  ST乐凯:目前生产经营正常暂未受到疫情影响  2021-01-16
......
......
  -- Data Sorting
  2021-01-16  2741808  15:53  ss  ss/202101/t20210116_2741808.html  新疆铁路首次引入“北斗”系统检测钢轨
  2021-01-16  2741798  15:42  egs  egs/202101/t20210116_2741798.html  河北本轮疫情排除与我国既往本土疫情相关性由境外输入
......
-- Collate Unsaved Data
  -- Convert To SQL Field Format
  20210116  2741808  '15:53'  'ss'  'https://kuaixun.stcn.com/ss/202101/t20210116_2741808.html'  '新疆铁路首次引入“北斗”系统检测钢轨'
  20210116  2741798  '15:42'  'egs'  'https://kuaixun.stcn.com/egs/202101/t20210116_2741798.html'  '河北本轮疫情排除与我国既往本土疫情相关性由境外输入'
......
-- Building MySQL Insert Statements
  insert  into  news  values(20210116,2741808,'15:53','ss','https://kuaixun.stcn.com/ss/202101/t20210116_2741808.html','新疆铁路首次引入“北斗”系统检测钢轨'),......
  -- Latest Data Update To Redis
OK
(integer) 400
End unlock file .
  lock file is deleted
2021-01-16 16:13:06 | finish

[ root@stock.data.news.100 pts/0 2021-01-16/31@6 16:39:53 /Server/datas/downloads/scripts ]
# redis-cli
127.0.0.1:6379> get stock_news-string-latest_stored_title_data_id
"20210116_2741808"
127.0.0.1:6379> LRANGE stock_news-llist-web_pages_titles_of_urls 0 3
1) "https://kuaixun.stcn.com/yb/202101/t20210115_2736688.html"
2) "https://kuaixun.stcn.com/cj/202101/t20210115_2736700.html"
3) "https://kuaixun.stcn.com/yb/202101/t20210115_2736704.html"
4) "https://kuaixun.stcn.com/egs/202101/t20210115_2736708.html"
127.0.0.1:6379>

MySQL没有建表,故省略

 

脚本程序(基础执行框架): spider.sh

#!/bin/bash

VERSION=2.0.2
Help()
{
    cat <<-EOF
    # synopsis: bash $0
    # description: 抓取 https://kuaixun.stcn.com/ 的数据
    # date: 2020-11-03
    # version: ${VERSION} [ 增脚本传参,多进程,等各项改进 ]
    # author: LBC

    # -- 使用介绍
        #   单进程(默认在脚本所在的目录搜索${0%.*}.cnf、${0%.*}.fn;若没有配置则按脚本默认的方式执行。)
            # bash $0

        #   多进程
            # bash $0 lock_file='sample1.lf' config_file='file1.cnf' skincare_file='file1.fn'
            # bash $0 lock_file='sample2.lf' config_file='file2.cnf' skincare_file='file2.fn'
        
        #   参数项
            # lock_file 文本锁,用于区别不同配置文件的进程
            # config_file 配置文件,用于修改脚本的全局变量
            # skincare_file 功能文件,用于重写或新增脚本的函数功能
    
    # -- 当前版本改进
        # 重定义命名变量或函数名,更具容易理解。
        # 增加传参函数(ReceiveParameters),实现多进程脚本基础,类似MySQL多线程。
        # 增加各函数的日志输出,更清晰观察脚本执行的节点。

EOF
}

# 版本定义
# version: 2.0.0 [ 结构(主版本号,实现的逻辑).基础功能变更(新增或删除).基础功能优化 ]

# 命名约定
    # 全局变量:全大写+下划线
    # 函数:每个单词的首字母大写
    # 局部变量:小写+下划线

# 版本说明
    # 2.0.0 [ 基础版本 ]
    # 2.0.1 [ GetWebPageDatas:grep + tr 取代 sed]
    # 2.0.2 [ LoadConfigFile:使用for循环 ]

# 全局变量初始化
export PATH='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin'

WORKING_DIRECTORY=$(dirname $0)
LOCK_FILE="/tmp/$(basename $0).lf"
FILE_NAME=$(basename ${0%.*})
CONFIG_FILE="${WORKING_DIRECTORY}/${FILE_NAME}.cnf"
SKINCARE_FILE="${WORKING_DIRECTORY}/${FILE_NAME}.fn"
LOG_ON=1
LOCK_ON=1
#
WEB_URL_INDEX='https://kuaixun.stcn.com/'
WEB_URL_PATH_PREFIX='index_'
WEB_URL_PATH_NUMBER=1
WEB_URL_PATH_SUFFIX='.html'
#
HTML_TITLE_CODE_BEGIN_LABEL='news_list2'
HTML_TITLE_CODE_END_LABEL='<\/ul>'
# 正则提取HTML代码中的数据
HTML_TITLE_DATA_MATCHING_PATTERN='([0-9.a-z/_]+).html|>.*<'
#HTML_TITLE_DATA_MATCHING_PATTERN='[0-9:-]+|([0-9.a-z/_]+).html'
#

# 日志输出
Log()
{
    if [ ${LOG_ON:-0} -eq 1 ] ; then
        echo "$1 $2"               
    fi
}

# [ 基础功能 ] 文件锁,防止脚本在执行期间重复执行. 
# $1=锁文件路径,$2=[0|1],0:检查锁文件是否存在,1:创建或删除锁文件
LockFile()
{
    if [ ${LOCK_ON:-0} -eq 0 ] ; then
        return 0
    fi
    local lock_file=$1
    if [ ${2:-0} -eq 0 ] ; then 
        if [ -e ${lock_file} ] ; then
            Log " " "error: $(stat -c %y ${lock_file}) script is already running."
            Log ' ' "repair: Stop the process($(cat ${lock_file})) and delete the file(${lock_file})"
            exit
        fi
    else
        if [ -e ${lock_file} ] ; then
            rm -f ${lock_file}
            Log ' ' 'lock file is deleted'
        else
            echo $$ > ${lock_file}
            Log ' ' "lock file is completed"
        fi
    fi
}

# [ 基础功能 ] 加载配置:$1=配置文件,$2=配置标签。
LoadConfigToRunning()
{
    local config_file=$1
    local config_label_list=$2
    local config_variables=''
    local config_label=''
    for config_label in ${config_label_list}
    do
        Log '  --' "${config_label} is loading ."
        config_variables=$(sed -rn '/^\['${config_label}'\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${config_file})
        # 这里可以加一个判断,判断 config_variables 是否为空。
        eval "${config_variables}"
        
        #echo "${config_label_list}"
        #echo "${config_variables}"
    done
}

# [ 基础功能 ] 获取脚本配置文件(配置文件与脚本名同名) 
LoadConfigFile()
{
    local config_file=''
    local config_labels=''
    for config_file in $@
    do 
        if [ -e ${config_file} ] ; then
            dos2unix "${config_file}" &>/dev/null
            case ${config_file} in
                ${CONFIG_FILE})
                    config_labels='variables'
                    ;;
                ${SKINCARE_FILE})
                    config_labels="$(sed -rn '/\[skincare\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${CONFIG_FILE})"
                    ;;
            esac
            LoadConfigToRunning "${config_file}" "${config_labels:-''}"
            Log ' ' "Configuration[${config_file}] complete loading ."
        else
            Log ' ' "${config_file} does not exist ."
            return 0
        fi
    done
}

# [ 基础功能 ] 获取脚本配置文件(配置文件与脚本名同名) 
# LoadConfigFile()
# {
#     local config_file="${0%.*}.conf"
#     if [ -e ${config_file} ] ; then
#         dos2unix "${config_file}.conf" &>/dev/null
#         LoadConfigToRunning "${config_file}" 'variables'
#     else
#         echo "${config_file} does not exist ."
#         return 0
#     fi
#     echo "Configuration[${config_file}.conf] complete loading ."
#     #
#     # local skincare_file="$(dirname $0)/skincare.conf"
#     local skincare_file="${0%.*}.sf"
#     if [ -e ${skincare_file} ] ; then
#         dos2unix "${skincare_file}" &>/dev/null
#         local variables=$(sed -rn '/\[skincare\]/,/^$|^}/{/^$|^#|^\[/d;p}' ${config_file})
#         LoadConfigToRunning "${skincare_file}" "${variables}"
#     else
#         echo "${skincare_file} does not exist ."
#         return 0
#     fi
#     echo "overwrite[${skincare_file}] function complete loading ."
# }


# 获取某一个网页中所有的分页网址
GetWebPageUrls()
{
    # $1=首页
    local http_url=$1
    # $3:分页数
    local http_url_page_number=$3
    
    # https://kuaixun.stcn.com/index_1.html
    # $1=https://kuaixun.stcn.com/ ; $2=index_ ; $4=.html
    local http_url_page='$1$2${http_url_page_number}$4'  # 分页的完整URL
    local return_url_list=''
    while true
    do
        local http_status_code=$(curl -Is "${http_url}" | awk 'NR==1{print $2}')
        if [ ${http_status_code:-0} -eq 200 ] ; then
            return_url_list="${return_url_list} ${http_url}"
            http_url=$(eval "echo ${http_url_page}")
            http_url_page_number=$((${http_url_page_number} + 1))
            sleep 1
        else
            break
        fi
    done

    echo "${return_url_list}"
}

# 获取网页的数据
GetWebPageTitleDatas()
{
    # 获取标签段内的内容
    local web_data=$( \
        curl -s $1 | \
        sed -rn '/'${HTML_TITLE_CODE_BEGIN_LABEL}'/,/'${HTML_TITLE_CODE_END_LABEL}'/p' | \
        sed -r \
        -e 's/[[:blank:]]*//g' \
        -e 's/[[:space:]]*//g' \
    )

    # 从标签段的内容中提取数据
    local return_results=$( \
        echo -e "${web_data}" | \
        grep -Eo "${HTML_TITLE_DATA_MATCHING_PATTERN}" | \
        tr -d '><'
        # sed 比 grep 的性能要差点. 该处sed实现数据格式化,处理步骤上,该处的 sed 不适合过早对原数据进行调整。
        #sed -rn \
        #    -e 's#.*(\.[a-z0-9./]+)t([0-9]+)_([0-9]+)(\.[a-z]+).*>(.*)</a>#\2\3 \1\2_\3\4 \5#p' \
        #    -e 's/<span>([0-9-]+)<.span>/\1/p' \
        #    -e 's#<i>([0-9:]+)</i>#\1#p'
    )

    echo "${return_results}"
}

# [ 基础功能 ] 接收脚本参数 
ReceiveParameters()
{
    Log "base function:" "receive parameters"
    #while [ "${1}" != "${1#*=}" ] ; do
    while [ 0 -ne $# ] ; do
        case ${1,,} in
            lock_file=?*)
                LOCK_FILE=${1#*=}                
                shift
                ;;
            config_file=?*)
                CONFIG_FILE=${1#*=}                
                shift
                ;;
            skincare_file=?*)
                SKINCARE_FILE=${1#*=}                
                shift
                ;;
            *)
                Help
                exit
                ;;
        esac
    done
    Log " " "lock file = ${LOCK_FILE}"
    Log " " "config file = ${CONFIG_FILE}"
    Log " " "skincare file = ${SKINCARE_FILE}"
}

# [ 基础功能 ] 初始化函数包含基础功能函数
Initialize()
{
    Log "base function:" "Initialize"
    Log ' ' 'load config file . '
    LoadConfigFile ${CONFIG_FILE} ${SKINCARE_FILE}
    
    Log ' ' 'check script is already running .'
    LockFile ${LOCK_FILE}
    
    Log ' ' 'creating lock file .'
    LockFile ${LOCK_FILE} 1
}

Main()
{

    Log 'Main'
    Log '  --' "Get Web Page Urls From ${WEB_URL_INDEX}"
    local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}")
    Log '' "${web_pages_urls}" 4

    local web_page_data=''
    for web_page_url in $web_pages_urls
    do
        Log '  --' "Get Web Page Datas From ${web_page_url}"
        web_page_data=$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4)
        Log '' "$web_page_data" 4

        Log ' ' 'format data'
        web_page_data=$( Log '' "$web_page_data" 4|sed -rn 's/[ ]+([0-9:]+)[ ]+([0-9a-z./]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]+(.*)[ ]+([0-9-]+)/\3\4 \2\3_\4\5 \7 \1 \6/p' )
        Log '' "$web_page_data" 5
    done
      
}

Log "$(date +%F\ %H:%M:%S) bash ${WORKING_DIRECTORY}/$(basename $0) process is $$"
Log ' ' "current directory: $(pwd)" 

ReceiveParameters $@
Initialize

wait
Main

Log 'End' 'unlock file .'
LockFile ${LOCK_FILE} 1
Log "$(date +%F\ %H:%M:%S) | finish"

 

对某一服务的配置文件:stcn_kuaixun.cnf

# 说明:标签与标签之间必须有空行,标签内的代码不能有空行
[variables]
export PATH="/Server/program-files/redis/bin:/Server/program-files/mysql/bin:${PATH}"
LOG_ON=1
LOCK_ON=1
# LOCK_FILE="/tmp/$(basename $0).lockfile"
#
# 首页
WEB_URL_INDEX='https://kuaixun.stcn.com/'
# 分页网址起始页:https://kuaixun.stcn.com/index_1.html
WEB_URL_PATH_PREFIX='index_'
WEB_URL_PATH_NUMBER=1
WEB_URL_PATH_SUFFIX='.html'
#
HTML_TITLE_CODE_BEGIN_LABEL='news_list2'
HTML_TITLE_CODE_END_LABEL='<\/ul>'
# 正则提取HTML代码中的数据
HTML_TITLE_DATA_MATCHING_PATTERN='([0-9.a-z/_]+).html|>.*<' 
#
MYSQL_DATABASE='stock'
MYSQL_USER='stock'
MYSQL_PASSWORD='111111'
MYSQL_HOSTNAME="$(lsof -i:3306 | grep -Eo '[0-9.]{6,}' )"
MYSQL_PORT='3306'
MYSQL_TABLE="news"
#
REDIS_PORT='6379'
REDIS_HOSTNAME="$(ps -ef | grep [r]edis | grep -Eo '([0-9.]+){6,}')"


[skincare]
# 调用 skincare.fn 对应标签的函数对脚本 spider.sh 重新或新增 Log_2.
1 ConvertToSqlFieldFormat_2.1 BuildingMySqlInsertStatement_2.1 #Main_2.1 Main_2.2

 

对某一服务的功能文件:skincare.fn

# description: 对脚本的功能进行重写
# note: 标签之间用空行隔开,标签内的语句不能有空行,文件尾必须是空行。
# date: 2020-11-07
# version: 2.0.0
# author: LBC

# 命名约定
    # 全局变量:全大写+下划线
    # 函数:每个单词的首字母大写
    # 局部变量:小写+下划线

# 版本号定义
    # 功能名称_脚本主版本号.序号

# 日志输出格式化
[Log_2.1]
# $1: 顶格标题
# $2: 文本内容
# $3: 一行输出多个文本内容(单位:单词)
Log()
{
    if [ ${LOG_ON:-0} -eq 1 ] ; then
        local text_counter=0
        local print_text=''
        # $1:一行的顶格单词(可以是序号,标题等)
        local text_title=$1
        # $2:输出的文本内容
        local text_list=$2
        # $3:一行输出多少个文本内容(单位:单词),默认100个单词一行输出。
        local text_merge_line=${3:-0}
        if [ ${text_merge_line} -eq 0 ] ; then
            echo "${text_title} ${text_list}"
            return 1  
        fi
        #
        print_text="${text_title}"
        for text in ${text_list:-" "}
        do
            if [ ${text_counter} -eq ${text_merge_line} ] ; then
                print_text="${print_text}\n  ${text}"
                text_counter=0
            else
                print_text="${print_text}  ${text}"
            fi
            text_counter=$((${text_counter}+1))
        done
        echo -e "${print_text}"
    fi
}

# 转换为SQL的字段格式
[ConvertToSqlFieldFormat_2.1]
# $1 数据源
# $2 数据源中多少个文本为一组
# 格式化后的结果,是 6 个文本为一组
ConvertToSqlFieldFormat()
{
    local news_yymmdd=''
    local news_id=''
    local news_hhmm=''
    local news_category=''
    local news_web_path=''
    local news_title=''
    #
    local return_format_data=''
    # $2 中每6个文本为一行
    local text_group=$2
    local text_counter=1
    #
    for text in $1
    do
        case ${text} in 
            ????-??-??)
                news_yymmdd=${text}
            ;;
            ???????)
                news_id=${text}
            ;;
            ??:??)
                news_hhmm=${text}
            ;;
            ??|???|????)
                news_category=${text}
            ;;
            ????*_*.*)
                news_web_path=${text}
            ;;
            *)
                news_title=${text}
            ;;
        esac
        #
        # 每读6个文本为一组数据,文本的排列顺序与SQL的表结构顺序一致
        if [ $(( ${text_counter} % ${text_group} )) -eq 0 ] ; then
            return_format_data=" ${return_format_data} \
                ${news_yymmdd//-/} \
                ${news_id} \
                '${news_hhmm}' \
                '${news_category}' \
                '${WEB_URL_INDEX}${news_web_path}' \
                '${news_title}' \
            "
        fi
        text_counter=$(( ${text_counter} + 1 ))
    done
    echo "${return_format_data}"
}

# Building MySQL Insert Statements
# $1 表名
# $2 数据源
# $3 数据源中多少个文本构成一行数据
[BuildingMySqlInsertStatement_2.1]
BuildingMySqlInsertStatement()
{
    local text_group=$3
    local text_counter=1
    #
    local fields_value=''
    #local MYSQL_INSERT_STATEMENT=''
    local insert_datas=''
    #
    for text in $2
    do
        fields_value="${fields_value}${text},"
        if [ $(( ${text_counter} % ${text_group} )) -eq 0 ] ; then
            # 构成一行数据
            insert_datas="${insert_datas}(${fields_value%,}),"
            fields_value=''
        fi
        text_counter=$(( ${text_counter} + 1 ))
    done
    echo "${1}${insert_datas%,}"
}

# Main_2.1 测试使用,目的是否能替换脚本中的 Main
[Main_2.1]
Main()
{
    Log 'Main'
    Log '--' "Get Web Page Urls From ${WEB_URL_INDEX}"
    local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}")
    Log '' "${web_pages_urls}" 4
    #
    # 获取网页上标题的数据
    local web_pages_titles_data=''
    for web_page_url in $web_pages_urls
    do
        Log '--' "Get Web Page Datas From ${web_page_url}"
        web_pages_titles_data=$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4)
        Log '' "${web_pages_titles_data}" 4
    done
}

# Main_2.2 正式版
[Main_2.2]
Main()
{
    Log 'Main'
    # 检查 redis,mysql 是否启动。
    if [ -z ${REDIS_HOSTNAME} ] ; then
Log ' --' 'Redis is not running.'
        return
    fi
if [ -z ${MYSQL_HOSTNAME} ] ; then
Log ' --' 'MySQL is not running.'
retrun
fi # # 获取所有的网页url # 从 https:
//kuaixun.stcn.com/index_1.html 到 https://kuaixun.stcn.com/index_19.html 顺序获取 # web_pages_urls='https://kuaixun.stcn.com/index_1.html https://kuaixun.stcn.com/index_2.html ... index_19.html' Log ' --' "Get Web Page Urls From ${WEB_URL_INDEX}" local web_pages_urls=$(GetWebPageUrls "${WEB_URL_INDEX}" "${WEB_URL_PATH_PREFIX}" "${WEB_URL_PATH_NUMBER}" "${WEB_URL_PATH_SUFFIX}") Log '' "${web_pages_urls}" 4 # # 存放所有页面的标题数据 local web_pages_titles_data='' # # 存放单个页面的标题数据 local web_page_title_data='' # local mysql_stock="mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -h${MYSQL_HOSTNAME} -P${MYSQL_PORT} ${MYSQL_DATABASE} -e " # # Redis 变量命名 约定:服务名称-变量类型-变量描述 ## 举例:stock_news-string-latest_stored_title_data_id ## stock_news:服务名称 ## string:变量数据类型 ## latest_stored_title_data_id :脚本的变量 # local redis_cli="redis-cli -h ${REDIS_HOSTNAME} -p ${REDIS_PORT} " # # 将已经保存到MySQL最新的标题数据,存入 Redis。用于下一次抓取数据时判断哪些数据已经保存到本地 # 存入 Redis 的数据格式:截取与url中的一段:20210107_2712148 # 对应 Redis 中键为:stock_news-string-latest_stored_title_data_id local latest_stored_title_data_id_at_redis=$(${redis_cli} get stock_news-string-latest_stored_title_data_id) if [ -z ${latest_stored_title_data_id_at_redis} ] ; then ${redis_cli} set stock_news-string-latest_stored_title_data_id 'is null' latest_stored_title_data_id_at_redis='is null' fi # # 判断页面的数据是否有已经保存到MySQL local is_the_page_data_stored=0 # for web_page_url in $web_pages_urls do # 年月日 id号 时间 分类 url 标题 Log ' --' "Get Web Page Datas From ${web_page_url} " # # 获取网页上标题的数据,格式(4个数据):时间 url 标题 年月日 # 页面题材的标题原文格式:20:22 ./ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 2021-01-07 Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 # Log ' --' "Adjust data format" web_page_title_data="$(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 |sed -rn 's#[ ]*([0-9:]+)[ ]*\./+([a-z]+)([0-9a-z/]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]*(.*)[ ]+([0-9-]+)#\8 \5 \1 \2 \2\3\4_\5\6 \7#p')" # 对原页面标题数据调整:格式(6个数据):年月日 id号(从url中提取) 时间 分类(从url中提取) url 标题 # 对原文格式调整:2021-01-07 2712148 20:22 ss ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 Log '' "${web_page_title_data}" 6 # web_pages_titles_data="${web_pages_titles_data} ${web_page_title_data}" #web_pages_titles_data="${web_pages_titles_data} $(Log '' "$(GetWebPageTitleDatas ${web_page_url})" 4 |sed -rn 's#[ ]*([0-9:]+)[ ]*\./+([a-z]+)([0-9a-z/]+t)([0-9]+)_([0-9]+)([.a-z]+)[ ]*(.*)[ ]+([0-9-]+)#\8 \5 \1 \2 '${WEB_URL_INDEX}' \2\3\4_\5\6 \7#p')" # # 当前页面的数据是否有已经保存到本地的数据。 Log ' --' "Is the latest data[${latest_stored_title_data_id_at_redis}] of redis in this page ?" # 从redis提取数据,检查该页的数据是否已经存在,存在则退出循环,不再加载数据,表示后面的网页中的数据已经在数据库中。 echo "${web_page_title_data}" | fgrep -q "${latest_stored_title_data_id_at_redis}" if [ $? -eq 0 ] ; then Log ' --' "The latest data from Redis exist in this page, which stops data crawling from subsequent pages." is_the_page_data_stored=1 break else Log ' --' 'The data of this page will be stored.' fi done # # # 由于网页上标题在发布时,存在时间上非顺序关系,故需要重排序(降序)。以最新的数据排在第一位。 Log ' --' 'Data Sorting' local web_pages_titles_data_of_sort_desc="$(Log '' " ${web_pages_titles_data} " 6 | sort -n -k1 -k2 -t' ' -r)" Log '' "${web_pages_titles_data_of_sort_desc}" 6 unset web_pages_titles_data # # # 在排序的数据的基础上,删除对应已经保存到本地的数据 Log ' --' 'Collate Unsaved Data' if [ ${is_the_page_data_stored} -eq 1 ] ; then web_pages_titles_data_of_sort_desc="${web_pages_titles_data_of_sort_desc%%${latest_stored_title_data_id_at_redis##*_}*}" fi # # Log ' --' 'Convert To SQL Field Format' # 调整的格式(6个数据):2021-01-07 2712148 20:22 ss ss/202101/t20210107_2712148.html 2020年珠江水运内河货运量企稳回升 # MySQL存储(6个数据):20210107 2712148 '20:22' 'ss' 'https://kuaixun.stcn.com/ss/202101/t20210107_2712148.html' '2020年珠江水运内河货运量企稳回升' # 在 MySQL 表设计中,前6个字段使用固定值的数据类型 local web_pages_titles_data_convert_to_sql_field="$(ConvertToSqlFieldFormat "${web_pages_titles_data_of_sort_desc}" 6)" Log '' "${web_pages_titles_data_convert_to_sql_field}" 6 unset web_pages_titles_data_of_sort_desc # # # 若数据为空,则表示当前抓取网页上的数据已经全部保存到MySQL中。 if [ -z "${web_pages_titles_data_convert_to_sql_field}" ] ; then Log ' --' 'There Is No Data To Save' return fi # # # 从最新的数据中提取前两个文本(年月日,id),用于保存至 Redis local text_counter=1 for text in ${web_pages_titles_data_convert_to_sql_field} do if [ ${text_counter} -eq 2 ] ; then latest_stored_title_data_id_at_redis="${latest_stored_title_data_id_at_redis}_${text}" break else latest_stored_title_data_id_at_redis="${text}" fi text_counter=$(( ${text_counter} + 1 )) done # # Log ' --' 'Building MySQL Insert Statements' local mysql_insert_statement="$(BuildingMySqlInsertStatement "insert into ${MYSQL_TABLE} values" "${web_pages_titles_data_convert_to_sql_field}" 6)" Log '' "${mysql_insert_statement}" 6 # # Log ' --' 'Data Insert To MySQL' echo "${mysql_stock} ${mysql_insert_statement}" # # Log ' --' 'Latest Data Update To Redis' ${redis_cli} set stock_news-string-latest_stored_title_data_id "${latest_stored_title_data_id_at_redis}" # 保存所有页面标题的网址(url),并保存到 Redis 中,对应键:stock_news-llist-web_pages_titles_of_urls # stock_news:服务名称 # llist:使用lpush进行压入队列 # web_pages_titles_of_urls:对应脚本的变量 ${redis_cli} LPUSH stock_news-llist-web_pages_titles_of_urls $(echo ${web_pages_titles_data_convert_to_sql_field} | grep -Eo 'http[0-9a-z./_:]+') }

 

posted on 2021-01-16 16:47  3L·BoNuo·Lotus  阅读(192)  评论(0编辑  收藏  举报