数据下载工作笔记三:脚本

一共写了三个脚本,第一次写shell脚本,很蹩脚。写完回头看一看,质量确实很差劲。

脚本一:addLinkFiles.sh

在当前目录下有一个xmlfile.xml文件,该文件需要手动编辑,文件中用标签表示出来两级目录,例如:

<class>Life-sciences</class>

<dataset>uniprot</dataset>

<location>http://www.example.com/example.nt.gz</location>

没有使用xml树形结构,因为shell中解析实在是太麻烦了。用相对位置表示上下级关系。所以这并不是一个严格的xml文件,四不像。上面三句话表示class中有dataset,dataset中有一个下载链接,对应的目录结构为"Life-sciences/uniprot/link",link中保存链接。addLinkFiles.sh会在启动时检查一下该文件,并在./download/data/中检查linkpath = ${class}/${dataset}/link文件是否存在,若不存在,新建目录和文件,然后添加下载链接到link文件中,并将${linkpath}添加到${modifiedLinkFile}中,每隔${interval}秒,脚本会检查一次${xmlfile}的修改时间,如果修改时间改变了,说明有新的location添加了,检查目录和xml文件内容的对应关系,给xml中新添加的内容建立相应的目录和文件。

#!/bin/bash
#*********************************************************
#addLinkFiles.sh
#Keep checking the $xmlfile. #The $xmlfile shoul have only
3 tags: class, name, location. # #last edited 2013.09.03 by Lyuxd. # #********************************************************* #****************** #----init---------- #****************** interval=10 rootDir=${PWD} dataDir=$rootDir"/data" logDir=$rootDir"/log" link="link" log="add.log" modifiedLinkFile="modifiedlinkfile" xmlfile="xmlfile.xml" level1="class" level2="name" level3="location" currentClass=$rootDir currentDataSet=$rootDir xmlLastMT=0 cd $rootDir #**************************************** #------Create Data, Log Directories------ #**************************************** if [ ! -d "$dataDir" ];then mkdir "$dataDir" fi if [ ! -d "$logDir" ];then mkdir "$logDir" fi #**************************************** #------Parsing the xmlfile------ #**************************************** if [ ! -f "$xmlfile" ];then echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`No xmlfile found. exit." >> "$logDir/$log" exit 1 fi #check the modified-time of xmlfile every $interval sec. If modified-time changed parse xmlfile. while true do xmlMT=$(stat -c %Y $xmlfile|awk '{print $0}') if [ "$xmlLastMT" -lt "$xmlMT" ];then xmlLastMT=$xmlMT echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing $xmlfile..." >> "$logDir/$log" while read line do #bu wei kong. if [ "$line"x != x ];then tmp=$(echo $line | awk -F "<|>| " '{print $2}') tag=$(echo $tmp) #check if "class" is existing. If not, create it. if [ "$tag"x = "$level1"x ]; then currentClass=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') currentClass=$(echo $currentClass) currentDataSet=$rootDir if [ ! -z "$currentClass" ] && [ ! -d "$dataDir/$currentClass" ]; then mkdir "$dataDir/$currentClass" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentClass" >> "$logDir/$log" fi #check if "name" is existing. If not, create it. elif [ "$tag"x = "$level2"x ] && [ "$currentClass" != "$rootDir" ]; then currentDataSet=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') currentDataSet=$(echo $currentDataSet) if [ ! -z "$currentClass" ] &&[ ! -z "$currentDataSet" ] && [ ! -d "$dataDir/$currentClass/$currentDataSet" ]; then mkdir "$dataDir/$currentClass/$currentDataSet" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`mkdir $dataDir/$currentDataSet" >> "$logDir/$log" fi #check if "link" is existing. If not, create it. elif [ "$tag"x = "$level3"x ] && [ ! -z "$currentClass" ] && [ ! -z "$currentDataSet" ] && [ "$currentDataSet" != "$rootDir" ] && [ -d "$dataDir/$currentClass/$currentDataSet" ]; then if [ ! -f "$dataDir/$currentClass/$currentDataSet/$link" ]; then touch "$dataDir/$currentClass/$currentDataSet/$link" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Create link file : $dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log" fi newRecord=$(echo $line | awk -F "<$tmp>|</$tmp>" '{print $2}') ifexit=$(grep "$newRecord" "$dataDir/$currentClass/$currentDataSet/$link") if [ ! -z "$newRecord" ] && [ -z $ifexit ]; then #不存在相同的记录 echo "$newRecord" >> "$dataDir/$currentClass/$currentDataSet/$link" echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Add new link $newRecord to $datatDir/$currentClass/$currentDataSet/$link" >> "$logDir/$log" echo "$dataDir/$currentClass/$currentDataSet/$link" >> "$logDir/modifiedLinkFile.tmp" fi else echo "`date "+%Y.%m.%d-%H:%M:%S--ERROR: "`Failed to process $line" >> "$logDir/$log" fi fi done <$xmlfile #****************************** #modifiedLinkFile.tmp contains the paths who were modified in last loop. #Deduplicate modifiedLinkFile.tmp --> modifiedLinkFile #****************************** if [ -f "$logDir/modifiedLinkFile.tmp" ]; then cat "$logDir/modifiedLinkFile.tmp"| awk '!a[$0]++{"date \"+%Y%m%d%H%M%S\""|getline time; print time,$0}' >> "$logDir/$modifiedLinkFile" rm "$logDir/modifiedLinkFile.tmp" else touch "$logDir/$modifiedLinkFile" fi fi #echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`parsing end." >> "$logDir/$log" sleep $interval done

脚本二:checkmodifiedLinkFiles.sh

modifiedlinkfile文件是由上面的脚本一来添加内容的,脚本二会以interval为间隔检查modifiedlinkfile文件的修改时间,如果修改时间发生改变,说明该文件被脚本一修改过了,也就是说,xmlfile中添加了新的下载链接,并且建立了对应的目录。此时,脚本二就会将modifiedlinkfile中的记录取出来(记录是新建的link文件的绝对路径),调用脚本三monitot.sh执行下载任务。

#!/bin/bash
#*************************************************
#This script reads in modifiedLinkFile, 
#for every record calling monitor.sh.
#monitor.sh /home/class/name "wget -c -i link -b"
#
#last edited 2013.09.10 by lyuxd.
#
#*************************************************



interval=10
rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
modifiedLinkFile="$logDir/modifiedLinkFile"
modifiedLinkFileMT="$logDir/modifiedLinkFile.MT"
log=$logDir"/check.log"
maxWgetProcess=5
echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`check is running...">>$log
#*****************************************
#-----------restart interrupted tasks-----
#*****************************************
if [ -f "$runningTask" ]; then
   while read line
   do
    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    while [ $counterWgetProcess -ge $maxWgetProcess ]
    do
        sleep 20
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
    done
    echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log 
    nohup "./monitor.sh" "$line" "wget -nd -c -i link -b" >> /dev/null &
    sleep 1
   done <$runningTask 
fi 


#*********************************
#------------failedQueue-----
#*********************************
#if [ -f "$failedqueue" ] && [ `ls -l "$failedqueue"|awk '{print $5}'` -gt "0" ];then
#    line=($(awk '{print $0}' $failedqueue))
#    echo ${line[1]}
#    :>"$failedqueue"
#    for ((i=0;i<${#line[@]};i++))
#    do
#    counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        while [ $counterWgetProcess -ge $maxWgetProcess ]
#        do
#            sleep 20
#            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
#        done
#        echo "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b"
#        "./monitor.sh" "${line[i]}" "wget -nd -c -i link -b" >> /dev/null &
#ex "$failedqueue" <<EOF
#1d
#wq
#EOF
#    done
#fi
#***************************************************
#------------check new task in modifiedLinkFile-----
#***************************************************
if [ ! -f "$modifiedLinkFile" ];then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`No modifiedLinkFile found. checkmodifiedLinkFiles.sh exit 1." >> $log
    exit 1
fi
if [ ! -f "$modifiedLinkFileMT" ];then
    echo "0" > "$modifiedLinkFileMT"
fi
while true
do

newMT=$(stat -c %Y $modifiedLinkFile|awk '{print $0}')
oldMT=$(awk '{print $0}' "$modifiedLinkFileMT")

if [ "$newMT" != "$oldMT" ]; then
while read line
do    
    if [ ! -z "$line" ] && [ "$line" != "" ]; then
        counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        while [ $counterWgetProcess -ge $maxWgetProcess ]
        do
            #echo "waiting 20sec"
            sleep 20
            counterWgetProcess=$(ps -A|grep -c "monitor.sh")
        done
            newLink=$(echo $line |awk '{print $2}')
            
            linkfileName=$(echo $newLink |awk -F "/" '{print $NF}')
            downloadDir=$(echo $newLink|awk -F "$linkfileName" '{print $1}')
            echo "`date "+%Y.%m.%d-%H:%M:%S--INFO: "`Call ./monitor for $downloadDir." >> $log
            "./monitor.sh" "$downloadDir" "wget -nd -c -i $linkfileName -b" >> /dev/null &
            sleep 1
    fi
done <$modifiedLinkFile
: > $modifiedLinkFile
echo $(stat -c %Y $modifiedLinkFile|awk '{print $0}') > "$modifiedLinkFileMT"
#else
    #echo "nothing to do"
fi
sleep $interval
done

脚本三:monitor.sh 这个脚本主要就是被脚本二调用,执行具体的下载任务了。下载前会在Life-sciences/uniprot下新建一个wgetlog目录,目录中存放下载日志wget-log。下载过程中,monitor.sh会以10S为时间间隔不断检查日志文件的大小,一旦文件大小在连续两次检查中没有发生改变,则去查看日志的最后三行,发现FINISH或者failed等关键字时,就停止下载,并且通过邮件通知。如果发现日志后三行没有找到关键字,则认为是网络速度有问题,导致下载速度为0,所以日志没有增长,在interval时间后重新检查日志大小,重复此过程共maxchecktimes次,如果还是没有增长,则将该错误通过邮件通知。

#!/bin/bash
#*********************************************************
#monitor download directory. 
#One moniter.sh process is started for one download task.
#IF some url in $downloadDir/link can't be reached, monitor
#will log "WARNING". If load failed, log "ERROR". If 
#finished, log "FINISH".
#mail to $mailAddress.
#
#Last edited 2013.09.04 by Lyuxd. 
#
#*********************************************************


#every $interval sec check the size of wgetlog.
interval=30

#if size of wgetlog stay the same, try $maxtrytimes to check
maxtrytimes=5


downloadDir=$1
command=$2

rootDir=${PWD}
dataDir=$rootDir"/data"
logDir=$rootDir"/log"
log=$logDir"/monitor.log"
wgetlogDir="$downloadDir/wgetlog"
wgetlogname="`date +%Y%m%d%H%M%S`-wgetlog"
wgetlog="$wgetlogDir/$wgetlogname"
failedqueue="$logDir/failedQueue"
runningTask="$logDir/runningTask"
mailAddress="15822834587@139.com"
lastERROR="e"
addtoBoolean=0


cd $downloadDir
sleep 1
counterMail=0


echo "`date "+%Y.%m.%d-%H:%M:%S--"`Monitor for directory: ${PWD}.">> $log
whereAmI=$(echo ${PWD} | awk -F "/" '{print $NF}')
if [ ! -d $wgetlogDir ]; then
mkdir $wgetlogDir
fi
# Put current task into runningTask is case of power off. When checkmodifiedLinkFile.sh up, runningTask will be checked if some task  interrupted. And interrupted task will be started again by checkmodifiedLinkFile.sh .  
isexit=$(grep $downloadDir $runningTask)
if [ -z "$isexit" ];then
echo $downloadDir >> $runningTask 
fi

#Begainning downloading.
`$command -b -o "$wgetlog" &`


#Check the size of logfile every $interval times.
#Continue cheching Until size is same with it in
#last check, then wait a $interval long period time,
#try again, try again...(try $maxtrytimes totally)
#read in wgetlog to find if there is
#something not right.
#Mail to $mailAddress.
trytimesRemain=$maxtrytimes
logoldsize=0
sleep 10
lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
while [ ! -z "$lognewsize" ] && [ "$trytimesRemain" -gt 0 ]
do


# If log's size stays unchanging in $interval*$maxtrytime
# find "FINISH" from log. 
# 
    if [ "$lognewsize" -eq "$logoldsize" ];then
        message=$(tail -n3 "$wgetlog")
        level=$(echo $message|grep "FINISH")
        if [ -z "$level" ];then
            trytimesRemain=`expr $trytimesRemain - 1`
            echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir Download speed 0.0 KB/s. MaxTryTimes=$maxtrytimes. Try(`expr $maxtrytimes - $trytimesRemain`). ">> $log
        else
            break
        fi
    else
        trytimesRemain=$maxtrytimes
    fi


    ERROR=$(tail -n250 "$wgetlog" | grep "ERROR\|failed")
    if [ ! -z "$ERROR" ] && [ "$ERROR" != "$lastERROR" ] && [ "$counterMail" -lt 5 ]
        then
        echo "`date "+%Y.%m.%d-%H:%M:%S--"`WARNNING: $downloadDir $ERROR. mail to $mailAddress.">> $log
        echo -e "${PWD}\n$ERROR\n"|mutt -s "Wget Running State : WARNNING in $whereAmI" $mailAddress
        counterMail=$counterMail+1
        lastERROR=$ERROR
        addtoBoolean=1
    fi
    logoldsize=$lognewsize
    sleep $interval
    lognewsize=$(echo $(ls -l $wgetlog | awk '{print $5}'))
done

if [ ! -z "$level" ]
    then
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`FINISHI: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : FINISH $whereAmI--RUNNING $(ps -A|grep -c wget)" $mailAddress
    counterMail=$counterMail+1
else
    echo "`date "+%Y.%m.%d-%H:%M:%S--"`ERROR: $message. mail to $mailAddress.">> $log
    echo -e "`date '+%Y-%m-%d +%H:%M:%S'`\n${PWD}\n$message\n"|mutt -s "Wget Report : ERROR in $whereAmI" $mailAddress
    addtoBoolean=1
    counterMail=$counterMail+1
fi

if [ "$addtoBoolean" -eq "1" ];then
echo "$downloadDir" >> "$failedqueue"
fi


#Remove the interrupted task from runningTask.
sed -i "/$whereAmI/d" "$runningTask"
echo "`date "+%Y.%m.%d-%H:%M:%S--"`$downloadDir Monitor ending.">> $log

总结:第一次写shell脚本,中间基本上每修改一次都会产生很多错误。脚本的质量也很差,好在三个脚本的耦合度不算太高,分工还算明确,这也带来了不少方便。由于平时工作电脑是教育网,而下数据用的是联通的PPPoE拨号,所以ssh访问速度也比较慢,虽然所有工作都简化为了维护一个xml文件(好吧,严格说,它根本不是xml文件,只是一个带标签的文本而已),但是ssh上敲一个字符需要等待三四秒钟的龟速还是无法忍受的,所以下一步想将第一个脚本的工作用java重写一下,在web上管理xml文件。

posted on 2013-09-11 11:12  甲马  阅读(307)  评论(0编辑  收藏  举报

导航