shell实现大批量word转码然后分析相关字段

需求

需要从服务器中的所有附件(2013-2019) 共60G查找相关字段

在linux上面直接打开doc等是乱码的

思路

先全部附件转码为txt, 然后用grep遍历查找字段实现

转码shell

#!/bin/bash
#*************************************************************************
#         > File Name: doc.sh
#         > Author: chenglee
#         > Main : chengkenlee@sina.com
#         > Blog : http://www.cnblogs.com/chenglee/
#         > Created Time : 2019年04月10日 星期三 15时16分41秒
#*************************************************************************
Year="2018"
format="txt"
savedir=$(cd `dirname $0`; pwd)
filetxt="filetxt"
 
ls -l ${Year}/ |awk '/^d/ {print $NF}' > ${Year}.logs
 
function Find(){
    for element in `ls $1`
    do
        dir_or_file=$1"/"$element
        if [ -d $dir_or_file ]
        then
            Find $dir_or_file
        else
            echo $dir_or_file
        fi
    done
}
function Filter(){
    cat filelogs | grep doc | grep -v 'pdf\|zip\|rar\|pptv' > filedir
}
function Unoconv(){
    exec 2<"filedir"
    while read line2<&2
    do
        unoconv -f ${format} ${line2}
        echo "[${line2}] 已转码..."
        #mv *.txt ${filetxt}/${Year}
    done
}
function Move(){
    exec 4<"${Year}.logs"
    while read line4<&4
    do
        mv ${Year}/${line4}/*.txt ${savedir}/${filetxt}/${Year}/
    done
}
function Filetxt(){
    if [ -d "${filetxt}/${Year}" ];then
        root_dir="${Year}"
        Find $root_dir > filelogs
        Filter
        sum=`cat filedir | wc -l`
        echo "总数为:${sum}"
        Unoconv
    else
        mkdir -p ${filetxt}/${Year}
        root_dir="${Year}"
        Find $root_dir > filelogs
        Filter
        sum=`cat filedir | wc -l`
        echo "总数为:${sum}"
        Unoconv
    fi
}
function main(){
    Filetxt
    echo "全部文件已实现转码为txt类型"
    Move
    echo "已转码的文件已转移到${savedir}/${filetxt}/${Year}/下"
}
main

注:先遍历附件中列出日期扔进filelogs这个文件和新建相对文件夹, 然后把所有能转码的doc和docx文件全部扔进filedir文件, 然后脚本直接识别这个文件中的目录文件, 转码方式是libreoffice+unoconv, 全部转码完成会自动把已转好的txt文件转移到filetxt这个文件夹中.

注:我这是双开

工具

1	`yum` `install` `libreoffice unoconv -y`

注:也可以自己下载包安装, 我偷个懒是直接yum拉取的

检索shell

#!/bin/bash
#*************************************************************************
#         > File Name: crawler.sh
#         > Author: chenglee
#         > Main : chengkenlee@sina.com
#         > Blog : http://www.cnblogs.com/chenglee/
#         > Created Time : 2019年04月10日 星期三 10时52分31秒
#*************************************************************************
 
filetxt="TXT"
 
function If(){
    exec 6<"NameFile"
    while read line6<&6
    do
        grep -rn "${line6}" ${filetxt}/ > logs/result-${line6}.logs
        echo "检索${line6}完毕..."
    done
}
function main(){
    If
}
main

注:全部转好之后,新建一个文件, 名称为NameFile, 里面换行写入需要查找的字段, 然后脚本会自动去读每行字符作为变量, 然后把所有结果扔进logs这个文件夹.

维护shell

#*************************************************************************
#         > File Name: unockill.sh
#         > Author: chenglee
#         > Main : chengkenlee@sina.com
#         > Blog : http://www.cnblogs.com/chenglee/
#         > Created Time : 2019年04月10日 星期三 22时20分45秒
#*************************************************************************
#!/bin/bash
 
function killAll(){
    echo "等待10秒开始判断"
    sleep 10;
    StringName=`ps aux | grep unoconv | grep -v grep | awk -F '/' '{print$NF}' | awk -F '.' '{print$1}'`
    if [ "$stringname" != "$StringName" ];then
        echo "[转码正常]"
    else
        echo "[卡住了]... 准备干掉当前进程"
        ps aux | grep unoconv | grep -v grep | awk -F ' ' '{print$2}' | xargs kill -9
    fi
}
function main(){
    while [ "1" = "1" ]
    do
        stringname=`ps aux | grep unoconv | grep -v grep | awk -F '/' '{print$NF}' | awk -F '.' '{print$1}'`
        killAll
    done
}
main

注:这个是配合转码shell一起使用的, 每10秒检测一下进程(时间可以根据自己调, 一个一般5秒之内能转好), 如果卡住了, 干掉当前的进行下一个.

posted @ 2019-04-11 14:59 扶苏公子x 阅读(529) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 传国玉玺易主，ai.com竟然跳转到国产AI
· 本地部署 DeepSeek：小白也能轻松搞定！
· 自己如何在本地电脑从零搭建DeepSeek！手把手教学，快来看看！ (建议收藏)
· 我们是如何解决abp身上的几个痛点
· 普通人也能轻松掌握的20个DeepSeek高频提示词（2025版）

阅读目录(Content)

此页目录为空

夏季节

动心忍性，增益其所不能

念两句诗

shell实现大批量word转码然后分析相关字段

需求

思路

转码shell

工具

检索shell

维护shell

公告

个人信息

日历

积分与排名

随笔分类 (337)

文章档案 (2)

阅读排行榜