某阅读多word整理自动化脚本

- 写在前面

　　最近想练习英语，发现电脑磁盘中有很多流利阅读的资料，文件夹格式为:流利阅读201X年>X月>0101 2019 年度色发布：活力珊瑚橘>mp3、word。由于个人想整合下文章（仅新闻正文），方便打印。手动整理了几篇，发现挺耗时的。因此便想写个脚本，自动化操作。

- 环境配置安装

　　运行环境：Python3.6、Spyder

　　依赖模块：win32com、python-docx等

- 开始工作

1.新闻正文获取

　　一篇流利阅读文章结构有五个部分：今日导读、带着问题听讲解、新闻正文、重点词汇、拓展内容。我的目标就是提取每篇文章的新闻正文，具体方式是寻找到新闻正文开始与结束的段落，即可通过查找“新闻正文”，“重点词汇”可以得到

　　代码如下：

def getText(file):
    doc=docx.Document(file)
    start=0 #正文开始点
    end=0   #正文开始点
    state=1  #读取状态，1为成功，0为失败
    for i in range(len(doc.paragraphs)):
        if re.search('新闻正文',doc.paragraphs[i].text):
            start=i
        if re.search('重点词汇',doc.paragraphs[i].text):
            end=i
    if start==0 or end==0:
        state=0  
    if state==1:
        fullText = []
        for j in range(end-start):
            pa=doc.paragraphs[start+j+1].text.strip('重点词汇')
            fullText.append(pa)
    else:
        print(filename+':  read failure!!!')
    return '\n'.join(fullText)

　　这里还有一个问题就是有些文章的格式是doc，由于python-docx仅支持docx，因此这里需要对doc文档进行转化成docx。即有：

    try:    
        doc=docx.Document(file)
    except:           #doc转docx
        word = wc.Dispatch("Word.Application")
        doc = word.Documents.Open(file)
        (file_path, tempfilename) = os.path.split(file)
        (filename, extension) = os.path.splitext(tempfilename)
        #print(filename)
        file=file_path+filename+'.docx'
        doc.SaveAs(file, 12)   #12为docx
        doc.Close()
        doc=docx.Document(file)

2.文件名获取

　　我这里主要保存了文件名及其相应的文件地址。

#文件目录
path=''
titles1=os.listdir(path)
files=[]
files_name=[]
for title1 in titles1:
    titles2=os.listdir(path+'//'+title1)
    for title2 in  titles2:
        titles3=os.listdir(path+'//'+title1+'//'+title2)
        for title3 in titles3:
            if re.search('doc',os.path.splitext(title3)[1]):
                file=path+'//'+title1+'//'+title2+'//'+title3
                files.append(file)
                files_name.append(title2)
                continue

3.定位整合

　　由于我不想一次性打印这么多的文章，便想定位定数整合。

split=10  #文章数
where=10  #起始
new_doc=docx.Document()
for j in range(split):
    try:
        text=getText(files[j+where])
        new_doc.add_heading(files_name[j+where], 2)
        print(text)

        new_doc.add_paragraph(text) 
    except:
        pass
new_doc.styles['Normal'].font.name = 'Times New Roman' #西文字体
new_doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'微软雅黑')#中文字体
new_doc.save('xx//流利阅读2019_'+str(where+1)+'_'+str(j+where)+'.docx')

- 结果展示

TIM截图20190920113018.png

TIM截图20190920112650.png

-写在最后

　　因本人能力有限且时间不足，所写的脚本简陋且冗杂（嗯，能满足我暂时需求就OK），望请多多包涵与指正。

　　另写这篇博客的初衷是分享关于word操作的一些实例，希望对你有所帮助。

　　PS：本文所提及的XX阅读仅供个人学习使用，不进行网络传播，本人概不负相关法律责任。

posted on 2019-09-20 11:32 云帆sc 阅读(316) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部