利用python-docx批量处理Word文件—图片

图片是Word的一种特殊内容，这篇文章主要内容是如何利用python-docx批量提取Word中的图片，以及如何在Word国插入图片。

1.提取Word中的图片并保护成指定格式

docx好像并没有直接获取图片的方法，网上的资料也很少，有用的资料我就找到这一篇：
如何从pythondocx段中获取图像(Inlineshape)
说实话，这篇文章我看的不是太懂，而且这个方法只能获得内联的图片，什么是内联的图片呢，我也不知道，我只知道我们在word中直接插入的图片不属于这种，也就是这种方法并不能获得word中直接插入的图片，我用add_picture()插入一张图片，用该方法可以获得。受这篇文章的启发，我看了一下python-docx的源码，虽然没有看懂，但也得到一个用有的信息：python-docx会将wrod文件转换成Proxy Type（不敢翻译）格式进行处理。Proxy Type格式是什么样的呢，其实质是xml，不同的类型会被转成不同的Proxy Type，以Document为例，可以用document._element.xml查看被转换后的内容：
在这里插入图片描述
这就是word内容转换成Proxy Type后的形式（大部分信息被我折叠了），我对xml研究不多，可以看出所有标签都是<w:x>的形式，整个文档包含在<w:document></w:document>标签中，每段以<w:p>开始，</w:p>结束，图片在docx中也是段落，因此我们过以通过遍历整个xml找到包含图片的段落，要通过遍历找到图片，图片所在的段落必须有其特殊之处，不然我们也无判断。下面是一幅图处的Proxy Type的内容：

<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" w:rsidR="00B20677" w:rsidRDefault="00D47C7B" w:rsidP="00ED22C2">
  <w:pPr>
    <w:rPr>
      <w:rFonts w:ascii="宋体" w:eastAsia="宋体" w:hAnsi="宋体"/>
      <w:lang w:eastAsia="zh-CN"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:rFonts w:ascii="宋体" w:eastAsia="宋体" w:hAnsi="宋体"/>
      <w:lang w:eastAsia="zh-CN"/>
    </w:rPr>
    <w:pict>
      <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
        <v:stroke joinstyle="miter"/>
        <v:formulas>
          <v:f eqn="if lineDrawn pixelLineWidth 0"/>
          <v:f eqn="sum @0 1 0"/>
          <v:f eqn="sum 0 0 @1"/>
          <v:f eqn="prod @2 1 2"/>
          <v:f eqn="prod @3 21600 pixelWidth"/>
          <v:f eqn="prod @3 21600 pixelHeight"/>
          <v:f eqn="sum @0 0 1"/>
          <v:f eqn="prod @6 1 2"/>
          <v:f eqn="prod @7 21600 pixelWidth"/>
          <v:f eqn="sum @8 21600 0"/>
          <v:f eqn="prod @7 21600 pixelHeight"/>
          <v:f eqn="sum @10 21600 0"/>
        </v:formulas>
        <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
        <o:lock v:ext="edit" aspectratio="t"/>
      </v:shapetype>
      <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:6in;height:214.5pt">
        <v:imagedata r:id="rId8" o:title="syh"/>
      </v:shape>
    </w:pict>
  </w:r>
</w:p>

可以看到图片信息包含在<w:pict></w:pict>标签中，因此我们可以通过该标签写信图片段落。
document有一个part属性，part有一个related_parts属性，其定义如下：

@property
def related_parts(self):
    """
    Dictionary mapping related parts by rId, so child objects can resolve
    explicit relationships present in the part XML, e.g. sldIdLst to a
    specific |Slide| instance.
    """
    return self.rels.related_parts

再看rels.related_partsr的定义：

@property
def related_parts(self):
    """
    dict mapping rIds to target parts for all the internal relationships
    in the collection.
    """
    return self._target_parts_by_rId

self.rels.related_parts是一个字典，这个字典可以通过rId映射对应的内容，恰好在图片对应的Proxy Type内容（imagedata标签）中发现了这个属性，

<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:6in;height:214.5pt">
     <v:imagedata r:id="rId8" o:title="syh"/>
</v:shape>

可以看到，这个图片对应rId是rId8，运行

doc.part.related_parts['rId9']

发现前没有报错，将其存储成图片后，惊喜出现了——这就是该图片的内容。
整理上面的思路，获得图片的过程分3步：

获得各段的Proxy Type对象，它是一个xml；
遍历该xml，如果pict键存在，该段是图片，继续遍历获得rId；
利用related_parts获得图片内容。

下面详述该过程：

1.1 获得各段对应的`Proxy Type xml`数据

proxy=[]
for p in doc.paragraphs:
    proxy.append(p._element.xml)

1.2 遍历`xml`，找到图片所在的段落并获得`rid`

import xml.etree.cElementTree as ET
for p in proxy:
    #一段一个根树
    root=ET.fromstring(p)
    #获得<w:r>树，所有的<w:pict>树均是<w:r>树的子树
    pictr_str="%sr" % re.match('{\S+}',root.tag).group(0)
    pictrs=root.findall(pictr_str)
    pict_str="%spict" % re.match('{\S+}',root.tag).group(0)
    picts=[]
    rIds=[]
    for pictr in pictrs:
        #获得所有<w:pict>标签
        pict=pictr.findall(pict_str)
        #如果<w:pict>存在
        if len(pict)>0:
            picts.append(pict[0])
    for pict in picts:
        shape_str="%sshape" % re.match('{\S+}',pict[0].tag).group(0)
        #获得<v:shape>标签
        shape=pict.findall(shape_str)[0]
        attrib=[]
        #<w:imagedata>标签
        imagedata=shape.findall("%simagedata" % re.match('{\S+}',pict[0].tag).group(0))
        rIds.append(imagedata[0].attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'])

ps:这部分代码需要对照xml才能看懂。

1.3 获得image数据

imgs=[]
for rid in rIds:
    imgs.append(doc.part.related_parts[rid])

1.4 保存图片到本地

i=1
for img in imgs:
 
    f=open("img%d.jpg" % i,'wb')
    f.write(img.blob)
    f.close()
    i+=1

2.给word插入图片

插入图片就比较简单了:

doc.add_picture('img_path',width=Cm(16),height=Cm(12))

后记：从word中读出图片在点复杂，这个代码肯定不能满足所有word文件，也可能存在很多问题，毕竟这个在官方API中并没有提到，我只是抛砖引玉，如果大家有更好的方法欢迎交流。

posted @ 2018-10-30 13:36 xtfge0915 阅读(6029) 评论(0) 编辑收藏举报

刷新页面返回顶部

xtfge0915