NPOI处理Word文本中段落编号
NPOI的XWPFParagraph对象中,是无法直接读取段落编号的,然而可以读取的是编号的样式名称(GetNumFmt),编号分组ID(GetNumID),编号样式(NumLevelText)等。具体如下:
/* * 若干格式信息 * GetNumFmt: decimal, GetNumID: 1, GetNumIlvl: 0, NumLevelText: %1. => 1. * GetNumFmt: decimal, GetNumID: 4, GetNumIlvl: 0, NumLevelText: %1) => 1) * GetNumFmt: chineseCountingThousand, GetNumID: 2, GetNumIlvl: 0, NumLevelText: (%1) => (一) * GetNumFmt: chineseCountingThousand, GetNumID: 3, GetNumIlvl: 0, NumLevelText: %1、 => 一、 * GetNumFmt: upperLetter, GetNumID: 5, GetNumIlvl: 0, NumLevelText: %1. => A. * GetNumFmt: decimal, GetNumID: 6, GetNumIlvl: 0, NumLevelText: %1、 => 1、 */
于是封装了段落编号的处理类,几个关键点:
1、考虑频繁调用,使用单例。
2、依照NumLevelText内容替换编号的样式
3、编号分组发生变化时,编号要重置为1,采用字典记录
4、汉字、字母统一处理为数字编号
5、读取一个新Word时,字典内容要清空
段落处理类:
1 /// <summary> 2 /// 段落处理类 3 /// Author: Matsuyoi 4 /// </summary> 5 class ParagraphNumHandle 6 { 7 #region 封装为单例 8 private static ParagraphNumHandle singleton = null; 9 public static ParagraphNumHandle GetInstance() 10 { 11 if (singleton == null) 12 singleton = new ParagraphNumHandle(); 13 //获取单例后重置一次变量 14 singleton.Reset(); 15 return singleton; 16 } 17 #endregion 18 //Num字典 19 private Dictionary<string, int> _Count; 20 private ParagraphNumHandle() 21 { 22 _Count = new Dictionary<string, int>(); 23 } 24 /// <summary> 25 /// 重置 26 /// </summary> 27 private void Reset() 28 { 29 //清空字典 30 _Count.Clear(); 31 } 32 /// <summary> 33 /// 处理段落中的编号,汉字与字母编号统一转为数字编号 34 /// </summary> 35 /// <param name="paragraph"></param> 36 /// <returns></returns> 37 public string GetParagraphNum(XWPFParagraph paragraph) 38 { 39 string result = ""; 40 //若无编号格式信息,则返回空 41 if (string.IsNullOrEmpty(paragraph.GetNumFmt()) || 42 string.IsNullOrEmpty(paragraph.GetNumID()) || 43 string.IsNullOrEmpty(paragraph.NumLevelText)) 44 { 45 return result; 46 } 47 48 string key = paragraph.GetNumID() ?? ""; 49 if (!_Count.ContainsKey(key)) 50 { 51 //编号从1开始 52 _Count.Add(key, 1); 53 } 54 else 55 { 56 _Count[key] += 1; 57 } 58 59 string fmt = paragraph.NumLevelText.Replace("%1", "{0}"); 60 result = string.Format(fmt, _Count[key].ToString()) + " "; 61 return result; 62 } 63 }
调用方式:
//段落编号处理 ParagraphNumHandle pnc = ParagraphNumHandle.GetInstance(); //正文段落 foreach (XWPFParagraph paragraph in document.Paragraphs) { //获取段楼中的编号 string num = pnc.GetParagraphNum(paragraph); ... }
延续上一篇《NPOI处理Word文本中上下角标》的示例,完整代码如下:
/// <summary> /// 读取Word,并识别文本中的上下角标 /// </summary> /// <param name="fileName"></param> /// <returns></returns> public static string ReadWordTextExWithSubscript(string fileName) { string fileText = string.Empty; StringBuilder sbFileText = new StringBuilder(); #region 打开文档 XWPFDocument document = null; try { using (FileStream file = new FileStream(fileName, FileMode.Open, FileAccess.Read)) { document = new XWPFDocument(file); } } catch (Exception e) { throw e; } #endregion //段落编号处理 ParagraphNumHandle pnc = ParagraphNumHandle.GetInstance(); //正文段落 foreach (XWPFParagraph paragraph in document.Paragraphs) { //获取段楼中的句列表 IList<XWPFRun> runsLists = paragraph.Runs; //获取段楼中的编号 string num = pnc.GetParagraphNum(paragraph); sbFileText.Append("<p>" + num); foreach (XWPFRun run in runsLists) { switch (run.Subscript) { case VerticalAlign.BASELINE: sbFileText.Append(run.Text); break; //上角标 case VerticalAlign.SUPERSCRIPT: sbFileText.Append("<sup>" + run.Text + "</sup>"); break; //下角标 case VerticalAlign.SUBSCRIPT: sbFileText.Append("<sub>" + run.Text + "</sub>"); break; default: sbFileText.Append(run.Text); break; } } sbFileText.AppendLine("</p>"); } fileText = sbFileText.ToString(); return fileText; }
测试:
Word文档:
输出:
<p>1. 第一段</p>
<p>2. 第二段</p>
<p>1) 第三段</p>
<p>(1) 第四段</p>
<p>(2) 第五段</p>
<p>1、 第六段</p>
<p>2、 第七段</p>
<p>1. 第八段</p>
<p>1、 第九段</p>
<p>2、 第十段</p>
<p>测试<sup>上</sup><sub>下</sub>ok。</p>
<p>CO<sub>2</sub></p>
<p>面积约6000km<sup>2</sup></p>
Html预览: