代码改变世界

转载:如何将offcie 2003文档(.doc、.xls、.ppt)转换成mht文档

2008-02-15 18:03  Koy  阅读(2814)  评论(0编辑  收藏  举报

如何将offcie 2003文档(.doc、.xls、.ppt)转换成mht文档

转载自:http://www.cnblogs.com/shanyou/archive/2007/11/28/975941.html

要实现office文档转换成MHTML文档,首先会将office文档转换成HTML格式的文档,然后将HTML文档转换成MHTML文档。要将office文档转成HTML需要使用Microsoft.HtmlTrans.Interface的程序集。这个程序集是需要安装“HTML 转换服务器HTML 转换服务器是 Windows SharePoint Services 服务器场的可选组件。你可以在微软网站上找到该服务器的安装文件,或单击这里下载。

按照下面的步骤安装:

1. 解压缩下载的文件,里面有文件:
eng11probypass.mst
htmltrbackend.msi
HTML Viewer WhitePaper
文档
2.
如果已经安装了office,请先卸载,然后安装支持HTML Viewer ServicesOffice
Office安装路径下,找到Setup文件所在路径;
eng11probypass.mst文件拷贝到该路径下;
在命令提示符下输入:Setup transforms= eng11probypass.mst来安装支持HTML ViewerOffice
3.
安装HTML Viewer Server:运行htmltrbackend.msi

安装好以后,找到Microsoft.HtmlTrans.Interface.dll文件并把它copy到项目文件夹中。在项目中引用该文件。由于将用到命名空间Microsoft.HtmlTrans中的htmlTrLoadBalancerhtmlTrLauncher两个Romoting对象将office文档转换为HTML文件。不过需要注意:

Document types not supported are:
Master documents in Word (see Word Help for an explanation of Master document)
Password protected documents, workbooks, and presentations (encrypted)
Word documents that use framesets
Files that contain Excel 4.0 macros
WordPerfect files
For files with embedded objects, VBA, scripts, etc, the following rules apply:
VBA is ignored and not executed; However, the VBA project (source code, dialog definitions, etc) is retained
Embedded and linked objects are converted to graphic images and displayed in the approximate location where they were in the source file
Linked or embedded objects with password protection are not converted

在实现中另外一个难点就是如何将HTML转换成MHTMLMHTML MIME Encapsulation of Aggregate HTML的缩写,它是一种网络编码格式,是用来定义在电子邮件正文中如何传送html内容的MIME标准。通俗点说,就是一个HTML文件和包括其中的.css文件、.js文件、图片等等一切的资源文件都整合在一个MHTL文件中。以下是一个典型的MHTML文件(;后为解释部分)

Mime-Version: 1.0
; Content-Location
为主文件地址,可以随意设定
Content-Location: http://www.ietf.cnri.reston.va.us/
; Content-Type
MTHML文件的类型,这里表示MHTML文件中包含多种文件类型
;boundary
定义文件之间的分隔符,可随意定义
;type
为主文件格式
Content-Type: multipart/related; boundary="boundary-example";type="text/html"

;在前面加”--”字符表示一个文件开始
--boundary-example
;
以下是文件头
; text/html
表示该文件的文件类型;charset表示使用的字符集
Content-Type: text/html; charset="ISO-8859-1"
; Content-Transfer-Encoding:
表示的是该文件的编码类型;
;
一般有两种:一种是文本类型的一般使用”QUOTED-PRINTABLE”;
;
另一种是二进制文件一般使用”BASE64”
Content-Transfer-Encoding: QUOTED-PRINTABLE

;以下是正文
... text of the HTML document, which might contain URIs
referencing resources in other body parts, for example through
statements such as:

<IMG SRC="images/ietflogo1.gif" ALT="IETF logo1">
<IMG SRC="images/ietflogo2.gif" ALT="IETF logo2">
<IMG SRC="images/ietflogo3.gif" ALT="IETF logo3">

Example of a copyright sign encoded with Quoted-Printable: =A9
Example of a copyright sign mapped onto HTML markup: ¨

--boundary-example
; Content-Location:
该文件的地址,可以是绝对地址或相对主文件的相对地址
;
这里是绝对地址
Content-Location:http://www.ietf.cnri.reston.va.us/images/ietflogo1.gif
Content-Type: IMAGE/GIF
;
二进制文件,使用BASE64编码
Content-Transfer-Encoding: BASE64

R0lGODlhGAGgAPEAAP/////ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...

--boundary-example
;
这里是相对地址
Content-Location: images/ietflogo2.gif
Content-Transfer-Encoding: BASE64

R0lGODlhGAGgAPEAAP/////ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...

--boundary-example
Content-Location:http://www.ietf.cnri.reston.va.us/images/ietflogo3.gif
Content-Transfer-Encoding: BASE64

R0lGODlhGAGgAPEAAP/////ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...
;
注意这里是结束标记,表示MHTML文件已经结束了.在定义的分隔符前后都加上”--”
--boundary-example—

上面是标准的MHTML文件格式,但是按上面的标准是无法在IE里面正确浏览的。还需要注意以下几点:

1. 凡是文本类型的文件所有的”=”替换成”=3D”,例如
<IMG SRC="images/ietflogo3.gif" ALT="IETF logo3">
要替换成
<IMG SRC=3D"images/ietflogo3.gif" ALT=3D"IETF logo3">
2.
所有的BASE64编码的文件必须要换行;
3.
每个文件开头的分隔符要在前加上”--”,而最后一个分隔符要在前后加上”--”
4.
正文与文件头和下一个文件的分割符都要有换行符。

实现代码:

using System;
using System.Collections;
using System.IO;
using System.Text;
using Microsoft.HtmlTrans;

namespace MSOfficeHelper
{
    
public class Conversion
    
{
        
//字符串的编码
        protected static Encoding encoding = Encoding.Default;
        
//用于创建IHtmlTrLoadBalancer的remoting对象的url

        
protected static string strServiceUrl = System.Configuration.ConfigurationSettings.AppSettings["OfficeHtmlViewService"];

        
public static void ConvertMHT(string inputfile, string outputfile)
        
{
            
//通过url(strServiceUrl)获取一个IHtmlTrLoadBalancer的remoting对象
            IHtmlTrLoadBalancer htmlTrLoadBalancer =
                (IHtmlTrLoadBalancer) System.Activator.GetObject(
                    
typeof (IHtmlTrLoadBalancer), strServiceUrl);
            
//用输入文件名(inputfile)作为一个任务的任务标示(strTask)
            string strTask = inputfile;

            
//根据任务标示(strTask)新建一个任务并获取任务的url(strLauncherUri)
            string strLauncherUri = htmlTrLoadBalancer.StrGetLauncher(strTask);

            
//通过任务的url(strLauncherUri)获取一个IHtmlTrLauncher的remoting对象(htmlTrLauncher),
            
//并用这个对象来执行该任务
            IHtmlTrLauncher htmlTrLauncher =
                (IHtmlTrLauncher) System.Activator.GetObject(
typeof (IHtmlTrLauncher), strLauncherUri);

            
//接下来是把输入文件(inputfile)的内容读入一个byte数组(bFile)
            byte[] bFile = null;
            FileStream fsInputMht 
= null;
            BinaryReader bwInputMht 
= null;
            
try
            
{
                fsInputMht 
= new FileStream(inputfile, FileMode.Open);
                bwInputMht 
= new BinaryReader(fsInputMht, encoding);
                bFile 
= new byte[fsInputMht.Length];
                
for (long i = 0; i < bFile.LongLength; i++)
                    bFile[i] 
= bwInputMht.ReadByte();
                bwInputMht.Close();
                fsInputMht.Close();

            }

            
catch (Exception ex)
            
{
                bwInputMht.Close();
                fsInputMht.Close();
                
throw ex;
            }


            
//CHICreateHtml通过office文档创建HTML文件及其附件
            
//CHICreateHtml(
            
//string strLauncherUri,         任务的url
            
//byte[] rgbFile,             office文档的二进制内容
            
//Microsoft.HtmlTrans.BrowserType bt, 使用浏览类型,该参数是一个枚举类型
            
//string strReqFile,           office文档的路径/url
            
//string strTaskName,           任务标示名,HTML转换服务器根据其跟踪该请求
            
//int timeout,                 转换超时时间,如果网络状况较差,建议值设大点
            
//bool fReturnFileBits          是否返回二进制内容,分别保存在CreateHtmlInfo的rgbMainFile属性和rgrgbThicketFiles属性中
            
//);
            CreateHtmlInfo chi = htmlTrLauncher.CHICreateHtml(strLauncherUri, bFile,
                                                              BrowserType.BT_IE4, inputfile, strTask, 
200true);

            
//结束转换任务
            htmlTrLoadBalancer.LauncherTaskCompleted(strLauncherUri, strTask);

            
//在转换HTML文件的过程中没有错误,并且存在主文件,执行以下代码
            if (chi.ce == CreationErrorType.CE_NONE && chi.fHasMainFile)
            
{
                FileStream fsOutputMht 
= null;
                BinaryWriter bwOutputMht 
= null;
                
try
                
{
                    fsOutputMht 
= new FileStream(outputfile, FileMode.Create);
                    bwOutputMht 
= new BinaryWriter(fsOutputMht, encoding);
                    
//将HTML文件及其附件转换为MHTML文件
                    byte[] bMHTMLBody = CreateMHTMLBody(chi);
                    
string temp = System.Text.Encoding.Default.GetString(bMHTMLBody);

                    StringBuilder sb 
= new StringBuilder();

                    
foreach (char c in temp.ToCharArray())
                    
{
                        
string t = c.ToString();
                        
if ((uint) c > 500)
                        
{
                            t 
= "&#" + ((uint) c).ToString() + ";";
                        }


                        sb.Append(t);
                    }


                    bMHTMLBody 
= Encoding.ASCII.GetBytes(sb.ToString());

                    bwOutputMht.Write(bMHTMLBody);
                    bwOutputMht.Close();
                    fsOutputMht.Close();
                    
return;
                }

                
catch (Exception ex)
                
{
                    bwOutputMht.Close();
                    fsOutputMht.Close();
                    
throw ex;
                }

            }

            
return;
        }


        
//MHTML文件头信息
        protected static string MIME =
            
"MIME-Version: 1.0" + Environment.NewLine +
                
"Content-Type: multipart/related; boundary=\"{0}\"" + Environment.NewLine +
                Environment.NewLine;

        
//MHTML各个文件的头信息
        protected static string HEADER =
            Environment.NewLine 
+ "--{0}" + Environment.NewLine +
                
"Content-Location: {1}" + Environment.NewLine +
                
"Content-Transfer-Encoding: {2}" + Environment.NewLine +
                
"Content-Type: {3}" + Environment.NewLine +
                Environment.NewLine;

        
//定义MHTML中各文件之间的分隔符
        protected static string BOUNDARY = "Define_It_Youself";
        
//MHTML主文件的URL
        protected static string LOCATION = string.Format(@"file:///c:/{0}/",Guid.NewGuid());

        
private static byte[] CreateMHTMLBody(CreateHtmlInfo creatHtmlInfo)
        
{
            
//将回车换行符进行编码并存储在字节数组中
            byte[] bNewLine = Encoding.UTF8.GetBytes(Environment.NewLine);
            
//将3D进行编码并存储在字节数组中
            byte[] bAfterEquals = encoding.GetBytes("3D");
            
//'='的byte值为61
            byte bEquals = 61;
            
//MHTML文件的长度
            long lMHTMLBodyLength = 0;
            
//从零开始的字节偏移量
            long lOffset = 0;
            
//根据BOUNDARY的定义形成MTHML文件的头信息
            string strMIME = string.Format(MIME, BOUNDARY);
            
//将头信息进行编码并存储在字节数组中
            byte[] bMIME = encoding.GetBytes(strMIME);
            
//MHTML文件的长度增加bMIME.LongLength
            lMHTMLBodyLength += bMIME.LongLength;

            
//根据信息定义主文件的头信息
            string strMainHeader = string.Format(HEADER,
                                                 BOUNDARY,
                                                 LOCATION 
+ creatHtmlInfo.strMainFileName,
                                                 TransferEncoding.QUOTED_PRINTABLE,
                                                 ContentType.TEXT_HTML);

            
byte[] bMainHeader = encoding.GetBytes(strMainHeader);
            lMHTMLBodyLength 
+= bMainHeader.LongLength;

            
//建立一个动态临时数组
            ArrayList alTempArray = new ArrayList();

            
//主文件的正文部分所有的"="替换成"=3D"
            for (int i = 0; i < creatHtmlInfo.rgbMainFile.Length; i ++)
            
{
                alTempArray.Add(creatHtmlInfo.rgbMainFile[i]);
                
if (creatHtmlInfo.rgbMainFile[i] == bEquals)
                
{
                    alTempArray.Add(bAfterEquals[
0]);
                    alTempArray.Add(bAfterEquals[
1]);
                }

            }

            
//获取新的主文件的正文部分并存储在字节数组中
            byte[] bMainBody = new byte[alTempArray.Count];
            alTempArray.CopyTo(bMainBody);
            lMHTMLBodyLength 
+= bMainBody.LongLength;
            alTempArray.Clear();

            
//申明存储MHTML附件的正文内容字节数组,该数组为一个二维数组
            byte[][] bThicketContent = null;
            
//申明存储MHTML附件的头信息字节数组
            string[] strThicketHeaders = null;
            
//如果MHTML存在附件则执行以下代码
            if (creatHtmlInfo.fHasThicket)
            
{
                bThicketContent 
= new byte[creatHtmlInfo.rgrgbThicketFiles.Length][];
                strThicketHeaders 
= new string[creatHtmlInfo.rgrgbThicketFiles.Length];
                
for (int i = 0; i < strThicketHeaders.Length; i++)
                
{
                    
//定义附件的头信息
                    string strLocation = LOCATION +
                        creatHtmlInfo.strThicketFolderName 
+ "/" +
                        creatHtmlInfo.rgstrThicketFileNames[i];
                    
string strTransferEncoding = TransferEncoding.GetTransferEncodingByFileName
                        (creatHtmlInfo.rgstrThicketFileNames[i]);
                    
string strContentType = ContentType.GetContentTypeByFileName
                        (creatHtmlInfo.rgstrThicketFileNames[i]);
                    strThicketHeaders[i] 
= string.Format(HEADER,
                                                         BOUNDARY,
                                                         strLocation,
                                                         strTransferEncoding,
                                                         strContentType);
                    
byte[] bThicketHeader = encoding.GetBytes(strThicketHeaders[i]);

                    StringBuilder strBase64ThicketBody 
= new StringBuilder();
                    
byte[] bThicketBody = null;
                    
//如果附件二进制文件,那么用BASE64编码
                    if (strTransferEncoding ==
                        TransferEncoding.BASE64)
                    
{
                        
//首先将字节数组里的内容转换为Base64编码的字符串
                        strBase64ThicketBody.Append(
                            Convert.ToBase64String(creatHtmlInfo.rgrgbThicketFiles[i]));
                        
//然后将字符串进行编码存储在新的字节数组中
                        bThicketBody = encoding.GetBytes(strBase64ThicketBody.ToString());
                        
//每76个字节,加入一个换行符
                        int BUFFER_SIZE = 76;
                        
for (int j = 0; j < bThicketBody.Length; j++)
                        
{
                            alTempArray.Add(bThicketBody[j]);
                            
if (j%BUFFER_SIZE == BUFFER_SIZE - 1)
                            
{
                                alTempArray.Add(bNewLine[
0]);
                                alTempArray.Add(bNewLine[
1]);
                            }

                        }

                        bThicketBody 
= new byte[alTempArray.Count];
                        alTempArray.CopyTo(bThicketBody);
                        alTempArray.Clear();
                    }

                        
//如果附件是以明文编码,那么明文编码,并将附件正文部分所有的"="替换成"=3D"
                    else
                    
{
                        
for (int j = 0; j < creatHtmlInfo.rgrgbThicketFiles[i].Length; j++)
                        
{
                            alTempArray.Add(creatHtmlInfo.rgrgbThicketFiles[i][j]);
                            
if (creatHtmlInfo.rgrgbThicketFiles[i][j] == bEquals)
                            
{
                                alTempArray.Add(bAfterEquals[
0]);
                                alTempArray.Add(bAfterEquals[
1]);
                            }

                        }

                        bThicketBody 
= new byte[alTempArray.Count];
                        alTempArray.CopyTo(bThicketBody);
                        alTempArray.Clear();
                    }



                    
//如中htm文件则进行添加base操作
                    string ext = Path.GetExtension(creatHtmlInfo.rgstrThicketFileNames[i]).ToLower();
                    
if (ext == ".htm")
                    
{
                        
string body = Encoding.Default.GetString(bThicketBody);

                        
int start = body.IndexOf("<link");

                        
if (start > -1)
                        
{
                            body 
=
                                body.Insert(
                                    start,
                                    
string.Format(
                                        
"\r\n<![if IE]>\r\n"
                                            
+ "<base href=3D\"{0}\"\r\n"
                                            
+ "id=3D\"webarch_temp_base_tag\">\r\n"
                                            
+ "<![endif]>\r\n",
                                        LOCATION 
+ creatHtmlInfo.strThicketFolderName + @"/" + creatHtmlInfo.rgstrThicketFileNames[i]
                                        )
                                    );
                        }


                        
byte[] data = Encoding.Default.GetBytes(body);
                        bThicketBody 
= new byte[data.Length];

                        data.CopyTo(bThicketBody, 
0);
                    }

                    
//将附件中的头信息字节数组和正文的字节数组合并存储在bThicketContent[i]中,
                    
//并在lMHTMLBodyLength增加相应的长度
                    bThicketContent[i] = new byte[bThicketHeader.LongLength + bThicketBody.LongLength + bNewLine.LongLength];
                    Array.Copy(
                        bThicketHeader,
                        
0,
                        bThicketContent[i],
                        
0,
                        bThicketHeader.LongLength);
                    Array.Copy(
                        bThicketBody,
                        
0,
                        bThicketContent[i],
                        bThicketHeader.LongLength,
                        bThicketBody.LongLength);
                    Array.Copy(
                        bNewLine,
                        
0,
                        bThicketContent[i],
                        bThicketHeader.LongLength 
+ bThicketBody.LongLength,
                        bNewLine.LongLength);
                    lMHTMLBodyLength 
+= bThicketContent[i].LongLength;
                }

            }

            
//MHTML文件结束分割符的存储在字节数组中
            byte[] bEndBoundary = encoding.GetBytes(
                Environment.NewLine 
+ "--" + BOUNDARY + "--" + Environment.NewLine);
            lMHTMLBodyLength 
+= bEndBoundary.LongLength;

            
//新建一个数组,该数组用于存储MHTML文件的所有内容
            byte[] bMHTMLBody = new byte[lMHTMLBodyLength];
            
//将所有的内容全部合并,并存储在数组bMHTMLBody中
            Array.Copy(bMIME, 0, bMHTMLBody, lOffset, bMIME.LongLength);
            lOffset 
+= bMIME.LongLength;
            Array.Copy(bMainHeader, 
0, bMHTMLBody, lOffset, bMainHeader.LongLength);
            lOffset 
+= bMainHeader.LongLength;
            Array.Copy(bMainBody, 
0, bMHTMLBody, lOffset, bMainBody.LongLength);
            lOffset 
+= bMainBody.LongLength;
            
if (bThicketContent != null)
                
for (int i = 0; i < bThicketContent.Length; i++)
                
{
                    Array.Copy(
                        bThicketContent[i],
                        
0,
                        bMHTMLBody,
                        lOffset,
                        bThicketContent[i].LongLength);
                    lOffset 
+= bThicketContent[i].LongLength;
                }

            Array.Copy(bEndBoundary, 
0, bMHTMLBody, lOffset, bEndBoundary.LongLength);

            
return bMHTMLBody;
        }

    }


    
//根据不同的文件后缀名定义编码方式
    class TransferEncoding
    
{
        
public const string QUOTED_PRINTABLE = "quoted-printable";
        
public const string BASE64 = "base64";

        
public static string GetTransferEncodingByFileName(string fileName)
        
{
            
string strRusult = string.Empty;
            
string strExtension = fileName.Remove(0, fileName.LastIndexOf(".")).ToUpper();
            
switch (strExtension)
            
{
                    
//以下文件名在MTHML文件中都将以明文的形式编码
                default:
                
case ".HTM":
                
case ".HTML":
                
case ".XML":
                    strRusult 
= TransferEncoding.QUOTED_PRINTABLE;
                    
break;
                    
//以下文件名在MHTML文件中都将以BASE64编码形式出现
                case ".JPG":
                
case ".JEPG":
                
case ".PNG":
                
case ".MSO":
                
case ".EMZ":
                
case ".GIF":
                
case ".WMF":
                
case ".WMZ":
                
case ".CSS":
                    strRusult 
= TransferEncoding.BASE64;
                    
break;
            }

            
return strRusult;
        }

    }


    
//根据不同的后缀名定义文件内容的类型
    class ContentType
    
{
        
public const string TEXT_HTML = "text/html; charset=\"us-ascii\"";
        
public const string APPLICATION_XMSO = "application/x-mso";
        
public const string IMAGE_XEMZ = "image/x-emz";
        
public const string IMAGE_GIF = "image/gif";
        
public const string TEXT_CSS = "text/css";
        
public const string TEXT_XML = "text/xml; charset=\"utf-8\"";
        
public const string IMAGE_XWMF = "image/x-wmf";
        
public const string IMAGE_PNG = "image/png";
        
public const string IMAGE_JPEG = "image/jpeg";
        
public const string TEXT_JS = "application/javascript; charset=\"us-ascii\"";
        
public const string IMAGE_WMZ = "image/x-wmz";


        
public static string GetContentTypeByFileName(string fileName)
        
{
            
string strExtension = fileName.Remove(0, fileName.LastIndexOf(".")).ToUpper();
            
switch (strExtension)
            
{
                    
//以下文件名在MHTML文件中的类型是text/html; charset="us-ascii"
                case ".HTM":
                
case ".HTML":
                    
return ContentType.TEXT_HTML;
                    
//以下文件名在MHTML文件中的类型是application/x-mso
                case ".MSO":
                    
return ContentType.APPLICATION_XMSO;
                    
//以下文件名在MHTML文件中的类型是image/x-emz
                case ".EMZ":
                    
return ContentType.IMAGE_XEMZ;
                    
//以下文件名在MHTML文件中的类型是image/gif
                case ".GIF":
                    
return ContentType.IMAGE_GIF;
                    
//以下文件名在MHTML文件中的类型是text/css
                case ".CSS":
                    
return ContentType.TEXT_CSS;
                    
//以下文件名在MHTML文件中的类型是text/xml; charset="utf-8"
                case ".XML":
                    
return ContentType.TEXT_XML;
                    
//以下文件名在MHTML文件中的类型是image/x-wmf
                case ".WMF":
                    
return ContentType.IMAGE_XWMF;
                    
//以下文件名在MHTML文件中的类型是image/png
                case ".PNG":
                    
return ContentType.IMAGE_PNG;
                    
//以下文件名在MHTML文件中的类型是image/jpeg
                case ".JPG":
                
case ".JEPG":
                    
return ContentType.IMAGE_JPEG;

                
case ".JS":
                    
return ContentType.TEXT_JS;

                
case ".WMZ":
                    
return ContentType.IMAGE_WMZ;
            }

            
return string.Empty;
        }

    }


}