xml文本空白处理

xml的空白可以是下列字符中的任何一种：

• 空格（ASCII 空格，0x20）

• 回车符（CR，0x0D）

• 换行符（LF，0x0A）

• 水平制表符（0X09）

当你敲回车的时候是输入0D0A还是单0A是操作系统决定的，BIOS不过告诉操作系统“有人按了回车键”而已，

Windows将它认做0D0A，Unix只需要0A，而Mac认做0D。

一般来说当一个程序处理文本的时候，他要做的第一件事是决定这个文本是使用哪种字符集来编码保存。

最标准的途径是检测文本最开头的几个字符。

下面是xml文件：

<?xml version="1.0" encoding="utf-8" ?>

<book>

<name>

C++

</name>

<value>

123

</value>

</book>

对应的用二进制编辑器查看到的效果。

通过EF BB BF得知他是用unicode-8进行编码的,当然记事本也支持没有前缀的编码方式，比如ANSI。如果缺少前缀，那记事本只能依靠上下文来猜测到底使用的是哪种编码方式。附注1。

另外通过图片可以知道这个文档有大量的回车，空格符。

同时空白可以按两种方式分类：有效空白和多余空白。有效空白是由文档类型定义 (DTD) 定义的混合内容模型中的任何空白，element中的文本，或者是特殊属性 xml:space 范围内的空白（当 xml:space 设置为 "preserve" 时）。有效空白是需要从原始文档保留到最终文档的任何空白。多余空白是不需要从读取文档保留到输出文档的空白。空白可以是下列字符中的任何一种：

对于xml,W3C 标准规定必须根据空白在文档中出现的位置以及 xml:space 属性的设置对空白进行不同的处理。如果字符出现在混合元素内容中或xml:space="preserve" 范围内，则必须保留它们并且将它们原样传递到应用程序。不需要保留任何其他空白。

那什么是混合元素？

混合内容是 XSD 架构中的一个选项，它表示 COMPLEXTYPE 元素可以包含与其他元素混合的文本节点。混合内容的一个典型示例是 XHTML XML：

<xhtml:p>This text is

<xhtml:b>bold</xhtml:b> and this text is

<xhtml:u>underlined</xhtml:u>!

</xhtml:p>

Here is an example of XML that contains white space and has the xml:space attribute set to "preserve". The newline character is illustrated as a special white space character at the end of the lines in this example.

<!DOCTYPE test [

<!ELEMENT test (item | book)*> <-- element content model -->

<!ELEMENT item (item*)> <-- element content model -->

<!ATTLIST item xml:space (default | preserve) #IMPLIED>

<!ELEMENT book (#PCDATA | b | i)*> <-- mixed content model -->

<!ELEMENT b (#PCDATA)> <-- mixed content model -->

<!ELEMENT i (#PCDATA)> <-- mixed content model -->

]>•

<test>•

••••<item>•

••••••••<item xml:space="preserve">º

ºººººººººººº<item/>º

ºººººººº</item>•

••••</item>•

••••<book>º

ººººººººThisº

ººººººººisº

ººººººººa testº

ºººº</book>•

</test>•

The white space shown as (•) is insignificant white space. The white space shown as (º) is significant white space.

附注1：

下面是字符串 "Hello" 在不同编码下所对应的二进制编码。

48 65 6C 6C 6F

This is the traditional ANSI encoding.

48 00 65 00 6C 00 6C 00 6F 00

This is the Unicode (little-endian) encoding with no BOM.两个字节表示一个字符。

EF BB BF 48 65 6C 6C 6F

This is UTF-8 encoding. The first three bytes are the UTF-8 encoding of the BOM.

BOM(Byte-Order-Mark)是用UTF-8,UTF-16或者UTF-32编码的unicode文件的开始处的一种特殊标记。对

与UTF-8或者UTF-16来说是强制的，但是对UTF-8是可选的。

http://unicode.org/faq/utf_bom.html#22

http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

http://msdn.microsoft.com/zh-cn/library/6f00zs65.aspx

posted @ 2008-09-10 11:12 keep complex... 阅读(1060) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

小而香

xml文本空白处理

公告