CSV文件格式之误
小阳突然很沮丧,原以为csv的一行就是一条记录,原来不是。
如果纪录太多,会在一条记录内分行,比如
aaaaaa,aaaaaaaaa,aaaaaaaaa,aaaaaaaaaaa,aaaaaaa,aaaa,aaaaaaaa"kkk
eeegg",bbbbbbbbbbbbbbbbbbbbbbbbbbb
SHIT 下面是一片参考:http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
- Contents
[top] |
As is the case with most exchange formats since XML, CSV files have become somewhat of a legacy format. New applications that wish to include an export format will generally use XML today (though there may be exceptions). In legacy systems though (pre-XML), CSV files had indeed become a de facto industry standard. Just as there are still billions of lines of CoBOL code in use today that need to be maintained, support for a legacy standard such as CSV is likely to be required long after it has stopped being implemented in new designs.
[top] |
- Each record is one line ...but
A record separator may consist of a line feed (ASCII/LF=0x0A), or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A).
...but: fields may contain embedded line-breaks (see below) so a record may span more than one line. - Fields are separated with commas.
Example John,Doe,120 any st.,"Anytown, WW",08123 - Leading and trailing space-characters adjacent to comma field separators are ignored.
So John , Doe ,... resolves to "John" and "Doe", etc. Space characters can be spaces, or tabs. - Fields with embedded commas must be delimited with double-quote characters.
In the above example. "Anytown, WW" had to be delimited in double quotes because it had an embedded comma. - Fields that contain double quote characters must be surounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes.
So, John "Da Man" Doe would convert to "John ""Da Man""",Doe, 120 any st.,... - A field that contains embedded line-breaks must be surounded by double-quotes
So:
Field 1: Conference room 1
Field 2:
John,
Please bring the M. Mathers file for review
-J.L.
Field 3: 10/18/2002
...would convert to:
Conference room 1, "John,
Please bring the M. Mathers file for review
-J.L.
",10/18/2002,...Note that this is a single CSV record, even though it takes up more than one line in the CSV file. This works because the line breaks are embedded inside the double quotes of the field.
- Fields with leading or trailing spaces must be delimited with double-quote characters.
So to preserve the leading and trailing spaces around the last name above: John ," Doe ",...- Usage note: Some applications will insist on helping you by removing leading and trailing spaces from all fields regardless of whether the CSV used quotes to preserve them. They may also insist on removing leading zeros from all fields regardless of whether you need them. One such application is Excel. :-(
- Fields may always be delimited with double quotes.
The delimiters will always be discarded.- Implementation note: When importing CSV, do not reach down a layer and try to use the quotes to impart type information to fields. Also, when exporting CSV, you may want to be defensive of apps that improperly try to do this. Though, to be honest, I have not found any examples of applications that try to do this. If you have encountered any apps that attempt to use the quotes to glean type information from CSV files (like assuming quoted fields are strings even if they are numeric), please let me know about it.
- The first record in a CSV file may be a header record containing column (field) names
There is no mechanism for automatically discerning if the first record is a header row, so in the general case, this will have to be provided by an outside process (such as prompting the user). The header row is encoded just like any other CSV record in accordance with the rules above. A header row for the multi-line example above, might be:
Location, Notes, "Start Date", ...
[top] |
[top] |
So there is nothing inherently wrong with using CSV to maintain data written in alternate character encodings such as utf-8 and utf-16. CSV's syntax is durable enough to deal with these encoding schemes. Problems arise however in two areas when attempting to transport data of different encodings as plain binary.
First, complexity can arise from mixed encoding schemes. That is, the encodings of the three characters used by CSV (, " and line-feed) may be the same width as the elemental character widths used in the binary field data, or they may be different. For example, utf-8 or utf-16 may be embedded in the fields of any CSV file that uses normal 8 bit characters for comma, quote, and line-feed. The CSV file may alternately use utf-16 for its commas, quotes, and line-feeds to enclose binary fields that contain utf-8 encoded characters. Such complexity must be dealt with deliberately somewhere by the applications that handle the data in those fields.
Problems also arise because the application or display system on which the data is conveyed may not be equipped to handle the encoding, or it may interpret it in unfamiliar ways.
The CSV to XML converter form used in this article for example, knows its output comes from an HTML form and goes to XML which must be displayed in a browser's HTML. To be compatible with this environment it intentionally converts many 8-bit characters to HTML entities (e.g. &, ", etc.).
This will cause all kinds of problems for character encodings other than the usual ISO-8859-x used for these pages. This isn't an incompatibility between CSV and utf-8, it is just an incompatibility between the character encodings used at the different presentation levels.
This issue arises in other areas as well. Spreadsheet programs will often assume all data in a CSV file is in the OEM's or system's default character encoding and attempt to interpret all characters accordingly.
The CSV to XML converter application used here can easily be configured to send output to a binary file with binary, un-cooked characters. In this case it will correctly produce the proper CSV for any character-encoding scheme including utf-8. The applications that will be asked to interpret those octets found in the CSV fields will have to know how to deal with them though. Also, this application will always use an 8-bit encoding for its CSV delimiter characters, and so may cause mixed encoding confusion when used for wide characters.
For a wealth of introductory and advanced information regarding character encoding issues there is a great on the web. You will find this and other helpful links listed on our Links We Like page.
[top] |
Usage: Simply paste your CSV file into the Input area and hit Convert. You may also specify the element names for the columns (fields) along with a name for the document element. If you don't specify column names or if you specify too few, those without names will be labeled "coln" where n is the column number within the record starting from zero. If you don't specify a document element name, the document element will be "document". When your file is finished the XML will display in the Output area. Hit the Select button to select it, then use your clipboard to cut and paste it wherever you'd like.
[top] |
Especially in requirements that utilize high-cost bandwidth and where large amounts of data must be moved often, CSV may be better specified. Hybrid implementations that convert to and from CSV where bandwidth is critical may also be a workable solution in some designs.
The absolute theoretical best advantage for XML is one-letter element names in the XML and every field quoted in the CSV. Even with this unrealistic advantage, XML's overhead will be slightly greater than 200% of CSV's. Typically though, XML's overhead will be 900% to over 9000% greater than CSV's.
This is an apples to apples comparison and so only assumes transfers of regular tables (all rows of a column are the same type). XML and CSV will both transfer column names. In this comparison XML and CSV would both require extra overhead to transfer type data.
Please note: These numbers and analysis are for overhead only and do not attempt to measure or analyze the entire data file (overhead plus data). Because there is no typical data field size there is no typical ratio of overhead to data, so such an analysis would be meaningless for comparison purposes.
Lastly, when the data is very sparse, XML may be able to make up much of the overhead that CSV will use up in commas (though, this is being charitable).
|
[top] |
The CSV format described in this article has been called the "800 pound gorilla of CSV". It is not the prettiest or most technically elegant method by any means. But it is the one supported and used by the world's largest software development company. For this reason it should be supported by anyone doing serious software development and looking for a universal import/export mechanism for data produced.
There are many other CSV formats besides the variation described here. For many software designs, companies use CSV to transfer data between a number of their own applications and never intend it to be written or read by others. In such cases, the variations are almost endless.
In other cases, software developers produce their own formats with the expectation that others will adapt. One reason for such a design choice, even in the face of the 800 pound gorilla, might be that a much more technically advanced format is not that hard to conceive of. While the temptation might be to go with a more technically proficient design, for those producing new designs, we don't recommend it. Why? Can you say "BetaMax"?
For those who are developing new designs we recommend following the big hairy one. However, taking some simple steps might help to improve how your design inter-operates with many of the non-Excel variations that are out there.
- Considerations When Exporting CSV
The biggest differences are in how these three characters are handled.- Embedded double quotes in fields.
An escape character is sometimes used to introduce a double quote, or in place of it. - Embedded line-feeds in fields.
This one is also escaped sometimes. Often like in C ("\n") - Embedded commas in fields.
Again, an escape character is sometimes used in place of the comma
Theoretically, if you could avoid exporting any fields that include these characters you would greatly improve the chances that your CSV file could be read by variations.
Is it worth reducing your application's functionality or convenience? No. But if the existence of these characters is entirely superfluous, you might consider taking them out.
- Embedded double quotes in fields.
- Considerations When Importing CSV
First you'd have to consider if extra import options would ever even be used or understood by your typical users. Second, you'd have to consider whether such extras would be used enough to merit adding the extra complexity to your application and to your development effort. If (and that's a big if) you get past these hurdles, there are some options you might consider including, such as.- [ ] Check if line feeds are \n
- [ ] Check if embedded double quotes are \"
- [ ] Check if ???
[top] |
Excel is an application that produces and uses CSV. A particular aspect of how Excel uses CSV has become a considerable source of confusion and uncertanty.
Excel will always remove leading zeros from fields before displaying them. It will also always remove leading spaces. It insists on helping you, even if removing these leading zeros will, for example, cause your numerically keyed records to sort out of order.
There is a solution
If you have a table for export in which leading zeros and spaces have significance AND that file will only be viewed in Excel, then you can use an Excel specific convention in those fields where leading zeros or spaces should be preserved (in Excel).
To implement this fix, simply place a leading single quote character ahead of each field you wish to preserve. This will tell Excel --the application-- to treat the field value as a string and preserve it verbatim.
...But the solution has a caveat
You can't just use the single quote qualifier all the time if you are exporting tables that may be used in applications other than Excel. In many applications the extra single quote will simply show up as corrupted data. Even in Excel applications, there are times when you will want to treat numbers as numbers. The leading quote will force Excel to treat numbers as strings of digits (not numbers).
A partial solution?
If you are exporting tables for general use that you know will often be viewed in Excel, you should probably provide a checkbox that the user making the export can check if he or she wishes to make "Excel-only CSV files with preserved leading characters."
[top] |
Links to this article are always welcome.
However, you may not copy, modify, or distribute this work or any part of it without first obtaining express written permission from Creativyst, Inc. Production and distribution of derivative products, such as displaying this content along with directly related content in a common browser view are expressly forbidden!
Those wishing to obtain permission to distribute this article or derivatives in any form should contact Creativyst.
Permissions printed over any code, DTD, or schema files are supported as our permission statement for those constructs.