C#操作Excel(读取)

一.使用OleDb,这个法子好像不大好使.容易读错.
引用System.Data.OleDb;

/**//// <summary>
/// 返回Excel数据源
/// </summary>
/// <param name="filename">文件路径</param>
/// <returns></returns>
static public DataSet ExcelToDataSet(string filename)
{
        DataSet ds;
string strCon = "Provider=Microsoft.Jet.OLEDB.4.0;" +
"Extended Properties=Excel 8.0;" +
"data source=" + filename;
        OleDbConnection myConn = new OleDbConnection(strCon);
string strCom = " SELECT * FROM [Sheet1$]";
        myConn.Open();
        OleDbDataAdapter myCommand = new OleDbDataAdapter(strCom, myConn);
        ds = new DataSet();
        myCommand.Fill(ds);
        myConn.Close();
return ds;
    }

二.使用com.
导入Microsoft.Excel
使用命名空间
using Excel= Microsoft.Office.Interop.Excel;
using System.Diagnostics;

public class ExcelHelper
{
private Excel._Application excelApp;
private string fileName=string.Empty;
private Excel.WorkbookClass wbclass;
public ExcelHelper(string _filename)
{
          excelApp = new Excel.Application();
object   objOpt   =   System.Reflection.Missing.Value;
          wbclass = (Excel.WorkbookClass)excelApp.Workbooks.Open(_filename, objOpt, false, objOpt, objOpt, objOpt, true, objOpt, objOpt, true, objOpt, objOpt, objOpt, objOpt, objOpt);
      }
/**//// <summary>
/// 所有sheet的名称列表
/// </summary>
/// <returns></returns>
public List<string> GetSheetNames()
{
          List<string> list = new List<string>();
          Excel.Sheets sheets = wbclass.Worksheets;
string sheetNams = string.Empty;
foreach (Excel.Worksheet sheet in sheets)
{
            list.Add(sheet.Name);
          }
return list;
      }
public Excel.Worksheet GetWorksheetByName(string name)
{
          Excel.Worksheet sheet=null;
          Excel.Sheets sheets= wbclass.Worksheets;
foreach (Excel.Worksheet s in sheets)
{
if (s.Name == name)
{
                  sheet = s;
break;
              }
          }
return sheet;
      }
/**//// <summary>
///
/// </summary>
/// <param name="sheetName">sheet名称</param>
/// <returns></returns>
public Array GetContent(string sheetName)
{
          Excel.Worksheet sheet = GetWorksheetByName(sheetName);
//获取A1 到AM24范围的单元格
          Excel.Range rang = sheet.get_Range("A1", "AM24");
//读一个单元格内容
//sheet.get_Range("A1", Type.Missing);
//不为空的区域,列,行数目
//   int l = sheet.UsedRange.Columns.Count;
// int w = sheet.UsedRange.Rows.Count;
//  object[,] dell = sheet.UsedRange.get_Value(Missing.Value) as object[,];
          System.Array values = (Array)rang.Cells.Value2;
return values;
      }

public void Close()
{
          excelApp.Quit();
          excelApp = null;
      }

    }

通常读取Excel文件的方法有两种,一是通过ADO的OleDb,一是通过Com组件;
这里要说的是通过OleDb读取时,遇到的内容丢失问题。
症状:使用Microsoft.Jet.OLEDB.4.0读取数据时,当某一字段内分别含有文本和数字的混合数据时,某一类型的数据会产生丢失。
读取方式:
string connString = " Provider=Microsoft.Jet.OLEDB.4.0; Data Source=文件名; Extended Properties='Excel 8.0;'";
OleDbConnection conn = new OleDbConnection(connString);
string sql = " SELECT * FROM [Sheet1$] ";
conn.Open();
OleDbDataAdapter adapter = new OleDbDataAdapter(sql, conn);
DataSet ds = new DataSet();
adapter.Fill(ds, "[Sheet1$]");
案例1
Excel数据如下:(一共9行)
"入库指令单号","SF公司","物料编号","实收数量","包装","批号","生产日期","入库时间","PO#","状态"
"11-20080808-000001","01","11111","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
"11-20080808-000001","01","11111","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
......
"11-20080808-000001","01","11111","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
"11-20080808-000001","01","xxxx","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
读取结果并不理想,第9行的xxxx数据丢失了
案例2
Excel数据如下:(一共9行)
"入库指令单号","SF公司","物料编号","实收数量","包装","批号","生产日期","入库时间","PO#","状态"
"11-20080808-000001","01","xxxx","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
"11-20080808-000001","01","11111","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
......
"11-20080808-000001","01","11111","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
"11-20080808-000001","01","xxxx","100","fasdfas","fdafad","08/08/2008","2008-08-08","SF001","1"
读取结果也不理想,第1行和第9行的xxxx数据都丢失了
问题分析:
产生这种问题的根源与Excel ISAM(Indexed Sequential Access Method,即索引顺序存取方法)驱动程序的限制有关,Excel ISAM 驱动程序通过检查前几行中实际值确定一个 Excel 列的类型,然后选择能够代表其样本中大部分值的数据类型。也即Excel ISAM查找某列前几行(默认情况下是8行),把占多的类型作为其处理类型。例如如果数字占多,那么其它含有字母等文本的数据项就会置空;相反如果文本居多,纯数字的数据项就会被置空。
现具体分析Extended Properties项中的HDR和IMEX所代表的含义。HDR用来设置是否将Excel表中第一行作为字段名,“YES”代表是,“NO”代表不是即也为数据内容;IMEX是用来告诉驱动程序使用Excel文件的模式,其值有0、1、2三种,分别代表导出、导入、混合模式。当我们设置IMEX=1时将强制混合数据转换为文本,但仅仅这种设置并不可靠,IMEX=1只确保在某列前8行数据至少有一个是文本项的时候才起作用,它只是把查找前8行数据中数据类型占优选择的行为作了略微的改变。例如某列前8行数据全为纯数字,那么它仍然以数字类型作为该列的数据类型,随后行里的含有文本的数据仍然变空。
解决方案1
前提,你的Excel包括Header,且Header都是文本,如案例1和案例2
使用:Provider=Microsoft.Jet.OLEDB.4.0; Data Source=文件名; Extended Properties='Excel 8.0;HDR=NO;IMEX=1'
IMEX=1将强制混合数据转换为文本,HDR=NO将第一行作为内容,由于第一行Header都是文本,因此所有列的类型都将转换成文本。
解决方案2
如果你的Excel不包含Header,且某列的数据又是混合数据,这种情况只能通过Com组件来读取Excel。
使用OleDb读取Excel的条件总结:
只有一条,保证Excel中每一列的前8行数据中至少有一个文本。(多适用于有Header的内容)

 

---------------

The OleDb library will, more often than not, mess up your data in an Excel spreadsheet. This is largely because it forces everything into a fixed-type column layout, guessing at the type of each column from the values in the first 8 cells in each column. If it guesses wrong, you end up with digit strings converted to scientific-notation. Blech!

To avoid this you're better off skipping the OleDb and reading the sheet directly yourself. You can do this using the COM interface of Excel (also blech!), or a third-party .NET Excel-compatible reader. SpreadsheetGear is one such library that works reasonably well, and has an interface that's very similar to Excel's COM interface.

 

Microsoft ADO.NET provides a handy, if quirky way to access Excel spreadsheets from Windows applications. The idea is to treat spreadsheets like databases, with each worksheet represented as a "table". Worksheets are expected to be in a table-like format with column headings in the first row and rows of data beneath. For example, the following code reads worksheet "foo" from spreadsheet file C:\BAR.XLS into a DataTable:

DataTable fooData = new DataTable ();
OleDbConnection dbConnection =
  new OleDbConnection
    (@"Provider=Microsoft.Jet.OLEDB.4.0;"
     + @"Data Source=C:\BAR.XLS;"
     + @"Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    OleDbDataAdapter dbAdapter =
        new OleDbDataAdapter
            ("SELECT * FROM [foo$]", dbConnection);
    dbAdapter.Fill (fooData);
}
finally
{
    dbConnection.Close ();
}

If you want to process the data row by row rather than snarfing it into a DataTable, you can do it this way:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [foo$]", dbConnection);
    OleDbDataReader dbReader = dbCommand.ExecuteReader ();

    // Say we are interested only in the columns "YearOfBirth" and "Country":
    int yearOfBirthIndex = dbReader.GetOrdinal ("YearOfBirth");
    int countryIndex = dbReader.GetOrdinal ("Country");

    while (dbReader.Read ())
    {
	string yearOfBirth = dbReader.GetValue (yearOfBirthIndex).ToString ();
	string country = dbReader.GetValue (countryIndex).ToString ();

	// ...
    }
}
finally
{
    dbConnection.Close ();
}

But what if you don’t know the name of the sheet you want to read? As you can see from the examples, the ADO.NET interface requires you to name the worksheet – but in many cases you just want to read the first worksheet regardless of its name. It would be cool if ADO.NET provided a suitable notation like this:

// I WISH THIS WORKED
OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [0#]", dbConnection);

… but it doesn’t. You must tell ADO.NET the specific name of the sheet you want to read. The solution is therefore to read the spreadsheet schema to find out the sheet names, thus reducing the second problem to the first one, like this:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;""");
dbConnection.Open ();
try
{
    // Get the name of the first worksheet:
    DataTable dbSchema = dbConnection.GetOleDbSchemaTable (OleDbSchemaGuid.Tables, null);
    if (dbSchema == null || dbSchema.Rows.Count < 1)
    {
        throw new Exception ("Error: Could not determine the name of the first worksheet.");
    }
    string firstSheetName = dbSchema.Rows [0] ["TABLE_NAME"].ToString ();

    // Now we have the table name; proceed as before:
    OleDbCommand dbCommand = new OleDbCommand ("SELECT * FROM [" + firstSheetName + "]", dbConnection);
    OleDbDataReader dbReader = dbCommand.ExecuteReader ();

    // And so on...
}
finally
{
    dbConnection.Close ();
}

The main quirk about the ADO.NET interface is how datatypes are handled. (You'll notice I've been carefully avoiding the question of which datatypes are returned when reading the spreadsheet.) Are you ready for this? ADO.NET scans the first 8 rows of data, and based on that guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!

I suppose that makes a twisted kind of sense if you have spent a lot of time working with relational databases, but there are some unfortunate consequences of this design which aren't obvious at first. For example, say your spreadsheet contains the following columns:

YearOfBirth    Country	PostalCode
1964	       USA	10005
1970	       USA	10001
1952	       Canada	K2P1R6
1981	       Canada	L3R3R2
1974	       USA	10013

ADO.NET will correctly guess that the YearOfBirth column is numeric, and that the Country column is of type string. But what about the PostalCode column, which contains a mix of numbers and strings? In this case ADO.NET chooses the type based on the majority of the values (with a tie going to numeric). In this example 3 of the 5 postal codes are numeric, so ADO.NET will declare the column to be numeric. Therefore it will attempt to cast each cell to a number, which will fail for the Canadian postal codes - which will therefore come out as NULL values. Ha ha. Isn't that fun?

Even more entertaining, there is absolutely no way to make this 100% reliable - although with some pain, you can improve the situation. Here's what you need to do. First add the "IMEX=1" option to your connection string like this:

OleDbConnection dbConnection = new OleDbConnection (@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\BAR.XLS;Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1;""");

That tells ADO.NET to honor the following registry key when reading the spreadsheet:

Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes

This registry key tells ADO.NET what to do when it encounters mixed types in the first 8 rows. It can either be set to the string "Majority Type" (for the default behavior) or to "Text" (which forces the column to be of type string). Note that you are still screwed if the first 8 postal codes are numeric and the 9th is Canadian. (Not to mention that the "Text" option invokes handling that fails on strings over 255 characters, but let's skip that for now.)

There's also a second relevant registry setting (which is honored regardless of the IMEX option):

Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/TypeGuessRows

That says how many rows to scan to guess the datatype. The default is 8, but you can set it anywhere from 0-16 decimal (0 meaning "scan the first 16384 rows", and all other values meaning what they say). Putting this all together, the most reliable way to read a US/Canadian postal code is to use the following registry settings:

TypeGuessRows = 0
ImportMixedTypes = Text

That's pretty close to perfect, although it will still fail if the first 16384 postal codes are numeric and any of the subsequent ones aren't.

This is a Bad Design for so many reasons I don't know where to start. First, the behavior of the spreadsheet importer should not depend on global registry settings - that is just inviting mysterious, data-dependent errors whenever other applications or users change those settings. All of those settings should be in the connect string and nowhere else. Second, there should be an option to say "I'm not sure what data is coming, but I want all of it - please coerce everything to something universal like an arbitrary-length string". Third, the interface should be stream-based, not file-based. If you are reading the spreadsheet data from the network, you shouldn't have to save it to a temporary file in order to parse it. Fourth, you shouldn't have to read the spreadsheet schema if you just want to select the worksheet by index (e.g. you want to read the first worksheet, whatever it happens to be called).

For an example of a nice interface for reading and writing Excel spreadsheets, check out Jakarta POI (an open source Java library).

posted @ 2009-12-16 11:21  汤包  阅读(1970)  评论(0编辑  收藏  举报