关于文件流Seek以及Read操作的一点不满

问题

对于读取文件某指定位置开始的一段数据的操作, 我们一般可以用如下的代码来实现:

Read File Stream Content
  1. private static string ReadContent(string fileName, int position, int length)
  2. {
  3.     if (!File.Exists(fileName))
  4.     {
  5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
  6.     }
  7.  
  8.     using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
  9.     using (StreamReader reader = new StreamReader(stream))
  10.     {
  11.         reader.BaseStream.Seek(position, SeekOrigin.Begin);
  12.         char[] buffer = new char[length];
  13.         reader.Read(buffer, 0, length);
  14.  
  15.         return new string(buffer, 0, length);
  16.     }
  17. }

 

这样的操作在代码上看来比较直观也易于理解。 如果想在同一个文件中读取多个这样的内容段, 一般可以写成如下(指定多个位置和多个需要对应读取的长度,参数列表仅为示意):

Read Content With Seeking
  1. private static string[] ReadContents(string fileName, int[] positions, int[] lengths)
  2. {
  3.     if (!File.Exists(fileName))
  4.     {
  5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
  6.     }
  7.  
  8.     using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
  9.     using (StreamReader reader = new StreamReader(stream))
  10.     {
  11.         string[] contents = new string[positions.Length];
  12.  
  13.         for (int i = 0; i < positions.Length; i++)
  14.         {
  15.             reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);
  16.             char[] buffer = new char[lengths[i]];
  17.             reader.Read(buffer, 0, lengths[i]);
  18.             contents[i] = new string(buffer, 0, lengths[i]);
  19.         }
  20.  
  21.         return contents;
  22.     }
  23. }

这看起来也没有什么问题。 但是如果我们提供一段测试程序, 就会发现出乎意料的结果:

Test App
  1. static void Main(string[] args)
  2. {
  3.     string fileName = @"text.txt";
  4.  
  5.     using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
  6.     using (StreamWriter writer = new StreamWriter(stream))
  7.     {
  8.         writer.Write("ABCDEFGHIJKLMNOPQ");
  9.     }
  10.  
  11.  
  12.     Console.WriteLine(ReadContent(fileName, 4, 2));
  13.     Console.WriteLine(ReadContent(fileName, 10, 2));
  14.     Console.WriteLine(ReadContent(fileName, 7, 2));
  15.     Console.WriteLine();
  16.  
  17.     string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });
  18.     foreach (var item in contents)
  19.     {
  20.         Console.WriteLine(item);
  21.     }
  22.  
  23.     Console.ReadKey();
  24. }

输出是:

Capture

 

所以当我们在同一个流中尝试定位的时候, 类库API并没有按照我们预想的那样, 取出对应的内容。 而看起来像是, 在一个文件流对象发生第一次Seek之后, 其后的所有Seek操作都失效了!这是为什么呢?

 

分析

事实上, StreamReader为了性能的考虑, 在自己的内部内置并维护了一个byte buffer。 如果在声明StreamReader对象的时候没有指定这个buffer的尺寸, 那么它的默认大小是1k。 如果是文件流, 那么这个buffer的默认大小是4K。 所有Read操作,都直接或间接转换为了对这个buffer的操作。

Buffer Size
  1. // Using a 1K byte buffer and a 4K FileStream buffer works out pretty well
  2. // perf-wise.  On even a 40 MB text file, any perf loss by using a 4K
  3. // buffer is negated by the win of allocating a smaller byte[], which
  4. // saves construction time.  This does break adaptive buffering,
  5. // but this is slightly faster.
  6. internal const int DefaultBufferSize = 1024;  // Byte buffer size
  7. private const int DefaultFileStreamBufferSize = 4096;
  8. private const int MinBufferSize = 128;

 

Read Buffer
  1.         // This version has a perf optimization to decode data DIRECTLY into the
  2.         // user's buffer, bypassing StreamWriter's own buffer.
  3.         // This gives a > 20% perf improvement for our encodings across the board,
  4.         // but only when asking for at least the number of characters that one
  5.         // buffer's worth of bytes could produce.
  6.         // This optimization, if run, will break SwitchEncoding, so we must not do
  7.         // this on the first call to ReadBuffer.
  8.         private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {
  9.             charLen = 0;
  10.             charPos = 0;
  11.             if (!_checkPreamble)
  12.                 byteLen = 0;
  13.             int charsRead = 0;
  14.             // As a perf optimization, we can decode characters DIRECTLY into a
  15.             // user's char[].  We absolutely must not write more characters
  16.             // into the user's buffer than they asked for.  Calculating
  17.             // encoding.GetMaxCharCount(byteLen) each time is potentially very
  18.             // expensive - instead, cache the number of chars a full buffer's
  19.             // worth of data may produce.  Yes, this makes the perf optimization
  20.             // less aggressive, in that all reads that asked for fewer than AND
  21.             // returned fewer than _maxCharsPerBuffer chars won't get the user
  22.             // buffer optimization.  This affects reads where the end of the
  23.             // Stream comes in the middle somewhere, and when you ask for
  24.             // fewer chars than than your buffer could produce.
  25.             readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
  26.             do {
  27.                 if (_checkPreamble) {
  28.                     BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble.  Are two threads using this StreamReader at the same time?");
  29.                     int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);
  30.                     BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
  31.  
  32.                     if (len == 0) {
  33.                         // EOF but we might have buffered bytes from previous
  34.                         // attempts to detecting preamble that needs to decoded now
  35.                         if (byteLen > 0) {
  36.                             if (readToUserBuffer) {
  37.                                 charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
  38.                                 charLen = 0;  // StreamReader's buffer is empty.
  39.                             }
  40.                             else {
  41.                                 charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
  42.                                 charLen += charsRead;  // Number of chars in StreamReader's buffer.
  43.                             }
  44.                         }
  45.                         return charsRead;
  46.                     }
  47.  
  48.                     byteLen += len;
  49.                 }
  50.                 else {
  51.                     BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble.  Are two threads using this StreamReader at the same time?");
  52.                     byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length);
  53.                     BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
  54.                     if (byteLen == 0)  // EOF
  55.                         return charsRead;
  56.                 }
  57.  
  58.                 // _isBlocked == whether we read fewer bytes than we asked for.
  59.                 // Note we must check it here because CompressBuffer or
  60.                 // DetectEncoding will ---- with byteLen.
  61.                 _isBlocked = (byteLen < byteBuffer.Length);
  62.                 // Check for preamble before detect encoding. This is not to override the
  63.                 // user suppplied Encoding for the one we implicitly detect. The user could
  64.                 // customize the encoding which we will loose, such as ThrowOnError on UTF8
  65.                 // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble
  66.                 // doesn't change the encoding or affect _maxCharsPerBuffer
  67.                 if (IsPreamble())
  68.                     continue;
  69.  
  70.                 // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it.
  71.                 if (_detectEncoding && byteLen >= 2) {
  72.                     DetectEncoding();
  73.                     // DetectEncoding changes some buffer state.  Recompute this.
  74.                     readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
  75.                 }
  76.  
  77.                 charPos = 0;
  78.                 if (readToUserBuffer) {
  79.                     charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
  80.                     charLen = 0;  // StreamReader's buffer is empty.
  81.                 }
  82.                 else {
  83.                     charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
  84.                     charLen += charsRead;  // Number of chars in StreamReader's buffer.
  85.                 }
  86.             } while (charsRead == 0);
  87.  
  88.             _isBlocked &= charsRead < desiredChars;
  89.             //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+"  readToUserBuffer: "+readToUserBuffer);
  90.             return charsRead;
  91.         }



所以问题就转化为, 当第二次调用BaseStream.Seek的时候, 对应的buffer的内容并没有重新读取!所以第二次读取的时候, 对应读取的内容其实是第一次seek后, 对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了(或者完全不在缓存中)。

如果想要在第二次seek前刷新缓存, 必须显式调用DiscardBufferedData():

Code Snippet
  1. // DiscardBufferedData tells StreamReader to throw away its internal
  2. // buffer contents.  This is useful if the user needs to seek on the
  3. // underlying stream to a known location then wants the StreamReader
  4. // to start reading from this new point.  This method should be called
  5. // very sparingly, if ever, since it can lead to very poor performance.
  6. // However, it may be the only way of handling some scenarios where
  7. // users need to re-read the contents of a StreamReader a second time.
  8. public void DiscardBufferedData() {
  9.     byteLen = 0;
  10.     charLen = 0;
  11.     charPos = 0;
  12.     decoder = encoding.GetDecoder();
  13.     _isBlocked = false;
  14. }

 

一点抱怨

记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.

 

我们来揣度一下如何设计.

如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.

 

如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.

 

总结

最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为, 争论是每一种语言前进的动力。 想说点什么, 突然想起了上面的这个小例子。 其实作为程序员, 可能既不关注究竟是语言支撑模式,也不关注是不是类库支撑模式。 唯希望在类库设计中,少一点上面这个例子中的灵机一动, 多一点实实在在。

posted @ 2010-07-11 19:12  Jeffrey Sun  阅读(6654)  评论(23编辑  收藏  举报