关于文件流Seek以及Read操作的一点不满

问题

对于读取文件某指定位置开始的一段数据的操作，我们一般可以用如下的代码来实现：

Read File Stream Content
 private static string ReadContent(string fileName, int position, int length)
{
    if (!File.Exists(fileName))
    {
        throw new FileNotFoundException("The specified file is not found : " + fileName);
    }
 
    using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (StreamReader reader = new StreamReader(stream))
    {
        reader.BaseStream.Seek(position, SeekOrigin.Begin);
        char[] buffer = new char[length];
        reader.Read(buffer, 0, length);
 
        return new string(buffer, 0, length);
    }
}
 

这样的操作在代码上看来比较直观也易于理解。如果想在同一个文件中读取多个这样的内容段，一般可以写成如下（指定多个位置和多个需要对应读取的长度，参数列表仅为示意）：

Read Content With Seeking
 private static string[] ReadContents(string fileName, int[] positions, int[] lengths)
{
    if (!File.Exists(fileName))
    {
        throw new FileNotFoundException("The specified file is not found : " + fileName);
    }
 
    using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (StreamReader reader = new StreamReader(stream))
    {
        string[] contents = new string[positions.Length];
 
        for (int i = 0; i < positions.Length; i++)
        {
            reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);
            char[] buffer = new char[lengths[i]];
            reader.Read(buffer, 0, lengths[i]);
            contents[i] = new string(buffer, 0, lengths[i]);
        }
 
        return contents;
    }
}
 

这看起来也没有什么问题。但是如果我们提供一段测试程序，就会发现出乎意料的结果：

Test App
 static void Main(string[] args)
{
    string fileName = @"text.txt";
 
    using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
    using (StreamWriter writer = new StreamWriter(stream))
    {
        writer.Write("ABCDEFGHIJKLMNOPQ");
    }
 
 
    Console.WriteLine(ReadContent(fileName, 4, 2));
    Console.WriteLine(ReadContent(fileName, 10, 2));
    Console.WriteLine(ReadContent(fileName, 7, 2));
    Console.WriteLine();
 
    string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });
    foreach (var item in contents)
    {
        Console.WriteLine(item);
    }
 
    Console.ReadKey();
}
 

输出是：

所以当我们在同一个流中尝试定位的时候，类库API并没有按照我们预想的那样，取出对应的内容。而看起来像是，在一个文件流对象发生第一次Seek之后，其后的所有Seek操作都失效了！这是为什么呢？

分析

事实上， StreamReader为了性能的考虑，在自己的内部内置并维护了一个byte buffer。如果在声明StreamReader对象的时候没有指定这个buffer的尺寸，那么它的默认大小是1k。如果是文件流，那么这个buffer的默认大小是4K。所有Read操作，都直接或间接转换为了对这个buffer的操作。

Buffer Size
 // Using a 1K byte buffer and a 4K FileStream buffer works out pretty well 
// perf-wise.  On even a 40 MB text file, any perf loss by using a 4K 
// buffer is negated by the win of allocating a smaller byte[], which
// saves construction time.  This does break adaptive buffering, 
// but this is slightly faster.
internal const int DefaultBufferSize = 1024;  // Byte buffer size
private const int DefaultFileStreamBufferSize = 4096;
private const int MinBufferSize = 128;
 

Read Buffer
         // This version has a perf optimization to decode data DIRECTLY into the
        // user's buffer, bypassing StreamWriter's own buffer. 
        // This gives a > 20% perf improvement for our encodings across the board,
        // but only when asking for at least the number of characters that one 
        // buffer's worth of bytes could produce. 
        // This optimization, if run, will break SwitchEncoding, so we must not do
        // this on the first call to ReadBuffer. 
        private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {
            charLen = 0;
            charPos = 0;
 
            if (!_checkPreamble)
                byteLen = 0; 
 
            int charsRead = 0;
 
            // As a perf optimization, we can decode characters DIRECTLY into a
            // user's char[].  We absolutely must not write more characters
            // into the user's buffer than they asked for.  Calculating
            // encoding.GetMaxCharCount(byteLen) each time is potentially very 
            // expensive - instead, cache the number of chars a full buffer's
            // worth of data may produce.  Yes, this makes the perf optimization 
            // less aggressive, in that all reads that asked for fewer than AND 
            // returned fewer than _maxCharsPerBuffer chars won't get the user
            // buffer optimization.  This affects reads where the end of the 
            // Stream comes in the middle somewhere, and when you ask for
            // fewer chars than than your buffer could produce.
            readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
 
            do {
                if (_checkPreamble) { 
                    BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble.  Are two threads using this StreamReader at the same time?"); 
                    int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);
                    BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class."); 
 
                    if (len == 0) {
                        // EOF but we might have buffered bytes from previous
                        // attempts to detecting preamble that needs to decoded now 
                        if (byteLen > 0) {
                            if (readToUserBuffer) { 
                                charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead); 
                                charLen = 0;  // StreamReader's buffer is empty.
                            } 
                            else {
                                charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
                                charLen += charsRead;  // Number of chars in StreamReader's buffer.
                            } 
                        }
                        return charsRead; 
                    } 
 
                    byteLen += len; 
                }
                else {
                    BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble.  Are two threads using this StreamReader at the same time?");
                    byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length); 
                    BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
 
                    if (byteLen == 0)  // EOF 
                        return charsRead;
                } 
 
                // _isBlocked == whether we read fewer bytes than we asked for.
                // Note we must check it here because CompressBuffer or
                // DetectEncoding will ---- with byteLen. 
                _isBlocked = (byteLen < byteBuffer.Length);
 
                // Check for preamble before detect encoding. This is not to override the 
                // user suppplied Encoding for the one we implicitly detect. The user could
                // customize the encoding which we will loose, such as ThrowOnError on UTF8 
                // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble
                // doesn't change the encoding or affect _maxCharsPerBuffer
                if (IsPreamble())
                    continue; 
 
                // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it. 
                if (_detectEncoding && byteLen >= 2) { 
                    DetectEncoding();
                    // DetectEncoding changes some buffer state.  Recompute this. 
                    readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
                }
 
                charPos = 0; 
                if (readToUserBuffer) {
                    charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead); 
                    charLen = 0;  // StreamReader's buffer is empty. 
                }
                else { 
                    charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
                    charLen += charsRead;  // Number of chars in StreamReader's buffer.
                }
            } while (charsRead == 0); 
 
            _isBlocked &= charsRead < desiredChars; 
 
            //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+"  readToUserBuffer: "+readToUserBuffer);
            return charsRead; 
        }
 

所以问题就转化为，当第二次调用BaseStream.Seek的时候，对应的buffer的内容并没有重新读取！所以第二次读取的时候，对应读取的内容其实是第一次seek后，对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了（或者完全不在缓存中）。

如果想要在第二次seek前刷新缓存，必须显式调用DiscardBufferedData（）：

Code Snippet
 // DiscardBufferedData tells StreamReader to throw away its internal 
// buffer contents.  This is useful if the user needs to seek on the
// underlying stream to a known location then wants the StreamReader 
// to start reading from this new point.  This method should be called
// very sparingly, if ever, since it can lead to very poor performance.
// However, it may be the only way of handling some scenarios where
// users need to re-read the contents of a StreamReader a second time. 
public void DiscardBufferedData() {
    byteLen = 0; 
    charLen = 0; 
    charPos = 0;
    decoder = encoding.GetDecoder(); 
    _isBlocked = false;
}
 

一点抱怨

记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.

我们来揣度一下如何设计.

如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.

如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.

总结

最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为，争论是每一种语言前进的动力。想说点什么，突然想起了上面的这个小例子。其实作为程序员，可能既不关注究竟是语言支撑模式，也不关注是不是类库支撑模式。唯希望在类库设计中，少一点上面这个例子中的灵机一动，多一点实实在在。

posted @ 2010-07-11 19:12 Jeffrey Sun 阅读(6651) 评论(23) 编辑收藏举报

刷新页面返回顶部

关于文件流Seek以及Read操作的一点不满

问题

分析

一点抱怨

总结

公告