C++中IO封装

之前在写一个分布式计算小项目时,频繁地使用文件IO,为简化代码,减少重复操作,降低出错可能性,便将I/O模块进行封装。

当时我的需求是对大文本文件进行读写,而且该文本文件是规整的,如:

标签 特征1 特征2 特征3
1 3.4 2.2 1
0 1.1 0.7 4

所以,遇到新需求时还需改写,这是这个模块还需改进的地方。

为了提升自己英文读写能力,部分代码注释使用英文,不知道几年之后,自己看着这蹩脚英语是啥感受!

1. 底层设计

对底层文件的(此处指文本文件)操作应该有3种,分别是读、写、追加操作,所以,我们可以用enum声明操作方式:

enum class FileOpenMode:int
{
    Write = 0,
    Read = 1,
    Append = 2
};

应对当时的需求,只完成了对应功能,后续功能再有需求时完善。

class FileOperator
{
public:
    FileOperator(const std::string& path, FileOpenMode mode);
    ~FileOperator();

    void Write(const void *buf, size_t size);
    
    /*
        * @param buf pointer to a memory buffer
        * @param size data size
        * @return the true size of data
        */
    inline size_t Read(void *buf, size_t size);

    //void Append(const void *buf, size_t size);

    inline bool Good();

private:
    bool is_good_;
    FILE *fp_;
    std::string path_;
};

对应的函数实现较为简单。实现代码附在文章末尾。

2. 中间层

用一个TextReader来封装对文本文件的操作。由于文件很大,将整个文件一次读入内存,无法实现,也不必要,所以每次读写一部分数据,因此,TextReader的关键成员变量就应为文件名、缓冲区大小(每次读入多少数据)。

class TextReader
{
public:
    TextReader(const std::string &path, size_t buf_size = 1024);
    ~TextReader();
    size_t GetLine(std::string &line);
private:
    size_t LoadBuffer();
    char* buf_;
    size_t pos_, buf_size_, length_;
    FileOperator* op_;
};

其中,构造函数与析构函数的设计实现较为简单,析构函数附在文末。

TextReader::TextReader(const std::string &path, size_t buf_size)
{
    stream_ = new LocalStream(path, FileOpenMode::Read);
    buf_size_ = buf_size;
    pos_ = length_ = 0;
    buf_ = new char[buf_size_];
}

根据需求,我们按行读取文件,通过调用GetLine函数,读取每一行数据,我个人不太喜欢指针,所以,利用string类的引用传递数据。

其中length_buf_种实际的数据长度,pos_指外部读取每行时,读取到了哪个位置。当读取到换行符时或文件已经全部读取完成即可跳出外层循环并返回,否则需要读取一个数据。在读取一个数据时,我们可能会遇到当前buf_的内容已全部被读取的情况,即内层while循环跳出,此时就需重新loadBuffer(访问文件)。

size_t TextReader::GetLine(std::string &line)
{
    line.clear();
    bool isEnd = false;
    while (!isEnd)
    {
        while (pos_ < length_)
        {
            char &c = buf_[pos_++];
            if (c == '\n')
            {
                isEnd = true;
                break;
            }
            else
            {
                line += c;
            }
        }
        if (isEnd || LoadBuffer() == 0)
            break;
    }
    return line.size();
}

size_t TextReader::LoadBuffer()
{
    pos_ = length_ = 0;
    return length_ = stream_->Read(buf_, buf_size_ - 1);
}

3.顶层设计

这一层,可以根据实际需求改动。为了节省内存空间,同时提高CPU利用率,我们读一部分数据,处理一部分数据,在CPU处理数据的时候,可以继续读文件,这样可以保证CPU一直运转。所以,我们考虑多线程的设计。

一些关键部分用中文描述,怕之后连自己都看不懂 😃

template <typename ElemType>
class SampleReader
{
public:
    /**
     * @brief 构造函数
     * @param filepath 需要读取的样本文件的文件路径
     * @param read_buffer_size 存储在该实例的buffer_中的样本数量
     * @param input_dimention 样本标签数,不包含最后与偏移量相对应的1
     */
    SampleReader(std::string filepath, int read_buffer_size, int input_dimention);
    ~SampleReader();
    /**
     * @brief load data to the buffer, 
     * @param buffer_size the data size you want to load
     * @param buffer put line pointer in buffer
     * @return the true rows actually load
     */
    int GetSample(int buffer_size, Sample<ElemType> **buffer);
    /**
     * @brief 释放一些资源,可以让读文件进程继续进行
     * @param row_num 释放的sample个数
     */
    void Free(int row_num);
    /**
     * @brief 重置,epoch>1时使用,重新开始读取文件,实例重复利用
     */
    void Reset();
    /**
     * @brief 终止SampleReader,当模型趋于稳定时,不再需要训练时可以使用
     */
    void Stop();
    /**
     * @brief 是否读取了文件中的所有data
     */
    bool IsEndOfFile() const;
private:
    /**
     * @brief 类的主要函数,另起一个线程执行该函数,利用stop终止该函数执行
     */
    void Read();
    /**
     * @param str string where data stored
     * @param idx the index of buffer_
     */
    void ParseLine(std::string &str, int idx);
    std::thread *th_;
    bool stop_;
    bool eof_;
    int buffer_size_;
    int input_dimention_;
    int length_;
    int readlength_;
    int start_;
    int end_;
    std::string file_;
    TextReader *reader_;
    Sample<ElemType> **buffer_;
    std::mutex mutex_;
    std::condition_variable cv_;
};

这个类的主要功能是将文件读出,并利用设计好的数据结构存储,同时涉及外部访问接口,方便外部的计算进程读数据。

此处,每一个样本数据都用如下结构体存储,而且,还有对应的两个方法,分别是建立和销毁Sample数组

template <typename ElemType>
struct Sample
{
    int label;
    std::vector<ElemType> features;
    Sample(int size)
    {
        features.reserve(size);
    }
};

/**
 * @brief
 * @param num {int} the sample number
 * @param size (size_t} size of sample.features
 * @return the pointer of buffer
 */
template <typename ElemType>
Sample<ElemType> **CreateSampleBuff(int num, int size)
{
    Sample<ElemType> **samplep = new Sample<ElemType> *[num];
    for (int i = 0; i < num; ++i)
    {
        samplep[i] = new Sample<ElemType>(size);
    }
    return samplep;
}

template <typename ElemType>
void DeleteSampleBuff(Sample<ElemType> **samplep, int num)
{
    for (int i = 0; i < num; i++)
        delete samplep[i];
}


因此,SampleReader的构造函数可以这样实现。主要任务是创建一个buffer,用来存储读取的数据,同时开启读线程

    template <typename ElemType>
    SampleReader<ElemType>::SampleReader(std::string filepath, int read_buffer_size, int input_dimention): stop_(false),
                                            eof_(false),
                                            file_(filepath),
                                            buffer_size_(read_buffer_size),input_dimention_(input_dimention),length_(0),
                                            readlength_(0),
                                            start_(0),
                                            end_(0)
    {
        buffer_ = CreateSampleBuff<ElemType>(buffer_size_, input_dimention + 1);
        reader_ = new TextReader(file_);
        // Log::Info("SampleReader begin to read data from %s\n", file_.c_str());
        th_ = new std::thread(&SampleReader<ElemType>::Read, this);
    }

其中,读文件的主要任务交给Read函数,另启一个线程完成。

template <typename ElemType>
void SampleReader<ElemType>::Read()
{
    while (1)
    {
        while (eof_ && !stop_)
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(1000));
        }
        std::string line;
        // 将数据从磁盘读入内存,不允许计算线程读数据
        std::unique_lock<std::mutex> lock(mutex_);
        while (reader_->GetLine(line) && (!stop_))
        {
            if (length_ == buffer_size_)
            {
                cv_.wait(lock);
            }
            ParseLine(line, end_);
            ++length_;
            end_ = round(++end_, buffer_size_);
        }
        eof_ = true;
        lock.unlock();
        if (stop_)
            break;
    }
}

ParseLine的目的是从reader读取的string对象中,解析数据,存储到结构体中。

template <typename ElemType>
void SampleReader<ElemType>::ParseLine(std::string &str, int idx)
{
    std::stringstream ss(str);
    Sample<ElemType> *data = buffer_[idx];
    data->features.clear();
    ss >> data->label;
    ElemType feature;
    while (ss >> feature)
        data->features.emplace_back(feature);
    data->features.emplace_back(1);
}

计算线程通过GetSample函数,来读取解析后的数据

template <typename ElemType>
int SampleReader<ElemType>::GetSample(int buffer_size, Sample<ElemType> **buffer)
{
    int size;
    {
        std::lock_guard<std::mutex> lock(mutex_);
        size = length_ - readlength_;
        size = size > buffer_size ? buffer_size : size;
        readlength_ += size;
    }
    for (int i = 0; i < size; ++i)
    {
        buffer[i] = buffer_[round(start_ + i, buffer_size_)];
    }
    start_ = round(start_ + size, buffer_size_);
    return size;
}

显然,当计算线程处理掉数据之后,那部分数据就没有价值了,所以可以移除内存,用以读取新数据。因此,计算线程可以调用Free函数,来释放一部分数据。

template <typename ElemType>
void SampleReader<ElemType>::Free(int row_num)
{
    {
        std::lock_guard<std::mutex> lock(mutex_);
        length_ -= row_num;
        readlength_ -= row_num;
    }
    cv_.notify_one();
}

4. 所有代码实现

1. io.h

enum class FileOpenMode : int
{
   Write = 0,
   Read = 1,
   Append = 2
};

class FileOperator
{
public:
   FileOperator(const std::string &path, FileOpenMode mode);
   ~FileOperator();

   void Write(const void *buf, size_t size);

   /*
    * @param buf pointer to a memory buffer
    * @param size data size
    * @return the true size of data
    */
   inline size_t Read(void *buf, size_t size);

   // void Append(const void *buf, size_t size);

   inline bool Good();

private:
   bool is_good_;
   FILE *fp_;
   std::string path_;
};

class TextReader
{
public:
   TextReader(const std::string &path, size_t buf_size = 1024);
   ~TextReader();
   size_t GetLine(std::string &line);

private:
   size_t LoadBuffer();
   char *buf_;
   size_t pos_, buf_size_, length_;
   FileOperator *op_;
};

2. io.cpp

LocalStream::LocalStream(const std::string &path, FileOpenMode mode) : path_(path)
{
    std::string mode_str;
    switch (mode)
    {
    case FileOpenMode::Read:
        mode_str = "r";
        break;
    case FileOpenMode::Write:
        mode_str = "w";
        break;
    case FileOpenMode::Append:
        mode_str = "a";
        break;
    case FileOpenMode::BinaryRead:
        mode_str = "rb";
        break;
    case FileOpenMode::BinaryWrite:
        mode_str = "wb";
        break;
    case FileOpenMode::BinaryAppend:
        mode_str = "ab";
    }
    fp_ = fopen(path_.c_str(), mode_str.c_str());
    if (fp_ == nullptr)
    {
        is_good_ = false;
        Log::Error("Faild to open LOcalStream %s\n", path_.c_str());
    }
    else
    {
        is_good_ = true;
    }
}

LocalStream::~LocalStream()
{
    is_good_ = false;
    if (fp_ != nullptr)
        std::fclose(fp_);
}

void LocalStream::Write(const void *buf, size_t size)
{
    if (std::fwrite(buf, 1, size, fp_) != size)
    {
        is_good_ = false;
        Log::Error("LocalStream.Write incomplete\n");
    }
}

size_t LocalStream::Read(void *buf, size_t size)
{
    return std::fread(buf, 1, size, fp_);
}

bool LocalStream::Good()
{
    return is_good_;
}

TextReader::TextReader(const std::string &path, size_t buf_size)
{
    stream_ = new LocalStream(path, FileOpenMode::Read);
    buf_size_ = buf_size;
    pos_ = length_ = 0;
    buf_ = new char[buf_size_];
}

size_t TextReader::GetLine(std::string &line)
{
    line.clear();
    bool isEnd = false;
    while (!isEnd)
    {
        while (pos_ < length_)
        {
            char &c = buf_[pos_++];
            if (c == '\n')
            {
                isEnd = true;
                break;
            }
            else
            {
                line += c;
            }
        }
        if (isEnd || LoadBuffer() == 0)
            break;
    }
    return line.size();
}

size_t TextReader::LoadBuffer()
{
    pos_ = length_ = 0;
    return length_ = stream_->Read(buf_, buf_size_ - 1);
}

TextReader::~TextReader()
{
    delete stream_;
    delete[] buf_;
}

3. samplereader.h


template <typename ElemType>
class SampleReader
{
public:
    /**
     * @brief 构造函数
     * @param filepath 需要读取的样本文件的文件路径
     * @param read_buffer_size 存储在该实例的buffer_中的样本数量
     * @param input_dimention 样本标签数,不包含最后与偏移量相对应的1
     */
    SampleReader(std::string filepath, int read_buffer_size, int input_dimention);

    ~SampleReader();

    /**
     * @brief load data to the buffer,
     * @param buffer_size the data size you want to load
     * @param buffer put line pointer in buffer
     * @return the true rows actually load
     */
    int GetSample(int buffer_size, Sample<ElemType> **buffer);

    /**
     * @brief 释放一些资源,可以让读文件进程继续进行
     * @param row_num 释放的sample个数
     */
    void Free(int row_num);

    /**
     * @brief 重置,epoch>1时使用,重新开始读取文件,实例重复利用
     */
    void Reset();

    /**
     * @brief 终止SampleReader,当模型趋于稳定时,不再需要训练时可以使用
     */
    void Stop();

    /**
     * @brief 是否读取了文件中的所有data
     */
    bool IsEndOfFile() const;

private:
    /**
     * @brief 类的主要函数,另起一个线程执行该函数,利用stop终止该函数执行
     */
    void Read();
    /**
     * @param str string where data stored
     * @param idx the index of buffer_
     */
    void ParseLine(std::string &str, int idx);

    std::thread *th_;
    bool stop_;
    bool eof_;
    int buffer_size_;
    int input_dimention_;
    int length_;
    int readlength_;
    int start_;
    int end_;
    std::string file_;
    TextReader *reader_;
    Sample<ElemType> **buffer_;
    std::mutex mutex_;
    std::condition_variable cv_;
};

template <typename ElemType>
/**
 * @brief reach a loop
 * @return if(a > b) retrun a-b, else return a
 */
inline ElemType round(ElemType a, ElemType b)
{
    return a >= b ? (a - b) : a;
}

template <typename ElemType>
SampleReader<ElemType>::SampleReader(std::string filepath, int read_buffer_size, int input_dimention) : stop_(false),
                                                                                                        eof_(false),
                                                                                                        file_(filepath),
                                                                                                        buffer_size_(read_buffer_size),
                                                                                                        input_dimention_(input_dimention),
                                                                                                        length_(0),
                                                                                                        readlength_(0),
                                                                                                        start_(0),
                                                                                                        end_(0)
{
    buffer_ = CreateSampleBuff<ElemType>(buffer_size_, input_dimention + 1);
    reader_ = new TextReader(file_);
    // Log::Info("SampleReader begin to read data from %s\n", file_.c_str());
    th_ = new std::thread(&SampleReader<ElemType>::Read, this);
}

template <typename ElemType>
SampleReader<ElemType>::~SampleReader()
{
    DeleteSampleBuff<ElemType>(buffer_, buffer_size_);
    stop_ = true;
    th_->join();
    delete buffer_;
    delete reader_;
    delete th_;
}

template <typename ElemType>
void SampleReader<ElemType>::Free(int row_num)
{
    {
        std::lock_guard<std::mutex> lock(mutex_);
        length_ -= row_num;
        readlength_ -= row_num;
    }
    cv_.notify_one();
}

template <typename ElemType>
int SampleReader<ElemType>::GetSample(int buffer_size, Sample<ElemType> **buffer)
{
    // Log::Debug("begin to load data from SampleReader\n");
    int size;
    {
        std::lock_guard<std::mutex> lock(mutex_);
        size = length_ - readlength_;
        size = size > buffer_size ? buffer_size : size;
        readlength_ += size;
    }
    for (int i = 0; i < size; ++i)
    {
        buffer[i] = buffer_[round(start_ + i, buffer_size_)];
    }
    start_ = round(start_ + size, buffer_size_);
    return size;
}

template <typename ElemType>
void SampleReader<ElemType>::Read()
{
    // Log::Debug("Start read thread!\n");
    while (1)
    {
        while (eof_ && !stop_)
        {
            // Log::Info("file %s read end, read thread sleep\n", file_.c_str());
            std::this_thread::sleep_for(std::chrono::milliseconds(1000));
        }
        std::string line;
        std::unique_lock<std::mutex> lock(mutex_);
        // int count = 0;
        while (reader_->GetLine(line) && (!stop_))
        {
            // Log::Debug("Read %dth line\n", ++count);
            if (length_ == buffer_size_)
            {
                cv_.wait(lock);
            }
            ParseLine(line, end_);
            ++length_;
            end_ = round(++end_, buffer_size_);
        }
        eof_ = true;
        lock.unlock();
        if (stop_)
            break;
    }
}

template <typename ElemType>
void SampleReader<ElemType>::ParseLine(std::string &str, int idx)
{
    std::stringstream ss(str);
    Sample<ElemType> *data = buffer_[idx];
    data->features.clear();
    ss >> data->label;
    ElemType feature;
    while (ss >> feature)
        data->features.emplace_back(feature);
    data->features.emplace_back(1);
}

// This function dosen't work properly. The bug needs to be solved
template <typename ElemType>
void SampleReader<ElemType>::Reset()
{
    stop_ = false;
    eof_ = false;
    std::lock_guard<std::mutex> lock(mutex_);
    start_ = 0;
    end_ = 0;
    length_ = 0;
    readlength_ = 0;
    delete reader_;
    reader_ = new TextReader(file_);
}

template <typename ElemType>
void SampleReader<ElemType>::Stop()
{
    stop_ = true;
}

template <typename ElemType>
inline bool SampleReader<ElemType>::IsEndOfFile() const
{
    return (eof_ && length_ == readlength_);
}

posted @   caieleven  阅读(51)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
点击右上角即可分享
微信分享提示