判断文件是否为二进制

在工作中,碰到处理STL文件,有时候拿到的文件是二进制,有时候又是ASCII, 所以

想着写个方法进行判断,然后再选择打开方式。

话不多说,上代码!

enum FileTypeEnum 
  { 
    FileTypeUnknown,
    FileTypeBinary,
    FileTypeText
  };

FileTypeEnum
DetectFileType(const char *filename,
                            unsigned long length,
                            double percent_bin)
{
  if (!filename || percent_bin < 0)
    {
    return FileTypeUnknown;
    }

  FILE *fp = Fopen(filename, "rb");
  if (!fp)
    {
    return FileTypeUnknown;
    }

  // Allocate buffer and read bytes

  unsigned char *buffer = new unsigned char [length];
  size_t read_length = fread(buffer, 1, length, fp);
  fclose(fp);
  if (read_length == 0)
    {
    return FileTypeUnknown;
    }

  // Loop over contents and count

  size_t text_count = 0;

  const unsigned char *ptr = buffer;
  const unsigned char *buffer_end = buffer + read_length;

  while (ptr != buffer_end)
    {
    if ((*ptr >= 0x20 && *ptr <= 0x7F) ||
        *ptr == '\n' ||
        *ptr == '\r' ||
        *ptr == '\t')
      {
      text_count++;
      }
    ptr++;
    }

  delete [] buffer;

  double current_percent_bin =
    (static_cast<double>(read_length - text_count) /
     static_cast<double>(read_length));

  if (current_percent_bin >= percent_bin)
    {
    return FileTypeBinary;
    }

  return FileTypeText;
}

调用示例:

DetectFileType(filename,256,0.05)

算法原来很简单:

  • Up to ‘length’ bytes are read from the file, if more than ‘percent_bin’ %
  • of the bytes are non-textual elements, the file is considered binary,
  • otherwise textual. Textual elements are bytes in the ASCII [0x20, 0x7E]
  • range, but also \n, \r, \t.

意思就是,从文件中读取一段字符串,并统计字符串中非文本字符的数量,如果超过

字符串长度的百分之percent_bin,那么就是二进制文件。

这里文本字符包括 \n \r \t 以及ASCII码值在[0x20, 0x7E]这个范围的

整个文件不需要全部读取到内存。

posted @ 2017-04-14 20:56  Louie-Liu  阅读(498)  评论(0编辑  收藏  举报