判断文件是否为二进制
在工作中,碰到处理STL文件,有时候拿到的文件是二进制,有时候又是ASCII, 所以
想着写个方法进行判断,然后再选择打开方式。
话不多说,上代码!
enum FileTypeEnum
{
FileTypeUnknown,
FileTypeBinary,
FileTypeText
};
FileTypeEnum
DetectFileType(const char *filename,
unsigned long length,
double percent_bin)
{
if (!filename || percent_bin < 0)
{
return FileTypeUnknown;
}
FILE *fp = Fopen(filename, "rb");
if (!fp)
{
return FileTypeUnknown;
}
// Allocate buffer and read bytes
unsigned char *buffer = new unsigned char [length];
size_t read_length = fread(buffer, 1, length, fp);
fclose(fp);
if (read_length == 0)
{
return FileTypeUnknown;
}
// Loop over contents and count
size_t text_count = 0;
const unsigned char *ptr = buffer;
const unsigned char *buffer_end = buffer + read_length;
while (ptr != buffer_end)
{
if ((*ptr >= 0x20 && *ptr <= 0x7F) ||
*ptr == '\n' ||
*ptr == '\r' ||
*ptr == '\t')
{
text_count++;
}
ptr++;
}
delete [] buffer;
double current_percent_bin =
(static_cast<double>(read_length - text_count) /
static_cast<double>(read_length));
if (current_percent_bin >= percent_bin)
{
return FileTypeBinary;
}
return FileTypeText;
}
调用示例:
DetectFileType(filename,256,0.05);
算法原来很简单:
- Up to ‘length’ bytes are read from the file, if more than ‘percent_bin’ %
- of the bytes are non-textual elements, the file is considered binary,
- otherwise textual. Textual elements are bytes in the ASCII [0x20, 0x7E]
- range, but also \n, \r, \t.
意思就是,从文件中读取一段字符串,并统计字符串中非文本字符的数量,如果超过
字符串长度的百分之percent_bin,那么就是二进制文件。
这里文本字符包括 \n \r \t 以及ASCII码值在[0x20, 0x7E]这个范围的
整个文件不需要全部读取到内存。