CED（Compact Encoding Detection）技术

在处理文本时，如果不知道文本是什么格式的编码，是UTF-8还是GBK还是其他的格式，会导致文本显示乱码。

CED（Compact Encoding Detection）技术就是能检测文本编码格式的一种技术。

1.1 CED

CED（Compact Encoding Detection）是基于C/C++开源的组件，能够扫描输入的字节来判断输入的字节最可能是什么文本编码格式。CED由google 提供并开源。

https://github.com/google/compact_enc_det/

CED的设计目标

1）快速高效。能识别跳过7位的ASCII数据。

2）基于线程安全的。

3）对于50字节的少量字节查询，对于5000字节的电子邮件内容查询，对于50000字节的网页查询都能提供相同的良好输出。

使用接口如下：

Encoding DetectEncoding(

const char* text, int text_length, const char* url_hint,

const char* http_charset_hint, const char* meta_charset_hint,

const int encoding_hint,

const Language language_hint, // User interface lang

const TextCorpusType corpus_type, bool ignore_7bit_mail_encodings,

int* bytes_consumed, bool* is_reliable);

例程如下：

/////////////////////////////////////////////////////////////

#include "compact_enc_det/compact_enc_det.h"

const char* text = "Input text";

bool is_reliable;

int bytes_consumed;

CompactEncDet::DetectEncoding(text, length,

nullptr, // URL hint

nullptr, // HTTP hint

nullptr, // Meta hint

UNKNOWN_ENCODING,

UNKNOWN_LANGUAGE,

CompactEncDet::WEB_CORPUS,

false, // Include 7-bit encodings?

&bytes_consumed, &is_reliable);

支持检测的类型如下：

enum Encoding {

ISO_8859_1 = 0, // Teragram ASCII

ISO_8859_2 = 1, // Teragram Latin2

ISO_8859_3 = 2, // in BasisTech but not in Teragram

ISO_8859_4 = 3, // Teragram Latin4

ISO_8859_5 = 4, // Teragram ISO-8859-5

ISO_8859_6 = 5, // Teragram Arabic

ISO_8859_7 = 6, // Teragram Greek

ISO_8859_8 = 7, // Teragram Hebrew

ISO_8859_9 = 8, // in BasisTech but not in Teragram

ISO_8859_10 = 9, // in BasisTech but not in Teragram

JAPANESE_EUC_JP = 10, // Teragram EUC_JP

JAPANESE_SHIFT_JIS = 11, // Teragram SJS

JAPANESE_JIS = 12, // Teragram JIS

CHINESE_BIG5 = 13, // Teragram BIG5

CHINESE_GB = 14, // Teragram GB

CHINESE_EUC_CN = 15, // Misnamed. Should be EUC_TW. Was Basis Tech

// CNS11643EUC, before that Teragram EUC-CN(!)

// See //i18n/basistech/basistech_encodings.h

KOREAN_EUC_KR = 16, // Teragram KSC

UNICODE = 17, // Teragram Unicode

CHINESE_EUC_DEC = 18, // Misnamed. Should be EUC_TW. Was Basis Tech

// CNS11643EUC, before that Teragram EUC.

CHINESE_CNS = 19, // Misnamed. Should be EUC_TW. Was Basis Tech

// CNS11643EUC, before that Teragram CNS.

CHINESE_BIG5_CP950 = 20, // Teragram BIG5_CP950

JAPANESE_CP932 = 21, // Teragram CP932

UTF8 = 22,

UNKNOWN_ENCODING = 23,

ASCII_7BIT = 24, // ISO_8859_1 with all characters <= 127.

// Should be present only in the crawler

// and in the repository,

// *never* as a result of Document::encoding().

RUSSIAN_KOI8_R = 25, // Teragram KOI8R

RUSSIAN_CP1251 = 26, // Teragram CP1251

//----------------------------------------------------------

// These are _not_ output from teragram. Instead, they are as

// detected in the headers of usenet articles.

MSFT_CP1252 = 27, // 27: CP1252 aka MSFT euro ascii

RUSSIAN_KOI8_RU = 28, // CP21866 aka KOI8-U, used for Ukrainian.

// Misnamed, this is _not_ KOI8-RU but KOI8-U.

// KOI8-U is used much more often than KOI8-RU.

MSFT_CP1250 = 29, // CP1250 aka MSFT eastern european

ISO_8859_15 = 30, // aka ISO_8859_0 aka ISO_8859_1 euroized

//----------------------------------------------------------

// These are in BasisTech but not in Teragram. They are

// needed for new interface languages. Now detected by

// research langid

MSFT_CP1254 = 31, // used for Turkish

MSFT_CP1257 = 32, // used in Baltic countries

//----------------------------------------------------------

// New encodings detected by Teragram

ISO_8859_11 = 33, // aka TIS-620, used for Thai

MSFT_CP874 = 34, // used for Thai

MSFT_CP1256 = 35, // used for Arabic

//----------------------------------------------------------

// Detected as ISO_8859_8 by Teragram, but can be found in META tags

MSFT_CP1255 = 36, // Logical Hebrew Microsoft

ISO_8859_8_I = 37, // Iso Hebrew Logical

HEBREW_VISUAL = 38, // Iso Hebrew Visual

//----------------------------------------------------------

// Detected by research langid

CZECH_CP852 = 39,

CZECH_CSN_369103 = 40, // aka ISO_IR_139 aka KOI8_CS

MSFT_CP1253 = 41, // used for Greek

RUSSIAN_CP866 = 42,

//----------------------------------------------------------

// Handled by iconv in glibc

ISO_8859_13 = 43,

ISO_2022_KR = 44,

GBK = 45,

GB18030 = 46,

BIG5_HKSCS = 47,

ISO_2022_CN = 48,

//-----------------------------------------------------------

// Detected by xin liu's detector

// Handled by transcoder

// (Indic encodings)

TSCII = 49,

TAMIL_MONO = 50,

TAMIL_BI = 51,

JAGRAN = 52,

MACINTOSH_ROMAN = 53,

UTF7 = 54,

BHASKAR = 55, // Indic encoding - Devanagari

HTCHANAKYA = 56, // 56 Indic encoding - Devanagari

//-----------------------------------------------------------

// These allow a single place (inputconverter and outputconverter)

// to do UTF-16 <==> UTF-8 bulk conversions and UTF-32 <==> UTF-8

// bulk conversions, with interchange-valid checking on input and

// fallback if needed on ouput.

UTF16BE = 57, // big-endian UTF-16

UTF16LE = 58, // little-endian UTF-16

UTF32BE = 59, // big-endian UTF-32

UTF32LE = 60, // little-endian UTF-32

//-----------------------------------------------------------

// An encoding that means "This is not text, but it may have some

// simple ASCII text embedded". Intended input conversion (not yet

// implemented) is to keep strings of >=4 seven-bit ASCII characters

// (follow each kept string with an ASCII space), delete the rest of

// the bytes. This will pick up and allow indexing of e.g. captions

// in JPEGs. No output conversion needed.

BINARYENC = 61,

//-----------------------------------------------------------

// Some Web pages allow a mixture of HZ-GB and GB-2312 by using

// ~{ ... ~} for 2-byte pairs, and the browsers support this.

HZ_GB_2312 = 62,

//-----------------------------------------------------------

// Some external vendors make the common input error of

// converting MSFT_CP1252 to UTF8 *twice*. No output conversion needed.

UTF8UTF8 = 63,

//-----------------------------------------------------------

// Handled by transcoder for tamil language specific font

// encodings without the support for detection at present.

TAM_ELANGO = 64, // Elango - Tamil

TAM_LTTMBARANI = 65, // Barani - Tamil

TAM_SHREE = 66, // Shree - Tamil

TAM_TBOOMIS = 67, // TBoomis - Tamil

TAM_TMNEWS = 68, // TMNews - Tamil

TAM_WEBTAMIL = 69, // Webtamil - Tamil

//-----------------------------------------------------------

// Shift_JIS variants used by Japanese cell phone carriers.

KDDI_SHIFT_JIS = 70,

DOCOMO_SHIFT_JIS = 71,

SOFTBANK_SHIFT_JIS = 72,

// ISO-2022-JP variants used by KDDI and SoftBank.

KDDI_ISO_2022_JP = 73,

SOFTBANK_ISO_2022_JP = 74,

//-----------------------------------------------------------

NUM_ENCODINGS = 75, // Always keep this at the end. It is not a

// valid Encoding enum, it is only used to

// indicate the total number of Encodings.

};

1.2 CED获取

在windows平台

git clone https://github.com/google/compact_enc_det.git

cd compact_enc_det

cmake .

1.3 总结

CED提供了高效小巧的检测文本编码功能，可以使用到很多领域。

posted @ 2020-01-19 17:27 鹏小鹕阅读(1232) 评论(0) 编辑收藏举报

刷新页面返回顶部

鹏小鹕

CED（Compact Encoding Detection）技术

1.1 CED

1.2 CED获取

1.3 总结

公告