[转载]语言/编码检测的复合方法

原文地址:https://www-archive.mozilla.org/projects/intl/universalcharsetdetection

英文版本:

A composite approach to language/encoding detection

 

Shanjian Li (shanjian@netscape.com)
Katsuhiko Momoi (momoi@netscape.com)
Netscape Communications Corp.


[Note: This paper was originally presented at the 19th International Unicode Conference (San Jose). Since then the implementation has gone through a period of real world usage and we made many improvements along the way. A major change is that we now use positive sequences to detect single byte charsets, c.f. Sections 4.7 and 4.7.1. �This paper was written when the universal charset detection code was not part of the Mozilla main source. (See Section 8). Since then, the code was checked into the tree. For more updated implementation, see our open source code at Mozilla Source Tree. - The authors. 2002-11-25.]

1. Summary:


This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.� We discuss merits and demerits of each method and propose a composite approach in which all 3 types of detection methods are used in such a way as to maximize their strengths and complement other detection methods. We argue that auto-detection can play an important role in helping transition browser users from frequent uses of a character encoding menu into a more desirable state where an encoding menu is rarely, if ever, used.� We envision that the transition to the Unicode would have to be transparent to the users.� Users need not know how characters are displayed as long as they are displayed correctly -- whether it�s a native encoding or one of Unicode encodings.� Good auto-detection service could help significantly in this effort as it takes most encoding issues out of the user�s concerns.

2. Background:


Since the beginning of the computer age, many encoding schemes have been created to represent various writing scripts/characters for computerized data. With the advent of globalization and the development of the Internet, information exchanges crossing both language and regional boundaries are becoming ever more important. But the existence of multiple coding schemes presents a significant barrier.� The Unicode has provided a universal coding scheme, but it has not so far replaced existing regional coding schemes for a variety of reasons. This, in spite of the fact that many W3C and IETF recommendations list UTF-8 as the default encoding, e.g. XML, XHTML, RDF, etc. Thus, today's global software applications are required to handle multiple encodings in addition to supporting Unicode.

The current work has been conducted in the context of developing an Internet browser. To deal with a variety of languages using different encodings on the web today, a lot of efforts have been expended. In order to get the correct display result, browsers should be able to utilize the encoding information provided by http servers, web pages or end users via a character encoding menu. Unfortunately, this type of information is missing from many http servers and web pages. Moreover, most average users are unable to provide this information via manual operation of a character encoding menu. Without this charset information, web pages are sometimes displayed as �garbage� characters, and users are unable to access the desired information. This also leads users to conclude that their browser is mal-functioning or buggy. 

As more Internet standard protocols designate Unicode as the default encoding, there will undoubtedly be a� significant shift toward the use of Unicode on web pages. Good universal auto-detection can make an important contribution toward such a shift if it works seamlessly without the user ever having to use an encoding menu.� Under such a condition, gradual shift to Unicode could be painless and without noticeable effects on web users since for users, pages simply display correctly without them doing anything or paying attention to an encoding menu.� Such a smooth transition could be aided by making encodings issues less and less noticeable to the users. Auto-detection would play an important role for such a scenario. 

3. Problem Scope:

 

3.1. General Schema

Let us begin with a general schema. For most applications, the following represents a general framework of auto-detection use:

Input Data -> �Auto-detector -> Returns results

An application/program takes the returned result(s) from an auto-detector and then uses this information for a variety of purposes such as setting the encoding for the data, displaying the data as intended by the original creator, pass it on to other programs, and so on.

The auto-detection methods discussed in this paper use an Internet Browser application as an example. These auto-detection methods, however, can be easily adapted for other types of applications. 

3.2.� Browser and auto-detection


Browsers may use certain detection algorithms to auto-detect the encoding of web pages. A program can potentially interpret a piece of text in any number of ways assuming different encodings, but except in some extremely rare situations, only one interpretation is desired by the page�s author.� This is normally the only reasonable way for the user to see that page correctly in the intended language. 

To list major factors in designing an auto-detection algorithm, we begin with certain assumptions about input text and approaches to them.� Taking web page data as an example, 

1. Input text is composed of words/sentences readable to readers of a particular language.� (= The data is not gibberish.)

2. Input text is from typical web pages on the Internet. (= The data is usually not from some dead or ancient language.)

3. The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words (e.g. English words in Chinese documents), space and other format/control characters.

To cover all the known languages and encodings for auto-detection is nearly an impossible task. In the current approaches, we tried to cover all popular encodings used in East Asian languages, and provided a generic model to handle single-byte encodings at the same time. The Russian language encodings was chosen as an implementation example of the latter type and also our test bed for single-byte encodings. 

4. Target multi-byte encodings include UTF8, Shift-JIS, EUC-JP, GB2312, Big5, EUC-TW, EUC-KR, ISO2022-XX, and HZ. 

5. Providing a generic model to handle single-byte encodings � Russian language encodings (KOI8-R, ISO8859-5, window1251, Mac-cyrillic, ibm866, ibm855) are covered in a test bed and as an implementation example.

4. Three Methods of Auto-detection:

 

4.1. Introduction:


In this section, we discuss 3 different methods for detecting the encoding of text data. They are 1) Coding scheme method, 2) Character Distribution, and 3) 2-Char Sequence Distribution. Each one has its strengths and weaknesses used on its own, but if we use all 3 in a complementary manner, the results can be quite satisfying.

4.2. Coding Scheme Method:


This method is probably the most obvious and the one most often tried first for multi-byte encodings. In any of the multi-byte encoding coding schemes, not all possible code points are used. If an illegal byte or byte sequence (i.e. unused code point) is encountered when verifying a certain encoding, we can immediately conclude that this is not the right guess. A small number of code points are also specific to a certain encoding, and that fact can lead to an immediate positive conclusion. Frank Tang (Netscape Communications) developed a very efficient algorithm to detecting character set using coding scheme through a parallel state machine.� His basic idea is:

For each coding scheme, a state machine is implemented to verify a byte sequence for this particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. There are 3 states in a state machine that are of interest to an auto-detector:

  • �START state: This is the state to start with, or a legal byte sequence (i.e. a valid code point) for character has been identified.
  • �ME state:� This indicates that the state machine identified a byte sequence that is specific to the charset it is designed for and that there is no other possible encoding which can contain this byte sequence. This will to lead to an immediate positive answer for the detector.
  • �ERROR state: This indicates the state machine identified an illegal byte sequence for that encoding. This will lead to an immediate negative answer for this encoding. Detector will exclude this encoding from consideration from here on.

In a typical example, one state machine will eventually provide a positive answer and all others will provide a negative answer. 

The version of PSM (Parallel State Machine) used in the current work is a modification of Frank Tang's original work. Whenever a state machine reaches the START state, meaning it has successfully identified a legal character, we query the state machine to see how many bytes this character has. This information is used in 2 ways. 

  • First, for UTF-8 encoding, if several multi-byte characters are identified, the input data is very unlikely to be anything other than UTF-8. So we count the number of multi-byte characters identified by the UTF-8 state machine. When it reaches a certain number (= the threshold), conclusion is made. �
  • Second, for other multi-byte encodings, this information is fed to Character Distribution analyzer (see below) so that the analyzer can deal with character data rather than raw data.

 

4.3. Character Distribution Method:


In any given language, some characters are used more often than other characters. This fact can be used to devise a data model for each language script. This is particularly useful for languages with a large number of characters such as Chinese, Japanese and Korean. We often hear anecdotally about such distributional statistics, but we have not found many published results. Thus for the following discussions, we relied mostly on our own collected data.

4.3.1. Simplified Chinese:



Our research on 6763 Chinese characters data encoded in GB2312 shows the following distributional results:

Number of Most Frequent Characters Accumulated Percentage
10 0.11723
64 0.31983
128 0.45298
256 0.61872
512 0.79135
1024 0.92260
2048 0.98505
4096 0.99929
6763 1.00000

 

 

 

 

 

 

 

 

 

���� Table 1.� Simplified Chinese Character Distribution Table

 

4.3.2. Traditional Chinese:


Research by Taiwan�s Mandarin Promotion Council conducted annually shows a similar result for traditional Chinese encoded in Big5. 


Number of Most Frequent Characters

Accumulated Percentage

10

0.11713

64

0.29612

128

0.42261

256

0.57851

512

0.74851

1024

0.89384

2048

0.97583

4096

0.99910

   

 

 

���� Table 2. Traditional Chinese Character Distribution Table



4.3.3. Japanese:


We collected our own data for Japanese, then wrote a utility to analyze them.� The following table shows the results:

Number of Most Frequent Characters Accumulated Percentage
10 0.27098
64 0.66722
128 0.77094
256 0.85710
512 0.92635
1024 0.97130
2048 0.99431
4096 0.99981
  1.00000

 

 

 

������ Table 3.� Japanese Character Distribution Table

4.3.4.� Korean:


Similarly for Korean, we collected our own data from the Internet and run our utility on it. The results are as follows:


Number of Most Frequent Characters Accumulated Percentage
10 0.25620
64 0.64293
128 0.79290
256 0.92329
512 0.98653
1024 0.99944
2048 0.99999
4096 0.99999
   

 

 

����� Table 4.� Korean Character Distribution Table

 

 

4.4. General characteristics of the distributional results:


In all these four languages, we find that a rather small set of coding points covers a significant percentage of characters used in our defined application scope.� Moreover, closer examination of those frequently used code points shows that they are scattered over a rather wide coding range.� This gives us a way to overcome the common problem encountered in the Coding Scheme analyzer, i.e. different national encodings may share overlapping code points.� Because the most frequently occurring sets for these languages have the characteristics described above, the overlap problem between different encodings in the Code Scheme Method will be insignificant in the Distribution Method.

4.5. Algorithm for analysis:


In order to identify characteristics of a language based on the character frequency/distribution statistics, we need an algorithm to calculate a value from a stream of text input. This value should show the likelihood of this stream of text being in a certain character encoding. A natural choice might be to calculate this value based on each character�s frequency weight. But from our experiment with various character encodings, we find that this approach is not necessary and it uses too much memory and CPU power. A simplified version provides a very satisfactory result, and uses much less resources and runs faster. �

In the current approach, all characters in a given encoding are classified into 2 categories, �frequently used� and �not frequently used�.� If a character is among the top 512 characters in the frequency distribution table, it is categorized as a �frequently used� character. The number 512 is chosen because it covers a significant amount of accumulated percentages in any of the 4 language input text while only occupying a small percentage of coding points. We count the number of characters in either category in a batch of input text, and then calculate a float value we call Distribution Ratio. �

The Distribution Ratio is defined as follows: 

Distribution Ratio = the Number of occurrences of the 512 most frequently used characters divided by the Number of occurrences of the rest of the characters.

Each of the multi-byte encodings tested actually shows a distinct Distribution Ratio. From this ratio then, we can calculate the confidence level of the raw input text for a given encoding. Following discussions for each encoding should make this clearer. 

4.6. Distribution Ratio and Confidence Level:


Let us look at the 4 language data to see the differences in Distribution Ratios.� Note first that we use the term Distribution Ratio in two ways. An �ideal� Distribution Ratio is defined for language scripts/character sets rather than for encodings.� If a language script/character set is represented by more than one encodings, then, for each encoding, we calculate the �actual� Distribution Ratio in the input data by sorting characters into �frequently used� or �not frequently used� categories. This value is then compared against the ideal Distribution Ratio of the language script/character set.� Based on the actual Distribution Ratios obtained, we can calculate the Confidence level for each set of input data as described below.

4.6.1. Simplified Chinese (GB2312):


GB2312 encoding contains two levels of Chinese characters. Level 1 contains 3755 characters, and Level 2, 3008 characters. Level 1 characters are more frequently used than Level 2 ones, and it is no surprise to see that all 512 characters on the most frequently used character list for GB 2312 are within Level 1. Because Level 1 characters are sorted based on pronunciation, those 512 characters are evenly scattered in 3755 code points. These characters occupies 13.64% of all coding points in Level 1, but it covers 79.135% of the character occurrences in a typical Chinese text. In an ideal situation, a piece of Chinese text that contains enough characters should return us something like:

�Distribution Ratio =� 0.79135/(1-0.79135) =3.79

And for a randomly generated text using the same encoding scheme, the ratio should be around 512 / (3755-512)=0.157 if no level 2 character is used. 

If we include Level 2 characters into consideration, we can assume that the average probability of each Level 1 character is p1, and that of Level 2 is p2.� The calculation then would be:

512*p1 / (3755*p1 + 3008*p2 � 512*p1) = 512/(3755 + 3008*p2/p1-512)

Obviously, this value is even smaller. In a later analysis, we just use the worst case for comparison.

4.6.2. Big 5:


Big5 and EUC-TW (i.e. CNS Character Set) encodings have a very similar story.� Big5 also encodes Chinese characters in 2 levels. The most frequently used 512 characters are evenly scattered in 5401 Level 1 characters. The ideal ratio we can get from a big5-encoded text is:

Distribution Ratio = 0.74851/(1-0.74851) =2.98

And for a randomly generated text should have a ration near 

512/(5401-512)=0.105

Since Big5 Level 1 characters are nearly identical to CNS plane 1 characters, the same analysis applies to EUC-TW.

4.6.3. Japanese Shift_JIS & EUC-JP:


For the Japanese Language, Hiragana and Katakana are usually more frequently used than Kanji. Because Shift-JIS and EUC-JP encode Hiragana and Katakana in different coding ranges, we are still able to use this method to distinguish among the two encodings. 
Those Kanji characters that are among the most 512 frequently used characters are also scattered evenly among 2965 JIS Level 1 Kanji set.� The same Analysis leads to the following distribution ratio:

Distribution Ratio = 0.92635 / (1-0.92635) = 12.58

For randomly generated Japanese text data, the ratio should be at least 

512 / (2965+63+83+86-512) = 0.191. 

The calculation includes Hankaku Katakana (63), Hiragana (83), and Katakana (86).


4.6.4. Korean EUC-KR:


In EUC-KR encoding, the number of Hanja (Chinese) characters actually used in a typical Korean text is insignificant. The 2350 Hangul characters coded in this encoding are arranged by their pronunciation.� In the frequency table we got through analyzing a large amount of Korean text data, most frequently used characters are evenly distributed in these 2350 code points. Using the same analysis, in an ideal situation, we get:

Distribution Ratio = 0.98653 / (1-0.98653) = 73.24

For randomly generated Korean text, it should be:

512 / (2350-512) = 0.279.


4.6.5. Calculating Confidence Level:


From the foregoing discussions for each language script, we can define the Confidence level for each data set as follows:


Confidence Detecting(InputText)
{
� for each multi-byte character in InputText 
� {
����� TotalCharacterCount++;
����� if the character is among 512 most frequent ones
��������� FrequentCharacterCount++;
� }

�� Ratio = FrequentCharacterCount 
��������������� / (TotalCharacterCount-FreqentCharacterCount);
�� Confidence = Ratio / CHARSET_RATIO;
�� Return Confidence;
}


The Confidence level for a given set data is defined as the Distribution Ratio of the input data divided by the ideal Distribution Ratio obtained by the analyses in the preceding sections.


4.7.� Two-Char Sequence Distribution Method:


In languages that only use a small number of characters, we need to go further than counting the occurrences of each single character. Combination of characters reveals more language-characteristic information. We define a 2-Char Sequence as 2 characters appearing immediately one after another in input text, and the order is significant in this case. Just as not all characters are used equally frequently in a language, 2-Char Sequence distribution also turns out to be extremely language/encoding dependent. This characteristic can be used in language detection. This leads to better confidence in detecting a character encoding, and is very useful in detecting single byte languages.

Let�s use Russian language as an example. We downloaded around 20MB of Russian plain text, and wrote a program to analyze the text. The program found 21,199,528 2-Char sequence occurrences in total. Among the sequences we found, some of them are irrelevant for our consideration, e.g. space-space combination. These sequences are considered as noises, and their occurrences are not included in the analysis . In the data we used to detect the Russian language encodings, this left 20,134, 122 2-Char sequence occurrences.� That covers about 95% of all the sequence occurrences found in the data.� The sequences used in building our language model can be classified into 4096 different sequences, and 1961 of them appear fewer than 3 times in our 20,134,122 samples. We call these 1961 sequences as Negative Sequence Set of this language. 

4.7.1. Algorithm for determining Confidence Level


For single-byte languages, we define the Confidence Level as follows:

Confidence Detecting(InputText)
{
� for each character in InputText 
� {
����� If character is not a symbol or punctuation character
��������� TotalCharacters++;
�� �Find its frequency order in frequency table;
����� If (Frequency order < SampleSize)
����� {
������� FrequentCharCount++;
������� If we do not have lastChar
������� {
���������� lastChar = thisChar;
���������� continue;
������� } 
������� if both lastChar and thisChar are within our sample range
������� {
�������� TotalSequence++;
�������� If Sequence(lastChar, thisChar) belongs to NegativeSequenceSet
���������� NetgativeSequenceCount++;
������� }
����� }
�� }
�� Confidence = (TotalSequence � NegativeSequenceCount)/TotalSequence
��������������� * FrequentCharCount / TotalCharacters;
�� return Confidence;�������� �
} �



There are several things in the algorithm that need to be explained. 

First, this sequence analysis is not done to all characters. We can build a 256 by 256 matrix to cover all those sequences, but many of those are irrelevant to language/encoding analysis and thus unnecessary.� Since most single-byte languages use fewer then 64 letters, the most frequently used 64 characters seem to cover almost all the language specific characters.� This way, the matrix can be reduced to 64 by 64, which is much smaller.� So we are using 64 as our SampleSize in this work. The 64 characters we choose to build our model are mostly based on the frequency statistics with some adjustment allowed. Some characters, such as 0x0d and 0x0a, play roles very similar to the space character (0x20) in our perspective, and thus have been eliminated from the sampling. 

Second, for all the sequences covered by this 64 by 64 model, some sequences are also irrelevant to detecting language/encoding.� Almost all single-byte language encodings include ASCII as a subset, it is very common to see a lot of English words in data from other languages, especially on web sites. It is also obvious that the space-space sequence has no connection with any language encoding. Those are considered as �noise� in our detection and are removed by filtering.

Third, in calculating confidence, we need to also count the number of characters that fall into our sample range and those that do not. If most of the characters in a small sample data do not fall into our sampling range, the sequence distribution itself may return us a high value since very few negative sequences might be found in such a case.� After filtering, most of those characters that have been fed to the detector should fall into the sampling range if the text is in the desired encoding. So the confidence obtained from counting negative sequences needs to be adjusted by this number. 

To summarize the foregoing:

  • Only a subset of all the characters are used for character set identification. This keeps our model small. We also improved detection accuracy by reducing noise.
  • Each language model is generated by a script/tool.
  • Handling of Latin Alphabet characters:
  • If the language does not use Latin Alphabet letters, Alphabet -letter to Alphabet -letter sequences are removed as noise for detection. (e.g. English words frequently appear in web pages of other languages.)
  • If the language does use Latin Alphabet letters, those sequences are kept for analysis.
  • The number of characters that fall into our sample range and those that do not are counted so that they can be used in calculating the Confidence Level.

 

5. Comparison of the 3 methods:

 

5.1. Code scheme:


For many single-byte encodings, all code points are used fairly evenly. And even for those encodings that do contain some unused code points, those unused code points are seldom used in other encodings and are thus unsuitable for encoding detection. 

For some multi-byte encodings, this method leads to a very good result and is very efficient. However, because some multi-byte encodings such as EUC-CN and EUC-KR share almost identical coding points, it is very hard to distinguish among such encodings with this method. Considering the fact that a browser normally does not have a large amount of text, we must resort to other methods to decide on an encoding. �

For 7-bit multi-bye encodings like ISO-2022-xx and HZ, which use easily recognizable escape or shift sequences, this method produces satisfactory results. Summarizing, the Code Scheme method, 

  • is very good for 7-bit multi-byte encodings like ISO-2022-xx and HZ.
  • is good for some multi-byte encoding like Shift_JIS and EUC-JP, but not for others like EUC-CN and EUC-KR.
  • is not very useful for single-byte encodings.
  • can apply to any kind of text.
  • is fast and efficient.



5. 2. Character Distribution:


For multi-byte encodings, and especially those that can not be handled reliably by the Code Scheme method, Character Distribution offers strong help without digging into complicated context analysis. For single-byte encodings, because the input data size is usually small, and there are so many possible encodings, it is unlikely to produce good results except under some special situations. Since the 2-Char Sequence Distribution method leads to a very good detection result in such a case, we have not gone further with this method on single-byte encodings. Summarizing these points, the Character Distribution Method

  • is very good for multi-byte encodings.
  • only applies to typical text.
  • is fast and efficient.



5.3.� 2-Char Sequence Distribution:


In the 2-Char Sequence Distribution method, we can use more information data in detecting language/encodings. That leads to good results even with a very small data sample. But because sequences are used instead of words (separated by a space), the matrix will be very big if it was to apply to multi-byte languages. Thus this method:

  • is very good for single-byte encodings.
  • is not efficient for multi-byte encodings.
  • can lead to good results with even small sample size.
  • only applies to typical text.

 

6. A composite Approach:

 

6.1. Combining the 3 methods:


Languages/encodings we want to cover with our charset auto-detector includes a number of multi-byte and single-byte encodings.� Given the deficiencies of each method, none of the 3 methods alone can produce truly satisfactory results.� We propose, therefore, a composite approach which can deal with both types of encodings. 

The 2-Char Sequence Distribution method is used for all single-byte encoding detections. 
The Code Scheme method is used for UTF-8, ISO-2022-xx and HZ detection. In UTF-8 detection, a small modification has been made to the existing state machine. The UTF-8 detector declares its success after several multi-byte sequences have been identified.� (See Martin Duerst�s (1977) detail). Both the Code Scheme and Character Distribution methods are used for major East Asian character encodings such as GB2312, Big5, EUC-TW, EUC-KR, Shift_JIS, and EUC-JP. 

For Japanese encodings like Shift_JIS and EUC-JP, the 2-Char Sequence Distribution method can also be used� because they contain a significant number of Hiragana syallbary characters, which work like letters in single-byte languages.� The 2-Char Sequence Distribution method can achieve an accurate result with less text material.

We tried both approaches -- one with the 2-Char Distribution Method and the other without.� Both led to quite satisfactory results. There are some web sites which contain a lot of Kanji and Katakana characters but only a few Hiragana characters. To achieve the best possible result, we use both the Character Distribution and 2-CharDistribution methods� for Japanese encoding detection.

Here then is one example of how these 3 detection methods are used together.� The upper most control module (for auto-detectors) has an algorithm like the following:


Charset AutoDetection (InputText)
{
�� if (all characters in InputText are ASCII)
�� {
������ if InputText contains ESC or �~{�
������ {
��������� call ISO-2022 and HZ detector with InputText;
��������� if one of them succeed, return that charset, otherwise return ASCII;
������ }
������ else
��������� return ASCII;
�� }
�� else if (InputText start with BOM)
� {
����� return UCS2;
� }
� else
� {
����� Call all multi-byte detectors and single-byte detectors;
����� Return the one with best confidence;
� }
}




Summarizing the sequences in the code fragment above, 

  • Most web pages are still encoded in ASCII. This top-level control algorithm begins with an ASCII verifier. If all characters are ASCII, there is no need to launch other detectors except ISO-2022-xx and HZ ones.
  • ISO-2022-xx and HZ detectors are launched only after encountering ESC or �~{�, and they are abandoned immediately when a 8-bit byte is met.
  • BOM is being searched to identify UCS2. We found that some web sites send 0x00 inside http stream, and using this byte for identifying UCS2 proved to be unreliable.
  • If any one of the active detectors received enough data and reaches a high level of confidence, the entire auto-detecting process will be terminated and that charset will be returned as the result. This is called shortcut.

 

6.2.� Test Results:


As a test for the approach advocated in this paper, we applied our detector(s) to the home pages of 100 popular international web sites without document-based or server-sent HTTP charset.� For all the encodings covered by our detector(s) we were able to achieve 100% accuracy rate. 

For example, when visiting a web site that provides no charset information (e.g. the web site at http://www.yahoo.co.jp before its server started sending the charset info), our charset detector(s) generates output like the following:

[UTF8] is inactive
[SJIS] is inactive
[EUCJP] detector has confidence 0.950000
[GB2312] detector has confidence 0.150852
[EUCKR] is inactive
[Big5] detector has confidence 0.129412
[EUCTW] is inactive
[Windows-1251 ] detector has confidence 0.010000 
[KOI8-R] detector has confidence 0.010000 
[ISO-8859-5] detector has confidence 0.010000 
[x-mac-cyrillic] detector has confidence 0.010000 
[IBM866] detector has confidence 0.010000 
[IBM855] detector has confidence 0.010000 

This then leads to the determination that EUC-JP is the most likely encoding for this site.

7. Conclusion:


The composite approach that utilizes Code Scheme, Character Distribution and 2-Char Sequence Distribution methods to identify language/encodings has been proven to be very effective and efficient in our environment. We covered Unicode encodings, multi-byte encodings and single-byte encodings. These are representative encodings in our current digital text on the Internet. It is reasonable to believe that this method can be extended to cover the rest of the encodings not covered in this paper. 

Though only encodings information is desired in our detection results at this time, language is also identified in most cases. In fact, both Character Distribution and 2-Char Distribution methods rely on characteristic distributional patterns of different language characters. Only in the case of UTF16 and UTF8, encoding is detected but the language remains unknown. But even in such cases, this work can still be easily extended to cover language detection in future.

The 3 methods outlined here have been implemented in Netscape 6.1 PR1 and later versions as the �Detect All� option. We expect our work in auto-detection to free our users further from having to deal with cumbersome manipulations of the Character Coding menu.� The Character Coding menu (or Encoding menu for others) is different from other UI items in the Internet client in that it exposes part of the i18n backend to general users. Its existence itself is a mirror of how messy today�s web pages are when it comes to language/encoding. 

We hope that offering good encoding default and universal auto-detection will help alleviate most of the encoding problems our users encounter in surfing the net. Web standards are shifting toward Unicode, particularly, toward UTF-8, as the default encoding. We expect gradual increase of its use on the web. Such shifts need not be overt as more and more users are freed from confronting issues related to encoding while browsing or reading/sending messages, thanks in part to auto-detection.� This is why we advocate good auto-detection and good default encoding settings for Internet clients.

8. Future Work:


Our auto-detection identifies a language. The encoding determination is a byproduct of that determination. For the current work, we only covered Russian as an example of single-byte implementation.� Since it identifies a language and only then which encoding it uses, the more language data models there are, the better the quality of encoding detection. 

To add other single-byte languages/encodings, we need a large amount of text sample data for each language and certain degree of language knowledge/analysis.� We currently use a script to generate a language model for all the encodings for that language.

This work is at present not in the Mozilla source but we hope to make it public in the near future. When we do, we hope people with the above qualification will contribute in this area. Because we have not yet tested many single-byte encodings, it is likely that the model we propose here needs to be fine-tuned, modified or possibly even re-designed when applying to other languages/encodings.

�9. References:

Duerst, Martin. 1977. The Properties and Promizes of UTF-8.� 11th Unicode Conference. 
���� http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/IUC11-UTF-8.pdf 
Mandarin Promotion Council, Taiwan. Annual survey results of Traditional Chinese character usage.
� http://www.edu.tw/mandr/result/87news/index1.htm
Mozilla Internationalization Projects.� http://www.mozilla.org/projects/intl
Mozilla.org.� http://www.mozilla.org/
Mozilla source viewing.� http://lxr.mozilla.org/
 

 

 

机器翻译版本:

Shanjian Li (shanjian@netscape.com) 
Katsuhiko Momoi (momoi@netscape.com) 
Netscape Communications Corp.


[注:本文最初发表于第19 届国际 Unicode 会议(圣何塞)。从那时起,该实现经历了一段真实世界的使用期,我们在此过程中进行了许多改进。一个主要的变化是我们现在使用正序列来检测单字节字符集,参见第 4.7 和 4.7.1 节。本文是在通用字符集检测代码不是 Mozilla 主要来源的一部分时编写的。(见第 8 节)。从那时起,代码被签入到树中。有关更多更新的实现,请参阅我们在Mozilla Source Tree 上的开源代码- 作者。2002-11-25。]

一、总结:


本文介绍了三种类型的自动检测方法来确定没有显式字符集声明的文档编码。 我们讨论了每种方法的优缺点,并提出了一种复合方法,其中所有 3 种类型的检测方法都以这样的方式使用最大限度地发挥其优势并补充其他检测方法。我们认为,自动检测可以在帮助浏览器用户从频繁使用字符编码菜单过渡到更理想的状态方面发挥重要作用,在这种状态下,编码菜单很少使用(如果有的话)。我们设想过渡到 Unicode必须对用户透明。只要字符显示正确,用户就不需要知道字符是如何显示的——无论是本机编码还是 Unicode 编码之一。

2. 背景:


自计算机时代开始以来,已经创建了许多编码方案来表示计算机数据的各种书写脚本/字符。随着全球化的到来和互联网的发展,跨越语言和地区界限的信息交流变得越来越重要。但是多种编码方案的存在构成了一个重大障碍。 Unicode 提供了一种通用编码方案,但由于各种原因,它迄今为止还没有取代现有的区域编码方案。尽管事实上许多 W3C 和 IETF 建议将 UTF-8 列为默认编码,例如 XML、XHTML、RDF 等。因此,除了支持 Unicode 之外,当今的全球软件应用程序还需要处理多种编码。

目前的工作是在开发互联网浏览器的背景下进行的。为了处理当今网络上使用不同编码的各种语言,已经付出了很多努力。为了得到正确的显示结果,浏览器应该能够通过字符编码菜单利用http服务器、网页或最终用户提供的编码信息。不幸的是,许多 http 服务器和网页都缺少这种类型的信息。而且,大多数普通用户无法通过手动操作字符编码菜单来提供这些信息。如果没有这些字符集信息,网页有时会显示为“垃圾”字符,用户无法访问所需的信息。这也导致用户得出结论,他们的浏览器出现故障或有问题。 

随着越来越多的 Internet 标准协议将 Unicode 指定为默认编码,毫无疑问,在网页上使用 Unicode 将发生重大转变。良好的通用自动检测可以为这种转变做出重要贡献,如果它在用户不需要使用编码菜单的情况下无缝工作。 在这种情况下,逐渐转变到 Unicode 可能会很轻松,并且不会对网络用户产生明显影响,因为对于用户来说,页面只是正确显示,他们无需做任何事情或注意编码菜单。 可以通过使用户越来越不注意编码问题来帮助实现这种平滑过渡。自动检测将在这种情况下发挥重要作用。

3. 问题范围:

 

3.1. 通用架构

让我们从一个通用模式开始。对于大多数应用程序,以下表示自动检测使用的一般框架:

输入数据 -> 自动检测器 -> 返回结果

应用程序/程序从自动检测器获取返回的结果,然后将此信息用于多种用途,例如设置数据的编码、按照原始创建者的意图显示数据、将其传递给其他程序等。

本文中讨论的自动检测方法以 Internet 浏览器应用程序为例。然而,这些自动检测方法可以很容易地适用于其他类型的应用。

3.2.. 浏览器和自动检测


浏览器可能会使用某些检测算法来自动检测网页的编码。一个程序可以假设不同的编码以多种方式解释一段文本,但除了在极少数情况下,页面作者只需要一种解释。 这通常是用户唯一合理的方式以预期的语言正确查看该页面。

为了列出设计自动检测算法的主要因素,我们首先对输入文本及其处理方法进行一些假设。 以网页数据为例,

1. 输入文本由特定读者可读的单词/句子组成语言. (=数据不是胡言乱语。)

2. 输入文本来自互联网上的典型网页。(= 数据通常不是来自某种死的或古老的语言。)

3. 输入文本可能包含与其编码无关的外来噪声,例如 HTML 标签、非母语单词(例如中文文档中的英文单词)、空格和其他格式/控制字符。

涵盖所有已知语言和自动检测编码几乎是一项不可能完成的任务。在当前的方法中,我们试图涵盖东亚语言中使用的所有流行编码,并提供了一个通用模型来同时处理单字节编码。选择俄语编码作为后一种类型的实现示例,也是我们的单字节编码测试平台。

4.目标多字节编码包括UTF8、Shift-JIS、EUC-JP、GB2312、Big5、EUC-TW、EUC-KR、ISO2022-XX、HZ。

5. 提供处理单字节编码的通用模型 俄罗斯语言编码(KOI8-R、ISO8859-5、window1251、Mac-cyrillic、ibm866、ibm855)在测试台和实现示例中进行了介绍。

4. 三种自动检测方法:

 

4.1. 介绍:


在本节中,我们将讨论 3 种不同的检测文本数据编码的方法。它们是 1) 编码方案方法,2) 字符分布,和 3) 2-字符序列分布。每个都有自己的优点和缺点,但如果我们以互补的方式使用所有 3 个,结果会非常令人满意。

4.2. 编码方案方法:


对于多字节编码,这种方法可能是最明显的,也是最常首先尝试的方法。在任何多字节编码编码方案中,并非使用所有可能的代码点。如果在验证某个编码时遇到非法字节或字节序列(即未使用的代码点),我们可以立即得出结论,这不是正确的猜测。少数代码点也特定于某种编码,这一事实可以立即得出肯定的结论。Frank Tang (Netscape Communications) 开发了一种非常有效的算法,通过并行状态机使用编码方案来检测字符集。 他的基本思想是:

对于每种编码方案,都实现了一个状态机来验证该特定编码的字节序列。对于检测器接收到的每个字节,它将将该字节提供给每个可用的活动状态机,一次一个字节。状态机根据它之前的状态和它接收到的字节改变它的状态。自动检测器感兴趣的状态机中有 3 种状态:

  • START 状态:这是开始的状态,或者已识别字符的合法字节序列(即有效代码点)。
  • ME 状态: 这表明状态机标识了一个字节序列,该字节序列特定于它所设计的字符集,并且没有其他可能的编码可以包含该字节序列。这将导致检测器立即得到肯定的答复。
  • 错误状态:这表示状态机识别出该编码的非法字节序列。这将导致对该编码的立即否定回答。从这里开始,检测器将排除此编码。

在一个典型的例子中,一个状态机最终会提供一个肯定的答案,而所有其他的都会提供一个否定的答案。

目前作品中使用的PSM(Parallel State Machine)版本是对Frank Tang原作的修改。每当状态机到达 START 状态时,这意味着它已成功识别出一个合法字符,我们查询状态机以查看该字符有多少字节。该信息以两种方式使用。

  • 首先,对于 UTF-8 编码,如果识别出多个多字节字符,则输入数据不太可能是 UTF-8 以外的任何其他字符。所以我们统计一下UTF-8状态机识别出的多字节字符的个数。当它达到一定数量(=阈值)时,得出结论。
  • 其次,对于其他多字节编码,此信息被馈送到字符分布分析器(见下文),以便分析器可以处理字符数据而不是原始数据。

 

4.3. 字符分配方法:


在任何给定的语言中,某些字符的使用频率高于其他字符。这一事实可用于为每种语言脚本设计一个数据模型。这对于包含大量字符的语言特别有用,例如中文、日文和韩文。我们经常听到关于这种分布统计的轶事,但我们没有发现很多已发表的结果。因此,在接下来的讨论中,我们主要依赖于我们自己收集的数据。

4.3.1. 简体中文:



我们对 GB2312 编码的 6763 个汉字数据的研究显示了以下分布结果:

最常见的字符数 累计百分比
10 0.11723
64 0.31983
128 0.45298
256 0.61872
512 0.79135
1024 0.92260
2048 0.98505
4096 0.99929
6763 1.00000

 

 

 

 

 

 

 

 

 

表 1. 简体汉字分布表

 

4.3.2. 繁体中文:


台湾普通话促进委员会每年进行的研究表明,对 Big5 编码的繁体中文也有类似的结果。 


最常见的字符数

累计百分比

10

0.11713

64

0.29612

128

0.42261

256

0.57851

512

0.74851

1024

0.89384

2048

0.97583

4096

0.99910

   

 

 

表 2. 繁体字分布表



4.3.3. 日本人:


我们收集了自己的日语数据,然后编写了一个实用程序来分析它们。 下表显示了结果:

最常见的字符数 累计百分比
10 0.27098
64 0.66722
128 0.77094
256 0.85710
512 0.92635
1024 0.97130
2048 0.99431
4096 0.99981
  1.00000

 

 

 

表 3. 日文字符分布表

4.3.4. 韩语:


同样,对于韩语,我们从互联网上收集了我们自己的数据并在其上运行我们的实用程序。结果如下:


最常见的字符数 累计百分比
10 0.25620
64 0.64293
128 0.79290
256 0.92329
512 0.98653
1024 0.99944
2048 0.99999
4096 0.99999
   

 

 

表 4. 韩文字符分布表

 

 

4.4. 分布结果的一般特征:


在所有这四种语言中,我们发现一小部分编码点覆盖了我们定义的应用范围中使用的字符的很大一部分。 此外,对这些常用编码点的仔细检查表明它们分散在相当广泛的编码中范围。。这为我们提供了一种克服编码方案分析器中遇到的常见问题的方法,即不同的国家编码可能共享重叠的代码点。。因为这些语言最常出现的集合具有上述特征,因此Code Scheme Method 中的不同编码在 Distribution Method 中将无关紧要。

4.5. 分析算法:


为了根据字符频率/分布统计来识别语言的特征,我们需要一种算法来计算文本输入流中的值。此值应显示此文本流采用某种字符编码的可能性。一个自然的选择可能是根据每个字符的频率权重来计算这个值。但是从我们对各种字符编码的实验中,我们发现这种方法是没有必要的,它使用了太多的内存和 CPU 能力。简化版本提供了非常令人满意的结果,并且使用更少的资源并且运行速度更快。 

在当前的方法中,给定编码中的所有字符被分为 2 类,“经常使用”和“不经常使用”。如果一个字符在频率分布表中的前 512 个字符中,则将其归类为“经常使用的字符。选择数字 512 是因为它涵盖了 4 种语言输入文本中任何一种的大量累积百分比,而仅占编码点的一小部分。我们计算一批输入文本中任一类别的字符数,然后计算一个我们称为分布比率的浮点值。 

分布比率定义如下:

分布比率 = 512 个最常用字符的出现次数除以其余字符的出现次数。

测试的每个多字节编码实际上都显示了不同的分布比率。然后,根据这个比率,我们可以计算给定编码的原始输入文本的置信度。以下对每种编码的讨论应该使这一点更清楚。

4.6. 分配比率和置信度:


让我们看看 4 种语言数据,看看分布比率的差异。 首先请注意,我们以两种方式使用术语分布比率。“理想的”分布比率是为语言脚本/字符集而不是编码定义的。如果语言脚本/字符集由多个编码表示,那么对于每种编码,我们计算“实际”分布比率通过将字符分类为“常用”或“不常用”类别来输入数据。然后将该值与语言脚本/字符集的理想分布比率进行比较。 根据获得的实际分布比率,我们可以计算每组输入数据的置信度,如下所述。

4.6.1. 简体中文(GB2312):


GB2312 编码包含两级汉字。级别 1 包含 3755 个字符,级别 2 包含 3008 个字符。1级字比2级字更常用,GB 2312最常用字表中的512个字全都在1级之内也就不足为奇了。因为1级字是按发音排序的,所以这512个字字符均匀地分散在 3755 个代码点中。这些字符占一级所有编码点的 13.64%,但覆盖了典型中文文本中出现的字符的 79.135%。在理想情况下,一段包含足够字符的中文文本应该返回如下信息:

Distribution Ratio = 0.79135/(1-0.79135) =3.79

对于使用相同编码方案的随机生成的文本,如果不使用 2 级字符,则比率应约为 512 / (3755-512)=0.157。 

如果我们考虑2级角色,我们可以假设每个1级角色的平均概率是p1,2级角色的平均概率是p2。 那么计算将是:

512*p1 / (3755*p1 + 3008* p2 512*p1) = 512/(3755 + 3008*p2/p1-512)

显然,这个值更小。在后面的分析中,我们只使用最坏的情况进行比较。

4.6.2. 大五:


Big5 和 EUC-TW(即 CNS 字符集)编码有一个非常相似的故事。 Big5 也对汉字进行 2 级编码。最常用的512个字符均匀地分散在5401个一级字符中。我们可以从 big5 编码的文本中得到的理想比率是:

Distribution Ratio = 0.74851/(1-0.74851) =2.98

并且对于随机生成的文本应该具有接近

512/(5401-512)=0.105 的比率,

因为 Big5 Level 1字符与 CNS 平面 1 字符几乎相同,同样的分析适用于 EUC-TW。

4.6.3. 日语 Shift_JIS 和 EUC-JP:


对于日语,平假名和片假名通常比汉字更常用。因为 Shift-JIS 和 EUC-JP 在不同的编码范围内对平假名和片假名进行编码,所以我们仍然可以使用这种方法来区分这两种编码。
最常用的 512 个汉字中的那些汉字也均匀地分散在 2965 个 JIS 1 级汉字集中。 相同的分析导致以下分布比率:

分布比率 = 0.92635 / (1-0.92635) = 12.58

对于随机生成日文文本数据,比例至少应为 

512 / (2965+63+83+86-512) = 0.191。

计算包括半角片假名 (63)、平假名 (83) 和片假名 (86)。


4.6.4. 韩国 EUC-KR:


在 EUC-KR 编码中,典型韩语文本中实际使用的汉字(中文)字符的数量是微不足道的。用这种编码编码的 2350 个韩文字符是按发音排列的。 在我们分析大量韩文文本数据得到的频率表中,最常用的字符均匀分布在这 2350 个代码点中。使用相同的分析,在理想情况下,我们得到:

Distribution Ratio = 0.98653 / (1-0.98653) = 73.24

对于随机生成的韩文文本,应该是:

512 / (2350-512) = 0.279。


4.6.5. 计算置信度:


从前面对每个语言脚本的讨论,我们可以定义每个数据集的置信度如下:


Confidence Detecting(InputText) 

对于InputText中的每个多字节字符

TotalCharacterCount++; 
如果该字符在 512 个最常见的
字符中 FrequencyCharacterCount++;


Ratio = 
FrequencyCharacterCount / (TotalCharacterCount-FreqentCharacterCount); 
置信度 = 比率 / CHARSET_RATIO;
返回置信度;
}


给定集合数据的置信水平定义为输入数据的分布比率除以通过前面部分中的分析获得的理想分布比率。


4.7.. 双字符序列分布方法:


在仅使用少量字符的语言中,我们需要走得更远而不是计算每个单个字符的出现次数。字符的组合揭示了更多的语言特征信息。我们将 2-Char Sequence 定义为在输入文本中一个接一个出现的 2 个字符,在这种情况下顺序很重要。正如在一种语言中并非所有字符都同样频繁地使用一样,2-Char 序列分布也非常依赖于语言/编码。该特性可用于语言检测。这会提高检测字符编码的信心,并且在检测单字节语言时非常有用。

让我们以俄语为例。我们下载了大约 20MB 的俄文纯文本,并编写了一个程序来分析文本。该程序总共发现了 21,199,528 个 2-Char 序列出现。在我们发现的序列中,其中一些与我们的考虑无关,例如空间-空间组合。这些序列被视为噪声,它们的出现不包括在分析中。在我们用来检测俄语编码的数据中,这留下了 20,134、122 个 2-Char 序列出现。这涵盖了数据中发现的所有序列出现的大约 95%。用于构建我们的语言模型的序列可以是分为 4096 个不同的序列,其中 1961 个在我们的 20,134,122 个样本中出现的次数不到 3 次。我们称这些 1961 序列为该语言的负序列集。

4.7.1. 确定置信水平的算法


对于单字节语言,我们定义置信度如下:

Confidence Detecting(InputText) 

对于 InputText 中的每个字符

如果字符不是符号或标点字符
TotalCharacters++; 
在频率表中找到它的频率顺序;
If (Frequency order < SampleSize) 

FrequencyCharCount++; 
如果我们没有 lastChar 

lastChar = thisChar; 
继续;

如果 lastChar 和 thisChar 都在我们的样本范围内

TotalSequence++;
如果 Sequence(lastChar, thisChar) 属于 NegativeSequenceSet 
NetgativeSequenceCount++;



置信度 = (TotalSequence NegativeSequenceCount)/TotalSequence 
* FrequencyCharCount / TotalCharacters; 
return Confidence; 

算法中

有几点需要说明。

首先,这种序列分析不是对所有字符进行的。我们可以构建一个 256 x 256 的矩阵来覆盖所有这些序列,但其中许多与语言/编码分析无关,因此是不必要的。 由于大多数单字节语言使用少于 64 个字母,最常用的 64 个字符似乎几乎涵盖了所有语言特定字符。这样,矩阵可以减少到 64 x 64,这要小得多。因此我们在这项工作中使用 64 作为我们的 SampleSize。我们选择构建模型的 64 个字符主要基于频率统计,并允许进行一些调整。某些字符,例如 0x0d 和 0x0a,在我们看来与空格字符 (0x20) 非常相似,因此已从采样中排除。

其次,对于这个 64 x 64 模型覆盖的所有序列,一些序列也与检测语言/编码无关。 几乎所有单字节语言编码都包含 ASCII 作为子集,看到很多英文单词是很常见的来自其他语言的数据,尤其是网站上的数据。同样明显的是,空间-空间序列与任何语言编码都没有联系。这些在我们的检测中被视为“噪声”,并通过过滤去除。

第三,在计算置信度时,我们还需要计算落入我们样本范围和不落入样本范围的字符数。如果小样本数据中的大部分字符不属于我们的采样范围,序列分布本身可能会给我们返回一个高值,因为在这种情况下可能会发现很少的负序列。 过滤后,大多数字符如果文本处于所需的编码中,则已被馈送到检测器应属于采样范围。所以计数负序列得到的置信度需要通过这个数字进行调整。

总结上述内容:

  • 只有所有字符的一个子集用于字符集识别。这使我们的模型很小。我们还通过降低噪声来提高检测精度。
  • 每个语言模型都是由一个脚本/工具生成的。
  • 拉丁字母字符的处理:
  • 如果语言不使用拉丁字母,则将 Alphabet -letter 到 Alphabet -letter 序列作为噪声去除以进行检测。(例如,英语单词经常出现在其他语言的网页中。)
  • 如果语言确实使用拉丁字母,则保留这些序列以供分析。
  • 计算属于我们的样本范围和不属于我们的样本范围的字符数,以便它们可用于计算置信度。

 

5. 3种方法的比较:

 

5.1. 代码方案:


对于许多单字节编码,所有代码点都相当均匀地使用。并且即使对于那些确实包含一些未使用的代码点的编码,那些未使用的代码点也很少用于其他编码,因此不适合编码检测。

对于一些多字节编码,这种方法会得到很好的结果,并且非常高效。然而,由于一些多字节编码如EUC-CN和EUC-KR共享几乎相同的编码点,用这种方法很难区分这些编码。考虑到浏览器通常没有大量文本,我们必须求助于其他方法来决定编码。 

对于使用易于识别的转义或移位序列的 ISO-2022-xx 和 HZ 等 7 位多字节编码,该方法产生了令人满意的结果。总结一下,Code Scheme 方法,

  • 非常适合 ISO-2022-xx 和 HZ 等 7 位多字节编码。
  • 适用于某些多字节编码,如 Shift_JIS 和 EUC-JP,但不适用于 EUC-CN 和 EUC-KR 等其他编码。
  • 对于单字节编码不是很有用。
  • 可以应用于任何类型的文本。
  • 快速高效。



5. 2. 人物分布:


对于多字节编码,尤其是那些不能被 Code Scheme 方法可靠处理的编码,Character Distribution 提供了强大的帮助,而无需深入研究复杂的上下文分析。对于单字节编码,由于输入的数据量通常比较小,而且可能的编码方式太多,除非在一些特殊情况下,否则不太可能产生好的结果。由于 2-Char Sequence Distribution 方法在这种情况下会产生非常好的检测结果,因此我们没有在单字节编码上进一步使用这种方法。总结这几点,人物分布法

  • 非常适合多字节编码。
  • 仅适用于典型文本。
  • 快速高效。



5.3.. 2-Char 序列分布:


在 2-Char Sequence Distribution 方法中,我们可以使用更多的信息数据来检测语言/编码。即使使用非常小的数据样本,这也会产生良好的结果。但是因为使用序列而不是单词(用空格分隔),如果应用于多字节语言,矩阵将非常大。因此这个方法:

  • 非常适合单字节编码。
  • 对于多字节编码效率不高。
  • 即使样本量很小,也能得到很好的结果。
  • 仅适用于典型文本。

 

6. 复合方法:

 

6.1. 结合3种方法:


我们希望用我们的字符集自动检测器覆盖的语言/编码包括许多多字节和单字节编码。鉴于每种方法的缺陷,单独使用这 3 种方法都不能产生真正令人满意的结果。我们建议,因此,可以处理两种类型的编码的复合方法。

2-Char Sequence Distribution 方法用于所有单字节编码检测。
Code Scheme 方法用于 UTF-8、ISO-2022-xx 和 HZ 检测。在 UTF-8 检测中,对现有的状态机做了一个小的修改。UTF-8 检测器在识别出几个多字节序列后宣布其成功。(参见 Martin Duerst's (1977) 详细信息)。Code Scheme 和 Character Distribution 方法都用于主要的东亚字符编码,例如 GB2312、Big5、EUC-TW、EUC-KR、Shift_JIS 和 EUC-JP。

对于 Shift_JIS 和 EUC-JP 等日语编码,也可以使用 2-Char Sequence Distribution 方法。因为它们包含大量平假名 syallbary 字符,其工作方式类似于单字节语言中的字母。 2-Char Sequence Distribution方法可以用较少的文字材料获得准确的结果。

我们尝试了两种方法——一种使用 2-Char Distribution Method,另一种不使用。 两种方法都取得了相当令人满意的结果。有一些网站包含很多汉字和片假名字符,但只有少数平假名字符。为了获得最佳结果,我们同时使用 Character Distribution 和 2-CharDistribution 方法来检测日语编码。

下面是这 3 种检测方法如何一起使用的一个示例。 最上面的控制模块(用于自动检测器)具有如下算法:


Charset AutoDetection (InputText) 

if (InputText 中的所有字符都是 ASCII ) 

如果 InputText 包含 ESC 或 ~{ 
{
使用 InputText 调用 ISO-2022 和 HZ 检测器;
如果其中一个成功,则返回该字符集,否则返回 ASCII;

else 
返回 ASCII;

else if (InputText start with BOM) 

return UCS2; 

else 

调用所有多字节检测器和单字节检测器;
返回最有信心的那个;

}




总结上面代码片段中的序列,

  • 大多数网页仍然以 ASCII 编码。这个顶级控制算法从一个 ASCII 验证器开始。如果所有字符都是 ASCII,则无需启动除 ISO-2022-xx 和 HZ 之外的其他检测器。
  • ISO-2022-xx 和 HZ 检测器只有在遇到 ESC 或 ~{ 后才启动,遇到 8 位字节时立即放弃。
  • 正在搜索 BOM 以识别 UCS2。我们发现一些网站在http流中发送0x00,使用这个字节来识别UCS2被证明是不可靠的。
  • 如果任何一个活动检测器接收到足够的数据并达到高置信度,则整个自动检测过程将终止,该字符集将作为结果返回。这称为捷径。

 

6.2. 测试结果:


作为对本文提倡的方法的测试,我们将我们的检测器应用于 100 个流行的国际网站的主页,没有基于文档或服务器发送的 HTTP 字符集。 对于我们的检测器涵盖的所有编码) 我们能够达到 100% 的准确率。

例如,当访问不提供字符集信息的网站时(例如,在其服务器开始发送字符集信息之前的 http://www.yahoo.co.jp 网站)时,我们的字符集检测器会生成如下输出:以下:

[UTF8] 无效
[SJIS] 无效
[EUCJP] 检测器置信度 0.950000 
[GB2312] 检测器置信度 0.150852 
[EUCKR] 无效
[Big5] 检测器置信度 0.129412 
[EUCTW] 无效
[Windows-1251] 检测器有信心 0.010000 
[KOI8-R] 检测器有信心 0.010000 
[ISO-8859-5] 检测器有信心 0.010000 
[x-mac- 
cyrillic] 检测器有信心 0.010000 [IBM866] 检测器有信心 0.018000005 
[IBM866] IBM05检测器的置信度为 0.010000

这将导致确定 EUC-JP 是该站点最有可能的编码。

7. 结论:


利用代码方案、字符分布和 2 字符序列分布方法来识别语言/编码的复合方法已被证明在我们的环境中非常有效和高效。我们介绍了 Unicode 编码、多字节编码和单字节编码。这些是我们当前互联网上的数字文本中的代表性编码。可以合理地相信,这种方法可以扩展到涵盖本文未涵盖的其余编码。

虽然目前我们的检测结果中只需要编码信息,但在大多数情况下也可以识别语言。实际上,Character Distribution 和 2-Char Distribution 方法都依赖于不同语言字符的特征分布模式。仅在 UTF16 和 UTF8 的情况下,会检测到编码,但语言仍然未知。但即使在这种情况下,这项工作仍然可以很容易地扩展到未来的语言检测。

此处概述的 3 种方法已在 Netscape 6.1 PR1 和更高版本中作为“全部检测”选项实现。我们希望我们在自动检测方面的工作能够让我们的用户免于处理繁琐的字符编码菜单操作。 字符编码菜单(或其他人的编码菜单)与 Internet 客户端中的其他 UI 项目的不同之处在于它向一般用户公开了 i18n 后端的一部分。它的存在本身反映了当今网页在语言/编码方面的混乱程度。 

我们希望提供良好的编码默认和通用自动检测将有助于缓解我们的用户在网上冲浪时遇到的大多数编码问题。Web 标准正在转向 Unicode,特别是转向 UTF-8,作为默认编码。我们预计它在网络上的使用会逐渐增加。这种转变不需要公开,因为越来越多的用户在浏览或阅读/发送消息时不再面临与编码相关的问题,部分归功于自动检测。 这就是为什么我们提倡良好的自动检测和良好的默认编码设置对于互联网客户。

8. 未来工作:


我们的自动检测识别一种语言。编码确定是该确定的副产品。对于当前的工作,我们仅将俄语作为单字节实现的示例进行了介绍。 由于它识别一种语言,并且只识别它使用的编码,因此语言数据模型越多,编码检测的质量就越好。

要添加其他单字节语言/编码,我们需要每种语言的大量文本样本数据和一定程度的语言知识/分析。我们目前使用脚本为该语言的所有编码生成语言模型。

这项工作目前不在 Mozilla 源中,但我们希望在不久的将来公开。当我们这样做时,我们希望具有上述资格的人在这方面做出贡献。因为我们还没有测试过很多单字节编码,所以我们在这里提出的模型很可能需要在应用于其他语言/编码时进行微调、修改甚至可能重新设计。

9. 参考:

杜斯特,马丁。1977. UTF-8 的特性和承诺。第 11 届 Unicode 会议。
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/IUC11-UTF-8.pdf 
台湾普通话推广委员会。繁体字使用情况年度调查结果。
http://www.edu.tw/mandr/result/87news/index1.htm
Mozilla 国际化项目。http://www.mozilla.org/projects/intl
Mozilla.org .。http://www.mozilla .org/
Mozilla 源代码查看.. http://lxr.mozilla.org/

 

posted @ 2021-06-22 11:09  日月王  阅读(76)  评论(0编辑  收藏  举报