Character Sets, Collation, Unicode :: utf8_unicode_ci vs utf8_general_ci

w

Hi, 

You can check and compare sort orders provided by these two collations here: 

http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html 
http://www.collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html 

utf8_general_ci is a very simple collation. What it does - it just 
- removes all accents 
- then converts to upper case 
and uses the code of this sort of "base letter" result letter to compare. 

For example, these Latin letters: ÀÁÅåāă (and all other Latin letters "a" 
with any accents and in any cases) are all compared as equal to "A". 


utf8_unicode_ci uses the default Unicode collation element table (DUCET). 


The main differences are: 

1. utf8_unicode_ci supports so called expansions and ligatures, for example: 
German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" 
Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE". 

utf8_general_ci does not support expansions/ligatures, it sorts 
all these letters as single characters, and sometimes in a wrong order. 

2. utf8_unicode_ci is *generally* more accurate for all scripts. 
For example, on Cyrillic block: 
utf8_unicode_ci is fine for all these languages: 
Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. 
While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. 
Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian 
are sorted not well. 


The disadvantage of utf8_unicode_ci is that it is a little bit 
slower than utf8_general_ci. 

So when you need better sorting order - use utf8_unicode_ci, 
and when you utterly interested in performance - use utf8_general_ci.

 

posted @ 2017-03-22 10:48  papering  阅读(200)  评论(0编辑  收藏  举报