Stay Hungry,Stay Foolish!

字符编码转换笔记

何为字符编码?

字符编码为计算机文字的存储格式, 例如 英文 字母 以ASCII编码存储, 即单字节存储,  其他字符编码有 UTF-8(通用字符编码格式), 其他区域性编码格式, 例如 ISO-8859(西欧), windows-1251俄文,中文GB编码。

为什么需要转换?

正因各个地区有不同的编码格式, 为了交换信息的目的, 就需要将相同字符的 从一种编码格式 转换为 另外一种编码格式。

 通用的编码格式为 UTF-8, 其囊括了 世界上所有字符, 所以一般为了通用性, 文件都以UTF-8编码(例如网页支持多语言显示的情况), 其他编码的语言一般都向UTF-8转换。

转换库LIBICONV

http://www.gnu.org/software/libiconv/#introduction

GNU世界提供了 一个开源 转换库, 支持若干编码 和 unicode 编码之间的转换。 此库可以再没有提供编码转换的系统上使用。

项目地址 http://savannah.gnu.org/projects/libiconv/

 

最新的Linux C库以已经提供 iconv 的转换,可以不用安装:

http://davidgao.github.io/LFSCN/chapter06/glibc.html

LFS 之外的某些程序包推荐安装 GNU libiconv 用于转换文本编码。此工程的主页 (http://www.gnu.org/software/libiconv/) 表示 “此库提供一个 iconv() 实现,用于没有提供此实现或无法操作 Unicode 的系统。” Glibc 提供一个 iconv() 实现并且可以操作 Unicode,所以在 LFS 系统上不必安装 libiconv。

 

LUAICONV

对于成熟的 lua, 对iconv功能进行了封装, 形成了一个专门的库,提供给LUA应用脚本使用。

官网介绍

http://ittner.github.io/lua-iconv/#download-and-installation

 

 local iconv = require("iconv")
  cd = iconv.new(to, from)
  cd = iconv.open(to, from)

  nstr, err = cd:iconv(str)

    Converts the 'str' string to the desired charset. This method always
    returns two arguments: the converted string and an error code, which
    may have any of the following values:

    nil
        No error. Conversion was successful.

    iconv.ERROR_NO_MEMORY
        Failed to allocate enough memory in the conversion process.

    iconv.ERROR_INVALID
        An invalid character was found in the input sequence.

    iconv.ERROR_INCOMPLETE
        An incomplete character was found in the input sequence.

    iconv.ERROR_FINALIZED
        Trying to use an already-finalized converter. This usually means
        that the user was tweaking the garbage collector private methods.

    iconv.ERROR_UNKNOWN
        There was an unknown error.

 

对于LUA 5.1版本, 推荐下载 lua-iconv-5 版本, 最新的-7版本兼容 LUA5.2

https://github.com/ittner/lua-iconv/releases/tag/lua-iconv-5

 

安装运行有报错:

:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ lua test_iconv.lua
lua: error loading module 'iconv' from file './iconv.so':
    ./iconv.so: undefined symbol: libiconv_open
stack traceback:
    [C]: ?
    [C]: in function 'require'
    test_iconv.lua:1: in main chunk
    [C]: ?

 

经过查证(受到此文启发 http://tonybai.com/2013/04/25/a-libiconv-linkage-problem/), 

分析为先安装了 libiconv库,  导致 此库的iconv.h拷贝到 usr/local/include/iconv.h

然后编译 luaiconv工程,编译文件iconv.c文件时候, gcc先找到 usr/local/include/iconv.h 此文件, 以此文件内部的函数声明为准,编译出iconv.so

实际上次应该以系统提供的 iconv.h 为准,  此文件在 /usr/include/iconv.h

 

头文件gcc搜索次序:

:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ ld -verbose | grep SEARCH
SEARCH_DIR("=/usr/i686-linux-gnu/lib32"); SEARCH_DIR("=/usr/local/lib32"); SEARCH_DIR("=/lib32"); SEARCH_DIR("=/usr/lib32"); SEARCH_DIR("=/usr/i686-linux-gnu/lib"); SEARCH_DIR("=/usr/local/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib/i386-linux-gnu"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/lib");

 

libiconv-------usr/local/include/iconv.h

#ifndef LIBICONV_PLUG
#define iconv_open libiconv_open
#endif
extern LIBICONV_DLL_EXPORTED iconv_t iconv_open (const char* tocode, const char* fromcode);

libiconv -- iconv.c 中 libiconv_open 定义收到宏控制, 应该未开启, 或者编译 luaiconv未链接libiconv库

#if defined __FreeBSD__ && !defined __gnu_freebsd__
/* GNU libiconv is the native FreeBSD iconv implementation since 2002.
   It wants to define the symbols 'iconv_open', 'iconv', 'iconv_close'.  */
#define strong_alias(name, aliasname) _strong_alias(name, aliasname)
#define _strong_alias(name, aliasname) \
  extern __typeof (name) aliasname __attribute__ ((alias (#name)));
#undef iconv_open
#undef iconv
#undef iconv_close
strong_alias (libiconv_open, iconv_open)
strong_alias (libiconv, iconv)
strong_alias (libiconv_close, iconv_close)
#endif

 

解决方法: 修改实现文件中, 引用的 iconv.h 引用方式, 将标准方式, 修改为自定义,并且写为全路径 /usr/include/iconv.h

然后再次 make && make install, 运行ok

vim luaiconv.c


#include <lua.h>
#include <lauxlib.h>
#include <stdlib.h>

#include "/usr/include/iconv.h"
#include <errno.h>

 

安装运行其它报错参考:

https://github.com/ittner/lua-iconv/issues/3

 

生成转换表实验

在一些嵌入式系统上, 没有安装libiconv库, 或者 libc库中也没有实现 iconv 功能, 但是同时还是需要字符换场景,

可以在编译服务器上, 安装luaiconv, 利用系统的iconv功能, 生成 一种编码到另外一种编码的映射表, 然后利用此映射表来, 是实现转换。

 

例如, 将windows-1251转换为UTF-8

windows-1251 字符编码参考:

http://www.science.co.il/language/Character-code.asp?s=1251

 

生成表的LUA代码:

function serializeTable(val, name, skipnewlines, depth)
    skipnewlines = skipnewlines or false
    depth = depth or 0
    local tmp = string.rep(" ", depth)
    if name then tmp = tmp .. name .. " = " end
    if type(val) == "table" then
        tmp = tmp .. "{" .. (not skipnewlines and "\n" or "")
        for k, v in pairs(val) do
            tmp = tmp .. serializeTable(v, k, skipnewlines, depth + 1) .. "," .. (not skipnewlines and "\n" or "")
        end
        tmp = tmp .. string.rep(" ", depth) .. "}"
    elseif type(val) == "number" then
        tmp = tmp .. tostring(val)
    elseif type(val) == "string" then
        tmp = tmp .. string.format("%q", val)
    elseif type(val) == "boolean" then
        tmp = tmp .. (val and "true" or "false")
    else
        tmp = tmp .. "\"[inserializeable datatype:" .. type(val) .. "]\""
    end
    return tmp
end

local iconv = require("iconv")
-- Set your terminal encoding here
-- local termcs = "iso-8859-1"
local termcs = "utf-8"

function check_one(to, from, text)
  print("\n-- Testing conversion from " .. from .. " to " .. to)
  local cd = iconv.new(to .. "//TRANSLIT", from)
  assert(cd, "Failed to create a converter object.")
  local ostr, err = cd:iconv(text)
  if err == iconv.ERROR_INCOMPLETE then
    print("ERROR: Incomplete input.")
  elseif err == iconv.ERROR_INVALID then
    print("ERROR: Invalid input.")
  elseif err == iconv.ERROR_NO_MEMORY then
    print("ERROR: Failed to allocate memory.")
  elseif err == iconv.ERROR_UNKNOWN then
    print("ERROR: There was an unknown error.")
  end

  print(ostr)
  return ostr
end
 
local result = {}
local num = 255
for i = 0, num do
  print("----------------------------------- i="..i)
  local char = string.char(i)
  local ostr = check_one(termcs, "windows-1251", char)
  print(string.len(ostr))
  local byteStr = ""
  for j = 1, string.len(ostr) do
      local byteVal = string.byte(ostr,j)
      print("byte j=" ..j .. " byteVal=".. byteVal)
      byteStr = byteStr .. "\\" .. byteVal
  end
  print("char i=" ..i .. " byteStr=".. byteStr)
  table.insert(result, byteStr)
end

print("-----------------------------------!!")
s = serializeTable(result)
print(s)

 

整理后的 windows-1251转换为UTF-8 的表

lcoal transTbl_1251toutf8 = {
 1 = "\0",
 2 = "\1",
 3 = "\2",
 4 = "\3",
 5 = "\4",
 6 = "\5",
 7 = "\6",
 8 = "\7",
 9 = "\8",
 10 = "\9",
 11 = "\10",
 12 = "\11",
 13 = "\12",
 14 = "\13",
 15 = "\14",
 16 = "\15",
 17 = "\16",
 18 = "\17",
 19 = "\18",
 20 = "\19",
 21 = "\20",
 22 = "\21",
 23 = "\22",
 24 = "\23",
 25 = "\24",
 26 = "\25",
 27 = "\26",
 28 = "\27",
 29 = "\28",
 30 = "\29",
 31 = "\30",
 32 = "\31",
 33 = "\32",
 34 = "\33",
 35 = "\34",
 36 = "\35",
 37 = "\36",
 38 = "\37",
 39 = "\38",
 40 = "\39",
 41 = "\40",
 42 = "\41",
 43 = "\42",
 44 = "\43",
 45 = "\44",
 46 = "\45",
 47 = "\46",
 48 = "\47",
 49 = "\48",
 50 = "\49",
 51 = "\50",
 52 = "\51",
 53 = "\52",
 54 = "\53",
 55 = "\54",
 56 = "\55",
 57 = "\56",
 58 = "\57",
 59 = "\58",
 60 = "\59",
 61 = "\60",
 62 = "\61",
 63 = "\62",
 64 = "\63",
 65 = "\64",
 66 = "\65",
 67 = "\66",
 68 = "\67",
 69 = "\68",
 70 = "\69",
 71 = "\70",
 72 = "\71",
 73 = "\72",
 74 = "\73",
 75 = "\74",
 76 = "\75",
 77 = "\76",
 78 = "\77",
 79 = "\78",
 80 = "\79",
 81 = "\80",
 82 = "\81",
 83 = "\82",
 84 = "\83",
 85 = "\84",
 86 = "\85",
 87 = "\86",
 88 = "\87",
 89 = "\88",
 90 = "\89",
 91 = "\90",
 92 = "\91",
 93 = "\92",
 94 = "\93",
 95 = "\94",
 96 = "\95",
 97 = "\96",
 98 = "\97",
 99 = "\98",
 100 = "\99",
 101 = "\100",
 102 = "\101",
 103 = "\102",
 104 = "\103",
 105 = "\104",
 106 = "\105",
 107 = "\106",
 108 = "\107",
 109 = "\108",
 110 = "\109",
 111 = "\110",
 112 = "\111",
 113 = "\112",
 114 = "\113",
 115 = "\114",
 116 = "\115",
 117 = "\116",
 118 = "\117",
 119 = "\118",
 120 = "\119",
 121 = "\120",
 122 = "\121",
 123 = "\122",
 124 = "\123",
 125 = "\124",
 126 = "\125",
 127 = "\126",
 128 = "\127",
 129 = "\208\130",
 130 = "\208\131",
 131 = "\226\128\154",
 132 = "\209\147",
 133 = "\226\128\158",
 134 = "\226\128\166",
 135 = "\226\128\160",
 136 = "\226\128\161",
 137 = "\226\130\172",
 138 = "\226\128\176",
 139 = "\208\137",
 140 = "\226\128\185",
 141 = "\208\138",
 142 = "\208\140",
 143 = "\208\139",
 144 = "\208\143",
 145 = "\209\146",
 146 = "\226\128\152",
 147 = "\226\128\153",
 148 = "\226\128\156",
 149 = "\226\128\157",
 150 = "\226\128\162",
 151 = "\226\128\147",
 152 = "\226\128\148",
 153 = "",
 154 = "\226\132\162",
 155 = "\209\153",
 156 = "\226\128\186",
 157 = "\209\154",
 158 = "\209\156",
 159 = "\209\155",
 160 = "\209\159",
 161 = "\194\160",
 162 = "\208\142",
 163 = "\209\158",
 164 = "\208\136",
 165 = "\194\164",
 166 = "\210\144",
 167 = "\194\166",
 168 = "\194\167",
 169 = "\208\129",
 170 = "\194\169",
 171 = "\208\132",
 172 = "\194\171",
 173 = "\194\172",
 174 = "\194\173",
 175 = "\194\174",
 176 = "\208\135",
 177 = "\194\176",
 178 = "\194\177",
 179 = "\208\134",
 180 = "\209\150",
 181 = "\210\145",
 182 = "\194\181",
 183 = "\194\182",
 184 = "\194\183",
 185 = "\209\145",
 186 = "\226\132\150",
 187 = "\209\148",
 188 = "\194\187",
 189 = "\209\152",
 190 = "\208\133",
 191 = "\209\149",
 192 = "\209\151",
 193 = "\208\144",
 194 = "\208\145",
 195 = "\208\146",
 196 = "\208\147",
 197 = "\208\148",
 198 = "\208\149",
 199 = "\208\150",
 200 = "\208\151",
 201 = "\208\152",
 202 = "\208\153",
 203 = "\208\154",
 204 = "\208\155",
 205 = "\208\156",
 206 = "\208\157",
 207 = "\208\158",
 208 = "\208\159",
 209 = "\208\160",
 210 = "\208\161",
 211 = "\208\162",
 212 = "\208\163",
 213 = "\208\164",
 214 = "\208\165",
 215 = "\208\166",
 216 = "\208\167",
 217 = "\208\168",
 218 = "\208\169",
 219 = "\208\170",
 220 = "\208\171",
 221 = "\208\172",
 222 = "\208\173",
 223 = "\208\174",
 224 = "\208\175",
 225 = "\208\176",
 226 = "\208\177",
 227 = "\208\178",
 228 = "\208\179",
 229 = "\208\180",
 230 = "\208\181",
 231 = "\208\182",
 232 = "\208\183",
 233 = "\208\184",
 234 = "\208\185",
 235 = "\208\186",
 236 = "\208\187",
 237 = "\208\188",
 238 = "\208\189",
 239 = "\208\190",
 240 = "\208\191",
 241 = "\209\128",
 242 = "\209\129",
 243 = "\209\130",
 244 = "\209\131",
 245 = "\209\132",
 246 = "\209\133",
 247 = "\209\134",
 248 = "\209\135",
 249 = "\209\136",
 250 = "\209\137",
 251 = "\209\138",
 252 = "\209\139",
 253 = "\209\140",
 254 = "\209\141",
 255 = "\209\142",
 256 = "\209\143",
}

 

posted @ 2015-07-10 00:55  lightsong  阅读(3291)  评论(0编辑  收藏  举报
Life Is Short, We Need Ship To Travel