lua数据类型 - string

源码版本：lua 5.4.3

一、类型 TString 概述

lua字符串与java，python等语言的字符串不一样，后两者有字符型字符串这个概念，字符型字符串在内存中一般都是以unicode码的形式存在。

lua的字符串是以字节码的形式存在的。例如在代码文件编码是utf-8，那么字符串常量加载后以utf-8编码的字节码形式保存。

这种机制使得lua的字符串可以用char数组保存，但是要获取字符串长度就不容易了。

lua的字符串内部分为两种类型——短字符串（LUA_VSHRSTR）及长字符串（LUA_VLNGSTR），对外部的统一类型为LUA_TSTRING。

以下是字符串的结构体：

typedef struct TString {
  CommonHeader;
  lu_byte extra;  /* reserved words for short strings; "has hash" for longs */
  lu_byte shrlen;  /* length for short strings */
  unsigned int hash;
  union {
    size_t lnglen;  /* length for long strings */
    struct TString *hnext;  /* linked list for hash table */
  } u;
  char contents[1];
} TString;

1.变量的作用

contents：在老版本的lua中并没有这个字段。老版本的字符串内容指针是紧跟在结构体TString后面的，获取字符串内容的指针方法如下：

#define getstr(ts)  \
  check_exp(sizeof((ts)->extra), cast_charp((ts)) + sizeof(TString))

cast_charp 将 ts 转换为 char * 类型，加上 sizeof(TString) 将指针移到 TString 结构后面。

而在现在的版本中，该方法为：

#define getstr(ts)  ((ts)->contents)

由数组名与指针的相似性，推测为了代码的可读性，增加的contents字段代表字符串内容指针。

u：对于长字符串，用于记录长度；对于短字符串，用于记录在字符串常量池的哈希表中，哈希冲突时延伸的链表的下一个字符串。

hash：哈希值

shrlen：短字符串长度，由于短字符串长度一般最大是40（后面提到），8位足矣。

extra：对于短字符串 extra 用来记录这个字符串是否为保留字，这个标记用于词法分析器对保留字的快速判断；对于长字符串，可以用于惰性求哈希值，==0表示还没有哈希（见字符串哈希小节）。

CommonHeader：可回收对象的公共结构，结构如下：

#define CommonHeader    struct GCObject *next; lu_byte tt; lu_byte marked

2.关于长短字符串类型的区分

lua的弱数据类型用类型 TValue 表示，一个变量用一个TValue储存。TValue结构如下：

#define TValuefields    Value value_; lu_byte tt_

typedef struct TValue {
  TValuefields;
} TValue;

其中 tt_ 用以表述类型，用一个byte表示。其中0-3位表示大类型LUA_TSTRING，4-5位表示小类型LUA_VSHRSTR及LUA_VLNGSTR。

LUA_VSHRSTR及LUA_VLNGSTR的值分别为：

#define makevariant(t,v)    ((t) | ((v) << 4))

// ...

//构造小类型
#define LUA_VSHRSTR    makevariant(LUA_TSTRING, 0)  /* short strings */
#define LUA_VLNGSTR    makevariant(LUA_TSTRING, 1)  /* long strings */

默认<= 40 byte的字符串是短字符串

#define LUAI_MAXSHORTLEN    40

二、字符串的哈希

字符串的哈希分为短哈希与长哈希两个方法：

unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) {
  unsigned int h = seed ^ cast_uint(l);
  for (; l > 0; l--)
    h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
  return h;
}

unsigned int luaS_hashlongstr (TString *ts) {
  lua_assert(ts->tt == LUA_VLNGSTR);
  if (ts->extra == 0) {  /* no hash? */
    size_t len = ts->u.lnglen;
    ts->hash = luaS_hash(getstr(ts), len, ts->hash);
    ts->extra = 1;  /* now it has its hash */
  }
  return ts->hash;
}

长字符串的哈希只有table的mainposition方法用到。老版本的lua对长字符串哈希使用跳级计算，新版本是逐个字符计算，是为了提高安全性？总之，太长的字符串不适宜用作table的key。

关于短字符串哈希，seed有不用来源，有lua state保存的全局随机种子（见四小节字符串创建），也有用时间函数构造的种子。

长字符串之所以能用hash做种子，是因为创建时将hash设置为全局随机种子了。

三、字符串的比较

字符串的比较也有两个版本。因为短串和长串可以用类型区分，新版本lua没有提供字符串比较的泛用方法。下面是同类型字符串的比较方法：

#define eqshrstr(a,b)    check_exp((a)->tt == LUA_VSHRSTR, (a) == (b))

短字符串会放入字符串常量池中，因此短串在内存中总是只有一份，直接比较地址即可。

int luaS_eqlngstr (TString *a, TString *b) {
  size_t len = a->u.lnglen;
  lua_assert(a->tt == LUA_VLNGSTR && b->tt == LUA_VLNGSTR);
  return (a == b) ||  /* same instance or... */
    ((len == b->u.lnglen) &&  /* equal length and ... */
     (memcmp(getstr(a), getstr(b), len) == 0));  /* equal contents */
}

长串的判断顺序依次是：是否同一个对象，长度是否一致，最后用memcmp函数进行逐个字符比较。

四、字符串的创建

new新的字符串代码如下：

TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
  if (l <= LUAI_MAXSHORTLEN)  /* short string? */
    return internshrstr(L, str, l);
  else {
    TString *ts;
    if (l_unlikely(l >= (MAX_SIZE - sizeof(TString))/sizeof(char)))
      luaM_toobig(L);
    ts = luaS_createlngstrobj(L, l);
    memcpy(getstr(ts), str, l * sizeof(char));
    return ts;
  }
}

可以看到短串和长串的创建是分开处理的。下面分开讨论。

1.短字符串的内部化

上面提到，短字符串会放入字符串常量池，在内存中总是只有一份。内部化指这个过程。

字符串常量池在lua state中，结构为：

typedef struct stringtable {
  TString **hash;
  int nuse;  /* number of elements */
  int size;
} stringtable;

**hash是一个开散列表，nuse表示字符串数量，size表示散列表数组的长度。

短串的new方法如下：

 1 static TString *internshrstr (lua_State *L, const char *str, size_t l) {
 2   TString *ts;
 3   global_State *g = G(L);
 4   stringtable *tb = &g->strt;
 5   unsigned int h = luaS_hash(str, l, g->seed);
 6   TString **list = &tb->hash[lmod(h, tb->size)];
 7   lua_assert(str != NULL);  /* otherwise 'memcmp'/'memcpy' are undefined */
 8   for (ts = *list; ts != NULL; ts = ts->u.hnext) {
 9     if (l == ts->shrlen && (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
10       /* found! */
11       if (isdead(g, ts))  /* dead (but not collected yet)? */
12         changewhite(ts);  /* resurrect it */
13       return ts;
14     }
15   }
16   /* else must create a new string */
17   if (tb->nuse >= tb->size) {  /* need to grow string table? */
18     growstrtab(L, tb);
19     list = &tb->hash[lmod(h, tb->size)];  /* rehash with new size */
20   }
21   ts = createstrobj(L, l, LUA_VSHRSTR, h);
22   memcpy(getstr(ts), str, l * sizeof(char));
23   ts->shrlen = cast_byte(l);
24   ts->u.hnext = *list;
25   *list = ts;
26   tb->nuse++;
27   return ts;
28 }

字符串常量池strt维护一个开散列表，在6行hash到相应的链表首元素后，在8行遍历链表找字符串。

11,12行跟gc有关，对已经标记为死亡的字符串对象，需要重新标记颜色。有关gc的三色标记和回收相关不在这里详述。

17行，当哈希表中字符串的数量nuse超过预定容量size时，可以预计hash冲突必然发生。这时候需要对strt进行resize操作扩容。

createstrobj()用于创建新的字符串，这个后面再说。

24-16行，新字符串插入到链表头部，然后nuse自增。

下面是strt的resize操作方法 growstrtab()：

static void growstrtab (lua_State *L, stringtable *tb) {
  if (l_unlikely(tb->nuse == MAX_INT)) {  /* too many strings? */
    luaC_fullgc(L, 1);  /* try to free some... */
    if (tb->nuse == MAX_INT)  /* still too many? */
      luaM_error(L);  /* cannot even create a message... */
  }
  if (tb->size <= MAXSTRTB / 2)  /* can grow string table? */
    luaS_resize(L, tb->size * 2);
}

这个方法主要是进行一些容量检查，最后一步将哈希表的数组扩展为2倍，再看看luaS_resize()：

 1 static void tablerehash (TString **vect, int osize, int nsize) {
 2   int i;
 3   for (i = osize; i < nsize; i++)  /* clear new elements */
 4     vect[i] = NULL;
 5   for (i = 0; i < osize; i++) {  /* rehash old part of the array */
 6     TString *p = vect[i];
 7     vect[i] = NULL;
 8     while (p) {  /* for each string in the list */
 9       TString *hnext = p->u.hnext;  /* save next */
10       unsigned int h = lmod(p->hash, nsize);  /* new position */
11       p->u.hnext = vect[h];  /* chain it into array */
12       vect[h] = p;
13       p = hnext;
14     }
15   }
16 }
17 
18 void luaS_resize (lua_State *L, int nsize) {
19   stringtable *tb = &G(L)->strt;
20   int osize = tb->size;
21   TString **newvect;
22   if (nsize < osize)  /* shrinking table? */
23     tablerehash(tb->hash, osize, nsize);  /* depopulate shrinking part */
24   newvect = luaM_reallocvector(L, tb->hash, osize, nsize, TString*);
25   if (l_unlikely(newvect == NULL)) {  /* reallocation failed? */
26     if (nsize < osize)  /* was it shrinking table? */
27       tablerehash(tb->hash, nsize, osize);  /* restore to original size */
28     /* leave table as it was */
29   }
30   else {  /* allocation succeeded */
31     tb->hash = newvect;
32     tb->size = nsize;
33     if (nsize > osize)
34       tablerehash(newvect, osize, nsize);  /* rehash for new size */
35   }
36 }

24行 luaM_reallocvector() 用于分配新的空间，这里先不深入。关键是34行tablerehash方法。

tablerehash()对数组扩容前的部分中的每个链表进行遍历，对每个String重新hash，并插入到正确的数组下标链表中。

这种方法在遍历过程中不可避免地对某些字符串重复计算，而且看起来很不直观。可能是为了避免额外的内存开销。

2.长字符串的创建

现在，回到luaS_newlstr()中，看下长字符串的创建方法luaS_createlngstrobj()：

TString *luaS_createlngstrobj (lua_State *L, size_t l) {
  TString *ts = createstrobj(L, l, LUA_VLNGSTR, G(L)->seed);
  ts->u.lnglen = l;
  return ts;
}

和短串的创建方法一致，都是createstrobj()。在创建后，将长串的长度赋值到u.lnglen。

3.string obj的创建方法createstrobj()

 1 static TString *createstrobj (lua_State *L, size_t l, int tag, unsigned int h) {
 2   TString *ts;
 3   GCObject *o;
 4   size_t totalsize;  /* total size of TString object */
 5   totalsize = sizelstring(l);
 6   o = luaC_newobj(L, tag, totalsize);
 7   ts = gco2ts(o);
 8   ts->hash = h;
 9   ts->extra = 0;
10   getstr(ts)[l] = '\0';  /* ending 0 */
11   return ts;
12 }

5行的代码如下：

#define sizelstring(l)  (offsetof(TString, contents) + ((l) + 1) * sizeof(char))

长度等于TString结构的大小，加上字符串长度再加1，因为10行在字符串最后加入了'\0'

6行创建了一个可回收对象，这里不深入。

7行是一个检测。

8行长短串的处理不一样，短串传入真正的hash值，长串传入的是全局随机种子。

9行将extra复制为0，对于长串，表示未计算hash；对于短串，表示不是系统保留字。系统保留字的创建在llex.c文件中：

 1 /* ORDER RESERVED */
 2 static const char *const luaX_tokens [] = {
 3     "and", "break", "do", "else", "elseif",
 4     "end", "false", "for", "function", "goto", "if",
 5     "in", "local", "nil", "not", "or", "repeat",
 6     "return", "then", "true", "until", "while",
 7     "//", "..", "...", "==", ">=", "<=", "~=",
 8     "<<", ">>", "::", "<eof>",
 9     "<number>", "<integer>", "<name>", "<string>"
10 };
11 
12 void luaX_init (lua_State *L) {
13   int i;
14   TString *e = luaS_newliteral(L, LUA_ENV);  /* create env name */
15   luaC_fix(L, obj2gco(e));  /* never collect this name */
16   for (i=0; i<NUM_RESERVED; i++) {
17     TString *ts = luaS_new(L, luaX_tokens[i]);
18     luaC_fix(L, obj2gco(ts));  /* reserved words are never collected */
19     ts->extra = cast_byte(i+1);  /* reserved word */
20   }
21 }

14行luaS_newliteral()为：

#define luaS_newliteral(L, s)    (luaS_newlstr(L, "" s, \
                                 (sizeof(s)/sizeof(char))-1))

由luaX_init()的19行，可以看到系统保留字都是事先创建好，然后赋值extra为相应的正整数。

至此，创建字符串对象完成。

四、字符串的缓存

lua为luaS_newlstr()做了一层缓存。即在创建字符串前，先从缓存中查找。这个缓存也保存在lua state中。

TString *strcache[STRCACHE_N][STRCACHE_M];  /* cache for strings in API */

其中STRCACHE_N = 53，STRCACHE_M = 2。

下面是luaS_newlstr()的上级函数：

TString *luaS_new (lua_State *L, const char *str) {
  unsigned int i = point2uint(str) % STRCACHE_N;  /* hash */
  int j;
  TString **p = G(L)->strcache[i];
  for (j = 0; j < STRCACHE_M; j++) {
    if (strcmp(str, getstr(p[j])) == 0)  /* hit? */
      return p[j];  /* that is it */
  }
  /* normal route */
  for (j = STRCACHE_M - 1; j > 0; j--)
    p[j] = p[j - 1];  /* move out last element */
  /* new element is first in the list */
  p[0] = luaS_newlstr(L, str, strlen(str));
  return p[0];
}

L->strcache也是一个哈希表，里面每个元素是一个数组，保存着 STRCACHE_M 个字符串。用strcmp做字符串命中判断。每次有新的字符串，就把数组最老的缓存字符串替换掉。

四、字符串的常用扩展方法

1.获取字符型长度

lua的字符串都是按char保存的字节码，但实际业务需求中，很多时候需要用到字符型长度。以下是utf-8编码的string的字符型长度求法：

local _utf8Num = {0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc}
local _notUTF8Num = {}
for _, v in ipairs(_utf8Num) do
    table.insert(_notUTF8Num, bit.bnot(v))
end

function string.utf8len(input)
    local len  = string.len(input)
    local left = len
    local cnt  = 0
    while left ~= 0 do
        local tmp = string.byte(input, -left)
        local i = #_utf8Num
        while true do
            if tmp >= _utf8Num[i] then
                left = left - i
                break
            end
            i = i - 1
        end
        cnt = cnt + 1
    end
    return cnt
end

这里主要是利用了在utf-8编码中，字符的第一个字节阐述了该字符占据多少个字节。

posted @ 2021-04-12 01:40 枫林晚月阅读(523) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

枫林晚月

顺天，应时，依人，以利胜