Base-40 包装字符串(字符串压缩)

看一好文，译之，权当温故而知新了。理解错误及笔误请见谅。

原文参考：http://drdobbs.com/blogs/embedded-systems/229400732

可能不是完全正确，我认为，嵌入式系统编程与桌面级或server系统编程最大的不同就是：你花费了大量时间来优化嵌入式软件的大小。当我在2K的EEPROM上运行我的嵌入式程序时，而我的桌面电脑具有几个GB的内存。最近，我使用了一个嵌入式系统，其具有一个ARM处理器，运行Linux，具有一个LCD触摸屏。

这么小的嵌入式设备是没有巨大的硬盘和内存的，因此空间仍然昂贵。你可以随时优化你的C代码，当今的编译器优化以及做的非常好了。不过，我发现，所有这些用户接口，往往包含大量的字符串，文本消息等给出帮助信息。以及文本消息用于日志。而文本的空间效率较低。

以前，我们习惯于使用非常简单的压缩格式，节省有限的磁带空间。这也是促使BASE-40打包——这种压缩技术出现的原因，而且这个技术非常易于在嵌入式系统中实现。这个算法假定你的硬件支持整数除法就OK了，当然，设备也必须支持（整数）乘法（这基本是必定的）。这些基本条件是被硬件支持的，因此算法非常简单，可以非常快的执行。

压缩并不一定是激动人心的，但对于很多字符串，压缩可能很显著。BASE-40算法总能将3个字符串压缩成2byte。因此33%的压缩率还是不错的（没有免费的午餐）。这种压缩的代价就是你只能编码40个字符（本文会解决这个问题，使之无限制）。对某些应用程序（实际上很多）这已经足够了。让我们了解这个基本算法。

当你有一个16进制数，它有3个数字，假定这个数字代号是count，值就是123（hex）。它的意思就是1*256+2*16+3.更确切的说，它是1乘以16的平方+2乘以16的1次方+3乘以16的0次方。这种编码方法演示了如何基于BASE40编码。我们将形成1个三位数，使用BASE40，每个数字值为0—39.

40这个值不是随意选的。如果你仅仅是为了分隔一个16bit的数字，你最多只能分隔成3个5bit的数字（和额外的1个bit）。5bit你只能表达32个符号。因此，简单的打包不能在相同的空间中得到多于32个符号。使用BASE40，最大的数字就是39*1600+39*40+39=63999，（非常接近16bit的字符空间65535，译者按）。

如果你只使用全大写或全小写字符，40个字符也不错。你可以存储26个字母，10个数字，还有4个空间留给space，new line等等。不幸的是，当今很多系统需要大小写混写，需要更多其他符号。要解决这个问题，你可以借用老式的电传打字机。假定大写字母和标点符号少见，那么，你可以定义一个切换键（shift）选择第二个字符集。为了使这个事情简单，我通常仅是shift对下一字符有效。但是，如果你更吝啬，你可以使shift一直应用到遇到unshift字符。怎样工作的最好，取决于你要编码的字符串。

作为一个例子，考虑下面的头文件，定义了一个字符集，具有2个cases：

   1:  #ifndef __CTABLE_H

2:

   3:  #define __CTABLE_H

4:

   5:  char ctable[2][39]=

6:

   7:  {

8:

   9:  {

10:

  11:  '\0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','0','1','2','3','4','5','6','7','8','9',' ','\n'

12:

  13:  },

14:

  15:  {

16:

  17:  '\0','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',')','!','@','#',',','.','?','/','*',')','<','>'

18:

  19:  }

20:

  21:  };

22:

  23:  #endif

24:

上面的代码中有很多字符。我有意的浪费了空间，在子数组的第一个位置放0。这只是为了使调试容易点。此外，没有任何理由“upper”case不能使用一个额外的字符（即2个shift实际就是1个字符）。我只是没那么做。

上面的编码很草率，因为你完全可以在PC上执行输出，并将输出结果用于嵌入式系统的文本定义（例如：C的hex word数组，或assembler的DW指令）。下面是一个简单的编码器：

   1:  /* Encode strings as base 40 triplets.

   2:     Al Williams DDJ

   3:  */

4:

   5:  #include <stdio.h>

6:

   7:  // The character table is a 2x38 array. The 2 arrays are for

   8:  // the shift state and by convention 0 is always 0

   9:  // code 39 (not in the array which tops out at index 38)

  10:  // causes the shift to be set for the next character

11:

  12:  #include "ctable.h"   // needs to be the same as decoder!

13:

14:

  15:  #define ZEROCHAR '~'  // use this to represent zero in input

  16:  // what to use for unknown symbols -- note must not be shifted!

  17:  #define UNKNOWN_SYM 37  // space in the default encoding

18:

  19:  int pending=0;  // encode needs to send more than one symbol

20:

21:

  22:  // return a symbol (or symbols, see pending) for a given character

  23:  int encode(int c)

  24:  {

  25:    int i;

  26:    static int shifted=-1;

  27:  // if we are mid shift, just send the pending symbol

  28:    if (shifted!=-1)

  29:      {

  30:        i=shifted;

  31:        shifted=-1;

  32:        pending=0;

  33:        return i;

  34:      }

  35:    // make it easy for the user to enter a zero

  36:    if (c==ZEROCHAR) c='\0';

  37:    // simple search for the symbol -- run time

  38:    // performance isn't a big deal here because

  39:    // we will just run this on the PC at design time

  40:    for (i=0;i<39;i++)

  41:      {

  42:        if (ctable[0][i]==c)

  43:          {

  44:            pending=0;

  45:            return i;  // b40 symbol

  46:          }

  47:        if (ctable[1][i]==c)

  48:          {

  49:            pending=1;

  50:            shifted=i;

  51:            return 39;    // shift code

  52:          }

  53:      }

  54:    // use 0 for any unmatched character

  55:    // you might prefer using your symbol for '.' or '?'

  56:    // and you can set that in the #define

  57:    fprintf(stderr,"Warning character %c (%x) unmatched\n",c,c);

  58:    pending=0;

  59:    return UNKNOWN_SYM;  // unmatched character

  60:  }

61:

  62:  // Emit a 16 bit quantity

  63:  // in the d array (d[2] is first digit, d[0] is last digit)

  64:    void emit(int *d)

  65:    {

  66:      unsigned int v;

  67:      v=d[2]*1600+d[1]*40+d[0];

  68:      printf("%04X,",v);

  69:    }

70:

71:

72:

  73:  int main(int argc, char *argv[])

  74:  {

  75:    int d[3];  // current B40 "digits"

  76:    int cp=3;  // pointer in d array

  77:    int c;     // current character

78:

79:

  80:    // Note you have to press ^D twice if not at the start of a line!

  81:    while ((c=getchar())!=EOF)

  82:      {

  83:        int cval;

  84:        pending=1;   // make sure we call encode once

  85:        while (pending)  // while encode has something to say

  86:          {

  87:            int cenc;

  88:            cenc=encode(c);  // could be index (0-37) or shiftcode

  89:            d[--cp]=cenc;

  90:            // every 3 characters we have to emit something

  91:            if (cp==0)

  92:              {

  93:                emit(d);

  94:                cp=3;  // reset d array

  95:              }

  96:          }

97:

  98:      }

99:

 100:    // flush any left overs

 101:    if (cp!=3)

 102:      {

 103:      while (cp!=-1) d[--cp]=0;

 104:      emit(d);  // write that last word

 105:      }

 106:    // done!

 107:    return 0;

 108:

 109:  }

 110:

唯一棘手的部分是处理文件末尾。否则，它只是一个简单的查找表中找到三个“数字”，构建一个16bit的word。Decoder非常简单，你甚至可以手工编写汇编实现.

   1:  /* Decode strings from base 40 triplets.

   2:     Al Williams DDJ

   3:  */

4:

   5:  #define TESTMAIN 1   // build test harness?

6:

7:

   8:  #include <stdio.h>

9:

  10:  // The character table is a 2x38 array. The 2 arrays are for

  11:  // the shift state and by convention 0 is always 0

  12:  // code 39 (not in the array which tops out at index 38)

  13:  // causes the shift to be set for the next character

  14:  #include "ctable.h"   // needs to be the same as decoder!

15:

  16:  // Decode a triplet into out[start] and beyond

  17:  // since shift characters don't create characters you could have

  18:  // from 1 to 3 characters in out (since [shift]X[shift] is legal)

  19:  // so the return value is how many characters were set

  20:  int decode(int in,char *out, int start)

  21:  {

  22:    static shiftstate=0;

  23:    int count=3;  // assume we will emit 3 characters

24:

  25:    unsigned int tmp;

  26:    // get first character

  27:    tmp=in/1600;

  28:    if (tmp==39)  // test for shift

  29:      {

  30:      shiftstate=1;

  31:      count--;  // won't count as output

  32:      }

  33:    else

  34:      {

  35:        // real character, so look it up and reset shift

  36:        // which may not have been set anyway

  37:       out[start++]=ctable[shiftstate][tmp];

  38:       shiftstate=0;

  39:      }

  40:    // get 2nd character and repeat logic

  41:    // keep in mind that X*1600 => X*1024+X*512+X*64

  42:    // so this could be optimized if necessary

  43:    tmp=(in-tmp*1600)/40;

  44:    if (tmp==39)

  45:      {

  46:        shiftstate=1;

  47:        count--;

  48:      }

  49:    else

  50:      {

  51:      out[start++]=ctable[shiftstate][tmp];

  52:      shiftstate=0;

  53:      }

  54:    // get 3rd character.. if you don't have a mod

  55:    // function you could say tmp=(in-tmp*40)

  56:    // don't forget that x*40 = x*32+x*8 so you could

  57:    // make * 40 very efficient if you needed to optimize

  58:    tmp=in%40;

  59:    if (tmp==39)

  60:      {

  61:        shiftstate=1;

  62:        count--;

  63:      }

  64:    else

  65:      {

  66:        out[start++]=ctable[shiftstate][tmp];

  67:        shiftstate=0;

  68:      }

  69:    return count;   // return count

  70:  }

71:

  72:  #if TESTMAIN==1

  73:  int main(int argc, char *argv[])

  74:  {

  75:    int v,c;

  76:    char buf[4];

  77:    // grab a triplet

  78:    while (scanf("%x,",&v)>0)

  79:      {

  80:        // decode it

  81:        c=decode(v,buf,0);

  82:        buf[c]='\0';

  83:        // for us printing it is fine

  84:        printf("%s",buf);

  85:      }

  86:    printf("\n");

  87:  }

88:

  89:  #endif

90:

当然，上面的解决方案仅当你的字符串多过look-table和上面的解码器代码才有效。你可以进一步优化上面代码中的乘法，换成移位与加法（见注释）。

虽然这是一个古老的技术，但它很很适合简单的CPU，因为那些地方空间是有限的。转换后的代码节省了33%的空间。当然，你可以进一步优化，尝试不同的look-table。例如，你的字符串中包含大量的q，很少有大写的S，那么你可以将S编码到lower case中，并将q表达为一个shifted character。

这仅仅是节省空间的一种简单武器。希望能留言，告知你们节省空间的办法。

posted on 2011-04-03 21:24 海天一鸥阅读(827) 评论(0) 编辑收藏举报

刷新页面返回顶部

海鸥航迹

学习之笔记，好文之收集。

导航

公告

Base-40 包装字符串(字符串压缩)