CPython3 中 str 类和 bytes 类的一些实现细节

CPython在github上的官方 repo: https://github.com/python/cpython
下面的总结都是给予我当前时间点(2019-05)看到的最新版本，3.8.0 alpha 4

首先在 python2 中，str类对应的 C struct 是 PyStringObject，但是在 python3 中该 strcut 改成了 PyBytesObject，但是 python3 中不再使用 PyBytesObject 作为 str 类的底层实现，这是因为 python2 中 str 默认是 bytes，转成 unicode 需要加 ‘u’ 前缀；而 python3 中默认是 unicode，转成 bytes 需要加 b 前缀。

#python2
s = "abc" # bytes
s = u"abc" # unicode

#python3
s = "abc" # unicode
s = b"abc" # bytes

PyBytesObject

PyBytesObject 定义在 include/bytesobject.h 文件：

#ifndef Py_LIMITED_API
typedef struct {
    PyObject_VAR_HEAD
    Py_hash_t ob_shash;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     */
} PyBytesObject;
#endif

ob_shash 是 hash 值，可以计算一次后缓存起来，ob_sval 就是指向具体内存的指针。
PyObject_VAR_HEAD 定义在 Include/object.h 的一个 macro(宏)，这个 macro 是用于标记大多数可变长对象的。（注意不是可变对象）

#define PyObject_VAR_HEAD      PyVarObject ob_base;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

其中 ob_size 是元素个数，而 ob_base 是一个所有 python 对象的抽象 struct，定义如下：

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

ob_refcnt 是引用计数，ob_type 就是实际指向对象的指针。

至此整个结构体的脉络比较清晰，还是比较容易看懂的

具体的一些对 PyBytesObject 操作的函数是定义在 Objects/bytesobject.c 文件里的，这里抽取几个进行分析。

第一个是 PyBytes_FromString 函数，它是从一个 char* 中创建一个 PyBytesObject 对象

PyObject *
PyBytes_FromString(const char *str)
{
    size_t size;
    PyBytesObject *op;

    assert(str != NULL);
    size = strlen(str);
    //检查传入的字符串长度是否过长
    if (size > PY_SSIZE_T_MAX - PyBytesObject_SIZE) {
        PyErr_SetString(PyExc_OverflowError,
            "byte string is too long");
        return NULL;
    }

    //如果长度为0, 则返回 nullstring
    if (size == 0 && (op = nullstring) != NULL) {
#ifdef COUNT_ALLOCS
        _Py_null_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }

    //characters 是一个 单字符str缓冲池, 如果长度为1,就返回缓冲池地址
    if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
#ifdef COUNT_ALLOCS
        _Py_one_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }

    /* Inline PyObject_NewVar */
    // 分配长度为 PyBytesObject_SIZE+size 的内存空间, 注意此时 op.ob_sval 指向的是一个长度为 size+1 的内存空间
    op = (PyBytesObject *)PyObject_MALLOC(PyBytesObject_SIZE + size);
    if (op == NULL)
        return PyErr_NoMemory();
    (void)PyObject_INIT_VAR(op, &PyBytes_Type, size);
    op->ob_shash = -1;
    // 将传入的 str 值拷贝到 op 结构体中
    memcpy(op->ob_sval, str, size+1);
    /* share short strings */
    if (size == 0) {
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1) {
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

根据我在 repo 里的搜索, PyBytesObject 应该没有 interning 机制(也就是缓存机制), 测试了一下也确实是。

PyUnicodeObject

PyUnicodeObject 主要定义在 Include/unicodeobject.h, 实现在 Objects/unicodeobject.c
python3.3时引入的最新的可变Unicode字符宽度结构体，大致定义如下(后面参考3)：

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

typedef struct {
  PyCompactUnicodeObject _base;
  union {
      void *any;
      Py_UCS1 *latin1;
      Py_UCS2 *ucs2;
      Py_UCS4 *ucs4;
  } data;
} PyUnicodeObject;

可以看到对于不同字符宽度， PyUnicodeObject支持 1bytes, 2bytes, 4bytes 宽度
对于 PyUnicodeObject，最值得关注的是 PyUnicode_InternInPlace 这个函数，它是来完成 interning 操作的

void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    // 如果当前的 interned 字典为空，就创建一个
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION

    // 注意，这里是把 interned 中 s 同时设置为 key 和 value
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    // 因为设置到 interned 中了，因此需要将多加的两次引用计数减掉，因为这两次不能作为 GC 的参考
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

在官方对PyUnicodeObject的说明里有这么一段话:

In almost all cases, they shouldn’t be used directly, since all API functions that deal with Unicode objects take and return PyObject pointers.

也就是说，大多数在 CPython 源码里对 PyUnicodeObject 的操作，获取的参数都是 PyObject 结构体

Python 中的 interning 机制

interning 直白地说就是把一些 str 之类的对象缓存起来，因为它们是不可变对象，因此一些有相同值的对象就可以指向同一块内存
后面的参考1写了几个有趣的现象，我在这里描述一下

1 对于 str 类型来说，只会 interning 只含有常见字符的字符串，常见字符是 a-z, A-Z, 0-9 还有下划线。比如下面的示例：

a="wtf"
b="wtf"
a is b # True
a="wtf!"
b="wtf!"
a is b # False

2 但是，如果把含有非常见字符的字符串赋值写在一行，就可以 interning，这是因为解释器判断出了两者是同一个值:

a,b="wtf!","wtf!"
a is b # True

3 对于一个 .py 文件来说，基本上都可以做到 str 类型的 interning，因为解释器会一次性先扫描一遍文件
4 但是对于需要计算的字符串，超过一定长度就无法做到 interning，因为强行 interning 有可能会导致 py 文件过于庞大。(这一条虽然指的是文件，但是对于交互式解释器也适用)

a="c"*20
a is "cccccccccccccccccccc"  # True
a="c"*21
a is "ccccccccccccccccccccc" # False

参考:
[1] do you really think you know strings in python?
[2] Python字符串对象实现原理(此文章主要是python2内容)
[3] Python官方pep文件, 从3.3后改用可伸缩的对象存储Unicode String
[4] Python3官方对字符串变化的说明

posted on 2019-05-24 10:51 daghlny 阅读(793) 评论(0) 编辑收藏举报

刷新页面返回顶部

Daghlny's world