Python字典实现原理

2015-10-18 23:12 编程相关 python , 哈希表 , 数据结构

Python字典实现原理

引言字典实现字典entry的定义字典对象的定义散列函数冲突处理：开放寻址法字典相关操作的实现字典对象创建搜索元素函数动态扩容和收缩策略对象缓冲池参考

引言

Python中的字典是基于哈希表实现，关于哈希表这种结构的详细的介绍，可以查看本博文章数据结构之哈希表。本文主要讲解哈希表在Python是如何具体被实现出来的，本博客的另一篇文章Redis数据类型：字典，则是讲解了哈希表在Redis上的具体实现，有时间的读者可以对比着来看。

正如数据结构之哈希表这篇文章后面提到的，要实现一个哈希表，我们应该关注三件事情：

设计一个合适的散列函数
选择合适的散列冲突解决策略
定义装载因子阈值，并且设计动态扩容策略

所以这篇文章，也会重点讲解这三点在Python是如何选择和实现出来的。除此之外，还会介绍字典创建、元素查找等操作的底层实现，以及Python惯用伎俩--字典对象缓冲池的实现。

字典实现

字典entry的定义

我们把字典的一个元素（即键值对），称为一个entry，它的定义如下：


x
// [dictobject.h]
typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

me_hash属性
存储的是针对键值me_key的散列值，存储起来避免每次查询都要重新计算。
me_key和me_value
就是我们常说的键值对中的键和值，注意它是PyObject类型，这也是python字典可以存储任何对象的原因。

entry有三种状态，分别为：

Unused态
此时me_key和me_value都等于NULL，表明该entry没有存储任何东西。
Active态
此时me_key和me_value都不为空，表明该entry存储了一个键值对。
Dummy态
处于Active态的entry被删除时，会转到Dummy态。此时me_key指向一个dummy对象，me_value为NULl。为什么要Dummy态呢？这是由Python的冲突处理策略决定的，Python采用的开放地址冲突解决方法，不允许做真正的删除，所以采用打标记的"伪删除法"。关于冲突解决，后面会有更详细的说明。

状态转换图：

字典对象的定义


x
// [dictobject.h]
#define PyDict_MINSIZE 8
typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;  /* # Active + # Dummy */
    Py_ssize_t ma_used;  /* # Active */
    /* The table contains ma_mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t ma_mask;
    /* ma_table points to ma_smalltable for small tables, else to
     * additional malloc'ed memory.  ma_table is never NULL!  This rule
     * saves repeated runtime null-tests in the workhorse getitem and
     * setitem calls.
     */
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

ma_fill
处于Active以及Dummy的元素个数
ma_used
处于Active状态的元素个数
ma_mask
entry的元素个数-1（Active+Dummy+Unused-1）
ma_smalltable
创建字典对象时，一定会创建一个大小为PyDict_MINSIZE==8的PyDictEntry数组。当entry数量小于8时，使用ma_smalltable作为字典元素实际存储的地方。
ma_table
当entry数量小于PyDict_MINSIZE，ma_table指向ma_smalltable的首地址，当entry数量大于8时，Python把它当做一个大字典来处理，此刻会申请额外的内存空间，同时将ma_table指向这块空间。
ma_lookup
字典元素的搜索策略

需要注意的是，PyDictObject使用PyObject_HEAD而不是PyObject_Var_HEAD，虽然字典也是变长对象，但此处并不是通过ob_size来存储字典中元素的长度，而是通过ma_used字段。

散列函数

计算索引值（即元素在数组中的下标）的公式如下：


x
index = PyObject_Hash(key) & mask

其中PyObject_Hash函数定义如下：


x
long PyObject_Hash(PyObject *v)
{
    PyTypeObject *tp = v->ob_type;
    if (tp->tp_hash != NULL)
        return (*tp->tp_hash)(v);
    /* To keep to the general practice that inheriting
     * solely from object in C code should work without
     * an explicit call to PyType_Ready, we implicitly call
     * PyType_Ready here and then check the tp_hash slot again
     */
    if (tp->tp_dict == NULL) {
        if (PyType_Ready(tp) < 0)
            return -1;
        if (tp->tp_hash != NULL)
            return (*tp->tp_hash)(v);
    }
    if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) {
        return _Py_HashPointer(v); /* Use address as hash value */
    }
    /* If there's a cmp but no hash defined, the object can't be hashed */
    return PyObject_HashNotImplemented(v);

可以看出，实际调用的还是各个类型的*tp->tp_hash ，即各数据类型自己的hash函数。

冲突处理：开放寻址法

关于python所使用的开放寻址具体策略，采用的是二次探测序列（quadratic probing sequence），源码文档说明如下：


xxxxxxxxxx
The first half of collision resolution is to visit table indices via this
recurrence:
    j = ((5*j) + 1) mod 2**i
For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).  By itself, this doesn't help much:  like linear probing (setting
j += 1, or j -= 1, on each loop trip), it scans the table entries in a fixed
order.  This would be bad, except that's not the only thing we do, and it's
actually *good* in the common cases where hash keys are consecutive.  In an
example that's really too small to make this entirely clear, for a table of
size 2**3 the order of indices is:
    0 -> 1 -> 6 -> 7 -> 4 -> 5 -> 2 -> 3 -> 0 [and here it's repeating]
If two things come in at index 5, the first place we look after is index 2,
not 6, so if another comes in at index 6 the collision at 5 didn't hurt it.
Linear probing is deadly in this case because there the fixed probe order
is the *same* as the order consecutive keys are likely to arrive.  But it's
extremely unlikely hash codes will follow a 5*j+1 recurrence by accident,
and certain that consecutive hash codes do not.
The other half of the strategy is to get the other bits of the hash code
into play.  This is done by initializing a (unsigned) vrbl "perturb" to the
full hash code, and changing the recurrence to:
    j = (5*j) + 1 + perturb;
    perturb >>= PERTURB_SHIFT;
    use j % 2**i as the next table index;
Now the probe sequence depends (eventually) on every bit in the hash code,
and the pseudo-scrambling property of recurring on 5*j+1 is more valuable,
because it quickly magnifies small differences in the bits that didn't affect
the initial index.  Note that because perturb is unsigned, if the recurrence
is executed often enough perturb eventually becomes and remains 0.  At that
point (very rarely reached) the recurrence is on (just) 5*j+1 again, and
that's certain to find an empty slot eventually (since it generates every int
in range(2**i), and we make sure there's always at least one empty slot).
Selecting a good value for PERTURB_SHIFT is a balancing act.  You want it
small so that the high bits of the hash code continue to affect the probe
sequence across iterations; but you want it large so that in really bad cases
the high-order hash bits have an effect on early iterations.  5 was "the
best" in minimizing total collisions across experiments Tim Peters ran (on
both normal and pathological cases), but 4 and 6 weren't significantly worse.

字典相关操作的实现

字典对象创建


x
// [dictobject.c]
static PyObject *dummy = NULL; /* Initialized by first call to newPyDictObject() */
PyObject * PyDict_New(void)
{
    register PyDictObject *mp;
    if (dummy == NULL) { /* Auto-initialize dummy */
        dummy = PyString_FromString("<dummy key>");
        if (dummy == NULL)
            return NULL;
    }
    if (numfree) {
        mp = free_list[--numfree];
        assert (mp != NULL);
        assert (Py_TYPE(mp) == &PyDict_Type);
        _Py_NewReference((PyObject *)mp);
        if (mp->ma_fill) {
            EMPTY_TO_MINSIZE(mp);
        } else {
            /* At least set ma_table and ma_mask; these are wrong
               if an empty but presized dict is added to freelist */
            INIT_NONZERO_DICT_SLOTS(mp);
        }
        assert (mp->ma_used == 0);
        assert (mp->ma_table == mp->ma_smalltable);
        assert (mp->ma_mask == PyDict_MINSIZE - 1);
    } else {
        mp = PyObject_GC_New(PyDictObject, &PyDict_Type);
        if (mp == NULL)
            return NULL;
        EMPTY_TO_MINSIZE(mp);   // 申请ma_smalltable内存空间，初始化ma_mask的值。
    }
    mp->ma_lookup = lookdict_string;    // 定义元素搜索函数
    return (PyObject *)mp;
}

步骤：

初始化dummy对象；从dummy的声明可以看出，它是个全局静态对象，只会初始化一次。
如果缓冲池有可用对象，则读取之。
如果缓冲池没有可用对象，则新建一个PyDictObject对象，并初始化它。
给ma_lookup赋值，添加字典元素时的探测函数，元素的搜索策略。

搜索元素函数

此函数用于获取到指定key对应的entry，它是获取元素、更新元素、添加删除元素的基础，因为做这些操作之前，都要先找到entry。从前面我们已经知道ma_lookup的默认值为lookdict_string， lookdict_string只是针对字符串类型的key的实现，更通用的实现是函数lookdict，lookdict_string是基于lookdict实现的。


xxxxxxxxxx
static PyDictEntry *
lookdict(PyDictObject *mp, PyObject *key, register long hash)
{
    register size_t i;
    register size_t perturb;
    register PyDictEntry *freeslot;
    register size_t mask = (size_t)mp->ma_mask;
    PyDictEntry *ep0 = mp->ma_table;
    register PyDictEntry *ep;
    register int cmp;
    PyObject *startkey;
    i = (size_t)hash & mask;
    ep = &ep0[i];
    if (ep->me_key == NULL || ep->me_key == key)
        return ep;
    if (ep->me_key == dummy)
        freeslot = ep;
    else {
        if (ep->me_hash == hash) {
            startkey = ep->me_key;
            Py_INCREF(startkey);
            cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
            Py_DECREF(startkey);
            if (cmp < 0)
                return NULL;
            if (ep0 == mp->ma_table && ep->me_key == startkey) {
                if (cmp > 0)
                    return ep;
            }
            else {
                /* The compare did major nasty stuff to the
                 * dict:  start over.
                 * XXX A clever adversary could prevent this
                 * XXX from terminating.
                 */
                return lookdict(mp, key, hash);
            }
        }
        freeslot = NULL;
    }
    /* In the loop, me_key == dummy is by far (factor of 100s) the
       least likely outcome, so test for that last. */
    for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
        i = (i << 2) + i + perturb + 1;
        ep = &ep0[i & mask];
        if (ep->me_key == NULL)
            return freeslot == NULL ? ep : freeslot;
        if (ep->me_key == key)
            return ep;
        if (ep->me_hash == hash && ep->me_key != dummy) {
            startkey = ep->me_key;
            Py_INCREF(startkey);
            cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
            Py_DECREF(startkey);
            if (cmp < 0)
                return NULL;
            if (ep0 == mp->ma_table && ep->me_key == startkey) {
                if (cmp > 0)
                    return ep;
            }
            else {
                /* The compare did major nasty stuff to the
                 * dict:  start over.
                 * XXX A clever adversary could prevent this
                 * XXX from terminating.
                 */
                return lookdict(mp, key, hash);
            }
        }
        else if (ep->me_key == dummy && freeslot == NULL)
            freeslot = ep;
    }
    assert(0);          /* NOT REACHED */
    return 0;
}

字典在添加元素和查询元素时，都需要用到字典的搜索策略，搜索时，如果不存在该key，那么返回Unused状态的entry，如果存在该key，但是key是一个Dummy对象，那么返回Dummy状态的entry，其他情况就表示存在Active状态的entry，那么对于字典的插入操作，针对不同的情况进行操作也不一样。对于Active的entry，直接替换me_value值即可；对于Unused或Dummy的entry，需要同时设置me_key，me_hash和me_value。

动态扩容和收缩策略

装载因子定义


x
load_factory = (mp->ma_fill) / (mp->ma_mask + 1)

触发条件

当load_factory >= 2/3，触发ma_table大小改变，这种改变可能是扩容也可能是收缩。

改变后的大小


x
mp->ma_used*(mp->ma_used>50000?2 : 4)
当ma_used的2倍或者4倍，比整个ma_table还小时，这时的产生的效果是收缩。反之，为扩容。

改变ma_table大小的代码如下：


xxxxxxxxxx
/*
Restructure the table by allocating a new table and reinserting all
items again.  When entries have been deleted, the new table may
actually be smaller than the old one.
*/
static int
dictresize(PyDictObject *mp, Py_ssize_t minused)
{
    Py_ssize_t newsize;
    PyDictEntry *oldtable, *newtable, *ep;
    Py_ssize_t i;
    int is_oldtable_malloced;
    PyDictEntry small_copy[PyDict_MINSIZE];
    assert(minused >= 0);
    /* Find the smallest table size > minused. */
    for (newsize = PyDict_MINSIZE;
         newsize <= minused && newsize > 0;
         newsize <<= 1)
        ;
    if (newsize <= 0) {
        PyErr_NoMemory();
        return -1;
    }
    /* Get space for a new table. */
    oldtable = mp->ma_table;
    assert(oldtable != NULL);
    is_oldtable_malloced = oldtable != mp->ma_smalltable;
    if (newsize == PyDict_MINSIZE) {
        /* A large table is shrinking, or we can't get any smaller. */
        newtable = mp->ma_smalltable;
        if (newtable == oldtable) {
            if (mp->ma_fill == mp->ma_used) {
                /* No dummies, so no point doing anything. */
                return 0;
            }
            /* We're not going to resize it, but rebuild the
               table anyway to purge old dummy entries.
               Subtle:  This is *necessary* if fill==size,
               as lookdict needs at least one virgin slot to
               terminate failing searches.  If fill < size, it's
               merely desirable, as dummies slow searches. */
            assert(mp->ma_fill > mp->ma_used);
            memcpy(small_copy, oldtable, sizeof(small_copy));
            oldtable = small_copy;
        }
    }
    else {
        newtable = PyMem_NEW(PyDictEntry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }
    /* Make the dict empty, using the new table. */
    assert(newtable != oldtable);
    mp->ma_table = newtable;
    mp->ma_mask = newsize - 1;
    memset(newtable, 0, sizeof(PyDictEntry) * newsize);
    mp->ma_used = 0;
    i = mp->ma_fill;
    mp->ma_fill = 0;
    /* Copy the data over; this is refcount-neutral for active entries;
       dummy entries aren't copied over, of course */
    for (ep = oldtable; i > 0; ep++) {
        if (ep->me_value != NULL) {             /* active entry */
            --i;
            insertdict_clean(mp, ep->me_key, (long)ep->me_hash,
                             ep->me_value);
        }
        else if (ep->me_key != NULL) {          /* dummy entry */
            --i;
            assert(ep->me_key == dummy);
            Py_DECREF(ep->me_key);
        }
        /* else key == value == NULL:  nothing to do */
    }
    if (is_oldtable_malloced)
        PyMem_DEL(oldtable);
    return 0;
}

流程大概是，先申请一个新的table，然后将Active状态的entry搬到新的table上去，调整各个属性的值。一气呵成，跟Redis的渐进式hash还是有很大不同。

对象缓冲池

字典的缓冲池，就是下面的free_list


xxxxxxxxxx
#define PyDict_MAXFREELIST 80
static PyDictObject *free_list[PyDict_MAXFREELIST];
static int numfree = 0;

缓冲池的元素是处理字典销毁的地方放进去的，代码如下：


x
static void
dict_dealloc(register PyDictObject *mp)
{
    register PyDictEntry *ep;
    Py_ssize_t fill = mp->ma_fill;
    /* bpo-31095: UnTrack is needed before calling any callbacks */
    PyObject_GC_UnTrack(mp);
    Py_TRASHCAN_SAFE_BEGIN(mp)
    for (ep = mp->ma_table; fill > 0; ep++) {
        if (ep->me_key) {
            --fill;
            Py_DECREF(ep->me_key);
            Py_XDECREF(ep->me_value);
        }
    }
    if (mp->ma_table != mp->ma_smalltable)
        PyMem_DEL(mp->ma_table);        // 释放ma_table维护的系统堆内存
    if (numfree < PyDict_MAXFREELIST && Py_TYPE(mp) == &PyDict_Type)
        free_list[numfree++] = mp;
    else
        Py_TYPE(mp)->tp_free((PyObject *)mp);
    Py_TRASHCAN_SAFE_END(mp)
}

要特别注意的地方是，申请的系统堆内存释放掉之后，才会放进缓冲池里。

参考

书籍《Python源码剖析》
Python2.7的源码
Python字典实现原理 https://foofish.net/python_dict_implements.html