Going through parsers.pyx, particularly _string_box_utf8, I'm trying to figure out what the point of the hashtable checks are:

        k = kh_get_strbox(table, word)

        # in the hash table
        if k != table.n_buckets:
            # this increments the refcount, but need to test
            pyval = <object>table.vals[k]
        else:
            # box it. new ref?
            pyval = PyUnicode_Decode(word, strlen(word), "utf-8", encoding_errors)

            k = kh_put_strbox(table, word, &ret)
            table.vals[k] = <PyObject *>pyval

        result[i] = pyval

This was introduced in 2012 a9db003. I don't see a clear reason why this isn't just

result[i] = PyUnicode_Decode(word, strlen(word), "utf-8", encoding_errors)

My best guess is that it involves string interning. Prior to py37, only small strings were interned. Now most strings up to 4096 I think are interned. Under the old system, the hashtable could prevent a ton of memory allocation, but that may no longer be the case. No, that doesn't apply to runtime-created strings. So that may be the reason why, but if so it is still a valid one.

Does anyone have a longer memory than me on this?

Comment From: jorisvandenbossche

Maybe for the case where you have a string column with repeated values? In that case the above, keeping a hashtable of all encountered values, might make it faster(?) or at least reduce memory.

The arrow->python conversion in pyarrow has a similar option, deduplicate_objects (https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_pandas), which was introduced in https://github.com/apache/arrow/pull/3257, and that seems to have some context / benchmarks.

Comment From: jbrockmendel

That's my guess too. I suspect that if we ever get to all-pyarrow-strings we can replace this, and a ndarray[object] allocation, with something a lot more efficient.