Skip to content

Add a hash-based factorize primitive #212

Description

@flexatone

Summary

Add a hash-based factorize(array) that returns (uniques, codes) for a 1D array of
any dtype, in O(n), where array[i] == uniques[codes[i]]. This is the standard
"dense integer coding" of a categorical column (cf. pandas.factorize,
numpy.unique(..., return_inverse=True)), but computed via arraykit's existing hash
table rather than a sort.

Motivation

Grouping, pivoting, and index de-duplication all reduce to the same primitive: map each
element of an array to a dense integer code and recover the distinct values. Today the
only way to do this in the numpy/arraykit stack is sort-based:

  • numpy.unique(a, return_inverse=True) — argsort of all n elements.
  • arraykit's own ufunc_unique1d_indexer (used by array_to_groups_and_locations) — also argsort.

The sort is the bottleneck, and it is worst exactly where grouping is most common:
string keys. Measured on n = 100_000 rows, k = 5_000 groups (<U array):

step time
np.unique(a, return_inverse=True) (sort) 14.9 ms
pd.factorize(a) on the same <U array (khash) 7.8 ms
pd.factorize on the object-dtype equivalent (khash + cached hashes) 2.9 ms
downstream reduction (np.bincount) 0.12 ms

Factorization is ~99% of the cost; the reduction is free. A hash-based factorize roughly
halves the time on <U arrays, and the gap to pandas' 2.9 ms is then purely the string
representation (object-interned, cached-hash strings vs. fixed-width <U), not the
algorithm.

Concrete downstream case: static-frame's Frame.pivot on a string key currently
factorizes with np.unique and is ~3.6× slower than the pandas equivalent end-to-end
(13.8 ms vs 3.8 ms), with essentially all of the difference in the sort.

Proposed API

def factorize(
    array: np.ndarray,
    *,
    sort: bool = False,
) -> tuple[np.ndarray, np.ndarray]:
    """
    Return (uniques, codes) such that array[i] == uniques[codes[i]].

    uniques: 1D array of the distinct values, in first-appearance order
             (sort=False) or ascending sorted order (sort=True).
    codes:   1D intp array, len == len(array), values in [0, len(uniques)).
    """

Semantics mirror pandas.factorize / numpy.unique(return_inverse=True):

  • Ordering. Default sort=False returns uniques in first-appearance (hash-insertion)
    order — the cheap, natural output. sort=True additionally argsorts the k uniques
    and remaps, which is O(n + k log k) and still far cheaper than sorting n elements
    when k << n. (Callers that need sorted output — e.g. index construction — get it
    without a full re-sort.)
  • codes dtype intp, matching what np.bincount and fancy-indexing consume with no
    copy.

Why arraykit, and why this is low-risk

arraykit already contains the entire hash table this needs: AutoMap / FrozenAutoMap
map values to sequential integers in insertion order, and FrozenAutoMap.get_all(array)
is a vectorized O(n) code lookup that already works across every dtype we tried —
int*, uint*, float64, <U, S, datetime64, and object (verified: uniques[codes]
round-trips for all of them).

The only reason these maps can't factorize today is that they raise NonUniqueError
on a duplicate insert
— they are maps, not factorizers. A factorize entry point is
therefore a small addition over proven code: run the existing insertion hashing, but on a
repeated key return the already-assigned code instead of raising. No new hash table, no
new dtype handling.

Benefits across dtypes

Because it reuses the existing all-dtype hash, factorize helps everywhere, though the
magnitude and kind of benefit vary:

dtype today with hash factorize
U / S (strings) sort — slowest case large speedup (sort of variable-length strings is the worst case)
object sort via Python compares — very slow large speedup
float no O(n) coding possible; grouping falls back to sort new O(n) path (floats can't be dense-coded)
sparse int / datetime64 dense coding bails when range ≫ n → sort fallback new O(n) path for the sparse case
dense int, bool already O(n) via key - min dense coding little/no change (dense coding already optimal)

So it is not only a string optimization: it gives numeric/temporal float keys and
sparse-integer keys a vectorized grouping path they don't have at all today
, while
leaving the already-optimal dense-integer path alone.

Design considerations / edge cases

  • NaN. FrozenAutoMap currently raises KeyError on a NaN float key (since
    nan != nan). factorize must define this explicitly — either treat all NaN as one
    group, or emit a sentinel code (-1, as pandas.factorize(use_na_sentinel=True) does).
  • Immutability. arraykit's maps require immutable arrays; factorize should either
    accept mutable input or document the requirement.
  • Empty / single-value / all-unique arrays should be handled (degenerate but common).
  • Relationship to existing API. factorize generalizes ufunc_unique1d_indexer
    (which is sort-based) and complements get_new_indexers_and_screen (which already
    re-screens integer codes). A shared C core could back all three.

Downstream impact in static-frame

  • pivot_group_reduce_1d gains a hash path for string keys and, newly, for float and
    sparse-integer keys.
  • array_to_groups_and_locations / iter_group can drop their sort for the unsorted case.
  • Index construction / set operations that currently sort to dedup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions