Summary
Add a hash-based factorize(array) that returns (uniques, codes) for a 1D array of
any dtype, in O(n), where array[i] == uniques[codes[i]]. This is the standard
"dense integer coding" of a categorical column (cf. pandas.factorize,
numpy.unique(..., return_inverse=True)), but computed via arraykit's existing hash
table rather than a sort.
Motivation
Grouping, pivoting, and index de-duplication all reduce to the same primitive: map each
element of an array to a dense integer code and recover the distinct values. Today the
only way to do this in the numpy/arraykit stack is sort-based:
numpy.unique(a, return_inverse=True) — argsort of all n elements.
- arraykit's own
ufunc_unique1d_indexer (used by array_to_groups_and_locations) — also argsort.
The sort is the bottleneck, and it is worst exactly where grouping is most common:
string keys. Measured on n = 100_000 rows, k = 5_000 groups (<U array):
| step |
time |
np.unique(a, return_inverse=True) (sort) |
14.9 ms |
pd.factorize(a) on the same <U array (khash) |
7.8 ms |
pd.factorize on the object-dtype equivalent (khash + cached hashes) |
2.9 ms |
downstream reduction (np.bincount) |
0.12 ms |
Factorization is ~99% of the cost; the reduction is free. A hash-based factorize roughly
halves the time on <U arrays, and the gap to pandas' 2.9 ms is then purely the string
representation (object-interned, cached-hash strings vs. fixed-width <U), not the
algorithm.
Concrete downstream case: static-frame's Frame.pivot on a string key currently
factorizes with np.unique and is ~3.6× slower than the pandas equivalent end-to-end
(13.8 ms vs 3.8 ms), with essentially all of the difference in the sort.
Proposed API
def factorize(
array: np.ndarray,
*,
sort: bool = False,
) -> tuple[np.ndarray, np.ndarray]:
"""
Return (uniques, codes) such that array[i] == uniques[codes[i]].
uniques: 1D array of the distinct values, in first-appearance order
(sort=False) or ascending sorted order (sort=True).
codes: 1D intp array, len == len(array), values in [0, len(uniques)).
"""
Semantics mirror pandas.factorize / numpy.unique(return_inverse=True):
- Ordering. Default
sort=False returns uniques in first-appearance (hash-insertion)
order — the cheap, natural output. sort=True additionally argsorts the k uniques
and remaps, which is O(n + k log k) and still far cheaper than sorting n elements
when k << n. (Callers that need sorted output — e.g. index construction — get it
without a full re-sort.)
- codes dtype
intp, matching what np.bincount and fancy-indexing consume with no
copy.
Why arraykit, and why this is low-risk
arraykit already contains the entire hash table this needs: AutoMap / FrozenAutoMap
map values to sequential integers in insertion order, and FrozenAutoMap.get_all(array)
is a vectorized O(n) code lookup that already works across every dtype we tried —
int*, uint*, float64, <U, S, datetime64, and object (verified: uniques[codes]
round-trips for all of them).
The only reason these maps can't factorize today is that they raise NonUniqueError
on a duplicate insert — they are maps, not factorizers. A factorize entry point is
therefore a small addition over proven code: run the existing insertion hashing, but on a
repeated key return the already-assigned code instead of raising. No new hash table, no
new dtype handling.
Benefits across dtypes
Because it reuses the existing all-dtype hash, factorize helps everywhere, though the
magnitude and kind of benefit vary:
| dtype |
today |
with hash factorize |
U / S (strings) |
sort — slowest case |
large speedup (sort of variable-length strings is the worst case) |
object |
sort via Python compares — very slow |
large speedup |
float |
no O(n) coding possible; grouping falls back to sort |
new O(n) path (floats can't be dense-coded) |
sparse int / datetime64 |
dense coding bails when range ≫ n → sort fallback |
new O(n) path for the sparse case |
dense int, bool |
already O(n) via key - min dense coding |
little/no change (dense coding already optimal) |
So it is not only a string optimization: it gives numeric/temporal float keys and
sparse-integer keys a vectorized grouping path they don't have at all today, while
leaving the already-optimal dense-integer path alone.
Design considerations / edge cases
- NaN.
FrozenAutoMap currently raises KeyError on a NaN float key (since
nan != nan). factorize must define this explicitly — either treat all NaN as one
group, or emit a sentinel code (-1, as pandas.factorize(use_na_sentinel=True) does).
- Immutability. arraykit's maps require immutable arrays;
factorize should either
accept mutable input or document the requirement.
- Empty / single-value / all-unique arrays should be handled (degenerate but common).
- Relationship to existing API.
factorize generalizes ufunc_unique1d_indexer
(which is sort-based) and complements get_new_indexers_and_screen (which already
re-screens integer codes). A shared C core could back all three.
Downstream impact in static-frame
pivot_group_reduce_1d gains a hash path for string keys and, newly, for float and
sparse-integer keys.
array_to_groups_and_locations / iter_group can drop their sort for the unsorted case.
- Index construction / set operations that currently sort to dedup.
Summary
Add a hash-based
factorize(array)that returns(uniques, codes)for a 1D array ofany dtype, in O(n), where
array[i] == uniques[codes[i]]. This is the standard"dense integer coding" of a categorical column (cf.
pandas.factorize,numpy.unique(..., return_inverse=True)), but computed via arraykit's existing hashtable rather than a sort.
Motivation
Grouping, pivoting, and index de-duplication all reduce to the same primitive: map each
element of an array to a dense integer code and recover the distinct values. Today the
only way to do this in the numpy/arraykit stack is sort-based:
numpy.unique(a, return_inverse=True)— argsort of all n elements.ufunc_unique1d_indexer(used byarray_to_groups_and_locations) — alsoargsort.The sort is the bottleneck, and it is worst exactly where grouping is most common:
string keys. Measured on
n = 100_000rows,k = 5_000groups (<Uarray):np.unique(a, return_inverse=True)(sort)pd.factorize(a)on the same<Uarray (khash)pd.factorizeon theobject-dtype equivalent (khash + cached hashes)np.bincount)Factorization is ~99% of the cost; the reduction is free. A hash-based factorize roughly
halves the time on
<Uarrays, and the gap to pandas' 2.9 ms is then purely the stringrepresentation (object-interned, cached-hash strings vs. fixed-width
<U), not thealgorithm.
Concrete downstream case: static-frame's
Frame.pivoton a string key currentlyfactorizes with
np.uniqueand is ~3.6× slower than the pandas equivalent end-to-end(13.8 ms vs 3.8 ms), with essentially all of the difference in the sort.
Proposed API
Semantics mirror
pandas.factorize/numpy.unique(return_inverse=True):sort=Falsereturns uniques in first-appearance (hash-insertion)order — the cheap, natural output.
sort=Trueadditionally argsorts the k uniquesand remaps, which is O(n + k log k) and still far cheaper than sorting n elements
when k << n. (Callers that need sorted output — e.g. index construction — get it
without a full re-sort.)
intp, matching whatnp.bincountand fancy-indexing consume with nocopy.
Why arraykit, and why this is low-risk
arraykit already contains the entire hash table this needs:
AutoMap/FrozenAutoMapmap values to sequential integers in insertion order, and
FrozenAutoMap.get_all(array)is a vectorized O(n) code lookup that already works across every dtype we tried —
int*,uint*,float64,<U,S,datetime64, andobject(verified:uniques[codes]round-trips for all of them).
The only reason these maps can't factorize today is that they raise
NonUniqueErroron a duplicate insert — they are maps, not factorizers. A
factorizeentry point istherefore a small addition over proven code: run the existing insertion hashing, but on a
repeated key return the already-assigned code instead of raising. No new hash table, no
new dtype handling.
Benefits across dtypes
Because it reuses the existing all-dtype hash,
factorizehelps everywhere, though themagnitude and kind of benefit vary:
factorizeU/S(strings)objectfloatint/datetime64int,boolkey - mindense codingSo it is not only a string optimization: it gives numeric/temporal float keys and
sparse-integer keys a vectorized grouping path they don't have at all today, while
leaving the already-optimal dense-integer path alone.
Design considerations / edge cases
FrozenAutoMapcurrently raisesKeyErroron aNaNfloat key (sincenan != nan).factorizemust define this explicitly — either treat all NaN as onegroup, or emit a sentinel code (
-1, aspandas.factorize(use_na_sentinel=True)does).factorizeshould eitheraccept mutable input or document the requirement.
factorizegeneralizesufunc_unique1d_indexer(which is sort-based) and complements
get_new_indexers_and_screen(which alreadyre-screens integer codes). A shared C core could back all three.
Downstream impact in static-frame
pivot_group_reduce_1dgains a hash path for string keys and, newly, for float andsparse-integer keys.
array_to_groups_and_locations/iter_groupcan drop their sort for the unsorted case.