Hardware-portable Vector API bit-packing for VectorFastPFOR by raunaqmorarka · Pull Request #71 · fast-pack/JavaFastPFOR

raunaqmorarka · 2026-06-22T17:24:20Z

Hardware-portable Vector API bit-packing for `VectorFastPFOR`

Summary

VectorFastPFOR previously had a single bit-packing kernel hardwired to 512-bit lanes (VectorBitPacker). On any machine whose preferred vector width is below 512 bits (all Graviton, all AMD without AVX-512, Intel forced to AVX2), that kernel emulates the wide lanes in software and runs tens of times slower than the scalar FastPFOR codec it is meant to beat. The codec was effectively Intel-AVX-512-only.

This PR adds 128-bit and 256-bit kernels alongside the existing 512-bit one, tags each stream with the lane width it was packed for, and selects the matching kernel on decode. The result runs natively across NEON/SVE (ARM), Zen (AMD), and AVX2/AVX-512 (Intel) without a code change or rebuild.

Why a width tag is needed

The three kernels produce different, non-interchangeable packed byte layouts. A stream packed at one width cannot be decoded at another. So each page header now carries a 2-bit lane-width code (top 2 bits of the existing wheremeta word, previously unused, so no format-size cost), and decode dispatches to the kernel that produced the stream.

Encode/decode model

Encode defaults to the host's preferred width (LaneWidth.PREFERRED), so each machine packs with its widest native kernel.
Decode reads the stream's width tag and uses the matching kernel.
A kernel runs natively whenever the host's preferred width is ≥ the kernel width (small penalty), and emulates (large penalty) when the host is narrower. Decode therefore fails loud (checkDecodable) if a stream was packed for wider lanes than the host runs natively, rather than silently falling off a performance cliff.
For heterogeneous clusters, the VectorFastPFOR(int pageSize, LaneWidth) constructor pins encode to a lowest-common-denominator width (e.g. BITS_128, which decodes natively on every machine).

Hardware validation

Measured against the scalar FastPFOR codec across the matrix (decode and encode of 256-int blocks):

Platform	ISA / preferred width	Kernel	Decode	Encode
Graviton2 (Neoverse N1)	NEON 128	VectorBitPacker128	2.0–2.9x (~2.7x)	~15% faster
Graviton4 (Neoverse V2)	SVE 128	VectorBitPacker128	2.3–4.1x (~3.0x)	~8% faster
Graviton3	SVE 256	VectorBitPacker256	2.8–4.5x (~3.4x)	~13% faster
AMD EPYC (Zen 1)	AVX2 256	VectorBitPacker256	2.4–3.1x (~2.8x)	~15% faster

128-bit lanes are the universal floor (native on every machine tested). The 256-bit kernel earns its place on genuinely 256-capable hardware (Graviton3 SVE, AMD Zen1 AVX2). Forcing a kernel wider than the host's native width emulates at a 40–100x+ cliff, which is what the decode guard exists to prevent.

Also fixed

A pre-existing correctness bug: the slow-path packer OR-accumulated into its output without zeroing first, so compressing into a reused (non-zero) buffer could carry stale bits. Fixed by zeroing the target words first; covered by dirtyOutputBufferRoundTrip.

Commits

Fix VectorFastPFOR corruption on reused output buffers — slow-path zeroing fix + test.
Add 128-bit vector bit-packing with width-tagged streams — VectorBitPacker128, the LaneWidth registry, per-page width tag, decode guard, and the LCD constructor.
Add 256-bit vector bit-packing kernel — VectorBitPacker256, plugging into the existing infra.
Build and test the vector module by default — pom + module-info + README.
Cover VectorFastPFOR with the shared codec test suites — BasicTest / SkippableBasicTest entries.

Requirements

JDK 21+ (project baseline). On aarch64, JDK 24+ for the SVE intrinsics; earlier releases run a fallback slower than the scalar codec.

slowpack OR-accumulates into the output, so a reused buffer kept stale bits. Zero the target words first.

Introduce a VectorBitPackerKernels interface with a 128-lane VectorBitPacker128 for Arm NEON and other 128-bit hardware, alongside the existing 512-bit kernel. A LaneWidth enum selects the encode kernel from the preferred vector width and tags each stream; decode dispatches to the tagging width's kernel and fails loud when the host runs only narrower lanes natively. slowpack/slowunpack move into VectorFastPFOR. Decode speedup over the scalar FastPFOR codec: Graviton2 (Neoverse N1, NEON 128): 2.0-2.9x (~2.7x), encode ~15% faster Graviton4 (Neoverse V2, SVE 128): 2.3-4.1x (~3.0x), encode ~8% faster

Add VectorBitPacker256 (AVX2, 256-bit SVE) and register it in the LaneWidth enum, so 256-bit hosts pack natively instead of stepping down to the 128-bit kernel. Decode speedup over the scalar FastPFOR codec: Graviton3 (SVE 256): 2.8-4.5x (~3.4x), encode ~13% faster AMD EPYC Zen 1 (AVX2 256): 2.4-3.1x (~2.8x), encode ~15% faster

Declare jdk.incubator.vector as requires static so scalar consumers resolve the module without --add-modules; only VectorFastPFOR users need it.

SkippableBasicTest exercises maxHeadlessCompressedLength, so implement it (mirroring FastPFOR) rather than throwing.

Fix VectorFastPFOR corruption on reused output buffers

40e2c33

slowpack OR-accumulates into the output, so a reused buffer kept stale bits. Zero the target words first.

raunaqmorarka force-pushed the vectorbitpacker128 branch from 8444b7a to 7e2cbf3 Compare June 22, 2026 17:39

raunaqmorarka added 4 commits June 22, 2026 23:10

Build and test the vector module by default

0f4d659

Declare jdk.incubator.vector as requires static so scalar consumers resolve the module without --add-modules; only VectorFastPFOR users need it.

Cover VectorFastPFOR with the shared codec test suites

9093b1a

SkippableBasicTest exercises maxHeadlessCompressedLength, so implement it (mirroring FastPFOR) rather than throwing.

raunaqmorarka force-pushed the vectorbitpacker128 branch from 7e2cbf3 to 9093b1a Compare June 22, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware-portable Vector API bit-packing for VectorFastPFOR#71

Hardware-portable Vector API bit-packing for VectorFastPFOR#71
raunaqmorarka wants to merge 5 commits into
fast-pack:masterfrom
raunaqmorarka:vectorbitpacker128

raunaqmorarka commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raunaqmorarka commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!