Skip to content

Hardware-portable Vector API bit-packing for VectorFastPFOR#71

Open
raunaqmorarka wants to merge 5 commits into
fast-pack:masterfrom
raunaqmorarka:vectorbitpacker128
Open

Hardware-portable Vector API bit-packing for VectorFastPFOR#71
raunaqmorarka wants to merge 5 commits into
fast-pack:masterfrom
raunaqmorarka:vectorbitpacker128

Conversation

@raunaqmorarka

@raunaqmorarka raunaqmorarka commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Hardware-portable Vector API bit-packing for VectorFastPFOR

Summary

VectorFastPFOR previously had a single bit-packing kernel hardwired to 512-bit lanes (VectorBitPacker). On any machine whose preferred vector width is below 512 bits (all Graviton, all AMD without AVX-512, Intel forced to AVX2), that kernel emulates the wide lanes in software and runs tens of times slower than the scalar FastPFOR codec it is meant to beat. The codec was effectively Intel-AVX-512-only.

This PR adds 128-bit and 256-bit kernels alongside the existing 512-bit one, tags each stream with the lane width it was packed for, and selects the matching kernel on decode. The result runs natively across NEON/SVE (ARM), Zen (AMD), and AVX2/AVX-512 (Intel) without a code change or rebuild.

Why a width tag is needed

The three kernels produce different, non-interchangeable packed byte layouts. A stream packed at one width cannot be decoded at another. So each page header now carries a 2-bit lane-width code (top 2 bits of the existing wheremeta word, previously unused, so no format-size cost), and decode dispatches to the kernel that produced the stream.

Encode/decode model

  • Encode defaults to the host's preferred width (LaneWidth.PREFERRED), so each machine packs with its widest native kernel.
  • Decode reads the stream's width tag and uses the matching kernel.
  • A kernel runs natively whenever the host's preferred width is the kernel width (small penalty), and emulates (large penalty) when the host is narrower. Decode therefore fails loud (checkDecodable) if a stream was packed for wider lanes than the host runs natively, rather than silently falling off a performance cliff.
  • For heterogeneous clusters, the VectorFastPFOR(int pageSize, LaneWidth) constructor pins encode to a lowest-common-denominator width (e.g. BITS_128, which decodes natively on every machine).

Hardware validation

Measured against the scalar FastPFOR codec across the matrix (decode and encode of 256-int blocks):

Platform ISA / preferred width Kernel Decode Encode
Graviton2 (Neoverse N1) NEON 128 VectorBitPacker128 2.0–2.9x (~2.7x) ~15% faster
Graviton4 (Neoverse V2) SVE 128 VectorBitPacker128 2.3–4.1x (~3.0x) ~8% faster
Graviton3 SVE 256 VectorBitPacker256 2.8–4.5x (~3.4x) ~13% faster
AMD EPYC (Zen 1) AVX2 256 VectorBitPacker256 2.4–3.1x (~2.8x) ~15% faster

128-bit lanes are the universal floor (native on every machine tested). The 256-bit kernel earns its place on genuinely 256-capable hardware (Graviton3 SVE, AMD Zen1 AVX2). Forcing a kernel wider than the host's native width emulates at a 40–100x+ cliff, which is what the decode guard exists to prevent.

Also fixed

A pre-existing correctness bug: the slow-path packer OR-accumulated into its output without zeroing first, so compressing into a reused (non-zero) buffer could carry stale bits. Fixed by zeroing the target words first; covered by dirtyOutputBufferRoundTrip.

Commits

  1. Fix VectorFastPFOR corruption on reused output buffers — slow-path zeroing fix + test.
  2. Add 128-bit vector bit-packing with width-tagged streamsVectorBitPacker128, the LaneWidth registry, per-page width tag, decode guard, and the LCD constructor.
  3. Add 256-bit vector bit-packing kernelVectorBitPacker256, plugging into the existing infra.
  4. Build and test the vector module by default — pom + module-info + README.
  5. Cover VectorFastPFOR with the shared codec test suitesBasicTest / SkippableBasicTest entries.

Requirements

JDK 21+ (project baseline). On aarch64, JDK 24+ for the SVE intrinsics; earlier releases run a fallback slower than the scalar codec.

slowpack OR-accumulates into the output, so a reused buffer kept stale
bits. Zero the target words first.
Introduce a VectorBitPackerKernels interface with a 128-lane
VectorBitPacker128 for Arm NEON and other 128-bit hardware, alongside
the existing 512-bit kernel. A LaneWidth enum selects the encode kernel
from the preferred vector width and tags each stream; decode dispatches
to the tagging width's kernel and fails loud when the host runs only
narrower lanes natively. slowpack/slowunpack move into VectorFastPFOR.

Decode speedup over the scalar FastPFOR codec:
  Graviton2 (Neoverse N1, NEON 128): 2.0-2.9x (~2.7x), encode ~15% faster
  Graviton4 (Neoverse V2, SVE 128): 2.3-4.1x (~3.0x), encode ~8% faster
Add VectorBitPacker256 (AVX2, 256-bit SVE) and register it in the
LaneWidth enum, so 256-bit hosts pack natively instead of stepping down
to the 128-bit kernel.

Decode speedup over the scalar FastPFOR codec:
  Graviton3 (SVE 256): 2.8-4.5x (~3.4x), encode ~13% faster
  AMD EPYC Zen 1 (AVX2 256): 2.4-3.1x (~2.8x), encode ~15% faster
Declare jdk.incubator.vector as requires static so scalar consumers
resolve the module without --add-modules; only VectorFastPFOR users need it.
SkippableBasicTest exercises maxHeadlessCompressedLength, so implement it
(mirroring FastPFOR) rather than throwing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant