Hardware-portable Vector API bit-packing for VectorFastPFOR#71
Open
raunaqmorarka wants to merge 5 commits into
Open
Hardware-portable Vector API bit-packing for VectorFastPFOR#71raunaqmorarka wants to merge 5 commits into
raunaqmorarka wants to merge 5 commits into
Conversation
slowpack OR-accumulates into the output, so a reused buffer kept stale bits. Zero the target words first.
8444b7a to
7e2cbf3
Compare
Introduce a VectorBitPackerKernels interface with a 128-lane VectorBitPacker128 for Arm NEON and other 128-bit hardware, alongside the existing 512-bit kernel. A LaneWidth enum selects the encode kernel from the preferred vector width and tags each stream; decode dispatches to the tagging width's kernel and fails loud when the host runs only narrower lanes natively. slowpack/slowunpack move into VectorFastPFOR. Decode speedup over the scalar FastPFOR codec: Graviton2 (Neoverse N1, NEON 128): 2.0-2.9x (~2.7x), encode ~15% faster Graviton4 (Neoverse V2, SVE 128): 2.3-4.1x (~3.0x), encode ~8% faster
Add VectorBitPacker256 (AVX2, 256-bit SVE) and register it in the LaneWidth enum, so 256-bit hosts pack natively instead of stepping down to the 128-bit kernel. Decode speedup over the scalar FastPFOR codec: Graviton3 (SVE 256): 2.8-4.5x (~3.4x), encode ~13% faster AMD EPYC Zen 1 (AVX2 256): 2.4-3.1x (~2.8x), encode ~15% faster
Declare jdk.incubator.vector as requires static so scalar consumers resolve the module without --add-modules; only VectorFastPFOR users need it.
SkippableBasicTest exercises maxHeadlessCompressedLength, so implement it (mirroring FastPFOR) rather than throwing.
7e2cbf3 to
9093b1a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hardware-portable Vector API bit-packing for
VectorFastPFORSummary
VectorFastPFORpreviously had a single bit-packing kernel hardwired to 512-bit lanes (VectorBitPacker). On any machine whose preferred vector width is below 512 bits (all Graviton, all AMD without AVX-512, Intel forced to AVX2), that kernel emulates the wide lanes in software and runs tens of times slower than the scalarFastPFORcodec it is meant to beat. The codec was effectively Intel-AVX-512-only.This PR adds 128-bit and 256-bit kernels alongside the existing 512-bit one, tags each stream with the lane width it was packed for, and selects the matching kernel on decode. The result runs natively across NEON/SVE (ARM), Zen (AMD), and AVX2/AVX-512 (Intel) without a code change or rebuild.
Why a width tag is needed
The three kernels produce different, non-interchangeable packed byte layouts. A stream packed at one width cannot be decoded at another. So each page header now carries a 2-bit lane-width code (top 2 bits of the existing
wheremetaword, previously unused, so no format-size cost), and decode dispatches to the kernel that produced the stream.Encode/decode model
LaneWidth.PREFERRED), so each machine packs with its widest native kernel.checkDecodable) if a stream was packed for wider lanes than the host runs natively, rather than silently falling off a performance cliff.VectorFastPFOR(int pageSize, LaneWidth)constructor pins encode to a lowest-common-denominator width (e.g.BITS_128, which decodes natively on every machine).Hardware validation
Measured against the scalar
FastPFORcodec across the matrix (decode and encode of 256-int blocks):128-bit lanes are the universal floor (native on every machine tested). The 256-bit kernel earns its place on genuinely 256-capable hardware (Graviton3 SVE, AMD Zen1 AVX2). Forcing a kernel wider than the host's native width emulates at a 40–100x+ cliff, which is what the decode guard exists to prevent.
Also fixed
A pre-existing correctness bug: the slow-path packer OR-accumulated into its output without zeroing first, so compressing into a reused (non-zero) buffer could carry stale bits. Fixed by zeroing the target words first; covered by
dirtyOutputBufferRoundTrip.Commits
VectorFastPFORcorruption on reused output buffers — slow-path zeroing fix + test.VectorBitPacker128, theLaneWidthregistry, per-page width tag, decode guard, and the LCD constructor.VectorBitPacker256, plugging into the existing infra.module-info+ README.VectorFastPFORwith the shared codec test suites —BasicTest/SkippableBasicTestentries.Requirements
JDK 21+ (project baseline). On aarch64, JDK 24+ for the SVE intrinsics; earlier releases run a fallback slower than the scalar codec.