feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128
Open
zhuqi-lucas wants to merge 2 commits into
Open
Conversation
…ive> in multi-column GROUP BY PR 2 of the EPIC apache#22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (apache#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map). ## Which issue does this PR close? Part of apache#22682 (nested type coverage) and apache#22715 (EPIC). ## Rationale for this change Today a single `FixedSizeList<primitive>` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries. ## What changes are included in this PR? - `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new `fixed_size_list` submodule. Storage: outer null bitmap + a child `PrimitiveGroupValueBuilder<T, true>` that holds every element flat (length = `outer_len * list_len`). Element `j` of outer row `i` lives at child index `i * list_len + j`. - `group_column_supported_type` accepts `FixedSizeList<primitive>` for the primitive subset wired through the dispatcher (Int8..Int64, UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children (Struct / List inside FSL) are deferred to a follow-up after their respective builders land. - `make_group_column` dispatches `DataType::FixedSizeList(...)` to the appropriate `FixedSizeListGroupValueBuilder<T>` via an `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!` pattern. ## Are these changes tested? Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz. Per-builder: - append / build round trip with mixed outer-null and inner-null rows - `equal_to` for identical / different / outer-null / inner-null rows - `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows) - sliced input array (offset != 0) - `vectorized_equal_to` / `vectorized_append` match per-row reference - `size()` grows with appends - `build` on empty builder returns empty array - end-to-end dispatcher → `GroupValuesColumn` routing Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR. 127/127 aggregates tests pass, clippy clean, fmt clean. ## What follows in the EPIC - PR 3: `Struct<...>` builder + dispatcher. - PR 4: `List<T>` builder + dispatcher (recursive child via factory). - PR 5: `LargeList<T>` builder + dispatcher. - PR 6: Relax FSL child restriction to allow `FSL<Struct>` / `FSL<List>` once the prerequisite child builders are in. - PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).
Comment on lines
+927
to
+945
| // FixedSizeList<primitive> is the only nested type currently supported. | ||
| // List / LargeList / Struct will follow in subsequent PRs of #22715. | ||
| if let DataType::FixedSizeList(child_field, _) = data_type { | ||
| return matches!( | ||
| child_field.data_type(), | ||
| DataType::Int8 | ||
| | DataType::Int16 | ||
| | DataType::Int32 | ||
| | DataType::Int64 | ||
| | DataType::UInt8 | ||
| | DataType::UInt16 | ||
| | DataType::UInt32 | ||
| | DataType::UInt64 | ||
| | DataType::Float32 | ||
| | DataType::Float64 | ||
| | DataType::Date32 | ||
| | DataType::Date64 | ||
| ); | ||
| } |
Comment on lines
+1090
to
+1103
| DataType::FixedSizeList(ref child_field, _) => { | ||
| // `group_column_supported_type` already restricts the child to | ||
| // the primitive subset supported here. Any unsupported child | ||
| // type returned `false` upstream and was routed to the | ||
| // `GroupValuesRows` fallback, so the wildcard arm below is | ||
| // only reachable from the consistency-fuzz test. | ||
| macro_rules! instantiate_fsl { | ||
| ($t:ty) => {{ | ||
| let b = fixed_size_list::FixedSizeListGroupValueBuilder::<$t>::new( | ||
| data_type, | ||
| ); | ||
| v.push(Box::new(b) as _); | ||
| }}; | ||
| } |
Comment on lines
+90
to
+103
| let list_array = array | ||
| .as_any() | ||
| .downcast_ref::<FixedSizeListArray>() | ||
| .expect("FixedSizeListGroupValueBuilder called with non-FixedSizeList array"); | ||
| let rhs_sublist: ArrayRef = list_array.value(rhs_row); | ||
| let list_len = self.list_len_usize(); | ||
| let lhs_base = lhs_row * list_len; | ||
| for j in 0..list_len { | ||
| if !self.child.equal_to(lhs_base + j, &rhs_sublist, j) { | ||
| return false; | ||
| } | ||
| } | ||
| true | ||
| } |
Comment on lines
+73
to
+75
| fn list_len_usize(&self) -> usize { | ||
| self.list_len as usize | ||
| } |
… size, drop per-compare allocation in equal_to Four Copilot inline comments handled together: * `group_column_supported_type` and `make_group_column` both now reject `FixedSizeList` with a negative `list_size`. Arrow defines that size as non-negative, but the enum lets callers construct a negative value programmatically; without this guard the `i32 as usize` cast inside the builder would wrap and risk panics / OOM. Guards are duplicated on both sides so direct callers of `make_group_column` (and the `supported_schema` planner gate) both fail safely. Consistency fuzz already covers the `make_group_column` side via the existing unsupported_cases path; an explicit dedicated test below covers both surfaces. * `FixedSizeListGroupValueBuilder::new` asserts `list_len >= 0` so any direct construction that bypasses the factory fails fast with a clear message instead of silently producing a broken builder. `list_len_usize` now uses `usize::try_from(...).expect(...)` instead of `as usize`; with the constructor's assertion the conversion is provably infallible, but going through `try_from` means a future invariant break shows up as an explicit panic rather than a silent wrap. * `equal_to` no longer calls `list_array.value(rhs_row)`, which constructs a sliced child `ArrayRef` on every comparison. The hot path now borrows `list_array.values()` and uses `value_offset(rhs_row)` to compute the child base index, mirroring the existing approach in `append_val`. Zero allocations per equality check. New test: `negative_list_size_is_rejected_by_allow_list_and_dispatcher` locks in the negative-size guard at both surfaces. 13/13 FSL tests pass, 128/128 aggregates pass, clippy clean.
Contributor
Author
|
Addressed all four Copilot comments in 29114c3:
New test |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR 2 of the EPIC #22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map).
Which issue does this PR close?
Part of #22682 (nested type coverage) and #22715 (EPIC).
Rationale for this change
Today a single `FixedSizeList` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries.
What changes are included in this PR?
Are these changes tested?
Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz.
Per-builder:
Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR.
127/127 aggregates tests pass, clippy clean, fmt clean.
What follows in the EPIC
Are there any user-facing changes?
No behavior change for users whose GROUP BY did not previously contain a `FixedSizeList` column. For users who did, the grouping now uses the column-native fast path instead of falling back to `GroupValuesRows` — same results, less memory and CPU.