Skip to content

feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128

Open
zhuqi-lucas wants to merge 2 commits into
apache:mainfrom
zhuqi-lucas:qizhu/group-column-fixed-size-list-primitive
Open

feat(physical-plan): add GroupColumn support for FixedSizeList<primitive> in multi-column GROUP BY#23128
zhuqi-lucas wants to merge 2 commits into
apache:mainfrom
zhuqi-lucas:qizhu/group-column-fixed-size-list-primitive

Conversation

@zhuqi-lucas

Copy link
Copy Markdown
Contributor

PR 2 of the EPIC #22715 sequence. Builds on the dispatcher refactor + factory framework that landed in PR 1 (#22751) to bring the first nested type into `GroupValuesColumn`, unblocking the rest of the sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL, PR 7 Map).

Which issue does this PR close?

Part of #22682 (nested type coverage) and #22715 (EPIC).

Rationale for this change

Today a single `FixedSizeList` column in a GROUP BY drags the whole grouping onto the byte-encoded `GroupValuesRows` fallback, even when every other column would have qualified for the column-wise + cross-column short-circuit fast path in `GroupValuesColumn`. With the recursive `make_group_column` factory + `group_column_supported_type` allow-list in place from PR 1, this PR is now a self-contained addition of one builder and two dispatcher entries.

What changes are included in this PR?

  • `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new `fixed_size_list` submodule. Storage: outer null bitmap + a child `PrimitiveGroupValueBuilder<T, true>` that holds every element flat (length = `outer_len * list_len`). Element `j` of outer row `i` lives at child index `i * list_len + j`.
  • `group_column_supported_type` accepts `FixedSizeList` for the primitive subset wired through the dispatcher (Int8..Int64, UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children (Struct / List inside FSL) are deferred to a follow-up after their respective builders land.
  • `make_group_column` dispatches `DataType::FixedSizeList(...)` to the appropriate `FixedSizeListGroupValueBuilder` via an `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!` pattern.

Are these changes tested?

Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus extensions to the existing `group_column_supported_type` ⇔ `make_group_column` consistency fuzz.

Per-builder:

  • append / build round trip with mixed outer-null and inner-null rows
  • `equal_to` for identical / different / outer-null / inner-null rows
  • `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows)
  • sliced input array (offset != 0)
  • `vectorized_equal_to` / `vectorized_append` match per-row reference
  • `size()` grows with appends
  • `build` on empty builder returns empty array
  • end-to-end dispatcher → `GroupValuesColumn` routing

Consistency fuzz now covers four primitive FSL children (signed, unsigned, float, date) on the supported side and three non-primitive FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking in the scope boundary for this PR.

127/127 aggregates tests pass, clippy clean, fmt clean.

What follows in the EPIC

  • PR 3: `Struct<...>` builder + dispatcher.
  • PR 4: `List` builder + dispatcher (recursive child via factory).
  • PR 5: `LargeList` builder + dispatcher.
  • PR 6: Relax FSL child restriction to allow `FSL` / `FSL` once the prerequisite child builders are in.
  • PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).

Are there any user-facing changes?

No behavior change for users whose GROUP BY did not previously contain a `FixedSizeList` column. For users who did, the grouping now uses the column-native fast path instead of falling back to `GroupValuesRows` — same results, less memory and CPU.

…ive> in multi-column GROUP BY

PR 2 of the EPIC apache#22715 sequence. Builds on the dispatcher refactor +
factory framework that landed in PR 1 (apache#22751) to bring the first
nested type into `GroupValuesColumn`, unblocking the rest of the
sequence (PR 3 Struct, PR 4 List, PR 5 LargeList, PR 6 composite FSL,
PR 7 Map).

## Which issue does this PR close?

Part of apache#22682 (nested type coverage) and apache#22715 (EPIC).

## Rationale for this change

Today a single `FixedSizeList<primitive>` column in a GROUP BY drags
the whole grouping onto the byte-encoded `GroupValuesRows` fallback,
even when every other column would have qualified for the column-wise
+ cross-column short-circuit fast path in `GroupValuesColumn`. With
the recursive `make_group_column` factory + `group_column_supported_type`
allow-list in place from PR 1, this PR is now a self-contained
addition of one builder and two dispatcher entries.

## What changes are included in this PR?

- `FixedSizeListGroupValueBuilder<T: ArrowPrimitiveType>` in a new
  `fixed_size_list` submodule. Storage: outer null bitmap + a child
  `PrimitiveGroupValueBuilder<T, true>` that holds every element flat
  (length = `outer_len * list_len`). Element `j` of outer row `i`
  lives at child index `i * list_len + j`.
- `group_column_supported_type` accepts `FixedSizeList<primitive>` for
  the primitive subset wired through the dispatcher (Int8..Int64,
  UInt8..UInt64, Float32, Float64, Date32, Date64). Composite children
  (Struct / List inside FSL) are deferred to a follow-up after their
  respective builders land.
- `make_group_column` dispatches `DataType::FixedSizeList(...)` to the
  appropriate `FixedSizeListGroupValueBuilder<T>` via an
  `instantiate_fsl!` macro mirroring the existing `instantiate_primitive!`
  pattern.

## Are these changes tested?

Yes. 12 new unit tests on `FixedSizeListGroupValueBuilder` plus
extensions to the existing `group_column_supported_type` ⇔
`make_group_column` consistency fuzz.

Per-builder:
- append / build round trip with mixed outer-null and inner-null rows
- `equal_to` for identical / different / outer-null / inner-null rows
- `take_n` boundary cases (`n=0`, `n=len`, prefix containing null rows)
- sliced input array (offset != 0)
- `vectorized_equal_to` / `vectorized_append` match per-row reference
- `size()` grows with appends
- `build` on empty builder returns empty array
- end-to-end dispatcher → `GroupValuesColumn` routing

Consistency fuzz now covers four primitive FSL children (signed,
unsigned, float, date) on the supported side and three non-primitive
FSL children (Utf8, Decimal128, Boolean) on the rejected side, locking
in the scope boundary for this PR.

127/127 aggregates tests pass, clippy clean, fmt clean.

## What follows in the EPIC

- PR 3: `Struct<...>` builder + dispatcher.
- PR 4: `List<T>` builder + dispatcher (recursive child via factory).
- PR 5: `LargeList<T>` builder + dispatcher.
- PR 6: Relax FSL child restriction to allow `FSL<Struct>` / `FSL<List>`
  once the prerequisite child builders are in.
- PR 7: `Map<K,V>` (`List<Struct<keys, values>>` Arrow representation).
Copilot AI review requested due to automatic review settings June 23, 2026 15:32
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines +927 to +945
// FixedSizeList<primitive> is the only nested type currently supported.
// List / LargeList / Struct will follow in subsequent PRs of #22715.
if let DataType::FixedSizeList(child_field, _) = data_type {
return matches!(
child_field.data_type(),
DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
| DataType::UInt8
| DataType::UInt16
| DataType::UInt32
| DataType::UInt64
| DataType::Float32
| DataType::Float64
| DataType::Date32
| DataType::Date64
);
}
Comment on lines +1090 to +1103
DataType::FixedSizeList(ref child_field, _) => {
// `group_column_supported_type` already restricts the child to
// the primitive subset supported here. Any unsupported child
// type returned `false` upstream and was routed to the
// `GroupValuesRows` fallback, so the wildcard arm below is
// only reachable from the consistency-fuzz test.
macro_rules! instantiate_fsl {
($t:ty) => {{
let b = fixed_size_list::FixedSizeListGroupValueBuilder::<$t>::new(
data_type,
);
v.push(Box::new(b) as _);
}};
}
Comment on lines +90 to +103
let list_array = array
.as_any()
.downcast_ref::<FixedSizeListArray>()
.expect("FixedSizeListGroupValueBuilder called with non-FixedSizeList array");
let rhs_sublist: ArrayRef = list_array.value(rhs_row);
let list_len = self.list_len_usize();
let lhs_base = lhs_row * list_len;
for j in 0..list_len {
if !self.child.equal_to(lhs_base + j, &rhs_sublist, j) {
return false;
}
}
true
}
Comment on lines +73 to +75
fn list_len_usize(&self) -> usize {
self.list_len as usize
}
… size, drop per-compare allocation in equal_to

Four Copilot inline comments handled together:

* `group_column_supported_type` and `make_group_column` both now reject
  `FixedSizeList` with a negative `list_size`. Arrow defines that size
  as non-negative, but the enum lets callers construct a negative value
  programmatically; without this guard the `i32 as usize` cast inside
  the builder would wrap and risk panics / OOM. Guards are duplicated
  on both sides so direct callers of `make_group_column` (and the
  `supported_schema` planner gate) both fail safely. Consistency fuzz
  already covers the `make_group_column` side via the existing
  unsupported_cases path; an explicit dedicated test below covers both
  surfaces.

* `FixedSizeListGroupValueBuilder::new` asserts `list_len >= 0` so any
  direct construction that bypasses the factory fails fast with a
  clear message instead of silently producing a broken builder.
  `list_len_usize` now uses `usize::try_from(...).expect(...)` instead
  of `as usize`; with the constructor's assertion the conversion is
  provably infallible, but going through `try_from` means a future
  invariant break shows up as an explicit panic rather than a silent
  wrap.

* `equal_to` no longer calls `list_array.value(rhs_row)`, which
  constructs a sliced child `ArrayRef` on every comparison. The hot
  path now borrows `list_array.values()` and uses `value_offset(rhs_row)`
  to compute the child base index, mirroring the existing approach in
  `append_val`. Zero allocations per equality check.

New test: `negative_list_size_is_rejected_by_allow_list_and_dispatcher`
locks in the negative-size guard at both surfaces. 13/13 FSL tests
pass, 128/128 aggregates pass, clippy clean.
@zhuqi-lucas

Copy link
Copy Markdown
Contributor Author

Addressed all four Copilot comments in 29114c3:

Comment Fix
mod.rs:945 — allow-list missing negative-size check group_column_supported_type now rejects FixedSizeList(_, n) with n < 0 before falling through to the inner-type matcher.
mod.rs:1103 — dispatcher missing negative-size check make_group_column returns not_impl_err! for FixedSizeList(_, n < 0) (defense in depth so direct callers also fail safely).
fixed_size_list.rs:75as usize can wrap Constructor now assert!(list_len >= 0) with a clear message; list_len_usize switched to usize::try_from(...).expect(...) so a future invariant break surfaces as a panic, not a silent wrap.
fixed_size_list.rs:103value(rhs_row) per compare equal_to now borrows list_array.values() + value_offset(rhs_row) like append_val does; zero allocations per equality check on the grouping hot path.

New test negative_list_size_is_rejected_by_allow_list_and_dispatcher locks in the guard at both surfaces. 13/13 FSL tests pass, 128/128 aggregates pass, clippy clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants