[core] Align full-text search with JSON DSL by JingsongLi · Pull Request #8308 · apache/paimon

JingsongLi · 2026-06-21T03:00:55Z

Summary

Align Paimon full-text search with the LanceDB-style FTS query DSL and make JSON DSL the public full-text search API surface. This covers Java, PyPaimon, Spark SQL table-valued functions, Tantivy JNI parsing, hybrid full-text routes, docs, and index scan correctness.

Changes

Add structured FullTextQuery JSON support for match, match_phrase / phrase, boost, multi_match, and boolean.
Replace the old full-text builder / Spark TVF shape with JSON DSL inputs, including hybrid full-text routes using DSL strings.
Execute multi_match, cross-column boolean queries, and boost demotion by composing per-column full-text index reads in Paimon's read layer.
Read full leaf candidates before applying final top-k for compound queries so final scoring is applied to the full candidate set.
Filter Java and PyPaimon full-text scans to full-text index types, so other indexes on the same column are not mixed into full-text splits.
Mirror the DSL in PyPaimon and update CLI, examples, docs, and tests.

Testing

mvn -pl paimon-common,paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test
mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests -DfailIfNoTests=false compile
mvn -pl paimon-common,paimon-core,paimon-tantivy/paimon-tantivy-index -DskipTests -DfailIfNoTests=false spotless:check
python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q
python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text
cargo check in paimon-tantivy/paimon-tantivy-jni/rust
mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test
mvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.catalyst.plans.logical.VectorSearchQueryTest -Dtest=none test
mvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common,paimon-spark/paimon-spark-ut -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.sql.FullTextSearchTest,org.apache.paimon.spark.sql.HybridSearchTest -Dtest=none test
git diff --check

Notes

This is a breaking API change for the unreleased full-text search API. multi_match and cross-column boolean queries are supported by composing per-column full-text index reads in Paimon's read layer; leaf execution still uses existing single-column full-text indexes.

One earlier parallel Maven run failed with a transient maven-remote-resources-plugin resource-copy error while another Maven command was running against the same workspace. The same related test command passed when rerun serially.

leaves12138

Thanks for the update. The structured DSL itself looks generally aligned across Java/Python/Spark/Tantivy, and the main Java/Python targeted tests pass locally. I found one correctness issue that I think should be fixed before merging.

Blocker: FullTextScanImpl currently filters global index files only by globalIndex.indexFieldId() and then groups all files for that column/range into the same FullTextSearchSplit. Global index identity elsewhere includes both index_type and indexed fields (for example drop_global_index filters by entry.indexFile().indexType() plus getIndexedFieldIds()), so a table can have another global index type on the same text column. In that case full-text search can pick up non-full-text index files and pass a mixed file list to the full-text reader.

I reproduced this locally by adding a test-only btree global index on the same content column in FullTextSearchBuilderTest; executeLocal() fails with IllegalArgumentException: Expected exactly one index file per shard from the full-text reader because the split contains both the full-text and btree files for the same row range. The Tantivy reader has the same one-file-per-shard assumption, so this is not only a test-index artifact.

Could we filter the scan to full-text-capable index types before grouping (or otherwise carry/select the intended full-text index type), and add a regression test for "same column has full-text + another global index"?

Non-blocking compatibility question: this PR removes the old Java builder shape and Spark TVF signature in favor of JSON DSL only. If those APIs are considered user-facing, it would be safer to keep deprecated wrappers for the old (column, query_text, operator) form and translate them to FullTextQuery.match(...).

Local validation:

mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed.
python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text passed (8 passed).
mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test built successfully, but the Tantivy test class was skipped in this environment (11 skipped).

JingsongLi · 2026-06-22T02:30:02Z

Addressed the blocker in 2e325b67e: full-text scans now only include global index types whose factory declares full-text support, and the Tantivy/test full-text factories opt in. I also added a regression test covering a full-text index and a btree index on the same content column; it failed before the fix with Expected exactly one index file per shard and now passes.\n\nValidation run locally:\n- mvn -pl paimon-common,paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test\n- mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests -DfailIfNoTests=false compile\n- mvn -pl paimon-common,paimon-core,paimon-tantivy/paimon-tantivy-index -DskipTests -DfailIfNoTests=false spotless:check\n- git diff --check

leaves12138

Thanks for the update. The Java-side blocker from my previous review is fixed: FullTextScanImpl now filters index files via GlobalIndexerFactory.supportsFullTextSearch(), and the new Java regression test covers the same-column btree/full-text case.

I found the same issue still exists in PyPaimon. pypaimon/table/source/full_text_scan.py still filters only by global_index_meta.index_field_id in text_column_ids, so a btree/bitmap/etc. global index on the same text column is grouped into FullTextSearchSplit.full_text_index_files. I verified this with a small local PyPaimon test using one tantivy-fulltext entry and one btree entry for the same content column/range; the split contained both ft.index and btree.index. This can later trip the Python Tantivy reader's assert len(io_metas) == 1 or otherwise pass a non-full-text file to the full-text reader.

Could we also filter PyPaimon's full-text scan to tantivy-fulltext (or a Python equivalent of the Java full-text-capable index type check) and add the same regression coverage there?

Local validation on the latest head:

mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed (FullTextSearchBuilderTest now runs 16 tests).
python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text passed (8 passed).

JingsongLi · 2026-06-22T02:42:52Z

Addressed the PyPaimon blocker in 3ceef0054: FullTextScanImpl now filters scanned index files to tantivy-fulltext, matching the only full-text reader supported by PyPaimon. I added a regression test with tantivy-fulltext and btree entries on the same content column/range; it failed before the fix with both files in the split and now passes.\n\nValidation run locally:\n- python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text\n- python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q\n- git diff --check

leaves12138

Thanks for the fix. I rechecked the latest head and the previous blockers are addressed now:

Java full-text scan filters by full-text-capable index factories and has regression coverage for another index type on the same column.
PyPaimon full-text scan now filters to tantivy-fulltext and has the matching regression test.

Local validation:

mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed (FullTextQueryTest: 5 tests, FullTextSearchBuilderTest: 16 tests).
python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k 'full_text_scan_ignores_other_index_types_on_same_column or full_text' passed (9 passed).
mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test built successfully; the Tantivy test class is still skipped in this environment (11 skipped).

LGTM.

[core] Add structured full-text query DSL

69ef783

JingsongLi marked this pull request as draft June 21, 2026 03:19

JingsongLi added 5 commits June 21, 2026 21:17

[core] Align full-text query DSL with LanceDB

bca0a08

[core] Use full-text DSL for hybrid routes

f5f6901

[docs] Document full-text query DSL

0f5c743

[core] Canonicalize full-text query JSON

060541c

[core] Fix full-text DSL test formatting

8bc75ff

JingsongLi marked this pull request as ready for review June 21, 2026 15:06

leaves12138 requested changes Jun 21, 2026

View reviewed changes

[core] Filter full-text scan index types

2e325b6

JingsongLi changed the title ~~[core] Add structured full-text query DSL~~ [core] Align full-text search with JSON DSL Jun 22, 2026

leaves12138 requested changes Jun 22, 2026

View reviewed changes

[python] Filter full-text scan index types

3ceef00

leaves12138 approved these changes Jun 22, 2026

View reviewed changes

JingsongLi merged commit 84d5acb into apache:master Jun 22, 2026
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Align full-text search with JSON DSL#8308

[core] Align full-text search with JSON DSL#8308
JingsongLi merged 8 commits into
apache:masterfrom
JingsongLi:codex/full-text-query-dsl

JingsongLi commented Jun 21, 2026 •

edited

Loading

Uh oh!

leaves12138 left a comment

Uh oh!

JingsongLi commented Jun 22, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

JingsongLi commented Jun 22, 2026

Uh oh!

leaves12138 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JingsongLi commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Notes

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 22, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 22, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JingsongLi commented Jun 21, 2026 •

edited

Loading