Skip to content

[core] Align full-text search with JSON DSL#8308

Merged
JingsongLi merged 8 commits into
apache:masterfrom
JingsongLi:codex/full-text-query-dsl
Jun 22, 2026
Merged

[core] Align full-text search with JSON DSL#8308
JingsongLi merged 8 commits into
apache:masterfrom
JingsongLi:codex/full-text-query-dsl

Conversation

@JingsongLi

@JingsongLi JingsongLi commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Align Paimon full-text search with the LanceDB-style FTS query DSL and make JSON DSL the public full-text search API surface. This covers Java, PyPaimon, Spark SQL table-valued functions, Tantivy JNI parsing, hybrid full-text routes, docs, and index scan correctness.

Changes

  • Add structured FullTextQuery JSON support for match, match_phrase / phrase, boost, multi_match, and boolean.
  • Replace the old full-text builder / Spark TVF shape with JSON DSL inputs, including hybrid full-text routes using DSL strings.
  • Execute multi_match, cross-column boolean queries, and boost demotion by composing per-column full-text index reads in Paimon's read layer.
  • Read full leaf candidates before applying final top-k for compound queries so final scoring is applied to the full candidate set.
  • Filter Java and PyPaimon full-text scans to full-text index types, so other indexes on the same column are not mixed into full-text splits.
  • Mirror the DSL in PyPaimon and update CLI, examples, docs, and tests.

Testing

  • mvn -pl paimon-common,paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test
  • mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests -DfailIfNoTests=false compile
  • mvn -pl paimon-common,paimon-core,paimon-tantivy/paimon-tantivy-index -DskipTests -DfailIfNoTests=false spotless:check
  • python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q
  • python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text
  • cargo check in paimon-tantivy/paimon-tantivy-jni/rust
  • mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test
  • mvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.catalyst.plans.logical.VectorSearchQueryTest -Dtest=none test
  • mvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common,paimon-spark/paimon-spark-ut -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.sql.FullTextSearchTest,org.apache.paimon.spark.sql.HybridSearchTest -Dtest=none test
  • git diff --check

Notes

This is a breaking API change for the unreleased full-text search API. multi_match and cross-column boolean queries are supported by composing per-column full-text index reads in Paimon's read layer; leaf execution still uses existing single-column full-text indexes.

One earlier parallel Maven run failed with a transient maven-remote-resources-plugin resource-copy error while another Maven command was running against the same workspace. The same related test command passed when rerun serially.

@JingsongLi JingsongLi marked this pull request as draft June 21, 2026 03:19
@JingsongLi JingsongLi marked this pull request as ready for review June 21, 2026 15:06

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The structured DSL itself looks generally aligned across Java/Python/Spark/Tantivy, and the main Java/Python targeted tests pass locally. I found one correctness issue that I think should be fixed before merging.

Blocker: FullTextScanImpl currently filters global index files only by globalIndex.indexFieldId() and then groups all files for that column/range into the same FullTextSearchSplit. Global index identity elsewhere includes both index_type and indexed fields (for example drop_global_index filters by entry.indexFile().indexType() plus getIndexedFieldIds()), so a table can have another global index type on the same text column. In that case full-text search can pick up non-full-text index files and pass a mixed file list to the full-text reader.

I reproduced this locally by adding a test-only btree global index on the same content column in FullTextSearchBuilderTest; executeLocal() fails with IllegalArgumentException: Expected exactly one index file per shard from the full-text reader because the split contains both the full-text and btree files for the same row range. The Tantivy reader has the same one-file-per-shard assumption, so this is not only a test-index artifact.

Could we filter the scan to full-text-capable index types before grouping (or otherwise carry/select the intended full-text index type), and add a regression test for "same column has full-text + another global index"?

Non-blocking compatibility question: this PR removes the old Java builder shape and Spark TVF signature in favor of JSON DSL only. If those APIs are considered user-facing, it would be safer to keep deprecated wrappers for the old (column, query_text, operator) form and translate them to FullTextQuery.match(...).

Local validation:

  • mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed.
  • python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text passed (8 passed).
  • mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test built successfully, but the Tantivy test class was skipped in this environment (11 skipped).

@JingsongLi JingsongLi changed the title [core] Add structured full-text query DSL [core] Align full-text search with JSON DSL Jun 22, 2026
@JingsongLi

Copy link
Copy Markdown
Contributor Author

Addressed the blocker in 2e325b67e: full-text scans now only include global index types whose factory declares full-text support, and the Tantivy/test full-text factories opt in. I also added a regression test covering a full-text index and a btree index on the same content column; it failed before the fix with Expected exactly one index file per shard and now passes.\n\nValidation run locally:\n- mvn -pl paimon-common,paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test\n- mvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests -DfailIfNoTests=false compile\n- mvn -pl paimon-common,paimon-core,paimon-tantivy/paimon-tantivy-index -DskipTests -DfailIfNoTests=false spotless:check\n- git diff --check

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The Java-side blocker from my previous review is fixed: FullTextScanImpl now filters index files via GlobalIndexerFactory.supportsFullTextSearch(), and the new Java regression test covers the same-column btree/full-text case.

I found the same issue still exists in PyPaimon. pypaimon/table/source/full_text_scan.py still filters only by global_index_meta.index_field_id in text_column_ids, so a btree/bitmap/etc. global index on the same text column is grouped into FullTextSearchSplit.full_text_index_files. I verified this with a small local PyPaimon test using one tantivy-fulltext entry and one btree entry for the same content column/range; the split contained both ft.index and btree.index. This can later trip the Python Tantivy reader's assert len(io_metas) == 1 or otherwise pass a non-full-text file to the full-text reader.

Could we also filter PyPaimon's full-text scan to tantivy-fulltext (or a Python equivalent of the Java full-text-capable index type check) and add the same regression coverage there?

Local validation on the latest head:

  • mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed (FullTextSearchBuilderTest now runs 16 tests).
  • python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text passed (8 passed).

@JingsongLi

Copy link
Copy Markdown
Contributor Author

Addressed the PyPaimon blocker in 3ceef0054: FullTextScanImpl now filters scanned index files to tantivy-fulltext, matching the only full-text reader supported by PyPaimon. I added a regression test with tantivy-fulltext and btree entries on the same content column/range; it failed before the fix with both files in the split and now passes.\n\nValidation run locally:\n- python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_text\n- python -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q\n- git diff --check

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. I rechecked the latest head and the previous blockers are addressed now:

  • Java full-text scan filters by full-text-capable index factories and has regression coverage for another index type on the same column.
  • PyPaimon full-text scan now filters to tantivy-fulltext and has the matching regression test.

Local validation:

  • mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest test passed (FullTextQueryTest: 5 tests, FullTextSearchBuilderTest: 16 tests).
  • python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k 'full_text_scan_ignores_other_index_types_on_same_column or full_text' passed (9 passed).
  • mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest test built successfully; the Tantivy test class is still skipped in this environment (11 skipped).

LGTM.

@JingsongLi JingsongLi merged commit 84d5acb into apache:master Jun 22, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants