[core] Align full-text search with JSON DSL#8308
Conversation
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The structured DSL itself looks generally aligned across Java/Python/Spark/Tantivy, and the main Java/Python targeted tests pass locally. I found one correctness issue that I think should be fixed before merging.
Blocker: FullTextScanImpl currently filters global index files only by globalIndex.indexFieldId() and then groups all files for that column/range into the same FullTextSearchSplit. Global index identity elsewhere includes both index_type and indexed fields (for example drop_global_index filters by entry.indexFile().indexType() plus getIndexedFieldIds()), so a table can have another global index type on the same text column. In that case full-text search can pick up non-full-text index files and pass a mixed file list to the full-text reader.
I reproduced this locally by adding a test-only btree global index on the same content column in FullTextSearchBuilderTest; executeLocal() fails with IllegalArgumentException: Expected exactly one index file per shard from the full-text reader because the split contains both the full-text and btree files for the same row range. The Tantivy reader has the same one-file-per-shard assumption, so this is not only a test-index artifact.
Could we filter the scan to full-text-capable index types before grouping (or otherwise carry/select the intended full-text index type), and add a regression test for "same column has full-text + another global index"?
Non-blocking compatibility question: this PR removes the old Java builder shape and Spark TVF signature in favor of JSON DSL only. If those APIs are considered user-facing, it would be safer to keep deprecated wrappers for the old (column, query_text, operator) form and translate them to FullTextQuery.match(...).
Local validation:
mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest testpassed.python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_textpassed (8 passed).mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest testbuilt successfully, but the Tantivy test class was skipped in this environment (11 skipped).
|
Addressed the blocker in |
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. The Java-side blocker from my previous review is fixed: FullTextScanImpl now filters index files via GlobalIndexerFactory.supportsFullTextSearch(), and the new Java regression test covers the same-column btree/full-text case.
I found the same issue still exists in PyPaimon. pypaimon/table/source/full_text_scan.py still filters only by global_index_meta.index_field_id in text_column_ids, so a btree/bitmap/etc. global index on the same text column is grouped into FullTextSearchSplit.full_text_index_files. I verified this with a small local PyPaimon test using one tantivy-fulltext entry and one btree entry for the same content column/range; the split contained both ft.index and btree.index. This can later trip the Python Tantivy reader's assert len(io_metas) == 1 or otherwise pass a non-full-text file to the full-text reader.
Could we also filter PyPaimon's full-text scan to tantivy-fulltext (or a Python equivalent of the Java full-text-capable index type check) and add the same regression coverage there?
Local validation on the latest head:
mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest testpassed (FullTextSearchBuilderTestnow runs 16 tests).python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_textpassed (8 passed).
|
Addressed the PyPaimon blocker in |
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the fix. I rechecked the latest head and the previous blockers are addressed now:
- Java full-text scan filters by full-text-capable index factories and has regression coverage for another index type on the same column.
- PyPaimon full-text scan now filters to
tantivy-fulltextand has the matching regression test.
Local validation:
mvn -pl paimon-common,paimon-core -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest testpassed (FullTextQueryTest: 5 tests,FullTextSearchBuilderTest: 16 tests).python3 -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k 'full_text_scan_ignores_other_index_types_on_same_column or full_text'passed (9 passed).mvn -pl paimon-tantivy/paimon-tantivy-index -am -DskipITs -Dcheckstyle.skip -Drat.skip -Dspotless.check.skip=true -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest testbuilt successfully; the Tantivy test class is still skipped in this environment (11 skipped).
LGTM.
Summary
Align Paimon full-text search with the LanceDB-style FTS query DSL and make JSON DSL the public full-text search API surface. This covers Java, PyPaimon, Spark SQL table-valued functions, Tantivy JNI parsing, hybrid full-text routes, docs, and index scan correctness.
Changes
FullTextQueryJSON support formatch,match_phrase/phrase,boost,multi_match, andboolean.multi_match, cross-column boolean queries, and boost demotion by composing per-column full-text index reads in Paimon's read layer.Testing
mvn -pl paimon-common,paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest=FullTextQueryTest,FullTextSearchBuilderTest testmvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DskipTests -DfailIfNoTests=false compilemvn -pl paimon-common,paimon-core,paimon-tantivy/paimon-tantivy-index -DskipTests -DfailIfNoTests=false spotless:checkpython -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -qpython -m pytest paimon-python/pypaimon/tests/vector_search_filter_test.py -q -k full_textcargo checkinpaimon-tantivy/paimon-tantivy-jni/rustmvn -pl paimon-tantivy/paimon-tantivy-index -am -Pfast-build -DfailIfNoTests=false -Dtest=TantivyFullTextGlobalIndexTest testmvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.catalyst.plans.logical.VectorSearchQueryTest -Dtest=none testmvn -pl paimon-spark/paimon-spark-common,paimon-spark/paimon-spark3-common,paimon-spark/paimon-spark-ut -am -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.sql.FullTextSearchTest,org.apache.paimon.spark.sql.HybridSearchTest -Dtest=none testgit diff --checkNotes
This is a breaking API change for the unreleased full-text search API.
multi_matchand cross-column boolean queries are supported by composing per-column full-text index reads in Paimon's read layer; leaf execution still uses existing single-column full-text indexes.One earlier parallel Maven run failed with a transient
maven-remote-resources-pluginresource-copy error while another Maven command was running against the same workspace. The same related test command passed when rerun serially.