[spark] support lateral inner join for vector search by Stefanietry · Pull Request #8252 · apache/paimon

Stefanietry · 2026-06-16T08:03:55Z

Purpose
Purpose: Support lateral join for vector search on spark.
Linked issue: #8251

Tests
Add vector search with lateral join on org.apache.paimon.spark.SparkMultimodalITCase#testVector、org.apache.spark.sql.test.SQLTestUtils#test("lateral vector search preserves subquery alias qualifiers")

JingsongLi · 2026-06-16T08:20:27Z

+
+  override protected def doExecute(): RDD[InternalRow] = {
+    child.execute().mapPartitions {
+      outerRows =>


Can batch queries be supported? Batch queries are crucial for performance. You can take a look to benchmark in https://github.com/apache/paimon-vector-index

Thanks for your reminder. I'll refine it in batch mode later.

JingsongLi · 2026-06-16T15:11:23Z

Please fix test failures.

JingsongLi · 2026-06-23T06:05:53Z

+            _.toPaimonDataField)).asJava)
+      }
+    val sparkRow = SparkInternalRow.create(resultRowType)
+    val vectorSearchBuilder = innerTable


Normal Spark vector_search scans apply pushed partition/data filters before top-k (PaimonBaseScan.evalVectorSearch passes pushedPartitionFilters/pushedDataFilters into the builder). This lateral executor builds the BatchVectorSearchBuilder here without carrying predicates from the search side; PushDownLateralVectorSearchFilter only pushes predicates that reference the left child, so predicates on r.dt or other searched-table columns stay above LateralVectorSearch. A query like ... JOIN LATERAL (...) r WHERE r.dt = '20260420' will pick topK over all partitions and then filter the joined rows, which can return fewer or wrong rows compared with non-lateral vector_search(...) WHERE dt = .... Please preserve search-side filters and apply them via withPartitionFilter/withFilter before newVectorScan()/readBatch(), or reject such predicates explicitly.

JingsongLi · 2026-06-23T06:06:07Z

+
+    scan.plan().splits().asScala.iterator.flatMap {
+      split =>
+        val reader =


This reader is only closed when this inner iterator is fully exhausted. If a downstream operator short-circuits consumption, for example LIMIT 1/take, or if the task is interrupted, Spark can stop pulling rows before hasNext returns false, leaving the current PaimonRecordReaderIterator and its underlying RecordReader open. Please register a TaskContext completion listener or wrap the returned iterator so the current reader is closed on task completion/cancellation as well as normal exhaustion.

JingsongLi · 2026-06-23T14:11:52Z

+    relation.output.filter(projectReferences.contains)
+  }
+
+  private def extractDynamicVectorSearch(plan: LogicalPlan)


Filters inside the lateral subquery are still not handled here. For example, FROM q, LATERAL (SELECT gid FROM vector_search('t', 'embs', q.embs, 5) WHERE dt = '20260608') r is resolved as a right plan containing Filter(..., DynamicVectorSearchRelation) (usually under Project). This extractor falls through to None, leaving the dynamic relation with an outer reference but no LateralVectorSearch physical path. Please extract Filter nodes here and append their conditions to searchFilters (or reject them explicitly) in addition to the outer-WHERE pushdown case.

JingsongLi · 2026-06-23T14:12:01Z

+      val (pushDownToLeft, otherPredicates) = predicates.partition {
+        predicate => predicate.deterministic && predicate.references.subsetOf(lvs.child.outputSet)
+      }
+      val (pushDownToSearch, stayUp) = otherPredicates.partition {


This removes every deterministic predicate that references only search-side output, but convertSearchFilters() later throws if Spark cannot translate the rewritten expression into a Paimon predicate. A valid query such as WHERE r.gid + 1 > 10, or a filter on an expression alias from the lateral subquery, would now fail during execution instead of being evaluated above the lateral result. Please keep untranslatable predicates in stayUp, or restrict this pushdown to simple field predicates that are known to be convertible before dropping the upper filter.

JingsongLi · 2026-06-24T00:49:07Z

  }
+
+  def hasOuterReference(argsWithoutTable: Seq[Expression]): Boolean = {
+    val queryVector = argsWithoutTable(1)


hasOuterReference accesses argsWithoutTable(1) before verifying the arity. For invalid calls such as vector_search('t', 'embs'), this now throws IndexOutOfBoundsException during resolution instead of the existing helpful vector_search needs three or four parameters... error from createVectorSearch/createDynamicVectorSearch. Please check the size before reading the query-vector argument.

JingsongLi

+1

Stefanietry force-pushed the support_lateral_join_for_vector_search branch from 8aa9c09 to 774c9b6 Compare June 16, 2026 08:05

JingsongLi reviewed Jun 16, 2026

View reviewed changes

Stefanietry force-pushed the support_lateral_join_for_vector_search branch 2 times, most recently from 4697f65 to c23c76b Compare June 22, 2026 15:04

Stefanietry closed this Jun 22, 2026

Stefanietry reopened this Jun 22, 2026

Stefanietry force-pushed the support_lateral_join_for_vector_search branch 2 times, most recently from a1e3745 to 835b339 Compare June 23, 2026 05:38

JingsongLi reviewed Jun 23, 2026

View reviewed changes

Stefanietry force-pushed the support_lateral_join_for_vector_search branch 2 times, most recently from 07428ac to c9ee0dd Compare June 23, 2026 11:41

JingsongLi reviewed Jun 23, 2026

View reviewed changes

[spark] support lateral inner join for vector search

aa7dcc3

Stefanietry force-pushed the support_lateral_join_for_vector_search branch from c9ee0dd to aa7dcc3 Compare June 23, 2026 16:06

JingsongLi reviewed Jun 24, 2026

View reviewed changes

JingsongLi approved these changes Jun 24, 2026

View reviewed changes

JingsongLi merged commit ffaebae into apache:master Jun 24, 2026
13 checks passed

Stefanietry deleted the support_lateral_join_for_vector_search branch June 24, 2026 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] support lateral inner join for vector search#8252

[spark] support lateral inner join for vector search#8252
JingsongLi merged 1 commit into
apache:masterfrom
Stefanietry:support_lateral_join_for_vector_search

Stefanietry commented Jun 16, 2026

Uh oh!

JingsongLi Jun 16, 2026

Uh oh!

Stefanietry Jun 16, 2026

Uh oh!

JingsongLi commented Jun 16, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

JingsongLi Jun 24, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stefanietry commented Jun 16, 2026

Uh oh!

JingsongLi Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Stefanietry Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 16, 2026

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants