Skip to content

release-8.5: add Lake docs#23156

Open
lilin90 wants to merge 45 commits into
release-8.5from
feature/preview-cloud-lake
Open

release-8.5: add Lake docs#23156
lilin90 wants to merge 45 commits into
release-8.5from
feature/preview-cloud-lake

Conversation

@lilin90

@lilin90 lilin90 commented Jun 26, 2026

Copy link
Copy Markdown
Member

What is changed, added or deleted? (Required)

This PR adds a whole set of documentation for TiDB Cloud Lake public preview.

Note: Lake diagrams referenced in docs were already added to release-8.5/media/tidb-cloud-lake in advance.

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions.

  • master (the latest development version)
  • v9.0 (TiDB 9.0 versions)
  • v8.5 (TiDB 8.5 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

shhdgit and others added 30 commits March 13, 2026 03:34
* *: add analytics service docs

* Update doc

* Update relative reference links

* Update docs links and variables

Standardize and fix internal documentation links and labels for the TiDB Cloud Lake docs. Updated TOC entries to shorter titles, fixed several relative links to point to /tidb-cloud-lake paths (load-from-local-file, load-from-remote-file, and file format references), cleaned up minor whitespace in the input-output-file-formats doc, and added new variables (lake, lake-short) in variables.json.

* Update file names and links

* Update format and fix typo

* Fix links and update file names

* Update cross reference links
* media: add 151 images and update reference links

* Add alt text for images

* Handle duplicated file names

* Fix internal links
* update tutorials

* tidb-cloud-lake: update guides

* tidb-cloud-lake: update sql

* Update tidb-cloud-lake/sql/input-output-file-formats.md

---------

Co-authored-by: Lilian Lee <lilin@pingcap.com>
* cloud-lake: standardize front matter in 1000+ Markdown files

* Update _index.md

* Normalize frontmatter summaries in guides

Standardize and simplify frontmatter summaries across multiple TiDB Cloud Lake guides: remove inline imports and stray markup, strip markdown emphasis and code ticks, and shorten/clarify several descriptions. Files changed: tidb-cloud-lake/guides/ai-powered-features.md (replace stray import string with concise summary), connect-using-dbeaver.md, connection-overview.md (limit to Databend Cloud), dashboards.md, data-purge-and-recycle.md (remove backticks around SQL commands), editions.md, full-text-index.md, and geo-analytics.md. These changes improve consistency and readability of guide metadata.

* Update how-json-variant-works.md

* Remove special format or punctuation characters

* Clean up frontmatter summaries in guides

Update frontmatter 'summary' fields across several TiDB Cloud Lake guides to be concise and descriptive. Removed embedded SQL/example snippets and stray JS import lines that were accidentally placed in summaries, and shortened redundant text for clarity. Affected files: tidb-cloud-lake/guides/query-parquet-files-in-stage.md, query-tsv-files-in-stage.md, sql-analytics.md, stage-overview.md, warehouse.md, and worksheet.md.

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Lilian Lee <lilin@pingcap.com>

* Update between.md

* Update summary of four files

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* tidb-cloud-lake: update sql

* tidb-cloud-lake: inline external content

Replace ComponentContent transclusions and LanguageFileParse/LanguageDocs components with their inlined English content.

* tidb-cloud-lake: convert structural components to Gatsby-compatible Markdown

* tidb-cloud-lake: convert inline/decorative components

* fix
* docs(tidb-cloud-lake): align first-level headings with front matter titles

* Update filenames

* Update anchor links
* cloud-lake: fix markdownlint violations

* Update recovery-from-operational-errors.md

---------

Co-authored-by: lilin90 <lilin@pingcap.com>
* cloud-lake: fix markdownlint violations

* fix

* fix

* fix
* docs(tidb-cloud-lake): align first-level headings with front matter titles

* Fix Superset guide and quote COPY INTO title

Remove a duplicated '## Superset' heading from tidb-cloud-lake/guides/superset.md to eliminate redundancy. Quote the YAML title in tidb-cloud-lake/sql/copy-into-table.md ("COPY INTO <table>") so the angle brackets are preserved and parsed correctly. Files changed: tidb-cloud-lake/guides/superset.md, tidb-cloud-lake/sql/copy-into-table.md.

* cloud-lake: replace Databend product names with lake variables

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

* cloud-lake: fix links after lake variable replacement

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

* Quote summary frontmatter strings

Wraps the frontmatter summary values in quotes across multiple tidb-cloud-lake guide and sql pages to ensure consistent YAML parsing and to preserve template expressions (e.g. {{{ .lake-short }}}). Standardizes formatting to prevent potential frontmatter parse issues.

---------

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
* cloud-lake: update update wording and remove self-hosted

* Update links and name in connection to variable

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: awxxxxxx <7347183+awxxxxxx@users.noreply.github.com>

* Update tidb-cloud-lake/guides/connect-using-dbeaver.md

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: awxxxxxx <7347183+awxxxxxx@users.noreply.github.com>
* cloud-lake: remove manual line breaks

* Update wording and format

* Update load-with-airbyte.md
* cloud-lake: remove inapplicable diagrams

* Remove inappropriate images

* Update tidb-cloud-lake/tutorials/migrate-from-snowflake.md

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update connect-with-aws-privatelink.md

* Remove manual line break

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* cloud-lake: enhance lake overview content and structure

* Updated the summary to clarify the service's capabilities and features.
* Improved descriptions of analytics, vector, search, and geo functionalities.
* Added a note about the private beta status.
* Revised the "Get Started" section for better clarity and organization.

* cloud-lake: enhance quick start guide with detailed steps

* Expanded the quick start guide to include a three-step process: signing up, creating a warehouse, and exploring it.
* Added detailed instructions for each step to improve user onboarding experience.
* Introduced new images to visually support the guide.

* Update wording and index

---------

Co-authored-by: lilin90 <lilin@pingcap.com>
lilin90 and others added 11 commits June 11, 2026 02:51
* lake: incremental updates 0614

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Remove service hosting pricing and fix API name

Update Lake pricing docs: change components list to 'warehouses, storage, and cloud service', pluralize 'warehouse(s)', and correct the REST API target from `lake-query` to `databend-query`. Remove the Service Hosting Pricing section (including its table and examples) to reflect that hosting fees are no longer described here.

* Update data-integration-overview.md

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@lilin90 lilin90 self-assigned this Jun 26, 2026
@lilin90 lilin90 added the translation/no-need No need to translate this PR. label Jun 26, 2026
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lilin90. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 26, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive documentation for TiDB Cloud Lake, including quick start guides, tutorials, SQL references, and advanced features such as access control, full-text search, and vector search. The reviewer's feedback is highly constructive, pointing out several copy-paste errors, redundant headings, duplicate comments, grammatical issues (such as fragmented sentences and typos), and style guide inconsistencies (like using title case instead of sentence case for headings). All of these suggestions are actionable and help improve the overall quality and professionalism of the documentation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


# Unloading NDJSON File

## Unloading TSV File

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is about unloading NDJSON files, but the heading says "## Unloading TSV File". This is a copy-paste error and should be corrected to "## Syntax".

Suggested change
## Unloading TSV File
## Syntax
References
  1. Technical accuracy and terminology consistency (link)

@@ -0,0 +1,138 @@
---
title: JSON & Search
summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system. JSON without copying it out of {{{ .lake }}}.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a fragmented sentence at the end of the summary: "...external system. JSON without copying...". It should be joined into a single coherent sentence.

Suggested change
summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system. JSON without copying it out of {{{ .lake }}}.
summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system or copying it out of {{{ .lake }}}.
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)

Click `Create role`, and select `Custom trust policy` in `Trusted entity type`:

![Create Role](/media/tidb-cloud-lake/create-role.png)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a duplicate "the" in "Input the the trust policy document:". It should be corrected to "Input the trust policy document:".

Suggested change
Input the trust policy document:
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)

@@ -0,0 +1,238 @@
---
title: Lakehouse ETL
summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline. to refresh the shared tables in {{{ .lake }}}.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a fragmented sentence at the end of the summary with a double period: "...single COPY pipeline. to refresh...". It should be joined into a single coherent sentence.

Suggested change
summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline. to refresh the shared tables in {{{ .lake }}}.
summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline to refresh the shared tables in {{{ .lake }}}.
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)

- ✅ **JDBC 4.0 Compatible**: Standard JDBC interface support
- ✅ **Connection Pooling**: Built-in connection management
- ✅ **Prepared Statements**: Efficient parameterized queries
- ✅ **Batch Operations**: Bulk insert and update supportations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The word "supportations" is not a standard English word. It should be corrected to "support".

Suggested change
-**Batch Operations**: Bulk insert and update supportations
-**Batch Operations**: Bulk insert and update support
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)

| ------------------ | ------------------------------------------------------------- |
| **{{{ .lake }}}** | `lake://user:pwd@host:443/database?warehouse=wh` |

### Step 2: Setup API Keys and Environment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

"Setup" is a noun; the verb form "Set up" should be used here. Also, "API Keys" and "Environment" should be in lowercase to adhere to sentence case guidelines.

Suggested change
### Step 2: Setup API Keys and Environment
### Step 2: Set up API keys and environment
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)
  2. Use sentence case for headings (e.g., ## Configure the cluster). (link)

INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:7777 (Press CTRL+C to quit)
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

"Setup" is a noun; the verb form "Set up" should be used here. Also, "Web Interface" should be in lowercase "web interface" to adhere to sentence case guidelines.

Suggested change
### Step 6: Set up web interface
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)
  2. Use sentence case for headings (e.g., ## Configure the cluster). (link)


For example: `arn:aws:iam::123456789012:role/xxxxxxx/tnabcdefg/xxxxxxx-tnabcdefg`

2. Goto AWS Console:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

"Goto" is a common spelling mistake for "Go to". Also, "AWS Console" should be in lowercase "AWS console" to adhere to sentence case guidelines.

Suggested change
2. Goto AWS Console:
2. Go to the AWS console:
References
  1. Correct English grammar, spelling, and punctuation mistakes, if any. (link)
  2. Use sentence case for headings (e.g., ## Configure the cluster). (link)

Comment on lines +46 to +47
# Query: Read data
# Query: Read data

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The comment "# Query: Read data" is duplicated. Remove the duplicate line.

Suggested change
# Query: Read data
# Query: Read data
# Query: Read data
References
  1. Avoid unnecessary words and repetition. (link)

| Features | Personal | Business | Dedicated |
|----------|----------|----------|-----------|
| The next-generation SQL worksheet for advanced query development, data analysis, and visualization. | ✓ | ✓ | ✓ |
| LakeSQL, a command line client for building/testing queries, loading/unloading bulk data, and automating DDL operations. | ✓ | ✓ | ✓ |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The list of programmatic interfaces includes both "Node.js" and ".js". The ".js" entry is redundant and incorrect as it is a file extension, not a language or platform name. It should be removed.

Suggested change
| LakeSQL, a command line client for building/testing queries, loading/unloading bulk data, and automating DDL operations. ||| |
| Programmatic interfaces for Rust, Python, Java, Node.js, PHP, and Go. |
References
  1. Avoid unnecessary words and repetition. (link)

@lilin90 lilin90 added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. area/tidb-cloud This PR relates to the area of TiDB Cloud. labels Jun 26, 2026
@lilin90 lilin90 requested review from awxxxxxx and qiancai June 26, 2026 10:16
@lilin90 lilin90 marked this pull request as ready for review June 26, 2026 10:24
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tidb-cloud This PR relates to the area of TiDB Cloud. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. translation/no-need No need to translate this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants