release-8.5: add Lake docs#23156
Conversation
* *: add analytics service docs * Update doc * Update relative reference links * Update docs links and variables Standardize and fix internal documentation links and labels for the TiDB Cloud Lake docs. Updated TOC entries to shorter titles, fixed several relative links to point to /tidb-cloud-lake paths (load-from-local-file, load-from-remote-file, and file format references), cleaned up minor whitespace in the input-output-file-formats doc, and added new variables (lake, lake-short) in variables.json. * Update file names and links * Update format and fix typo * Fix links and update file names * Update cross reference links
* media: add 151 images and update reference links * Add alt text for images * Handle duplicated file names * Fix internal links
* update tutorials * tidb-cloud-lake: update guides * tidb-cloud-lake: update sql * Update tidb-cloud-lake/sql/input-output-file-formats.md --------- Co-authored-by: Lilian Lee <lilin@pingcap.com>
* cloud-lake: standardize front matter in 1000+ Markdown files * Update _index.md * Normalize frontmatter summaries in guides Standardize and simplify frontmatter summaries across multiple TiDB Cloud Lake guides: remove inline imports and stray markup, strip markdown emphasis and code ticks, and shorten/clarify several descriptions. Files changed: tidb-cloud-lake/guides/ai-powered-features.md (replace stray import string with concise summary), connect-using-dbeaver.md, connection-overview.md (limit to Databend Cloud), dashboards.md, data-purge-and-recycle.md (remove backticks around SQL commands), editions.md, full-text-index.md, and geo-analytics.md. These changes improve consistency and readability of guide metadata. * Update how-json-variant-works.md * Remove special format or punctuation characters * Clean up frontmatter summaries in guides Update frontmatter 'summary' fields across several TiDB Cloud Lake guides to be concise and descriptive. Removed embedded SQL/example snippets and stray JS import lines that were accidentally placed in summaries, and shortened redundant text for clarity. Affected files: tidb-cloud-lake/guides/query-parquet-files-in-stage.md, query-tsv-files-in-stage.md, sql-analytics.md, stage-overview.md, warehouse.md, and worksheet.md. * Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Lilian Lee <lilin@pingcap.com> * Update between.md * Update summary of four files --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* tidb-cloud-lake: update sql * tidb-cloud-lake: inline external content Replace ComponentContent transclusions and LanguageFileParse/LanguageDocs components with their inlined English content. * tidb-cloud-lake: convert structural components to Gatsby-compatible Markdown * tidb-cloud-lake: convert inline/decorative components * fix
* docs(tidb-cloud-lake): align first-level headings with front matter titles * Update filenames * Update anchor links
* cloud-lake: fix markdownlint violations * Update recovery-from-operational-errors.md --------- Co-authored-by: lilin90 <lilin@pingcap.com>
* cloud-lake: fix markdownlint violations * fix * fix * fix
* docs(tidb-cloud-lake): align first-level headings with front matter titles
* Fix Superset guide and quote COPY INTO title
Remove a duplicated '## Superset' heading from tidb-cloud-lake/guides/superset.md to eliminate redundancy. Quote the YAML title in tidb-cloud-lake/sql/copy-into-table.md ("COPY INTO <table>") so the angle brackets are preserved and parsed correctly. Files changed: tidb-cloud-lake/guides/superset.md, tidb-cloud-lake/sql/copy-into-table.md.
* cloud-lake: replace Databend product names with lake variables
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
* cloud-lake: fix links after lake variable replacement
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
* Quote summary frontmatter strings
Wraps the frontmatter summary values in quotes across multiple tidb-cloud-lake guide and sql pages to ensure consistent YAML parsing and to preserve template expressions (e.g. {{{ .lake-short }}}). Standardizes formatting to prevent potential frontmatter parse issues.
---------
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
* cloud-lake: update update wording and remove self-hosted * Update links and name in connection to variable * Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: awxxxxxx <7347183+awxxxxxx@users.noreply.github.com> * Update tidb-cloud-lake/guides/connect-using-dbeaver.md --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: awxxxxxx <7347183+awxxxxxx@users.noreply.github.com>
* cloud-lake: remove manual line breaks * Update wording and format * Update load-with-airbyte.md
* cloud-lake: remove inapplicable diagrams * Remove inappropriate images * Update tidb-cloud-lake/tutorials/migrate-from-snowflake.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update connect-with-aws-privatelink.md * Remove manual line break --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* cloud-lake: enhance lake overview content and structure * Updated the summary to clarify the service's capabilities and features. * Improved descriptions of analytics, vector, search, and geo functionalities. * Added a note about the private beta status. * Revised the "Get Started" section for better clarity and organization. * cloud-lake: enhance quick start guide with detailed steps * Expanded the quick start guide to include a three-step process: signing up, creating a warehouse, and exploring it. * Added detailed instructions for each step to improve user onboarding experience. * Introduced new images to visually support the guide. * Update wording and index --------- Co-authored-by: lilin90 <lilin@pingcap.com>
* lake: incremental updates 0614 * Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Remove service hosting pricing and fix API name Update Lake pricing docs: change components list to 'warehouses, storage, and cloud service', pluralize 'warehouse(s)', and correct the REST API target from `lake-query` to `databend-query`. Remove the Service Hosting Pricing section (including its table and examples) to reflect that hosting fees are no longer described here. * Update data-integration-overview.md --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive documentation for TiDB Cloud Lake, including quick start guides, tutorials, SQL references, and advanced features such as access control, full-text search, and vector search. The reviewer's feedback is highly constructive, pointing out several copy-paste errors, redundant headings, duplicate comments, grammatical issues (such as fragmented sentences and typos), and style guide inconsistencies (like using title case instead of sentence case for headings). All of these suggestions are actionable and help improve the overall quality and professionalism of the documentation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| # Unloading NDJSON File | ||
|
|
||
| ## Unloading TSV File |
There was a problem hiding this comment.
This file is about unloading NDJSON files, but the heading says "## Unloading TSV File". This is a copy-paste error and should be corrected to "## Syntax".
| ## Unloading TSV File | |
| ## Syntax |
References
- Technical accuracy and terminology consistency (link)
| @@ -0,0 +1,138 @@ | |||
| --- | |||
| title: JSON & Search | |||
| summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system. JSON without copying it out of {{{ .lake }}}. | |||
There was a problem hiding this comment.
There is a fragmented sentence at the end of the summary: "...external system. JSON without copying...". It should be joined into a single coherent sentence.
| summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system. JSON without copying it out of {{{ .lake }}}. | |
| summary: Scenario CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in {{{ .lake }}} with Elasticsearch-style syntax without replicating it to an external system or copying it out of {{{ .lake }}}. |
References
- Correct English grammar, spelling, and punctuation mistakes, if any. (link)
| Click `Create role`, and select `Custom trust policy` in `Trusted entity type`: | ||
|
|
||
|  | ||
|
|
There was a problem hiding this comment.
There is a duplicate "the" in "Input the the trust policy document:". It should be corrected to "Input the trust policy document:".
| Input the trust policy document: |
References
- Correct English grammar, spelling, and punctuation mistakes, if any. (link)
| @@ -0,0 +1,238 @@ | |||
| --- | |||
| title: Lakehouse ETL | |||
| summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline. to refresh the shared tables in {{{ .lake }}}. | |||
There was a problem hiding this comment.
There is a fragmented sentence at the end of the summary with a double period: "...single COPY pipeline. to refresh...". It should be joined into a single coherent sentence.
| summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline. to refresh the shared tables in {{{ .lake }}}. | |
| summary: Scenario CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update {{{ .lake }}}'s shared tables via a single COPY pipeline to refresh the shared tables in {{{ .lake }}}. |
References
- Correct English grammar, spelling, and punctuation mistakes, if any. (link)
| - ✅ **JDBC 4.0 Compatible**: Standard JDBC interface support | ||
| - ✅ **Connection Pooling**: Built-in connection management | ||
| - ✅ **Prepared Statements**: Efficient parameterized queries | ||
| - ✅ **Batch Operations**: Bulk insert and update supportations |
There was a problem hiding this comment.
The word "supportations" is not a standard English word. It should be corrected to "support".
| - ✅ **Batch Operations**: Bulk insert and update supportations | |
| - ✅ **Batch Operations**: Bulk insert and update support |
References
- Correct English grammar, spelling, and punctuation mistakes, if any. (link)
| | ------------------ | ------------------------------------------------------------- | | ||
| | **{{{ .lake }}}** | `lake://user:pwd@host:443/database?warehouse=wh` | | ||
|
|
||
| ### Step 2: Setup API Keys and Environment |
There was a problem hiding this comment.
"Setup" is a noun; the verb form "Set up" should be used here. Also, "API Keys" and "Environment" should be in lowercase to adhere to sentence case guidelines.
| ### Step 2: Setup API Keys and Environment | |
| ### Step 2: Set up API keys and environment |
| INFO: Application startup complete. | ||
| INFO: Uvicorn running on http://127.0.0.1:7777 (Press CTRL+C to quit) | ||
| ``` | ||
|
|
There was a problem hiding this comment.
"Setup" is a noun; the verb form "Set up" should be used here. Also, "Web Interface" should be in lowercase "web interface" to adhere to sentence case guidelines.
| ### Step 6: Set up web interface |
|
|
||
| For example: `arn:aws:iam::123456789012:role/xxxxxxx/tnabcdefg/xxxxxxx-tnabcdefg` | ||
|
|
||
| 2. Goto AWS Console: |
There was a problem hiding this comment.
"Goto" is a common spelling mistake for "Go to". Also, "AWS Console" should be in lowercase "AWS console" to adhere to sentence case guidelines.
| 2. Goto AWS Console: | |
| 2. Go to the AWS console: |
| # Query: Read data | ||
| # Query: Read data |
There was a problem hiding this comment.
The comment "# Query: Read data" is duplicated. Remove the duplicate line.
| # Query: Read data | |
| # Query: Read data | |
| # Query: Read data |
References
- Avoid unnecessary words and repetition. (link)
| | Features | Personal | Business | Dedicated | | ||
| |----------|----------|----------|-----------| | ||
| | The next-generation SQL worksheet for advanced query development, data analysis, and visualization. | ✓ | ✓ | ✓ | | ||
| | LakeSQL, a command line client for building/testing queries, loading/unloading bulk data, and automating DDL operations. | ✓ | ✓ | ✓ | |
There was a problem hiding this comment.
The list of programmatic interfaces includes both "Node.js" and ".js". The ".js" entry is redundant and incorrect as it is a file extension, not a language or platform name. It should be removed.
| | LakeSQL, a command line client for building/testing queries, loading/unloading bulk data, and automating DDL operations. | ✓ | ✓ | ✓ | | |
| | Programmatic interfaces for Rust, Python, Java, Node.js, PHP, and Go. | |
References
- Avoid unnecessary words and repetition. (link)
What is changed, added or deleted? (Required)
This PR adds a whole set of documentation for TiDB Cloud Lake public preview.
Note: Lake diagrams referenced in docs were already added to
release-8.5/media/tidb-cloud-lakein advance.Which TiDB version(s) do your changes apply to? (Required)
Tips for choosing the affected version(s):
By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.
For details, see tips for choosing the affected versions.
What is the related PR or file link(s)?
Do your changes match any of the following descriptions?