From ebd47856869a0cefac64803a6c54e3584c2ca0e9 Mon Sep 17 00:00:00 2001 From: Greg Orzell Date: Wed, 17 Jun 2026 09:30:50 +0200 Subject: [PATCH 1/3] docs(bpe): cite Incremental BPE Tokenization paper Add a reference to Jiang and Gong, "Incremental BPE Tokenization" (ICML 2026), in the tokenizer comparison section of the bpe README, linking to the paper's full runtime analysis and the authors' implementation at ModelTC/mtc-inc-bpe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- crates/bpe/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/crates/bpe/README.md b/crates/bpe/README.md index 743c59a7..6025d6eb 100644 --- a/crates/bpe/README.md +++ b/crates/bpe/README.md @@ -300,3 +300,6 @@ This case is particularly challenging for tiktoken, which shows a quadratic grow The Huggingface encoder scales better, but becomes slower and slower compared to our implementation as input size increases. ![worst-case encoding runtime comparison](./images/performance-worstcase.svg) + +For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case $\mathcal{O}(n \log^2 t)$ complexity (where $n$ is the input length and $t$ is the maximum token length). +Their implementation is available at [ModelTC/mtc-inc-bpe](https://github.com/ModelTC/mtc-inc-bpe). From 35e387b0966b8ef1cf9a97dd7ce3c1e77c13a1ff Mon Sep 17 00:00:00 2001 From: Greg Orzell Date: Wed, 17 Jun 2026 09:35:06 +0200 Subject: [PATCH 2/3] docs(bpe): use inline-code complexity notation for consistency Replace LaTeX-delimited complexity notation with inline code to match the rest of the README and render correctly on crates.io/docs.rs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- crates/bpe/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/bpe/README.md b/crates/bpe/README.md index 6025d6eb..c138b78f 100644 --- a/crates/bpe/README.md +++ b/crates/bpe/README.md @@ -301,5 +301,5 @@ The Huggingface encoder scales better, but becomes slower and slower compared to ![worst-case encoding runtime comparison](./images/performance-worstcase.svg) -For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case $\mathcal{O}(n \log^2 t)$ complexity (where $n$ is the input length and $t$ is the maximum token length). +For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length). Their implementation is available at [ModelTC/mtc-inc-bpe](https://github.com/ModelTC/mtc-inc-bpe). From 75e4040ad9a7fa032d3fb51b365cb43d939a9692 Mon Sep 17 00:00:00 2001 From: Greg Orzell Date: Wed, 17 Jun 2026 09:42:36 +0200 Subject: [PATCH 3/3] docs(bpe): note paper builds on our incremental algorithm Per reviewer feedback, clarify that the cited paper extends this crate's incremental algorithm by combining the aho-corasick search with the compatibility test into a single automaton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- crates/bpe/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/crates/bpe/README.md b/crates/bpe/README.md index c138b78f..5cf6a943 100644 --- a/crates/bpe/README.md +++ b/crates/bpe/README.md @@ -302,4 +302,5 @@ The Huggingface encoder scales better, but becomes slower and slower compared to ![worst-case encoding runtime comparison](./images/performance-worstcase.svg) For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length). +Their work builds on our incremental algorithm and takes it one step further by combining the aho-corasick search with the compatibility test into a single automaton. Their implementation is available at [ModelTC/mtc-inc-bpe](https://github.com/ModelTC/mtc-inc-bpe).