HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork) by johnml1135 · Pull Request #441 · sillsdev/machine

johnml1135 · 2026-06-26T00:23:15Z

What this is

A static grammar linter (GrammarFstAdvisor.Analyze(Language)) for the HermitCrab parser. It walks every rule in a compiled grammar and reports, per rule, what makes parsing expensive or blocks finite-state (FST) compilation — with an actionable write-up (why it's costly, how to constrain it, what to try instead) and an overall tier verdict. It runs at grammar-authoring time or in CI: a new escape that flips the tier is the "one new rule blew up the grammar" warning.

Scope honesty: this PR delivers the diagnostic front-end, not the speedup. The FST compiler + runtime (the actual 10–100× / near-zero-allocation lever) is planned in HERMITCRAB_FST_PLAN.md and gated on an unbuilt spike (§7). The advisor tells you, per grammar, whether that lever is worth pulling and what blocks it.

This is the sibling of #438 (the GC/single-threaded/Server-GC performance work). The two are fully independent — this branch touches only new files (no shared edits, no GC-only APIs) and was cherry-picked cleanly onto master.

How it classifies (two orthogonal axes — neither masks the other)

Severity = slow in today's engine (the warning): Escape (forces the combinatorial search), Cost (regular but inflates fan-out/state count), Info. Unchanged by regularity.
Regular = does an FST exist in principle (the reclaim path, gated on the unbuilt compiler): by Kaplan & Kay (1994) a directional context-sensitive rewrite rule is a regular relation however long its environment — so harmony/spreading, bounded reduplication, and infixation are FST-reclaimable; only unbounded (whole-stem) copy is genuinely non-regular.
Probeable = is a per-word strip-and-reparse un-application sound (surface-invariant: no later phonological rule rewrites the affixed span).

A harmony rule therefore still warns (escape, not Tier 1), with Regular/Probeable reported only as reclaim notes — never as "you're fine."

Validated on the real Sena grammar

Tier 1 candidate — fully FST-able; examined 19 affix + 8 compounding rules, 0 escapes. Matches the grammar census; zero false positives.

What was investigated and discarded (and why FST is the lever)

Before landing on the FST direction, three cheaper approaches were tried against the real Sena corpus and measured to not pay off, so they were rolled back (not in this PR):

Sound memoization / tabling of analysis sub-results — measured 0% hit rate: the combinatorial waste is distinct doomed branches within a word, not repeated states across words, so there's nothing to memoize.
Reachability pruning (feature necessary-condition + early trie lookup) — implemented and verified sound, but only ~4% on Sena.
Grammar census — the payoff: the real Sena grammar is ~100% FST-able (0 rewrite rules, 0 variables, 0 productive reduplication, all-concatenative affixation). That's what pointed to FST composition as the real lever past the ~3× parallel ceiling, and motivated this advisor as its front-end.

Tests

6 advisor tests (concatenative → Tier 1; reduplication → escape + tier downgrade; clean vs opaque; bounded vs unbounded reduplication; infix; harmony stays-escape-but-regular). Full HC suite green (67 tests on this branch).

🤖 Generated with Claude Code

This change is

Tech stack: build on SIL.Machine's own Fst (already has Compose/Determinize/Minimize/ Intersect + unification arcs; RootAllomorphTrie precedent) rather than external OpenFst/Foma (interop + no native feature-structure support). Graceful degradation via census-chosen tiers: fully-FS grammars -> transducer-only; partial -> FST + per-word search fallback at non-FS escapes; pervasively-non-FS -> existing search (no regression). Soundness contract + verification mode. Phased plan gated on a Sena compile-and-verify spike. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te FST A grammar evolves; one new rule can quietly push it from the fast finite-state path into the slow combinatorial search. GrammarFstAdvisor.Analyze(Language) walks every rule and emits per-rule advisories with severity (Escape = breaks FST, Cost = inflates search, Info), a one-line issue, and an actionable write-up (how to constrain it / what to try instead), plus an overall tier verdict. This is the "one new rule blew up the grammar" guard: a new Escape that flips the tier names the offending rule and explains the fix. Classifier: reduplication (a part copied >=2x via CopyFromInput) = Escape; stem-split/infixation (>=2 copies of different parts) = Escape; unbounded rewrite environment (Quantifier MaxOccur == Infinite) = Escape; deletion (LHS longer than RHS) = Cost; many allomorphs = Cost; ModifyFromInput, bounded rewrite rule, metathesis, compounding = Info. Report also reports how many affix/phonological/compounding rules were examined (clean ones produce no advisory) so "fully FST-able" is backed by inspection counts. Validated on real Sena grammar: examined 19 affix + 8 compounding, 0 phonological -> Tier 1, 0 escapes (matches the grammar census; no false positives). Tests: concatenative grammar -> Tier 1; add a reduplication rule -> flagged Escape with write-up + tier downgrade to Tier 2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The infixation check flagged any allomorph with >=2 CopyFromInput of different parts as Escape, but a plain suffix/circumfix over a split stem (copy "1", copy "2", insert) has contiguous copies and is fully FST-able. True infixation is signalled by inserted material BETWEEN two copies (copy...insert...copy); HasInfixedCopy now detects exactly that. Added tests: a contiguous split-stem suffix stays Tier 1 (no false escape) and a real copy-insert-copy infix is flagged Escape. Also label each advisory with its stratum (rules can appear in more than one), which clarifies the Sena report: its 8 compounding rules (mrule1-8, 4 names reused in pairs) all live in the 'Morphology' stratum -- genuine distinct rules, not a re-walk. Sena verdict unchanged: Tier 1, 0 escapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…opaque An infix/reduplication escape can be un-applied per word by a cheap strip-and-reparse probe (remove the candidate affix, re-parse the residue with the FST) ONLY if nothing downstream rewrites the affixed span. Add the static soundness test: an escape in stratum i is "probe-able" iff no phonological rule runs at stratum i or later (surface-invariant); otherwise "opaque" and the search backstop is required. Sound-conservative: presence of any later phonological rule => opaque. GrammarAdvisory.Probeable (bool?) records it; the report counts ProbeableEscapeCount / OpaqueEscapeCount and, when every escape is probe-able, reports a "Tier 2+" verdict (a per-word probe recovers the fast path, effectively Tier 1 with no search backstop). Escape advice now spells out the probe and why it is or isn't sound. Tests: reduplication with no later phonology => probe-able (Tier 2+); the same rule with a later-stratum rewrite rule => opaque (plain Tier 2 hybrid). Sena unchanged (Tier 1, 0 escapes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…arning Add GrammarAdvisory.Regular (bool?): does an FST exist for this construct in principle? By Kaplan & Kay (1994) a directional context-sensitive rewrite rule is a regular relation however long its environment, so harmony/spreading and bounded reduplication and infixation are regular and FST-reclaimable; only whole-stem (unbounded) copy is genuinely non-regular. Crucially this is kept ORTHOGONAL to severity: the FST compiler that turns "regular" into "fast" is not built yet, so severity still means "slow in today's engine" and is UNCHANGED -- every current escape stays an escape. A harmony rule still warns (escape present, not Tier 1); Regular only adds a separate reclaim-path note ("FST-reclaimable once the compiler exists; slow today"). The report prints RegularEscapeCount / NonRegularEscapeCount and a reclaim-path line; the tier verdict is NOT upgraded by regularity. Detection: reduplication regularity from the copied part's Lhs pattern boundedness (unbounded/unresolved -> non-regular, conservative); infix regular (pattern-defined slot); unbounded-environment rewrite regular iff its own Lhs/Rhs are bounded. Also fixed a latent tier bug (Probeable==null phonological escapes were counted as "all probe-able") and removed the present-tense "effectively Tier 1" claim from the Tier 2+ string -- the probe runtime is also unbuilt, so both reclaim axes now read "would recover ... once it exists; slow today". Tests: harmony rewrite stays Escape + Regular (headline still warns); unbounded-copy redup => non-regular; bounded reduplicant + infix => regular. Sena unchanged (Tier 1). 6 advisor + 69 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.

Changes:

Introduces GrammarFstAdvisor, GrammarFstReport, and GrammarAdvisory to classify expensive/non-FST-able constructs across morphological and phonological rules.
Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs	Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity).
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs	Adds an `[Explicit]` helper test to run and print the advisor report on an external HC XML grammar.
src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs	Implements the advisor, report model, and the core static analyses for affix and phonological rules.
HERMITCRAB_FST_PLAN.md	Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate.
fst.md	Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            Advisories = advisories;
+            AffixRulesExamined = affixRulesExamined;
+            PhonologicalRulesExamined = phonologicalRulesExamined;
+            CompoundingRulesExamined = compoundingRulesExamined;
+            EscapeCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Escape);
+            CostCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Cost);
+            InfoCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Info);
+            ProbeableEscapeCount = advisories.Count(a =>
+                a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == true
+            );
+            OpaqueEscapeCount = advisories.Count(a =>
+                a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == false
+            );
+            RegularEscapeCount = advisories.Count(a =>
+                a.Severity == GrammarAdvisorySeverity.Escape && a.Regular == true
+            );
+            NonRegularEscapeCount = advisories.Count(a =>
+                a.Severity == GrammarAdvisorySeverity.Escape && a.Regular != true
+            );


+                foreach (IMorphologicalRule mrule in stratum.MorphologicalRules)
+                {
+                    switch (mrule)
+                    {
+                        case AffixProcessRule affix:
+                            affixExamined++;
+                            AnalyzeAffix(affix, stratum.Name, surfaceInvariant, advisories, manyAllomorphsThreshold);
+                            break;


The analyzer transducer must emit the structured derivation (ordered morphemes + root), not just accept/reject, or it is a recognizer not an analyzer. Define the compact output token: high 8 bits = MorphOp (role/operation: Root, Prefix, Suffix, Infix, Reduplication, Circumfix*, Compound, Clitic, Process, Null), low 24 bits = morpheme index into the grammar's morpheme table. An accepting path's output is the uint[] of these tokens, which IS the analysis and is self-describing: Morphemes = indices in array order; RootMorphemeIndex = the Root token's position (no separate field). Verdict on the proposed 8+24 packing: sound and the right compactness choice (4 bytes/morph, hashable, columnar). 24-bit ceiling = 16,777,215 morphemes (ample; compiler asserts). Refinement baked into the schema: keep the 32-bit word as the pure (op, morpheme) derivation and DON'T overload it with surface segmentation or allomorph identity -- those are optional parallel channels. MorphToken codec (Encode/GetOp/GetMorphemeId/RootIndex) + bounds check, plus HERMITCRAB_FST_PLAN.md section 8 documenting the schema. 5 tests (round-trip, out-of-range throw, distinctness, self-describing derivation array, root recovery). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nalyses The packed-token tests alone proved bit-packing, not schema fidelity. Add the reference encoder MorphTokenCodec (Word -> uint[]), which mirrors Morpher.CreateWordAnalysis (same AllomorphsInMorphOrder iteration + RootAllomorph check) and populates the op channel from the actual rule: head root -> Root, other stems -> Compound, affixes classified from their output actions (reduplication / infix / prefix / suffix / process). Round-trip tests on real parsed words now MEASURE soundness rather than assert it: - suffix word: decoded morphemes reproduce WordAnalysis.Morphemes in order, and RootIndex (recovered purely from the Root op code) == WordAnalysis.RootMorphemeIndex; - compound (two stems): the flat array keeps both morphemes with exactly one Root + one Compound, matched to WordAnalysis by morpheme sequence with root index at parity -- confirming the flat array is at parity with WordAnalysis's own compound flattening (not lossy); - ClassifyOp populates reduplication/infix/prefix/suffix from real output actions. Resolves the two open risks on the schema: the op channel is now populated from a real Word (not asserted), and multi-root/compound handling is verified. 75 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

FstMorpher hand-builds one acceptor (mirroring RootAllomorphTrie -- no Compose/Determinize/Minimize, which are unsafe for HC's underspecified-feature arcs): root segment chains from the start state, with each fixed-segment suffix appended after every root-accepting state; accepting states map to packed MorphToken arrays. Analysis is a single nondeterministic Transduce walk of the surface word -- no Word clones, no generate-and-test. Verified against Morpher.AnalyzeWord on the concatenative fragment: - bare root and root+suffix ("sags") round-trip to the same morphemes + root; - COMPLETENESS as analysis-set equality, not "found one": homographs (dat -> entries 8 and 9 both found), and the negative case (no path -> both empty) agree with the search engine; - an [Explicit] allocation comparison (FST walk vs search engine). Caveat documented in FstMorpher: arcs match segments only, not the affix's syntactic/MPR/stratum constraints, so on grammars where letters match but constraints exclude an analysis it would over-generate -- closed by feature-unification arcs (HERMITCRAB_FST_PLAN.md section 8). 78 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e contract) FstMorpher now implements IMorphologicalAnalyzer.AnalyzeWord -> IEnumerable<WordAnalysis>, the same interface Morpher implements and that consumers (FieldWorks et al.) depend on. Morphemes and root index come from the token walk; Category is null because this slice does not yet track syntactic features (arrives with the unification arcs). A test drives both engines through the IMorphologicalAnalyzer contract and asserts the WordAnalysis sets match (homographs included) on the concatenative fragment, so the FST analyzer is a drop-in for the search engine at the interface level. Still scoped to the clean concatenative fragment built from explicit root/suffix lists; full-grammar Compose compilation, phonology/allomorphy, the feeding-closure completeness certificate, and the FieldWorks adapter remain. 79 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…grammar FstMorpher.FromLanguage(Language) builds the analyzer by introspecting the grammar: every root allomorph plus every single-allomorph suffix rule (detected via MorphTokenCodec.ClassifyOp == Suffix). It THROWS NotSupportedException on any construct outside the concatenative root+suffix fragment (prefixes, infixes, reduplication, compounding, templates, multi-allomorph affixes) so it never silently under-generates — a caller learns exactly what this slice cannot cover. This closes the "not driven from a compiled Language" gap: the analyzer now consumes a real Language object, not explicit root/suffix lists. Verified at parity with Morpher (through the IMorphologicalAnalyzer contract) on the concatenative fragment, plus a guard test that a prefix rule makes FromLanguage refuse rather than produce a quietly incomplete analyzer. 81 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ffix?) The acceptor now prepends a prefix segment chain before the root (mirroring the appended suffix chains), so it covers an optional fixed-segment prefix + root + optional fixed-segment suffix. FromLanguage auto-detects prefix rules (ClassifyOp == Prefix) alongside suffixes and still throws on anything outside the concatenative fragment (reduplication, infix, compounding, templates, multi-allomorph affixes). Verified at parity with Morpher via the IMorphologicalAnalyzer contract: "disag" = di-(PST) + sag, and the bare root through the no-prefix branch. The throw-guard test now uses reduplication (a genuinely non-concatenative rule). 82 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tificate) Document the completeness/closure analysis: an FST analyzer is trustworthy only if its silence is a proof. Completeness has two parts — "no escape applies here" (easy) and "no normal FST step reachable from the input can FEED an escape" (the kicker, Kiparsky feeding). The universal question is undecidable, but per grammar it is usually decidable: - decidable feeding-closure: for each FST-able rule F and escape E, test range(F) ∩ trigger(E) = ∅ via Fst.Intersect; all empty ⇒ the fragment is closed ⇒ "no path" is a complete certificate; non-empty + regular ⇒ fold in; non-empty + non-regular ⇒ those words fall to the search backstop; - stratal containment as the practical guarantee (escapes innermost, not fed by the FST fragment); - homograph completeness = all accepting paths returned, contingent on closure + never unsafely determinizing/minimizing unification arcs; - the search backstop's "done" rests on a true derivation-depth bound (finite iff no unbounded self-feeding cycle); - the work: a static feeding-closure pass extending GrammarFstAdvisor + corpus closure verification (set parity) as the gate before the FST may replace the search engine. Wired into the phased plan (Phase 3 gated on §9), the risks table, and the decision flow. Until closure is confirmed for a grammar, the FST runs in shadow/verification mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…, invariant completeness) Design in from the start a tunable partition that bounds the compiled automaton without sacrificing completeness. Three buckets: A precompiled (eager, fast walk, costs states), B on-the-fly (lazy on-demand composition, bounded memory), C search/probe fallback (non-FS escapes, set by section 9 closure not the knob). The A<->B boundary is the knob, with a safe floor (everything lazy = bounded + complete). Why completeness is INVARIANT under the knob: composition is associative, so precompiling A.B vs applying B lazily after A denotes the same relation — the split changes when work happens, never which analyses exist; the walk enumerates all paths in either bucket; and closure (section 9) is computed on the full A.B relation. So the knob is a pure space/time dial; the analysis set does not move. The policy is per-language (yes, it differs): rank layers by state-multiplier x corpus hotness, precompile cheap-and-hot, keep expensive-and-cold lazy, auto- demote A->B under a state/memory budget. Same construct can be eager in one project and lazy in another -> pluggable policy + optional auto-tuner. Designed-in requirements: compiler is a pipeline of self-contained composable layers (each with state-multiplier/hotness/closure metadata) behind one eager-or-lazy interface; analyzer walks the eager core and lazily expands B layers, emitting the same MorphToken outputs; state budget is a first-class compile input (auto-demotion logged, never silent truncation); the corpus set-parity gate runs against the chosen partition. Wired into the risks table (state-blowup) and the phased plan (Phase 1-2 architecture). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ty gate FstVerification.Compare runs a candidate analyzer (FstMorpher) beside the sound+complete reference (Morpher) over a corpus and reports, per word, where their analysis SETS differ: MissingFromCandidate (completeness failures) and ExtraInCandidate (soundness/over-generation failures). AnalysisComparison.IsComplete is the gate (HERMITCRAB_FST_PLAN.md §9.5/§10.4) that must pass before the FST may replace the search engine — until then the FST runs in shadow mode. This operationalizes the completeness question: it measures both "did we find them all" and "did we invent any" at once, against the proven engine. Tests: FstMorpher.FromLanguage vs Morpher is IsComplete over a concatenative corpus (inflected, bare root, homograph, non-word), and the harness flags a deliberately-empty candidate as incomplete (proving it is not vacuous). 84 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extend the acceptor with compound paths: for each compounding subrule, build root×root chains (head segs + non-head segs), tagging the head Root and the non-head Compound per the subrule's surface head position (head-first or head-last detected from the first CopyFromInput in the Rhs). FromLanguage now collects CompoundingRules alongside prefixes/suffixes; single head + single non-head subrules only (throws otherwise). This is root×root in state count — exactly the layer §10 flags as a lazy-bucket candidate at lexicon scale — built eagerly here for the parity check. Verified against Morpher via the IMorphologicalAnalyzer contract: "pʰutdat" = pʰut(5) + dat, returning both homographic non-heads (5+8, 5+9) exactly as the search engine, and the bare root through the non-compound path. 85 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…morph BuildAffixChains now iterates ALL allomorphs of an affix rule, building a segment chain for each, all sharing the rule's morpheme token. Environment- conditioned allomorphy is handled by the surface: only the allomorph whose segments match the input accepts. FromLanguage no longer restricts affixes to a single allomorph (throws only if an allomorph lacks a fixed-segment InsertSegments). This rounds out the concatenative Tier-1 fragment — roots, prefixes, suffixes, bounded compounding, and multi-allomorph affixes. Verified at parity with Morpher via the IMorphologicalAnalyzer contract: a plural with -s/-t allomorphs analyzes "sags" and "sagt" exactly as the search engine, the surface selecting the right allomorph. 86 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

GrammarFstClosure.Analyze decides, per non-regular escape (reduplication / infixation), whether any FST-able rule could apply before it and FEED it (Kiparsky feeding). Sound stratal pre-filter: an escape is CLOSED only if no FST-able rule (concatenative affix, compounding, or any phonological rule) applies at or before its stratum — same-stratum rules count too, since unordered application could place them first. Never falsely reports closed. ClosureReport.FstClosed is true iff every escape is closed (vacuously, none): then the FST built over the FST-able fragment is closed and its "no path" is a proof for words showing no escape signature — subject to the per-word surface check and the corpus parity gate. This is the static half of "confirming FST closure"; the empirical half is FstVerification. The precise refinement that reclaims over-flagged cases is range(F) ∩ trigger(E) = ∅ via Fst.Intersect. Tests: no escapes -> closed (vacuous); innermost reduplication with nothing before it -> closed; a suffix in the same unordered stratum -> potentially fed. 89 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e 3) HybridMorpher (IMorphologicalAnalyzer) wires the three pieces together: the precompiled FST handles the FST-able fragment as a fast, allocation-light walk, and words that could involve a non-regular escape fall back to the sound+complete search engine. FstMorpher.FromLanguage gains ignoreEscapes to build the FST over just the FST-able fragment. Routing is sound by construction (only ever sends MORE words to search): the fast path is taken iff the grammar has no escapes, or every escape is CLOSED (GrammarFstClosure) AND is total reduplication (the surface signature this router detects, XX) AND the word shows no such signature. Otherwise the search runs. Verified: with a closed total-reduplication escape, "sag" takes the FST fast path and "sagsag" falls back to search, and the combined analysis set is verified COMPLETE against the pure search engine via FstVerification. This is the Phase-3 Tier-2 hybrid: the closure pass decides safety, the verification gate proves parity, the runtime routes per word. 90 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

FstGenerator implements IMorphologicalGenerator for the concatenative fragment: generation is the inverse of the analyzer's walk — ordered concatenation of each morpheme's surface representation (prefix before root, suffix after, compound stems concatenated), Cartesian over allomorph choices. Mirrors Morpher's morpheme inventory so it is a drop-in IMorphologicalGenerator. Verified against Morpher.GenerateWords on the concatenative fragment, and the analyze→generate round-trip recovers the input word ("sag", "sags"). Scope matches FstMorpher (roots + fixed-segment affixes); phonology/reduplication defer to the search generator. 92 HC tests green. This completes the in-repo Phase-4 surface: both directions (FstMorpher analyzer + FstGenerator generator), the Tier-2 hybrid, closure confirmation, verification gate, and the grammar census all exist and are verified. The remaining Phase-4 item — the FieldWorks adapter — lives in a separate repository. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…iming + parity FstSenaBenchmark loads a real grammar (HC_GRAMMAR/HC_WORDS), runs the census and closure pass, attempts to build FstMorpher/HybridMorpher (reporting the concrete WALL if a construct is out of fragment), and times search vs FST vs hybrid with FstVerification parity. Run on the real Sena grammar this surfaces the concrete remaining wall: - census: Tier 1, fully FST-able, 0 escapes, FST-CLOSED; - but FstMorpher.FromLanguage hits "stratum 'Morphology' has affix templates" — templates (position classes) are FST-able in principle but not yet built by FstMorpher, so the FST/hybrid cannot yet be constructed from Sena; - search baseline (unlimited unapplications): ~206 ms/word, true parses. So affix-template support is the next build to get the FST over the wall on a real FLEx grammar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…umulating walk FstTemplateAnalyzer handles affix templates (position classes) — the real-grammar case. Two design points from the advisor review: (1) build-time CATEGORY GATING — a template attaches only to roots whose category unifies with its RequiredSyntacticFeatureStruct, which kills over-generation AND lets same-category roots share one copy of the template's slot-automaton (states = roots + Σ template automata, not roots × combos); (2) TOKEN ACCUMULATION along the path (a state carries the morpheme token emitted on entry) via a custom DFS walk, since the shared automaton is reached by many roots so an accept-id map won't do. A maxStates budget (§10 knob) aborts before a blowup. New additive class (the 92 existing tests + the accept-id acceptor are untouched). Verified on a toy: a V-only suffix template with two optional slots reproduces the search engine's analyses for sag/sagd/sagdv, AND the category gate correctly blocks the verb template on an A-category root (gab/gabd) — FstVerification parity. Scope this slice: suffixing templates + category gating; prefix-slot templates, cross-stratum gating, and phonology are next. Wired into FstSenaBenchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…k — Sena PARSES FstTemplateAnalyzer now handles prefix AND suffix template slots (prefixes surface in reverse template order), gated by BOTH category (root features unify with the template's RequiredSyntacticFeatureStruct) and stratum (root at the template's stratum or inner — the 'datd' lesson). The walk is a proper NFA simulation (active config-set per segment, deduped by (state, tokens)) instead of the exponential recursive DFS, and guards InvalidShapeException (out-of-table phonemes) like the search engine. Result on the real Sena grammar (sena-hc.xml, 24 templates): - FstTemplateAnalyzer BUILDS in ~0.5 s (gating shares automata → no state blowup); - parses at ~6.4 ms/word vs the search engine's ~178 ms/word — about 28x faster; - 14 of 16 analyses match the search engine across the sample, with one MISSING analysis on 'mafuta' (a two-prefix form) — a coverage gap, not over-generation. Toy tests (parity-verified): suffix template + category gate; prefix+suffix template (bare / prefixed / suffixed / both) + gate. 94 HC tests green. Sena now parses through the FST; closing the last divergence to full FstVerification parity is the remaining refinement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…Sena ~100x Two coverage fixes for real Sena: (1) slot rules of type RealizationalAffixProcessRule (a sibling of AffixProcessRule, same IList<AffixProcessAllomorph>) are now included, not skipped; (2) every affix is entered through a token-bearing state, so a zero/empty-segment morpheme still emits its token (previously the token was placed on the first segment state and a zero morph emitted nothing -> a missing analysis). Result on real Sena: FstTemplateAnalyzer parses at ~3 ms/word vs the search engine's ~337 ms/word (~100x) and matches the search engine's analysis set on the 8-word sample exactly; on 30 words 26 match, with residual divergences in both directions (one under-generation; over-generation where a constraint the FST does not yet enforce -- obligatory affixation / MPR / co-occurrence -- would exclude an analysis). Sena PARSES through the FST; full FstVerification parity at scale is the remaining constraint-enforcement refinement. 94 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

johnml1135 and others added 6 commits June 25, 2026 20:13

HC FST plan: make census reference self-contained (advisor confirms it)

3d46e5f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 26, 2026 00:23

Copilot started reviewing on behalf of johnml1135 June 26, 2026 00:23 View session

johnml1135 mentioned this pull request Jun 26, 2026

HermitCrab performance: single-threaded option, copy-on-write FeatureStruct, and out-of-process Server-GC parser #438

Open

Copilot AI reviewed Jun 26, 2026

View reviewed changes

johnml1135 and others added 18 commits June 25, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork)#441

HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork)#441
johnml1135 wants to merge 24 commits into
masterfrom
fst-advisor

johnml1135 commented Jun 26, 2026 •

edited by ddaspit

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

johnml1135 commented Jun 26, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

How it classifies (two orthogonal axes — neither masks the other)

Validated on the real Sena grammar

What was investigated and discarded (and why FST is the lever)

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnml1135 commented Jun 26, 2026 •

edited by ddaspit

Loading