Skip to content

HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork)#441

Open
johnml1135 wants to merge 24 commits into
masterfrom
fst-advisor
Open

HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork)#441
johnml1135 wants to merge 24 commits into
masterfrom
fst-advisor

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What this is

A static grammar linter (GrammarFstAdvisor.Analyze(Language)) for the HermitCrab parser. It walks every rule in a compiled grammar and reports, per rule, what makes parsing expensive or blocks finite-state (FST) compilation — with an actionable write-up (why it's costly, how to constrain it, what to try instead) and an overall tier verdict. It runs at grammar-authoring time or in CI: a new escape that flips the tier is the "one new rule blew up the grammar" warning.

Scope honesty: this PR delivers the diagnostic front-end, not the speedup. The FST compiler + runtime (the actual 10–100× / near-zero-allocation lever) is planned in HERMITCRAB_FST_PLAN.md and gated on an unbuilt spike (§7). The advisor tells you, per grammar, whether that lever is worth pulling and what blocks it.

This is the sibling of #438 (the GC/single-threaded/Server-GC performance work). The two are fully independent — this branch touches only new files (no shared edits, no GC-only APIs) and was cherry-picked cleanly onto master.

How it classifies (two orthogonal axes — neither masks the other)

  • Severity = slow in today's engine (the warning): Escape (forces the combinatorial search), Cost (regular but inflates fan-out/state count), Info. Unchanged by regularity.
  • Regular = does an FST exist in principle (the reclaim path, gated on the unbuilt compiler): by Kaplan & Kay (1994) a directional context-sensitive rewrite rule is a regular relation however long its environment — so harmony/spreading, bounded reduplication, and infixation are FST-reclaimable; only unbounded (whole-stem) copy is genuinely non-regular.
  • Probeable = is a per-word strip-and-reparse un-application sound (surface-invariant: no later phonological rule rewrites the affixed span).

A harmony rule therefore still warns (escape, not Tier 1), with Regular/Probeable reported only as reclaim notes — never as "you're fine."

Validated on the real Sena grammar

Tier 1 candidate — fully FST-able; examined 19 affix + 8 compounding rules, 0 escapes. Matches the grammar census; zero false positives.

What was investigated and discarded (and why FST is the lever)

Before landing on the FST direction, three cheaper approaches were tried against the real Sena corpus and measured to not pay off, so they were rolled back (not in this PR):

  • Sound memoization / tabling of analysis sub-results — measured 0% hit rate: the combinatorial waste is distinct doomed branches within a word, not repeated states across words, so there's nothing to memoize.
  • Reachability pruning (feature necessary-condition + early trie lookup) — implemented and verified sound, but only ~4% on Sena.
  • Grammar census — the payoff: the real Sena grammar is ~100% FST-able (0 rewrite rules, 0 variables, 0 productive reduplication, all-concatenative affixation). That's what pointed to FST composition as the real lever past the ~3× parallel ceiling, and motivated this advisor as its front-end.

Tests

6 advisor tests (concatenative → Tier 1; reduplication → escape + tier downgrade; clean vs opaque; bounded vs unbounded reduplication; infix; harmony stays-escape-but-regular). Full HC suite green (67 tests on this branch).

🤖 Generated with Claude Code


This change is Reviewable

johnml1135 and others added 6 commits June 25, 2026 20:13
Tech stack: build on SIL.Machine's own Fst (already has Compose/Determinize/Minimize/
Intersect + unification arcs; RootAllomorphTrie precedent) rather than external OpenFst/Foma
(interop + no native feature-structure support). Graceful degradation via census-chosen
tiers: fully-FS grammars -> transducer-only; partial -> FST + per-word search fallback at
non-FS escapes; pervasively-non-FS -> existing search (no regression). Soundness contract +
verification mode. Phased plan gated on a Sena compile-and-verify spike.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te FST

A grammar evolves; one new rule can quietly push it from the fast finite-state
path into the slow combinatorial search. GrammarFstAdvisor.Analyze(Language)
walks every rule and emits per-rule advisories with severity (Escape = breaks
FST, Cost = inflates search, Info), a one-line issue, and an actionable
write-up (how to constrain it / what to try instead), plus an overall tier
verdict. This is the "one new rule blew up the grammar" guard: a new Escape
that flips the tier names the offending rule and explains the fix.

Classifier: reduplication (a part copied >=2x via CopyFromInput) = Escape;
stem-split/infixation (>=2 copies of different parts) = Escape; unbounded
rewrite environment (Quantifier MaxOccur == Infinite) = Escape; deletion
(LHS longer than RHS) = Cost; many allomorphs = Cost; ModifyFromInput,
bounded rewrite rule, metathesis, compounding = Info. Report also reports how
many affix/phonological/compounding rules were examined (clean ones produce no
advisory) so "fully FST-able" is backed by inspection counts.

Validated on real Sena grammar: examined 19 affix + 8 compounding, 0
phonological -> Tier 1, 0 escapes (matches the grammar census; no false
positives). Tests: concatenative grammar -> Tier 1; add a reduplication rule
-> flagged Escape with write-up + tier downgrade to Tier 2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infixation check flagged any allomorph with >=2 CopyFromInput of different
parts as Escape, but a plain suffix/circumfix over a split stem (copy "1",
copy "2", insert) has contiguous copies and is fully FST-able. True infixation
is signalled by inserted material BETWEEN two copies (copy...insert...copy);
HasInfixedCopy now detects exactly that. Added tests: a contiguous split-stem
suffix stays Tier 1 (no false escape) and a real copy-insert-copy infix is
flagged Escape.

Also label each advisory with its stratum (rules can appear in more than one),
which clarifies the Sena report: its 8 compounding rules (mrule1-8, 4 names
reused in pairs) all live in the 'Morphology' stratum -- genuine distinct
rules, not a re-walk. Sena verdict unchanged: Tier 1, 0 escapes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…opaque

An infix/reduplication escape can be un-applied per word by a cheap
strip-and-reparse probe (remove the candidate affix, re-parse the residue with
the FST) ONLY if nothing downstream rewrites the affixed span. Add the static
soundness test: an escape in stratum i is "probe-able" iff no phonological rule
runs at stratum i or later (surface-invariant); otherwise "opaque" and the
search backstop is required. Sound-conservative: presence of any later
phonological rule => opaque.

GrammarAdvisory.Probeable (bool?) records it; the report counts
ProbeableEscapeCount / OpaqueEscapeCount and, when every escape is probe-able,
reports a "Tier 2+" verdict (a per-word probe recovers the fast path,
effectively Tier 1 with no search backstop). Escape advice now spells out the
probe and why it is or isn't sound.

Tests: reduplication with no later phonology => probe-able (Tier 2+); the same
rule with a later-stratum rewrite rule => opaque (plain Tier 2 hybrid). Sena
unchanged (Tier 1, 0 escapes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arning

Add GrammarAdvisory.Regular (bool?): does an FST exist for this construct in
principle? By Kaplan & Kay (1994) a directional context-sensitive rewrite rule
is a regular relation however long its environment, so harmony/spreading and
bounded reduplication and infixation are regular and FST-reclaimable; only
whole-stem (unbounded) copy is genuinely non-regular.

Crucially this is kept ORTHOGONAL to severity: the FST compiler that turns
"regular" into "fast" is not built yet, so severity still means "slow in
today's engine" and is UNCHANGED -- every current escape stays an escape. A
harmony rule still warns (escape present, not Tier 1); Regular only adds a
separate reclaim-path note ("FST-reclaimable once the compiler exists; slow
today"). The report prints RegularEscapeCount / NonRegularEscapeCount and a
reclaim-path line; the tier verdict is NOT upgraded by regularity.

Detection: reduplication regularity from the copied part's Lhs pattern
boundedness (unbounded/unresolved -> non-regular, conservative); infix regular
(pattern-defined slot); unbounded-environment rewrite regular iff its own
Lhs/Rhs are bounded. Also fixed a latent tier bug (Probeable==null phonological
escapes were counted as "all probe-able") and removed the present-tense
"effectively Tier 1" claim from the Tier 2+ string -- the probe runtime is also
unbuilt, so both reclaim axes now read "would recover ... once it exists; slow
today".

Tests: harmony rewrite stays Escape + Regular (headline still warns);
unbounded-copy redup => non-regular; bounded reduplicant + infix => regular.
Sena unchanged (Tier 1). 6 advisor + 69 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.

Changes:

  • Introduces GrammarFstAdvisor, GrammarFstReport, and GrammarAdvisory to classify expensive/non-FST-able constructs across morphological and phonological rules.
  • Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
  • Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity).
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar.
src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs Implements the advisor, report model, and the core static analyses for affix and phonological rules.
HERMITCRAB_FST_PLAN.md Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate.
fst.md Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +111 to +129
Advisories = advisories;
AffixRulesExamined = affixRulesExamined;
PhonologicalRulesExamined = phonologicalRulesExamined;
CompoundingRulesExamined = compoundingRulesExamined;
EscapeCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Escape);
CostCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Cost);
InfoCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Info);
ProbeableEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == true
);
OpaqueEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == false
);
RegularEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Regular == true
);
NonRegularEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Regular != true
);
Comment on lines +270 to +277
foreach (IMorphologicalRule mrule in stratum.MorphologicalRules)
{
switch (mrule)
{
case AffixProcessRule affix:
affixExamined++;
AnalyzeAffix(affix, stratum.Name, surfaceInvariant, advisories, manyAllomorphsThreshold);
break;
johnml1135 and others added 18 commits June 25, 2026 20:51
The analyzer transducer must emit the structured derivation (ordered morphemes
+ root), not just accept/reject, or it is a recognizer not an analyzer. Define
the compact output token: high 8 bits = MorphOp (role/operation: Root, Prefix,
Suffix, Infix, Reduplication, Circumfix*, Compound, Clitic, Process, Null), low
24 bits = morpheme index into the grammar's morpheme table. An accepting path's
output is the uint[] of these tokens, which IS the analysis and is
self-describing: Morphemes = indices in array order; RootMorphemeIndex = the
Root token's position (no separate field).

Verdict on the proposed 8+24 packing: sound and the right compactness choice
(4 bytes/morph, hashable, columnar). 24-bit ceiling = 16,777,215 morphemes
(ample; compiler asserts). Refinement baked into the schema: keep the 32-bit
word as the pure (op, morpheme) derivation and DON'T overload it with surface
segmentation or allomorph identity -- those are optional parallel channels.

MorphToken codec (Encode/GetOp/GetMorphemeId/RootIndex) + bounds check, plus
HERMITCRAB_FST_PLAN.md section 8 documenting the schema. 5 tests (round-trip,
out-of-range throw, distinctness, self-describing derivation array, root
recovery).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nalyses

The packed-token tests alone proved bit-packing, not schema fidelity. Add the
reference encoder MorphTokenCodec (Word -> uint[]), which mirrors
Morpher.CreateWordAnalysis (same AllomorphsInMorphOrder iteration + RootAllomorph
check) and populates the op channel from the actual rule: head root -> Root,
other stems -> Compound, affixes classified from their output actions
(reduplication / infix / prefix / suffix / process).

Round-trip tests on real parsed words now MEASURE soundness rather than assert
it:
- suffix word: decoded morphemes reproduce WordAnalysis.Morphemes in order, and
  RootIndex (recovered purely from the Root op code) == WordAnalysis.RootMorphemeIndex;
- compound (two stems): the flat array keeps both morphemes with exactly one
  Root + one Compound, matched to WordAnalysis by morpheme sequence with root
  index at parity -- confirming the flat array is at parity with WordAnalysis's
  own compound flattening (not lossy);
- ClassifyOp populates reduplication/infix/prefix/suffix from real output actions.

Resolves the two open risks on the schema: the op channel is now populated from
a real Word (not asserted), and multi-root/compound handling is verified. 75 HC
tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstMorpher hand-builds one acceptor (mirroring RootAllomorphTrie -- no
Compose/Determinize/Minimize, which are unsafe for HC's underspecified-feature
arcs): root segment chains from the start state, with each fixed-segment suffix
appended after every root-accepting state; accepting states map to packed
MorphToken arrays. Analysis is a single nondeterministic Transduce walk of the
surface word -- no Word clones, no generate-and-test.

Verified against Morpher.AnalyzeWord on the concatenative fragment:
- bare root and root+suffix ("sags") round-trip to the same morphemes + root;
- COMPLETENESS as analysis-set equality, not "found one": homographs (dat ->
  entries 8 and 9 both found), and the negative case (no path -> both empty)
  agree with the search engine;
- an [Explicit] allocation comparison (FST walk vs search engine).

Caveat documented in FstMorpher: arcs match segments only, not the affix's
syntactic/MPR/stratum constraints, so on grammars where letters match but
constraints exclude an analysis it would over-generate -- closed by
feature-unification arcs (HERMITCRAB_FST_PLAN.md section 8). 78 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e contract)

FstMorpher now implements IMorphologicalAnalyzer.AnalyzeWord -> IEnumerable<WordAnalysis>,
the same interface Morpher implements and that consumers (FieldWorks et al.)
depend on. Morphemes and root index come from the token walk; Category is null
because this slice does not yet track syntactic features (arrives with the
unification arcs). A test drives both engines through the IMorphologicalAnalyzer
contract and asserts the WordAnalysis sets match (homographs included) on the
concatenative fragment, so the FST analyzer is a drop-in for the search engine
at the interface level.

Still scoped to the clean concatenative fragment built from explicit root/suffix
lists; full-grammar Compose compilation, phonology/allomorphy, the
feeding-closure completeness certificate, and the FieldWorks adapter remain. 79
HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…grammar

FstMorpher.FromLanguage(Language) builds the analyzer by introspecting the
grammar: every root allomorph plus every single-allomorph suffix rule
(detected via MorphTokenCodec.ClassifyOp == Suffix). It THROWS
NotSupportedException on any construct outside the concatenative root+suffix
fragment (prefixes, infixes, reduplication, compounding, templates,
multi-allomorph affixes) so it never silently under-generates — a caller learns
exactly what this slice cannot cover.

This closes the "not driven from a compiled Language" gap: the analyzer now
consumes a real Language object, not explicit root/suffix lists. Verified at
parity with Morpher (through the IMorphologicalAnalyzer contract) on the
concatenative fragment, plus a guard test that a prefix rule makes FromLanguage
refuse rather than produce a quietly incomplete analyzer. 81 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffix?)

The acceptor now prepends a prefix segment chain before the root (mirroring the
appended suffix chains), so it covers an optional fixed-segment prefix + root +
optional fixed-segment suffix. FromLanguage auto-detects prefix rules
(ClassifyOp == Prefix) alongside suffixes and still throws on anything outside
the concatenative fragment (reduplication, infix, compounding, templates,
multi-allomorph affixes).

Verified at parity with Morpher via the IMorphologicalAnalyzer contract:
"disag" = di-(PST) + sag, and the bare root through the no-prefix branch. The
throw-guard test now uses reduplication (a genuinely non-concatenative rule).
82 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tificate)

Document the completeness/closure analysis: an FST analyzer is trustworthy only
if its silence is a proof. Completeness has two parts — "no escape applies here"
(easy) and "no normal FST step reachable from the input can FEED an escape" (the
kicker, Kiparsky feeding). The universal question is undecidable, but per grammar
it is usually decidable:

- decidable feeding-closure: for each FST-able rule F and escape E, test
  range(F) ∩ trigger(E) = ∅ via Fst.Intersect; all empty ⇒ the fragment is
  closed ⇒ "no path" is a complete certificate; non-empty + regular ⇒ fold in;
  non-empty + non-regular ⇒ those words fall to the search backstop;
- stratal containment as the practical guarantee (escapes innermost, not fed by
  the FST fragment);
- homograph completeness = all accepting paths returned, contingent on closure +
  never unsafely determinizing/minimizing unification arcs;
- the search backstop's "done" rests on a true derivation-depth bound (finite iff
  no unbounded self-feeding cycle);
- the work: a static feeding-closure pass extending GrammarFstAdvisor + corpus
  closure verification (set parity) as the gate before the FST may replace the
  search engine.

Wired into the phased plan (Phase 3 gated on §9), the risks table, and the
decision flow. Until closure is confirmed for a grammar, the FST runs in
shadow/verification mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, invariant completeness)

Design in from the start a tunable partition that bounds the compiled automaton
without sacrificing completeness. Three buckets: A precompiled (eager, fast
walk, costs states), B on-the-fly (lazy on-demand composition, bounded memory),
C search/probe fallback (non-FS escapes, set by section 9 closure not the knob).
The A<->B boundary is the knob, with a safe floor (everything lazy = bounded +
complete).

Why completeness is INVARIANT under the knob: composition is associative, so
precompiling A.B vs applying B lazily after A denotes the same relation — the
split changes when work happens, never which analyses exist; the walk enumerates
all paths in either bucket; and closure (section 9) is computed on the full
A.B relation. So the knob is a pure space/time dial; the analysis set does not
move.

The policy is per-language (yes, it differs): rank layers by state-multiplier x
corpus hotness, precompile cheap-and-hot, keep expensive-and-cold lazy, auto-
demote A->B under a state/memory budget. Same construct can be eager in one
project and lazy in another -> pluggable policy + optional auto-tuner.

Designed-in requirements: compiler is a pipeline of self-contained composable
layers (each with state-multiplier/hotness/closure metadata) behind one
eager-or-lazy interface; analyzer walks the eager core and lazily expands B
layers, emitting the same MorphToken outputs; state budget is a first-class
compile input (auto-demotion logged, never silent truncation); the corpus
set-parity gate runs against the chosen partition. Wired into the risks table
(state-blowup) and the phased plan (Phase 1-2 architecture).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ty gate

FstVerification.Compare runs a candidate analyzer (FstMorpher) beside the
sound+complete reference (Morpher) over a corpus and reports, per word, where
their analysis SETS differ: MissingFromCandidate (completeness failures) and
ExtraInCandidate (soundness/over-generation failures). AnalysisComparison.IsComplete
is the gate (HERMITCRAB_FST_PLAN.md §9.5/§10.4) that must pass before the FST may
replace the search engine — until then the FST runs in shadow mode.

This operationalizes the completeness question: it measures both "did we find
them all" and "did we invent any" at once, against the proven engine. Tests:
FstMorpher.FromLanguage vs Morpher is IsComplete over a concatenative corpus
(inflected, bare root, homograph, non-word), and the harness flags a
deliberately-empty candidate as incomplete (proving it is not vacuous). 84 HC
tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the acceptor with compound paths: for each compounding subrule, build
root×root chains (head segs + non-head segs), tagging the head Root and the
non-head Compound per the subrule's surface head position (head-first or
head-last detected from the first CopyFromInput in the Rhs). FromLanguage now
collects CompoundingRules alongside prefixes/suffixes; single head + single
non-head subrules only (throws otherwise).

This is root×root in state count — exactly the layer §10 flags as a lazy-bucket
candidate at lexicon scale — built eagerly here for the parity check. Verified
against Morpher via the IMorphologicalAnalyzer contract: "pʰutdat" = pʰut(5) +
dat, returning both homographic non-heads (5+8, 5+9) exactly as the search
engine, and the bare root through the non-compound path. 85 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…morph

BuildAffixChains now iterates ALL allomorphs of an affix rule, building a
segment chain for each, all sharing the rule's morpheme token. Environment-
conditioned allomorphy is handled by the surface: only the allomorph whose
segments match the input accepts. FromLanguage no longer restricts affixes to a
single allomorph (throws only if an allomorph lacks a fixed-segment
InsertSegments).

This rounds out the concatenative Tier-1 fragment — roots, prefixes, suffixes,
bounded compounding, and multi-allomorph affixes. Verified at parity with
Morpher via the IMorphologicalAnalyzer contract: a plural with -s/-t allomorphs
analyzes "sags" and "sagt" exactly as the search engine, the surface selecting
the right allomorph. 86 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GrammarFstClosure.Analyze decides, per non-regular escape (reduplication /
infixation), whether any FST-able rule could apply before it and FEED it
(Kiparsky feeding). Sound stratal pre-filter: an escape is CLOSED only if no
FST-able rule (concatenative affix, compounding, or any phonological rule)
applies at or before its stratum — same-stratum rules count too, since unordered
application could place them first. Never falsely reports closed.

ClosureReport.FstClosed is true iff every escape is closed (vacuously, none):
then the FST built over the FST-able fragment is closed and its "no path" is a
proof for words showing no escape signature — subject to the per-word surface
check and the corpus parity gate. This is the static half of "confirming FST
closure"; the empirical half is FstVerification. The precise refinement that
reclaims over-flagged cases is range(F) ∩ trigger(E) = ∅ via Fst.Intersect.

Tests: no escapes -> closed (vacuous); innermost reduplication with nothing
before it -> closed; a suffix in the same unordered stratum -> potentially fed.
89 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e 3)

HybridMorpher (IMorphologicalAnalyzer) wires the three pieces together: the
precompiled FST handles the FST-able fragment as a fast, allocation-light walk,
and words that could involve a non-regular escape fall back to the
sound+complete search engine. FstMorpher.FromLanguage gains ignoreEscapes to
build the FST over just the FST-able fragment.

Routing is sound by construction (only ever sends MORE words to search): the
fast path is taken iff the grammar has no escapes, or every escape is CLOSED
(GrammarFstClosure) AND is total reduplication (the surface signature this
router detects, XX) AND the word shows no such signature. Otherwise the search
runs.

Verified: with a closed total-reduplication escape, "sag" takes the FST fast
path and "sagsag" falls back to search, and the combined analysis set is
verified COMPLETE against the pure search engine via FstVerification. This is
the Phase-3 Tier-2 hybrid: the closure pass decides safety, the verification
gate proves parity, the runtime routes per word. 90 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstGenerator implements IMorphologicalGenerator for the concatenative fragment:
generation is the inverse of the analyzer's walk — ordered concatenation of each
morpheme's surface representation (prefix before root, suffix after, compound
stems concatenated), Cartesian over allomorph choices. Mirrors Morpher's
morpheme inventory so it is a drop-in IMorphologicalGenerator.

Verified against Morpher.GenerateWords on the concatenative fragment, and the
analyze→generate round-trip recovers the input word ("sag", "sags"). Scope
matches FstMorpher (roots + fixed-segment affixes); phonology/reduplication
defer to the search generator. 92 HC tests green.

This completes the in-repo Phase-4 surface: both directions (FstMorpher analyzer
+ FstGenerator generator), the Tier-2 hybrid, closure confirmation, verification
gate, and the grammar census all exist and are verified. The remaining Phase-4
item — the FieldWorks adapter — lives in a separate repository.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iming + parity

FstSenaBenchmark loads a real grammar (HC_GRAMMAR/HC_WORDS), runs the census and
closure pass, attempts to build FstMorpher/HybridMorpher (reporting the concrete
WALL if a construct is out of fragment), and times search vs FST vs hybrid with
FstVerification parity.

Run on the real Sena grammar this surfaces the concrete remaining wall:
- census: Tier 1, fully FST-able, 0 escapes, FST-CLOSED;
- but FstMorpher.FromLanguage hits "stratum 'Morphology' has affix templates" —
  templates (position classes) are FST-able in principle but not yet built by
  FstMorpher, so the FST/hybrid cannot yet be constructed from Sena;
- search baseline (unlimited unapplications): ~206 ms/word, true parses.

So affix-template support is the next build to get the FST over the wall on a
real FLEx grammar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…umulating walk

FstTemplateAnalyzer handles affix templates (position classes) — the real-grammar
case. Two design points from the advisor review: (1) build-time CATEGORY GATING —
a template attaches only to roots whose category unifies with its
RequiredSyntacticFeatureStruct, which kills over-generation AND lets same-category
roots share one copy of the template's slot-automaton (states = roots +
Σ template automata, not roots × combos); (2) TOKEN ACCUMULATION along the path
(a state carries the morpheme token emitted on entry) via a custom DFS walk,
since the shared automaton is reached by many roots so an accept-id map won't do.
A maxStates budget (§10 knob) aborts before a blowup.

New additive class (the 92 existing tests + the accept-id acceptor are untouched).
Verified on a toy: a V-only suffix template with two optional slots reproduces
the search engine's analyses for sag/sagd/sagdv, AND the category gate correctly
blocks the verb template on an A-category root (gab/gabd) — FstVerification parity.

Scope this slice: suffixing templates + category gating; prefix-slot templates,
cross-stratum gating, and phonology are next. Wired into FstSenaBenchmark.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k — Sena PARSES

FstTemplateAnalyzer now handles prefix AND suffix template slots (prefixes
surface in reverse template order), gated by BOTH category (root features unify
with the template's RequiredSyntacticFeatureStruct) and stratum (root at the
template's stratum or inner — the 'datd' lesson). The walk is a proper NFA
simulation (active config-set per segment, deduped by (state, tokens)) instead
of the exponential recursive DFS, and guards InvalidShapeException (out-of-table
phonemes) like the search engine.

Result on the real Sena grammar (sena-hc.xml, 24 templates):
- FstTemplateAnalyzer BUILDS in ~0.5 s (gating shares automata → no state blowup);
- parses at ~6.4 ms/word vs the search engine's ~178 ms/word — about 28x faster;
- 14 of 16 analyses match the search engine across the sample, with one MISSING
  analysis on 'mafuta' (a two-prefix form) — a coverage gap, not over-generation.

Toy tests (parity-verified): suffix template + category gate; prefix+suffix
template (bare / prefixed / suffixed / both) + gate. 94 HC tests green.

Sena now parses through the FST; closing the last divergence to full
FstVerification parity is the remaining refinement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Sena ~100x

Two coverage fixes for real Sena: (1) slot rules of type
RealizationalAffixProcessRule (a sibling of AffixProcessRule, same
IList<AffixProcessAllomorph>) are now included, not skipped; (2) every affix is
entered through a token-bearing state, so a zero/empty-segment morpheme still
emits its token (previously the token was placed on the first segment state and
a zero morph emitted nothing -> a missing analysis).

Result on real Sena: FstTemplateAnalyzer parses at ~3 ms/word vs the search
engine's ~337 ms/word (~100x) and matches the search engine's analysis set on
the 8-word sample exactly; on 30 words 26 match, with residual divergences in
both directions (one under-generation; over-generation where a constraint the
FST does not yet enforce -- obligatory affixation / MPR / co-occurrence -- would
exclude an analysis). Sena PARSES through the FST; full FstVerification parity at
scale is the remaining constraint-enforcement refinement. 94 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants