Add eflomal word aligner: Bayesian IBM1→HMM→fertility with N parallel Gibbs chains#433
Add eflomal word aligner: Bayesian IBM1→HMM→fertility with N parallel Gibbs chains#433johnml1135 wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Integrates the new Thot-backed Eflomal Bayesian word alignment model into the SIL.Machine.Translation.Thot alignment pipeline, exposing training configuration (iterations, parallel Gibbs samplers) and wiring it through the CLI and model factory.
Changes:
- Adds
Eflomalas aThotWordAlignmentModelTypeand wires it intoThotWordAlignmentModel.Create. - Introduces
ThotEflomalWordAlignmentModel, plus trainer support forEflomalNumSamplersand iteration scheduling viaThotWordAlignmentParameters. - Extends Thot interop (
Thot.cs) and CLI plumbing (AlignmentModelCommandSpec,ToolHelpers) and adds Eflomal-focused tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/SIL.Machine.Translation.Thot.Tests/ThotEflomalWordAlignmentModelTests.cs | Adds coverage for creating/training/aligning with Eflomal, batch alignment, save/load, and symmetric alignment. |
| src/SIL.Machine.Translation.Thot/ThotWordAlignmentParameters.cs | Adds Eflomal iteration and sampler parameters with defaults. |
| src/SIL.Machine.Translation.Thot/ThotWordAlignmentModelType.cs | Adds Eflomal model type and string mapping. |
| src/SIL.Machine.Translation.Thot/ThotWordAlignmentModelTrainer.cs | Adds an Eflomal training branch, including sampler configuration. |
| src/SIL.Machine.Translation.Thot/ThotWordAlignmentModel.cs | Wires Eflomal into the alignment model factory method. |
| src/SIL.Machine.Translation.Thot/ThotEflomalWordAlignmentModel.cs | Implements the Eflomal-specific ComputeAlignedWordPairScores behavior. |
| src/SIL.Machine.Translation.Thot/Thot.cs | Adds Eflomal alignment-model enum value, P/Invoke for sampler count, and mapping from ThotWordAlignmentModelType. |
| src/SIL.Machine.Tool/ToolHelpers.cs | Adds CLI string constant for eflomal. |
| src/SIL.Machine.Tool/AlignmentModelCommandSpec.cs | Updates help text and parameter mapping to include Eflomal iterations and samplers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #433 +/- ##
==========================================
- Coverage 73.20% 73.20% -0.01%
==========================================
Files 440 442 +2
Lines 36931 37011 +80
Branches 5077 5089 +12
==========================================
+ Hits 27037 27094 +57
- Misses 8781 8797 +16
- Partials 1113 1120 +7 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
johnml1135
left a comment
There was a problem hiding this comment.
All Copilot review comments have been addressed — see replies on individual threads for details.
…el Gibbs chains Wires EflomalAlignmentModel from thot into machine's alignment pipeline. Depends on sillsdev/thot#11. Changes: ThotWordAlignmentModelType - Eflomal enum value + "eflomal" string alias ThotEflomalWordAlignmentModel - new class modeled on ThotFastAlignWordAlignmentModel; AlignmentScore is 1.0 (uniform) since eflomal does not expose an alignment probability ThotWordAlignmentModel.Create - Eflomal factory case ThotWordAlignmentModelTrainer - Eflomal branch: single IBM1->HMM->fertility cascade, graceful NotSupportedException when Thot NuGet predates EflomalAlignmentModel, setEflomalNumSamplers for parallel chains ThotWordAlignmentParameters - EflomalIterationCount (default 12) + EflomalNumSamplers (default 1) Thot.cs - swAlignModel_setEflomalNumSamplers/getEflomalNumSamplers P/Invoke AlignmentModelCommandSpec - --eflomal-iters + --eflomal-samplers CLI flags; ToolHelpers.Eflomal added to ValidateAlignmentModelTypeOption ThotEflomalWordAlignmentModelTests - 6 tests; [OneTimeSetUp]+Assume.That skips gracefully when installed Thot NuGet lacks EflomalAlignmentModel Quality (WPT English-French, 300k pairs, 447 gold - measured in thot): HMM: 10.4% intersection AER eflomal GPL ref: 6.58% (3 chains) This PR 1 chain: 7.52% This PR 5 chains: 6.46% (beats GPL reference) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Line 105 was 131 characters (exceeds max_line_length=120). Wrap the three-part HasValue condition and the return expression across lines, and add required braces (IDE0011) around the now-multi-line body. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Wires
EflomalAlignmentModelfrom thot into machine's alignment pipeline. eflomal is a Bayesian IBM1→HMM→fertility cascade trained by collapsed Gibbs sampling. Depends on Thot 3.5.0 (released to NuGet; implementation in sillsdev/thot#11).Changes
ThotWordAlignmentModelType/ToolHelpers—Eflomalenum value +"eflomal"string alias.ThotEflomalWordAlignmentModel— new class. Because eflomal is an HMM-based aligner, it implementsIHmmWordAlignmentModeland shares the alignment-scoring logic withThotHmmWordAlignmentModelvia a new abstract base,ThotHmmWordAlignmentModelBase(extracted from the HMM model so the two are siblings rather than one subclassing the other). Its alignment scores come fromswAlignModel_getEflomalAlignmentProbability. Added to theThotWordAlignmentModel.Createfactory.ThotHmmWordAlignmentModelBase— new abstract base holding the sharedComputeAlignedWordPairScores; concreteThotHmmWordAlignmentModelandThotEflomalWordAlignmentModelsupply only their model type and the underlying C API call.ThotFastAlignWordAlignmentModel— now uses the FastAlign-specificswAlignModel_getFastAlignAlignmentProbabilityinstead of borrowing the IBM2 entry point (the new thot binary dispatches by model type, so the IBM2 path would return 0 for a FastAlign model).ThotWordAlignmentModelTrainer— Eflomal branch creates a single model that runs the full IBM1→HMM→fertility cascade internally. Key behavior:Ibm1IterationCount→ IBM1 stage,HmmIterationCount→ HMM stage,Ibm3IterationCount→ fertility stage. The separateEflomal*IterationCountproperties are gone.swAlignModel_getEflomalScheduledIterationsafterstartTrainingand drives exactly that many sweeps — required because the schedule (and the staged cascade) isn't known up front. Progress is reported as indeterminate until it resolves.ThotWordAlignmentParameters— reworked eflomal hyperparameters to match the thot 3.5.0 C API:EflomalNumSamplers— parallel Gibbs chains; marginals summed across chains before argmax (eflomal'sn_samplers). Thot default 3 (left untouched unless set).EflomalSeed— RNG seed for the samplers.EflomalDeterministic— trains chains serially so a fixed seed is reproducible.EflomalLexNorm— plain1/N(e)vs Dirichlet-smoothed lexical denominator.EflomalLexAlpha/EflomalJumpAlpha/EflomalFertilityAlpha— Dirichlet priors.EflomalP0— NULL-alignment mixing weight (renamed from the removedEflomalNullProb).EflomalJumpWindow— jump-distribution half-window.EflomalNullAlpha(no longer in the API). Partial schedules fall back to thot's per-stage defaults (8 / 8 / 32).Thot.cs— P/Invoke declarations for the eflomal API (setEflomalSeed/NumSamplers/Deterministic/Iterations/AutoIterations/LexNorm/P0/AlphaLex/AlphaJump/AlphaFertility/JumpWindow,getEflomalScheduledIterations,getEflomalAlignmentProbability) andswAlignModel_getFastAlignAlignmentProbability.AlignmentModelCommandSpec— eflomal reuses the--ibm1-iters/--hmm-iters/--ibm3-itersflags for its schedule; new--eflomal-samplers,--eflomal-deterministic,--eflomal-seed,--eflomal-lex-norm,--eflomal-lex-alpha,--eflomal-jump-alpha,--eflomal-fert-alpha,--eflomal-p0,--eflomal-jump-windowflags.Test plan
ThotEflomalWordAlignmentModelTests: CreateTrainer+Align, AlignBatch, translation probability, vocab counts, save/load round-trip, symmetrized model, explicit-schedule training, deterministic reproducibility, seed + lex-normTreatWarningsAsErrors=true)dotnet csharpier check .clean🤖 Generated with Claude Code
This change is