Skip to content

Port 'Adjust placement of paragraph markers' from machine.py#435

Open
Copilot wants to merge 3 commits into
masterfrom
copilot/port-adjust-placement-of-paragraph-markers
Open

Port 'Adjust placement of paragraph markers' from machine.py#435
Copilot wants to merge 3 commits into
masterfrom
copilot/port-adjust-placement-of-paragraph-markers

Conversation

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Ports machine.py#298 — after alignment-based placement of paragraph markers, apply small boundary adjustments to produce more natural splits (e.g. keeping a trailing comma with its sentence rather than letting it open the next paragraph).

New: SegmentBoundaryAdjuster

Two new classes in SegmentBoundaryAdjuster.cs:

  • TokenRejoiner — reconstructs token lists into strings with correct punctuation spacing (no space before ,/./closing quotes, no space after opening brackets/quotes).
  • SegmentBoundaryAdjuster — adjusts a segment boundary by:
    • Moving prohibited segment-starting characters (, ; . ? ! closing quotes/brackets) from the head of the next segment to the tail of the current one
    • Moving prohibited segment-ending characters (opening brackets/quotes) from the tail of the current segment to the head of the next one
    • Correcting late sentence starts (capitalized words that crossed the boundary too early)
    • Correcting early sentence ends (words + terminal punctuation that crossed the boundary too late)
    • AdjustTokenizedSegmentPairBoundaries(int boundary, IReadOnlyList<string> tokens) — token-index variant used by the handler

Change: PlaceMarkersUsfmUpdateBlockHandler

After PredictMarkerLocation, paragraph markers now go through AdjustTokenizedSegmentPairBoundaries before their string index is resolved:

// If inserting a paragraph marker, make small adjustments to place it in a more natural location
if (element.Type == UsfmUpdateBlockElementType.Paragraph)
{
    adjacentTargetToken = _segmentBoundaryAdjuster.AdjustTokenizedSegmentPairBoundaries(
        adjacentTargetToken,
        targetTokens
    );
}

Before: alignment places \p before , → paragraph opens with , y esta prueba…
After: comma stays in the preceding paragraph → Este texto está en inglés, / \p y esta prueba…


This change is Reviewable

Copilot AI changed the title [WIP] Port relevant changes from PR #298 to machine Port 'Adjust placement of paragraph markers' from machine.py Jun 24, 2026
Copilot AI requested a review from ddaspit June 24, 2026 20:26
@ddaspit ddaspit marked this pull request as ready for review June 25, 2026 13:59
@ddaspit

ddaspit commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@copilot The build is failing.

Copilot AI commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@copilot The build is failing.

Fixed in the latest commit — the two new files (SegmentBoundaryAdjuster.cs and SegmentBoundaryAdjusterTests.cs) weren't run through CSharpier before being committed. Running dotnet csharpier format on both files resolved the formatting errors.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.79769% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.28%. Comparing base (c1bd2cc) to head (ad8b076).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/SIL.Machine/Corpora/SegmentBoundaryAdjuster.cs 94.54% 3 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #435      +/-   ##
==========================================
+ Coverage   73.18%   73.28%   +0.10%     
==========================================
  Files         440      441       +1     
  Lines       36882    37055     +173     
  Branches     5075     5094      +19     
==========================================
+ Hits        26991    27155     +164     
- Misses       8778     8781       +3     
- Partials     1113     1119       +6     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants