Skip to content

gh-51067: Add remove() and repack() to ZipFile#134627

Merged
gpshead merged 88 commits into
python:mainfrom
danny0838:gh-51067-2
Jun 20, 2026
Merged

gh-51067: Add remove() and repack() to ZipFile#134627
gpshead merged 88 commits into
python:mainfrom
danny0838:gh-51067-2

Conversation

@danny0838

@danny0838 danny0838 commented May 24, 2025

Copy link
Copy Markdown
Contributor

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing a str path or ZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when a str path is provided.
  • Returns the removed ZipInfo instance.
  • Supported in modes: 'a', 'w', 'x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode 'a'.

Rationales

Heuristics Used in repack()

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
    2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. #51067 (comment)).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.



📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

@bedevere-app

bedevere-app Bot commented May 24, 2025

Copy link
Copy Markdown

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

sharktide

This comment was marked as off-topic.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:29

@sharktide sharktide left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

@danny0838

Copy link
Copy Markdown
Contributor Author

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resembles open() and write(), which raises a ValueError in various cases. Furthermore there has been a change from raising RuntimeError since Python 3.6:

Changed in version 3.6: Calling open() on a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

Changed in version 3.6: Calling write() on a ZipFile created with mode 'r' or a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

@danny0838 danny0838 requested a review from sharktide May 24, 2025 17:58
@danny0838

danny0838 commented May 24, 2025

Copy link
Copy Markdown
Contributor Author

Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

danny0838 added 6 commits May 25, 2025 16:18
- Separate individual validation tests.
- Check underlying repacker not called in validation.
- Use `unlink` to prevent FileNotFoundError.
- Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.
- Move the protection logic to `ZipFile.repack()`.
@read-the-docs-community

read-the-docs-community Bot commented Jun 20, 2026

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33230819 | 📁 Comparing b70a80b against main (a9db5cb)

  🔍 Preview build  

4 files changed
± library/tkinter.html
± library/zipfile.html
± whatsnew/3.16.html
± whatsnew/changelog.html

@danny0838

Copy link
Copy Markdown
Contributor Author

@emmatyping The review is still awaiting for nearly a year. Is there still any major issue preventing this PR to be merged in Python 3.15?

gpshead added 4 commits June 20, 2026 16:56
Locating an unsigned data descriptor for a STORED or encrypted entry
requires a byte-by-byte scan that can be ~100-1000x slower and is a
potential denial-of-service vector on untrusted input. Default
strict_descriptor to True so repack() detects only the signed data
descriptors written by Python and other modern tools, leaving the slow
scan as an opt-in.

Also explain data descriptors and the signed/unsigned distinction in
the docs, and add coverage for the new default (unit, encrypted, and an
end-to-end repack test across all compression methods).
@gpshead gpshead self-assigned this Jun 20, 2026
@gpshead gpshead added the 3.16 new features, bugs and security fixes label Jun 20, 2026
@gpshead gpshead requested a review from AA-Turner as a code owner June 20, 2026 17:42
gpshead added 11 commits June 20, 2026 18:00
Clarify that passing removed= reclaims a member's space regardless of
how its local file entry was written, while the scan used when removed
is omitted may leave entries in place (e.g. unsigned data descriptors
under the default strict_descriptor=True). Adjust the remove() note so
it no longer implies a bare repack() always removes the data, and add a
short remove-then-repack(removed) example.
Without this check a stale local header whose compress_size points past
the next referenced entry would make _validate_local_file_entry_sequence
report more strippable bytes than the gap holds.  The move loop would
then over-advance entry_offset, drive a later header_offset negative,
and fail in fp.seek() with the archive partially rewritten.
Avoids a tight loop if a future caller (or a truncated archive) ever
asks to copy past EOF.
Replace **opts passthrough with the documented keyword-only signature
so help()/inspect show the real parameters and the internal debug=
hook is not reachable from public API.
zipfile is a stdlib module, not core/builtins.
The loop bound guarantees each slice is exactly dd_size bytes.
Consistent with the other _ZipRepacker scanners.
The previous spelling overrode __new__ to return a dict, which works but
reads as a class that isn't one.  Rename to comparable_zinfo() per PEP 8.
Single source of truth for the 1 MiB default used by both
ZipFile.repack() and _ZipRepacker.
Covers the new bounds check in _validate_local_file_entry and the
short-read guard in _copy_bytes.

@gpshead gpshead left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied a variety of minor edge case fixes for robustness and some code style cleanups.

@gpshead gpshead enabled auto-merge (squash) June 20, 2026 19:36
@gpshead gpshead disabled auto-merge June 20, 2026 19:40
@gpshead gpshead dismissed emmatyping’s stale review June 20, 2026 19:43

I think the primary concerns were addressed. There's always room for documentation reworking. I realize I added a lot of text myself, maybe that should be rearranged a bit, but the important info is there that we expect people to be saving ZipInfo's from remove() calls and passing those to repack() as the well lit no-footguns path.

@gpshead gpshead merged commit aec0aed into python:main Jun 20, 2026
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.16 new features, bugs and security fixes stale Stale PR or inactive for long period of time.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants