gh-51067: Add remove() and repack() to ZipFile#134627
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
This behavior simply resembles
|
|
Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces. The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted. |
- Separate individual validation tests. - Check underlying repacker not called in validation. - Use `unlink` to prevent FileNotFoundError. - Fix mode 'x' test.
- Set `_writing` to prevent `open('w').write()` during repacking.
- Move the protection logic to `ZipFile.repack()`.
Documentation build overview
4 files changed± library/tkinter.html± library/zipfile.html± whatsnew/3.16.html± whatsnew/changelog.html |
|
@emmatyping The review is still awaiting for nearly a year. Is there still any major issue preventing this PR to be merged in Python 3.15? |
Locating an unsigned data descriptor for a STORED or encrypted entry requires a byte-by-byte scan that can be ~100-1000x slower and is a potential denial-of-service vector on untrusted input. Default strict_descriptor to True so repack() detects only the signed data descriptors written by Python and other modern tools, leaving the slow scan as an opt-in. Also explain data descriptors and the signed/unsigned distinction in the docs, and add coverage for the new default (unit, encrypted, and an end-to-end repack test across all compression methods).
Clarify that passing removed= reclaims a member's space regardless of how its local file entry was written, while the scan used when removed is omitted may leave entries in place (e.g. unsigned data descriptors under the default strict_descriptor=True). Adjust the remove() note so it no longer implies a bare repack() always removes the data, and add a short remove-then-repack(removed) example.
Without this check a stale local header whose compress_size points past the next referenced entry would make _validate_local_file_entry_sequence report more strippable bytes than the gap holds. The move loop would then over-advance entry_offset, drive a later header_offset negative, and fail in fp.seek() with the archive partially rewritten.
Avoids a tight loop if a future caller (or a truncated archive) ever asks to copy past EOF.
Replace **opts passthrough with the documented keyword-only signature so help()/inspect show the real parameters and the internal debug= hook is not reachable from public API.
zipfile is a stdlib module, not core/builtins.
The loop bound guarantees each slice is exactly dd_size bytes.
Consistent with the other _ZipRepacker scanners.
The previous spelling overrode __new__ to return a dict, which works but reads as a class that isn't one. Rename to comparable_zinfo() per PEP 8.
Single source of truth for the 1 MiB default used by both ZipFile.repack() and _ZipRepacker.
Covers the new bounds check in _validate_local_file_entry and the short-read guard in _copy_bytes.
gpshead
left a comment
There was a problem hiding this comment.
I applied a variety of minor edge case fixes for robustness and some code style cleanups.
I think the primary concerns were addressed. There's always room for documentation reworking. I realize I added a lot of text myself, maybe that should be rearranged a bit, but the important info is there that we expect people to be saving ZipInfo's from remove() calls and passing those to repack() as the well lit no-footguns path.
This is a revised version of PR #103033, implementing two new methods in
zipfile.ZipFile:remove()andrepack(), as suggested in this comment.Features
ZipFile.remove(zinfo_or_arcname)strpath orZipInfo) from the central directory.strpath is provided.ZipInfoinstance.'a','w','x'.ZipFile.repack(removed=None)removedis passed (as a sequence of removedZipInfos), only their corresponding local file entry data are removed.'a'.Rationales
Heuristics Used in
repack()Since
repack()does not immediately clean up removed entries at the time aremove()is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.While local file entries begin with the magic signature
PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.To safely reclaim space,
repack()assumes that in a normal ZIP file, local file entries are stored consecutively:BadZipFileerror is raised and no changes are made.Check the doc in the source code of
_ZipRepacker.repack()(which is internally called byZipFile.repack()) for more details.Supported Modes
There has been opinions that a repacking should support mode
'w'and'x'(e. g. #51067 (comment)).This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode
'w'and'x'may be used on an unseekable file and will be broken by such change. OTOH, mode'a'is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/