feat: Milestone 2 — Structured Run Manifest System (#64)#67
feat: Milestone 2 — Structured Run Manifest System (#64)#67DhanashreePetare wants to merge 27 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| ``` | ||
|
|
||
| ## Manifest | ||
|
|
||
| All three commands support an optional `--manifest` flag that writes a structured JSON-LD record of the operation to disk: | ||
|
|
||
| ```bash | ||
| databusclient download https://databus.dbpedia.org/dbpedia/generic/labels/2023.12.01 \ | ||
| --manifest ./manifests/labels-download.jsonld | ||
|
|
||
| databusclient deploy --version-id https://databus.dbpedia.org/myaccount/mygroup/mydata/1.0 \ | ||
| --title "My Dataset" --abstract "..." --description "..." \ | ||
| --license https://creativecommons.org/licenses/by-sa/3.0/ \ | ||
| --apikey YOUR_KEY --manifest ./manifests/deploy-run.jsonld \ | ||
| myfile.nt | ||
|
|
||
| databusclient delete https://databus.dbpedia.org/myaccount/mygroup/mydata/1.0 \ | ||
| --databus-key YOUR_KEY --manifest ./manifests/delete-run.jsonld | ||
| ``` | ||
|
|
||
| The manifest records input parameters, per-file URLs, checksums, byte sizes, timestamps, and success/failure status for each file. It uses the DataID vocabulary and is versioned via `dbus:schemaVersion`. | ||
|
|
||
| - If the target path already exists, the manifest is written to an auto-suffixed path (e.g. `run_1.jsonld`) with a warning. | ||
| - Sensitive fields (API keys, vault tokens) are never written. | ||
| - If manifest writing fails, a warning is printed and the exit code reflects the actual operation result. | ||
|
|
||
| See `examples/reproducible-download.md` for a full walkthrough. No newline at end of file |
There was a problem hiding this comment.
The placement in the REAMDE could be better. At the top of the README is Table of Contents. There I would sort it in under:
- [CLI Usage](#cli-usage)
- [Download](#cli-download)
- [Deploy](#cli-deploy)
- [Delete](#cli-delete)
- [Manifest](#cli-manifest) <-- new
Accordingly the doc for Manifest should be placed below cli-delete
| # Explicitly un-ignore the manifest module folder (MANIFEST above is for Python packaging artifacts) | ||
| !databusclient/manifest/ | ||
| !databusclient/manifest/** | ||
| databusclient/manifest/__pycache__/ | ||
| *.py[cod] |
There was a problem hiding this comment.
Looks like code from agent :D
Unless there is a reason to keep it, remove it. Moreover *.py[cod] is already present on line 8, and __pycache__/ on line 7
There was a problem hiding this comment.
Actually, these lines were added to fix a Windows-specific gitignore conflict — the MANIFEST pattern on line 33 (for Python packaging artifacts) was case-insensitively matching databusclient/manifest/ on Windows, preventing the manifest module directory from being committed. The negation lines were the workaround at the time but now since the files are now tracked by git, the conflict no longer applies and I've removed them. Thanks for pointing this out truly.
| All three commands support an optional `--manifest` flag that writes a structured JSON-LD record of the operation to disk: | ||
|
|
||
| ```bash | ||
| databusclient download https://databus.dbpedia.org/dbpedia/generic/labels/2023.12.01 \ |
There was a problem hiding this comment.
Please use existing examples: https://databus.dbpedia.org/dbpedia/generic/labels/2023.12.01 does not exist
| if queue is not None: | ||
| queue.add_uri(databusURI) | ||
| return |
There was a problem hiding this comment.
Real delete manifests are empty for successful deletions.
databusclient/api/delete.py:152 queues every non-dry-run delete and returns before recording. The public delete() always uses a queue, but DeleteQueue.execute() calls _delete_list() without passing
manifest_context at databusclient/api/delete.py:73. Result: databusclient delete ... --manifest out.jsonld can delete resources successfully while the manifest reports zero files.
{
"@context": {
"dataid": "http://dataid.dbpedia.org/ns#",
"dcat": "http://www.w3.org/ns/dcat#",
"dcterms": "http://purl.org/dc/terms/",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"dbus": "http://databus.dbpedia.org/manifest/ns#"
},
"@type": "dbus:OperationManifest",
"dbus:schemaVersion": "1.0",
"dbus:clientVersion": "0.15",
"dbus:command": "delete",
"dcterms:issued": {
"@value": "2026-07-03T09:15:02.095679+00:00",
"@type": "xsd:dateTime"
},
"dbus:replayParams": {
"databusURIs": [
"https://databus.dev.dbpedia.link/fhofer/group1/artifact1/2027-07-03"
],
"dry_run": false
},
"dataid:distribution": {
"@type": "dataid:Distribution",
"dataid:file": []
},
"dbus:executionResult": {
"@type": "dbus:ExecutionSummary",
"dbus:totalFiles": 0,
"dbus:succeeded": 0,
"dbus:failed": 0,
"dbus:totalBytes": 0
}
}
There was a problem hiding this comment.
Updated, verified by deleting a deployed test dataset.
| if manifest_context is not None: | ||
| manifest_context.record_file( | ||
| url=url, | ||
| status="success", | ||
| sha256=actual_checksum or expected_checksum, | ||
| size_bytes=total_size_in_bytes if total_size_in_bytes else None, | ||
| downloaded_at=datetime.now(timezone.utc).isoformat(), | ||
| ) |
There was a problem hiding this comment.
Download manifests can record a file as successful before the requested conversion succeeds.
_download_file() writes the success entry at databusclient/api/download.py:526, but compression/format conversion runs afterward through databusclient/api/download.py:537. If convert_file() or recompression fails later, the manifest still contains a successful file entry for an operation whose final output was not produced
Pull Request
Description
Introduces the Structured Run Manifest System (Milestone 2) for the Databus Python Client. When --manifest is passed to any of the three existing commands (download, deploy, delete), a JSON-LD manifest file is written recording the complete details of the operation — input parameters, per-file URLs, checksums, byte sizes, timestamps, success/failure status, and a structured execution summary.
The manifest uses the DataID vocabulary (the same vocabulary used by the Databus platform itself), is versioned via dbus:schemaVersion, and never writes sensitive credentials. If the manifest file already exists at the given path, it auto-suffixes (run_1.jsonld, run_2.jsonld, ...) with a warning rather than silently overwriting. Passing a directory path instead of a file path raises a clear error. Manifest writing failure warns and continues — the exit code reflects the actual operation result, not the manifest write.
The manifest is written even when the operation itself fails, capturing whatever partial results were recorded before the failure — enabling the debugging use case described in the proposal (Use Case 5).
Related Issues
#64
Type of change
Checklist:
poetry run pytest- all tests passedpoetry run ruff check- no linting errors