Verification, GC, and History Squashing
This guide explains how to keep a repository healthy after it has accumulated real history, detached caches, and multiple generations of artifacts. The maintenance APIs are public on purpose: they are part of normal operation, not hidden implementation details.
That matters because hubvault is meant to be a low-infrastructure durable
repository for large ML data. Verification, rollback-oriented atomic writes,
and explicit resource-release APIs are part of the product, not optional
operational extras.
When to use this guide
Use the maintenance APIs when one or more of these becomes true:
you have written multiple generations of large files
detached downloads or snapshots have consumed noticeable cache space
you want a health check before archiving or handing off a repository
you need to reclaim bytes without guessing which internal directories are safe to delete
The maintenance flow has four distinct questions:
Is the repository healthy?
Where is the space going?
What is already safe to reclaim?
Is old reachable history the real blocker?
Step 1: start with verification
hubvault exposes two public verification levels:
quick = api.quick_verify()
print(quick.ok)
# True
full = api.full_verify()
print(full.ok)
# True
Use them differently:
quick_verify()is the cheap integrity check after ordinary writesfull_verify()is the deeper pass for maintenance windows, suspicious states, migration checks, or archival handoff
The usual pattern is simple: quick after normal mutation, full before major cleanup or handoff.
Step 2: inspect storage before deleting anything
Before deleting files manually, ask the repository for a structured storage overview:
overview = api.get_storage_overview()
print(overview.total_size > 0)
# True
print(overview.reachable_size >= 0)
# True
print(overview.historical_retained_size >= 0)
# True
print(overview.reclaimable_gc_size >= 0)
# True
print(overview.reclaimable_cache_size >= 0)
# True
These fields answer different questions:
total_size: how large is the repository footprint overall?reachable_size: how much data is required to preserve current live refs?historical_retained_size: how much space is still kept by old reachable history?reclaimable_gc_size: how much can plain GC reclaim right now?reclaimable_cache_size: how much detached-view cache can be dropped safely?reclaimable_temporary_size: how much temporary or quarantine residue can be cleaned?
You also get:
sections: per-area storage breakdownrecommendations: ordered maintenance suggestions based on the current state
That is the basis for deciding whether a simple GC is enough or whether history rewriting is required.
Step 3: preview GC first
Run GC in dry-run mode before mutating anything:
dry_gc = api.gc(dry_run=True, prune_cache=True)
print(dry_gc.dry_run)
# True
print(dry_gc.reclaimed_size >= 0)
# True
print(dry_gc.notes[:2])
# ['dry-run: ...', '...'] # exact notes vary by repository state
Dry-run mode tells you what hubvault would reclaim without changing
repository state. That is especially useful when deciding whether cache pruning
alone is enough.
Step 4: run GC for already reclaimable data
If the dry run looks correct, execute the real pass:
gc_report = api.gc(dry_run=False, prune_cache=True)
print(gc_report.reclaimed_size >= 0)
# True
print(gc_report.removed_file_count >= 0)
# True
print(gc_report.reclaimed_cache_size >= 0)
# True
Plain GC only reclaims data that is already safe to remove:
unreachable object data
unreachable chunk / pack data
rebuildable detached cache data
temporary or quarantine residue that no longer needs to be kept
If old history is still reachable from a branch, GC intentionally keeps it.
Step 5: use history squashing when old history is the blocker
Large repositories often retain most of their space in still-reachable branch
history. When that becomes the dominant storage cost, use
squash_history() explicitly:
squash = api.squash_history(
"main",
commit_message="squash main history",
run_gc=True,
prune_cache=True,
)
print(squash.rewritten_commit_count >= 1)
# True
print(squash.dropped_ancestor_count >= 0)
# True
print(squash.blocking_refs)
# [] # or other refs that still retain old lineage
squash_history() keeps the branch tip’s visible file state while making
older branch lineage unreachable from that branch. When run_gc=True, the
method follows up with GC immediately so newly unreachable data can be reclaimed.
How to choose the right action
A practical order is:
run
quick_verify()after normal writesrun
full_verify()before serious maintenance or archival handoffinspect
get_storage_overview()preview
gc(dry_run=True)run real
gc()use
squash_history()only when old reachable history is the real space consumer
That order prevents both under-cleaning and unsafe manual cleanup.
What not to do
Avoid these habits:
deleting internal directories because they “look temporary”
deleting cache, chunk, or object files by hand without checking overview/GC
assuming GC rewrites reachable history
assuming squashing is just an optimization with no history consequences
Use the public maintenance APIs. They already know how to preserve repository truth while cleaning safe-to-remove state.
Complete maintenance example
from hubvault import HubVaultApi
api = HubVaultApi("maintenance-repo")
api.create_repo(large_file_threshold=32)
api.upload_file(
path_or_fileobj=b"A" * 64,
path_in_repo="model.bin",
commit_message="seed v1",
)
api.upload_file(
path_or_fileobj=b"B" * 64,
path_in_repo="model.bin",
commit_message="seed v2",
)
api.hf_hub_download("model.bin") # populate one detached view
quick = api.quick_verify()
print(quick.ok) # True
full = api.full_verify()
print(full.ok) # True
overview = api.get_storage_overview()
print(overview.total_size > 0) # True
print(overview.reclaimable_cache_size >= 0) # True
print(overview.reclaimable_gc_size >= 0) # True
dry_gc = api.gc(dry_run=True, prune_cache=True)
print(dry_gc.dry_run) # True
print(dry_gc.reclaimed_size >= 0) # True
gc_report = api.gc(dry_run=False, prune_cache=True)
print(gc_report.reclaimed_size >= 0) # True
print(gc_report.removed_file_count >= 0) # True
squash = api.squash_history(
"main",
commit_message="squash main history",
run_gc=True,
prune_cache=True,
)
print(squash.rewritten_commit_count >= 1) # True
print(squash.dropped_ancestor_count >= 0) # True
Note
Exact byte counts differ across platforms, Python versions, and filesystems. The stable part is the meaning of each field and the ordering of the maintenance actions.