Verification, GC, and History Squashing

This guide explains how to keep a repository healthy after it has accumulated real history, detached caches, and multiple generations of artifacts. The maintenance APIs are public on purpose: they are part of normal operation, not hidden implementation details.

That matters because hubvault is meant to be a low-infrastructure durable repository for large ML data. Verification, rollback-oriented atomic writes, and explicit resource-release APIs are part of the product, not optional operational extras.

When to use this guide 

Use the maintenance APIs when one or more of these becomes true:

you have written multiple generations of large files
detached downloads or snapshots have consumed noticeable cache space
you want a health check before archiving or handing off a repository
you need to reclaim bytes without guessing which internal directories are safe to delete

The maintenance flow has four distinct questions:

Is the repository healthy?
Where is the space going?
What is already safe to reclaim?
Is old reachable history the real blocker?

Step 1: start with verification 

hubvault exposes two public verification levels:

quick = api.quick_verify()
print(quick.ok)
# True

full = api.full_verify()
print(full.ok)
# True

Use them differently:

quick_verify() is the cheap integrity check after ordinary writes
full_verify() is the deeper pass for maintenance windows, suspicious states, migration checks, or archival handoff

The usual pattern is simple: quick after normal mutation, full before major cleanup or handoff.

Step 2: inspect storage before deleting anything 

Before deleting files manually, ask the repository for a structured storage overview:

overview = api.get_storage_overview()

print(overview.total_size > 0)
# True

print(overview.reachable_size >= 0)
# True

print(overview.historical_retained_size >= 0)
# True

print(overview.reclaimable_gc_size >= 0)
# True

print(overview.reclaimable_cache_size >= 0)
# True

These fields answer different questions:

total_size: how large is the repository footprint overall?
reachable_size: how much data is required to preserve current live refs?
historical_retained_size: how much space is still kept by old reachable history?
reclaimable_gc_size: how much can plain GC reclaim right now?
reclaimable_cache_size: how much detached-view cache can be dropped safely?
reclaimable_temporary_size: how much temporary or quarantine residue can be cleaned?

You also get:

sections: per-area storage breakdown
recommendations: ordered maintenance suggestions based on the current state

That is the basis for deciding whether a simple GC is enough or whether history rewriting is required.

Step 3: preview GC first 

Run GC in dry-run mode before mutating anything:

dry_gc = api.gc(dry_run=True, prune_cache=True)

print(dry_gc.dry_run)
# True

print(dry_gc.reclaimed_size >= 0)
# True

print(dry_gc.notes[:2])
# ['dry-run: ...', '...']  # exact notes vary by repository state

Dry-run mode tells you what hubvault would reclaim without changing repository state. That is especially useful when deciding whether cache pruning alone is enough.

Step 4: run GC for already reclaimable data 

If the dry run looks correct, execute the real pass:

gc_report = api.gc(dry_run=False, prune_cache=True)

print(gc_report.reclaimed_size >= 0)
# True

print(gc_report.removed_file_count >= 0)
# True

print(gc_report.reclaimed_cache_size >= 0)
# True

Plain GC only reclaims data that is already safe to remove:

unreachable object data
unreachable chunk / pack data
rebuildable detached cache data
temporary or quarantine residue that no longer needs to be kept

If old history is still reachable from a branch, GC intentionally keeps it.

Step 5: use history squashing when old history is the blocker 

Large repositories often retain most of their space in still-reachable branch history. When that becomes the dominant storage cost, use squash_history() explicitly:

squash = api.squash_history(
    "main",
    commit_message="squash main history",
    run_gc=True,
    prune_cache=True,
)

print(squash.rewritten_commit_count >= 1)
# True

print(squash.dropped_ancestor_count >= 0)
# True

print(squash.blocking_refs)
# []  # or other refs that still retain old lineage

squash_history() keeps the branch tip’s visible file state while making older branch lineage unreachable from that branch. When run_gc=True, the method follows up with GC immediately so newly unreachable data can be reclaimed.

How to choose the right action 

A practical order is:

run quick_verify() after normal writes
run full_verify() before serious maintenance or archival handoff
inspect get_storage_overview()
preview gc(dry_run=True)
run real gc()
use squash_history() only when old reachable history is the real space consumer

That order prevents both under-cleaning and unsafe manual cleanup.

What not to do 

Avoid these habits:

deleting internal directories because they “look temporary”
deleting cache, chunk, or object files by hand without checking overview/GC
assuming GC rewrites reachable history
assuming squashing is just an optimization with no history consequences

Use the public maintenance APIs. They already know how to preserve repository truth while cleaning safe-to-remove state.

Complete maintenance example 

from hubvault import HubVaultApi

api = HubVaultApi("maintenance-repo")
api.create_repo(large_file_threshold=32)
api.upload_file(
    path_or_fileobj=b"A" * 64,
    path_in_repo="model.bin",
    commit_message="seed v1",
)
api.upload_file(
    path_or_fileobj=b"B" * 64,
    path_in_repo="model.bin",
    commit_message="seed v2",
)
api.hf_hub_download("model.bin")    # populate one detached view

quick = api.quick_verify()
print(quick.ok)                     # True

full = api.full_verify()
print(full.ok)                      # True

overview = api.get_storage_overview()
print(overview.total_size > 0)      # True
print(overview.reclaimable_cache_size >= 0)     # True
print(overview.reclaimable_gc_size >= 0)        # True

dry_gc = api.gc(dry_run=True, prune_cache=True)
print(dry_gc.dry_run)               # True
print(dry_gc.reclaimed_size >= 0)   # True

gc_report = api.gc(dry_run=False, prune_cache=True)
print(gc_report.reclaimed_size >= 0)        # True
print(gc_report.removed_file_count >= 0)    # True

squash = api.squash_history(
    "main",
    commit_message="squash main history",
    run_gc=True,
    prune_cache=True,
)
print(squash.rewritten_commit_count >= 1)   # True
print(squash.dropped_ancestor_count >= 0)   # True

Note

Exact byte counts differ across platforms, Python versions, and filesystems. The stable part is the meaning of each field and the ordering of the maintenance actions.