Repository Structure and How It Works
This guide explains the current on-disk repository layout and why that layout matters for portability, detached views, large-file storage, maintenance, and crash safety.
Why the layout matters
hubvault is intentionally a self-contained local repository format. The
directory tree is not an implementation accident; it is part of the portability
and safety story.
A valid repository is expected to keep working after it is:
moved to another absolute path
packed into an archive and restored later
reopened by another process on the same or another machine
Current top-level layout
A typical repository root now looks like this:
FORMAT
metadata.sqlite3
cache/
chunks/
locks/
objects/
quarantine/
txn/
The important responsibilities are:
FORMAT: repository format markermetadata.sqlite3: steady-state metadata and object truth-storelocks/: repository-wide shared / exclusive lock fileobjects/blobs/*.data: published blob payload byteschunks/packs/*.pack: published packed chunk payload bytescache/: detached file and snapshot viewstxn/: in-progress staging and residue cleanup areaquarantine/: isolated recovery or maintenance leftovers when needed
What lives in SQLite versus the filesystem
The current repository model intentionally splits metadata truth from payload bytes.
SQLite stores the repository’s steady-state metadata and object records, including:
repository metadata
refs
reflog
transaction journal state
chunk visibility metadata
commit / tree / file / blob metadata
The filesystem still stores large or immutable payload bytes:
blob data files under
objects/blobs/packed chunk payload under
chunks/packs/detached user views under
cache/
This design gives the repository one repo-local metadata truth store while keeping payload storage simple, portable, and easy to move with the repository.
What you should and should not treat as truth
The key operational rule is:
metadata.sqlite3is the steady-state metadata truth sourcedetached caches are rebuildable views, not truth
txn/andquarantine/are maintenance / recovery areas, not user data
Some directories from older layouts can still appear in a repository tree for migration or compatibility reasons, but they should not be treated as the primary truth source in current repositories.
Public file metadata versus private storage
hubvault intentionally separates public file identity from private storage
addressing.
For a public hubvault.models.RepoFile, the important user-facing fields
are:
path: the repo-relative pathoid/blob_id: file identity in Git/HF stylesha256: the bare 64-hex content digestlfs: extra large-file metadata when the file is stored through chunked mode
These public values are not simply a dump of internal storage records. That separation is deliberate so public callers can reason about files without depending on private engine details.
Small files and large files
hubvault uses two storage modes:
small files stay in ordinary object storage
files at or above
large_file_thresholdswitch to chunked storage
From the public caller’s point of view, the repo path stays the same:
small, large = api.get_paths_info(["artifacts/small.bin", "artifacts/large.bin"])
print(small.lfs is None)
# True
print(large.lfs is not None)
# True
print(large.sha256)
# 64-hex digest, value varies
Even when the file is chunked internally, hf_hub_download() still returns a
path ending with the original repo-relative suffix such as
artifacts/large.bin.
Detached views are part of the design
The cache/ area is not accidental clutter. It is the user-view layer that
allows hubvault to return real paths on disk without exposing writable
aliases of committed truth.
That supports an important guarantee:
deleting or editing a downloaded file does not corrupt committed data
the next read can rebuild the detached view from repository truth
This is why download and snapshot paths are safe to hand to other local tools.
How a write works at a high level
A public write operation follows this broad pattern:
acquire the repository writer lock
stage payload and metadata changes under transaction-local state
publish immutable payload bytes
commit metadata truth atomically
clean residue and release the lock
The intended observable rule is simple:
If a write does not complete successfully, the repository should look as if that write never happened.
That rollback-oriented behavior is one of the reasons hubvault keeps
explicit transaction and recovery areas instead of mutating committed truth in
place.
Why the structure supports portability
Because the repository keeps its durable state inside one root directory:
there is no repo-external sidecar database to carry around
there is no absolute-path binding in repository truth
archive / restore workflows do not need a rebuild step just to reopen
This is the practical reason hubvault can act like a portable local
artifact repository rather than a cache that depends on host-local state.
Complete structure example
from pathlib import Path
from hubvault import HubVaultApi
repo_dir = Path("structure-repo")
api = HubVaultApi(repo_dir)
api.create_repo(large_file_threshold=32)
api.upload_file(
path_or_fileobj=b"small-file",
path_in_repo="artifacts/small.bin",
commit_message="add small file",
)
api.upload_file(
path_or_fileobj=b"A" * 64,
path_in_repo="artifacts/large.bin",
commit_message="add large file",
)
print((repo_dir / "FORMAT").exists()) # True
print((repo_dir / "metadata.sqlite3").exists()) # True
print((repo_dir / "locks" / "repo.lock").exists()) # True
small, large = api.get_paths_info(
["artifacts/small.bin", "artifacts/large.bin"]
)
print(small.lfs is None) # True
print(large.lfs is not None) # True
download_path = api.hf_hub_download("artifacts/large.bin")
print(Path(download_path).as_posix().endswith("artifacts/large.bin"))
# True
overview = api.get_storage_overview()
print(overview.total_size > 0) # True
Note
Exact IDs, file counts, and pack counts vary by repository state. The stable part is the role of each area, the split between metadata truth and payload bytes, and the detached-view semantics.