docs: improve ambiguous areas of pack format documentation

bk2204 · gitster · commit 24d46f86337b · 2025-10-09T17:46:14.000-07:00
It is fair to say that our pack and indexing code is quite complex.
Contributors who wish to work on this code or implementors of other
implementations would benefit from clear, unambiguous documentation
about how our data formats are structured and encoded and what data is
used in the computation of certain values.  Unfortunately, some of this
data is missing, which leads to confusion and frustration.

Let's document some of this data to help clarify things.  Specify over
what data CRC32 values are computed and also note which CRC32 algorithm
is used, since Wikipedia mentions at least four 32-bit CRC algorithms
and notes that it's possible to use different bit orderings.

In addition, note how we encode objects in the pack.  One might be led
to believe that packed objects are always stored with the "&lt;type&gt;
&lt;size&gt;\0" prefix of loose objects, but that is not the case, although
for obvious reasons this data is included in the computation of the
object ID.  Explain why this is for the curious reader.

Finally, indicate what the size field of the packed object represents.
Otherwise, a reader might think that the size of a delta is the size of
the full object or that it might contain the offset or object ID,
neither of which are the case.  Explain clearly, however, that the
values represent uncompressed sizes to avoid confusion.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
diff --git a/Documentation/gitformat-pack.adoc b/Documentation/gitformat-pack.adoc
@@ -32,6 +32,10 @@ In a repository using the traditional SHA-1, pack checksums, index checksums,
 and object IDs (object names) mentioned below are all computed using SHA-1.
 Similarly, in SHA-256 repositories, these values are computed using SHA-256.
 
+CRC32 checksums are always computed over the entire packed object, including
+the header (n-byte type and length); the base object name or offset, if any;
+and the entire compressed object.  The CRC32 algorithm used is that of zlib.
+
 == pack-*.pack files have the following format:
 
    - A header appears at the beginning and consists of the following:
@@ -80,6 +84,16 @@ Valid object types are:
 
 Type 5 is reserved for future expansion. Type 0 is invalid.
 
+=== Object encoding
+
+Unlike loose objects, packed objects do not have a prefix containing the type,
+size, and a NUL byte. These are not necessary because they can be determined by
+the n-byte type and length that prefixes the data and so they are omitted from
+the compressed and deltified data.
+
+The computation of the object ID still uses this prefix by reconstructing it
+from the type and length as needed.
+
 === Size encoding
 
 This document uses the following "size encoding" of non-negative
@@ -92,6 +106,11 @@ values are more significant.
 This size encoding should not be confused with the "offset encoding",
 which is also used in this document.
 
+When encoding the size of an undeltified object in a pack, the size is that of
+the uncompressed raw object. For deltified objects, it is the size of the
+uncompressed delta.  The base object name or offset is not included in the size
+computation.
+
 === Deltified representation
 
 Conceptually there are only four object types: commit, tree, tag and