Skip to content

Commit 24d46f8

Browse files
bk2204gitster
authored andcommitted
docs: improve ambiguous areas of pack format documentation
It is fair to say that our pack and indexing code is quite complex. Contributors who wish to work on this code or implementors of other implementations would benefit from clear, unambiguous documentation about how our data formats are structured and encoded and what data is used in the computation of certain values. Unfortunately, some of this data is missing, which leads to confusion and frustration. Let's document some of this data to help clarify things. Specify over what data CRC32 values are computed and also note which CRC32 algorithm is used, since Wikipedia mentions at least four 32-bit CRC algorithms and notes that it's possible to use different bit orderings. In addition, note how we encode objects in the pack. One might be led to believe that packed objects are always stored with the "<type> <size>\0" prefix of loose objects, but that is not the case, although for obvious reasons this data is included in the computation of the object ID. Explain why this is for the curious reader. Finally, indicate what the size field of the packed object represents. Otherwise, a reader might think that the size of a delta is the size of the full object or that it might contain the offset or object ID, neither of which are the case. Explain clearly, however, that the values represent uncompressed sizes to avoid confusion. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
1 parent d477892 commit 24d46f8

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

Documentation/gitformat-pack.adoc

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,10 @@ In a repository using the traditional SHA-1, pack checksums, index checksums,
3232
and object IDs (object names) mentioned below are all computed using SHA-1.
3333
Similarly, in SHA-256 repositories, these values are computed using SHA-256.
3434
35+
CRC32 checksums are always computed over the entire packed object, including
36+
the header (n-byte type and length); the base object name or offset, if any;
37+
and the entire compressed object. The CRC32 algorithm used is that of zlib.
38+
3539
== pack-*.pack files have the following format:
3640

3741
- A header appears at the beginning and consists of the following:
@@ -80,6 +84,16 @@ Valid object types are:
8084

8185
Type 5 is reserved for future expansion. Type 0 is invalid.
8286

87+
=== Object encoding
88+
89+
Unlike loose objects, packed objects do not have a prefix containing the type,
90+
size, and a NUL byte. These are not necessary because they can be determined by
91+
the n-byte type and length that prefixes the data and so they are omitted from
92+
the compressed and deltified data.
93+
94+
The computation of the object ID still uses this prefix by reconstructing it
95+
from the type and length as needed.
96+
8397
=== Size encoding
8498

8599
This document uses the following "size encoding" of non-negative
@@ -92,6 +106,11 @@ values are more significant.
92106
This size encoding should not be confused with the "offset encoding",
93107
which is also used in this document.
94108

109+
When encoding the size of an undeltified object in a pack, the size is that of
110+
the uncompressed raw object. For deltified objects, it is the size of the
111+
uncompressed delta. The base object name or offset is not included in the size
112+
computation.
113+
95114
=== Deltified representation
96115

97116
Conceptually there are only four object types: commit, tree, tag and

0 commit comments

Comments
 (0)