Skip to content

Implement ID3 variable-byte decoding for indirect pointer specifiers (i / I) #237

@unclesp1d3r

Description

@unclesp1d3r

Summary

magic(5) defines i (little-endian) and I (big-endian) pointer specifiers for indirect offsets. These are NOT plain 32-bit integers -- they are ID3v2 "synchsafe integer" variable-byte encodings used in MP3 frame size headers, where each byte carries 7 bits of data and the high bit is reserved.

Current state (after PR #233)

The parser accepts i and I in indirect pointer specifiers (e.g., (6.I+10) in /usr/share/file/magic/audio line 308), but the evaluator treats them as plain 32-bit Long values with the corresponding endianness. So an ID3 frame header with synchsafe size 0x00 0x01 0x00 0x00 (which decodes to size 16384) is read as the literal 0x00010000 = 65536 -- four times the real size.

This means >(6.I+10) correctly resolves the offset arithmetically, but at the WRONG file position because the ID3 size field was decoded plain instead of synchsafe.

Real-world need

  • /usr/share/file/magic/audio:308: >(6.I+10) indirect x \b, contains: -- ID3v2-tagged audio file detection. The I reads the ID3 frame size, adds 10 for the header overhead, and recurses into the inner audio stream. Without synchsafe decoding, the recursive offset is wrong for any file >127 bytes (which is essentially all MP3 files).

Spec

ID3v2.3 / ID3v2.4 synchsafe integer encoding:

Stored bytes:    0aaaaaaa  0bbbbbbb  0ccccccc  0ddddddd
Decoded value:   aaaaaaa_bbbbbbb_ccccccc_ddddddd  (28-bit value, 4*7 bits)

Each byte's high bit (0x80) is always 0; the remaining 7 bits contribute to the value. So:

fn decode_synchsafe_be(bytes: &[u8; 4]) -> u32 {
    ((bytes[0] as u32 & 0x7f) << 21)
  | ((bytes[1] as u32 & 0x7f) << 14)
  | ((bytes[2] as u32 & 0x7f) << 7)
  |  (bytes[3] as u32 & 0x7f)
}

i and I differ only in byte order (LE swaps the byte input order before decoding).

Implementation outline

  1. AST -- The current parser maps i -> TypeKind::Long { endian: Little } and I -> TypeKind::Long { endian: Big } as a placeholder. Replace with a new pointer-type variant that records "synchsafe" semantics. Either:

    • New TypeKind::Synchsafe { endian } (specific)
    • Or a decode: PointerDecoding field on OffsetSpec::Indirect.pointer_type that's Plain for b/s/l/q and Synchsafe for i/I

    Option B keeps TypeKind from growing; Option A is more discoverable.

  2. Parser -- pointer_specifier_to_type in parser/grammar/mod.rs already accepts i/I; have it produce the new variant/field instead of falling back to Long.

  3. Evaluator -- evaluator/offset/indirect.rs::resolve_indirect_offset_with_anchor reads the pointer via read_pointer. Add a synchsafe decode branch before applying adjustment_op. Validate that the input bytes have high bits clear (real synchsafe encoding); if not, this is malformed ID3 and the read should fail with EvaluationError::InvalidOffset.

  4. Codegen -- Update serialize_offset_spec / serialize_type_kind to round-trip the new variant or field.

  5. Tests -- ID3v2 fixture file with a known frame size; verify the decoded offset matches.

Acceptance criteria

  • (6.I+10) in audio:308 correctly resolves into the inner audio stream of an MP3 with ID3v2 tags
  • Round-trip codegen
  • Fixture-based conformance: an MP3 file with audio magic database produces the same \\b, contains: <inner-format> output as GNU file
  • Malformed synchsafe (high bit set) is rejected as InvalidOffset rather than silently producing the wrong value

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcompatibilitylibmagic compatibility and migrationenhancementNew feature or requestevaluatorRule evaluation engine and logicparserMagic file parsing components and grammartestingTest infrastructure and coveragetype:feature

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions