Lazy by design. Fast by default.
SLOTH (Structural Loader with On-demand Traversal Handling) is a fast, flexible mmCIF parser designed for structural biology workflows. Built on the C++ gemmi backend, SLOTH performs eager parsing and lazy object construction, making it efficient for both large-scale pipelines and interactive exploration.
- High-speed parsing via gemmi
- Lazy construction of row and item objects for memory efficiency
- Pythonic, dot-notation access to mmCIF data
- Pluggable custom validation system
- Export and import in nested JSON format with automatic relationship resolution
"Why rush when you can be lazy?"
"Parse eagerly. Construct lazily. Access quickly."
SLOTH is unapologetically Pythonic:
- No runtime flags
- No manual optimizations
- No overengineering
Just smart defaults, clear abstractions, and expressive maintainable code.
Install from source:
git clone https://github.com/lucas-ebi/sloth.git
cd sloth
pip install -e ".[dev]"from sloth import MMCIFHandler
handler = MMCIFHandler()
mmcif = handler.read("1abc.cif")
print(mmcif.data_1ABC._struct.title[0])
print(mmcif.data_1ABC._atom_site.Cartn_x[0])# Dot notation
block = mmcif.data_1ABC
atom_site = block._atom_site
print(atom_site.Cartn_x[0])
# Dictionary notation (dynamic fields)
category_name = "_atom_site"
field_name = "Cartn_x"
x = mmcif.data[0][category_name][field_name]x_coords = atom_site.Cartn_x
first_atom = atom_site[0]
print(first_atom.type_symbol, first_atom.Cartn_x)# CA atoms from chain A
ca_atoms = [a for a in atom_site if a.label_atom_id == "CA" and a.label_asym_id == "A"]
# Mean X coordinate
avg_x = sum(float(x) for x in atom_site.Cartn_x) / atom_site.row_countfor cat_name in block.categories:
category = block[cat_name]
for item_name in category.items:
print(f"{cat_name}.{item_name}: {len(category[item_name])} values")sample = """data_1ABC
_entry.id 1ABC_STRUCTURE
loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
ATOM 1 N 10.123 20.456 30.789
ATOM 2 C 11.234 21.567 31.890
"""
with open("sample.cif", "w") as f:
f.write(sample)from sloth.mmcif import MMCIFDataContainer, DataBlock, Category
mmcif = MMCIFDataContainer()
block = DataBlock("1ABC")
cat = Category("_entry")
cat["id"] = ["1ABC_STRUCTURE"]
block["_entry"] = cat
mmcif["1ABC"] = blockmmcif = MMCIFDataContainer()
mmcif.data_1ABC._entry.id = ["1ABC_STRUCTURE"]
mmcif.data_1ABC._atom_site.Cartn_x = ["10.1", "11.2"]# Export to nested JSON format (with resolved parent-child relationships)
# Returns JSON string
json_str = handler.export(mmcif, indent=2)
# Export to file (pretty-printed)
handler.export(mmcif, file_path="out_nested.json", indent=2)
# Compact JSON (no indentation)
handler.export(mmcif, file_path="out_compact.json")Nested JSON Structure:
SLOTH automatically resolves mmCIF dictionary relationships when exporting to JSON. Child categories are nested within their parent categories, creating a hierarchical structure:
{
"data_DEMO": {
"_entry": [...],
"_entity": [
{
"id": "1",
"type": "polymer",
"_entity_poly": [
{
"entity_id": "1",
"_entity_poly_seq": [...]
}
],
"_struct_asym": [
{
"id": "A",
"_atom_site": [...]
}
]
}
]
}
}Note: All category names maintain the _ prefix convention, whether at the top level or nested.
# Import from JSON file (automatically flattens nested structure back to mmCIF)
mmcif = handler.load("out_nested.json")
# Access data using standard mmCIF notation
print(mmcif.data[0]._entity.id)
print(mmcif.data[0]._atom_site.Cartn_x)def verify_round_trip(original, imported):
"""Verify data integrity after JSON export/import round-trip."""
orig_block = original.data[0]
imp_block = imported.data[0]
# Check categories preserved
if set(orig_block.categories) == set(imp_block.categories):
print("✅ Categories: OK")
# Check atom data preserved
if "_atom_site" in orig_block.categories:
orig_count = len(orig_block._atom_site.Cartn_x)
imp_count = len(imp_block._atom_site.Cartn_x)
print(f"✅ Atoms: {'OK' if orig_count == imp_count else 'Mismatch'}")
# Test round-trip
handler.export(mmcif, file_path="test.json")
imported = handler.load("test.json")
verify_round_trip(mmcif, imported)from sloth import MMCIFHandler, ValidatorFactory
vf = ValidatorFactory()
vf.register_validator("_atom_site", lambda cat: print("Validating", cat.name))
handler = MMCIFHandler(validator_factory=vf)
mmcif = handler.read("1abc.cif")
mmcif.data_1ABC._atom_site.validate()# Cross-checker receives two Category objects
vf.register_cross_checker(
("_entity", "_atom_site"),
lambda e, a: print(f"Cross-checking {e.name} ↔ {a.name}") or set(e.id).issuperset(set(a.label_entity_id))
)
# Run cross-validation only
mmcif.data_1ABC._entity.validate.against(mmcif.data_1ABC._atom_site)
# Or validate first, then cross-check
mmcif.data_1ABC._entity.validate().against(mmcif.data_1ABC._atom_site)SLOTH includes a comprehensive Jupyter notebook cookbook that demonstrates all features interactively:
jupyter notebook SLOTH_Cookbook.ipynbThe cookbook covers:
- Parsing mmCIF files with embedded data
- Exploring data structures with dot notation
- 2D slicing (column-wise and row-wise access)
- Validating mmCIF data
- Modifying data elegantly
- Creating sample data (manual, programmatic, and auto-creation)
- Exporting to nested JSON with automatic relationship resolution
- Importing from JSON with automatic flattening
- Round-trip validation
- Writing modified mmCIF files
- Complete workflow examples
Perfect for learning SLOTH interactively or as a reference guide.
Performance benchmarks on synthetic mmCIF files (measured on macOS with Python 3.10):
| File Size | Full Parse | Selective | Access Speed | Memory (Parse) | Memory (Access) |
|---|---|---|---|---|---|
| 1KB | 12ms | 13ms | 40μs | 198KB | 4KB |
| 10KB | 12ms | 13ms | 97μs | 222KB | 13KB |
| 100KB | 13ms | 14ms | 594μs | 1.0MB | 104KB |
| 1.0MB | 19ms | 25ms | 6ms | 7.7MB | 954KB |
| 50.7MB | 394ms | 693ms | 298ms | 205.4MB | 46.1MB |
| 102.0MB | 817ms | 1.4s | 607ms | 386.8MB | 75.5MB |
Memory Architecture:
- Parse Memory: Gemmi C++ structures (eager loading, ~4-8x file size for typical files)
- SLOTH overhead: ~0KB (lazy wrappers add no measurable memory cost)
- Access Memory: Python string conversion cost (only when data is accessed)
- Multiplier trend: Small files show higher relative overhead due to fixed costs, stabilizing at ~4x for large files (>50MB)
SLOTH's lazy object creation ensures minimal overhead - LazyGemmiColumn defers Python string conversions until needed, adding virtually no memory cost beyond Gemmi's parsing.
Note: Medium-sized files (5-20MB) may show measurement anomalies due to C++ allocation timing. Trust measurements for small (<1MB) and large (>50MB) files.
# Dot access for known fields
x = mmcif.data_1ABC._atom_site.Cartn_x
# Dict access for dynamic fields
val = mmcif.data[0]["_atom_site"]["Cartn_x"]
# Partial category loading
mmcif = handler.read("file.cif", categories=["_atom_site"])- Fork
- Create a branch
- Add tests
- Submit a PR
MIT License — use freely, modify responsibly.
SLOTH is not just a parser — it’s a mindset.
Pythonic. Lazy. Elegant. Fast.
