Specific improvements#33
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the core infrastructure for the SMP Graph Engine, an ingest-free, memory-mapped graph database for code analysis. The changes include low-level file management with a Write-Ahead Log (WAL), indexing structures (Crit-bit and Radix trees), a deduplicated string pool, and a tree-sitter-based parsing engine for Python. While the architectural foundation is solid, the current implementation contains several critical bugs: serialized node and edge data are packed but never actually written to the memory-mapped file, and the indexing logic uses dummy pointers instead of real offsets. Additionally, there are logic errors regarding tree-sitter sibling traversal for decorators, case-sensitivity issues with NodeType mapping, and unsafe handling of header CRC mismatches.
| graph_nodes.append( | ||
| GraphNode( | ||
| id=pnode.node_id, | ||
| type=NodeType(pnode.type.upper()), |
There was a problem hiding this comment.
The NodeType enum in smp.core.models uses TitleCase values (e.g., "Function", "Class"). Calling .upper() on pnode.type (which is already "Function" or "Class") will result in "FUNCTION" or "CLASS", causing NodeType() to raise a ValueError because it is a StrEnum with case-sensitive matching.
| type=NodeType(pnode.type.upper()), | |
| type=NodeType(pnode.type), |
| def _extract_decorators(self, node: Any, content: bytes) -> list[str]: | ||
| """Extract decorators.""" | ||
| decorators: list[str] = [] | ||
| for child in node.prev_sibling: |
There was a problem hiding this comment.
The node.prev_sibling property in tree-sitter returns a single Node object or None; it is not an iterable. Attempting to iterate over it will raise a TypeError. You should use a while loop to traverse previous siblings if you are looking for multiple decorators.
| for child in node.prev_sibling: | |
| curr = node.prev_sibling | |
| while curr and curr.type == "decorator": | |
| text = content[curr.start_byte : curr.end_byte].decode("utf-8", errors="replace") | |
| decorators.append(text.strip()) | |
| curr = curr.prev_sibling |
| if actual_crc != stored_crc: | ||
| pass |
There was a problem hiding this comment.
Ignoring a CRC mismatch in the file header is dangerous for a database as it indicates potential data corruption. This should raise an exception to prevent further operations on a corrupted file.
| if actual_crc != stored_crc: | |
| pass | |
| if actual_crc != stored_crc: | |
| raise ValueError("Header CRC mismatch: file may be corrupted") |
| struct.pack( | ||
| "<BIII III I", | ||
| 1, | ||
| name_off, | ||
| sig_off, | ||
| file_off, | ||
| node.structural.start_line, | ||
| node.structural.end_line or 0, | ||
| 0, | ||
| 0, | ||
| ) |
| payload = struct.pack("<I", count) | ||
| for target_off, etype in targets: | ||
| payload += struct.pack("<II", target_off, etype) |
| self.index.insert(node.id, 0) | ||
| if self.radix: | ||
| self.radix.insert(node.file_path, 0) |
| if name and name[0].islower(): | ||
| # Likely a module name, not function | ||
| continue |
There was a problem hiding this comment.
This heuristic incorrectly skips module-level imports in Python, as most module names (e.g., os, sys, requests) start with a lowercase letter. This will cause the engine to miss a significant number of valid IMPORTS relationships.
| if name and name[0].islower(): | |
| # Likely a module name, not function | |
| continue | |
| if not name: | |
| continue |
No description provided.