Skip to content

feat: Support VARCHAR(max_length) in NeuG Type System#25

Open
shirly121 wants to merge 7 commits intomainfrom
add_varchar
Open

feat: Support VARCHAR(max_length) in NeuG Type System#25
shirly121 wants to merge 7 commits intomainfrom
add_varchar

Conversation

@shirly121
Copy link
Collaborator

@shirly121 shirly121 commented Mar 10, 2026

Committed-by: Xiaoli Zhou from Dev container

What do these changes do?

Related issue number

Fixes #26

Greptile Summary

This PR adds VARCHAR(max_length) syntax to NeuG's Cypher type system, introducing a StringTypeInfo extra-type-info class, new LogicalType::STRING(size_t) factory, a grammar rule in Cypher.g4, and a broad refactoring of DDL property-handling code from DataTypeId to the richer DataType (which now carries max_length).

Key changes:

  • StringTypeInfo class added to store max_length; LogicalType::STRING() now defaults to 256 instead of the old 65536 constant
  • parseStringType() added to parse VARCHAR(N) from string; endPtr validation was added (fixing a prior review thread), but non-positive values are not rejectedVARCHAR(0) silently creates an invalid type and VARCHAR(-1) silently wraps to VARCHAR(65536) due to long long → size_t conversion
  • DDL operators (create_vertex_type, create_edge_type, add_vertex_property, add_edge_property) updated from pair<string, Value> to tuple<DataType, string, Value> to propagate max_length through to storage
  • pb_utils.cc refactored from DataTypeId to DataType; temporal_type_to_property_type was partially updated (signature changed, body left using DataTypeId enums)
  • Test helper normalize() introduced to strip max_length values from comparison strings, avoiding brittle test failures

Confidence Score: 3/5

  • The core VARCHAR feature is mostly functional, but a missing maxLen > 0 guard allows zero and negative lengths to produce silent, incorrect results that could persist in the schema.
  • The majority of the refactoring is clean and correct. One logic bug remains: parseStringType does not validate that the parsed length is positive, so VARCHAR(0) and VARCHAR(-5) are silently accepted with incorrect behavior rather than returning a clear error. All other previously raised concerns have been addressed or acknowledged.
  • src/compiler/common/types/types.cpp — the parseStringType function needs a maxLen > 0 guard before calling LogicalType::STRING(maxLen).

Important Files Changed

Filename Overview
src/compiler/common/types/types.cpp Adds StringTypeInfo, LogicalType::STRING(size_t) factory, and parseStringType. The parseStringType function fails to validate that max_length > 0, allowing VARCHAR(0) to create an invalid type and negative values to silently wrap-cast to the maximum limit (65536).
include/neug/compiler/common/types/types.h Declares StringTypeInfo class and the two STRING() factory overloads. Design looks sound; serializeInternal is a no-op but this was acknowledged as intentional in a previous thread.
src/utils/pb_utils.cc Refactors property-type helpers from DataTypeId to DataType to carry max_length. string_type_to_property_type is correctly updated, but temporal_type_to_property_type still assigns raw DataTypeId enum values to the new DataType& output parameter — compiles via implicit conversion but is inconsistent with the rest of the PR's refactoring.
include/neug/compiler/gopt/g_type_utils.h YAML-to-type and type-to-YAML conversion updated to read/write max_length for STRING types. Logic correctly falls back to getDefaultStringMaxLen() when no extraTypeInfo is present.
src/compiler/gopt/g_type_converter.cpp Physical-plan type converter updated to read max_length from StringTypeInfo and fall back to the default gracefully. No issues found.
src/compiler/antlr4/Cypher.g4 Adds VARCHAR lexer token and VARCHAR(integer) production rule to nEUG_DataType. Grammar change is straightforward and correct.
tests/compiler/gopt_test.h Introduces normalize() helper that strips max_length values before comparing plan/YAML output in tests, avoiding brittle string-equality failures. Approach is clean.
src/execution/execute/ops/ddl/create_vertex_type.cc Changed property_def_t from pair<string, Value> to tuple<DataType, string, Value> so the explicit type (with max_length) flows through to storage. Change is consistent and correct.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as Cypher Parser<br/>(Cypher.g4)
    participant Binder as LogicalType::convertFromString
    participant PST as parseStringType()
    participant STR as LogicalType::STRING(size_t)
    participant STI as StringTypeInfo

    User->>Parser: CREATE NODE TABLE n(id INT64 PRIMARY KEY, name VARCHAR(128))
    Parser->>Binder: trimmedStr = "VARCHAR(128)"
    Binder->>PST: parseStringType("VARCHAR(128)")
    PST->>PST: strtoll → maxLen=128, validate endPtr
    Note over PST: ⚠️ No check for maxLen ≤ 0
    PST->>STR: LogicalType::STRING(128)
    STR->>STR: clamp to maxLimit (65536) if needed
    STR->>STI: new StringTypeInfo(128)
    STI-->>STR: StringTypeInfo{max_length=128}
    STR-->>PST: LogicalType(STRING, StringTypeInfo{128})
    PST-->>Binder: LogicalType(STRING, StringTypeInfo{128})
    Binder-->>User: VARCHAR(128) type created
Loading

Comments Outside Diff (2)

  1. src/compiler/common/types/types.cpp, line 1567-1576 (link)

    Exceeding max length is silently clamped — consider throwing instead

    When a caller passes a max_length larger than getMaxStringMaxLen() (65536), the value is silently reduced with only a LOG(WARNING). A user who writes VARCHAR(100000) will receive a column with VARCHAR(65536) without any query-level error, which is surprising.

    Consider throwing a binder exception (consistent with how invalid lengths are handled in parseStringType) rather than silently accepting invalid input:

    if (max_length > maxLimit) {
        THROW_BINDER_EXCEPTION(
            "The max length of VARCHAR exceeds the maximum allowed limit of " +
            std::to_string(maxLimit) + ". Given: " + std::to_string(max_length));
    }
  2. src/utils/pb_utils.cc, line 168-191 (link)

    Inconsistent refactoring — DataTypeId still assigned to DataType output

    The function signature was changed from DataTypeId& out_type to DataType& out_type as part of this PR's refactoring, but the assignments inside the switch still use raw DataTypeId enum values (e.g., DataTypeId::kDate, DataTypeId::kTimestampMs, DataTypeId::kInterval) rather than DataType factory methods.

    This compiles only because DataType has an implicit constructor from DataTypeId. In contrast, primitive_type_to_property_type — updated in the same PR — consistently uses DataType::INT32, DataType::UINT64, etc. The inconsistency makes it harder to reason about the code and may cause confusion for future maintainers.

    Consider updating the assignments to use the DataType factory style, for example:

    case common::Temporal::kDate32:
      out_type = DataType(DataTypeId::kDate);
      break;
    case common::Temporal::kDateTime:
      out_type = DataType(DataTypeId::kTimestampMs);
      break;
    // ... etc.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Last reviewed commit: 535e074

Greptile also left 1 inline comment on this PR.

Committed-by: Xiaoli Zhou from Dev container
@shirly121 shirly121 requested a review from zhanglei1949 March 10, 2026 08:37
shirly121 and others added 4 commits March 10, 2026 17:14
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Committed-by: xiaolei.zl from Dev container

Committed-by: xiaolei.zl from Dev container
@zhanglei1949
Copy link
Collaborator

@greptile

zhanglei1949
zhanglei1949 previously approved these changes Mar 11, 2026
Copy link
Collaborator

@zhanglei1949 zhanglei1949 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhanglei1949 zhanglei1949 requested a review from liulx20 March 11, 2026 09:23
Committed-by: xiaolei.zl from Dev container
Committed-by: xiaolei.zl from Dev container
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add VARCHAR(max_length) to type system for user-specified variable-length string limit

2 participants