Skip to content

Content-Length header uses character count instead of byte count #604

@meymchen

Description

@meymchen

Component

pygls/protocol/json_rpc.py

Summary

The method computes the Content-Length header value using len(body), which returns the number of Unicode code points rather than the number of bytes in the UTF-8 encoded payload. When the JSON-RPC payload contains non-ASCII characters (e.g., emoji, CJK text, or localized diagnostic messages), the declared length differs from the actual number of bytes transmitted, causing the receiving peer to misparse the message boundary.

Steps to Reproduce

  1. Return a CompletionItem with a label containing non-ASCII characters (e.g., "测试").
  2. Inspect the raw LSP traffic and observe that Content-Length is smaller than the actual UTF-8 payload size.
  3. The LSP client hangs or reports a parse error, waiting for more bytes than declared.

Expected Behavior

Per the LSP Base Protocol specification (JSON-RPC over HTTP), Content-Length must reflect the exact byte count of the UTF-8 encoded message body.

Actual Behavior

Content-Length reflects the Unicode character count, violating the LSP specification.

Affected Code (pygls/protocol/json_rpc.py, ~L528-541)

body = json.dumps(data, default=self._serialize_message)
header = (
    f"Content-Length: {len(body)}\r\n"
    ...
)
data = header + body
res = self.writer.write(data.encode(self.CHARSET))

Proposed Fix

body_bytes = body.encode(self.CHARSET)
header = (
    f"Content-Length: {len(body_bytes)}\r\n"
    ...
).encode(self.CHARSET)
self.writer.write(header + body_bytes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions