Skip to content

Conversation

@HNygard
Copy link
Owner

@HNygard HNygard commented Jan 31, 2026

Why: Laminas Mail strictly enforces RFC compliance and throws exceptions when parsing emails with common real-world violations. Norwegian government email servers frequently send emails with:

  • Raw UTF-8 bytes in headers (e.g., 'Lødingen Kommune' in Received headers)
  • Charset mismatches (UTF-8 bytes declared as ISO-8859-1)
  • Malformed encoded-words (missing ?= delimiters)

This required ~720 lines of workaround code to sanitize emails before Laminas could parse them. Zbateson handles all these cases natively, allowing us to delete the workarounds and simplify the email parsing architecture significantly.

Changes:

  • Replace laminas/laminas-mail with zbateson/mail-mime-parser
  • Remove sanitization methods: fixCharsetMismatchInEncodedWords(), sanitizeNonAsciiInHeaderValue(), stripProblematicHeaders(), readLaminasMessage_withErrorHandling(), and debugging helpers
  • Simplify extractContentFromEmail() to use Zbateson's getTextContent() and getHtmlContent() which handle encoding automatically
  • Update tests for new parser behavior

Note

High Risk
Swaps the core email parsing library and removes large amounts of defensive header/charset workaround code, which can change extraction behavior across many real-world emails. Also drops laminas/laminas-mail from dependencies, which may break any remaining call sites still instantiating Laminas\Mail\Storage\Message (e.g., api.php).

Overview
Switches email parsing in ThreadEmailExtractorEmailBody and ImapEmail from Laminas to Zbateson, introducing parseEmail() and using getTextContent()/getHtmlContent() for body extraction while deleting the prior header sanitization/error-isolation helpers.

Updates header/subject/X-Forwarded-For extraction to use Zbateson APIs, changes file.php to print raw headers via getAllHeaders(), and adds/updates PHPUnit coverage to reflect the new parser behavior.

Replaces laminas/laminas-mail with zbateson/mail-mime-parser (and related transitive deps) in Composer.

Written by Cursor Bugbot for commit 5c4f401. This will update automatically on new commits. Configure here.

Why: Laminas Mail strictly enforces RFC compliance and throws exceptions
when parsing emails with common real-world violations. Norwegian government
email servers frequently send emails with:
- Raw UTF-8 bytes in headers (e.g., 'Lødingen Kommune' in Received headers)
- Charset mismatches (UTF-8 bytes declared as ISO-8859-1)
- Malformed encoded-words (missing ?= delimiters)

This required ~720 lines of workaround code to sanitize emails before
Laminas could parse them. Zbateson handles all these cases natively,
allowing us to delete the workarounds and simplify the email parsing
architecture significantly.

Changes:
- Replace laminas/laminas-mail with zbateson/mail-mime-parser
- Remove sanitization methods: fixCharsetMismatchInEncodedWords(),
  sanitizeNonAsciiInHeaderValue(), stripProblematicHeaders(),
  readLaminasMessage_withErrorHandling(), and debugging helpers
- Simplify extractContentFromEmail() to use Zbateson's getTextContent()
  and getHtmlContent() which handle encoding automatically
- Update tests for new parser behavior

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 31, 2026 16:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request replaces the Laminas Mail library with Zbateson mail-mime-parser to handle emails with RFC violations that are common in Norwegian government email systems. The change eliminates approximately 720 lines of workaround code by using a more lenient parser that natively handles malformed encoded-words, charset mismatches, and raw UTF-8 bytes in headers.

Changes:

  • Replaced laminas/laminas-mail dependency with zbateson/mail-mime-parser ^3.0 in composer.json
  • Removed sanitization methods (fixCharsetMismatchInEncodedWords, sanitizeNonAsciiInHeaderValue, stripProblematicHeaders, readLaminasMessage_withErrorHandling) and replaced with simple parseEmail method
  • Simplified email content extraction to use Zbateson's built-in getTextContent() and getHtmlContent() methods
  • Updated all tests to reflect new parser behavior (returns null for missing headers, handles malformed content gracefully)
  • Added comprehensive validation test suite (ZbatesonValidationTest.php) to verify handling of problematic email patterns

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
organizer/src/composer.json Updated dependency from laminas/laminas-mail to zbateson/mail-mime-parser ^3.0
organizer/src/composer.lock Replaced Laminas dependencies with Zbateson and its transitive dependencies (guzzlehttp/psr7, php-di, etc.)
organizer/src/class/Extraction/ThreadEmailExtractorEmailBody.php Removed ~720 lines of sanitization code, simplified to use Zbateson parser with new parseEmail() method
organizer/src/class/Imap/ImapEmail.php Updated getEmailAddresses() and getEmailSubject() to use new parseEmail() and getHeaderValue() API
organizer/src/tests/Imap/ImapEmailTest.php Updated test expectations to match Zbateson's graceful handling (returns empty string for missing subjects)
organizer/src/tests/Extraction/ThreadEmailExtractorEmailBodyTest.php Renamed and updated tests from readLaminasMessage_withErrorHandling to parseEmail, updated assertions for Zbateson API
organizer/src/tests/Extraction/ZbatesonValidationTest.php New comprehensive test suite validating Zbateson handles all problematic email patterns from real-world Norwegian government emails

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 136 to 153
// Zbateson handles all encoding/decoding automatically
$plainText = $message->getTextContent();
$html = $message->getHtmlContent();

// If already valid UTF-8, return as is
if (mb_check_encoding($html, 'UTF-8')) {
return $html;
}
// Clean up extracted content
if ($plainText !== null) {
$email_content->plain_text = self::cleanText(self::fixEncoding($plainText));
} else {
$email_content->plain_text = '';
}

// Try multiple encodings, prioritizing those common in Norwegian content
$encodings = ['ISO-8859-1', 'Windows-1252', 'ISO-8859-15', 'UTF-8'];

foreach ($encodings as $encoding) {
$converted = @mb_convert_encoding($html, 'UTF-8', $encoding);
if (mb_check_encoding($converted, 'UTF-8') && strpos($converted, '?') === false) {
return $converted;
}
}
if ($html !== null) {
$email_content->html = self::convertHtmlToText(self::fixEncoding($html));
} else {
$email_content->html = '';
}

// Force ISO-8859-1 as a last resort
return mb_convert_encoding($html, 'UTF-8', 'ISO-8859-1');
};

$email_content = new ExtractedEmailBody();
if ($message->isMultipart()) {
$plainTextPart = false;
$htmlPart = false;
return $email_content;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states "Zbateson's getTextContent() and getHtmlContent() which handle encoding automatically", yet the fixEncoding method is still being called on the content after extraction. This suggests either:

  1. The fixEncoding method is still needed for edge cases not handled by Zbateson
  2. The method is redundant and should be removed

If Zbateson truly handles all encoding automatically and always returns UTF-8 content, then fixEncoding could be redundant. Consider testing whether this additional encoding fix is still necessary, or remove it to simplify the code further as stated in the PR objectives.

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Claude in new commit.

Comment on lines 128 to 131
$email_content = new ExtractedEmailBody();
$email_content->plain_text = "ERROR\n\n".$eml;
$email_content->html = '<pre>' . jTraceEx($e) . '</pre>';
return $email_content;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing fails, the error handling returns the entire raw EML content in the plain_text field with "ERROR\n\n" prefix. This could expose sensitive information in error logs or debug outputs. Consider truncating the EML content or only including a hash/size indicator instead of the full raw email content.

Copilot uses AI. Check for mistakes.
Comment on lines 41 to 62
public function testMalformedEncodedWord_MissingClosingDelimiter(): void {
// This pattern causes issues - encoded word missing ?= delimiter
// before another header starts on the same line
$email = "From: sender@example.com\r\n" .
"To: recipient@example.com\r\n" .
"Subject: =?iso-8859-1?Q?SV:_Klage_p=E5_m=E5lrettet?= =?iso-8859-1?Q?_utestengelse?Thread-Topic: test\r\n" .
"Content-Type: text/plain\r\n" .
"\r\n" .
"Test body";

$zbatesonMessage = $this->parseWithZbateson($email);

// Zbateson should successfully parse this
$this->assertNotNull($zbatesonMessage, "Zbateson should parse the email");

// Check if subject is accessible
$subject = $zbatesonMessage->getHeaderValue('subject');
$this->assertNotNull($subject, "Subject header should be accessible");

// Document what zbateson actually returns for this case
echo "\n[Malformed encoded-word] Zbateson Subject: " . var_export($subject, true) . "\n";
}
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses echo statements to output debugging information during test execution (lines 61, 78, 102, 106, 126-129, etc.). While this can be useful for development and validation, these echo statements should either be removed before merging to production or replaced with assertions that verify the expected behavior. Currently, these tests don't assert specific values for the parsed content, only that parsing succeeds. Consider adding assertions for the expected decoded values (e.g., verifying that "SV: Klage på målrettet utestengelse" is correctly decoded).

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by Claude in new commit.

Comment on lines 141 to 143
$x_forwarded_for = $message->getHeaderValue('x-forwarded-for');
if ($x_forwarded_for !== null) {
$addresses[] = $x_forwarded_for;
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Laminas implementation checked if X-Forwarded-For returned an ArrayIterator to handle multiple instances of the header. The new implementation using getHeaderValue() returns a single string value. If an email has multiple X-Forwarded-For headers, only one may be returned. Consider checking the Zbateson documentation to determine if getHeaderValue() returns all header values concatenated, or only the first one. If it only returns the first value, and multiple values are needed, you may need to use a different method to access all X-Forwarded-For headers.

Suggested change
$x_forwarded_for = $message->getHeaderValue('x-forwarded-for');
if ($x_forwarded_for !== null) {
$addresses[] = $x_forwarded_for;
$xForwardedForHeaders = $message->getAllHeadersByName('x-forwarded-for');
if (!empty($xForwardedForHeaders)) {
foreach ($xForwardedForHeaders as $header) {
$value = $header->getValue();
if ($value !== null && $value !== '') {
$addresses[] = $value;
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +110
/**
* Parse raw email content using Zbateson mail-mime-parser
*
* @param string $eml Raw email content
* @return Message Parsed message object
*/
public static function parseEmail(string $eml): Message {
$parser = new MailMimeParser();
return $parser->parse($eml, false);
}
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseEmail method doesn't have any explicit error handling. If the Zbateson parser throws an exception during parsing, it will propagate up to the caller. While this is handled in extractContentFromEmail (lines 123-132), consider whether parseEmail should document what exceptions it might throw, or add specific error handling if there are known failure modes that should be handled differently.

Copilot uses AI. Check for mistakes.
HNygard and others added 6 commits January 31, 2026 18:10
The tests referenced methods and classes that were removed:
- stripProblematicHeaders() was a workaround method no longer needed
- Laminas\Mail classes are no longer installed

Also removed tests depending on external email files not in CI.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The migration to Zbateson used getHeaderValue() which only returns the
first header value. Changed to getAllHeadersByName() to capture all
X-Forwarded-For headers when multiple are present.

Added tests for single, multiple, and no X-Forwarded-For header cases.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Zbateson's getValue() on address headers returns only the email address
without the name and angle brackets. Changed to getRawValue() to preserve
the original header format for proper HTML escaping.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

"require": {
"phpmailer/phpmailer": "^6.10",
"laminas/laminas-mail": "2.25.1",
"zbateson/mail-mime-parser": "^3.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete migration leaves api.php broken

High Severity

The laminas/laminas-mail dependency is removed from composer.json, but api.php still imports and uses Laminas\Mail\Storage\Message directly on lines 5 and 39. When this file is executed, PHP will throw a fatal "Class not found" error because the Laminas library is no longer installed. The file needs to be updated to use ThreadEmailExtractorEmailBody::parseEmail() like the other migrated files.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants