-
Notifications
You must be signed in to change notification settings - Fork 0
Replace Laminas Mail with Zbateson mail-mime-parser #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Why: Laminas Mail strictly enforces RFC compliance and throws exceptions when parsing emails with common real-world violations. Norwegian government email servers frequently send emails with: - Raw UTF-8 bytes in headers (e.g., 'Lødingen Kommune' in Received headers) - Charset mismatches (UTF-8 bytes declared as ISO-8859-1) - Malformed encoded-words (missing ?= delimiters) This required ~720 lines of workaround code to sanitize emails before Laminas could parse them. Zbateson handles all these cases natively, allowing us to delete the workarounds and simplify the email parsing architecture significantly. Changes: - Replace laminas/laminas-mail with zbateson/mail-mime-parser - Remove sanitization methods: fixCharsetMismatchInEncodedWords(), sanitizeNonAsciiInHeaderValue(), stripProblematicHeaders(), readLaminasMessage_withErrorHandling(), and debugging helpers - Simplify extractContentFromEmail() to use Zbateson's getTextContent() and getHtmlContent() which handle encoding automatically - Update tests for new parser behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request replaces the Laminas Mail library with Zbateson mail-mime-parser to handle emails with RFC violations that are common in Norwegian government email systems. The change eliminates approximately 720 lines of workaround code by using a more lenient parser that natively handles malformed encoded-words, charset mismatches, and raw UTF-8 bytes in headers.
Changes:
- Replaced
laminas/laminas-maildependency withzbateson/mail-mime-parser ^3.0in composer.json - Removed sanitization methods (
fixCharsetMismatchInEncodedWords,sanitizeNonAsciiInHeaderValue,stripProblematicHeaders,readLaminasMessage_withErrorHandling) and replaced with simpleparseEmailmethod - Simplified email content extraction to use Zbateson's built-in
getTextContent()andgetHtmlContent()methods - Updated all tests to reflect new parser behavior (returns null for missing headers, handles malformed content gracefully)
- Added comprehensive validation test suite (
ZbatesonValidationTest.php) to verify handling of problematic email patterns
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| organizer/src/composer.json | Updated dependency from laminas/laminas-mail to zbateson/mail-mime-parser ^3.0 |
| organizer/src/composer.lock | Replaced Laminas dependencies with Zbateson and its transitive dependencies (guzzlehttp/psr7, php-di, etc.) |
| organizer/src/class/Extraction/ThreadEmailExtractorEmailBody.php | Removed ~720 lines of sanitization code, simplified to use Zbateson parser with new parseEmail() method |
| organizer/src/class/Imap/ImapEmail.php | Updated getEmailAddresses() and getEmailSubject() to use new parseEmail() and getHeaderValue() API |
| organizer/src/tests/Imap/ImapEmailTest.php | Updated test expectations to match Zbateson's graceful handling (returns empty string for missing subjects) |
| organizer/src/tests/Extraction/ThreadEmailExtractorEmailBodyTest.php | Renamed and updated tests from readLaminasMessage_withErrorHandling to parseEmail, updated assertions for Zbateson API |
| organizer/src/tests/Extraction/ZbatesonValidationTest.php | New comprehensive test suite validating Zbateson handles all problematic email patterns from real-world Norwegian government emails |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Zbateson handles all encoding/decoding automatically | ||
| $plainText = $message->getTextContent(); | ||
| $html = $message->getHtmlContent(); | ||
|
|
||
| // If already valid UTF-8, return as is | ||
| if (mb_check_encoding($html, 'UTF-8')) { | ||
| return $html; | ||
| } | ||
| // Clean up extracted content | ||
| if ($plainText !== null) { | ||
| $email_content->plain_text = self::cleanText(self::fixEncoding($plainText)); | ||
| } else { | ||
| $email_content->plain_text = ''; | ||
| } | ||
|
|
||
| // Try multiple encodings, prioritizing those common in Norwegian content | ||
| $encodings = ['ISO-8859-1', 'Windows-1252', 'ISO-8859-15', 'UTF-8']; | ||
|
|
||
| foreach ($encodings as $encoding) { | ||
| $converted = @mb_convert_encoding($html, 'UTF-8', $encoding); | ||
| if (mb_check_encoding($converted, 'UTF-8') && strpos($converted, '?') === false) { | ||
| return $converted; | ||
| } | ||
| } | ||
| if ($html !== null) { | ||
| $email_content->html = self::convertHtmlToText(self::fixEncoding($html)); | ||
| } else { | ||
| $email_content->html = ''; | ||
| } | ||
|
|
||
| // Force ISO-8859-1 as a last resort | ||
| return mb_convert_encoding($html, 'UTF-8', 'ISO-8859-1'); | ||
| }; | ||
|
|
||
| $email_content = new ExtractedEmailBody(); | ||
| if ($message->isMultipart()) { | ||
| $plainTextPart = false; | ||
| $htmlPart = false; | ||
| return $email_content; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description states "Zbateson's getTextContent() and getHtmlContent() which handle encoding automatically", yet the fixEncoding method is still being called on the content after extraction. This suggests either:
- The fixEncoding method is still needed for edge cases not handled by Zbateson
- The method is redundant and should be removed
If Zbateson truly handles all encoding automatically and always returns UTF-8 content, then fixEncoding could be redundant. Consider testing whether this additional encoding fix is still necessary, or remove it to simplify the code further as stated in the PR objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by Claude in new commit.
| $email_content = new ExtractedEmailBody(); | ||
| $email_content->plain_text = "ERROR\n\n".$eml; | ||
| $email_content->html = '<pre>' . jTraceEx($e) . '</pre>'; | ||
| return $email_content; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing fails, the error handling returns the entire raw EML content in the plain_text field with "ERROR\n\n" prefix. This could expose sensitive information in error logs or debug outputs. Consider truncating the EML content or only including a hash/size indicator instead of the full raw email content.
| public function testMalformedEncodedWord_MissingClosingDelimiter(): void { | ||
| // This pattern causes issues - encoded word missing ?= delimiter | ||
| // before another header starts on the same line | ||
| $email = "From: sender@example.com\r\n" . | ||
| "To: recipient@example.com\r\n" . | ||
| "Subject: =?iso-8859-1?Q?SV:_Klage_p=E5_m=E5lrettet?= =?iso-8859-1?Q?_utestengelse?Thread-Topic: test\r\n" . | ||
| "Content-Type: text/plain\r\n" . | ||
| "\r\n" . | ||
| "Test body"; | ||
|
|
||
| $zbatesonMessage = $this->parseWithZbateson($email); | ||
|
|
||
| // Zbateson should successfully parse this | ||
| $this->assertNotNull($zbatesonMessage, "Zbateson should parse the email"); | ||
|
|
||
| // Check if subject is accessible | ||
| $subject = $zbatesonMessage->getHeaderValue('subject'); | ||
| $this->assertNotNull($subject, "Subject header should be accessible"); | ||
|
|
||
| // Document what zbateson actually returns for this case | ||
| echo "\n[Malformed encoded-word] Zbateson Subject: " . var_export($subject, true) . "\n"; | ||
| } |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test uses echo statements to output debugging information during test execution (lines 61, 78, 102, 106, 126-129, etc.). While this can be useful for development and validation, these echo statements should either be removed before merging to production or replaced with assertions that verify the expected behavior. Currently, these tests don't assert specific values for the parsed content, only that parsing succeeds. Consider adding assertions for the expected decoded values (e.g., verifying that "SV: Klage på målrettet utestengelse" is correctly decoded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by Claude in new commit.
| $x_forwarded_for = $message->getHeaderValue('x-forwarded-for'); | ||
| if ($x_forwarded_for !== null) { | ||
| $addresses[] = $x_forwarded_for; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original Laminas implementation checked if X-Forwarded-For returned an ArrayIterator to handle multiple instances of the header. The new implementation using getHeaderValue() returns a single string value. If an email has multiple X-Forwarded-For headers, only one may be returned. Consider checking the Zbateson documentation to determine if getHeaderValue() returns all header values concatenated, or only the first one. If it only returns the first value, and multiple values are needed, you may need to use a different method to access all X-Forwarded-For headers.
| $x_forwarded_for = $message->getHeaderValue('x-forwarded-for'); | |
| if ($x_forwarded_for !== null) { | |
| $addresses[] = $x_forwarded_for; | |
| $xForwardedForHeaders = $message->getAllHeadersByName('x-forwarded-for'); | |
| if (!empty($xForwardedForHeaders)) { | |
| foreach ($xForwardedForHeaders as $header) { | |
| $value = $header->getValue(); | |
| if ($value !== null && $value !== '') { | |
| $addresses[] = $value; | |
| } | |
| } |
| /** | ||
| * Parse raw email content using Zbateson mail-mime-parser | ||
| * | ||
| * @param string $eml Raw email content | ||
| * @return Message Parsed message object | ||
| */ | ||
| public static function parseEmail(string $eml): Message { | ||
| $parser = new MailMimeParser(); | ||
| return $parser->parse($eml, false); | ||
| } |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parseEmail method doesn't have any explicit error handling. If the Zbateson parser throws an exception during parsing, it will propagate up to the caller. While this is handled in extractContentFromEmail (lines 123-132), consider whether parseEmail should document what exceptions it might throw, or add specific error handling if there are known failure modes that should be handled differently.
The tests referenced methods and classes that were removed: - stripProblematicHeaders() was a workaround method no longer needed - Laminas\Mail classes are no longer installed Also removed tests depending on external email files not in CI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The migration to Zbateson used getHeaderValue() which only returns the first header value. Changed to getAllHeadersByName() to capture all X-Forwarded-For headers when multiple are present. Added tests for single, multiple, and no X-Forwarded-For header cases. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Zbateson's getValue() on address headers returns only the email address without the name and angle brackets. Changed to getRawValue() to preserve the original header format for proper HTML escaping. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| "require": { | ||
| "phpmailer/phpmailer": "^6.10", | ||
| "laminas/laminas-mail": "2.25.1", | ||
| "zbateson/mail-mime-parser": "^3.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete migration leaves api.php broken
High Severity
The laminas/laminas-mail dependency is removed from composer.json, but api.php still imports and uses Laminas\Mail\Storage\Message directly on lines 5 and 39. When this file is executed, PHP will throw a fatal "Class not found" error because the Laminas library is no longer installed. The file needs to be updated to use ThreadEmailExtractorEmailBody::parseEmail() like the other migrated files.


Why: Laminas Mail strictly enforces RFC compliance and throws exceptions when parsing emails with common real-world violations. Norwegian government email servers frequently send emails with:
This required ~720 lines of workaround code to sanitize emails before Laminas could parse them. Zbateson handles all these cases natively, allowing us to delete the workarounds and simplify the email parsing architecture significantly.
Changes:
Note
High Risk
Swaps the core email parsing library and removes large amounts of defensive header/charset workaround code, which can change extraction behavior across many real-world emails. Also drops
laminas/laminas-mailfrom dependencies, which may break any remaining call sites still instantiatingLaminas\Mail\Storage\Message(e.g.,api.php).Overview
Switches email parsing in
ThreadEmailExtractorEmailBodyandImapEmailfrom Laminas to Zbateson, introducingparseEmail()and usinggetTextContent()/getHtmlContent()for body extraction while deleting the prior header sanitization/error-isolation helpers.Updates header/subject/X-Forwarded-For extraction to use Zbateson APIs, changes
file.phpto print raw headers viagetAllHeaders(), and adds/updates PHPUnit coverage to reflect the new parser behavior.Replaces
laminas/laminas-mailwithzbateson/mail-mime-parser(and related transitive deps) in Composer.Written by Cursor Bugbot for commit 5c4f401. This will update automatically on new commits. Configure here.