Conversation
[Feature] html에서 주요 내용 파싱 후 LLM 전달
이전 버전은 cloudwatch에서 가독성이 지나치게 떨어져 변경했습니다.
마크다운으로 변환된 내용 대신 원본과 변환된 길이 및 감소율을 출력하도록 변경
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR converts HTML content to markdown format before passing it to the LLM, instead of sending raw HTML. This change aims to reduce the input size while preserving structure for better LLM performance.
- Added
markdownifypackage dependency for HTML-to-markdown conversion - Converted HTML body to markdown using
markdownifybefore LLM processing - Added logging to track original HTML length, markdown length, and reduction percentage in CloudWatch
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/lambda_function.py | Implements markdown conversion and adds length/reduction metrics logging |
| src/tos_summarize.py | Updates parameter name and system instruction to reflect markdown input instead of HTML |
| requirements.txt | Adds markdownify dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tos_content = md(event['body']) | ||
|
|
||
| # 바이트 기준으로 길이 및 감소율 계산 | ||
| original_length = len(event['body'].encode('utf-8')) |
There was a problem hiding this comment.
Potential division by zero error if event['body'] is an empty string (though this is checked earlier at line 21, it still passes if body is whitespace-only). While the current validation prevents empty strings, consider the edge case where original_length could be 0 after encoding.
Suggested change
| original_length = len(event['body'].encode('utf-8')) | |
| original_length = len(event['body'].encode('utf-8')) | |
| if original_length == 0: | |
| return { | |
| 'statusCode': 400, | |
| 'body': json.dumps({ | |
| 'error': '분석할 약관이 없습니다. (body가 비어있거나 유효하지 않습니다)' | |
| }, ensure_ascii=False) | |
| } |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
4자리는 일반적이지 않아 2자리로 수정합니다. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
관련 이슈
작업사항
markdowniffy패키지를 의존성에 추가했습니다.기타
마크다운 변환 대신 텍스트만을 추출하는 방식 역시 고려하였으나, 변환된 결과의 길이 차이가 크지 않아 어느 정도의 구조/형식이 남아있는 마크다운이 LLM의 성능상에 있어 이점이 있다 판단하여 마크다운으로 변환하는 것으로 결정했습니다.