Skip to content

[Feature] html 전체 대신 추출된 텍스트를 llm에 전달하도록 수정#23

Merged
hjham0856 merged 8 commits intomainfrom
test-deploy
Nov 14, 2025
Merged

[Feature] html 전체 대신 추출된 텍스트를 llm에 전달하도록 수정#23
hjham0856 merged 8 commits intomainfrom
test-deploy

Conversation

@hjham0856
Copy link
Copy Markdown
Collaborator

관련 이슈

작업사항

  • markdowniffy 패키지를 의존성에 추가했습니다.
  • 기존 전체 html을 llm에 전달하는 방식에서, html을 마크다운으로 변환한 뒤 전달하도록 수정했습니다.
  • CloudWatch에서 원본, 마크다운 텍스트의 길이 및 감소율을 확인할 수 있도록 하는 출력 코드를 추가했습니다.

기타

마크다운 변환 대신 텍스트만을 추출하는 방식 역시 고려하였으나, 변환된 결과의 길이 차이가 크지 않아 어느 정도의 구조/형식이 남아있는 마크다운이 LLM의 성능상에 있어 이점이 있다 판단하여 마크다운으로 변환하는 것으로 결정했습니다.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR converts HTML content to markdown format before passing it to the LLM, instead of sending raw HTML. This change aims to reduce the input size while preserving structure for better LLM performance.

  • Added markdownify package dependency for HTML-to-markdown conversion
  • Converted HTML body to markdown using markdownify before LLM processing
  • Added logging to track original HTML length, markdown length, and reduction percentage in CloudWatch

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/lambda_function.py Implements markdown conversion and adds length/reduction metrics logging
src/tos_summarize.py Updates parameter name and system instruction to reflect markdown input instead of HTML
requirements.txt Adds markdownify dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/lambda_function.py
tos_content = md(event['body'])

# 바이트 기준으로 길이 및 감소율 계산
original_length = len(event['body'].encode('utf-8'))
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential division by zero error if event['body'] is an empty string (though this is checked earlier at line 21, it still passes if body is whitespace-only). While the current validation prevents empty strings, consider the edge case where original_length could be 0 after encoding.

Suggested change
original_length = len(event['body'].encode('utf-8'))
original_length = len(event['body'].encode('utf-8'))
if original_length == 0:
return {
'statusCode': 400,
'body': json.dumps({
'error': '분석할 약관이 없습니다. (body가 비어있거나 유효하지 않습니다)'
}, ensure_ascii=False)
}

Copilot uses AI. Check for mistakes.
Comment thread src/lambda_function.py Outdated
Comment thread src/lambda_function.py Outdated
hjham0856 and others added 2 commits November 14, 2025 18:49
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
4자리는 일반적이지 않아 2자리로 수정합니다.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@hjham0856 hjham0856 merged commit 8f517ce into main Nov 14, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] html 중 핵심 내용만을 추출하여 사용

2 participants