[Feature] html 전체 대신 추출된 텍스트를 llm에 전달하도록 수정 by hjham0856 · Pull Request #23 · TermLens/TermLens_BE

hjham0856 · 2025-11-14T09:30:12Z

작업사항

markdowniffy 패키지를 의존성에 추가했습니다.
기존 전체 html을 llm에 전달하는 방식에서, html을 마크다운으로 변환한 뒤 전달하도록 수정했습니다.
CloudWatch에서 원본, 마크다운 텍스트의 길이 및 감소율을 확인할 수 있도록 하는 출력 코드를 추가했습니다.

기타

마크다운 변환 대신 텍스트만을 추출하는 방식 역시 고려하였으나, 변환된 결과의 길이 차이가 크지 않아 어느 정도의 구조/형식이 남아있는 마크다운이 LLM의 성능상에 있어 이점이 있다 판단하여 마크다운으로 변환하는 것으로 결정했습니다.

[Feature] html에서 주요 내용 파싱 후 LLM 전달

이전 버전은 cloudwatch에서 가독성이 지나치게 떨어져 변경했습니다.

마크다운으로 변환된 내용 대신 원본과 변환된 길이 및 감소율을 출력하도록 변경

Copilot

Pull Request Overview

This PR converts HTML content to markdown format before passing it to the LLM, instead of sending raw HTML. This change aims to reduce the input size while preserving structure for better LLM performance.

Added markdownify package dependency for HTML-to-markdown conversion
Converted HTML body to markdown using markdownify before LLM processing
Added logging to track original HTML length, markdown length, and reduction percentage in CloudWatch

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/lambda_function.py	Implements markdown conversion and adds length/reduction metrics logging
src/tos_summarize.py	Updates parameter name and system instruction to reflect markdown input instead of HTML
requirements.txt	Adds markdownify dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-14T09:32:28Z

+    tos_content = md(event['body'])
+
+    # 바이트 기준으로 길이 및 감소율 계산
+    original_length = len(event['body'].encode('utf-8'))


Potential division by zero error if event['body'] is an empty string (though this is checked earlier at line 21, it still passes if body is whitespace-only). While the current validation prevents empty strings, consider the edge case where original_length could be 0 after encoding.

Suggested change

original_length = len(event['body'].encode('utf-8'))

original_length = len(event['body'].encode('utf-8'))

if original_length == 0:

return {

'statusCode': 400,

'body': json.dumps({

'error': '분석할 약관이 없습니다. (body가 비어있거나 유효하지 않습니다)'

}, ensure_ascii=False)

}

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

4자리는 일반적이지 않아 2자리로 수정합니다. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hjham0856 and others added 6 commits November 14, 2025 17:18

chore: html에서 마크다운으로의 변환을 위한 markdownify 의존성 추가

59c4413

feat: 전달된 요청에 포함된 html을 요약 이전에 마크다운화

3067412

refactor: 파라미터 이름을 text_html에서 tos_content로 변경

34d248e

Merge pull request #21 from TermLens/feature/20-extract-text-from-html

4ec78af

[Feature] html에서 주요 내용 파싱 후 LLM 전달

feat: 전체 마크다운 파싱 결과를 출력하는 대신, 원본 html, 마크다운의 길이 및 감소율을 출력하도록 변경

d57dac1

이전 버전은 cloudwatch에서 가독성이 지나치게 떨어져 변경했습니다.

Merge pull request #22 from TermLens/feature/20-extract-text-from-html

355142c

마크다운으로 변환된 내용 대신 원본과 변환된 길이 및 감소율을 출력하도록 변경

hjham0856 requested a review from Copilot November 14, 2025 09:30

hjham0856 self-assigned this Nov 14, 2025

hjham0856 added the enhancement New feature or request label Nov 14, 2025

Copilot started reviewing on behalf of hjham0856 November 14, 2025 09:30 View session

Copilot finished reviewing on behalf of hjham0856 November 14, 2025 09:31

Copilot AI reviewed Nov 14, 2025

View reviewed changes

hjham0856 and others added 2 commits November 14, 2025 18:49

chore: 주석의 이전 변수명 text_html을 현재 사용하는 이름인 tos_content로 수정

e41ee31

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat: 지나친 소수점 자리수를 제한

3850d97

4자리는 일반적이지 않아 2자리로 수정합니다. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hjham0856 merged commit 8f517ce into main Nov 14, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] html 전체 대신 추출된 텍스트를 llm에 전달하도록 수정#23

[Feature] html 전체 대신 추출된 텍스트를 llm에 전달하도록 수정#23
hjham0856 merged 8 commits intomainfrom
test-deploy

hjham0856 commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    original_length = len(event['body'].encode('utf-8'))
+    original_length = len(event['body'].encode('utf-8'))
+    if original_length == 0:
+        return {
+            'statusCode': 400,
+            'body': json.dumps({
+                'error': '분석할 약관이 없습니다. (body가 비어있거나 유효하지 않습니다)'
+            }, ensure_ascii=False)
+        }

Conversation

hjham0856 commented Nov 14, 2025

관련 이슈

작업사항

기타

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants