HTML Body Content Cleaner

This Python script fetches an HTML page from a provided URL, cleans the content inside the <body> tag, and saves it to a local file. The script is designed for extracting clean, readable HTML content by removing all unnecessary tags, attributes, and layout-related elements.

Features

Only processes content inside the <body> tag
Removes scripts, styles, SVGs, navs, and titles
Strips all attributes (like class, id, style)
Unwraps layout elements (div, section, etc.) to flatten content
Removes HTML comments
Outputs simplified, readable HTML with original semantics

Usage

Requirements

Python 3.x
requests
beautifulsoup4

Install dependencies:

pip install requests beautifulsoup4

Run the script

Edit the url and output_file fields in the script or pass them dynamically. Then run:

python html_textonly.py

Example

If url = "https://example.com" and output_file = "cleaned-text.html", the resulting file will contain all cleaned content inside the <body> tag, without any layout wrappers or extraneous HTML attributes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
html_textonly.py		html_textonly.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HTML Body Content Cleaner

Features

Usage

Requirements

Run the script

Example

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HTML Body Content Cleaner

Features

Usage

Requirements

Run the script

Example

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages