-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Open
Labels
🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug
Description
crawl4ai version
0.8.0 (also confirmed in 0.7.4)
Expected Behavior
Markdown tables should include leading and trailing pipe (|) delimiters on every row for maximum compatibility with markdown parsers. While this is technically valid GitHub Flavored Markdown (GFM), it causes parsing issues with many markdown processors and violates the recommended best practice for compatibility.
Current Behavior
| Parameter | Guideline | Common Sources | Health Considerations | Comments |
|---|---|---|---|---|
| Arsenic (2006) | 0.010 ALARA | Naturally occurring | Cancer risk | Keep as low as possible |
Issues this causes:
- Many markdown parsers fail to render these tables correctly (Incorrect parsing of GFM tables with missing leading/trailing pipes showdownjs/showdown#230)
- Static site generators (Jekyll, Hugo) have inconsistent rendering
- Some IDEs and markdown previewers don't recognize the table structure
- LLM processing may have reduced accuracy with inconsistent formatting
- Potential data loss when parsers misalign columns
Is this reproducible?
Yes
Inputs Causing the Bug
Any webpage containing HTML `<table>` elements will produce markdown tables with missing outer pipes.
**Test URL used:**
https://www.canada.ca/en/health-canada/services/environmental-workplace-health/reports-publications/water-quality/guidelines-canadian-drinking-water-quality-summary-table.html
This page contains multiple tables with 5-7 columns and 100+ rows. All tables exhibit the issue.Steps to Reproduce
1. Install crawl4ai 0.8.0
2. Crawl any webpage with HTML tables
3. Examine the generated markdown
4. Observe that table rows do not start/end with `|`
**Specific test:**
- Crawl the Canadian drinking water quality page
- Check Table 2 (Chemical Parameters) - 97 rows, all missing outer pipes
- Verify separator row format: `---|---|---|---|---` instead of `|---|---|---|---|---|`Code snippets
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def test_table_format():
"""Test script demonstrating missing outer pipes in markdown tables"""
url = "https://www.canada.ca/en/health-canada/services/environmental-workplace-health/reports-publications/water-quality/guidelines-canadian-drinking-water-quality-summary-table.html"
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_until="networkidle",
delay_before_return_html=2.0,
excluded_tags=['nav', 'header', 'footer', 'form'],
word_count_threshold=10
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=run_config)
# Analyze table format
lines = result.markdown.split('\n')
table_rows_found = 0
rows_missing_pipes = 0
for i, line in enumerate(lines):
if '|' in line and line.strip():
table_rows_found += 1
starts_with_pipe = line.strip().startswith('|')
ends_with_pipe = line.strip().endswith('|')
if not starts_with_pipe or not ends_with_pipe:
rows_missing_pipes += 1
# Print first few examples
if rows_missing_pipes <= 3:
print(f"\nLine {i}: {repr(line[:100])}")
print(f" Starts with |: {starts_with_pipe}")
print(f" Ends with |: {ends_with_pipe}")
print(f"\n\nSummary:")
print(f" Total table rows: {table_rows_found}")
print(f" Rows missing outer pipes: {rows_missing_pipes}")
print(f" Percentage affected: {rows_missing_pipes/table_rows_found*100:.1f}%")
asyncio.run(test_table_format())OS
Mac OS
Python version
3.13.11
Browser
n/a
Browser version
n/a
Error logs & Screenshots (if applicable)
No response
Metadata
Metadata
Assignees
Labels
🐞 BugSomething isn't workingSomething isn't working📌 Root causedidentified the root cause of bugidentified the root cause of bug