Skip to content

[Bug]: Markdown tables missing outer pipe delimiters - causes parser compatibility issues #1731

@PatD42

Description

@PatD42

crawl4ai version

0.8.0 (also confirmed in 0.7.4)

Expected Behavior

Markdown tables should include leading and trailing pipe (|) delimiters on every row for maximum compatibility with markdown parsers. While this is technically valid GitHub Flavored Markdown (GFM), it causes parsing issues with many markdown processors and violates the recommended best practice for compatibility.

Current Behavior

Parameter Guideline Common Sources Health Considerations Comments
Arsenic (2006) 0.010 ALARA Naturally occurring Cancer risk Keep as low as possible

Issues this causes:

  1. Many markdown parsers fail to render these tables correctly (Incorrect parsing of GFM tables with missing leading/trailing pipes showdownjs/showdown#230)
  2. Static site generators (Jekyll, Hugo) have inconsistent rendering
  3. Some IDEs and markdown previewers don't recognize the table structure
  4. LLM processing may have reduced accuracy with inconsistent formatting
  5. Potential data loss when parsers misalign columns

Is this reproducible?

Yes

Inputs Causing the Bug

Any webpage containing HTML `<table>` elements will produce markdown tables with missing outer pipes.

**Test URL used:**
https://www.canada.ca/en/health-canada/services/environmental-workplace-health/reports-publications/water-quality/guidelines-canadian-drinking-water-quality-summary-table.html

This page contains multiple tables with 5-7 columns and 100+ rows. All tables exhibit the issue.

Steps to Reproduce

1. Install crawl4ai 0.8.0
  2. Crawl any webpage with HTML tables
  3. Examine the generated markdown
  4. Observe that table rows do not start/end with `|`

  **Specific test:**
  - Crawl the Canadian drinking water quality page
  - Check Table 2 (Chemical Parameters) - 97 rows, all missing outer pipes
  - Verify separator row format: `---|---|---|---|---` instead of `|---|---|---|---|---|`

Code snippets

import asyncio
  from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

  async def test_table_format():
      """Test script demonstrating missing outer pipes in markdown tables"""

      url = "https://www.canada.ca/en/health-canada/services/environmental-workplace-health/reports-publications/water-quality/guidelines-canadian-drinking-water-quality-summary-table.html"

      run_config = CrawlerRunConfig(
          cache_mode=CacheMode.BYPASS,
          wait_until="networkidle",
          delay_before_return_html=2.0,
          excluded_tags=['nav', 'header', 'footer', 'form'],
          word_count_threshold=10
      )

      async with AsyncWebCrawler() as crawler:
          result = await crawler.arun(url=url, config=run_config)

          # Analyze table format
          lines = result.markdown.split('\n')
          table_rows_found = 0
          rows_missing_pipes = 0

          for i, line in enumerate(lines):
              if '|' in line and line.strip():
                  table_rows_found += 1
                  starts_with_pipe = line.strip().startswith('|')
                  ends_with_pipe = line.strip().endswith('|')

                  if not starts_with_pipe or not ends_with_pipe:
                      rows_missing_pipes += 1

                      # Print first few examples
                      if rows_missing_pipes <= 3:
                          print(f"\nLine {i}: {repr(line[:100])}")
                          print(f"  Starts with |: {starts_with_pipe}")
                          print(f"  Ends with |: {ends_with_pipe}")

          print(f"\n\nSummary:")
          print(f"  Total table rows: {table_rows_found}")
          print(f"  Rows missing outer pipes: {rows_missing_pipes}")
          print(f"  Percentage affected: {rows_missing_pipes/table_rows_found*100:.1f}%")

  asyncio.run(test_table_format())

OS

Mac OS

Python version

3.13.11

Browser

n/a

Browser version

n/a

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions