Skip to content

Parsing web page requires playwright? #29

@gmgreg

Description

@gmgreg

Is playwright required? Readme seems to suggest this is optional for CSS parsing only (e.g. "Note: Without Playwright, CSS selectors still work for text extraction, but no visual highlighting screenshots are generated.")

from attachments import Attachments
ctx = Attachments("https://en.wikipedia.org/wiki/Llama")
print(str(ctx))      # Pretty text view
print(len(ctx.images))  # Number of extracted images

output:

[Attachments] Running primary processor 'webpage_to_llm' for https://en.wikipedia.org/wiki/Llama
[Attachments]   Applying step 'load.url_to_bs4' to https://en.wikipedia.org/wiki/Llama
[Attachments]   Running AdditivePipeline(present.markdown + present.images + present.metadata)
[Attachments]     Applying additive step 'present.markdown' to https://en.wikipedia.org/wiki/Llama
[Attachments]     Applying additive step 'present.images' to https://en.wikipedia.org/wiki/Llama
Processor failed for https://en.wikipedia.org/wiki/Llama: Playwright not available. Install with: pip install playwright && playwright install chromium, falling back to universal pipeline
[Attachments] Applying step 'load.url_to_response' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'modify.morph_to_detected_type' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.url_to_bs4' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.git_repo_to_structure' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.directory_to_structure' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.svg_to_svgdocument' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.eps_to_epsdocument' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.pdf_to_pdfplumber' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.csv_to_pandas' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.image_to_pil' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.html_to_bs4' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.text_to_string' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'load.zip_to_images' to https://en.wikipedia.org/wiki/Llama
[Attachments] Applying step 'modify.pages' to https://en.wikipedia.org/wiki/Llama
[Attachments] Running AdditivePipeline(present.markdown + present.images + present.metadata)
[Attachments]   Applying additive step 'present.markdown' to https://en.wikipedia.org/wiki/Llama
[Attachments]   Applying additive step 'present.images' to https://en.wikipedia.org/wiki/Llama
⚠️ Could not process https://en.wikipedia.org/wiki/Llama: Playwright not available. Install with: pip install playwright && playwright install chromium

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions