Scrawler

A modern, schema-based web scraping library for PHP with powerful transformers and a clean, intuitive syntax. Perfect for both manual use and API integration.

Features

Intuitive Schema Syntax: Easy to write by hand and by AI
Built-in Transformers: 20+ transformers for data manipulation (trim, float, int, upper, lower, etc.)
Flexible Lists: Support for limit and offset
JSON-Friendly: Perfect for API usage
Type-Safe: Full PHPStan max level compliance
Clean Architecture: SOLID principles, no anti-patterns
Well-Tested: 47 tests, 107 assertions

Installation

composer require fozbek/scrawler

Quick Start

use Scrawler\Bootstrap;
use Scrawler\Scrawler;

// Handle PHP 8.4 deprecation warnings from vendor libraries (optional)
Bootstrap::init();

$scrawler = new Scrawler();

$schema = [
    'title' => 'h1',
    'price' => ['span.price', 'trim|float'],
    'items' => [
        'li' => [
            'text' => [null, 'trim|upper']
        ],
        'limit' => 5
    ]
];

$data = $scrawler->scrape('https://example.com', $schema);

PHP 8.4 Compatibility

If you're running PHP 8.4+, you may see deprecation warnings from vendor libraries (DiDom, Guzzle) related to implicitly nullable parameters. These are harmless but can clutter output. Use Bootstrap::init() to suppress these vendor-specific warnings:

use Scrawler\Bootstrap;

Bootstrap::init(); // Call once at the start of your script

This only suppresses deprecation warnings from vendor code, keeping your own code's warnings intact.

Schema Syntax

Simple Text Extraction

$schema = [
    'title' => 'h1',
    'description' => '.content p'
];

Attribute Extraction

$schema = [
    'image' => 'img@src',
    'link' => 'a@href',
    'dataId' => 'div@data-id'
];

Extracting attributes from the current element (useful in lists):

$schema = [
    'items' => [
        '.product' => [
            'id' => '@id',              // Get id attribute from .product element
            'data' => '@data-value',    // Get data-value attribute
            'name' => '.title'          // Get text from nested .title
        ]
    ]
];

Transformers

Apply transformations using pipe-separated transformer names:

$schema = [
    'price' => ['span.price', 'trim|float'],
    'name' => ['.product-name', 'trim|upper'],
    'url' => ['a@href', 'urldecode']
];

Available Transformers:

Type Conversions:

int, float, bool, string

String Operations:

trim, ltrim, rtrim
upper, lower, ucfirst, ucwords
strip_tags

URL/Path:

basename, dirname
urlencode, urldecode

Parsing:

json - decode JSON strings
timestamp - convert dates to Unix timestamp

Utility:

abs - absolute value
md5, sha1 - hashing

Lists (New Syntax)

Simple list:

$schema = [
    'items' => [
        'li' => [
            'text' => null  // Current element text
        ]
    ]
];

List with transformers:

$schema = [
    'products' => [
        '.product' => [
            'name' => ['.name', 'trim|ucwords'],
            'price' => ['.price', 'trim|float']
        ]
    ]
];

List with limit and offset:

$schema = [
    'items' => [
        'li' => ['text' => null],
        'limit' => 10,    // Take only first 10
        'offset' => 5     // Skip first 5
    ]
];

Old syntax still supported:

$schema = [
    'items' => [
        'list-selector' => 'li',
        'content' => [
            'text' => null
        ]
    ]
];

Nested Lists

$schema = [
    'categories' => [
        '.category' => [
            'name' => '.category-name',
            'products' => [
                '.product' => [
                    'name' => ['.name', 'trim'],
                    'price' => ['.price', 'trim|float']
                ],
                'limit' => 5
            ]
        ]
    ]
];

Examples

Scraping with Transformers

$html = '
    <div class="product">
        <h2>  wireless headphones  </h2>
        <span class="price">  $59.99  </span>
        <a href="/products/item%20123">Details</a>
    </div>
';

$schema = [
    'name' => ['h2', 'trim|ucwords'],
    'price' => ['.price', 'trim|float'],
    'url' => ['a@href', 'urldecode']
];

$result = $scrawler->scrape($html, $schema, true);

// Output:
// [
//     'name' => 'Wireless Headphones',
//     'price' => 59.99,
//     'url' => '/products/item 123'
// ]

Scraping Lists with Limits

$html = '<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li>';

$schema = [
    'items' => [
        'li' => ['text' => null],
        'offset' => 1,
        'limit' => 3
    ]
];

$result = $scrawler->scrape($html, $schema, true);

// Output: ['items' => [['text' => '2'], ['text' => '3'], ['text' => '4']]]

Complex Real-World Example

$schema = [
    'title' => ['h1', 'trim|upper'],
    'author' => '.meta .author',
    'publishedAt' => ['.meta .date', 'timestamp'],
    'content' => ['.content', 'trim|strip_tags'],
    'tags' => [
        '.tag' => [
            'name' => [null, 'trim|lower'],
            'url' => ['a@href', 'urldecode']
        ],
        'limit' => 10
    ]
];

JSON API Usage

The schema syntax is designed to work seamlessly with JSON:

{
  "title": ["h1", "trim|upper"],
  "price": ["span.price", "trim|float"],
  "products": {
    ".product": {
      "name": [".name", "trim"],
      "price": [".price", "trim|float"]
    },
    "limit": 10,
    "offset": 0
  }
}

Note: Callbacks and filtering should be handled by the API consumer after receiving the data.

Custom HTTP Client

use GuzzleHttp\Client;
use Scrawler\Scrawler;

$client = new Client([
    'timeout' => 30,
    'headers' => ['User-Agent' => 'My Bot/1.0'],
    'proxy' => 'http://proxy.example.com:8080'
]);

$scrawler = new Scrawler($client);

Testing

# Run all tests
composer test

# Run specific test
./vendor/bin/phpunit tests/ScrawlerNewSyntaxTest.php

# With coverage
composer coverage

Static Analysis

composer analyse

PHPStan Level: Max (strictest)

Requirements

PHP 8.1 or higher
ext-dom
Guzzle 6.0 or 7.0+
DiDom 2.0+

License

MIT License - see LICENSE

Contributing

Contributions welcome! Please ensure:

All tests pass
PHPStan analysis passes
Follow PSR-12

Author

Fatih Özbek - mail@fatih.dev

Credits

Guzzle - HTTP client
DiDom - DOM parsing

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
examples		examples
src		src
tests		tests
.gitignore		.gitignore
.phpactor.json		.phpactor.json
.phpunit.result.cache		.phpunit.result.cache
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
phpstan.neon		phpstan.neon
phpunit.xml		phpunit.xml
test-hn.php		test-hn.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrawler

Features

Installation

Quick Start

PHP 8.4 Compatibility

Schema Syntax

Simple Text Extraction

Attribute Extraction

Transformers

Lists (New Syntax)

Nested Lists

Examples

Scraping with Transformers

Scraping Lists with Limits

Complex Real-World Example

JSON API Usage

Custom HTTP Client

Testing

Static Analysis

Requirements

License

Contributing

Author

Credits

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

fozbek/scrawler

Folders and files

Latest commit

History

Repository files navigation

Scrawler

Features

Installation

Quick Start

PHP 8.4 Compatibility

Schema Syntax

Simple Text Extraction

Attribute Extraction

Transformers

Lists (New Syntax)

Nested Lists

Examples

Scraping with Transformers

Scraping Lists with Limits

Complex Real-World Example

JSON API Usage

Custom HTTP Client

Testing

Static Analysis

Requirements

License

Contributing

Author

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages