A modern, schema-based web scraping library for PHP with powerful transformers and a clean, intuitive syntax. Perfect for both manual use and API integration.
- Intuitive Schema Syntax: Easy to write by hand and by AI
- Built-in Transformers: 20+ transformers for data manipulation (trim, float, int, upper, lower, etc.)
- Flexible Lists: Support for limit and offset
- JSON-Friendly: Perfect for API usage
- Type-Safe: Full PHPStan max level compliance
- Clean Architecture: SOLID principles, no anti-patterns
- Well-Tested: 47 tests, 107 assertions
composer require fozbek/scrawleruse Scrawler\Bootstrap;
use Scrawler\Scrawler;
// Handle PHP 8.4 deprecation warnings from vendor libraries (optional)
Bootstrap::init();
$scrawler = new Scrawler();
$schema = [
'title' => 'h1',
'price' => ['span.price', 'trim|float'],
'items' => [
'li' => [
'text' => [null, 'trim|upper']
],
'limit' => 5
]
];
$data = $scrawler->scrape('https://example.com', $schema);If you're running PHP 8.4+, you may see deprecation warnings from vendor libraries (DiDom, Guzzle) related to implicitly nullable parameters. These are harmless but can clutter output. Use Bootstrap::init() to suppress these vendor-specific warnings:
use Scrawler\Bootstrap;
Bootstrap::init(); // Call once at the start of your scriptThis only suppresses deprecation warnings from vendor code, keeping your own code's warnings intact.
$schema = [
'title' => 'h1',
'description' => '.content p'
];$schema = [
'image' => 'img@src',
'link' => 'a@href',
'dataId' => 'div@data-id'
];Extracting attributes from the current element (useful in lists):
$schema = [
'items' => [
'.product' => [
'id' => '@id', // Get id attribute from .product element
'data' => '@data-value', // Get data-value attribute
'name' => '.title' // Get text from nested .title
]
]
];Apply transformations using pipe-separated transformer names:
$schema = [
'price' => ['span.price', 'trim|float'],
'name' => ['.product-name', 'trim|upper'],
'url' => ['a@href', 'urldecode']
];Available Transformers:
Type Conversions:
int,float,bool,string
String Operations:
trim,ltrim,rtrimupper,lower,ucfirst,ucwordsstrip_tags
URL/Path:
basename,dirnameurlencode,urldecode
Parsing:
json- decode JSON stringstimestamp- convert dates to Unix timestamp
Utility:
abs- absolute valuemd5,sha1- hashing
Simple list:
$schema = [
'items' => [
'li' => [
'text' => null // Current element text
]
]
];List with transformers:
$schema = [
'products' => [
'.product' => [
'name' => ['.name', 'trim|ucwords'],
'price' => ['.price', 'trim|float']
]
]
];List with limit and offset:
$schema = [
'items' => [
'li' => ['text' => null],
'limit' => 10, // Take only first 10
'offset' => 5 // Skip first 5
]
];Old syntax still supported:
$schema = [
'items' => [
'list-selector' => 'li',
'content' => [
'text' => null
]
]
];$schema = [
'categories' => [
'.category' => [
'name' => '.category-name',
'products' => [
'.product' => [
'name' => ['.name', 'trim'],
'price' => ['.price', 'trim|float']
],
'limit' => 5
]
]
]
];$html = '
<div class="product">
<h2> wireless headphones </h2>
<span class="price"> $59.99 </span>
<a href="/products/item%20123">Details</a>
</div>
';
$schema = [
'name' => ['h2', 'trim|ucwords'],
'price' => ['.price', 'trim|float'],
'url' => ['a@href', 'urldecode']
];
$result = $scrawler->scrape($html, $schema, true);
// Output:
// [
// 'name' => 'Wireless Headphones',
// 'price' => 59.99,
// 'url' => '/products/item 123'
// ]$html = '<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li>';
$schema = [
'items' => [
'li' => ['text' => null],
'offset' => 1,
'limit' => 3
]
];
$result = $scrawler->scrape($html, $schema, true);
// Output: ['items' => [['text' => '2'], ['text' => '3'], ['text' => '4']]]$schema = [
'title' => ['h1', 'trim|upper'],
'author' => '.meta .author',
'publishedAt' => ['.meta .date', 'timestamp'],
'content' => ['.content', 'trim|strip_tags'],
'tags' => [
'.tag' => [
'name' => [null, 'trim|lower'],
'url' => ['a@href', 'urldecode']
],
'limit' => 10
]
];The schema syntax is designed to work seamlessly with JSON:
{
"title": ["h1", "trim|upper"],
"price": ["span.price", "trim|float"],
"products": {
".product": {
"name": [".name", "trim"],
"price": [".price", "trim|float"]
},
"limit": 10,
"offset": 0
}
}Note: Callbacks and filtering should be handled by the API consumer after receiving the data.
use GuzzleHttp\Client;
use Scrawler\Scrawler;
$client = new Client([
'timeout' => 30,
'headers' => ['User-Agent' => 'My Bot/1.0'],
'proxy' => 'http://proxy.example.com:8080'
]);
$scrawler = new Scrawler($client);# Run all tests
composer test
# Run specific test
./vendor/bin/phpunit tests/ScrawlerNewSyntaxTest.php
# With coverage
composer coveragecomposer analysePHPStan Level: Max (strictest)
- PHP 8.1 or higher
- ext-dom
- Guzzle 6.0 or 7.0+
- DiDom 2.0+
MIT License - see LICENSE
Contributions welcome! Please ensure:
- All tests pass
- PHPStan analysis passes
- Follow PSR-12
Fatih Özbek - mail@fatih.dev