This script:
- Expands each keyword into 2 *
queries_to_generate_per_keywordqueries (the first half in Russian, the second half in Kazakh). - Uses serper.dev to retrieve search results for these queries.
- De-duplicates and merges those results.
- Asks GPT to pick the top
top_results_to_getresults for each keyword. - Scrapes those pages via serper.dev to get full text.
- Outputs everything to a JSON file.
- Python 3.11+
poetry installpoetry shell- Copy
.env.templateand fill in the OpenAI and Serper keys
- Create a JSON file with a list of keywords, for example:
["Dimash Qudaibergen", "Apple Inc"]- Run:
python scrap_by_keywords.py keywords.json output.json \
--google_results_to_get_per_query=10 \
--top_results_to_get=5 \
--queries_to_generate_per_keyword=3 \
--collect_imgs_and_context=Truegoogle_results_to_get_per_querycontrols how many results to retrieve for each query.top_results_to_getis how many of those results to parse further.queries_to_generate_per_keywordis how many queries to generate in Russian (and then again in Kazakh), so total is double that.collect_imgs_and_contextcontrols whether to collect images and context for each result after the main results are collected. Will output<output_json>_imgs.jsonand imgs folder with images. Json will contain list of dicts with 'url', 'img_url', 'context_text_before', 'context_text_after', 'file_path' fields. Only keep images > N pixels to prevent getting trash, technical imgs, headers, etc.
- The output will be in path/to/output.json, for instance:
{
"Dimash Qudaibergen": [
{
"url": "...",
"google_snippet": "...",
"full_text": "..."
},
...
],
...
}