Skip to content

horde-research/Kaz-Parser-Bot

Repository files navigation

Keyword Processor with Serper.dev and GPT

Description

This script:

  1. Expands each keyword into 2 * queries_to_generate_per_keyword queries (the first half in Russian, the second half in Kazakh).
  2. Uses serper.dev to retrieve search results for these queries.
  3. De-duplicates and merges those results.
  4. Asks GPT to pick the top top_results_to_get results for each keyword.
  5. Scrapes those pages via serper.dev to get full text.
  6. Outputs everything to a JSON file.

Setup

  • Python 3.11+
  • poetry install
  • poetry shell
  • Copy .env.template and fill in the OpenAI and Serper keys

Usage

  1. Create a JSON file with a list of keywords, for example:
["Dimash Qudaibergen", "Apple Inc"]
  1. Run:
python scrap_by_keywords.py keywords.json output.json \
    --google_results_to_get_per_query=10 \
    --top_results_to_get=5 \
    --queries_to_generate_per_keyword=3 \
    --collect_imgs_and_context=True
  • google_results_to_get_per_query controls how many results to retrieve for each query.
  • top_results_to_get is how many of those results to parse further.
  • queries_to_generate_per_keyword is how many queries to generate in Russian (and then again in Kazakh), so total is double that.
  • collect_imgs_and_context controls whether to collect images and context for each result after the main results are collected. Will output <output_json>_imgs.json and imgs folder with images. Json will contain list of dicts with 'url', 'img_url', 'context_text_before', 'context_text_after', 'file_path' fields. Only keep images > N pixels to prevent getting trash, technical imgs, headers, etc.
  1. The output will be in path/to/output.json, for instance:
{
  "Dimash Qudaibergen": [
    {
      "url": "...",
      "google_snippet": "...",
      "full_text": "..."
    },
    ...
  ],
  ...
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages