Keyword Processor with Serper.dev and GPT

Description

This script:

Expands each keyword into 2 * queries_to_generate_per_keyword queries (the first half in Russian, the second half in Kazakh).
Uses serper.dev to retrieve search results for these queries.
De-duplicates and merges those results.
Asks GPT to pick the top top_results_to_get results for each keyword.
Scrapes those pages via serper.dev to get full text.
Outputs everything to a JSON file.

Setup

Python 3.11+
poetry install
poetry shell
Copy .env.template and fill in the OpenAI and Serper keys

Usage

Create a JSON file with a list of keywords, for example:

["Dimash Qudaibergen", "Apple Inc"]

Run:

python scrap_by_keywords.py keywords.json output.json \
    --google_results_to_get_per_query=10 \
    --top_results_to_get=5 \
    --queries_to_generate_per_keyword=3 \
    --collect_imgs_and_context=True

google_results_to_get_per_query controls how many results to retrieve for each query.
top_results_to_get is how many of those results to parse further.
queries_to_generate_per_keyword is how many queries to generate in Russian (and then again in Kazakh), so total is double that.
collect_imgs_and_context controls whether to collect images and context for each result after the main results are collected. Will output <output_json>_imgs.json and imgs folder with images. Json will contain list of dicts with 'url', 'img_url', 'context_text_before', 'context_text_after', 'file_path' fields. Only keep images > N pixels to prevent getting trash, technical imgs, headers, etc.

The output will be in path/to/output.json, for instance:

{
  "Dimash Qudaibergen": [
    {
      "url": "...",
      "google_snippet": "...",
      "full_text": "..."
    },
    ...
  ],
  ...
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
collect_imgs.py		collect_imgs.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrap_by_keywords.py		scrap_by_keywords.py
semaphore.py		semaphore.py
utils_openai.py		utils_openai.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Keyword Processor with Serper.dev and GPT

Description

Setup

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Keyword Processor with Serper.dev and GPT

Description

Setup

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages