-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Dubai villa lead scraper + Apify bridge + Sheets sync #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,216 @@ | ||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||
| Apify Dubai Real Estate API Bridge | ||||||||||||||||||||||||||||||||||||
| ==================================== | ||||||||||||||||||||||||||||||||||||
| Uses Apify's ready-made Dubai Real Estate Scraper actor to get | ||||||||||||||||||||||||||||||||||||
| owner contacts from PropertyFinder, Bayut & Dubizzle. | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| This is the FAST path - uses Apify's actor which handles anti-bot measures. | ||||||||||||||||||||||||||||||||||||
| Requires APIFY_TOKEN in .env | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| Usage: | ||||||||||||||||||||||||||||||||||||
| python agents/apify_dubai_scraper.py | ||||||||||||||||||||||||||||||||||||
| python agents/apify_dubai_scraper.py --area "Palm Jumeirah" --max 100 | ||||||||||||||||||||||||||||||||||||
|
Comment on lines
+7
to
+12
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Load The docstring says Also applies to: 175-180 🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| import argparse | ||||||||||||||||||||||||||||||||||||
| import json | ||||||||||||||||||||||||||||||||||||
| import time | ||||||||||||||||||||||||||||||||||||
| import urllib.request | ||||||||||||||||||||||||||||||||||||
| import urllib.error | ||||||||||||||||||||||||||||||||||||
| import os | ||||||||||||||||||||||||||||||||||||
| from datetime import datetime, timezone | ||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| ROOT_DIR = Path(__file__).resolve().parent.parent | ||||||||||||||||||||||||||||||||||||
| STATE_DIR = ROOT_DIR / "data" / "state" | ||||||||||||||||||||||||||||||||||||
| LOG_DIR = ROOT_DIR / "data" / "logs" | ||||||||||||||||||||||||||||||||||||
| LEADS_FILE = STATE_DIR / "villa_leads.json" | ||||||||||||||||||||||||||||||||||||
| LOG_FILE = LOG_DIR / "apify_scraper.log" | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| STATE_DIR.mkdir(parents=True, exist_ok=True) | ||||||||||||||||||||||||||||||||||||
| LOG_DIR.mkdir(parents=True, exist_ok=True) | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| # Apify actor ID for Dubai Real Estate Scraper | ||||||||||||||||||||||||||||||||||||
| ACTOR_ID = "redoubtable_bubble~dubai-real-estate-scraper-propertyfinder-bayut-dubizzle" | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def log(msg: str) -> None: | ||||||||||||||||||||||||||||||||||||
| line = f"[{datetime.now(timezone.utc).isoformat()}] [apify-scraper] {msg}" | ||||||||||||||||||||||||||||||||||||
| print(line, flush=True) | ||||||||||||||||||||||||||||||||||||
| with open(LOG_FILE, "a", encoding="utf-8") as f: | ||||||||||||||||||||||||||||||||||||
| f.write(line + "\n") | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def now_iso() -> str: | ||||||||||||||||||||||||||||||||||||
| return datetime.now(timezone.utc).isoformat() | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def apify_request(method: str, path: str, token: str, body: dict = None) -> dict: | ||||||||||||||||||||||||||||||||||||
| url = f"https://api.apify.com/v2{path}?token={token}" | ||||||||||||||||||||||||||||||||||||
| data = json.dumps(body).encode() if body else None | ||||||||||||||||||||||||||||||||||||
| headers = {"Content-Type": "application/json"} | ||||||||||||||||||||||||||||||||||||
| req = urllib.request.Request(url, data=data, headers=headers, method=method) | ||||||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||||||
| with urllib.request.urlopen(req, timeout=30) as resp: | ||||||||||||||||||||||||||||||||||||
| return json.loads(resp.read()) | ||||||||||||||||||||||||||||||||||||
| except urllib.error.HTTPError as e: | ||||||||||||||||||||||||||||||||||||
| error_body = e.read().decode() | ||||||||||||||||||||||||||||||||||||
| log(f"Apify API error {e.code}: {error_body}") | ||||||||||||||||||||||||||||||||||||
| return {"error": str(e.code), "message": error_body} | ||||||||||||||||||||||||||||||||||||
| except Exception as e: | ||||||||||||||||||||||||||||||||||||
| log(f"Request error: {e}") | ||||||||||||||||||||||||||||||||||||
| return {"error": str(e)} | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str | None: | ||||||||||||||||||||||||||||||||||||
| """Start the Apify actor run and return run ID.""" | ||||||||||||||||||||||||||||||||||||
|
Comment on lines
+65
to
+66
|
||||||||||||||||||||||||||||||||||||
| payload = { | ||||||||||||||||||||||||||||||||||||
| "searchQuery": f"{property_type} {area} Dubai" if area else f"{property_type} Dubai", | ||||||||||||||||||||||||||||||||||||
| "maxItems": max_items, | ||||||||||||||||||||||||||||||||||||
| "propertyType": "villa", | ||||||||||||||||||||||||||||||||||||
| "listingType": "rent", | ||||||||||||||||||||||||||||||||||||
| "location": area or "Dubai", | ||||||||||||||||||||||||||||||||||||
| "directOwnerOnly": True | ||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||
| log(f"Starting Apify actor: {ACTOR_ID}") | ||||||||||||||||||||||||||||||||||||
| log(f"Payload: {json.dumps(payload)}") | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| result = apify_request("POST", f"/acts/{ACTOR_ID}/runs", token, payload) | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| if "data" in result: | ||||||||||||||||||||||||||||||||||||
| run_id = result["data"]["id"] | ||||||||||||||||||||||||||||||||||||
| log(f"Actor started. Run ID: {run_id}") | ||||||||||||||||||||||||||||||||||||
| return run_id | ||||||||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||||||||
| log(f"Failed to start actor: {result}") | ||||||||||||||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def wait_for_run(token: str, run_id: str, timeout: int = 300) -> bool: | ||||||||||||||||||||||||||||||||||||
| """Wait for actor run to finish.""" | ||||||||||||||||||||||||||||||||||||
| log(f"Waiting for run {run_id} to complete...") | ||||||||||||||||||||||||||||||||||||
| start = time.time() | ||||||||||||||||||||||||||||||||||||
| while time.time() - start < timeout: | ||||||||||||||||||||||||||||||||||||
| result = apify_request("GET", f"/actor-runs/{run_id}", token) | ||||||||||||||||||||||||||||||||||||
| status = result.get("data", {}).get("status", "") | ||||||||||||||||||||||||||||||||||||
| log(f" Status: {status}") | ||||||||||||||||||||||||||||||||||||
| if status in ("SUCCEEDED", "FINISHED"): | ||||||||||||||||||||||||||||||||||||
| return True | ||||||||||||||||||||||||||||||||||||
| if status in ("FAILED", "ABORTED", "TIMED-OUT"): | ||||||||||||||||||||||||||||||||||||
| log(f"Run failed with status: {status}") | ||||||||||||||||||||||||||||||||||||
| return False | ||||||||||||||||||||||||||||||||||||
| time.sleep(10) | ||||||||||||||||||||||||||||||||||||
| log("Timeout waiting for actor run") | ||||||||||||||||||||||||||||||||||||
| return False | ||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||
| def fetch_results(token: str, run_id: str) -> list[dict]: | ||||||||||||||||||||||||||||||||||||
| """Fetch results from completed actor run.""" | ||||||||||||||||||||||||||||||||||||
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | ||||||||||||||||||||||||||||||||||||
| items = result.get("data", {}).get("items", []) | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
| items = result.get("data", {}).get("items", []) | |
| # The Apify dataset items endpoint may return either: | |
| # - a raw JSON array of items, or | |
| # - an object that wraps items under data.items. | |
| if isinstance(result, list): | |
| items = result | |
| elif isinstance(result, dict): | |
| data = result.get("data", result) | |
| if isinstance(data, dict): | |
| items = data.get("items", []) | |
| elif isinstance(data, list): | |
| items = data | |
| else: | |
| items = [] | |
| else: | |
| items = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, find and inspect the apify_dubai_scraper.py file
find . -name "apify_dubai_scraper.py" -type fRepository: vishnu-madhavan-git/automation
Length of output: 105
🌐 Web query:
Apify API dataset-items endpoint response format /actor-runs/{runId}/dataset/items
💡 Result:
GET /v2/actor-runs/{runId}/dataset/items is the Actor-run “default dataset” items endpoint and it behaves the same as the regular Get dataset items endpoint (same query params + same output formats). (docs.apify.com)
Default response (no format param)
format=json(default): returns a JSON array of item objects, e.g.[{...},{...}]. (docs.apify.com)
Other response formats (format=...)
format=jsonl: newline-delimited JSON (one JSON object per line, ending with\n). (docs.apify.com)format=csv: CSV text (header\nrow\n...), with optional UTF‑8 BOM behavior controlled bybom=0|1. (docs.apify.com)- Also supported:
html,xlsx,xml,rss(returned as the corresponding textual/binary representation). (docs.apify.com)
Pagination metadata (via response headers)
The items response includes these headers:
X-Apify-Pagination-OffsetX-Apify-Pagination-LimitX-Apify-Pagination-CountX-Apify-Pagination-Total(docs.apify.com)
Common query params that affect the output shape
offset,limitfields(pick only listed fields),omitunwind,flattenclean(shortcut forskipHidden=true+skipEmpty=true) (docs.apify.com)
Citations:
- 1: https://docs.apify.com/api/v2/actor-run-get
- 2: https://docs.apify.com/api/v2/dataset-items-get
- 3: https://docs.apify.com/api/v2/dataset-items-get
- 4: https://docs.apify.com/api/v2/dataset-items-get
- 5: https://docs.apify.com/api/v2/dataset-items-get
- 6: https://docs.apify.com/api/v2/dataset-items-get
- 7: https://docs.apify.com/api/v2/dataset-items-get
🏁 Script executed:
# Search for the apify_request function to understand response handling
rg "def apify_request" -A 10 -t pyRepository: vishnu-madhavan-git/automation
Length of output: 978
🏁 Script executed:
# Inspect the actual fetch_results function in the file to confirm code and line numbers
cat -n agents/apify_dubai_scraper.py | sed -n '100,120p'Repository: vishnu-madhavan-git/automation
Length of output: 936
🏁 Script executed:
# Also check the full context around the function to understand usage
cat -n agents/apify_dubai_scraper.py | sed -n '1,30p'Repository: vishnu-madhavan-git/automation
Length of output: 1126
🏁 Script executed:
# Check the complete apify_request function to see full implementation
cat -n agents/apify_dubai_scraper.py | sed -n '40,60p'Repository: vishnu-madhavan-git/automation
Length of output: 1041
🏁 Script executed:
# Verify if there are any other usages of apify_request to understand the response pattern
rg "apify_request" agents/apify_dubai_scraper.py -B 1 -A 1Repository: vishnu-madhavan-git/automation
Length of output: 653
Treat the dataset-items response as a raw list.
Apify's /actor-runs/{runId}/dataset/items endpoint returns a JSON array directly (e.g. [{...},{...}]), not a wrapped object. The current code calls .get("data", {}) on the response, which will fail with AttributeError: 'list' object has no attribute 'get'. This differs from other Apify endpoints in this file (e.g. /actor-runs/{run_id}) which return wrapped responses.
Proposed fix
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
- result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
- items = result.get("data", {}).get("items", [])
+ items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+ if not isinstance(items, list):
+ log(f"Unexpected dataset response: {items}")
+ return []
log(f"Fetched {len(items)} items from Apify")
return items📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| if not isinstance(items, list): | |
| log(f"Unexpected dataset response: {items}") | |
| return [] | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/apify_dubai_scraper.py` around lines 107 - 111, The fetch_results
function assumes apify_request returned a dict and calls .get on it, but the
/actor-runs/{run_id}/dataset/items endpoint returns a raw list; update
fetch_results to handle both shapes by checking the type of the response from
apify_request (called in fetch_results) and set items = result if it's a list,
otherwise fall back to result.get("data", {}).get("items", []); keep the
log(f"Fetched {len(items)} items from Apify") and ensure the function returns
the items list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Apify dataset items API returns a JSON array, not a {data: {items: []}} object
The Apify API endpoint /v2/actor-runs/{runId}/dataset/items returns a raw JSON array of items, not an object with a data.items wrapper. At agents/apify_dubai_scraper.py:109-110, apify_request parses the response with json.loads() which yields a Python list. Then result.get("data", {}).get("items", []) will raise AttributeError: 'list' object has no attribute 'get' because lists don't have .get(). This means fetch_results() will always crash after a successful actor run.
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items | |
| def fetch_results(token: str, run_id: str) -> list[dict]: | |
| """Fetch results from completed actor run.""" | |
| result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) | |
| if isinstance(result, list): | |
| items = result | |
| else: | |
| items = result.get("data", {}).get("items", []) | |
| log(f"Fetched {len(items)} items from Apify") | |
| return items |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialize writes to the shared leads store.
This code loads data/state/villa_leads.json, appends in memory, and rewrites the whole file. agents/dubai_villa_scraper.py does the same against the same path, so overlapping runs will drop whichever scraper saves first.
Also applies to: 198-201
🧰 Tools
🪛 Ruff (0.15.4)
[warning] 152-152: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@agents/apify_dubai_scraper.py` around lines 148 - 158, Concurrent runs
overwrite the shared JSON leads store because load_existing_leads and save_leads
read the whole file, mutate in-memory, and rewrite it; change these functions to
perform concurrency-safe updates (e.g., acquire a file lock around
read-modify-write or switch to an append-only/JSONL writer) so overlapping
scrapers don't lose data. Specifically, update load_existing_leads, save_leads
and any callers that append to LEADS_FILE so they obtain an exclusive lock on
LEADS_FILE (or open it in append mode for JSONL) before reading/writing, merge
new leads into the existing set safely, and use atomic replace/rename when
writing to avoid partial writes; ensure the same locking/format is used by
agents/dubai_villa_scraper.py to keep behavior consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: vishnu-madhavan-git/automation
Length of output: 182
🏁 Script executed:
Repository: vishnu-madhavan-git/automation
Length of output: 465
🏁 Script executed:
Repository: vishnu-madhavan-git/automation
Length of output: 137
🏁 Script executed:
Repository: vishnu-madhavan-git/automation
Length of output: 150
🏁 Script executed:
# Check the agent files for parameterized built-ins and union syntax wc -l agents/dubai_villa_scraper.py agents/apify_dubai_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 157
🏁 Script executed:
Repository: vishnu-madhavan-git/automation
Length of output: 1402
🏁 Script executed:
# Check the first 50 lines of each agent file for imports and type hints head -50 agents/dubai_villa_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 1482
🏁 Script executed:
# Check the apify agent file as well head -50 agents/apify_dubai_scraper.pyRepository: vishnu-madhavan-git/automation
Length of output: 1584
Specify Python 3.10+ requirement for these commands.
The agents use PEP 604 union syntax (
str | None) and parameterized built-ins (list[str],dict, etc.), which require Python 3.10+ and 3.9+ respectively. The README shows unqualifiedpythoncommands, so users with system Python < 3.10 will fail at runtime. Add a note requiring Python 3.10+.🤖 Prompt for AI Agents