Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,26 @@ python agents/hello.py
- `npm run dev` - runs UI + Core + Orchestrator
- `npm run ui` - starts Next.js UI
- `npm run core` - starts Express API
- `npm run agent` - runs hello agent directly
- `npm run agent` - runs hello agent directly
## Dubai Villa Lead Scraper

Two scraper agents for collecting direct villa owner contacts in Dubai:

### Basic Scraper (no API key needed)
```bash
python agents/dubai_villa_scraper.py --source both --area "Palm Jumeirah" --max 30
```

### Apify Scraper (faster, more reliable)
Requires `APIFY_TOKEN` in `.env`
```bash
python agents/apify_dubai_scraper.py --area "Emirates Hills" --max 100
```
Comment on lines +57 to +70
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the agents directory and list its contents
find . -name "agents" -type d | head -20
git ls-files | grep -E "agents/" | head -20

Repository: vishnu-madhavan-git/automation

Length of output: 182


🏁 Script executed:

# Check if the agent files exist
ls -la agents/ 2>/dev/null || echo "agents/ directory not found at root"

Repository: vishnu-madhavan-git/automation

Length of output: 465


🏁 Script executed:

# Search for these specific files
find . -name "dubai_villa_scraper.py" -o -name "apify_dubai_scraper.py"

Repository: vishnu-madhavan-git/automation

Length of output: 137


🏁 Script executed:

# Check for version requirements in common config files
cat pyproject.toml 2>/dev/null | grep -i python || echo "No python requirement in pyproject.toml"
cat setup.py 2>/dev/null | grep -i python || echo "No setup.py found"
cat setup.cfg 2>/dev/null | grep -i python || echo "No setup.cfg found"

Repository: vishnu-madhavan-git/automation

Length of output: 150


🏁 Script executed:

# Check the agent files for parameterized built-ins and union syntax
wc -l agents/dubai_villa_scraper.py agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 157


🏁 Script executed:

# Search for parameterized built-ins (list[...], dict[...], etc.) and union syntax (X | Y)
rg "list\[|dict\[|tuple\[|set\[|\s\|\s" agents/

Repository: vishnu-madhavan-git/automation

Length of output: 1402


🏁 Script executed:

# Check the first 50 lines of each agent file for imports and type hints
head -50 agents/dubai_villa_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1482


🏁 Script executed:

# Check the apify agent file as well
head -50 agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1584


Specify Python 3.10+ requirement for these commands.

The agents use PEP 604 union syntax (str | None) and parameterized built-ins (list[str], dict, etc.), which require Python 3.10+ and 3.9+ respectively. The README shows unqualified python commands, so users with system Python < 3.10 will fail at runtime. Add a note requiring Python 3.10+.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 57 - 70, Update the README to explicitly require
Python 3.10+ for running the scraper commands: add a short note above the
examples stating "Requires Python 3.10+" (or "Python 3.10 or later") because the
agents (agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP
604 union syntax (e.g., str | None) and newer parameterized built-ins; also keep
the APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.


Leads are saved to `data/state/villa_leads.json` and can be synced to Google Sheets via `core/leads-bridge.js`.

### Sync to Sheets
```js
const { syncLeadsToSheets } = require('./core/leads-bridge');
await syncLeadsToSheets();
```
216 changes: 216 additions & 0 deletions agents/apify_dubai_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
"""
Apify Dubai Real Estate API Bridge
====================================
Uses Apify's ready-made Dubai Real Estate Scraper actor to get
owner contacts from PropertyFinder, Bayut & Dubizzle.

This is the FAST path - uses Apify's actor which handles anti-bot measures.
Requires APIFY_TOKEN in .env

Usage:
python agents/apify_dubai_scraper.py
python agents/apify_dubai_scraper.py --area "Palm Jumeirah" --max 100
Comment on lines +7 to +12
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load .env here or change the documented CLI contract.

The docstring says APIFY_TOKEN can live in .env, but this entry point only checks os.environ. Running python agents/apify_dubai_scraper.py as documented will still hit APIFY_TOKEN not set unless the caller exported the variable first.

Also applies to: 175-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 7 - 12, The script documents that
APIFY_TOKEN can live in a .env but only reads os.environ; fix by loading .env at
module start (before any env access) using python-dotenv: add "from dotenv
import load_dotenv" and call "load_dotenv()" near the top of the file before
retrieving APIFY_TOKEN (and similarly before the env checks around the block
referenced at lines ~175-180). Alternatively, update the CLI docs to remove the
.env claim — but the preferred fix is to call load_dotenv() before accessing
APIFY_TOKEN so os.environ sees variables from .env.

"""

import argparse
import json
import time
import urllib.request
import urllib.error
import os
from datetime import datetime, timezone
from pathlib import Path

ROOT_DIR = Path(__file__).resolve().parent.parent
STATE_DIR = ROOT_DIR / "data" / "state"
LOG_DIR = ROOT_DIR / "data" / "logs"
LEADS_FILE = STATE_DIR / "villa_leads.json"
LOG_FILE = LOG_DIR / "apify_scraper.log"

STATE_DIR.mkdir(parents=True, exist_ok=True)
LOG_DIR.mkdir(parents=True, exist_ok=True)

# Apify actor ID for Dubai Real Estate Scraper
ACTOR_ID = "redoubtable_bubble~dubai-real-estate-scraper-propertyfinder-bayut-dubizzle"


def log(msg: str) -> None:
line = f"[{datetime.now(timezone.utc).isoformat()}] [apify-scraper] {msg}"
print(line, flush=True)
with open(LOG_FILE, "a", encoding="utf-8") as f:
f.write(line + "\n")


def now_iso() -> str:
return datetime.now(timezone.utc).isoformat()


def apify_request(method: str, path: str, token: str, body: dict = None) -> dict:
url = f"https://api.apify.com/v2{path}?token={token}"
data = json.dumps(body).encode() if body else None
headers = {"Content-Type": "application/json"}
req = urllib.request.Request(url, data=data, headers=headers, method=method)
try:
with urllib.request.urlopen(req, timeout=30) as resp:
return json.loads(resp.read())
except urllib.error.HTTPError as e:
error_body = e.read().decode()
log(f"Apify API error {e.code}: {error_body}")
return {"error": str(e.code), "message": error_body}
except Exception as e:
log(f"Request error: {e}")
return {"error": str(e)}


def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str | None:
"""Start the Apify actor run and return run ID."""
Comment on lines +65 to +66
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also uses PEP 604 union types in return annotations (e.g. str | None, dict | None), which require Python 3.10+. If agents are intended to run on an unspecified “system python”, consider using Optional[...]/Union[...] or documenting/enforcing Python >= 3.10.

Copilot uses AI. Check for mistakes.
payload = {
"searchQuery": f"{property_type} {area} Dubai" if area else f"{property_type} Dubai",
"maxItems": max_items,
"propertyType": "villa",
"listingType": "rent",
"location": area or "Dubai",
"directOwnerOnly": True
}
log(f"Starting Apify actor: {ACTOR_ID}")
log(f"Payload: {json.dumps(payload)}")

result = apify_request("POST", f"/acts/{ACTOR_ID}/runs", token, payload)

if "data" in result:
run_id = result["data"]["id"]
log(f"Actor started. Run ID: {run_id}")
return run_id
else:
log(f"Failed to start actor: {result}")
return None


def wait_for_run(token: str, run_id: str, timeout: int = 300) -> bool:
"""Wait for actor run to finish."""
log(f"Waiting for run {run_id} to complete...")
start = time.time()
while time.time() - start < timeout:
result = apify_request("GET", f"/actor-runs/{run_id}", token)
status = result.get("data", {}).get("status", "")
log(f" Status: {status}")
if status in ("SUCCEEDED", "FINISHED"):
return True
if status in ("FAILED", "ABORTED", "TIMED-OUT"):
log(f"Run failed with status: {status}")
return False
time.sleep(10)
log("Timeout waiting for actor run")
return False


def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_results assumes apify_request returns an object with data.items, but the dataset items endpoint commonly returns a raw JSON array. If a list is returned, result.get(...) will throw and the scraper will crash. Adjust apify_request/fetch_results to handle a list response (or request a response shape that’s always an object).

Suggested change
items = result.get("data", {}).get("items", [])
# The Apify dataset items endpoint may return either:
# - a raw JSON array of items, or
# - an object that wraps items under data.items.
if isinstance(result, list):
items = result
elif isinstance(result, dict):
data = result.get("data", result)
if isinstance(data, dict):
items = data.get("items", [])
elif isinstance(data, list):
items = data
else:
items = []
else:
items = []

Copilot uses AI. Check for mistakes.
log(f"Fetched {len(items)} items from Apify")
Comment on lines +107 to +111
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and inspect the apify_dubai_scraper.py file
find . -name "apify_dubai_scraper.py" -type f

Repository: vishnu-madhavan-git/automation

Length of output: 105


🌐 Web query:

Apify API dataset-items endpoint response format /actor-runs/{runId}/dataset/items

💡 Result:

GET /v2/actor-runs/{runId}/dataset/items is the Actor-run “default dataset” items endpoint and it behaves the same as the regular Get dataset items endpoint (same query params + same output formats). (docs.apify.com)

Default response (no format param)

  • format=json (default): returns a JSON array of item objects, e.g. [{...},{...}]. (docs.apify.com)

Other response formats (format=...)

  • format=jsonl: newline-delimited JSON (one JSON object per line, ending with \n). (docs.apify.com)
  • format=csv: CSV text (header\nrow\n...), with optional UTF‑8 BOM behavior controlled by bom=0|1. (docs.apify.com)
  • Also supported: html, xlsx, xml, rss (returned as the corresponding textual/binary representation). (docs.apify.com)

Pagination metadata (via response headers)

The items response includes these headers:

  • X-Apify-Pagination-Offset
  • X-Apify-Pagination-Limit
  • X-Apify-Pagination-Count
  • X-Apify-Pagination-Total (docs.apify.com)

Common query params that affect the output shape

  • offset, limit
  • fields (pick only listed fields), omit
  • unwind, flatten
  • clean (shortcut for skipHidden=true + skipEmpty=true) (docs.apify.com)

Citations:


🏁 Script executed:

# Search for the apify_request function to understand response handling
rg "def apify_request" -A 10 -t py

Repository: vishnu-madhavan-git/automation

Length of output: 978


🏁 Script executed:

# Inspect the actual fetch_results function in the file to confirm code and line numbers
cat -n agents/apify_dubai_scraper.py | sed -n '100,120p'

Repository: vishnu-madhavan-git/automation

Length of output: 936


🏁 Script executed:

# Also check the full context around the function to understand usage
cat -n agents/apify_dubai_scraper.py | sed -n '1,30p'

Repository: vishnu-madhavan-git/automation

Length of output: 1126


🏁 Script executed:

# Check the complete apify_request function to see full implementation
cat -n agents/apify_dubai_scraper.py | sed -n '40,60p'

Repository: vishnu-madhavan-git/automation

Length of output: 1041


🏁 Script executed:

# Verify if there are any other usages of apify_request to understand the response pattern
rg "apify_request" agents/apify_dubai_scraper.py -B 1 -A 1

Repository: vishnu-madhavan-git/automation

Length of output: 653


Treat the dataset-items response as a raw list.

Apify's /actor-runs/{runId}/dataset/items endpoint returns a JSON array directly (e.g. [{...},{...}]), not a wrapped object. The current code calls .get("data", {}) on the response, which will fail with AttributeError: 'list' object has no attribute 'get'. This differs from other Apify endpoints in this file (e.g. /actor-runs/{run_id}) which return wrapped responses.

Proposed fix
 def fetch_results(token: str, run_id: str) -> list[dict]:
     """Fetch results from completed actor run."""
-    result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
-    items = result.get("data", {}).get("items", [])
+    items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+    if not isinstance(items, list):
+        log(f"Unexpected dataset response: {items}")
+        return []
     log(f"Fetched {len(items)} items from Apify")
     return items
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
if not isinstance(items, list):
log(f"Unexpected dataset response: {items}")
return []
log(f"Fetched {len(items)} items from Apify")
return items
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 107 - 111, The fetch_results
function assumes apify_request returned a dict and calls .get on it, but the
/actor-runs/{run_id}/dataset/items endpoint returns a raw list; update
fetch_results to handle both shapes by checking the type of the response from
apify_request (called in fetch_results) and set items = result if it's a list,
otherwise fall back to result.get("data", {}).get("items", []); keep the
log(f"Fetched {len(items)} items from Apify") and ensure the function returns
the items list.

return items
Comment on lines +107 to +112
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Apify dataset items API returns a JSON array, not a {data: {items: []}} object

The Apify API endpoint /v2/actor-runs/{runId}/dataset/items returns a raw JSON array of items, not an object with a data.items wrapper. At agents/apify_dubai_scraper.py:109-110, apify_request parses the response with json.loads() which yields a Python list. Then result.get("data", {}).get("items", []) will raise AttributeError: 'list' object has no attribute 'get' because lists don't have .get(). This means fetch_results() will always crash after a successful actor run.

Suggested change
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
return items
def fetch_results(token: str, run_id: str) -> list[dict]:
"""Fetch results from completed actor run."""
result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
if isinstance(result, list):
items = result
else:
items = result.get("data", {}).get("items", [])
log(f"Fetched {len(items)} items from Apify")
return items
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.



def normalize_lead(item: dict, area: str) -> dict | None:
"""Convert Apify result to our lead format."""
# Apify actor returns various fields - normalize them
phone = (
item.get("phone") or
item.get("contactPhone") or
item.get("agentPhone") or
item.get("ownerPhone") or ""
)
name = (
item.get("agentName") or
item.get("ownerName") or
item.get("contactName") or
"Unknown"
)
if not phone:
return None

return {
"name": name.strip(),
"phone": phone.strip(),
"all_phones": [phone.strip()],
"area": item.get("location") or item.get("area") or area or "Dubai",
"type": "villa",
"price": str(item.get("price", "")),
"url": item.get("url") or item.get("propertyUrl", ""),
"source": item.get("source") or "Apify/Dubai",
"direct_owner": item.get("directOwner", False),
"unit_number": item.get("unitNumber", ""),
"scraped_at": now_iso()
}


def load_existing_leads() -> list:
if LEADS_FILE.exists():
try:
return json.loads(LEADS_FILE.read_text(encoding="utf-8"))
except Exception:
return []
return []


def save_leads(leads: list) -> None:
LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8")
Comment on lines +148 to +158
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Serialize writes to the shared leads store.

This code loads data/state/villa_leads.json, appends in memory, and rewrites the whole file. agents/dubai_villa_scraper.py does the same against the same path, so overlapping runs will drop whichever scraper saves first.

Also applies to: 198-201

🧰 Tools
🪛 Ruff (0.15.4)

[warning] 152-152: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents/apify_dubai_scraper.py` around lines 148 - 158, Concurrent runs
overwrite the shared JSON leads store because load_existing_leads and save_leads
read the whole file, mutate in-memory, and rewrite it; change these functions to
perform concurrency-safe updates (e.g., acquire a file lock around
read-modify-write or switch to an append-only/JSONL writer) so overlapping
scrapers don't lose data. Specifically, update load_existing_leads, save_leads
and any callers that append to LEADS_FILE so they obtain an exclusive lock on
LEADS_FILE (or open it in append mode for JSONL) before reading/writing, merge
new leads into the existing set safely, and use atomic replace/rename when
writing to avoid partial writes; ensure the same locking/format is used by
agents/dubai_villa_scraper.py to keep behavior consistent.



def deduplicate(existing: list, new_leads: list) -> tuple[list, int]:
existing_phones = {lead["phone"] for lead in existing}
unique_new = []
for lead in new_leads:
if lead["phone"] not in existing_phones:
unique_new.append(lead)
existing_phones.add(lead["phone"])
return unique_new, len(new_leads) - len(unique_new)


def main() -> None:
parser = argparse.ArgumentParser(description="Apify Dubai Villa Scraper")
parser.add_argument("--area", type=str, default="", help="Area in Dubai (e.g. 'Palm Jumeirah')")
parser.add_argument("--max", type=int, default=50, help="Max leads to scrape")
parser.add_argument("--token", type=str, default=os.environ.get("APIFY_TOKEN", ""), help="Apify API token")
args = parser.parse_args()

if not args.token:
log("ERROR: APIFY_TOKEN not set. Add it to .env or pass --token")
print('__RESULT__:{"status":"error","message":"APIFY_TOKEN not set"}')
return

log("=== Apify Dubai Villa Scraper Started ===")

run_id = run_actor(args.token, args.area, args.max)
if not run_id:
print('__RESULT__:{"status":"error","message":"Failed to start actor"}')
return

success = wait_for_run(args.token, run_id)
if not success:
print('__RESULT__:{"status":"error","message":"Actor run failed"}')
return

raw_items = fetch_results(args.token, run_id)
new_leads = [n for item in raw_items if (n := normalize_lead(item, args.area)) is not None]

existing = load_existing_leads()
unique_leads, dupes = deduplicate(existing, new_leads)
all_leads = existing + unique_leads
save_leads(all_leads)

log(f"=== Done. New: {len(unique_leads)}, Skipped: {dupes}, Total: {len(all_leads)} ===")

summary = {
"status": "ok",
"new_leads": len(unique_leads),
"total_leads": len(all_leads),
"duplicates_skipped": dupes,
"leads": unique_leads
}
print(f"\n__RESULT__:{json.dumps(summary)}")


if __name__ == "__main__":
main()
Loading