Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"name": "apify-ultimate-scraper",
"source": "./skills/apify-ultimate-scraper",
"skills": "./",
"description": "Universal AI-powered web scraper for 55+ platforms. Scrape data from Instagram, Facebook, TikTok, YouTube, Google Maps, Google Search, Google Trends, Booking.com, TripAdvisor, Amazon, Walmart, eBay, and more for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, e-commerce pricing, and reviews",
"description": "Universal AI-powered web scraper for 15+ platforms with ~100 Actors. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, TripAdvisor, and more for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, SEO intelligence, recruitment, and reviews",
"keywords": [
"scraping",
"web-scraper",
Expand Down
2 changes: 1 addition & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "apify-agent-skills",
"version": "1.6.1",
"version": "2.0.0",
"description": "Official Apify agent skills for web scraping, data extraction, and automation",
"author": {
"name": "Apify",
Expand Down
2 changes: 1 addition & 1 deletion agents/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ IMPORTANT: You MUST read the SKILL.md file whenever the description of the skill
apify-actor-development: `Develop, debug, and deploy Apify Actors - serverless cloud programs for web scraping, automation, and data processing. Use when creating new Actors, modifying existing ones, or troubleshooting Actor code.`
apify-actorization: `Convert existing projects into Apify Actors - serverless cloud programs. Actorize JavaScript/TypeScript (SDK with Actor.init/exit), Python (async context manager), or any language (CLI wrapper). Use when migrating code to Apify, wrapping CLI tools as Actors, or adding Actor SDK to existing projects.`
apify-generate-output-schema: `Generate output schemas (dataset_schema.json, output_schema.json, key_value_store_schema.json) for an Apify Actor by analyzing its source code. Use when creating or updating Actor output schemas.`
apify-ultimate-scraper: `Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, Google Maps, Google Search, Google Trends, Booking.com, and TripAdvisor. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, or any data extraction task.`
apify-ultimate-scraper: `Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, and 15+ more platforms. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, review analysis, SEO intelligence, recruitment, or any data extraction task.`
</available_skills>

Paths referenced within SKILL.md files are relative to that SKILL folder. For example `reference/workflows.md` refers to the workflows file inside the skill's reference folder.
Expand Down
13 changes: 4 additions & 9 deletions commands/create-actor.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,17 +42,12 @@ Initial request: $ARGUMENTS

**Actions**:
1. Check if Apify CLI is installed: `apify --help`
2. If not installed, guide user to install:
```bash
curl -fsSL https://apify.com/install-cli.sh | bash
# Or: brew install apify-cli (Mac)
# Or: npm install -g apify-cli
```
2. If not installed, install via package manager: `npm install -g apify-cli` (or `brew install apify-cli` on Mac). Do NOT install by piping remote scripts to a shell.
3. Verify authentication: `apify info`
4. If not logged in:
- Check for APIFY_TOKEN environment variable
- If missing, ask user to generate token at https://console.apify.com/settings/integrations
- Login with: `apify login -t $APIFY_TOKEN`
- Authenticate using OAuth (opens browser): `apify login`
- If browser isn't available, ensure `APIFY_TOKEN` env var is exported (the CLI reads it automatically)
- If user doesn't have a token, generate one at https://console.apify.com/settings/integrations

---

Expand Down
1,014 changes: 1,014 additions & 0 deletions docs/superpowers/plans/2026-03-28-ultimate-scraper-restructure.md

Large diffs are not rendered by default.

232 changes: 232 additions & 0 deletions docs/superpowers/specs/2026-03-28-ultimate-scraper-skill-redesign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# Ultimate scraper skill redesign

## Context

The `apify-ultimate-scraper` skill was recently migrated from raw REST API scripts to Apify CLI commands. This redesign addresses the next layer: the skill's information architecture. The current design loads a ~400-line monolithic Actor index every time, spends most of its token budget on Actor selection (which agents handle well), and provides almost no help with the two primary failure modes: wrong input configuration and inability to pipe Actor outputs into subsequent steps.

**Problem:** The skill optimizes for the wrong thing. 50% of usage is quick targeted scrapes (where loading 400 lines of index is waste), and 50% is multi-step workflows (where the skill provides no data-piping guidance).

**Goal:** Restructure the skill into a three-layer progressive disclosure architecture that's lean for simple tasks and rich for complex ones, while adding gotchas and cost guardrails to prevent the most common mistakes.

## Architecture: three layers

### Layer 1: Lean Actor index (`references/actor-index.md`, ~100 lines)

A flat Markdown lookup table organized by platform. Three columns only: `Actor ID | Tier | Best for (5 words max)`. Always loaded when the skill triggers.

Purpose: fast Actor selection. Does NOT contain input schemas, output fields, or workflow instructions. Those are in layers 2 and 3.

Example format:

```markdown
## Instagram

| Actor | Tier | Best for |
|-------|------|----------|
| apify/instagram-profile-scraper | apify | profiles, followers, bio |
| apify/instagram-post-scraper | apify | posts, likes, comments |
| apify/instagram-hashtag-scraper | apify | hashtag posts, trends |
```

For Actors not in the index, the agent uses `apify actors search "QUERY" --json` to discover dynamically.

### Layer 2: Use-case workflow guides (`references/workflows/*.md`)

Rich multi-step pipeline guides with explicit Actor chaining and data-piping instructions. Loaded only when the task involves multiple Actors or a recognized use case.

Ten workflow files:

1. `lead-generation.md` - business contacts, email extraction, B2B prospecting
2. `competitive-intel.md` - ad monitoring, pricing, market positioning
3. `influencer-vetting.md` - profile discovery, audience analysis, engagement vetting
4. `brand-monitoring.md` - mentions, sentiment, hashtag tracking
5. `review-analysis.md` - cross-platform review aggregation (Google Maps, Yelp, TripAdvisor, Airbnb)
6. `content-and-seo.md` - SERP analysis, web crawling, content extraction for RAG
7. `social-media-analytics.md` - engagement metrics, content performance across platforms
8. `trend-research.md` - Google Trends, TikTok trends, hashtag analytics, seasonal demand
9. `job-market-and-recruitment.md` - LinkedIn jobs, candidate sourcing, skill-gap analysis
10. `real-estate-and-hospitality.md` - listing pipelines, market analysis, pricing comparison

Each workflow guide follows a consistent structure:

```markdown
# [Use case] workflows

## [Specific scenario name]
**When:** [One-line trigger condition]

### Pipeline
1. **[Step name]** -> `actor/id`
- Key input: `field1`, `field2`, `field3`
2. **[Step name]** -> `actor/id`
- Pipe: `results[].fieldX` -> `inputFieldY`
- Key input: `startUrls`, `maxRequestsPerCrawl`

### Output fields
Step 1: `field1`, `field2`, `field3`
Step 2: `field1`, `field2`, `field3`

### Gotcha
[Workflow-specific pitfall, if any]
```

Key elements:
- **Pipe instructions** - explicit field mappings for chaining Actor outputs to inputs
- **Key input fields** - the 2-3 most important params (NOT the full schema - that's fetched dynamically)
- **Output fields** - what each step returns (enables the agent to know what's available for piping or presenting)
- **Gotcha** - per-workflow pitfall where relevant

### Layer 3: Dynamic schema fetching (runtime, no files)

Input schemas are always fetched at runtime via:

```bash
apify actors info "ACTOR_ID" --input --json
```

This eliminates stale pre-cached schemas and ensures the agent always sees the current parameter set. The workflow guides provide only the 2-3 key input fields as hints - the full schema comes from the CLI.

For Actor documentation:

```bash
apify actors info "ACTOR_ID" --readme
```

## Cross-cutting: gotchas and cost guardrails (`references/gotchas.md`)

A single reference file (~60 lines) covering:

### Pricing models
- FREE: no per-result cost
- PAY_PER_EVENT (PPE): charged per result - MUST check pricing before running
- FLAT_PRICE_PER_MONTH: subscription model

### Cost estimation protocol
Before running PPE Actors:
1. Read `.currentPricingInfo` from `apify actors info "ACTOR_ID" --json`
2. Calculate: `pricePerEvent * requestedResults`
3. Warn user if estimated cost > $5
4. Require explicit confirmation for > $20

### Common pitfalls
- Cookie-dependent Actors (social media scrapers needing login)
- Rate limiting on large scrapes (use proxy configuration)
- Empty results from geo-restrictions or narrow queries
- `maxResults` vs `maxCrawledPages` confusion (different Actors use different limit fields)
- Deprecated Actors (check `.isDeprecated` in Actor info)

## File structure

```
skills/apify-ultimate-scraper/
├── SKILL.md # ~150 lines: workflow + routing
├── references/
│ ├── actor-index.md # ~100 lines: flat lookup by platform
│ ├── gotchas.md # ~60 lines: pitfalls + cost guardrails
│ └── workflows/
│ ├── lead-generation.md
│ ├── competitive-intel.md
│ ├── influencer-vetting.md
│ ├── brand-monitoring.md
│ ├── review-analysis.md
│ ├── content-and-seo.md
│ ├── social-media-analytics.md
│ ├── trend-research.md
│ ├── job-market-and-recruitment.md
│ └── real-estate-and-hospitality.md
```

Total files: 13 (SKILL.md + actor-index + gotchas + 10 workflows)

## SKILL.md workflow

The main skill file contains:

1. **Frontmatter** - name, description (trigger conditions)
2. **Prerequisites** - CLI version, authentication (OAuth-first)
3. **Workflow** (5 steps):
- Step 1: Understand goal, identify platform/use-case
- Step 2: Select Actor from `references/actor-index.md`, fetch input schema dynamically via `apify actors info --input --json`
- Step 3: If multi-step task, read matching workflow guide from `references/workflows/`
- Step 4: Review `references/gotchas.md` for pricing/cost traps. Run cost estimation for PPE Actors.
- Step 5: Run Actor(s) via CLI, fetch results, deliver to user
4. **Error handling** - table of common errors and resolutions
5. **`--json` policy** - reminder to always use `--json` flag

Routing logic for workflow guides:

```
If task mentions "lead" or "contact" or "email" -> lead-generation.md
If task mentions "competitor" or "ad" or "pricing" -> competitive-intel.md
If task mentions "influencer" or "creator" -> influencer-vetting.md
If task mentions "brand" or "mention" or "sentiment" -> brand-monitoring.md
If task mentions "review" or "rating" or "reputation" -> review-analysis.md
If task mentions "SEO" or "SERP" or "crawl" or "content" -> content-and-seo.md
If task mentions "analytics" or "engagement" or "performance" -> social-media-analytics.md
If task mentions "trend" or "keyword" or "hashtag" -> trend-research.md
If task mentions "job" or "recruit" or "candidate" or "hiring" -> job-market-and-recruitment.md
If task mentions "real estate" or "listing" or "property" or "hotel" -> real-estate-and-hospitality.md
```

This is high-freedom guidance (text-based), not rigid routing. The agent uses judgment.

## Token budget analysis

| Scenario | Files loaded | Estimated tokens |
|----------|-------------|-----------------|
| Simple scrape ("get Nike's Instagram") | SKILL.md + actor-index | ~250 lines (~2,500 tokens) |
| Targeted with gotchas check | SKILL.md + actor-index + gotchas | ~310 lines (~3,100 tokens) |
| Multi-step workflow | SKILL.md + actor-index + gotchas + 1 workflow | ~370 lines (~3,700 tokens) |
| Complex exploration | SKILL.md + actor-index + gotchas + 2 workflows | ~430 lines (~4,300 tokens) |

Current design loads ~590 lines regardless. The new design ranges from 250-430 depending on complexity. The simple case (50% of usage) cuts token usage by more than half.

## What changes from current design

| Aspect | Current | Redesigned |
|--------|---------|-----------|
| Actor index | ~400 lines monolithic, includes descriptions + workflows | ~100 lines, 3-column lookup only |
| Input schemas | Not provided (just "fetch via CLI") | Still fetched via CLI, but workflow guides provide key input hints |
| Output schemas | Not provided | Explicit per-step output field lists in workflow guides |
| Workflow guidance | None | 10 dedicated files with data-piping instructions |
| Gotchas | None | Dedicated reference file with pricing/cost/pitfall guidance |
| Cost estimation | Brief warning about 1,000+ results | Explicit protocol: check pricing, estimate cost, confirm with user |
| Token usage (simple task) | ~590 lines | ~250 lines |

## What does NOT change

- CLI commands (same as current: `actors search`, `actors info`, `actors call`, `datasets get-items`)
- Authentication flow (OAuth-first, env var fallback)
- `--json` policy (all CLI output via `--json`)
- Error handling table
- Resilience strategy (4 layers from the migration plan)
- Plugin metadata structure (plugin.json, marketplace.json, AGENTS.md)

## Implementation scope

### Must-do (this pass)
- Rewrite SKILL.md with new routing logic and workflow
- Create lean `references/actor-index.md` from existing actor-index data
- Create `references/gotchas.md`
- Create 10 workflow guide files with consistent structure
- Populate workflow guides with at least 1-2 pipelines each (skeleton + key examples)
- Delete old `references/actor-index.md` (the current ~400 line version)

### Deferred (second pass by user)
- Enriching workflow guides with additional pipeline examples
- Adding more Actors to the index as they're discovered/tested
- Building an eval framework for skill testing
- Adding skill memory/run history
- Per-Actor gotchas (currently only cross-cutting gotchas)

## Verification

1. Load the skill in Claude Code and test a simple scrape ("get 10 Instagram profiles for @nike")
- Verify: agent loads SKILL.md + actor-index only, picks right Actor, fetches schema via CLI
2. Test a multi-step workflow ("build me a lead list of restaurants in Prague with emails")
- Verify: agent loads lead-generation.md, follows the pipeline, pipes data correctly
3. Test a PPE Actor ("scrape Amazon product reviews")
- Verify: agent checks gotchas.md, estimates cost, warns before running
4. Test dynamic discovery ("scrape Glassdoor company reviews")
- Verify: agent can't find in index, uses `apify actors search`, fetches schema dynamically
5. Test in Gemini CLI via AGENTS.md to verify cross-agent compatibility
10 changes: 3 additions & 7 deletions skills/apify-actor-development/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,14 @@ When the apify CLI is installed, check that it is logged in with:
apify info # Should return your username
```

If it is not logged in, check if the `APIFY_TOKEN` environment variable is defined (if not, ask the user to generate one on https://console.apify.com/settings/integrations and then define `APIFY_TOKEN` with it).

Then authenticate using one of these methods:
If not logged in, authenticate using OAuth (opens browser):

```bash
# Option 1 (preferred): The CLI automatically reads APIFY_TOKEN from the environment.
# Just ensure the env var is exported and run any apify command — no explicit login needed.

# Option 2: Interactive login (prompts for token without exposing it in shell history)
apify login
```

If browser login isn't available (headless environment or CI), the CLI automatically reads `APIFY_TOKEN` from the environment. Ensure the env var is exported and run any apify command - no explicit login needed. If the user doesn't have a token, generate one at https://console.apify.com/settings/integrations.

> **Security note:** Avoid passing tokens as command-line arguments (e.g. `apify login -t <token>`).
> Arguments are visible in process listings and may be recorded in shell history.
> Prefer environment variables or interactive login instead.
Expand Down
14 changes: 5 additions & 9 deletions skills/apify-actorization/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,29 +40,25 @@ npm install -g apify-cli
```

> **Security note:** Do NOT install the CLI by piping remote scripts to a shell
> (e.g. `curl | bash` or `irm | iex`). Always use a package manager.
> (e.g. `curl ... | bash` or `irm ... | iex`). Always use a package manager.

Verify CLI is logged in:

```bash
apify info # Should return your username
```

If not logged in, check if the `APIFY_TOKEN` environment variable is defined (if not, ask the user to generate one at https://console.apify.com/settings/integrations and then define `APIFY_TOKEN` with it).

Then authenticate using one of these methods:
If not logged in, authenticate using OAuth (opens browser):

```bash
# Option 1 (preferred): The CLI automatically reads APIFY_TOKEN from the environment.
# Just ensure the env var is exported and run any apify command — no explicit login needed.

# Option 2: Interactive login (prompts for token without exposing it in shell history)
apify login
```

If browser login isn't available (headless environment or CI), ensure the `APIFY_TOKEN` environment variable is exported. The CLI reads it automatically - no explicit login needed. If the user doesn't have a token, generate one at https://console.apify.com/settings/integrations.

> **Security note:** Avoid passing tokens as command-line arguments (e.g. `apify login -t <token>`).
> Arguments are visible in process listings and may be recorded in shell history.
> Prefer environment variables or interactive login instead.
> Prefer OAuth login or environment variables instead.
> Never log, print, or embed `APIFY_TOKEN` in source code or configuration files.
> Use a token with the minimum required permissions (scoped token) and rotate it periodically.

Expand Down
Loading
Loading