diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 0c5e9d1..3b68c0f 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -13,7 +13,7 @@ "name": "apify-ultimate-scraper", "source": "./skills/apify-ultimate-scraper", "skills": "./", - "description": "Universal AI-powered web scraper for 55+ platforms. Scrape data from Instagram, Facebook, TikTok, YouTube, Google Maps, Google Search, Google Trends, Booking.com, TripAdvisor, Amazon, Walmart, eBay, and more for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, e-commerce pricing, and reviews", + "description": "Universal AI-powered web scraper for 15+ platforms with ~100 Actors. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, TripAdvisor, and more for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, SEO intelligence, recruitment, and reviews", "keywords": [ "scraping", "web-scraper", diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json index 4f11558..7f3c02f 100644 --- a/.claude-plugin/plugin.json +++ b/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "apify-agent-skills", - "version": "1.6.1", + "version": "2.0.0", "description": "Official Apify agent skills for web scraping, data extraction, and automation", "author": { "name": "Apify", diff --git a/agents/AGENTS.md b/agents/AGENTS.md index 4bd726d..d336d9a 100644 --- a/agents/AGENTS.md +++ b/agents/AGENTS.md @@ -15,7 +15,7 @@ IMPORTANT: You MUST read the SKILL.md file whenever the description of the skill apify-actor-development: `Develop, debug, and deploy Apify Actors - serverless cloud programs for web scraping, automation, and data processing. Use when creating new Actors, modifying existing ones, or troubleshooting Actor code.` apify-actorization: `Convert existing projects into Apify Actors - serverless cloud programs. Actorize JavaScript/TypeScript (SDK with Actor.init/exit), Python (async context manager), or any language (CLI wrapper). Use when migrating code to Apify, wrapping CLI tools as Actors, or adding Actor SDK to existing projects.` apify-generate-output-schema: `Generate output schemas (dataset_schema.json, output_schema.json, key_value_store_schema.json) for an Apify Actor by analyzing its source code. Use when creating or updating Actor output schemas.` -apify-ultimate-scraper: `Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, Google Maps, Google Search, Google Trends, Booking.com, and TripAdvisor. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, or any data extraction task.` +apify-ultimate-scraper: `Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, and 15+ more platforms. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, review analysis, SEO intelligence, recruitment, or any data extraction task.` Paths referenced within SKILL.md files are relative to that SKILL folder. For example `reference/workflows.md` refers to the workflows file inside the skill's reference folder. diff --git a/commands/create-actor.md b/commands/create-actor.md index 9f886f8..b97b162 100644 --- a/commands/create-actor.md +++ b/commands/create-actor.md @@ -42,17 +42,12 @@ Initial request: $ARGUMENTS **Actions**: 1. Check if Apify CLI is installed: `apify --help` -2. If not installed, guide user to install: - ```bash - curl -fsSL https://apify.com/install-cli.sh | bash - # Or: brew install apify-cli (Mac) - # Or: npm install -g apify-cli - ``` +2. If not installed, install via package manager: `npm install -g apify-cli` (or `brew install apify-cli` on Mac). Do NOT install by piping remote scripts to a shell. 3. Verify authentication: `apify info` 4. If not logged in: - - Check for APIFY_TOKEN environment variable - - If missing, ask user to generate token at https://console.apify.com/settings/integrations - - Login with: `apify login -t $APIFY_TOKEN` + - Authenticate using OAuth (opens browser): `apify login` + - If browser isn't available, ensure `APIFY_TOKEN` env var is exported (the CLI reads it automatically) + - If user doesn't have a token, generate one at https://console.apify.com/settings/integrations --- diff --git a/docs/superpowers/plans/2026-03-28-ultimate-scraper-restructure.md b/docs/superpowers/plans/2026-03-28-ultimate-scraper-restructure.md new file mode 100644 index 0000000..e427b59 --- /dev/null +++ b/docs/superpowers/plans/2026-03-28-ultimate-scraper-restructure.md @@ -0,0 +1,1014 @@ +# Ultimate Scraper Skill Restructure Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Restructure the ultimate-scraper skill from a monolithic 400-line index into a three-layer progressive disclosure architecture: lean Actor lookup + rich use-case workflow guides + dynamic schema fetching. + +**Architecture:** Flat Markdown index (~100 lines) for Actor selection, 10 use-case workflow guides with explicit data-piping instructions, cross-cutting gotchas/cost guardrails file. Input schemas fetched dynamically via CLI at runtime. + +**Tech Stack:** Markdown (all files), Apify CLI v1.4.0+ (runtime commands) + +--- + +### Task 1: Create lean actor-index.md + +**Files:** +- Create: `skills/apify-ultimate-scraper/references/actor-index.md` + +- [ ] **Step 1: Write the lean Actor index** + +Delete the existing `references/actor-index.md` (~400 lines) and replace it with the lean version below. Three columns only: `Actor | Tier | Best for`. No descriptions longer than 5 words. No workflows (those move to workflow guides). Organized by platform. + +```markdown +# Actor index + +Flat lookup for Actor selection. For input schemas, fetch dynamically: +`apify actors info "ACTOR_ID" --input --json` + +Tiers: `apify` = Apify-maintained (always prefer), `community` = community-maintained (fill gaps). + +## Instagram + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/instagram-scraper | apify | all Instagram data | +| apify/instagram-profile-scraper | apify | profiles, followers, bio | +| apify/instagram-post-scraper | apify | posts, engagement metrics | +| apify/instagram-comment-scraper | apify | post and reel comments | +| apify/instagram-hashtag-scraper | apify | posts by hashtag | +| apify/instagram-hashtag-analytics-scraper | apify | hashtag metrics, trends | +| apify/instagram-reel-scraper | apify | reels, transcripts, engagement | +| apify/instagram-api-scraper | apify | API-based, no login | +| apify/instagram-search-scraper | apify | search users, places | +| apify/instagram-tagged-scraper | apify | tagged/mentioned posts | +| apify/instagram-topic-scraper | apify | posts by topic | +| apify/instagram-followers-count-scraper | apify | follower count tracking | +| apify/export-instagram-comments-posts | apify | bulk posts + comments | + +## Facebook + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/facebook-posts-scraper | apify | posts, videos, engagement | +| apify/facebook-comments-scraper | apify | comment extraction | +| apify/facebook-likes-scraper | apify | reactions, liker info | +| apify/facebook-groups-scraper | apify | public group content | +| apify/facebook-events-scraper | apify | events, attendees | +| apify/facebook-reels-scraper | apify | reels, engagement | +| apify/facebook-photos-scraper | apify | photos with OCR | +| apify/facebook-search-scraper | apify | page search | +| apify/facebook-marketplace-scraper | apify | marketplace listings | +| apify/facebook-followers-following-scraper | apify | follower lists | +| apify/facebook-video-search-scraper | apify | video search | +| apify/facebook-ads-scraper | apify | ad library, creatives | +| apify/facebook-page-contact-information | apify | page contact info | +| apify/facebook-reviews-scraper | apify | page reviews | +| apify/facebook-hashtag-scraper | apify | hashtag posts | +| apify/threads-profile-api-scraper | apify | Threads profiles | + +## TikTok + +| Actor | Tier | Best for | +|-------|------|----------| +| clockworks/tiktok-scraper | apify | all TikTok data | +| clockworks/tiktok-profile-scraper | apify | profiles, videos | +| clockworks/tiktok-video-scraper | apify | video details, metrics | +| clockworks/tiktok-comments-scraper | apify | video comments | +| clockworks/tiktok-hashtag-scraper | apify | videos by hashtag | +| clockworks/tiktok-followers-scraper | apify | follower profiles | +| clockworks/tiktok-user-search-scraper | apify | user search | +| clockworks/tiktok-sound-scraper | apify | videos by sound | +| clockworks/free-tiktok-scraper | apify | free tier extraction | +| clockworks/tiktok-ads-scraper | apify | hashtag analytics | +| clockworks/tiktok-trends-scraper | apify | trending content | +| clockworks/tiktok-explore-scraper | apify | explore categories | +| clockworks/tiktok-discover-scraper | apify | discover by hashtag | + +## YouTube + +| Actor | Tier | Best for | +|-------|------|----------| +| streamers/youtube-scraper | apify | videos, metrics | +| streamers/youtube-channel-scraper | apify | channel info | +| streamers/youtube-comments-scraper | apify | video comments | +| streamers/youtube-shorts-scraper | apify | shorts data | +| streamers/youtube-video-scraper-by-hashtag | apify | videos by hashtag | +| streamers/youtube-video-downloader | apify | video download | +| curious_coder/youtube-transcript-scraper | community | transcripts, captions | + +## X/Twitter + +| Actor | Tier | Best for | +|-------|------|----------| +| apidojo/tweet-scraper | community | tweet search | +| apidojo/twitter-scraper-lite | community | comprehensive, no limits | +| apidojo/twitter-user-scraper | community | user profiles | +| apidojo/twitter-profile-scraper | community | profiles + recent tweets | +| apidojo/twitter-list-scraper | community | tweets from lists | + +## LinkedIn + +| Actor | Tier | Best for | +|-------|------|----------| +| harvestapi/linkedin-profile-search | community | find profiles | +| harvestapi/linkedin-profile-scraper | community | profile with email | +| harvestapi/linkedin-company | community | company details | +| harvestapi/linkedin-company-employees | community | employee lists | +| harvestapi/linkedin-company-posts | community | company page posts | +| harvestapi/linkedin-profile-posts | community | profile posts | +| harvestapi/linkedin-job-search | community | job listings | +| harvestapi/linkedin-post-search | community | post search | +| harvestapi/linkedin-post-comments | community | post comments | +| harvestapi/linkedin-profile-search-by-name | community | find by name | +| harvestapi/linkedin-profile-search-by-services | community | find by service | +| apimaestro/linkedin-companies-search-scraper | community | company search | +| apimaestro/linkedin-company-detail | community | company deep data | +| apimaestro/linkedin-jobs-scraper-api | community | job search | +| apimaestro/linkedin-job-detail | community | job details | +| apimaestro/linkedin-batch-profile-posts-scraper | community | batch profile posts | +| apimaestro/linkedin-post-reshares | community | post reshares | +| apimaestro/linkedin-post-detail | community | post details | +| apimaestro/linkedin-profile-full-sections-scraper | community | full profile data | +| dev_fusion/linkedin-profile-scraper | community | mass scraping + email | + +## Google Maps + +| Actor | Tier | Best for | +|-------|------|----------| +| compass/crawler-google-places | apify | business listings | +| compass/google-maps-extractor | apify | detailed business data | +| compass/Google-Maps-Reviews-Scraper | apify | reviews, ratings | +| compass/enrich-google-maps-dataset-with-contacts | apify | email enrichment | +| compass/contact-details-scraper-standby | apify | quick contact extract | +| lukaskrivka/google-maps-with-contact-details | community | listings + contacts | +| curious_coder/google-maps-reviews-scraper | community | cheap review scraping | + +## Google Search and Trends + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/google-search-scraper | apify | SERP, ads, AI overviews | +| apify/google-trends-scraper | apify | trend data | +| tri_angle/bing-search-scraper | apify | Bing SERP data | + +## Reviews (cross-platform) + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/hotel-review-aggregator | apify | 7-platform hotel reviews | +| tri_angle/restaurant-review-aggregator | apify | 6-platform restaurant reviews | +| tri_angle/yelp-scraper | apify | Yelp business data | +| tri_angle/yelp-review-scraper | apify | Yelp reviews | +| tri_angle/get-tripadvisor-urls | apify | find TripAdvisor URLs | +| tri_angle/get-yelp-urls | apify | find Yelp URLs | +| tri_angle/airbnb-reviews-scraper | apify | Airbnb reviews | +| tri_angle/social-media-sentiment-analysis-tool | apify | sentiment analysis | + +## Real estate and hospitality + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/airbnb-scraper | apify | Airbnb listings | +| tri_angle/new-fast-airbnb-scraper | apify | fast Airbnb search | +| tri_angle/airbnb-rooms-urls-scraper | apify | detailed room data | +| tri_angle/redfin-search | apify | Redfin property search | +| tri_angle/redfin-detail | apify | Redfin property details | +| tri_angle/real-estate-aggregator | apify | multi-source listings | +| tri_angle/fast-zoopla-properties-scraper | apify | UK properties | +| tri_angle/doordash-store-details-scraper | apify | DoorDash stores | +| tri_angle/cargurus-zipcode-search-scraper | apify | CarGurus listings | +| tri_angle/carmax-zipcode-search-scraper | apify | Carmax listings | + +## SEO tools + +| Actor | Tier | Best for | +|-------|------|----------| +| radeance/similarweb-scraper | community | traffic, rankings | +| radeance/ahrefs-scraper | community | backlinks, keywords | +| radeance/semrush-scraper | community | domain authority | +| radeance/moz-scraper | community | DA, spam score | +| radeance/ubersuggest-scraper | community | keyword suggestions | +| radeance/se-ranking-scraper | community | keyword CPC | + +## Content and web crawling + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/website-content-crawler | apify | clean text for AI | +| apify/rag-web-browser | apify | RAG pipelines | +| apify/web-scraper | apify | general web scraping | +| apify/cheerio-scraper | apify | fast HTML parsing | +| apify/playwright-scraper | apify | JS-heavy sites | +| apify/camoufox-scraper | apify | anti-bot sites | +| apify/sitemap-extractor | apify | sitemap URLs | +| lukaskrivka/article-extractor-smart | community | article extraction | + +## Other platforms + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/telegram-scraper | apify | Telegram messages | +| tri_angle/snapchat-scraper | apify | Snapchat profiles | +| tri_angle/snapchat-spotlight-scraper | apify | Snapchat Spotlight | +| tri_angle/truth-scraper | apify | Truth Social | +| tri_angle/social-media-finder | apify | cross-platform search | +| tri_angle/website-changes-detector | apify | website monitoring | +| tri_angle/e-commerce-product-matching-tool | apify | product matching | +| trudax/reddit-scraper-lite | community | Reddit posts | +| janbuchar/github-contributors-scraper | community | GitHub contributors | + +## Enrichment and contacts + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/social-media-leads-analyzer | apify | emails from websites | +| apify/social-media-hashtag-research | apify | cross-platform hashtags | +| apify/e-commerce-scraping-tool | apify | product data enrichment | +| vdrmota/contact-info-scraper | community | contact extraction | +| code_crafter/leads-finder | community | B2B leads | +``` + +- [ ] **Step 2: Verify the index is under 100 Actor entries per platform section** + +Count the entries visually. The index should have ~100 unique Actor IDs total across all platform sections. Some Actors appear in multiple use-case workflow guides but should appear only once in the index, under their primary platform. + +- [ ] **Step 3: Commit** + +```bash +git add skills/apify-ultimate-scraper/references/actor-index.md +git commit -m "refactor: replace monolithic actor index with lean platform-organized lookup" +``` + +--- + +### Task 2: Create gotchas.md + +**Files:** +- Create: `skills/apify-ultimate-scraper/references/gotchas.md` + +- [ ] **Step 1: Write the gotchas and cost guardrails file** + +```markdown +# Gotchas and cost guardrails + +## Pricing models + +| Model | How it works | Action before running | +|-------|-------------|----------------------| +| FREE | No per-result cost, only platform compute | None needed | +| PAY_PER_EVENT (PPE) | Charged per result item | MUST estimate cost first | +| FLAT_PRICE_PER_MONTH | Monthly subscription | Verify user has active subscription | + +To check an Actor's pricing: + + apify actors info "ACTOR_ID" --json + +Read `.currentPricingInfo.pricingModel` and `.currentPricingInfo.pricePerEvent`. + +## Cost estimation protocol + +Before running any PPE Actor: + +1. Get the per-event price from Actor info (`.currentPricingInfo.pricePerEvent`) +2. Multiply by the requested result count +3. Present the estimate to the user: "This will cost approximately $X for Y results" +4. If estimate > $5: warn explicitly +5. If estimate > $20: require explicit user confirmation before proceeding + +## Common pitfalls + +**Cookie-dependent Actors** +Some social media scrapers require cookies or login sessions. If an Actor returns auth errors or empty results unexpectedly, check its README: + + apify actors info "ACTOR_ID" --readme + +Look for mentions of "cookies", "login", "session", or "proxy". + +**Rate limiting on large scrapes** +Platforms throttle or block large-volume scraping. Mitigations: +- Use proxy configuration when available: `"proxyConfiguration": {"useApifyProxy": true}` +- Set reasonable concurrency limits (check the Actor's `maxConcurrency` input) +- For 1,000+ results, suggest splitting into smaller batches + +**Empty results** +Common causes: +- Too-narrow search query or geo-restriction (try broader terms) +- Platform blocking without proxy (enable Apify Proxy) +- Actor requires cookies/login but none provided +- Wrong input field name (always verify with `--input --json`) + +**maxResults vs maxCrawledPages** +Different Actors use different limit field names. Common variants: +- `maxResults`, `resultsLimit`, `maxItems` - limit output items +- `maxCrawledPages`, `maxRequestsPerCrawl` - limit pages visited +Always fetch the input schema to find the correct field for the specific Actor. + +**Deprecated Actors** +Check `.isDeprecated` in `apify actors info --json`. If `true`: +1. Search for alternatives: `apify actors search "SIMILAR_KEYWORDS" --json` +2. Prefer `apify` tier replacements over `community` alternatives + +**LinkedIn pricing** +LinkedIn Actors are all PPE and vary significantly: +- `harvestapi/` Actors: generally cheaper ($0.001-0.01/result) +- `apimaestro/` Actors: generally more expensive ($0.005-0.02/result) +- `dev_fusion/` Actors: mid-range, useful for mass scraping with email enrichment +Always compare pricing before selecting a LinkedIn Actor. + +**SEO tool pricing** +`radeance/` SEO scrapers (SimilarWeb, Ahrefs, SEMrush, Moz) have the highest per-result costs ($0.005-0.0275/result). For large-scale SEO analysis, estimate costs carefully and suggest batching. +``` + +- [ ] **Step 2: Commit** + +```bash +git add skills/apify-ultimate-scraper/references/gotchas.md +git commit -m "feat: add gotchas and cost guardrails reference" +``` + +--- + +### Task 3: Create workflow guides (batch 1 of 2) + +**Files:** +- Create: `skills/apify-ultimate-scraper/references/workflows/lead-generation.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/competitive-intel.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/influencer-vetting.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/brand-monitoring.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/review-analysis.md` + +- [ ] **Step 1: Create the workflows directory** + +```bash +mkdir -p skills/apify-ultimate-scraper/references/workflows +``` + +- [ ] **Step 2: Write lead-generation.md** + +```markdown +# Lead generation workflows + +## Local business leads with email enrichment +**When:** User wants business contacts, emails, or phone numbers for businesses in a specific location. + +### Pipeline +1. **Find businesses** -> `compass/crawler-google-places` + - Key input: `searchStringsArray`, `locationQuery`, `maxCrawledPlaces` +2. **Enrich with contacts** -> `compass/enrich-google-maps-dataset-with-contacts` + - Pipe: `results[].url` -> `startUrls` (or pass the dataset ID directly) + - Key input: `datasetId` (from step 1), `maxRequestsPerCrawl` + +### Output fields +Step 1: `title`, `address`, `phone`, `website`, `categoryName`, `totalScore`, `reviewsCount`, `url` +Step 2: `emails[]`, `phones[]`, `socialLinks`, `linkedInUrl`, `twitterUrl` + +### Gotcha +Google Maps results vary by language and location. Set `language: "en"` explicitly. Also set `locationQuery` to a specific city/region, not just a country. + +## B2B prospect discovery via LinkedIn +**When:** User wants to find professionals by role, company, or industry. + +### Pipeline +1. **Search profiles** -> `harvestapi/linkedin-profile-search` + - Key input: `keyword`, `location`, `title`, `limit` +2. **Enrich with details** -> `harvestapi/linkedin-profile-scraper` + - Pipe: `results[].profileUrl` -> `urls` + - Key input: `urls`, `includeEmail` (set to `true` for email discovery) + +### Output fields +Step 1: `fullName`, `headline`, `location`, `profileUrl`, `currentCompany` +Step 2: `experience[]`, `education[]`, `skills[]`, `email`, `phone` + +### Gotcha +LinkedIn Actors are all PPE. Step 2 with `includeEmail: true` costs ~$0.01/profile. For 500 profiles, that's ~$5. Estimate and confirm with user. +``` + +- [ ] **Step 3: Write competitive-intel.md** + +```markdown +# Competitive intelligence workflows + +## Competitor ad monitoring +**When:** User wants to see competitor advertising creatives, targeting, or ad spend signals. + +### Pipeline +1. **Scrape ad library** -> `apify/facebook-ads-scraper` + - Key input: `searchQuery` (competitor name), `country`, `adType`, `maxItems` + +### Output fields +Step 1: `adTitle`, `adBody`, `adCreativeUrl`, `startDate`, `pageInfo.name`, `platform` + +### Gotcha +Facebook Ad Library is public data, no auth needed. But results are limited to currently active or recently inactive ads. + +## Competitor web presence analysis +**When:** User wants traffic, rankings, and SEO data for competitor domains. + +### Pipeline +1. **Get traffic data** -> `radeance/similarweb-scraper` + - Key input: `urls` (competitor domains) +2. **Get backlink profile** -> `radeance/ahrefs-scraper` + - Key input: `urls` (same domains) + +### Output fields +Step 1: `globalRank`, `monthlyVisits`, `bounceRate`, `avgVisitDuration`, `trafficSources` +Step 2: `domainRating`, `backlinks`, `referringDomains`, `organicKeywords` + +### Gotcha +SEO scrapers (radeance/) have the highest PPE costs ($0.005-0.0275/result). Estimate cost before running. For a single domain, each step costs ~$0.02-0.03. +``` + +- [ ] **Step 4: Write influencer-vetting.md** + +```markdown +# Influencer vetting workflows + +## Instagram creator vetting +**When:** User wants to evaluate an influencer's profile, audience, and engagement quality. + +### Pipeline +1. **Get profile data** -> `apify/instagram-profile-scraper` + - Key input: `usernames` (list of handles) +2. **Analyze engagement** -> `apify/instagram-comment-scraper` + - Pipe: `results[].latestPosts[].url` -> `directUrls` (pick 3-5 recent posts) + - Key input: `directUrls`, `resultsLimit` + +### Output fields +Step 1: `username`, `fullName`, `followersCount`, `followsCount`, `postsCount`, `biography`, `isVerified`, `latestPosts[]` +Step 2: `text`, `ownerUsername`, `timestamp` (scan for bot patterns: generic praise, emoji-only, irrelevant content) + +### Gotcha +High follower count with low comment quality suggests fake followers. Compare comment sentiment to post content. + +## Cross-platform influencer discovery +**When:** User wants to find an influencer's presence across multiple platforms. + +### Pipeline +1. **Search across platforms** -> `tri_angle/social-media-finder` + - Key input: `query` (influencer name or handle), `platforms` + +### Output fields +Step 1: `platform`, `profileUrl`, `username`, `followers`, `isVerified` +``` + +- [ ] **Step 5: Write brand-monitoring.md** + +```markdown +# Brand monitoring workflows + +## Cross-platform brand mention tracking +**When:** User wants to monitor brand mentions, hashtags, or sentiment across social platforms. + +### Pipeline (run each independently, combine results) +1. **Instagram mentions** -> `apify/instagram-tagged-scraper` + - Key input: `username` (brand handle) +2. **Instagram hashtags** -> `apify/instagram-hashtag-scraper` + - Key input: `hashtags` (branded hashtags) +3. **X/Twitter mentions** -> `apidojo/tweet-scraper` + - Key input: `searchTerms` (brand name, handle, hashtags) +4. **Reddit mentions** -> `trudax/reddit-scraper-lite` + - Key input: `searchQuery` (brand name) + +### Output fields +Instagram: `caption`, `likesCount`, `commentsCount`, `timestamp`, `ownerUsername` +X/Twitter: `text`, `retweetCount`, `likeCount`, `replyCount`, `createdAt`, `author` +Reddit: `title`, `body`, `score`, `numComments`, `subreddit`, `createdAt` + +### Gotcha +This is a parallel workflow, not sequential. Run each Actor independently. Combine results by date for a timeline view. + +## Sentiment analysis +**When:** User wants sentiment scoring on collected mentions. + +### Pipeline +1. **Collect mentions** (use any step from above) +2. **Analyze sentiment** -> `tri_angle/social-media-sentiment-analysis-tool` + - Pipe: collected post URLs -> `urls` + - Key input: `urls`, `platforms` + +### Output fields +Step 2: `sentiment` (positive/negative/neutral), `score`, `text`, `platform` +``` + +- [ ] **Step 6: Write review-analysis.md** + +```markdown +# Review analysis workflows + +## Google Maps review extraction +**When:** User wants to collect and analyze business reviews from Google Maps. + +### Pipeline +1. **Find businesses** -> `compass/crawler-google-places` + - Key input: `searchStringsArray`, `locationQuery`, `maxCrawledPlaces` +2. **Extract reviews** -> `compass/Google-Maps-Reviews-Scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `title`, `totalScore`, `reviewsCount`, `url`, `categoryName` +Step 2: `text`, `stars`, `publishedAtDate`, `reviewerName`, `ownerResponse` + +## Cross-platform hotel/restaurant reviews +**When:** User wants reviews aggregated from multiple platforms for the same business. + +### Pipeline (hotels) +1. **Aggregate reviews** -> `tri_angle/hotel-review-aggregator` + - Key input: `urls` (hotel URLs from TripAdvisor, Yelp, Google Maps, Booking.com, etc.) + +### Pipeline (restaurants) +1. **Aggregate reviews** -> `tri_angle/restaurant-review-aggregator` + - Key input: `urls` (restaurant URLs from Yelp, Google Maps, DoorDash, UberEats, etc.) + +### Output fields +Both: `text`, `rating`, `date`, `platform`, `reviewerName`, `title` + +## Yelp review pipeline +**When:** User wants Yelp reviews for businesses in a specific area. + +### Pipeline +1. **Find businesses** -> `tri_angle/get-yelp-urls` + - Key input: `location`, `category` +2. **Extract reviews** -> `tri_angle/yelp-review-scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `name`, `url`, `rating`, `reviewCount`, `address` +Step 2: `text`, `rating`, `date`, `userName` + +### Gotcha +Review aggregators pull from multiple platforms in one run - cheaper than running separate scrapers per platform. Use the aggregators when covering 3+ platforms. +``` + +- [ ] **Step 7: Commit** + +```bash +git add skills/apify-ultimate-scraper/references/workflows/ +git commit -m "feat: add workflow guides for lead-gen, competitive-intel, influencer, brand, reviews" +``` + +--- + +### Task 4: Create workflow guides (batch 2 of 2) + +**Files:** +- Create: `skills/apify-ultimate-scraper/references/workflows/content-and-seo.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/social-media-analytics.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/trend-research.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/job-market-and-recruitment.md` +- Create: `skills/apify-ultimate-scraper/references/workflows/real-estate-and-hospitality.md` + +- [ ] **Step 1: Write content-and-seo.md** + +```markdown +# Content and SEO workflows + +## Website content extraction for RAG +**When:** User wants to crawl a website and extract clean text for AI/LLM pipelines or knowledge bases. + +### Pipeline +1. **Crawl website** -> `apify/website-content-crawler` + - Key input: `startUrls`, `maxCrawlPages`, `crawlerType` ("cheerio" for speed, "playwright" for JS sites) + +### Output fields +Step 1: `url`, `title`, `text`, `markdown`, `metadata`, `links[]` + +### Gotcha +For JS-heavy sites (SPAs), set `crawlerType: "playwright"`. For static sites, use `"cheerio"` (10x faster). For anti-bot sites, use `apify/camoufox-scraper` instead. + +## SERP analysis +**When:** User wants to analyze search engine results for specific keywords. + +### Pipeline +1. **Google SERP** -> `apify/google-search-scraper` + - Key input: `queries`, `maxPagesPerQuery`, `countryCode`, `languageCode` + +### Output fields +Step 1: `organicResults[]` (title, url, description, position), `paidResults[]`, `peopleAlsoAsk[]`, `relatedSearches[]` + +## Domain authority and backlink analysis +**When:** User wants SEO metrics for specific domains. + +### Pipeline +1. **Traffic overview** -> `radeance/similarweb-scraper` + - Key input: `urls` +2. **Backlink profile** -> `radeance/ahrefs-scraper` + - Key input: `urls` +3. **Domain authority** -> `radeance/semrush-scraper` + - Key input: `urls` + +### Output fields +Step 1: `globalRank`, `monthlyVisits`, `bounceRate`, `trafficSources` +Step 2: `domainRating`, `backlinks`, `referringDomains`, `organicKeywords` +Step 3: `authorityScore`, `organicSearchTraffic`, `paidSearchTraffic` + +### Gotcha +All radeance/ SEO Actors are PPE at $0.005-0.0275/result. Running all 3 for one domain costs ~$0.05-0.08. For 50 domains, estimate $2.50-$4.00. +``` + +- [ ] **Step 2: Write social-media-analytics.md** + +```markdown +# Social media analytics workflows + +## Instagram account performance analysis +**When:** User wants engagement metrics and content performance for an Instagram account. + +### Pipeline +1. **Get profile** -> `apify/instagram-profile-scraper` + - Key input: `usernames` +2. **Get recent posts** -> `apify/instagram-post-scraper` + - Key input: `directUrls` (from profile's `latestPosts[].url`) or `usernames` + +### Output fields +Step 1: `followersCount`, `followsCount`, `postsCount`, `biography`, `isVerified` +Step 2: `caption`, `likesCount`, `commentsCount`, `timestamp`, `type` (photo/video/reel), `url` + +## TikTok creator analytics +**When:** User wants performance data for a TikTok creator. + +### Pipeline +1. **Get profile** -> `clockworks/tiktok-profile-scraper` + - Key input: `profiles` (handles or URLs) + +### Output fields +Step 1: `nickname`, `followers`, `following`, `likes`, `videos`, `verified`, `recentVideos[]` (with views, likes, shares per video) + +## Multi-platform engagement comparison +**When:** User wants to compare an account's performance across platforms. + +### Pipeline (run independently, combine) +1. **Instagram** -> `apify/instagram-profile-scraper` with `usernames` +2. **TikTok** -> `clockworks/tiktok-profile-scraper` with `profiles` +3. **YouTube** -> `streamers/youtube-channel-scraper` with `channelUrls` +4. **X/Twitter** -> `apidojo/twitter-user-scraper` with `handles` + +### Output fields +Instagram: `followersCount`, `postsCount`, `biography` +TikTok: `followers`, `likes`, `videos` +YouTube: `subscriberCount`, `videoCount`, `viewCount` +X/Twitter: `followers`, `tweets`, `likes` + +### Gotcha +Parallel workflow - run each Actor independently. Normalize metric names for comparison (followers/subscribers, posts/videos/tweets). +``` + +- [ ] **Step 3: Write trend-research.md** + +```markdown +# Trend and keyword research workflows + +## Google Trends analysis +**When:** User wants to analyze search demand trends for keywords or topics. + +### Pipeline +1. **Get trend data** -> `apify/google-trends-scraper` + - Key input: `searchTerms`, `timeRange`, `geo` (country code) + +### Output fields +Step 1: `term`, `timelineData[]` (date, value), `relatedQueries[]`, `relatedTopics[]` + +## Cross-platform hashtag research +**When:** User wants to evaluate a hashtag's reach and usage across platforms. + +### Pipeline +1. **Cross-platform overview** -> `apify/social-media-hashtag-research` + - Key input: `hashtags`, `platforms` (instagram, youtube, tiktok, facebook) + +### Output fields +Step 1: `hashtag`, `platform`, `postsCount`, `topPosts[]`, `relatedHashtags[]` + +## TikTok trend discovery +**When:** User wants to find trending content, sounds, or hashtags on TikTok. + +### Pipeline +1. **Trending content** -> `clockworks/tiktok-trends-scraper` + - Key input: `channel` (trending category) +2. **Explore categories** -> `clockworks/tiktok-explore-scraper` + - Key input: `exploreCategories` + +### Output fields +Step 1: `videoUrl`, `description`, `likes`, `shares`, `views`, `author`, `music` +Step 2: `category`, `posts[]`, `authors[]`, `music[]` + +## Content topic validation +**When:** User wants to validate whether a topic has demand before creating content. + +### Pipeline +1. **Search demand** -> `apify/google-trends-scraper` + - Key input: `searchTerms` (topic keywords) +2. **Social reach** -> `apify/social-media-hashtag-research` + - Key input: `hashtags` (topic hashtags) + +### Output fields +Step 1: `timelineData[]` (trending up/down), `relatedQueries[]` +Step 2: `postsCount` per platform, `topPosts[]` + +### Gotcha +Google Trends shows relative interest (0-100 scale), not absolute volume. Combine with hashtag post counts for a fuller picture. +``` + +- [ ] **Step 4: Write job-market-and-recruitment.md** + +```markdown +# Job market and recruitment workflows + +## Job listing research +**When:** User wants to find and analyze job postings by role, company, or location. + +### Pipeline +1. **Search jobs** -> `harvestapi/linkedin-job-search` + - Key input: `keyword`, `location`, `datePosted`, `limit` +2. **Get job details** -> `apimaestro/linkedin-job-detail` + - Pipe: `results[].jobUrl` -> `urls` + - Key input: `urls` + +### Output fields +Step 1: `title`, `company`, `location`, `jobUrl`, `postedDate`, `applicantsCount` +Step 2: `description`, `requirements`, `seniority`, `employmentType`, `salary` + +### Gotcha +Both Actors are PPE. Step 1: ~$0.001/job. Step 2: ~$0.005/job. For 200 jobs, total ~$1.20. Estimate and confirm with user. + +## Candidate sourcing +**When:** User wants to find potential candidates matching specific criteria. + +### Pipeline +1. **Search profiles** -> `harvestapi/linkedin-profile-search` + - Key input: `keyword`, `title`, `location`, `industry`, `limit` +2. **Enrich with details** -> `apimaestro/linkedin-profile-full-sections-scraper` + - Pipe: `results[].profileUrl` -> `urls` + - Key input: `urls` + +### Output fields +Step 1: `fullName`, `headline`, `location`, `profileUrl`, `currentCompany` +Step 2: `experience[]`, `education[]`, `skills[]`, `certifications[]`, `languages[]` + +### Gotcha +Step 2 (`apimaestro/linkedin-profile-full-sections-scraper`) costs ~$0.01/profile - the most expensive LinkedIn scraper. Use sparingly for shortlisted candidates only. + +## GitHub contributor discovery +**When:** User wants to find developers who contribute to specific open-source projects. + +### Pipeline +1. **Get contributors** -> `janbuchar/github-contributors-scraper` + - Key input: `repoUrls` + +### Output fields +Step 1: `username`, `contributions`, `profileUrl`, `avatarUrl` +``` + +- [ ] **Step 5: Write real-estate-and-hospitality.md** + +```markdown +# Real estate and hospitality workflows + +## Property search and analysis +**When:** User wants to find and compare property listings in a specific area. + +### Pipeline +1. **Search properties** -> `tri_angle/redfin-search` + - Key input: `location`, `propertyType`, `minPrice`, `maxPrice` +2. **Get details** -> `tri_angle/redfin-detail` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls` + +### Output fields +Step 1: `address`, `price`, `beds`, `baths`, `sqft`, `url`, `status` +Step 2: `description`, `yearBuilt`, `lotSize`, `priceHistory[]`, `taxHistory[]`, `schools[]` + +## Airbnb market analysis +**When:** User wants to analyze Airbnb listings, pricing, and reviews in a destination. + +### Pipeline +1. **Search listings** -> `tri_angle/new-fast-airbnb-scraper` + - Key input: `location`, `checkIn`, `checkOut`, `maxItems` +2. **Get reviews** -> `tri_angle/airbnb-reviews-scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `name`, `price`, `rating`, `reviews`, `type`, `amenities[]`, `url`, `images[]` +Step 2: `text`, `rating`, `date`, `reviewerName` + +### Gotcha +Airbnb pricing varies by date. Always set `checkIn` and `checkOut` for accurate pricing. For market analysis, run multiple date ranges to capture seasonal variation. + +## Multi-source property comparison +**When:** User wants to compare listings across Zillow, Realtor, Zumper, and other US/UK sources. + +### Pipeline +1. **Aggregate listings** -> `tri_angle/real-estate-aggregator` + - Key input: `location`, `propertyType`, `sources` (Zillow, Realtor, Zumper, Apartments.com, Rightmove) + +### Output fields +Step 1: `address`, `price`, `beds`, `baths`, `sqft`, `source`, `url`, `listingDate` +``` + +- [ ] **Step 6: Commit** + +```bash +git add skills/apify-ultimate-scraper/references/workflows/ +git commit -m "feat: add workflow guides for content/SEO, social analytics, trends, jobs, real estate" +``` + +--- + +### Task 5: Rewrite SKILL.md with new routing logic + +**Files:** +- Modify: `skills/apify-ultimate-scraper/SKILL.md` + +- [ ] **Step 1: Rewrite the full SKILL.md** + +Replace the entire file with the new version that uses the three-layer routing: + +```markdown +--- +name: apify-ultimate-scraper +description: Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, and 15+ more platforms. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, review analysis, SEO intelligence, recruitment, or any data extraction task. +--- + +# Universal web scraper + +AI-driven data extraction from ~100 Actors across 15+ platforms via the Apify CLI. + +**Rule: Always pass `--json` to CLI commands.** JSON output is stable across CLI versions. Never parse human-readable output. + +## Prerequisites + +- Apify CLI v1.4.0+ (`npm install -g apify-cli`) +- Authenticated session (see below) + +## Authentication + +Check: `apify info` + +If not logged in, authenticate via OAuth (opens browser): + + apify login + +Headless fallback: `export APIFY_TOKEN=your_token_here` +Generate token: https://console.apify.com/settings/integrations + +## Workflow + +### Step 1: Understand goal and select Actor + +Identify the target platform and use case. Read `references/actor-index.md` to find the right Actor. + +If the task involves a multi-step pipeline, also read the matching workflow guide: + +| Task involves... | Read | +|-----------------|------| +| leads, contacts, emails, B2B | `references/workflows/lead-generation.md` | +| competitor, ads, pricing | `references/workflows/competitive-intel.md` | +| influencer, creator | `references/workflows/influencer-vetting.md` | +| brand, mentions, sentiment | `references/workflows/brand-monitoring.md` | +| reviews, ratings, reputation | `references/workflows/review-analysis.md` | +| SEO, SERP, crawl, content, RAG | `references/workflows/content-and-seo.md` | +| analytics, engagement, performance | `references/workflows/social-media-analytics.md` | +| trends, keywords, hashtags | `references/workflows/trend-research.md` | +| jobs, recruiting, candidates | `references/workflows/job-market-and-recruitment.md` | +| real estate, listings, hotels | `references/workflows/real-estate-and-hospitality.md` | + +If no Actor matches in the index, search dynamically: + + apify actors search "KEYWORDS" --json --limit 10 + +From results: `items[].username`/`items[].name` (Actor ID), `items[].title`, `items[].stats.totalUsers30Days`, `items[].currentPricingInfo.pricingModel`. + +### Step 2: Fetch Actor schema and check gotchas + +Fetch the input schema dynamically: + + apify actors info "ACTOR_ID" --input --json + +Also read `references/gotchas.md` to check for pricing traps and common pitfalls for the selected Actor. + +For PPE Actors: estimate cost before running (see gotchas.md cost estimation protocol). + +For Actor documentation: `apify actors info "ACTOR_ID" --readme` + +### Step 3: Configure and run + +**Skip user preferences** for simple lookups (e.g., "Nike's follower count"). Go straight to running with quick answer mode. + +For larger tasks, confirm output format (quick answer / CSV / JSON) and result count. + +**Standard run (blocking):** + + apify actors call "ACTOR_ID" -i 'JSON_INPUT' --json + +From output: `.id` (run ID), `.status`, `.defaultDatasetId`, `.stats.durationMillis` + +**Fetch results:** + + apify datasets get-items DATASET_ID --format json + +For CSV: `apify datasets get-items DATASET_ID --format csv` + +**Quick answer mode:** Fetch results as JSON, pick top 5, present formatted in chat. + +**Save to file:** Fetch results, use Write tool to save as `YYYY-MM-DD_descriptive-name.csv` or `.json`. + +**Large/long-running scrapes:** + + apify actors start "ACTOR_ID" -i 'JSON_INPUT' --json + +Poll: `apify runs info RUN_ID --json` (check `.status` for `SUCCEEDED`). + +### Step 4: Deliver results + +Report: result count, file location (if saved), key data fields, and links: +- Dataset: `https://console.apify.com/storage/datasets/DATASET_ID` +- Run: `https://console.apify.com/actors/runs/RUN_ID` + +For multi-step workflows: suggest the next pipeline step from the workflow guide. + +## Error handling + +| Error | Resolution | +|-------|-----------| +| `apify: command not found` | `npm install -g apify-cli` | +| `Error: Not logged in` | `apify login` or `export APIFY_TOKEN=...` | +| Actor not found | Check ID format (`username/actor-name`); try `apify actors search` | +| `status: FAILED` | Check `.statusMessage`; see run log at Console URL | +| Takes too long | Switch to `apify actors start` + poll with `apify runs info` | +| Empty results | Lower limits; check Actor README with `apify actors info ACTOR_ID --readme` | +| `isDeprecated: true` | Search for alternatives with `apify actors search` | +``` + +- [ ] **Step 2: Verify SKILL.md is under 150 lines** + +Count lines. Target: ~120-140 lines. If over 150, trim the error handling table or move to a reference file. + +- [ ] **Step 3: Commit** + +```bash +git add skills/apify-ultimate-scraper/SKILL.md +git commit -m "refactor: rewrite SKILL.md with three-layer progressive disclosure routing" +``` + +--- + +### Task 6: Clean up and verify + +**Files:** +- Modify: `.claude-plugin/marketplace.json` (update description) +- Regenerate: `agents/AGENTS.md` + +- [ ] **Step 1: Update marketplace.json description** + +Update the `apify-ultimate-scraper` entry's description to reflect the restructured skill. The current description already mentions 15+ platforms; no major change needed. Verify it matches the SKILL.md frontmatter description. + +- [ ] **Step 2: Regenerate AGENTS.md** + +```bash +cd /tmp/agent-skills && python3 scripts/generate_agents.py +``` + +- [ ] **Step 3: Verify file structure** + +```bash +find skills/apify-ultimate-scraper -type f | sort +``` + +Expected output: +``` +skills/apify-ultimate-scraper/SKILL.md +skills/apify-ultimate-scraper/references/actor-index.md +skills/apify-ultimate-scraper/references/gotchas.md +skills/apify-ultimate-scraper/references/workflows/brand-monitoring.md +skills/apify-ultimate-scraper/references/workflows/competitive-intel.md +skills/apify-ultimate-scraper/references/workflows/content-and-seo.md +skills/apify-ultimate-scraper/references/workflows/influencer-vetting.md +skills/apify-ultimate-scraper/references/workflows/job-market-and-recruitment.md +skills/apify-ultimate-scraper/references/workflows/lead-generation.md +skills/apify-ultimate-scraper/references/workflows/real-estate-and-hospitality.md +skills/apify-ultimate-scraper/references/workflows/review-analysis.md +skills/apify-ultimate-scraper/references/workflows/social-media-analytics.md +skills/apify-ultimate-scraper/references/workflows/trend-research.md +``` + +13 files total. No leftover `reference/scripts/` directory. + +- [ ] **Step 4: Commit and tag** + +```bash +git add -A +git commit -m "chore: update marketplace description and regenerate AGENTS.md" +``` + +- [ ] **Step 5: End-to-end verification** + +Test these scenarios mentally (or in Claude Code if available): + +1. **Simple scrape** ("get Nike's Instagram profiles"): Agent loads SKILL.md + actor-index.md. Picks `apify/instagram-profile-scraper`. Fetches schema via `apify actors info --input --json`. Runs. ~250 lines loaded. + +2. **Multi-step workflow** ("build a lead list of restaurants in Prague with emails"): Agent loads SKILL.md + actor-index.md + lead-generation.md + gotchas.md. Follows the "Local business leads" pipeline. ~370 lines loaded. + +3. **PPE cost check** ("scrape LinkedIn profiles for 500 marketing managers"): Agent loads gotchas.md, sees LinkedIn pricing section, estimates cost (~$5), warns user before running. + +4. **Dynamic discovery** ("scrape Glassdoor reviews"): Agent can't find in index, uses `apify actors search "glassdoor reviews" --json`, selects best match, fetches schema dynamically. diff --git a/docs/superpowers/specs/2026-03-28-ultimate-scraper-skill-redesign.md b/docs/superpowers/specs/2026-03-28-ultimate-scraper-skill-redesign.md new file mode 100644 index 0000000..1a7e350 --- /dev/null +++ b/docs/superpowers/specs/2026-03-28-ultimate-scraper-skill-redesign.md @@ -0,0 +1,232 @@ +# Ultimate scraper skill redesign + +## Context + +The `apify-ultimate-scraper` skill was recently migrated from raw REST API scripts to Apify CLI commands. This redesign addresses the next layer: the skill's information architecture. The current design loads a ~400-line monolithic Actor index every time, spends most of its token budget on Actor selection (which agents handle well), and provides almost no help with the two primary failure modes: wrong input configuration and inability to pipe Actor outputs into subsequent steps. + +**Problem:** The skill optimizes for the wrong thing. 50% of usage is quick targeted scrapes (where loading 400 lines of index is waste), and 50% is multi-step workflows (where the skill provides no data-piping guidance). + +**Goal:** Restructure the skill into a three-layer progressive disclosure architecture that's lean for simple tasks and rich for complex ones, while adding gotchas and cost guardrails to prevent the most common mistakes. + +## Architecture: three layers + +### Layer 1: Lean Actor index (`references/actor-index.md`, ~100 lines) + +A flat Markdown lookup table organized by platform. Three columns only: `Actor ID | Tier | Best for (5 words max)`. Always loaded when the skill triggers. + +Purpose: fast Actor selection. Does NOT contain input schemas, output fields, or workflow instructions. Those are in layers 2 and 3. + +Example format: + +```markdown +## Instagram + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/instagram-profile-scraper | apify | profiles, followers, bio | +| apify/instagram-post-scraper | apify | posts, likes, comments | +| apify/instagram-hashtag-scraper | apify | hashtag posts, trends | +``` + +For Actors not in the index, the agent uses `apify actors search "QUERY" --json` to discover dynamically. + +### Layer 2: Use-case workflow guides (`references/workflows/*.md`) + +Rich multi-step pipeline guides with explicit Actor chaining and data-piping instructions. Loaded only when the task involves multiple Actors or a recognized use case. + +Ten workflow files: + +1. `lead-generation.md` - business contacts, email extraction, B2B prospecting +2. `competitive-intel.md` - ad monitoring, pricing, market positioning +3. `influencer-vetting.md` - profile discovery, audience analysis, engagement vetting +4. `brand-monitoring.md` - mentions, sentiment, hashtag tracking +5. `review-analysis.md` - cross-platform review aggregation (Google Maps, Yelp, TripAdvisor, Airbnb) +6. `content-and-seo.md` - SERP analysis, web crawling, content extraction for RAG +7. `social-media-analytics.md` - engagement metrics, content performance across platforms +8. `trend-research.md` - Google Trends, TikTok trends, hashtag analytics, seasonal demand +9. `job-market-and-recruitment.md` - LinkedIn jobs, candidate sourcing, skill-gap analysis +10. `real-estate-and-hospitality.md` - listing pipelines, market analysis, pricing comparison + +Each workflow guide follows a consistent structure: + +```markdown +# [Use case] workflows + +## [Specific scenario name] +**When:** [One-line trigger condition] + +### Pipeline +1. **[Step name]** -> `actor/id` + - Key input: `field1`, `field2`, `field3` +2. **[Step name]** -> `actor/id` + - Pipe: `results[].fieldX` -> `inputFieldY` + - Key input: `startUrls`, `maxRequestsPerCrawl` + +### Output fields +Step 1: `field1`, `field2`, `field3` +Step 2: `field1`, `field2`, `field3` + +### Gotcha +[Workflow-specific pitfall, if any] +``` + +Key elements: +- **Pipe instructions** - explicit field mappings for chaining Actor outputs to inputs +- **Key input fields** - the 2-3 most important params (NOT the full schema - that's fetched dynamically) +- **Output fields** - what each step returns (enables the agent to know what's available for piping or presenting) +- **Gotcha** - per-workflow pitfall where relevant + +### Layer 3: Dynamic schema fetching (runtime, no files) + +Input schemas are always fetched at runtime via: + +```bash +apify actors info "ACTOR_ID" --input --json +``` + +This eliminates stale pre-cached schemas and ensures the agent always sees the current parameter set. The workflow guides provide only the 2-3 key input fields as hints - the full schema comes from the CLI. + +For Actor documentation: + +```bash +apify actors info "ACTOR_ID" --readme +``` + +## Cross-cutting: gotchas and cost guardrails (`references/gotchas.md`) + +A single reference file (~60 lines) covering: + +### Pricing models +- FREE: no per-result cost +- PAY_PER_EVENT (PPE): charged per result - MUST check pricing before running +- FLAT_PRICE_PER_MONTH: subscription model + +### Cost estimation protocol +Before running PPE Actors: +1. Read `.currentPricingInfo` from `apify actors info "ACTOR_ID" --json` +2. Calculate: `pricePerEvent * requestedResults` +3. Warn user if estimated cost > $5 +4. Require explicit confirmation for > $20 + +### Common pitfalls +- Cookie-dependent Actors (social media scrapers needing login) +- Rate limiting on large scrapes (use proxy configuration) +- Empty results from geo-restrictions or narrow queries +- `maxResults` vs `maxCrawledPages` confusion (different Actors use different limit fields) +- Deprecated Actors (check `.isDeprecated` in Actor info) + +## File structure + +``` +skills/apify-ultimate-scraper/ +├── SKILL.md # ~150 lines: workflow + routing +├── references/ +│ ├── actor-index.md # ~100 lines: flat lookup by platform +│ ├── gotchas.md # ~60 lines: pitfalls + cost guardrails +│ └── workflows/ +│ ├── lead-generation.md +│ ├── competitive-intel.md +│ ├── influencer-vetting.md +│ ├── brand-monitoring.md +│ ├── review-analysis.md +│ ├── content-and-seo.md +│ ├── social-media-analytics.md +│ ├── trend-research.md +│ ├── job-market-and-recruitment.md +│ └── real-estate-and-hospitality.md +``` + +Total files: 13 (SKILL.md + actor-index + gotchas + 10 workflows) + +## SKILL.md workflow + +The main skill file contains: + +1. **Frontmatter** - name, description (trigger conditions) +2. **Prerequisites** - CLI version, authentication (OAuth-first) +3. **Workflow** (5 steps): + - Step 1: Understand goal, identify platform/use-case + - Step 2: Select Actor from `references/actor-index.md`, fetch input schema dynamically via `apify actors info --input --json` + - Step 3: If multi-step task, read matching workflow guide from `references/workflows/` + - Step 4: Review `references/gotchas.md` for pricing/cost traps. Run cost estimation for PPE Actors. + - Step 5: Run Actor(s) via CLI, fetch results, deliver to user +4. **Error handling** - table of common errors and resolutions +5. **`--json` policy** - reminder to always use `--json` flag + +Routing logic for workflow guides: + +``` +If task mentions "lead" or "contact" or "email" -> lead-generation.md +If task mentions "competitor" or "ad" or "pricing" -> competitive-intel.md +If task mentions "influencer" or "creator" -> influencer-vetting.md +If task mentions "brand" or "mention" or "sentiment" -> brand-monitoring.md +If task mentions "review" or "rating" or "reputation" -> review-analysis.md +If task mentions "SEO" or "SERP" or "crawl" or "content" -> content-and-seo.md +If task mentions "analytics" or "engagement" or "performance" -> social-media-analytics.md +If task mentions "trend" or "keyword" or "hashtag" -> trend-research.md +If task mentions "job" or "recruit" or "candidate" or "hiring" -> job-market-and-recruitment.md +If task mentions "real estate" or "listing" or "property" or "hotel" -> real-estate-and-hospitality.md +``` + +This is high-freedom guidance (text-based), not rigid routing. The agent uses judgment. + +## Token budget analysis + +| Scenario | Files loaded | Estimated tokens | +|----------|-------------|-----------------| +| Simple scrape ("get Nike's Instagram") | SKILL.md + actor-index | ~250 lines (~2,500 tokens) | +| Targeted with gotchas check | SKILL.md + actor-index + gotchas | ~310 lines (~3,100 tokens) | +| Multi-step workflow | SKILL.md + actor-index + gotchas + 1 workflow | ~370 lines (~3,700 tokens) | +| Complex exploration | SKILL.md + actor-index + gotchas + 2 workflows | ~430 lines (~4,300 tokens) | + +Current design loads ~590 lines regardless. The new design ranges from 250-430 depending on complexity. The simple case (50% of usage) cuts token usage by more than half. + +## What changes from current design + +| Aspect | Current | Redesigned | +|--------|---------|-----------| +| Actor index | ~400 lines monolithic, includes descriptions + workflows | ~100 lines, 3-column lookup only | +| Input schemas | Not provided (just "fetch via CLI") | Still fetched via CLI, but workflow guides provide key input hints | +| Output schemas | Not provided | Explicit per-step output field lists in workflow guides | +| Workflow guidance | None | 10 dedicated files with data-piping instructions | +| Gotchas | None | Dedicated reference file with pricing/cost/pitfall guidance | +| Cost estimation | Brief warning about 1,000+ results | Explicit protocol: check pricing, estimate cost, confirm with user | +| Token usage (simple task) | ~590 lines | ~250 lines | + +## What does NOT change + +- CLI commands (same as current: `actors search`, `actors info`, `actors call`, `datasets get-items`) +- Authentication flow (OAuth-first, env var fallback) +- `--json` policy (all CLI output via `--json`) +- Error handling table +- Resilience strategy (4 layers from the migration plan) +- Plugin metadata structure (plugin.json, marketplace.json, AGENTS.md) + +## Implementation scope + +### Must-do (this pass) +- Rewrite SKILL.md with new routing logic and workflow +- Create lean `references/actor-index.md` from existing actor-index data +- Create `references/gotchas.md` +- Create 10 workflow guide files with consistent structure +- Populate workflow guides with at least 1-2 pipelines each (skeleton + key examples) +- Delete old `references/actor-index.md` (the current ~400 line version) + +### Deferred (second pass by user) +- Enriching workflow guides with additional pipeline examples +- Adding more Actors to the index as they're discovered/tested +- Building an eval framework for skill testing +- Adding skill memory/run history +- Per-Actor gotchas (currently only cross-cutting gotchas) + +## Verification + +1. Load the skill in Claude Code and test a simple scrape ("get 10 Instagram profiles for @nike") + - Verify: agent loads SKILL.md + actor-index only, picks right Actor, fetches schema via CLI +2. Test a multi-step workflow ("build me a lead list of restaurants in Prague with emails") + - Verify: agent loads lead-generation.md, follows the pipeline, pipes data correctly +3. Test a PPE Actor ("scrape Amazon product reviews") + - Verify: agent checks gotchas.md, estimates cost, warns before running +4. Test dynamic discovery ("scrape Glassdoor company reviews") + - Verify: agent can't find in index, uses `apify actors search`, fetches schema dynamically +5. Test in Gemini CLI via AGENTS.md to verify cross-agent compatibility diff --git a/skills/apify-actor-development/SKILL.md b/skills/apify-actor-development/SKILL.md index 1a6a02c..df121eb 100644 --- a/skills/apify-actor-development/SKILL.md +++ b/skills/apify-actor-development/SKILL.md @@ -40,18 +40,14 @@ When the apify CLI is installed, check that it is logged in with: apify info # Should return your username ``` -If it is not logged in, check if the `APIFY_TOKEN` environment variable is defined (if not, ask the user to generate one on https://console.apify.com/settings/integrations and then define `APIFY_TOKEN` with it). - -Then authenticate using one of these methods: +If not logged in, authenticate using OAuth (opens browser): ```bash -# Option 1 (preferred): The CLI automatically reads APIFY_TOKEN from the environment. -# Just ensure the env var is exported and run any apify command — no explicit login needed. - -# Option 2: Interactive login (prompts for token without exposing it in shell history) apify login ``` +If browser login isn't available (headless environment or CI), the CLI automatically reads `APIFY_TOKEN` from the environment. Ensure the env var is exported and run any apify command - no explicit login needed. If the user doesn't have a token, generate one at https://console.apify.com/settings/integrations. + > **Security note:** Avoid passing tokens as command-line arguments (e.g. `apify login -t `). > Arguments are visible in process listings and may be recorded in shell history. > Prefer environment variables or interactive login instead. diff --git a/skills/apify-actorization/SKILL.md b/skills/apify-actorization/SKILL.md index 1d73a40..ebfc906 100644 --- a/skills/apify-actorization/SKILL.md +++ b/skills/apify-actorization/SKILL.md @@ -40,7 +40,7 @@ npm install -g apify-cli ``` > **Security note:** Do NOT install the CLI by piping remote scripts to a shell -> (e.g. `curl … | bash` or `irm … | iex`). Always use a package manager. +> (e.g. `curl ... | bash` or `irm ... | iex`). Always use a package manager. Verify CLI is logged in: @@ -48,21 +48,17 @@ Verify CLI is logged in: apify info # Should return your username ``` -If not logged in, check if the `APIFY_TOKEN` environment variable is defined (if not, ask the user to generate one at https://console.apify.com/settings/integrations and then define `APIFY_TOKEN` with it). - -Then authenticate using one of these methods: +If not logged in, authenticate using OAuth (opens browser): ```bash -# Option 1 (preferred): The CLI automatically reads APIFY_TOKEN from the environment. -# Just ensure the env var is exported and run any apify command — no explicit login needed. - -# Option 2: Interactive login (prompts for token without exposing it in shell history) apify login ``` +If browser login isn't available (headless environment or CI), ensure the `APIFY_TOKEN` environment variable is exported. The CLI reads it automatically - no explicit login needed. If the user doesn't have a token, generate one at https://console.apify.com/settings/integrations. + > **Security note:** Avoid passing tokens as command-line arguments (e.g. `apify login -t `). > Arguments are visible in process listings and may be recorded in shell history. -> Prefer environment variables or interactive login instead. +> Prefer OAuth login or environment variables instead. > Never log, print, or embed `APIFY_TOKEN` in source code or configuration files. > Use a token with the minimum required permissions (scoped token) and rotate it periodically. diff --git a/skills/apify-ultimate-scraper/SKILL.md b/skills/apify-ultimate-scraper/SKILL.md index 4da4583..22007b8 100644 --- a/skills/apify-ultimate-scraper/SKILL.md +++ b/skills/apify-ultimate-scraper/SKILL.md @@ -1,232 +1,109 @@ --- name: apify-ultimate-scraper -description: Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, Google Maps, Google Search, Google Trends, Booking.com, and TripAdvisor. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, or any data extraction task. +description: Universal AI-powered web scraper for any platform. Scrape data from Instagram, Facebook, TikTok, YouTube, LinkedIn, X/Twitter, Google Maps, Google Search, Google Trends, Reddit, Airbnb, Yelp, and 15+ more platforms. Use for lead generation, brand monitoring, competitor analysis, influencer discovery, trend research, content analytics, audience analysis, review analysis, SEO intelligence, recruitment, or any data extraction task. --- -# Universal Web Scraper +# Universal web scraper -AI-driven data extraction from 55+ Actors across all major platforms. This skill automatically selects the best Actor for your task. +AI-driven data extraction from ~100 Actors across 15+ platforms via the Apify CLI. + +**Rule: Always pass `--json` to CLI commands.** JSON output is stable across CLI versions. Never parse human-readable output. ## Prerequisites -(No need to check it upfront) -- `.env` file with `APIFY_TOKEN` -- Node.js 20.6+ (for native `--env-file` support) +- Apify CLI v1.4.0+ (`npm install -g apify-cli`) +- Authenticated session (see below) + +## Authentication + +Check: `apify info` + +If not logged in, authenticate via OAuth (opens browser): + + apify login + +Headless fallback: `export APIFY_TOKEN=your_token_here` +Generate token: https://console.apify.com/settings/integrations ## Workflow -Copy this checklist and track progress: - -``` -Task Progress: -- [ ] Step 1: Understand user goal and select Actor -- [ ] Step 2: Fetch Actor schema -- [ ] Step 3: Ask user preferences (format, filename) -- [ ] Step 4: Run the scraper script -- [ ] Step 5: Summarize results and offer follow-ups -``` - -### Step 1: Understand User Goal and Select Actor - -First, understand what the user wants to achieve. Then select the best Actor from the options below. - -#### Instagram Actors (12) - -| Actor ID | Best For | -|----------|----------| -| `apify/instagram-profile-scraper` | Profile data, follower counts, bio info | -| `apify/instagram-post-scraper` | Individual post details, engagement metrics | -| `apify/instagram-comment-scraper` | Comment extraction, sentiment analysis | -| `apify/instagram-hashtag-scraper` | Hashtag content, trending topics | -| `apify/instagram-hashtag-stats` | Hashtag performance metrics | -| `apify/instagram-reel-scraper` | Reels content and metrics | -| `apify/instagram-search-scraper` | Search users, places, hashtags | -| `apify/instagram-tagged-scraper` | Posts tagged with specific accounts | -| `apify/instagram-followers-count-scraper` | Follower count tracking | -| `apify/instagram-scraper` | Comprehensive Instagram data | -| `apify/instagram-api-scraper` | API-based Instagram access | -| `apify/export-instagram-comments-posts` | Bulk comment/post export | - -#### Facebook Actors (14) - -| Actor ID | Best For | -|----------|----------| -| `apify/facebook-pages-scraper` | Page data, metrics, contact info | -| `apify/facebook-page-contact-information` | Emails, phones, addresses from pages | -| `apify/facebook-posts-scraper` | Post content and engagement | -| `apify/facebook-comments-scraper` | Comment extraction | -| `apify/facebook-likes-scraper` | Reaction analysis | -| `apify/facebook-reviews-scraper` | Page reviews | -| `apify/facebook-groups-scraper` | Group content and members | -| `apify/facebook-events-scraper` | Event data | -| `apify/facebook-ads-scraper` | Ad creative and targeting | -| `apify/facebook-search-scraper` | Search results | -| `apify/facebook-reels-scraper` | Reels content | -| `apify/facebook-photos-scraper` | Photo extraction | -| `apify/facebook-marketplace-scraper` | Marketplace listings | -| `apify/facebook-followers-following-scraper` | Follower/following lists | - -#### TikTok Actors (14) - -| Actor ID | Best For | -|----------|----------| -| `clockworks/tiktok-scraper` | Comprehensive TikTok data | -| `clockworks/free-tiktok-scraper` | Free TikTok extraction | -| `clockworks/tiktok-profile-scraper` | Profile data | -| `clockworks/tiktok-video-scraper` | Video details and metrics | -| `clockworks/tiktok-comments-scraper` | Comment extraction | -| `clockworks/tiktok-followers-scraper` | Follower lists | -| `clockworks/tiktok-user-search-scraper` | Find users by keywords | -| `clockworks/tiktok-hashtag-scraper` | Hashtag content | -| `clockworks/tiktok-sound-scraper` | Trending sounds | -| `clockworks/tiktok-ads-scraper` | Ad content | -| `clockworks/tiktok-discover-scraper` | Discover page content | -| `clockworks/tiktok-explore-scraper` | Explore content | -| `clockworks/tiktok-trends-scraper` | Trending content | -| `clockworks/tiktok-live-scraper` | Live stream data | - -#### YouTube Actors (5) - -| Actor ID | Best For | -|----------|----------| -| `streamers/youtube-scraper` | Video data and metrics | -| `streamers/youtube-channel-scraper` | Channel information | -| `streamers/youtube-comments-scraper` | Comment extraction | -| `streamers/youtube-shorts-scraper` | Shorts content | -| `streamers/youtube-video-scraper-by-hashtag` | Videos by hashtag | - -#### Google Maps Actors (4) - -| Actor ID | Best For | -|----------|----------| -| `compass/crawler-google-places` | Business listings, ratings, contact info | -| `compass/google-maps-extractor` | Detailed business data | -| `compass/Google-Maps-Reviews-Scraper` | Review extraction | -| `poidata/google-maps-email-extractor` | Email discovery from listings | - -#### Other Actors (6) - -| Actor ID | Best For | -|----------|----------| -| `apify/google-search-scraper` | Google search results | -| `apify/google-trends-scraper` | Google Trends data | -| `voyager/booking-scraper` | Booking.com hotel data | -| `voyager/booking-reviews-scraper` | Booking.com reviews | -| `maxcopell/tripadvisor-reviews` | TripAdvisor reviews | -| `vdrmota/contact-info-scraper` | Contact enrichment from URLs | +### Step 1: Understand goal and select Actor ---- +Identify the target platform and use case. Read `references/actor-index.md` to find the right Actor. -#### Actor Selection by Use Case +If the task involves a multi-step pipeline, also read the matching workflow guide: -| Use Case | Primary Actors | -|----------|---------------| -| **Lead Generation** | `compass/crawler-google-places`, `poidata/google-maps-email-extractor`, `vdrmota/contact-info-scraper` | -| **Influencer Discovery** | `apify/instagram-profile-scraper`, `clockworks/tiktok-profile-scraper`, `streamers/youtube-channel-scraper` | -| **Brand Monitoring** | `apify/instagram-tagged-scraper`, `apify/instagram-hashtag-scraper`, `compass/Google-Maps-Reviews-Scraper` | -| **Competitor Analysis** | `apify/facebook-pages-scraper`, `apify/facebook-ads-scraper`, `apify/instagram-profile-scraper` | -| **Content Analytics** | `apify/instagram-post-scraper`, `clockworks/tiktok-scraper`, `streamers/youtube-scraper` | -| **Trend Research** | `apify/google-trends-scraper`, `clockworks/tiktok-trends-scraper`, `apify/instagram-hashtag-stats` | -| **Review Analysis** | `compass/Google-Maps-Reviews-Scraper`, `voyager/booking-reviews-scraper`, `maxcopell/tripadvisor-reviews` | -| **Audience Analysis** | `apify/instagram-followers-count-scraper`, `clockworks/tiktok-followers-scraper`, `apify/facebook-followers-following-scraper` | +| Task involves... | Read | +|-----------------|------| +| leads, contacts, emails, B2B | `references/workflows/lead-generation.md` | +| competitor, ads, pricing | `references/workflows/competitive-intel.md` | +| influencer, creator | `references/workflows/influencer-vetting.md` | +| brand, mentions, sentiment | `references/workflows/brand-monitoring.md` | +| reviews, ratings, reputation | `references/workflows/review-analysis.md` | +| SEO, SERP, crawl, content, RAG | `references/workflows/content-and-seo.md` | +| analytics, engagement, performance | `references/workflows/social-media-analytics.md` | +| trends, keywords, hashtags | `references/workflows/trend-research.md` | +| jobs, recruiting, candidates | `references/workflows/job-market-and-recruitment.md` | +| real estate, listings, hotels | `references/workflows/real-estate-and-hospitality.md` | +| price monitoring, e-commerce, products | `references/workflows/ecommerce-price-monitoring.md` | +| contact enrichment, email extraction | `references/workflows/contact-enrichment.md` | +| knowledge base, RAG, LLM data feed | `references/workflows/knowledge-base-and-rag.md` | +| company research, due diligence | `references/workflows/company-research.md` | ---- +If no Actor matches in the index, search dynamically: + + apify actors search "KEYWORDS" --json --limit 10 -#### Multi-Actor Workflows +From results: `items[].username`/`items[].name` (Actor ID), `items[].title`, `items[].stats.totalUsers30Days`, `items[].currentPricingInfo.pricingModel`. -For complex tasks, chain multiple Actors: +### Step 2: Fetch Actor schema and check gotchas -| Workflow | Step 1 | Step 2 | -|----------|--------|--------| -| **Lead enrichment** | `compass/crawler-google-places` → | `vdrmota/contact-info-scraper` | -| **Influencer vetting** | `apify/instagram-profile-scraper` → | `apify/instagram-comment-scraper` | -| **Competitor deep-dive** | `apify/facebook-pages-scraper` → | `apify/facebook-posts-scraper` | -| **Local business analysis** | `compass/crawler-google-places` → | `compass/Google-Maps-Reviews-Scraper` | +Fetch the input schema dynamically: -#### Can't Find a Suitable Actor? + apify actors info "ACTOR_ID" --input --json -If none of the Actors above match the user's request, search the Apify Store directly: +Also read `references/gotchas.md` to check for pricing traps and common pitfalls for the selected Actor. -```bash -node ${CLAUDE_PLUGIN_ROOT}/reference/scripts/search_actors.js --query "SEARCH_KEYWORDS" -``` +For PPE Actors: estimate cost before running (see gotchas.md cost estimation protocol). -Replace `SEARCH_KEYWORDS` with 1-3 simple terms (e.g., "LinkedIn profiles", "Amazon products", "Twitter"). +For Actor documentation: `apify actors info "ACTOR_ID" --readme` -### Step 2: Fetch Actor Schema +### Step 3: Configure and run -Fetch the Actor's input schema and details: +**Skip user preferences** for simple lookups (e.g., "Nike's follower count"). Go straight to running with quick answer mode. -```bash -node --env-file=.env ${CLAUDE_PLUGIN_ROOT}/reference/scripts/fetch_actor_details.js --actor "ACTOR_ID" -``` +For larger tasks, confirm output format (quick answer / CSV / JSON) and result count. -Replace `ACTOR_ID` with the selected Actor (e.g., `compass/crawler-google-places`). +**Standard run (blocking):** -This returns: -- Actor info (title, description, URL, categories, stats, rating) -- README summary -- Input schema (required and optional parameters) + apify actors call "ACTOR_ID" -i 'JSON_INPUT' --json -### Step 3: Ask User Preferences +From output: `.id` (run ID), `.status`, `.defaultDatasetId`, `.stats.durationMillis` -**Skip this step** for simple lookups (e.g., "what's Nike's follower count?", "find me 5 coffee shops in Prague") — just use quick answer mode and move to Step 4. +**Fetch results:** -For larger scraping tasks, ask: -1. **Output format**: - - **Quick answer** - Display top few results in chat (no file saved) - - **CSV** - Full export with all fields - - **JSON** - Full export in JSON format -2. **Number of results**: Based on character of use case + apify datasets get-items DATASET_ID --format json -**Cost safety**: Always set a sensible result limit in the Actor input (e.g., `maxResults`, `resultsLimit`, `maxCrawledPages`, or equivalent field from the input schema). Default to 100 results unless the user explicitly asks for more. Warn the user before running large scrapes (1000+ results) as they consume more Apify credits. +For CSV: `apify datasets get-items DATASET_ID --format csv` -### Step 4: Run the Script +**Quick answer mode:** Fetch results as JSON, pick top 5, present formatted in chat. -**Quick answer (display in chat, no file):** -```bash -node --env-file=.env ${CLAUDE_PLUGIN_ROOT}/reference/scripts/run_actor.js \ - --actor "ACTOR_ID" \ - --input 'JSON_INPUT' -``` +**Save to file:** Fetch results, use Write tool to save as `YYYY-MM-DD_descriptive-name.csv` or `.json`. -**CSV:** -```bash -node --env-file=.env ${CLAUDE_PLUGIN_ROOT}/reference/scripts/run_actor.js \ - --actor "ACTOR_ID" \ - --input 'JSON_INPUT' \ - --output YYYY-MM-DD_OUTPUT_FILE.csv \ - --format csv -``` +**Large/long-running scrapes:** -**JSON:** -```bash -node --env-file=.env ${CLAUDE_PLUGIN_ROOT}/reference/scripts/run_actor.js \ - --actor "ACTOR_ID" \ - --input 'JSON_INPUT' \ - --output YYYY-MM-DD_OUTPUT_FILE.json \ - --format json -``` + apify actors start "ACTOR_ID" -i 'JSON_INPUT' --json -### Step 5: Summarize Results and Offer Follow-ups +Poll: `apify runs info RUN_ID --json` (check `.status` for `SUCCEEDED`). -After completion, report: -- Number of results found -- File location and name -- Key fields available -- **Suggested follow-up workflows** based on results: +### Step 4: Deliver results -| If User Got | Suggest Next | -|-------------|--------------| -| Business listings | Enrich with `vdrmota/contact-info-scraper` or get reviews | -| Influencer profiles | Analyze engagement with comment scrapers | -| Competitor pages | Deep-dive with post/ad scrapers | -| Trend data | Validate with platform-specific hashtag scrapers | +Report: result count, file location (if saved), key data fields, and links: +- Dataset: `https://console.apify.com/storage/datasets/DATASET_ID` +- Run: `https://console.apify.com/actors/runs/RUN_ID` +For multi-step workflows: suggest the next pipeline step from the workflow guide. -## Error Handling +## Troubleshooting -`APIFY_TOKEN not found` - Ask user to create `.env` with `APIFY_TOKEN=your_token` -`Actor not found` - Check Actor ID spelling -`Run FAILED` - Ask user to check Apify console link in error output -`Timeout` - Reduce input size or increase `--timeout` +Common errors and pitfalls are documented in `references/gotchas.md`. Read it before running PPE (pay-per-event) Actors. diff --git a/skills/apify-ultimate-scraper/reference/scripts/fetch_actor_details.js b/skills/apify-ultimate-scraper/reference/scripts/fetch_actor_details.js deleted file mode 100644 index 3d0a1a0..0000000 --- a/skills/apify-ultimate-scraper/reference/scripts/fetch_actor_details.js +++ /dev/null @@ -1,136 +0,0 @@ -#!/usr/bin/env node -/** - * Fetch Apify Actor details: README, input schema, and description. - * - * Usage: - * node --env-file=.env scripts/fetch_actor_details.js --actor "apify/instagram-profile-scraper" - */ - -import { parseArgs } from 'node:util'; - -const USER_AGENT = 'apify-agent-skills/apify-ultimate-scraper-1.3.0'; - -function parseCliArgs() { - const options = { - actor: { type: 'string', short: 'a' }, - help: { type: 'boolean', short: 'h' }, - }; - - const { values } = parseArgs({ options, allowPositionals: false }); - - if (values.help) { - console.log(` -Fetch Apify Actor details (README, input schema, description) - -Usage: - node --env-file=.env scripts/fetch_actor_details.js --actor "ACTOR_ID" - -Options: - --actor, -a Actor ID (e.g., apify/instagram-profile-scraper) [required] - --help, -h Show this help message -`); - process.exit(0); - } - - if (!values.actor) { - console.error('Error: --actor is required'); - process.exit(1); - } - - return { actor: values.actor }; -} - -async function fetchActorInfo(token, actorId) { - const apiActorId = actorId.replace('/', '~'); - const url = `https://api.apify.com/v2/acts/${apiActorId}?token=${encodeURIComponent(token)}`; - - const response = await fetch(url, { - headers: { 'User-Agent': `${USER_AGENT}/fetch_actor_info` }, - }); - - if (response.status === 404) { - console.error(`Error: Actor '${actorId}' not found`); - process.exit(1); - } - - if (!response.ok) { - const text = await response.text(); - console.error(`Error: Failed to fetch actor info (${response.status}): ${text}`); - process.exit(1); - } - - return (await response.json()).data; -} - -async function fetchBuildDetails(token, actorId, buildId) { - const apiActorId = actorId.replace('/', '~'); - const url = `https://api.apify.com/v2/acts/${apiActorId}/builds/${buildId}?token=${encodeURIComponent(token)}`; - - const response = await fetch(url, { - headers: { 'User-Agent': `${USER_AGENT}/fetch_build` }, - }); - - if (!response.ok) { - return null; - } - - return (await response.json()).data; -} - -async function main() { - const args = parseCliArgs(); - - const token = process.env.APIFY_TOKEN; - if (!token) { - console.error('Error: APIFY_TOKEN not found in .env file'); - console.error('Add your token to .env: APIFY_TOKEN=your_token_here'); - console.error('Get your token: https://console.apify.com/account/integrations'); - process.exit(1); - } - - // Step 1: Get actor info (includes readmeSummary, taggedBuilds) - const actorInfo = await fetchActorInfo(token, args.actor); - - // Step 2: Get build details for input schema - const buildId = actorInfo.taggedBuilds?.latest?.buildId; - let inputSchema = null; - - if (buildId) { - const build = await fetchBuildDetails(token, args.actor, buildId); - if (build) { - const schemaRaw = build.inputSchema; - if (schemaRaw) { - inputSchema = typeof schemaRaw === 'string' ? JSON.parse(schemaRaw) : schemaRaw; - } - } - } - - // Compose output - const stats = actorInfo.stats || {}; - const output = { - actorId: args.actor, - title: actorInfo.title || null, - url: `https://apify.com/${args.actor}`, - description: actorInfo.description || null, - categories: actorInfo.categories || [], - isDeprecated: actorInfo.isDeprecated || false, - stats: { - totalUsers: stats.totalUsers || 0, - monthlyUsers: stats.totalUsers30Days || 0, - bookmarks: stats.bookmarkCount || 0, - }, - rating: { - average: stats.actorReviewRating || null, - count: stats.actorReviewCount || 0, - }, - readmeSummary: actorInfo.readmeSummary || null, - inputSchema: inputSchema || null, - }; - - console.log(JSON.stringify(output, null, 2)); -} - -main().catch((err) => { - console.error(`Error: ${err.message}`); - process.exit(1); -}); diff --git a/skills/apify-ultimate-scraper/reference/scripts/run_actor.js b/skills/apify-ultimate-scraper/reference/scripts/run_actor.js deleted file mode 100644 index 9a96457..0000000 --- a/skills/apify-ultimate-scraper/reference/scripts/run_actor.js +++ /dev/null @@ -1,363 +0,0 @@ -#!/usr/bin/env node -/** - * Apify Actor Runner - Runs Apify actors and exports results. - * - * Usage: - * # Quick answer (display in chat, no file saved) - * node --env-file=.env scripts/run_actor.js --actor ACTOR_ID --input '{}' - * - * # Export to file - * node --env-file=.env scripts/run_actor.js --actor ACTOR_ID --input '{}' --output leads.csv --format csv - */ - -import { parseArgs } from 'node:util'; -import { writeFileSync, statSync } from 'node:fs'; - -// User-Agent for tracking skill usage in Apify analytics -const USER_AGENT = 'apify-agent-skills/apify-ultimate-scraper-1.3.0'; - -// Parse command-line arguments -function parseCliArgs() { - const options = { - actor: { type: 'string', short: 'a' }, - input: { type: 'string', short: 'i' }, - output: { type: 'string', short: 'o' }, - format: { type: 'string', short: 'f', default: 'csv' }, - timeout: { type: 'string', short: 't', default: '600' }, - 'poll-interval': { type: 'string', default: '5' }, - help: { type: 'boolean', short: 'h' }, - }; - - const { values } = parseArgs({ options, allowPositionals: false }); - - if (values.help) { - printHelp(); - process.exit(0); - } - - if (!values.actor) { - console.error('Error: --actor is required'); - printHelp(); - process.exit(1); - } - - if (!values.input) { - console.error('Error: --input is required'); - printHelp(); - process.exit(1); - } - - return { - actor: values.actor, - input: values.input, - output: values.output, - format: values.format || 'csv', - timeout: parseInt(values.timeout, 10), - pollInterval: parseInt(values['poll-interval'], 10), - }; -} - -function printHelp() { - console.log(` -Apify Actor Runner - Run Apify actors and export results - -Usage: - node --env-file=.env scripts/run_actor.js --actor ACTOR_ID --input '{}' - -Options: - --actor, -a Actor ID (e.g., compass/crawler-google-places) [required] - --input, -i Actor input as JSON string [required] - --output, -o Output file path (optional - if not provided, displays quick answer) - --format, -f Output format: csv, json (default: csv) - --timeout, -t Max wait time in seconds (default: 600) - --poll-interval Seconds between status checks (default: 5) - --help, -h Show this help message - -Output Formats: - JSON (all data) --output file.json --format json - CSV (all data) --output file.csv --format csv - Quick answer (no --output) - displays top 5 in chat - -Examples: - # Quick answer - display top 5 in chat - node --env-file=.env scripts/run_actor.js \\ - --actor "compass/crawler-google-places" \\ - --input '{"searchStringsArray": ["coffee shops"], "locationQuery": "Seattle, USA"}' - - # Export all data to CSV - node --env-file=.env scripts/run_actor.js \\ - --actor "compass/crawler-google-places" \\ - --input '{"searchStringsArray": ["coffee shops"], "locationQuery": "Seattle, USA"}' \\ - --output leads.csv --format csv -`); -} - -// Start an actor run and return { runId, datasetId } -async function startActor(token, actorId, inputJson) { - // Convert "author/actor" format to "author~actor" for API compatibility - const apiActorId = actorId.replace('/', '~'); - const url = `https://api.apify.com/v2/acts/${apiActorId}/runs?token=${encodeURIComponent(token)}`; - - let data; - try { - data = JSON.parse(inputJson); - } catch (e) { - console.error(`Error: Invalid JSON input: ${e.message}`); - process.exit(1); - } - - const response = await fetch(url, { - method: 'POST', - headers: { - 'Content-Type': 'application/json', - 'User-Agent': `${USER_AGENT}/start_actor`, - }, - body: JSON.stringify(data), - }); - - if (response.status === 404) { - console.error(`Error: Actor '${actorId}' not found`); - process.exit(1); - } - - if (!response.ok) { - const text = await response.text(); - console.error(`Error: API request failed (${response.status}): ${text}`); - process.exit(1); - } - - const result = await response.json(); - return { - runId: result.data.id, - datasetId: result.data.defaultDatasetId, - }; -} - -// Poll run status until complete or timeout -async function pollUntilComplete(token, runId, timeout, interval) { - const url = `https://api.apify.com/v2/actor-runs/${runId}?token=${encodeURIComponent(token)}`; - const startTime = Date.now(); - let lastStatus = null; - - while (true) { - const response = await fetch(url); - if (!response.ok) { - const text = await response.text(); - console.error(`Error: Failed to get run status: ${text}`); - process.exit(1); - } - - const result = await response.json(); - const status = result.data.status; - - // Only print when status changes - if (status !== lastStatus) { - console.log(`Status: ${status}`); - lastStatus = status; - } - - if (['SUCCEEDED', 'FAILED', 'ABORTED', 'TIMED-OUT'].includes(status)) { - return status; - } - - const elapsed = (Date.now() - startTime) / 1000; - if (elapsed > timeout) { - console.error(`Warning: Timeout after ${timeout}s, actor still running`); - return 'TIMED-OUT'; - } - - await sleep(interval * 1000); - } -} - -// Download dataset items -async function downloadResults(token, datasetId, outputPath, format) { - const url = `https://api.apify.com/v2/datasets/${datasetId}/items?token=${encodeURIComponent(token)}&format=json`; - - const response = await fetch(url, { - headers: { - 'User-Agent': `${USER_AGENT}/download_${format}`, - }, - }); - - if (!response.ok) { - const text = await response.text(); - console.error(`Error: Failed to download results: ${text}`); - process.exit(1); - } - - const data = await response.json(); - - if (format === 'json') { - writeFileSync(outputPath, JSON.stringify(data, null, 2)); - } else { - // CSV output - if (data.length > 0) { - const fieldnames = Object.keys(data[0]); - const csvLines = [fieldnames.join(',')]; - - for (const row of data) { - const values = fieldnames.map((key) => { - let value = row[key]; - - // Truncate long text fields - if (typeof value === 'string' && value.length > 200) { - value = value.slice(0, 200) + '...'; - } else if (Array.isArray(value) || (typeof value === 'object' && value !== null)) { - value = JSON.stringify(value) || ''; - } - - // CSV escape: wrap in quotes if contains comma, quote, or newline - if (value === null || value === undefined) { - return ''; - } - const strValue = String(value); - if (strValue.includes(',') || strValue.includes('"') || strValue.includes('\n')) { - return `"${strValue.replace(/"/g, '""')}"`; - } - return strValue; - }); - csvLines.push(values.join(',')); - } - - writeFileSync(outputPath, csvLines.join('\n')); - } else { - writeFileSync(outputPath, ''); - } - } - - console.log(`Saved to: ${outputPath}`); -} - -// Display top 5 results in chat format -async function displayQuickAnswer(token, datasetId) { - const url = `https://api.apify.com/v2/datasets/${datasetId}/items?token=${encodeURIComponent(token)}&format=json`; - - const response = await fetch(url, { - headers: { - 'User-Agent': `${USER_AGENT}/quick_answer`, - }, - }); - - if (!response.ok) { - const text = await response.text(); - console.error(`Error: Failed to download results: ${text}`); - process.exit(1); - } - - const data = await response.json(); - const total = data.length; - - if (total === 0) { - console.log('\nNo results found.'); - return; - } - - // Display top 5 - console.log(`\n${'='.repeat(60)}`); - console.log(`TOP 5 RESULTS (of ${total} total)`); - console.log('='.repeat(60)); - - for (let i = 0; i < Math.min(5, data.length); i++) { - const item = data[i]; - console.log(`\n--- Result ${i + 1} ---`); - - for (const [key, value] of Object.entries(item)) { - let displayValue = value; - - // Truncate long values - if (typeof value === 'string' && value.length > 100) { - displayValue = value.slice(0, 100) + '...'; - } else if (Array.isArray(value) || (typeof value === 'object' && value !== null)) { - const jsonStr = JSON.stringify(value); - displayValue = jsonStr.length > 100 ? jsonStr.slice(0, 100) + '...' : jsonStr; - } - - console.log(` ${key}: ${displayValue}`); - } - } - - console.log(`\n${'='.repeat(60)}`); - if (total > 5) { - console.log(`Showing 5 of ${total} results.`); - } - console.log(`Full data available at: https://console.apify.com/storage/datasets/${datasetId}`); - console.log('='.repeat(60)); -} - -// Report summary of downloaded data -function reportSummary(outputPath, format) { - const stats = statSync(outputPath); - const size = stats.size; - - let count; - try { - const content = require('fs').readFileSync(outputPath, 'utf-8'); - if (format === 'json') { - const data = JSON.parse(content); - count = Array.isArray(data) ? data.length : 1; - } else { - // CSV - count lines minus header - const lines = content.split('\n').filter((line) => line.trim()); - count = Math.max(0, lines.length - 1); - } - } catch { - count = 'unknown'; - } - - console.log(`Records: ${count}`); - console.log(`Size: ${size.toLocaleString()} bytes`); -} - -// Helper: sleep for ms -function sleep(ms) { - return new Promise((resolve) => setTimeout(resolve, ms)); -} - -// Main function -async function main() { - // Parse args first so --help works without token - const args = parseCliArgs(); - - // Check for APIFY_TOKEN - const token = process.env.APIFY_TOKEN; - if (!token) { - console.error('Error: APIFY_TOKEN not found in .env file'); - console.error(''); - console.error('Add your token to .env file:'); - console.error(' APIFY_TOKEN=your_token_here'); - console.error(''); - console.error('Get your token: https://console.apify.com/account/integrations'); - process.exit(1); - } - - // Start the actor run - console.log(`Starting actor: ${args.actor}`); - const { runId, datasetId } = await startActor(token, args.actor, args.input); - console.log(`Run ID: ${runId}`); - console.log(`Dataset ID: ${datasetId}`); - - // Poll for completion - const status = await pollUntilComplete(token, runId, args.timeout, args.pollInterval); - - if (status !== 'SUCCEEDED') { - console.error(`Error: Actor run ${status}`); - console.error(`Details: https://console.apify.com/actors/runs/${runId}`); - process.exit(1); - } - - // Determine output mode - if (args.output) { - // File output mode - await downloadResults(token, datasetId, args.output, args.format); - reportSummary(args.output, args.format); - } else { - // Quick answer mode - display in chat - await displayQuickAnswer(token, datasetId); - } -} - -main().catch((err) => { - console.error(`Error: ${err.message}`); - process.exit(1); -}); diff --git a/skills/apify-ultimate-scraper/reference/scripts/search_actors.js b/skills/apify-ultimate-scraper/reference/scripts/search_actors.js deleted file mode 100644 index e96823b..0000000 --- a/skills/apify-ultimate-scraper/reference/scripts/search_actors.js +++ /dev/null @@ -1,103 +0,0 @@ -#!/usr/bin/env node -/** - * Search Apify Store for Actors matching keywords. - * - * Usage: - * node --env-file=.env scripts/search_actors.js --query "instagram" - * node --env-file=.env scripts/search_actors.js --query "amazon products" --limit 5 - */ - -import { parseArgs } from 'node:util'; - -const USER_AGENT = 'apify-agent-skills/apify-ultimate-scraper-1.3.0'; - -function parseCliArgs() { - const options = { - query: { type: 'string', short: 'q' }, - limit: { type: 'string', short: 'l', default: '10' }, - help: { type: 'boolean', short: 'h' }, - }; - - const { values } = parseArgs({ options, allowPositionals: false }); - - if (values.help) { - console.log(` -Search Apify Store for Actors - -Usage: - node --env-file=.env scripts/search_actors.js --query "KEYWORDS" - -Options: - --query, -q Search keywords (e.g., "instagram", "amazon products") [required] - --limit, -l Max results to return (default: 10) - --help, -h Show this help message -`); - process.exit(0); - } - - if (!values.query) { - console.error('Error: --query is required'); - process.exit(1); - } - - return { - query: values.query, - limit: parseInt(values.limit, 10) || 10, - }; -} - -async function searchStore(query, limit) { - const params = new URLSearchParams({ search: query, limit: String(limit) }); - const url = `https://api.apify.com/v2/store?${params}`; - - const response = await fetch(url, { - headers: { 'User-Agent': `${USER_AGENT}/search_actors` }, - }); - - if (!response.ok) { - const text = await response.text(); - console.error(`Error: Store search failed (${response.status}): ${text}`); - process.exit(1); - } - - const result = await response.json(); - return result.data?.items || []; -} - -function formatResults(actors) { - if (actors.length === 0) { - console.log('No actors found.'); - return; - } - - console.log(`Found ${actors.length} actor(s):\n`); - - for (const actor of actors) { - const id = `${actor.username}/${actor.name}`; - const title = actor.title || id; - const desc = actor.description - ? actor.description.length > 120 - ? actor.description.slice(0, 120) + '...' - : actor.description - : 'No description'; - const runs = actor.stats?.totalRuns?.toLocaleString() || '0'; - const users = actor.stats?.totalUsers?.toLocaleString() || '0'; - - console.log(` ${id}`); - console.log(` Title: ${title}`); - console.log(` ${desc}`); - console.log(` Runs: ${runs} | Users: ${users}`); - console.log(); - } -} - -async function main() { - const args = parseCliArgs(); - const actors = await searchStore(args.query, args.limit); - formatResults(actors); -} - -main().catch((err) => { - console.error(`Error: ${err.message}`); - process.exit(1); -}); diff --git a/skills/apify-ultimate-scraper/references/actor-index.md b/skills/apify-ultimate-scraper/references/actor-index.md new file mode 100644 index 0000000..32627c2 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/actor-index.md @@ -0,0 +1,206 @@ +# Actor index + +Flat lookup for Actor selection. For input schemas, fetch dynamically: +`apify actors info "ACTOR_ID" --input --json` + +Tiers: `apify` = Apify-maintained (always prefer), `community` = community-maintained (fill gaps). + +## Instagram + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/instagram-scraper | apify | all Instagram data | +| apify/instagram-profile-scraper | apify | profiles, followers, bio | +| apify/instagram-post-scraper | apify | posts, engagement metrics | +| apify/instagram-comment-scraper | apify | post and reel comments | +| apify/instagram-hashtag-scraper | apify | posts by hashtag | +| apify/instagram-hashtag-analytics-scraper | apify | hashtag metrics, trends | +| apify/instagram-reel-scraper | apify | reels, transcripts, engagement | +| apify/instagram-api-scraper | apify | API-based, no login | +| apify/instagram-search-scraper | apify | search users, places | +| apify/instagram-tagged-scraper | apify | tagged/mentioned posts | +| apify/instagram-topic-scraper | apify | posts by topic | +| apify/instagram-followers-count-scraper | apify | follower count tracking | +| apify/export-instagram-comments-posts | apify | bulk posts + comments | + +## Facebook + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/facebook-posts-scraper | apify | posts, videos, engagement | +| apify/facebook-comments-scraper | apify | comment extraction | +| apify/facebook-likes-scraper | apify | reactions, liker info | +| apify/facebook-groups-scraper | apify | public group content | +| apify/facebook-events-scraper | apify | events, attendees | +| apify/facebook-reels-scraper | apify | reels, engagement | +| apify/facebook-photos-scraper | apify | photos with OCR | +| apify/facebook-search-scraper | apify | page search | +| apify/facebook-marketplace-scraper | apify | marketplace listings | +| apify/facebook-followers-following-scraper | apify | follower lists | +| apify/facebook-video-search-scraper | apify | video search | +| apify/facebook-ads-scraper | apify | ad library, creatives | +| apify/facebook-page-contact-information | apify | page contact info | +| apify/facebook-reviews-scraper | apify | page reviews | +| apify/facebook-hashtag-scraper | apify | hashtag posts | +| apify/threads-profile-api-scraper | apify | Threads profiles | + +## TikTok + +| Actor | Tier | Best for | +|-------|------|----------| +| clockworks/tiktok-scraper | apify | all TikTok data | +| clockworks/tiktok-profile-scraper | apify | profiles, videos | +| clockworks/tiktok-video-scraper | apify | video details, metrics | +| clockworks/tiktok-comments-scraper | apify | video comments | +| clockworks/tiktok-hashtag-scraper | apify | videos by hashtag | +| clockworks/tiktok-followers-scraper | apify | follower profiles | +| clockworks/tiktok-user-search-scraper | apify | user search | +| clockworks/tiktok-sound-scraper | apify | videos by sound | +| clockworks/free-tiktok-scraper | apify | free tier extraction | +| clockworks/tiktok-ads-scraper | apify | hashtag analytics | +| clockworks/tiktok-trends-scraper | apify | trending content | +| clockworks/tiktok-explore-scraper | apify | explore categories | +| clockworks/tiktok-discover-scraper | apify | discover by hashtag | + +## YouTube + +| Actor | Tier | Best for | +|-------|------|----------| +| streamers/youtube-scraper | apify | videos, metrics | +| streamers/youtube-channel-scraper | apify | channel info | +| streamers/youtube-comments-scraper | apify | video comments | +| streamers/youtube-shorts-scraper | apify | shorts data | +| streamers/youtube-video-scraper-by-hashtag | apify | videos by hashtag | +| streamers/youtube-video-downloader | apify | video download | +| curious_coder/youtube-transcript-scraper | community | transcripts, captions | + +## X/Twitter + +| Actor | Tier | Best for | +|-------|------|----------| +| apidojo/tweet-scraper | community | tweet search | +| apidojo/twitter-scraper-lite | community | comprehensive, no limits | +| apidojo/twitter-user-scraper | community | user profiles | +| apidojo/twitter-profile-scraper | community | profiles + recent tweets | +| apidojo/twitter-list-scraper | community | tweets from lists | + +## LinkedIn + +| Actor | Tier | Best for | +|-------|------|----------| +| harvestapi/linkedin-profile-search | community | find profiles | +| harvestapi/linkedin-profile-scraper | community | profile with email | +| harvestapi/linkedin-company | community | company details | +| harvestapi/linkedin-company-employees | community | employee lists | +| harvestapi/linkedin-company-posts | community | company page posts | +| harvestapi/linkedin-profile-posts | community | profile posts | +| harvestapi/linkedin-job-search | community | job listings | +| harvestapi/linkedin-post-search | community | post search | +| harvestapi/linkedin-post-comments | community | post comments | +| harvestapi/linkedin-profile-search-by-name | community | find by name | +| harvestapi/linkedin-profile-search-by-services | community | find by service | +| apimaestro/linkedin-companies-search-scraper | community | company search | +| apimaestro/linkedin-company-detail | community | company deep data | +| apimaestro/linkedin-jobs-scraper-api | community | job search | +| apimaestro/linkedin-job-detail | community | job details | +| apimaestro/linkedin-batch-profile-posts-scraper | community | batch profile posts | +| apimaestro/linkedin-post-reshares | community | post reshares | +| apimaestro/linkedin-post-detail | community | post details | +| apimaestro/linkedin-profile-full-sections-scraper | community | full profile data | +| dev_fusion/linkedin-profile-scraper | community | mass scraping + email | + +## Google Maps + +| Actor | Tier | Best for | +|-------|------|----------| +| compass/crawler-google-places | apify | business listings | +| compass/google-maps-extractor | apify | detailed business data | +| compass/Google-Maps-Reviews-Scraper | apify | reviews, ratings | +| compass/enrich-google-maps-dataset-with-contacts | apify | email enrichment | +| compass/contact-details-scraper-standby | apify | quick contact extract | +| lukaskrivka/google-maps-with-contact-details | community | listings + contacts | +| curious_coder/google-maps-reviews-scraper | community | cheap review scraping | + +## Google Search and Trends + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/google-search-scraper | apify | SERP, ads, AI overviews | +| apify/google-trends-scraper | apify | trend data | +| tri_angle/bing-search-scraper | apify | Bing SERP data | + +## Reviews (cross-platform) + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/hotel-review-aggregator | apify | 7-platform hotel reviews | +| tri_angle/restaurant-review-aggregator | apify | 6-platform restaurant reviews | +| tri_angle/yelp-scraper | apify | Yelp business data | +| tri_angle/yelp-review-scraper | apify | Yelp reviews | +| tri_angle/get-tripadvisor-urls | apify | find TripAdvisor URLs | +| tri_angle/get-yelp-urls | apify | find Yelp URLs | +| tri_angle/airbnb-reviews-scraper | apify | Airbnb reviews | +| tri_angle/social-media-sentiment-analysis-tool | apify | sentiment analysis | + +## Real estate and hospitality + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/airbnb-scraper | apify | Airbnb listings | +| tri_angle/new-fast-airbnb-scraper | apify | fast Airbnb search | +| tri_angle/airbnb-rooms-urls-scraper | apify | detailed room data | +| tri_angle/redfin-search | apify | Redfin property search | +| tri_angle/redfin-detail | apify | Redfin property details | +| tri_angle/real-estate-aggregator | apify | multi-source listings | +| tri_angle/fast-zoopla-properties-scraper | apify | UK properties | +| tri_angle/doordash-store-details-scraper | apify | DoorDash stores | +| tri_angle/cargurus-zipcode-search-scraper | apify | CarGurus listings | +| tri_angle/carmax-zipcode-search-scraper | apify | Carmax listings | + +## SEO tools + +| Actor | Tier | Best for | +|-------|------|----------| +| radeance/similarweb-scraper | community | traffic, rankings | +| radeance/ahrefs-scraper | community | backlinks, keywords | +| radeance/semrush-scraper | community | domain authority | +| radeance/moz-scraper | community | DA, spam score | +| radeance/ubersuggest-scraper | community | keyword suggestions | +| radeance/se-ranking-scraper | community | keyword CPC | + +## Content and web crawling + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/website-content-crawler | apify | clean text for AI | +| apify/rag-web-browser | apify | RAG pipelines | +| apify/web-scraper | apify | general web scraping | +| apify/cheerio-scraper | apify | fast HTML parsing | +| apify/playwright-scraper | apify | JS-heavy sites | +| apify/camoufox-scraper | apify | anti-bot sites | +| apify/sitemap-extractor | apify | sitemap URLs | +| lukaskrivka/article-extractor-smart | community | article extraction | + +## Other platforms + +| Actor | Tier | Best for | +|-------|------|----------| +| tri_angle/telegram-scraper | apify | Telegram messages | +| tri_angle/snapchat-scraper | apify | Snapchat profiles | +| tri_angle/snapchat-spotlight-scraper | apify | Snapchat Spotlight | +| tri_angle/truth-scraper | apify | Truth Social | +| tri_angle/social-media-finder | apify | cross-platform search | +| tri_angle/website-changes-detector | apify | website monitoring | +| tri_angle/e-commerce-product-matching-tool | apify | product matching | +| trudax/reddit-scraper-lite | community | Reddit posts | +| janbuchar/github-contributors-scraper | community | GitHub contributors | + +## Enrichment and contacts + +| Actor | Tier | Best for | +|-------|------|----------| +| apify/social-media-leads-analyzer | apify | emails from websites | +| apify/social-media-hashtag-research | apify | cross-platform hashtags | +| apify/e-commerce-scraping-tool | apify | product data enrichment | +| vdrmota/contact-info-scraper | community | contact extraction | +| code_crafter/leads-finder | community | B2B leads | diff --git a/skills/apify-ultimate-scraper/references/gotchas.md b/skills/apify-ultimate-scraper/references/gotchas.md new file mode 100644 index 0000000..def87c7 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/gotchas.md @@ -0,0 +1,108 @@ +# Gotchas and cost guardrails + +## Pricing models + +| Model | How it works | Action before running | +|-------|-------------|----------------------| +| FREE | No per-result cost, only platform compute | None needed | +| PAY_PER_EVENT (PPE) | Charged per result item | MUST estimate cost first | +| FLAT_PRICE_PER_MONTH | Monthly subscription | Verify user has active subscription | + +To check an Actor's pricing: + + apify actors info "ACTOR_ID" --json + +Read `.currentPricingInfo.pricingModel` and `.currentPricingInfo.pricePerEvent`. + +## Cost estimation protocol + +Before running any PPE Actor: + +1. Get the per-event price from Actor info (`.currentPricingInfo.pricePerEvent`) +2. Multiply by the requested result count +3. Present the estimate to the user: "This will cost approximately $X for Y results" +4. If estimate > $5: warn explicitly +5. If estimate > $20: require explicit user confirmation before proceeding + +## Common pitfalls + +**Cookie-dependent Actors** +Some social media scrapers require cookies or login sessions. If an Actor returns auth errors or empty results unexpectedly, check its README: + + apify actors info "ACTOR_ID" --readme + +Look for mentions of "cookies", "login", "session", or "proxy". + +**Rate limiting on large scrapes** +Platforms throttle or block large-volume scraping. Mitigations: +- Use proxy configuration when available: `"proxyConfiguration": {"useApifyProxy": true}` +- Set reasonable concurrency limits (check the Actor's `maxConcurrency` input) +- For 1,000+ results, suggest splitting into smaller batches + +**Empty results** +Common causes: +- Too-narrow search query or geo-restriction (try broader terms) +- Platform blocking without proxy (enable Apify Proxy) +- Actor requires cookies/login but none provided +- Wrong input field name (always verify with `--input --json`) + +**maxResults vs maxCrawledPages** +Different Actors use different limit field names. Common variants: +- `maxResults`, `resultsLimit`, `maxItems` - limit output items +- `maxCrawledPages`, `maxRequestsPerCrawl` - limit pages visited +Always fetch the input schema to find the correct field for the specific Actor. + +**Deprecated Actors** +Check `.isDeprecated` in `apify actors info --json`. If `true`: +1. Search for alternatives: `apify actors search "SIMILAR_KEYWORDS" --json` +2. Prefer `apify` tier replacements over `community` alternatives + +**LinkedIn pricing** +LinkedIn Actors are all PPE and vary significantly: +- `harvestapi/` Actors: generally cheaper ($0.001-0.01/result) +- `apimaestro/` Actors: generally more expensive ($0.005-0.02/result) +- `dev_fusion/` Actors: mid-range, useful for mass scraping with email enrichment +Always compare pricing before selecting a LinkedIn Actor. + +**SEO tool pricing** +`radeance/` SEO scrapers (SimilarWeb, Ahrefs, SEMrush, Moz) have the highest per-result costs ($0.005-0.0275/result). For large-scale SEO analysis, estimate costs carefully and suggest batching. + +## Error recovery + +| Symptom | Likely cause | Fix | +|---------|-------------|-----| +| `status: FAILED` in run output | Actor crashed or input invalid | Read `.statusMessage` in JSON; check run log at `https://console.apify.com/actors/runs/RUN_ID/log` | +| `isDeprecated: true` in Actor info | Actor is end-of-life | Search for replacement: `apify actors search "KEYWORDS" --json` | +| Empty dataset (0 items) | Query too narrow, geo-restriction, or anti-bot block | Broaden search terms; enable Apify Proxy; check Actor README with `apify actors info ACTOR_ID --readme` | +| Run takes >10 minutes | Large scrape or slow target site | Switch to fire-and-forget: `apify actors start --json`, poll with `apify runs info RUN_ID --json` | + +## Why Apify Actors vs raw HTTP scraping + +Many n8n and automation workflows use raw HTTP Request nodes or self-hosted Puppeteer for web scraping. These hit common walls that Apify Actors handle transparently: + +**Cloudflare and WAF bypass** +Raw HTTP requests fail on sites with Cloudflare Turnstile, DataDome, or other WAFs. Apify Actors use residential proxies and browser fingerprint rotation automatically. For the toughest sites, use `apify/camoufox-scraper`. + +**JavaScript-rendered pages (SPAs)** +React, Vue, and Angular sites return empty HTML to plain HTTP requests. Apify's `apify/playwright-scraper` and `apify/camoufox-scraper` fully render JavaScript before extracting data. + +**Anti-bot fingerprinting** +Even headless browsers get detected via TLS fingerprints (JA3 hashes). Apify's browser pool rotates fingerprints across requests automatically. + +**Session and cookie management** +Social media platforms (LinkedIn, Instagram) require persistent sessions. Social media Actors handle cookie management and session rotation internally. + +**Scaling without infrastructure** +Self-hosted Puppeteer at scale requires 4-8 GB RAM per browser instance. Apify Actors run on serverless infrastructure - no browser pool management, no RAM provisioning, no Docker orchestration. + +## Platform-specific rate limits + +**Instagram:** Aggressive rate limiting. Keep `maxResults` under 200 per run for profile/post scrapers. Use delays between runs. Instagram API scrapers (`apify/instagram-api-scraper`) have higher limits than browser-based ones. + +**LinkedIn:** All LinkedIn Actors are community-maintained and PPE. LinkedIn actively blocks scraping at scale. Keep batch sizes under 100 profiles. Space runs at least 5 minutes apart. Expect occasional empty results. + +**TikTok:** Anti-bot measures increasing. `clockworks/tiktok-scraper` handles most cases. For blocked regions, enable Apify Proxy with residential IPs. + +**Google Maps:** Generally stable. Set `language: "en"` explicitly for consistent results. Large-area searches may return different results depending on zoom level - use specific location queries over broad city names. + +**Amazon/E-commerce:** Heavy anti-bot. The `apify/e-commerce-scraping-tool` handles this via built-in proxy rotation. Raw HTTP requests will fail. diff --git a/skills/apify-ultimate-scraper/references/workflows/brand-monitoring.md b/skills/apify-ultimate-scraper/references/workflows/brand-monitoring.md new file mode 100644 index 0000000..89e82d0 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/brand-monitoring.md @@ -0,0 +1,84 @@ +# Brand monitoring workflows + +## Cross-platform brand mention tracking +**When:** User wants to monitor brand mentions, hashtags, or sentiment across social platforms. + +### Pipeline (run each independently, combine results) +1. **Instagram mentions** -> `apify/instagram-tagged-scraper` + - Key input: `username` (brand handle) +2. **Instagram hashtags** -> `apify/instagram-hashtag-scraper` + - Key input: `hashtags` (branded hashtags) +3. **X/Twitter mentions** -> `apidojo/tweet-scraper` + - Key input: `searchTerms` (brand name, handle, hashtags) +4. **Reddit mentions** -> `trudax/reddit-scraper-lite` + - Key input: `searchQuery` (brand name) + +### Output fields +Instagram: `caption`, `likesCount`, `commentsCount`, `timestamp`, `ownerUsername` +X/Twitter: `text`, `retweetCount`, `likeCount`, `replyCount`, `createdAt`, `author` +Reddit: `title`, `body`, `score`, `numComments`, `subreddit`, `createdAt` + +### Gotcha +This is a parallel workflow, not sequential. Run each Actor independently. Combine results by date for a timeline view. + +## Twitter/X real-time mention routing +**When:** User wants to route brand mentions on X to the right team channel - negative to support, positive to wins - with sentiment scoring. + +### Pipeline +1. **Collect tweets** -> `apidojo/tweet-scraper` + - Key input: `searchTerms` (brand name + variants), `maxItems`, `since` (ISO date for incremental runs) +2. **Score sentiment** -> `tri_angle/social-media-sentiment-analysis-tool` + - Pipe: `results[].url` -> `urls` + - Key input: `urls`, `platforms` + +### Output fields +Step 1: `text`, `author.userName`, `createdAt`, `likeCount`, `retweetCount`, `url` +Step 2: `sentiment` (positive/negative/neutral), `score`, `text`, `platform` + +### Gotcha +Use `since` on each run (store last tweet `createdAt` in Sheets) to avoid reprocessing the same mentions. Without dedup, alerts fire on the same tweet repeatedly. + +## Reddit brand and topic monitoring +**When:** User wants weekly surfacing of brand mentions, product feedback, and competitor comparisons from Reddit. + +### Pipeline +1. **Scrape Reddit** -> `trudax/reddit-scraper-lite` + - Key input: `subreddits` (target subreddit array), `searchTerms` (brand + competitor names), `maxItems`, `sort` (hot/new/top) + +### Output fields +Step 1: `title`, `body`, `subreddit`, `url`, `score`, `numberOfComments`, `createdAt` + +### Gotcha +Set `sort: "new"` for monitoring runs; use `sort: "top"` for periodic digest reports. Mixing both in one run returns inconsistent result sets. + +## Multi-platform social listening with sentiment +**When:** User wants a unified brand health view across Instagram, Facebook, TikTok, and Twitter simultaneously. + +### Pipeline (run in parallel) +1. **Instagram** -> `apify/instagram-search-scraper` + - Key input: `searchTerms` (brand variants), `maxItems` +2. **Facebook** -> `apify/facebook-search-scraper` + - Key input: `searchTerms`, `maxItems` +3. **TikTok** -> `clockworks/tiktok-user-search-scraper` + - Key input: `searchTerms`, `maxItems` +4. **Twitter** -> `apidojo/tweet-scraper` + - Key input: `searchTerms`, `maxItems` +5. **Sentiment scoring** -> `tri_angle/social-media-sentiment-analysis-tool` + - Pipe: merged post URLs from steps 1-4 -> `urls` + - Key input: `urls`, `platforms` + +### Output fields +Steps 1-4 (normalized): `text`, `platform`, `author`, `timestamp`, `engagementCount` +Step 5: `sentiment`, `score`, `text`, `platform` + +## Sentiment analysis +**When:** User wants sentiment scoring on collected mentions. + +### Pipeline +1. **Collect mentions** (use any step from above) +2. **Analyze sentiment** -> `tri_angle/social-media-sentiment-analysis-tool` + - Pipe: collected post URLs -> `urls` + - Key input: `urls`, `platforms` + +### Output fields +Step 2: `sentiment` (positive/negative/neutral), `score`, `text`, `platform` diff --git a/skills/apify-ultimate-scraper/references/workflows/company-research.md b/skills/apify-ultimate-scraper/references/workflows/company-research.md new file mode 100644 index 0000000..6de344e --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/company-research.md @@ -0,0 +1,76 @@ +# Company research workflows + +## Company intelligence profiling for sales or ABM +**When:** User has a list of target accounts and wants structured firmographic data, ICP signals, and key personnel for outreach or account-based marketing. + +### Pipeline +1. **Crawl company website** -> `apify/website-content-crawler` + - Key input: `startUrls` (company domains), `maxCrawlDepth` (2), `includeUrlGlobs` (about, pricing, team, careers, blog) +2. **Enrich with LinkedIn company data** -> `harvestapi/linkedin-company` + - Pipe: company name or LinkedIn URL extracted from WCC text -> Actor input + - Key input: company identifier, `includeEmployees: false` +3. **AI extract structured signals** (n8n: OpenAI node outputs JSON schema with `companySize`, `industry`, `techStack`, `keyPersonnel`, `painSignals`) +4. **Store** in Supabase, Airtable, or HubSpot with AI-extracted fields as custom properties + +### Output fields +WCC: `text` (per page), `url` +LinkedIn: `employeeCount`, `industry`, `headquarters`, `description`, `specialties` +AI-extracted: `companySize`, `industry`, `techStack`, `keyPersonnel`, `painSignals` + +### Gotcha +WCC crawl at depth 2 can return 20-50 pages per company. For large batches, set `maxCrawlPages: 5` focused on the About and Pricing pages via `includeUrlGlobs`. This keeps cost and latency manageable without sacrificing signal quality. + +--- + +## Startup scouting from Product Hunt +**When:** User wants weekly discovery of recently launched or funded startups in a target category for investor outreach, partnership, or competitive tracking. + +### Pipeline +1. **Scrape Product Hunt launches** -> `apify/web-scraper` + - Key input: `startUrls` (Product Hunt today/weekly/topic pages), `maxCrawlPages` (3-5) + - Note: no dedicated Actor exists - search `apify actors search "product hunt"` for community options +2. **Filter by category + upvote threshold** (n8n: Filter node on extracted `upvotes`, `category`) +3. **Crawl company sites** -> `apify/website-content-crawler` + - Pipe: `results[].website` -> `startUrls` + - Key input: `maxCrawlPages` (3), `includeUrlGlobs` (about, team) +4. **LinkedIn founder lookup** -> `harvestapi/linkedin-profile-search` (by name + company) + - Pipe: extracted founder names from step 3 -> search input +5. **AI ICP scoring** (n8n: OpenAI node scores each startup against defined criteria) +6. **Output** to Airtable pipeline + Slack alert for top matches + +### Output fields +Step 1: `title`, `tagline`, `upvotes`, `website`, `makers[].name`, `makers[].profileUrl` +WCC: `text`, `url` +LinkedIn: `fullName`, `headline`, `profileUrl`, `currentCompany` + +### Gotcha +Product Hunt ranking changes throughout the day. Schedule the scrape for end-of-day (11 pm UTC) to capture final vote counts. For AngelList/Wellfound, no maintained public Actor exists - use `apify/website-content-crawler` on search result pages as a fallback. + +--- + +## Sales meeting prep from LinkedIn and news +**When:** A calendar event is detected and the user needs a briefing on meeting attendees - their recent activity, company context, and conversation talking points - delivered before the meeting. + +### Pipeline +1. **Trigger on calendar event** (n8n: Google Calendar Trigger, filter for events starting in 30 min) +2. **Extract attendee LinkedIn activity** -> `harvestapi/linkedin-profile-scraper` + `harvestapi/linkedin-profile-posts` + - Key input: `profiles` (attendee LinkedIn URLs), `maxPosts` (5 recent posts) +3. **Extract company context** -> `harvestapi/linkedin-company` + - Pipe: `results[].currentCompany` -> company name +4. **Search recent news** -> `apify/google-search-scraper` + - Pipe: company name + "news" -> `queries` + - Key input: `queries`, `maxResultsPerPage` (5) +5. **AI synthesize brief** (n8n: OpenAI node produces: key topics, recent signals, suggested talking points) +6. **Deliver** via Gmail or WhatsApp node 30 minutes before meeting start + +### Output fields +Profiles: `fullName`, `headline`, `recentPosts[]`, `experience[0]` +Posts: `postText`, `publishedAt`, `likes`, `comments` +Company: `description`, `employeeCount`, `recentNews` +Search: `organicResults[].title`, `organicResults[].snippet`, `organicResults[].url` + +### Cost estimate +All HarvestAPI Actors are PPE. Per meeting with 2 attendees: profile scraper ~$0.02, posts ~$0.01, company ~$0.005, Google search ~$0.01. Total: ~$0.05 per meeting prep. + +### Gotcha +LinkedIn profile URLs must be in the calendar event description or a linked CRM record - they won't auto-resolve from email addresses. Set up a step in your CRM or calendar template to include LinkedIn URLs for attendees. Without a valid `profileUrl`, the HarvestAPI Actors return empty results. diff --git a/skills/apify-ultimate-scraper/references/workflows/competitive-intel.md b/skills/apify-ultimate-scraper/references/workflows/competitive-intel.md new file mode 100644 index 0000000..12e7481 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/competitive-intel.md @@ -0,0 +1,85 @@ +# Competitive intelligence workflows + +## Competitor ad monitoring +**When:** User wants to see competitor advertising creatives, targeting, or ad spend signals. + +### Pipeline +1. **Scrape ad library** -> `apify/facebook-ads-scraper` + - Key input: `searchQuery` (competitor name), `country`, `adType`, `maxItems` + +### Output fields +Step 1: `adTitle`, `adBody`, `adCreativeUrl`, `startDate`, `pageInfo.name`, `platform` + +### Gotcha +Facebook Ad Library is public data, no auth needed. But results are limited to currently active or recently inactive ads. + +--- + +## Competitor web presence analysis +**When:** User wants traffic, rankings, and SEO data for competitor domains. + +### Pipeline +1. **Get traffic data** -> `radeance/similarweb-scraper` + - Key input: `urls` (competitor domains) +2. **Get backlink profile** -> `radeance/ahrefs-scraper` + - Key input: `urls` (same domains) + +### Output fields +Step 1: `globalRank`, `monthlyVisits`, `bounceRate`, `avgVisitDuration`, `trafficSources` +Step 2: `domainRating`, `backlinks`, `referringDomains`, `organicKeywords` + +### Cost estimate +radeance/ Actors cost $0.005-0.0275/result. A single domain audit across both steps costs ~$0.04-0.06. + +--- + +## Competitor website change detection +**When:** User wants to monitor competitor pricing pages, feature announcements, or product pages and get alerted when meaningful changes occur. + +### Pipeline +1. **Detect changes** -> `tri_angle/website-changes-detector` + - Key input: `startUrls` (competitor page URLs), `notificationEmail`, `checkIntervalHours` + +### Output fields +Step 1: `url`, `changedAt`, `diff` (text diff), `screenshotUrl` + +### Gotcha +`tri_angle/website-changes-detector` handles baseline storage internally - do not attempt to manage baselines externally or you will lose the diff history between runs. + +--- + +## Competitor SERP position monitoring +**When:** User wants to track where competitor domains rank for target keywords and get alerted on significant position shifts. + +### Pipeline +1. **Scrape SERP rankings** -> `apify/google-search-scraper` + - Key input: `queries` (target keywords array), `countryCode`, `maxResultsPerPage` +2. **Track traffic estimates** -> `radeance/similarweb-scraper` + - Key input: `urls` (competitor domains) + - Pipe: run separately per competitor domain after extracting domains from step 1 results + +### Output fields +Step 1: `organicResults[].url`, `organicResults[].position`, `organicResults[].title` +Step 2: `globalRank`, `monthlyVisits`, `trafficSources` + +### Cost estimate +`radeance/similarweb-scraper` costs ~$0.02-0.03 per domain. For 5 competitors, budget ~$0.10-0.15 per weekly run. + +--- + +## Competitor feature and pricing benchmarking +**When:** User wants a structured comparison of competitor pricing tiers, feature lists, and positioning across 5-10 competitor sites. + +### Pipeline +1. **Crawl pricing and feature pages** -> `apify/website-content-crawler` + - Key input: `startUrls` (competitor pricing page URLs), `maxCrawlDepth` (set to 1), `includeUrlGlobs` +2. **Extract structured data** -> AI node (GPT-4o or Claude) + - Pipe: `results[].text` -> extraction prompt per competitor + - Key input: extraction schema (tiers, prices, key features, positioning statement) + +### Output fields +Step 1: `text` (clean markdown with pricing tables), `url`, `metadata.title` +Step 2: AI-extracted structured JSON with tiers, prices, feature flags per competitor + +### Gotcha +Set `maxCrawlDepth: 1` and use `includeUrlGlobs` to restrict crawl to pricing and features paths only. Without this, WCC will crawl the full site and inflate cost significantly. diff --git a/skills/apify-ultimate-scraper/references/workflows/contact-enrichment.md b/skills/apify-ultimate-scraper/references/workflows/contact-enrichment.md new file mode 100644 index 0000000..8962a3b --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/contact-enrichment.md @@ -0,0 +1,65 @@ +# Contact enrichment workflows + +## Website contact extraction from a URL list +**When:** User has a list of company websites and wants emails, phone numbers, and social links for outreach. + +### Pipeline +1. **Extract contact info** -> `vdrmota/contact-info-scraper` or `compass/contact-details-scraper-standby` + - Key input: `startUrls` (company website URLs), `maxDepth` (crawl depth, 1-2 is usually enough) +2. **Supplement with social signals** -> `apify/social-media-leads-analyzer` + - Pipe: `results[].domain` -> input domain list + - Key input: `domains` (company domains), `extractLinkedIn`, `extractTwitter` +3. **Dedup and verify** (n8n: dedup by email domain, optional ZeroBounce/Hunter node for verification) + +### Output fields +Step 1: `emails[]`, `phones[]`, `linkedinUrl`, `twitterUrl`, `domain` +Step 2: `socialProfiles`, `linkedinCompanyUrl`, `facebookUrl` + +### Gotcha +`contact-details-scraper-standby` is a Standby Actor - it stays warm and responds in < 1s, making it ideal for real-time enrichment in webhook flows. Use it when latency matters. For batch jobs, `vdrmota/contact-info-scraper` is more cost-effective. + +--- + +## LinkedIn warm lead identification from post comments +**When:** User wants to find engaged prospects who commented on relevant LinkedIn posts (competitor content, thought leader posts, industry discussions). + +### Pipeline +1. **Scrape post comments** -> `harvestapi/linkedin-post-comments` + - Key input: `postUrl` (LinkedIn post URL), `maxComments` +2. **Enrich commenter profiles** -> `harvestapi/linkedin-profile-scraper` + - Pipe: `results[].commenter.profileUrl` -> `urls` + - Key input: `urls`, `includeEmail: true` +3. **Filter by ICP criteria** (n8n: Filter node on `headline` or `companyName`) +4. **AI draft outreach** (n8n: OpenAI node generates personalized message using `commentText` + `headline`) + +### Output fields +Step 1: `commenter.name`, `commenter.headline`, `commenter.profileUrl`, `commentText`, `timestamp` +Step 2: `experience[]`, `education[]`, `email`, `phone`, `skills[]` + +### Cost estimate +Both Actors are PPE. Step 1 ~ $0.005/comment. Step 2 with `includeEmail: true` ~ $0.01/profile. For 100 commenters enriched: ~$1.50 total. + +### Gotcha +Not all LinkedIn posts are publicly accessible. Test `postUrl` manually before building a workflow around it. Private or restricted posts return empty results - no error, just zero items. + +--- + +## Real-time lead enrichment on form submission +**When:** A prospect fills a form and their company needs to be enriched automatically for ICP scoring and sales routing. + +### Pipeline +1. **Receive form webhook** (n8n: Webhook trigger from HubSpot/Typeform/custom form) +2. **Extract company domain** (n8n: Function node parses email domain) +3. **Crawl company site** -> `apify/website-content-crawler` + - Key input: `startUrls` (company domain), `maxCrawlPages` (3-5), `includeUrlGlobs` (about, pricing, careers, team) +4. **Enrich with LinkedIn firmographics** -> `harvestapi/linkedin-company` (optional, for headcount + industry) + - Pipe: company name or LinkedIn URL derived from WCC output +5. **AI extract signals** (n8n: OpenAI node extracts companySize, industry, techStack, ICPFit score from crawl text) +6. **Route** (n8n: Switch node sends high-ICP leads to Slack sales channel, others to nurture sequence) + +### Output fields +WCC: `text`, `title`, `url`, `metadata.description` +LinkedIn: `employeeCount`, `industry`, `headquarters`, `description`, `website` + +### Gotcha +WCC crawl on a small startup site can take 30-60 seconds. For synchronous form flows, set `maxCrawlPages: 3` and use a timeout. If latency is critical, use `apify/cheerio-scraper` for the About page only and skip LinkedIn enrichment for first response. diff --git a/skills/apify-ultimate-scraper/references/workflows/content-and-seo.md b/skills/apify-ultimate-scraper/references/workflows/content-and-seo.md new file mode 100644 index 0000000..b0cc0d5 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/content-and-seo.md @@ -0,0 +1,113 @@ +# Content and SEO workflows + +## Website content extraction for RAG +**When:** User wants to crawl a website and extract clean text for AI/LLM pipelines or knowledge bases. + +### Pipeline +1. **Crawl website** -> `apify/website-content-crawler` + - Key input: `startUrls`, `maxCrawlPages`, `crawlerType` ("cheerio" for speed, "playwright" for JS sites) + +### Output fields +Step 1: `url`, `title`, `text`, `markdown`, `metadata`, `links[]` + +### Gotcha +For JS-heavy sites (SPAs), set `crawlerType: "playwright"`. For static sites, use `"cheerio"` (10x faster). For anti-bot sites, use `apify/camoufox-scraper` instead. + +## SERP-based SEO content brief generation +**When:** User wants to generate a content brief for a target keyword by analyzing competitor SERP results and heading structures. + +### Pipeline +1. **SERP scrape** -> `apify/google-search-scraper` + - Key input: `queries`, `countryCode`, `maxResultsPerPage` +2. **Extract heading structure** -> `apify/cheerio-scraper` + - Pipe: `results[].organicResults[].url` -> `startUrls` (filter non-article URLs first) + - Key input: `startUrls`, `pageFunction` (extract h1/h2/h3 nodes) + +### Output fields +Step 1: `organicResults[].url`, `organicResults[].title`, `organicResults[].snippet` +Step 2: heading structure (h1/h2/h3 text), word count per page + +### Gotcha +Use `apify/cheerio-scraper` for heading extraction - it's 10x faster than `website-content-crawler` for HTML-only pages. Only escalate to `website-content-crawler` when you need full body text for AI synthesis. + +## Sitemap content audit +**When:** User wants to crawl all URLs on a competitor's site from the sitemap and build a keyword and topic inventory. + +### Pipeline +1. **Extract sitemap URLs** -> `apify/sitemap-extractor` + - Key input: `startUrls` (sitemap.xml URL) +2. **Crawl each URL** -> `apify/website-content-crawler` + - Pipe: `results[].urls[]` -> `startUrls` + - Key input: `startUrls`, `maxCrawlPages`, `htmlTransformer` (readableText) + +### Output fields +Step 1: `urls[]` (all discovered page URLs) +Step 2: `text`, `metadata`, `url` + +### Gotcha +Large sitemaps (1,000+ URLs) can be expensive. Filter step 1 results to a specific path prefix (e.g., `/blog/`) before passing to step 2 to avoid crawling low-value pages like tag archives and pagination. + +## Keyword rank tracking with alerts +**When:** User wants weekly tracking of keyword positions for own domain and competitors, with Slack alerts on drops over 5 positions. + +### Pipeline +1. **SERP scrape** -> `apify/google-search-scraper` + - Key input: `queries` (tracked keywords array), `countryCode`, `device` (desktop/mobile), `maxResultsPerPage` (set to 100) + +### Output fields +Step 1: `organicResults[].position`, `organicResults[].url`, `organicResults[].domain` + +### Cost estimate +`apify/google-search-scraper` is a fixed-cost Actor. 100 keywords weekly ≈ $1-3/month depending on query volume. + +### Gotcha +Set `maxResultsPerPage: 100` to capture positions 11-100. Default is 10 results, which misses any keyword ranking outside page 1 - making it impossible to detect rank drops from position 12 to 18. + +## SERP analysis +**When:** User wants to analyze search engine results for specific keywords. + +### Pipeline +1. **Google SERP** -> `apify/google-search-scraper` + - Key input: `queries`, `maxPagesPerQuery`, `countryCode`, `languageCode` + +### Output fields +Step 1: `organicResults[]` (title, url, description, position), `paidResults[]`, `peopleAlsoAsk[]`, `relatedSearches[]` + +## Deep research agent +**When:** User wants an AI agent that takes a research question, generates search queries, crawls top results, and synthesizes findings into a structured report. + +### Pipeline +1. **Generate search queries** (LLM node - 3 queries per research question) +2. **SERP scrape** -> `apify/google-search-scraper` + - Pipe: generated queries -> `queries` + - Key input: `queries`, `maxResultsPerPage` +3. **Extract content** -> `apify/rag-web-browser` + - Pipe: `results[].organicResults[].url` (AI-selected relevant URLs) -> `query` + - Key input: `query`, `maxResults`, `requestTimeoutSecs` +4. **Synthesize** (LLM node - final report generation) + +### Output fields +Step 2: `organicResults[].url`, `organicResults[].snippet` +Step 3: `text`, `url`, `metadata.title` + +### Gotcha +Use `apify/rag-web-browser` at step 3 rather than `website-content-crawler` - it's optimized for agent-based single-URL retrieval and returns clean markdown with lower latency. Reserve `website-content-crawler` for bulk batch crawls. + +## Domain authority and backlink analysis +**When:** User wants SEO metrics for specific domains. + +### Pipeline +1. **Traffic overview** -> `radeance/similarweb-scraper` + - Key input: `urls` +2. **Backlink profile** -> `radeance/ahrefs-scraper` + - Key input: `urls` +3. **Domain authority** -> `radeance/semrush-scraper` + - Key input: `urls` + +### Output fields +Step 1: `globalRank`, `monthlyVisits`, `bounceRate`, `trafficSources` +Step 2: `domainRating`, `backlinks`, `referringDomains`, `organicKeywords` +Step 3: `authorityScore`, `organicSearchTraffic`, `paidSearchTraffic` + +### Cost estimate +All radeance/ SEO Actors are PPE at $0.005-0.0275/result. Running all 3 for one domain costs ~$0.05-0.08. For 50 domains, estimate $2.50-$4.00. diff --git a/skills/apify-ultimate-scraper/references/workflows/ecommerce-price-monitoring.md b/skills/apify-ultimate-scraper/references/workflows/ecommerce-price-monitoring.md new file mode 100644 index 0000000..77c5610 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/ecommerce-price-monitoring.md @@ -0,0 +1,88 @@ +# E-commerce price monitoring workflows + +## Competitor product price monitoring with alerts +**When:** User wants to track competitor prices across product pages and get notified when prices change. + +### Pipeline +1. **Scrape product pages** -> `apify/e-commerce-scraping-tool` + - Key input: `startUrls` (competitor product page URLs), `proxyConfiguration` +2. **Match products across sites** -> `tri_angle/e-commerce-product-matching-tool` + - Pipe: `results[].url` + `results[].name` -> matching input + - Key input: source dataset ID from step 1, target product list +3. **Compare vs. baseline** (n8n logic: read Google Sheets last_price, compute % change, filter if changed) +4. **Alert** via Telegram/Slack node with price delta + +### Output fields +Step 1: `price`, `currency`, `name`, `sku`, `availability`, `url` +Step 2: matched pairs with `similarityScore`, `sourceProduct`, `targetProduct` + +### Cost estimate +`apify/e-commerce-scraping-tool` is pay-per-result. For 200 product URLs daily, expect ~$0.50-$1/run depending on site complexity. + +### Gotcha +Many e-commerce sites use bot protection. If `e-commerce-scraping-tool` returns empty prices, fall back to `apify/camoufox-scraper` with residential proxy. Set `sessionPoolName` to reuse sessions and reduce blocks. + +--- + +## Amazon product and review tracking +**When:** User wants to monitor own or competitor Amazon listings for price drops or review score changes. + +### Pipeline +1. **Extract Amazon data** -> `apify/e-commerce-scraping-tool` + - Key input: `startUrls` (Amazon product URLs), `extractReviews` (bool) +2. **Compare vs. stored baseline** (n8n: read last values from Sheets or DB) +3. **Alert on new low price or rating drop** (n8n: If node + Telegram/Slack send) + +### Output fields +`price`, `currency`, `rating`, `reviewsCount`, `title`, `asin`, `availability` + +### Cost estimate +Flat per-result pricing. 50 ASINs daily ~ $0.10-$0.25/run. + +### Gotcha +Amazon aggressively rotates prices and sometimes shows regional prices. Always store `currency` alongside `price`. For review text (not just counts), search `apify actors search "amazon reviews"` for a dedicated Actor. + +--- + +## Supplier catalog extraction to draft products +**When:** User wants to pull new products from a supplier portal and create draft listings with AI-enriched descriptions. + +### Pipeline +1. **Crawl supplier catalog** -> `apify/playwright-scraper` (JS-heavy portals) or `apify/cheerio-scraper` (static HTML) + - Key input: `startUrls` (supplier category pages), `pseudoUrls` (product URL patterns), `maxCrawlPages` +2. **Extract product content** -> `apify/website-content-crawler` (optional second pass for detail pages) + - Pipe: `results[].url` -> `startUrls` + - Key input: `maxCrawlPages` (1 per product), `htmlTransformer: "readableText"` +3. **AI rewrite** (n8n: OpenAI node generates SEO title + bullets from raw specs) +4. **Create draft product** (n8n: Shopify node `POST /products.json` with `status: "draft"`) + +### Output fields +Step 1/2: `text`, `url`, `metadata.title`, inline image URLs + +### Cost estimate +Depends on catalog size. `playwright-scraper` is PPE; 500 product pages ~ $1-3. + +### Gotcha +Supplier portals often require login. Use `apify/playwright-scraper` with `initialCookies` or a pre-login script in `preNavigationHooks`. Never hardcode credentials - pass via Actor input from n8n credentials store. + +--- + +## Multi-site deal and coupon monitoring +**When:** User wants to detect when competitors run promotions or publish coupon codes so marketing can respond. + +### Pipeline +1. **Scrape deals pages** -> `apify/e-commerce-scraping-tool` + - Key input: `startUrls` (competitor deal/sale page URLs), `proxyConfiguration` +2. **Dynamic JS deal pages** (fallback) -> `apify/camoufox-scraper` + - Pipe: failed URLs from step 1 -> `startUrls` +3. **AI extract promotion details** (n8n: OpenAI node extracts discount %, promo code, expiry from raw text) +4. **Dedup and alert** (n8n: compare against stored deals DB, Slack notify on new deals) + +### Output fields +Raw: `price`, `discountText`, `url`; AI-extracted: `promoCode`, `validUntil`, `discountPercent`, `category` + +### Cost estimate +Light scraping - deals pages are few. Expect < $0.20/run for 20 competitor pages. + +### Gotcha +Promo codes and flash deals may only be visible after login or in geofenced regions. Test each target URL manually first. AI extraction of expiry dates is unreliable - treat as best-effort signal, not exact data. diff --git a/skills/apify-ultimate-scraper/references/workflows/influencer-vetting.md b/skills/apify-ultimate-scraper/references/workflows/influencer-vetting.md new file mode 100644 index 0000000..b452fc3 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/influencer-vetting.md @@ -0,0 +1,90 @@ +# Influencer vetting workflows + +## Instagram creator vetting +**When:** User wants to evaluate an influencer's profile, audience, and engagement quality. + +### Pipeline +1. **Get profile data** -> `apify/instagram-profile-scraper` + - Key input: `usernames` (list of handles) +2. **Analyze engagement** -> `apify/instagram-comment-scraper` + - Pipe: `results[].latestPosts[].url` -> `directUrls` (pick 3-5 recent posts) + - Key input: `directUrls`, `resultsLimit` + +### Output fields +Step 1: `username`, `fullName`, `followersCount`, `followsCount`, `postsCount`, `biography`, `isVerified`, `latestPosts[]` +Step 2: `text`, `ownerUsername`, `timestamp` (scan for bot patterns: generic praise, emoji-only, irrelevant content) + +### Gotcha +High follower count with low comment quality suggests fake followers. Compare comment sentiment to post content. + +--- + +## Cross-platform influencer discovery +**When:** User wants to find an influencer's presence across multiple platforms. + +### Pipeline +1. **Search across platforms** -> `tri_angle/social-media-finder` + - Key input: `query` (influencer name or handle), `platforms` + +### Output fields +Step 1: `platform`, `profileUrl`, `username`, `followers`, `isVerified` + +--- + +## TikTok creator vetting +**When:** User wants to vet TikTok creators by niche or handle for partnership fit based on engagement rate and content quality. + +### Pipeline +1. **Get profile metrics** -> `clockworks/tiktok-profile-scraper` + - Key input: `profiles` (username array), `resultsPerPage` +2. **Pull recent videos** -> `clockworks/tiktok-video-scraper` + - Pipe: `results[].authorMeta.name` -> `profiles` + - Key input: `profiles`, `maxItems` + +### Output fields +Step 1: `authorMeta.name`, `authorMeta.fans`, `authorMeta.heart`, `authorMeta.video` +Step 2: `diggCount`, `playCount`, `commentCount`, `shareCount`, `createTimeISO`, `hashtags`, `text` + +### Gotcha +TikTok engagement rate must be calculated manually: `(diggCount + commentCount + shareCount) / playCount`. The Actor does not return a pre-calculated ER field. + +--- + +## YouTube channel audit +**When:** User wants to audit YouTube channels for subscriber growth, average views, and topic consistency before sponsorship. + +### Pipeline +1. **Get channel overview** -> `streamers/youtube-channel-scraper` + - Key input: `startUrls` (channel URLs), `maxResults` +2. **Pull video metrics** -> `streamers/youtube-scraper` + - Pipe: `results[].channelUrl` -> `startUrls` + - Key input: `startUrls`, `maxResults` +3. **Analyze content themes** -> `curious_coder/youtube-transcript-scraper` + - Pipe: `results[].url` -> video URLs (pick 5-10 recent videos) + - Key input: video URLs + +### Output fields +Step 1: `channelName`, `numberOfSubscribers`, `channelTotalViews`, `channelUrl` +Step 2: `videos[].viewCount`, `videos[].likeCount`, `videos[].title`, `videos[].publishedAt` +Step 3: `transcript` (raw text for AI topic classification) + +--- + +## Cross-platform hashtag discovery +**When:** User wants to discover new influencer candidates across Instagram, TikTok, and YouTube using niche hashtags for a unified shortlist. + +### Pipeline +1. **Instagram hashtag scrape** -> `apify/instagram-hashtag-scraper` + - Key input: `hashtags` (array), `resultsLimit` +2. **TikTok hashtag scrape** -> `clockworks/tiktok-hashtag-scraper` + - Key input: `hashtags` (same array), `maxItems` +3. **YouTube hashtag scrape** -> `streamers/youtube-video-scraper-by-hashtag` + - Key input: `hashtags` (same array), `resultsLimit` + +### Output fields +Step 1: `ownerUsername`, `followersCount`, `profileUrl`, `likesCount`, `commentsCount` +Step 2: `authorMeta.name`, `authorMeta.fans`, `playCount`, `diggCount`, `shareCount` +Step 3: `channelName`, `numberOfSubscribers`, `viewCount`, `channelUrl` + +### Gotcha +Each platform returns platform-specific field names. Normalize to a common schema (`username`, `platform`, `followersCount`, `avgEngagement`, `profileUrl`) in a downstream merge step before scoring. diff --git a/skills/apify-ultimate-scraper/references/workflows/job-market-and-recruitment.md b/skills/apify-ultimate-scraper/references/workflows/job-market-and-recruitment.md new file mode 100644 index 0000000..209c5b2 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/job-market-and-recruitment.md @@ -0,0 +1,74 @@ +# Job market and recruitment workflows + +## Job listing research +**When:** User wants to find and analyze job postings by role, company, or location. + +### Pipeline +1. **Search jobs** -> `harvestapi/linkedin-job-search` + - Key input: `keyword`, `location`, `datePosted`, `limit` +2. **Get job details** -> `apimaestro/linkedin-job-detail` + - Pipe: `results[].jobUrl` -> `urls` + - Key input: `urls` + +### Output fields +Step 1: `title`, `company`, `location`, `jobUrl`, `postedDate`, `applicantsCount` +Step 2: `description`, `requirements`, `seniority`, `employmentType`, `salary` + +### Gotcha +Both Actors are PPE. Step 1: ~$0.001/job. Step 2: ~$0.005/job. For 200 jobs, total ~$1.20. Estimate and confirm with user. + +## Candidate sourcing +**When:** User wants to find potential candidates matching specific criteria. + +### Pipeline +1. **Search profiles** -> `harvestapi/linkedin-profile-search` + - Key input: `keyword`, `title`, `location`, `industry`, `limit` +2. **Enrich with details** -> `apimaestro/linkedin-profile-full-sections-scraper` + - Pipe: `results[].profileUrl` -> `urls` + - Key input: `urls` + +### Output fields +Step 1: `fullName`, `headline`, `location`, `profileUrl`, `currentCompany` +Step 2: `experience[]`, `education[]`, `skills[]`, `certifications[]`, `languages[]` + +### Gotcha +Step 2 (`apimaestro/linkedin-profile-full-sections-scraper`) costs ~$0.01/profile - the most expensive LinkedIn scraper. Use sparingly for shortlisted candidates only. + +## Sales signal outreach - job posting as buying signal +**When:** User wants to monitor company job postings as a signal to identify sales opportunities - e.g., a "Head of Data Engineering" hire suggests budget for data tooling. + +### Pipeline +1. **Monitor target postings** -> `harvestapi/linkedin-job-search` + - Key input: `searchUrl` (LinkedIn Jobs URL with company or role filters), `keywords` +2. **Get company context** -> `harvestapi/linkedin-company` + - Pipe: `results[].companyUrl` -> `companyUrls` + +### Output fields +Step 1: `title`, `companyName`, `description`, `employmentType`, `seniorityLevel`, `jobUrl` +Step 2: `name`, `industry`, `employeeCount`, `description`, `specialties[]` + +### Gotcha +Job descriptions contain implicit buying signals - tech stack mentions, pain points, and headcount growth. Pass `description` to an LLM to extract inferred tech stack and budget tier before prioritizing outreach. Contact finding (Hunter.io) uses the native n8n node, not an Apify Actor. + +## Upwork job monitoring for freelancers +**When:** User wants to continuously monitor Upwork for new jobs matching their skills. + +### Pipeline +1. **Scrape Upwork search** -> `apify/playwright-scraper` + - Key input: `startUrls` (Upwork search URL with skill filters), `pseudoUrls`, `maxCrawledPages` + +### Output fields +Step 1: `title`, `description`, `budget`, `clientJobsPosted`, `clientHireRate`, `postedAt`, `url` + +### Gotcha +No dedicated Upwork Actor exists in Apify Store - verify with `apify actors search "upwork"` for community options before defaulting to `apify/playwright-scraper`. Upwork pages are JS-heavy so Playwright is required over basic HTTP scraping. For high-frequency monitoring (every 15 min), store seen job URLs to avoid re-processing duplicates. + +## GitHub contributor discovery +**When:** User wants to find developers who contribute to specific open-source projects. + +### Pipeline +1. **Get contributors** -> `janbuchar/github-contributors-scraper` + - Key input: `repoUrls` + +### Output fields +Step 1: `username`, `contributions`, `profileUrl`, `avatarUrl` diff --git a/skills/apify-ultimate-scraper/references/workflows/knowledge-base-and-rag.md b/skills/apify-ultimate-scraper/references/workflows/knowledge-base-and-rag.md new file mode 100644 index 0000000..d4b66b5 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/knowledge-base-and-rag.md @@ -0,0 +1,64 @@ +# Knowledge base and RAG pipeline workflows + +## Website to RAG knowledge base via sitemap crawl +**When:** User wants to ingest an entire website or documentation site into a vector database for AI retrieval (chatbots, search, AI agents). + +### Pipeline +1. **Extract sitemap** -> `apify/sitemap-extractor` + - Key input: `sitemapUrl` or `domain` +2. **Crawl and convert to markdown** -> `apify/website-content-crawler` + - Pipe: `results[].url` -> `startUrls` (or pass dataset ID) + - Key input: `startUrls`, `maxCrawlPages`, `htmlTransformer: "readableText"`, `outputFormats: ["markdown"]` +3. **Chunk and embed** (n8n: Recursive Character Text Splitter -> OpenAI Embeddings node) +4. **Upsert to vector DB** (n8n: Supabase / Qdrant node with document + metadata) + +### Output fields +`text` (clean markdown), `url`, `metadata.title`, `metadata.description`, `crawledAt` + +### Gotcha +`apify/rag-web-browser` is purpose-built for RAG use cases and returns pre-chunked, clean text without boilerplate - use it when you want simpler output and don't need full site coverage. For comprehensive crawls (full docs sites, 100+ pages), `website-content-crawler` gives more control over depth and URL filtering. + +--- + +## Deep research agent with web crawling +**When:** User or an AI agent submits a research question and wants a synthesized report drawn from live web sources. + +### Pipeline +1. **Generate search queries** (n8n: AI node expands research question into 3-5 distinct queries) +2. **Search** -> `apify/google-search-scraper` + - Pipe: generated queries -> `queries` (array) + - Key input: `queries`, `maxResultsPerPage` (5-10) +3. **Retrieve content** -> `apify/rag-web-browser` + - Pipe: `results[].organicResults[].url` -> `query` (RAG browser takes query + crawls most relevant result) + - Key input: `query`, `maxResults`, `requestTimeoutSecs` +4. **Synthesize** (n8n: OpenAI node assembles final report from per-source summaries) +5. **Output** to n8n Data Table, Notion, or Google Docs + +### Output fields +Search: `organicResults[].url`, `organicResults[].title`, `organicResults[].snippet` +RAG browser: `text`, `url`, `metadata.title` + +### Gotcha +`apify/rag-web-browser` fetches and summarizes a single URL per call. To process multiple search results in parallel, use n8n's Split In Batches node with a concurrency of 3-5 rather than running them sequentially. This cuts total runtime significantly for 10+ URLs. + +--- + +## Scheduled news monitoring to AI knowledge feed +**When:** User wants to track industry news sources daily, filter new articles, summarize them, and store in a searchable knowledge base (Notion, NocoDB, Supabase). + +### Pipeline +1. **Extract articles** -> `lukaskrivka/article-extractor-smart` + - Key input: `startUrls` (news site listing pages), `maxCrawlPages`, `articleSelector` (optional CSS hint) +2. **Filter new articles only** (n8n: compare `publishedAt` or URL against stored records in DB) +3. **Full article content** (optional) -> `apify/website-content-crawler` + - Pipe: new article `url` values -> `startUrls` + - Use when listing-page extract is too short for quality summarization +4. **AI summarize + tag** (n8n: OpenAI node generates 3-sentence summary + keyword tags) +5. **Upsert to knowledge base** (n8n: Notion / NocoDB / Supabase node) + +### Output fields +Step 1: `title`, `text`, `publishedAt`, `author`, `url`, `tags` +Step 3 (WCC): full `text`, `metadata.title`, `metadata.description` + +### Gotcha +`lukaskrivka/article-extractor-smart` handles most news formats well, but paywalled sites return truncated content. Check `text` length - if consistently under 200 characters for a given source, that site is paywalled and should be removed from the list. Deduplicate by URL before summarizing to avoid re-processing old articles on re-runs. diff --git a/skills/apify-ultimate-scraper/references/workflows/lead-generation.md b/skills/apify-ultimate-scraper/references/workflows/lead-generation.md new file mode 100644 index 0000000..96fe025 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/lead-generation.md @@ -0,0 +1,118 @@ +# Lead generation workflows + +## Local business leads with email enrichment +**When:** User wants business contacts, emails, or phone numbers for businesses in a specific location. + +### Pipeline +1. **Find businesses** -> `compass/crawler-google-places` + - Key input: `searchStringsArray`, `locationQuery`, `maxCrawledPlaces` +2. **Enrich with contacts** -> `compass/enrich-google-maps-dataset-with-contacts` + - Pipe: `results[].url` -> `startUrls` (or pass the dataset ID directly) + - Key input: `datasetId` (from step 1), `maxRequestsPerCrawl` + +### Output fields +Step 1: `title`, `address`, `phone`, `website`, `categoryName`, `totalScore`, `reviewsCount`, `url` +Step 2: `emails[]`, `phones[]`, `socialLinks`, `linkedInUrl`, `twitterUrl` + +### Gotcha +Google Maps results vary by language and location. Set `language: "en"` explicitly. Also set `locationQuery` to a specific city/region, not just a country. + +--- + +## B2B prospect discovery via LinkedIn +**When:** User wants to find professionals by role, company, or industry. + +### Pipeline +1. **Search profiles** -> `harvestapi/linkedin-profile-search` + - Key input: `keyword`, `location`, `title`, `limit` +2. **Enrich with details** -> `harvestapi/linkedin-profile-scraper` + - Pipe: `results[].profileUrl` -> `urls` + - Key input: `urls`, `includeEmail` (set to `true` for email discovery) + +### Output fields +Step 1: `fullName`, `headline`, `location`, `profileUrl`, `currentCompany` +Step 2: `experience[]`, `education[]`, `skills[]`, `email`, `phone` + +### Cost estimate +Step 2 with `includeEmail: true` costs ~$0.01/profile. For 500 profiles, budget ~$5. + +### Gotcha +LinkedIn Actors are all PPE. Estimate and confirm with user before running at scale. + +--- + +## Sales Navigator bulk lead extraction +**When:** User wants daily 100-1,000 lead extraction from a Sales Navigator search for outbound sequences. + +### Pipeline +1. **Extract leads** -> `harvestapi/linkedin-profile-search` + - Key input: `searchUrl` (Sales Navigator search URL), `maxResults`, `proxy` settings +2. **Verify emails** -> native n8n Hunter.io node or HTTP Request to ZeroBounce API + - Pipe: `results[].email` -> email verification input + +### Output fields +Step 1: `fullName`, `email`, `companyName`, `jobTitle`, `connectionDegree`, `profileUrl` +Step 2: `result` (valid/risky/invalid), `score` + +### Cost estimate +`harvestapi/linkedin-profile-search` is PPE. 1,000 leads at typical rates runs ~$5-10. Confirm before scheduling daily runs. + +### Gotcha +Sales Navigator URL must be a saved search URL, not a one-time results URL. The URL changes each session unless saved. + +--- + +## SERP-based B2B prospect discovery +**When:** User wants to find companies matching niche keywords via Google, AI-qualify them against ICP criteria, and push qualified leads to CRM. + +### Pipeline +1. **Find companies** -> `apify/google-search-scraper` + - Key input: `queries` (search terms array), `maxResultsPerPage`, `countryCode` +2. **Crawl company sites** -> `apify/website-content-crawler` + - Pipe: `results[].organicResults[].url` -> `startUrls` + - Key input: `startUrls`, `maxCrawlDepth` (set to 2), `maxCrawlPages` (set to 5) + +### Output fields +Step 1: `organicResults[].url`, `organicResults[].title`, `organicResults[].snippet` +Step 2: `text` (clean markdown), `url`, `metadata.title`, `metadata.description` + +### Gotcha +Pass only company root domains from SERP results into WCC - not individual blog post URLs. Filter `organicResults[].url` for root domains before piping. + +--- + +## Apollo leads + AI website icebreakers +**When:** User has an Apollo lead list with company websites and wants personalized cold email icebreakers generated from each company's web presence. + +### Pipeline +1. **Scrape company sites** -> `apify/website-content-crawler` + - Key input: `startUrls` (homepage URLs from Apollo export), `maxCrawlDepth` (set to 2), `maxCrawlPages` (set to 5) +2. **Generate icebreakers** -> AI node (GPT-4o or Claude) + - Pipe: `results[].text` -> prompt context per lead + - Key input: company summary + lead name + role + +### Output fields +Step 1: `text` (clean markdown), `metadata.title`, `metadata.description`, `url` +Step 2: AI-generated icebreaker string per lead + +### Gotcha +Some Apollo exports include LinkedIn URLs instead of company websites. Filter the list for `http` URLs before passing to WCC - LinkedIn blocks crawlers. + +--- + +## Reddit community lead mining +**When:** User wants to find prospects actively posting problems that their product or service solves in relevant subreddits. + +### Pipeline +1. **Mine subreddit posts** -> `trudax/reddit-scraper-lite` + - Key input: `startUrls` (subreddit URLs), `searchTerms` (problem keywords), `maxItems`, `sort` (hot/new/top) +2. **Qualify leads** -> AI node + - Pipe: `results[].title`, `results[].body` -> qualification prompt + - Key input: ICP criteria, pain point keywords + +### Output fields +Step 1: `title`, `body`, `subreddit`, `url`, `score`, `numberOfComments`, `createdAt`, `author` +Step 2: AI qualification score, extracted contact intent, suggested outreach angle + +### Gotcha +Reddit usernames are pseudonymous - there is no direct email enrichment path. The output is intent signals and post URLs for manual outreach via Reddit DM or to cross-reference against other platforms. diff --git a/skills/apify-ultimate-scraper/references/workflows/real-estate-and-hospitality.md b/skills/apify-ultimate-scraper/references/workflows/real-estate-and-hospitality.md new file mode 100644 index 0000000..007402d --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/real-estate-and-hospitality.md @@ -0,0 +1,74 @@ +# Real estate and hospitality workflows + +## Property search and analysis +**When:** User wants to find and compare property listings in a specific area. + +### Pipeline +1. **Search properties** -> `tri_angle/redfin-search` + - Key input: `location`, `propertyType`, `minPrice`, `maxPrice` +2. **Get details** -> `tri_angle/redfin-detail` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls` + +### Output fields +Step 1: `address`, `price`, `beds`, `baths`, `sqft`, `url`, `status` +Step 2: `description`, `yearBuilt`, `lotSize`, `priceHistory[]`, `taxHistory[]`, `schools[]` + +## Airbnb market analysis +**When:** User wants to analyze Airbnb listings, pricing, and reviews in a destination. + +### Pipeline +1. **Search listings** -> `tri_angle/new-fast-airbnb-scraper` + - Key input: `location`, `checkIn`, `checkOut`, `maxItems` +2. **Get reviews** -> `tri_angle/airbnb-reviews-scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `name`, `price`, `rating`, `reviews`, `type`, `amenities[]`, `url`, `images[]` +Step 2: `text`, `rating`, `date`, `reviewerName` + +### Gotcha +Airbnb pricing varies by date. Always set `checkIn` and `checkOut` for accurate pricing. For market analysis, run multiple date ranges to capture seasonal variation. + +## Real estate lead scoring and agent routing +**When:** User wants to qualify inbound leads from listing portals by budget signals and urgency, then route them to the right agent. + +### Pipeline +1. **Search matching properties** -> `tri_angle/redfin-search` + - Key input: `location`, `minPrice`, `maxPrice` (from lead payload) +2. **Enrich lead with LinkedIn signals** -> `harvestapi/linkedin-profile-scraper` + - Key input: `profileUrls` (optional - use only when lead email resolves to a LinkedIn profile) + +### Output fields +Step 1: `address`, `price`, `beds`, `baths`, `status`, `url` +Step 2: `headline`, `currentCompany`, `experience[]` (income/seniority signals) + +### Gotcha +The LinkedIn enrichment step is optional - only run it when the lead's identity is known and a LinkedIn profile URL is available. The core routing logic (hot/warm/cold tier + agent assignment) runs on the MLS webhook payload itself, with scraping used as enrichment. Lead scoring and routing output fields are AI-generated: `leadTier`, `assignedAgent`, `routingReason`. + +## Construction and pre-market property discovery +**When:** User wants to find new-construction projects or pre-market inventory before they appear on major listing portals. + +### Pipeline +1. **Scrape construction portals** -> `apify/playwright-scraper` + - Key input: `startUrls` (local MLS or construction project portal URLs), `proxyConfiguration` +2. **Extract clean text** -> `lukaskrivka/article-extractor-smart` + - Pipe: `results[].url` -> `urls` + +### Output fields +Step 1: raw HTML / structured page data +Step 2: `projectName`, `price`, `location`, `possessionDate`, `constructionStatus` + +### Gotcha +No market-specific Actor exists for most construction portals (e.g., 99acres). Run `apify actors search "real estate"` to check for community-built options before using `apify/playwright-scraper`. For JS-heavy portals, `playwright-scraper` is required. Step 2 cleans raw output into structured fields - pipe all Step 1 URLs through it. + +## Multi-source property comparison +**When:** User wants to compare listings across Zillow, Realtor, Zumper, and other US/UK sources. + +### Pipeline +1. **Aggregate listings** -> `tri_angle/real-estate-aggregator` + - Key input: `location`, `propertyType`, `sources` (Zillow, Realtor, Zumper, Apartments.com, Rightmove) + +### Output fields +Step 1: `address`, `price`, `beds`, `baths`, `sqft`, `source`, `url`, `listingDate` diff --git a/skills/apify-ultimate-scraper/references/workflows/review-analysis.md b/skills/apify-ultimate-scraper/references/workflows/review-analysis.md new file mode 100644 index 0000000..f4b3f14 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/review-analysis.md @@ -0,0 +1,90 @@ +# Review analysis workflows + +## Google Maps review extraction +**When:** User wants to collect and analyze business reviews from Google Maps. + +### Pipeline +1. **Find businesses** -> `compass/crawler-google-places` + - Key input: `searchStringsArray`, `locationQuery`, `maxCrawledPlaces` +2. **Extract reviews** -> `compass/Google-Maps-Reviews-Scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `title`, `totalScore`, `reviewsCount`, `url`, `categoryName` +Step 2: `text`, `stars`, `publishedAtDate`, `reviewerName`, `ownerResponse` + +## Competitor review intelligence +**When:** User wants to extract competitor reviews to surface customer pain points and compare against own product strengths for positioning. + +### Pipeline +1. **Scrape competitor reviews** -> `compass/Google-Maps-Reviews-Scraper` + - Key input: `startUrls` (competitor Google Maps URLs), `maxReviews`, `sort` (newest or most relevant) +2. **Yelp competitor reviews** -> `tri_angle/yelp-review-scraper` + - Key input: `startUrls` (competitor Yelp URLs), `maxReviews` + +### Output fields +Step 1: `stars`, `text`, `name`, `publishedAtDate`, `reviewId` +Step 2: `text`, `rating`, `date`, `userName` + +### Gotcha +Run steps 1 and 2 in parallel for the same competitor, then merge by date. AI analysis works best when you label each review with the competitor name before passing to LLM for theme extraction. + +## Google Play app review monitoring +**When:** User wants daily low-rating alerts from Google Play to route urgent negative feedback to the support team. + +### Pipeline +1. **Scrape app reviews** -> `apify/playwright-scraper` + - Key input: `startUrls` (Google Play app URL), `maxRequestsPerCrawl` +2. **Filter and alert** (n8n native - IF node) + - Pipe: `results[].stars` -> filter where `stars < 4` + +### Output fields +Step 1: `stars`, `text`, `date`, `appVersion`, `thumbsUpCount` + +### Gotcha +Google Play uses heavy client-side rendering. Use `apify/playwright-scraper` rather than cheerio. If results are thin, search Apify Store for a dedicated Google Play reviews Actor - the ecosystem updates frequently. + +## Cross-platform hotel/restaurant reviews +**When:** User wants reviews aggregated from multiple platforms for the same business. + +### Pipeline (hotels) +1. **Aggregate reviews** -> `tri_angle/hotel-review-aggregator` + - Key input: `urls` (hotel URLs from TripAdvisor, Yelp, Google Maps, Booking.com, etc.) + +### Pipeline (restaurants) +1. **Aggregate reviews** -> `tri_angle/restaurant-review-aggregator` + - Key input: `urls` (restaurant URLs from Yelp, Google Maps, DoorDash, UberEats, etc.) + +### Output fields +Both: `text`, `rating`, `date`, `platform`, `reviewerName`, `title` + +## Multi-platform review aggregation for hospitality +**When:** User wants a weekly sentiment digest across TripAdvisor, Booking.com, Google, and Yelp for a property - including theme extraction by category (service, rooms, location, price). + +### Pipeline +1. **Aggregate all platforms** -> `tri_angle/hotel-review-aggregator` + - Key input: `startUrls` (property page URLs per platform), `maxReviews`, `includeReviews` +2. **Airbnb reviews** (if applicable) -> `tri_angle/airbnb-reviews-scraper` + - Key input: `startUrls` (Airbnb listing URLs), `maxReviews` + +### Output fields +Step 1: `stars`, `text`, `title`, `reviewDate`, `source`, `userProfile.name` +Step 2: `stars`, `text`, `reviewDate`, `reviewerName` + +### Gotcha +Review aggregators pull from multiple platforms in one run - cheaper than running separate scrapers per platform. Use the aggregators when covering 3+ platforms. For Airbnb specifically, run the dedicated `tri_angle/airbnb-reviews-scraper` separately and merge by date. + +## Yelp review pipeline +**When:** User wants Yelp reviews for businesses in a specific area. + +### Pipeline +1. **Find businesses** -> `tri_angle/get-yelp-urls` + - Key input: `location`, `category` +2. **Extract reviews** -> `tri_angle/yelp-review-scraper` + - Pipe: `results[].url` -> `startUrls` + - Key input: `startUrls`, `maxReviews` + +### Output fields +Step 1: `name`, `url`, `rating`, `reviewCount`, `address` +Step 2: `text`, `rating`, `date`, `userName` diff --git a/skills/apify-ultimate-scraper/references/workflows/social-media-analytics.md b/skills/apify-ultimate-scraper/references/workflows/social-media-analytics.md new file mode 100644 index 0000000..5543831 --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/social-media-analytics.md @@ -0,0 +1,74 @@ +# Social media analytics workflows + +## Instagram account performance analysis +**When:** User wants engagement metrics and content performance for an Instagram account. + +### Pipeline +1. **Get profile** -> `apify/instagram-profile-scraper` + - Key input: `usernames` +2. **Get recent posts** -> `apify/instagram-post-scraper` + - Key input: `directUrls` (from profile's `latestPosts[].url`) or `usernames` + +### Output fields +Step 1: `followersCount`, `followsCount`, `postsCount`, `biography`, `isVerified` +Step 2: `caption`, `likesCount`, `commentsCount`, `timestamp`, `type` (photo/video/reel), `url` + +## TikTok creator analytics +**When:** User wants performance data for a TikTok creator. + +### Pipeline +1. **Get profile** -> `clockworks/tiktok-profile-scraper` + - Key input: `profiles` (handles or URLs) + +### Output fields +Step 1: `nickname`, `followers`, `following`, `likes`, `videos`, `verified`, `recentVideos[]` (with views, likes, shares per video) + +## Instagram competitor content analysis +**When:** User wants to identify top-performing content formats and engagement patterns from competitor Instagram accounts. + +### Pipeline +1. **Get competitor posts** -> `apify/instagram-post-scraper` + - Key input: `usernames` (competitor handles), `resultsLimit` (100), `scrapePostsUntilDate` +2. **Get reels separately** -> `apify/instagram-reel-scraper` + - Key input: `usernames` (same handles) + +### Output fields +Step 1: `likesCount`, `commentsCount`, `timestamp`, `type` (post/reel/story), `caption`, `displayUrl`, `url` +Step 2: `likesCount`, `commentsCount`, `playsCount`, `duration`, `caption`, `url` + +### Gotcha +Run both Actors to capture full content mix - the post scraper may under-count reels. Calculate engagement rate per post (likes + comments / follower count) and sort to surface top performers. + +## LinkedIn company page analytics +**When:** User wants to track LinkedIn post performance for a company page or benchmark against competitors. + +### Pipeline +1. **Get company posts** -> `harvestapi/linkedin-company-posts` + - Key input: `companyUrl`, `maxPosts`, `publishedAfter` +2. **Enrich post details** -> `apimaestro/linkedin-post-detail` + - Pipe: `results[].url` -> `urls` + +### Output fields +Step 1: `likesCount`, `commentsCount`, `repostsCount`, `text`, `publishedAt`, `url` +Step 2: `reactions{}` (breakdown by type), `topComments[]`, `impressions` + +### Gotcha +Both Actors are PPE. Step 1: ~$0.002/post, Step 2: ~$0.005/post. For 100 posts across 3 companies, estimate ~$2.10. Confirm with user before running. + +## Multi-platform engagement comparison +**When:** User wants to compare an account's performance across platforms. + +### Pipeline (run independently, combine) +1. **Instagram** -> `apify/instagram-profile-scraper` with `usernames` +2. **TikTok** -> `clockworks/tiktok-profile-scraper` with `profiles` +3. **YouTube** -> `streamers/youtube-channel-scraper` with `channelUrls` +4. **X/Twitter** -> `apidojo/twitter-user-scraper` with `handles` + +### Output fields +Instagram: `followersCount`, `postsCount`, `biography` +TikTok: `followers`, `likes`, `videos` +YouTube: `subscriberCount`, `videoCount`, `viewCount` +X/Twitter: `followers`, `tweets`, `likes` + +### Gotcha +Parallel workflow - run each Actor independently. Normalize metric names for comparison (followers/subscribers, posts/videos/tweets). diff --git a/skills/apify-ultimate-scraper/references/workflows/trend-research.md b/skills/apify-ultimate-scraper/references/workflows/trend-research.md new file mode 100644 index 0000000..1f675ac --- /dev/null +++ b/skills/apify-ultimate-scraper/references/workflows/trend-research.md @@ -0,0 +1,79 @@ +# Trend and keyword research workflows + +## Google Trends analysis +**When:** User wants to analyze search demand trends for keywords or topics. + +### Pipeline +1. **Get trend data** -> `apify/google-trends-scraper` + - Key input: `searchTerms`, `timeRange`, `geo` (country code) + +### Output fields +Step 1: `term`, `timelineData[]` (date, value), `relatedQueries[]`, `relatedTopics[]` + +## Cross-platform hashtag research +**When:** User wants to evaluate a hashtag's reach and usage across platforms. + +### Pipeline +1. **Cross-platform overview** -> `apify/social-media-hashtag-research` + - Key input: `hashtags`, `platforms` (instagram, youtube, tiktok, facebook) + +### Output fields +Step 1: `hashtag`, `platform`, `postsCount`, `topPosts[]`, `relatedHashtags[]` + +## TikTok trend discovery +**When:** User wants to find trending content, sounds, or hashtags on TikTok. + +### Pipeline +1. **Trending content** -> `clockworks/tiktok-trends-scraper` + - Key input: `channel` (trending category) +2. **Explore categories** -> `clockworks/tiktok-explore-scraper` + - Key input: `exploreCategories` + +### Output fields +Step 1: `videoUrl`, `description`, `likes`, `shares`, `views`, `author`, `music` +Step 2: `category`, `posts[]`, `authors[]`, `music[]` + +## Reddit trend and community insight mining +**When:** User wants to surface emerging trends, product feedback themes, or competitor mentions from Reddit communities. + +### Pipeline +1. **Scrape subreddits** -> `trudax/reddit-scraper-lite` + - Key input: `startUrls` (subreddit URLs), `maxItems`, `sort` (hot/rising/new) + +### Output fields +Step 1: `title`, `body`, `subreddit`, `score`, `numberOfComments`, `url`, `createdAt`, `comments[]` + +### Gotcha +Use `sort: rising` for early trend signals, `sort: hot` for confirmed trending topics. Filter by `score` threshold (e.g., >50) to reduce noise. Comments array provides qualitative context for AI sentiment analysis. + +## YouTube outlier video discovery +**When:** User wants to identify breakout videos in a niche with disproportionate views vs. channel subscriber count - a signal for content strategy pivots. + +### Pipeline +1. **Search niche videos** -> `streamers/youtube-scraper` + - Key input: `searchTerms`, `maxResults`, `sort` (viewCount), `uploadDate` (filter range) +2. **Get channel context** -> `streamers/youtube-channel-scraper` + - Pipe: `results[].channelUrl` -> `channelUrls` + +### Output fields +Step 1: `title`, `viewCount`, `likeCount`, `commentCount`, `channelName`, `publishedAt`, `url` +Step 2: `subscriberCount`, `videoCount`, `viewCount` (channel totals) + +### Gotcha +Outlier score = video `viewCount` / channel `subscriberCount`. Ratios > 10x indicate breakout potential. Run Step 2 on a filtered shortlist only - no need to fetch channel data for every result. + +## Content topic validation +**When:** User wants to validate whether a topic has demand before creating content. + +### Pipeline +1. **Search demand** -> `apify/google-trends-scraper` + - Key input: `searchTerms` (topic keywords) +2. **Social reach** -> `apify/social-media-hashtag-research` + - Key input: `hashtags` (topic hashtags) + +### Output fields +Step 1: `timelineData[]` (trending up/down), `relatedQueries[]` +Step 2: `postsCount` per platform, `topPosts[]` + +### Gotcha +Google Trends shows relative interest (0-100 scale), not absolute volume. Combine with hashtag post counts for a fuller picture.