From 31c5baa9df1090897fbd03b67bdc2d289a71f298 Mon Sep 17 00:00:00 2001 From: Mouad BANI Date: Wed, 1 Apr 2026 15:04:40 +0100 Subject: [PATCH 1/5] feat: claude skill for snowflake connectors implementation Signed-off-by: Mouad BANI --- .../scaffold-snowflake-connector/SKILL.md | 497 ++++++++++++++++++ .gitignore | 10 +- 2 files changed, 504 insertions(+), 3 deletions(-) create mode 100644 .claude/skills/scaffold-snowflake-connector/SKILL.md diff --git a/.claude/skills/scaffold-snowflake-connector/SKILL.md b/.claude/skills/scaffold-snowflake-connector/SKILL.md new file mode 100644 index 0000000000..19b7e39287 --- /dev/null +++ b/.claude/skills/scaffold-snowflake-connector/SKILL.md @@ -0,0 +1,497 @@ +--- +name: scaffold-snowflake-connector +description: Use when adding a new snowflake-connector data source to crowd.dev — a new platform or a new source within an existing platform that needs buildSourceQuery, transformer, activity types, migration, and all associated type registrations scaffolded. +--- + +# Scaffold Snowflake Connector + +## Overview + +This skill scaffolds all code required to add a new snowflake-connector data source to crowd.dev. It covers up to 11 touch points, enforces zero-assumption practices, and requires explicit user validation for every piece of business logic before writing any file to disk. + +**One platform, one source per run.** Multiple sources within a platform are done one run at a time. + +## Process Flow + +```dot +digraph scaffold { + "Phase 1: Initial Questions" [shape=box]; + "Phase 2: Schema Collection" [shape=box]; + "Phase 3: Business Logic Validation" [shape=box]; + "All mappings confirmed?" [shape=diamond]; + "Phase 4: Generate & Review Files" [shape=box]; + "User approves each file?" [shape=diamond]; + "Phase 5: Testing" [shape=box]; + "Done" [shape=doublecircle]; + + "Phase 1: Initial Questions" -> "Phase 2: Schema Collection"; + "Phase 2: Schema Collection" -> "Phase 3: Business Logic Validation"; + "Phase 3: Business Logic Validation" -> "All mappings confirmed?"; + "All mappings confirmed?" -> "Phase 3: Business Logic Validation" [label="no, revise"]; + "All mappings confirmed?" -> "Phase 4: Generate & Review Files" [label="yes"]; + "Phase 4: Generate & Review Files" -> "User approves each file?"; + "User approves each file?" -> "Phase 4: Generate & Review Files" [label="no, revise"]; + "User approves each file?" -> "Phase 5: Testing" [label="yes"]; + "Phase 5: Testing" -> "Done"; +} +``` + +## File Inventory (All Touch Points) + +Read the current state of each file before modifying. Never modify without reading first. + +### Conditional — only if platform is NEW +| # | File | Change | +|---|------|--------| +| 0 | `services/libs/types/src/enums/platforms.ts` | Add `PlatformType.{PLATFORM}` enum value | + +### Always — structural (template-filled) +| # | File | Change | +|---|------|--------| +| 1 | `services/libs/types/src/enums/organizations.ts` | Add to `OrganizationSource` + `OrganizationAttributeSource` enums | +| 2 | `services/libs/integrations/src/integrations/{platform}/types.ts` | NEW: activity type enum + GRID | +| 3 | `services/libs/integrations/src/integrations/index.ts` | Add `export * from './{platform}/types'` | +| 4 | `services/libs/data-access-layer/src/organizations/attributesConfig.ts` | Add to `ORG_DB_ATTRIBUTE_SOURCE_PRIORITY` | +| 5 | `backend/src/database/migrations/V{epoch}__add{Platform}ActivityTypes.sql` | NEW: INSERT into `activityTypes` | +| 6 | `services/apps/snowflake_connectors/src/integrations/types.ts` | Add `DataSourceName.{PLATFORM}_{SOURCE}` | +| 7 | `services/apps/snowflake_connectors/src/integrations/index.ts` | Import + register in `supported` | + +### Per source — business logic (generated from confirmed inputs) +| # | File | Change | +|---|------|--------| +| 8 | `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/buildSourceQuery.ts` | NEW | +| 9 | `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/transformer.ts` | NEW | +| 10 | `services/apps/snowflake_connectors/src/integrations/{platform}/{platform}TransformerBase.ts` | NEW (optional) | + +--- + +## Phase 1: Initial Questions + +Ask one question at a time. Do not bundle questions. + +### Question 1 — Platform name + +Read `services/libs/types/src/enums/platforms.ts` and list all current `PlatformType` values. + +Ask: +> "What is the platform name for this data source? It must match a `PlatformType` enum value. Current values are: [list them]. If your platform isn't listed, provide the name and I'll add it." + +- If the value **is** in the enum: continue. +- If the value **is not** in the enum: warn the user explicitly: + > "⚠️ `{name}` is not in the `PlatformType` enum. I'll need to add it to `services/libs/types/src/enums/platforms.ts`. Please confirm this is a new platform and confirm the exact string value (e.g., `my-platform`)." + Wait for confirmation before proceeding. + +### Question 2 — New or existing platform? + +Read `services/apps/snowflake_connectors/src/integrations/index.ts`. + +- If the platform already has sources registered: list them and confirm which one we're adding. +- If the platform is new to snowflake_connectors: note that touch points 0–4 and 7 will all be new additions. + +### Question 3 — Source name + +Ask: +> "What is the name for this data source? This becomes the directory name and `DataSourceName` enum suffix (e.g., `enrollments`, `event-registrations`)." + +### Question 4 — Snowflake table + +Ask: +> "What is the fully-qualified Snowflake table name for this source? Format: `DB.SCHEMA.TABLE`." + +If there are JOIN tables needed (e.g., users, organizations), ask after the main table is confirmed — one JOIN table at a time. + +--- + +## Phase 2: Schema Collection + +Ask the user to run the following in Snowflake UI and paste the output: + +```sql +DESCRIBE TABLE DB.SCHEMA.TABLE; +``` + +Parse the output to build a column registry: +- Column name (exact casing from schema — this is the reference for all code) +- Data type +- Nullable (Y/N) + +**Store this as the canonical column reference. Every column name used in generated code must appear in this registry. Never assume or invent a column name.** + +For each JOIN table identified in Phase 1: +- Ask the user to `DESCRIBE TABLE` that table too, one at a time +- Ask: "Is this the same table used by an existing implementation (e.g., same users or orgs table as TNC or CVENT)?" + - If yes: read the existing transformer that uses it and follow its column mapping exactly + - If no: treat every column as unknown — validate each one explicitly in Phase 3 + +--- + +## Phase 3: Business Logic Validation + +**Rule:** Every attribute is confirmed by the user individually. Never batch multiple questions into one message. Never proceed to file generation until all sub-sections are complete. + +--- + +### 3a. Identity Mapping + +**Rule:** Every member record must produce at least one identity with `type: MemberIdentityType.USERNAME`. The fallback chain is (try in order): +1. Platform-native username column from schema +2. LFID column value (used as username on `PlatformType.LFID`) +3. Email value used as the platform USERNAME (last resort) + +Ask one at a time: + +1. "Which column contains the member's email address? (Required — rows where this is null will be skipped.)" + - Verify the column is in the schema registry. If nullable=Y, note it. + +2. "Is there a platform-native username column in the schema?" + - If yes: "Which column?" — verify it's in the registry. + - If no: note that email or LFID will be used as username fallback. + +3. "Is there an LFID column in the schema?" + - If yes: "Which column?" — verify it's in the registry. + - If no: note it. + +4. "Are there any other identity-relevant columns (e.g., a separate user ID, a profile URL)?" + +5. For each confirmed identity column: + - "Should identities from this column be `verified: true`?" + - "What is the `verifiedBy` value?" (usually the platform type) + +**Critical check:** If a JOIN table for users is not the same table used by an existing implementation, validate every column explicitly. Do not infer semantics from column names. + +After all identity columns are confirmed, summarize the full identity-building logic and ask: +> "Here is how identities will be built: [summary]. Does this look correct?" + +--- + +### 3b. Organization Mapping + +Ask first: "Does this source contain organization/company data?" + +If **no**: skip to 3c. + +If **yes**: + +Ask: "Is the org data sourced from a JOIN table already used by another platform's implementation?" + +- If **yes**: read that existing implementation's `buildOrganizations` method from the codebase. Follow its column mapping exactly. Show the user what columns will be used and ask for confirmation. +- If **no** (new org table): validate each column explicitly, one at a time: + 1. Org display name column? + 2. Website column? + 3. Domain aliases column? If yes: "What is the format? (comma-separated string, array, etc.)" + 4. Logo URL column? + 5. Industry column? + 6. Employee count/size column? + +**Critical — `isIndividualNoAccount`:** +Read `services/apps/snowflake_connectors/src/core/transformerBase.ts` to find `isIndividualNoAccount`. The generated code MUST call this method rather than reimplementing it. Show the user the method signature and confirm it will be used identically to existing sources. + +After all org columns are confirmed, summarize and ask for confirmation before proceeding. + +--- + +### 3c. Activity Types + +**Rule:** Activity type names and scores come entirely from the user. Do not suggest them. + +Ask: +> "Please list all activity types this source can produce. For each, provide: +> - A short name (e.g., `enrolled-certification`) +> - A score from 1–10 +> +> I'll suggest the enum key, label, description, and flags for each one." + +For each activity type the user provides, suggest the following **one at a time**, waiting for approval before moving to the next: + +| Field | Suggestion rule | +|-------|----------------| +| Enum key | SCREAMING_SNAKE_CASE version of the name (e.g., `ENROLLED_CERTIFICATION`) | +| String value | The name the user provided (kebab-case) | +| Label | Human-readable (e.g., `Enrolled in certification`) | +| Description | One sentence describing the event (look at similar types in `backend/src/database/migrations/` for style) | +| `isCodeContribution` | `false` unless it involves code (check existing platforms — almost always false for non-GitHub sources) | +| `isCollaboration` | `false` unless it is a collaborative activity | + +Ask: "Does this look correct for `{type_name}`? Any changes?" before moving to the next type. + +--- + +### 3d. Timestamp & sourceId + +Before asking, explain: +> "The **timestamp column** is critical — it drives all incremental exports. Records updated after the last export run are re-fetched using this column. It must never be null. If it can be null, we need to find an alternative or add a `WHERE column IS NOT NULL` guard. +> +> The **sourceId** is used as the deduplication key — if two records have the same sourceId, only one activity is kept. It must uniquely identify each logical event." + +Ask: + +1. "Which column is the incremental update timestamp?" + - Check the schema registry nullable flag. If nullable=Y: "⚠️ This column is marked nullable. Can it actually be null in practice? If yes, we need an alternative column or a guard." Wait for resolution. + - Valid examples: `updated_ts`, `enrolled_at`, `created_at` + +2. "Which column uniquely identifies each record and will be used as `sourceId`?" + - Verify it's in the schema registry. + - Confirm: "Is this value guaranteed to be unique per logical event (not per user)?" + +--- + +### 3e. Base Class Check + +- If this is the **first source** for this platform: + - If org mapping is present: propose creating `{platform}TransformerBase.ts` with shared `buildOrganizations` logic. Show the proposed class structure (modeled on `tncTransformerBase.ts`). Ask for confirmation. + - If no org mapping: no base class needed; transformer extends `TransformerBase` directly. + +- If the platform **already has sources**: + - Check if `services/apps/snowflake_connectors/src/integrations/{platform}/{platform}TransformerBase.ts` exists. + - If yes: new transformer should extend it. Confirm with user. + - If no: new transformer extends `TransformerBase` directly. + +--- + +### 3f. Project Slug for Testing + +Ask: +> "Please provide a `project_slug` from CDP_MATCHED_SEGMENTS that has data in this Snowflake table. It should have a moderate number of records (ideally under a few thousand) and ideally cover all the activity types we're implementing. This slug will be used to restrict the test query and for the staging non-prod guard in `buildSourceQuery`." + +--- + +## Phase 4: File Generation & Review + +Generate files in the order listed in the File Inventory. For **each file**: +1. Show the complete file content in a code block +2. Ask: "Does this look correct? Any changes before I write it?" +3. Only write the file after the user explicitly confirms + +Read each target file before modifying it. + +--- + +### Structural Files (touch points 0–7) + +These are template-filled from confirmed inputs. No business logic reasoning required. + +**Touch point 0 — `platforms.ts`** (only if new platform) + +Add `{PLATFORM} = '{platform-string}',` to the `PlatformType` enum in alphabetical order among the non-standard entries. + +--- + +**Touch point 1 — `organizations.ts`** + +File: `services/libs/types/src/enums/organizations.ts` + +Add to both enums: +- `OrganizationSource.{PLATFORM} = '{platform-string}'` +- `OrganizationAttributeSource.{PLATFORM} = '{platform-string}'` + +--- + +**Touch point 2 — `{platform}/types.ts`** (new file) + +File: `services/libs/integrations/src/integrations/{platform}/types.ts` + +Template (fill with confirmed activity types and scores): + +```typescript +import { IActivityScoringGrid } from '@crowd/types' + +export enum {Platform}ActivityType { + {ENUM_KEY} = '{string-value}', + // ... one entry per confirmed activity type +} + +export const {PLATFORM}_GRID: Record<{Platform}ActivityType, IActivityScoringGrid> = { + [{Platform}ActivityType.{ENUM_KEY}]: { score: {score} }, + // ... one entry per confirmed activity type +} +``` + +--- + +**Touch point 3 — `integrations/index.ts`** + +File: `services/libs/integrations/src/integrations/index.ts` + +Add before `export * from './activityDisplayService'`: +```typescript +export * from './{platform}/types' +``` + +--- + +**Touch point 4 — `attributesConfig.ts`** + +File: `services/libs/data-access-layer/src/organizations/attributesConfig.ts` + +Add `OrganizationAttributeSource.{PLATFORM},` to the `ORG_DB_ATTRIBUTE_SOURCE_PRIORITY` array, positioned after the most recently added snowflake platform (currently after `OrganizationAttributeSource.TNC`). + +--- + +**Touch point 5 — Flyway migration** (new file) + +File: `backend/src/database/migrations/V{Date.now()}__add{Platform}ActivityTypes.sql` + +Template (use epoch timestamp for `V` prefix — run `date +%s%3N` in terminal): + +```sql +INSERT INTO "activityTypes" ("activityType", platform, "isCodeContribution", "isCollaboration", description, "label") VALUES +('{string-value}', '{platform-string}', {false|true}, {false|true}, '{description}', '{label}'), +-- ... one row per confirmed activity type, last row without trailing comma +('{string-value}', '{platform-string}', {false|true}, {false|true}, '{description}', '{label}'); +``` + +--- + +**Touch point 6 — `snowflake_connectors/types.ts`** + +File: `services/apps/snowflake_connectors/src/integrations/types.ts` + +Add to `DataSourceName` enum: +```typescript +{PLATFORM}_{SOURCE} = '{source-name}', +``` + +--- + +**Touch point 7 — `snowflake_connectors/index.ts`** + +File: `services/apps/snowflake_connectors/src/integrations/index.ts` + +Add import at top: +```typescript +import { buildSourceQuery as {platform}{Source}BuildQuery } from './{platform}/{source}/buildSourceQuery' +import { {Platform}{Source}Transformer } from './{platform}/{source}/transformer' +``` + +Add to `supported` object under the platform key (create the key if new platform): +```typescript +[PlatformType.{PLATFORM}]: { + sources: [ + { + name: DataSourceName.{PLATFORM}_{SOURCE}, + buildSourceQuery: {platform}{Source}BuildQuery, + transformer: new {Platform}{Source}Transformer(), + }, + ], +}, +``` + +If the platform already exists, append to its `sources` array instead of creating a new key. + +--- + +### Business Logic Files (touch points 8–10) + +These are AI-generated from the confirmed column mappings. Apply all rules strictly. + +--- + +**Touch point 8 — `buildSourceQuery.ts`** (new file) + +File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/buildSourceQuery.ts` + +**Rules (enforced — do not deviate):** +- Use explicit column names only. Never use `table.*` +- If any TIMESTAMP_TZ columns exist in the schema, exclude and re-cast them as TIMESTAMP_NTZ (see CVENT pattern) +- Follow the CTE structure: + 1. `org_accounts` CTE (if org data present) + 2. `CDP_MATCHED_SEGMENTS` CTE (always) + 3. `new_cdp_segments` CTE (inside the `sinceTimestamp` branch only) +- Incremental pattern: when `sinceTimestamp` is provided, UNION two queries: + - Records where the confirmed timestamp column `> sinceTimestamp` + - Records in newly created segments (using `new_cdp_segments`) +- Deduplication: `QUALIFY ROW_NUMBER() OVER (PARTITION BY {sourceId_column} ORDER BY {dedup_key} DESC) = 1` +- Non-prod guard: `if (!IS_PROD_ENV) { select += \` AND {table}.{project_slug_column} = '{confirmed_project_slug}'\` }` +- All column names must match the confirmed schema registry exactly + +Show the full generated file and ask for confirmation before writing. + +--- + +**Touch point 9 — `transformer.ts`** (new file) + +File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/transformer.ts` + +**Rules (enforced — do not deviate):** +- All string comparisons must be case-insensitive: use `.toLowerCase()` on both sides of comparison only; preserve the original value in the output +- No broad `else` statements — every branch must have an explicit condition +- All column names referenced in code must exactly match the schema registry — never assumed +- Identity fallback chain (always produces at least one USERNAME identity): + 1. If platform-native username column present and non-null → push EMAIL + USERNAME identities for platform + 2. Else if LFID present and non-null → push EMAIL for platform + LFID value as USERNAME for platform + 3. Else → push EMAIL value as USERNAME for platform (email-as-username) +- After building platform identities, if LFID column is present and non-null → push separate LFID identity: `{ platform: PlatformType.LFID, value: lfid, type: MemberIdentityType.USERNAME, ... }` +- `isIndividualNoAccount` must call `this.isIndividualNoAccount(displayName)` from `TransformerBase` — never reimplement +- Extends the platform base class if one was confirmed in Phase 3e; otherwise extends `TransformerBase` directly +- If in doubt about any mapping: ask the user before writing + +Show the full generated file and ask for confirmation before writing. + +--- + +**Touch point 10 — `{platform}TransformerBase.ts`** (new file, only if confirmed in Phase 3e) + +File: `services/apps/snowflake_connectors/src/integrations/{platform}/{platform}TransformerBase.ts` + +Model on `tncTransformerBase.ts`. Contains only the shared `buildOrganizations` logic. Abstract class that extends `TransformerBase` and sets `readonly platform = PlatformType.{PLATFORM}`. + +Show the full generated file and ask for confirmation before writing. + +--- + +## Phase 5: Testing + +After all files are written, provide the user with a test query: + +Take the `buildSourceQuery` output (without `sinceTimestamp`, so the full-load variant), substitute the non-prod project_slug guard with the confirmed `project_slug`, and add `LIMIT 100`: + +```sql +-- Example shape (exact SQL comes from the generated buildSourceQuery): +WITH org_accounts AS (...), +cdp_matched_segments AS (...) +SELECT + -- all explicit columns +FROM {table} t +INNER JOIN cdp_matched_segments cms ON ... +WHERE t.{project_slug_column} = '{confirmed_project_slug}' +QUALIFY ROW_NUMBER() OVER (...) = 1 +LIMIT 100; +``` + +Instruct the user: +> "Please run this query in Snowflake and paste the result (JSON or CSV). I'll walk through each row and verify the transformer logic produces the expected `IActivityData` before we consider this done." + +### Dry-Run Validation + +When the user pastes results, for each row: +- Apply transformer logic in-chat (show inputs → outputs) +- Show the resulting `IActivityData` + segment slug +- Flag immediately: null email, missing USERNAME identity, unexpected activity type, null sourceId, null timestamp +- If any issue found: loop back to the relevant Phase 3 sub-section, fix the mapping, regenerate the affected file + +### Completion Checklist + +Before declaring the implementation complete, verify every item: + +- [ ] All applicable files from the File Inventory written to disk +- [ ] No `table.*` anywhere in any generated query +- [ ] No broad `else` in any transformer +- [ ] All string comparisons are case-insensitive (`.toLowerCase()` on both sides of comparison) +- [ ] `sourceId` confirmed unique in test data +- [ ] Timestamp column confirmed never null in test data +- [ ] All activity types present in test data (or user explicitly acknowledged any absent ones) +- [ ] `isIndividualNoAccount` behavior verified matches existing sources + +--- + +## Anti-Hallucination Rules + +These apply throughout the entire skill execution. They are non-negotiable. + +1. **Never assume a column exists.** Every column reference must come from the pasted `DESCRIBE TABLE` output or a confirmed JOIN table schema. +2. **Never assume a JOIN table is the same as an existing one.** If a table has not been verified against an existing implementation, validate every column. +3. **Never suggest activity type names or scores.** These come from the user. +4. **Never write a file without showing it first and receiving explicit user approval.** +5. **When in doubt, ask.** A question costs seconds. A wrong implementation costs hours. +6. **One question per message.** Never bundle multiple validations. +7. **Read before modifying.** Read every file before making changes to it. diff --git a/.gitignore b/.gitignore index 54edc4f658..d7e0998702 100644 --- a/.gitignore +++ b/.gitignore @@ -196,9 +196,13 @@ services/libs/tinybird/.diff_tmp # custom cursor rules .cursor/rules/*.local.mdc -# claude code rules -CLAUDE.md -.claude +# claude code +.claude/cache/ +.claude/tmp/ +.claude/logs/ +.claude/sessions/ +.claude/todos/ +.claude/settings.local.json # git integration test repositories & output services/apps/git_integration/src/test/repos/ From fd86faee5062379ac8cb384aff96871f15dc4641 Mon Sep 17 00:00:00 2001 From: Mouad BANI Date: Thu, 2 Apr 2026 12:06:21 +0100 Subject: [PATCH 2/5] feat: tweak skill based on committees test Signed-off-by: Mouad BANI --- .../scaffold-snowflake-connector/SKILL.md | 252 ++++++++++++------ 1 file changed, 175 insertions(+), 77 deletions(-) diff --git a/.claude/skills/scaffold-snowflake-connector/SKILL.md b/.claude/skills/scaffold-snowflake-connector/SKILL.md index 19b7e39287..c9e1ef0495 100644 --- a/.claude/skills/scaffold-snowflake-connector/SKILL.md +++ b/.claude/skills/scaffold-snowflake-connector/SKILL.md @@ -17,7 +17,8 @@ This skill scaffolds all code required to add a new snowflake-connector data sou digraph scaffold { "Phase 1: Initial Questions" [shape=box]; "Phase 2: Schema Collection" [shape=box]; - "Phase 3: Business Logic Validation" [shape=box]; + "Pre-Analysis: Auto-Derive Mappings" [shape=box]; + "Phase 3: Confirm / Correct Mappings" [shape=box]; "All mappings confirmed?" [shape=diamond]; "Phase 4: Generate & Review Files" [shape=box]; "User approves each file?" [shape=diamond]; @@ -25,9 +26,10 @@ digraph scaffold { "Done" [shape=doublecircle]; "Phase 1: Initial Questions" -> "Phase 2: Schema Collection"; - "Phase 2: Schema Collection" -> "Phase 3: Business Logic Validation"; - "Phase 3: Business Logic Validation" -> "All mappings confirmed?"; - "All mappings confirmed?" -> "Phase 3: Business Logic Validation" [label="no, revise"]; + "Phase 2: Schema Collection" -> "Pre-Analysis: Auto-Derive Mappings"; + "Pre-Analysis: Auto-Derive Mappings" -> "Phase 3: Confirm / Correct Mappings"; + "Phase 3: Confirm / Correct Mappings" -> "All mappings confirmed?"; + "All mappings confirmed?" -> "Phase 3: Confirm / Correct Mappings" [label="no, revise"]; "All mappings confirmed?" -> "Phase 4: Generate & Review Files" [label="yes"]; "Phase 4: Generate & Review Files" -> "User approves each file?"; "User approves each file?" -> "Phase 4: Generate & Review Files" [label="no, revise"]; @@ -93,41 +95,138 @@ Read `services/apps/snowflake_connectors/src/integrations/index.ts`. Ask: > "What is the name for this data source? This becomes the directory name and `DataSourceName` enum suffix (e.g., `enrollments`, `event-registrations`)." -### Question 4 — Snowflake table +### Question 4 — Snowflake tables -Ask: -> "What is the fully-qualified Snowflake table name for this source? Format: `DB.SCHEMA.TABLE`." +Ask for the main table first, then any additional tables: +> "What is the main Snowflake table for this source?" + +Then: +> "Are there any additional tables needed (e.g., for user data, org data, segment matching)? For each one, provide the table name and its purpose — what data it holds and how it relates to the main table." -If there are JOIN tables needed (e.g., users, organizations), ask after the main table is confirmed — one JOIN table at a time. +Do not assume a direct JOIN between tables. The relation type (JOIN, subquery, CTE, lookup) must come from the user's description of each table's purpose. --- ## Phase 2: Schema Collection -Ask the user to run the following in Snowflake UI and paste the output: +Once all table names are known, generate a single query to collect schema information for every table at once. Ask the user to run it in Snowflake UI and either paste the output or provide a path to the exported file (e.g., CSV): ```sql -DESCRIBE TABLE DB.SCHEMA.TABLE; +SELECT DISTINCT + table_catalog, + table_schema, + table_name, + column_name, + data_type, + is_nullable, + ordinal_position +FROM SNOWFLAKE.ACCOUNT_USAGE.COLUMNS +WHERE (table_catalog, table_schema, table_name) IN ( + ('DB1', 'SCHEMA1', 'TABLE1'), + ('DB2', 'SCHEMA2', 'TABLE2') + -- one row per confirmed table +) +AND DELETED IS NULL +ORDER BY table_name, ordinal_position; ``` -Parse the output to build a column registry: +Replace the placeholder values with the actual table names provided by the user. + +Ask the user to either paste the output directly or provide a path to an exported file (CSV, JSON, or TSV). If a file path is given, read the file. + +Parse the output to build a column registry per table: - Column name (exact casing from schema — this is the reference for all code) - Data type -- Nullable (Y/N) +- Nullable (YES/NO) **Store this as the canonical column reference. Every column name used in generated code must appear in this registry. Never assume or invent a column name.** -For each JOIN table identified in Phase 1: -- Ask the user to `DESCRIBE TABLE` that table too, one at a time -- Ask: "Is this the same table used by an existing implementation (e.g., same users or orgs table as TNC or CVENT)?" - - If yes: read the existing transformer that uses it and follow its column mapping exactly - - If no: treat every column as unknown — validate each one explicitly in Phase 3 +For each JOIN table, check whether any existing transformer in `services/apps/snowflake_connectors/src/integrations/` queries from the same table. If yes, inherit its column mappings; if no, treat every column as unknown and derive it from sample data in the Pre-Analysis step below. + +After building the column registry, generate a sample data query using **only explicit column names from the registry** (never `*`) and ask the user to run it: + +```sql +SELECT + main.col1, main.col2, main.col3, -- all columns from main table registry + j1.col1, j1.col2, -- all columns from each JOIN table + ... +FROM DB.SCHEMA.MAIN_TABLE main +LEFT JOIN DB.SCHEMA.JOIN_TABLE1 j1 ON main.join_key = j1.pk +-- ... all joins +LIMIT 20; +``` + +Ask the user: +> "Please run this query in Snowflake and paste the result or provide a path to the exported file (CSV, JSON, or TSV). I'll use the actual data values to auto-derive column mappings before asking for your confirmation." + +--- + +## Pre-Analysis: Auto-Derive Mappings + +Before asking the user any Phase 3 questions, perform this analysis using the schema registry, the pasted sample data, and existing implementations that query the same tables. + +### Step 1 — Inherit from existing implementations + +For each JOIN table, check if any existing transformer imports or queries from it. If found, inherit those column mappings directly — they are already validated. + +Read the relevant transformer files from `services/apps/snowflake_connectors/src/integrations/` and extract: which column maps to email, username, LFID, org name, org website, domain aliases, timestamp, sourceId. + +### Step 2 — Infer remaining roles from names + data values + +For columns not covered by an existing implementation, apply these heuristics: + +| Role | High-confidence signals | +|------|------------------------| +| Email | Name contains `EMAIL`; values match `x@y.z` pattern | +| Platform username | Name is `USER_NAME`, `USERNAME`, `LOGIN`, `HANDLE`; values are non-email strings | +| LFID | Name contains `LF_USERNAME`, `LFID`, `LF_ID` | +| Timestamp (incremental) | Type is TIMESTAMP; name contains `UPDATED`, `MODIFIED`; nullable=NO | +| sourceId | Name ends in `_ID`; values appear unique in sample | +| Org name | Name contains `ACCOUNT_NAME`, `ORGANIZATION_NAME`, `COMPANY` | +| Org website | Name contains `WEBSITE`, `DOMAIN`, `URL`; values start with `http` | +| Domain aliases | Name contains `DOMAIN_ALIASES`, `ALIASES`; values are comma-separated or array | + +Assign each role a confidence level: +- **HIGH** — column name + data values both match unambiguously +- **MEDIUM** — multiple candidates exist, or name matches but data is ambiguous +- **LOW / UNKNOWN** — no clear match + +### Step 3 — Present consolidated proposal + +Present all HIGH-confidence mappings at once (exception to the one-question-per-message rule — batching is intentional here for efficiency). For MEDIUM, offer choices. For LOW/UNKNOWN, ask open-ended: + +> "Based on the schema and sample data, here's what I identified: +> - **Email**: `USER_EMAIL` (contains email values like `user@example.com`) ✅ +> - **Username**: `USER_NAME` ✅ +> - **LFID**: `LF_USERNAME` ✅ +> - **Timestamp**: `UPDATED_TS` (TIMESTAMP_NTZ, not nullable) ✅ +> - **sourceId**: `REGISTRATION_ID` (appears unique in sample) ✅ +> - **Org website**: `ORG_WEBSITE` ✅ +> - **Org name**: `ACCOUNT_NAME` or `COMPANY_NAME`? (both present) +> - **Domain aliases format**: couldn't determine — comma-separated string or array? +> +> Please confirm the ✅ items and resolve the open questions." + +After the user responds, record all confirmed mappings. **Skip any Phase 3 sub-step whose mappings are fully resolved here.** Only enter a sub-step for fields that remain MEDIUM, LOW, or flagged by the user. --- -## Phase 3: Business Logic Validation +## Phase 3: Confirm / Correct Mappings + +**Rule:** Skip sub-sections where Pre-Analysis fully resolved the mapping. For remaining sub-sections, use the propose-then-confirm pattern — never ask open-ended questions when a proposal can be made. Present choices when multiple candidates exist. Ask open-ended only when no inference is possible. -**Rule:** Every attribute is confirmed by the user individually. Never batch multiple questions into one message. Never proceed to file generation until all sub-sections are complete. +**When raising any ambiguity, always include:** +1. **How existing data sources handle it** — read the relevant transformer(s) from `services/apps/snowflake_connectors/src/integrations/` and show the pattern. If an existing source skips or doesn't implement the feature, say so explicitly (e.g., "TNC doesn't use a logo field — it's left undefined"). +2. **A Snowflake query to resolve it** — if the ambiguity can be answered by inspecting actual data (e.g., checking uniqueness, null rate, value distribution), provide the query so the user can run it instead of guessing. + +Example format for an ambiguity: +> "I see two timestamp columns: `CREATED_AT` (nullable) and `UPDATED_AT` (not nullable). +> - Existing sources (TNC, CVENT) all use a non-nullable `updated_ts`-style column for incremental exports. +> - To check null rates, run: +> ```sql +> SELECT COUNT(*) total, COUNT(CREATED_AT) created_not_null, COUNT(UPDATED_AT) updated_not_null FROM DB.SCHEMA.TABLE LIMIT 10000; +> ``` +> Which column should be used as the incremental timestamp?" --- @@ -138,50 +237,38 @@ For each JOIN table identified in Phase 1: 2. LFID column value (used as username on `PlatformType.LFID`) 3. Email value used as the platform USERNAME (last resort) -Ask one at a time: - -1. "Which column contains the member's email address? (Required — rows where this is null will be skipped.)" - - Verify the column is in the schema registry. If nullable=Y, note it. - -2. "Is there a platform-native username column in the schema?" - - If yes: "Which column?" — verify it's in the registry. - - If no: note that email or LFID will be used as username fallback. +If Pre-Analysis resolved email, username, and LFID columns with HIGH confidence and the user confirmed them, skip to the summary step below. -3. "Is there an LFID column in the schema?" - - If yes: "Which column?" — verify it's in the registry. - - If no: note it. +For any unresolved identity field, use this pattern: +- **Multiple candidates found**: "I see columns `A` and `B` that could be the email — which one?" (present choices, not open-ended) +- **One candidate found**: "I believe `USER_EMAIL` is the email column based on its values. Confirm?" (one-tap confirmation) +- **No candidate found**: "I couldn't identify an email column — please specify." -4. "Are there any other identity-relevant columns (e.g., a separate user ID, a profile URL)?" +For each confirmed identity column also confirm: +- `verified: true`? (default yes — ask only if the data suggests otherwise) +- `verifiedBy` value (default: platform type — propose it, don't ask open-ended) -5. For each confirmed identity column: - - "Should identities from this column be `verified: true`?" - - "What is the `verifiedBy` value?" (usually the platform type) +**Critical:** If a JOIN table for users is NOT the same table used by an existing implementation, validate every column explicitly regardless of Pre-Analysis confidence. Column name heuristics alone are not sufficient for unknown tables. -**Critical check:** If a JOIN table for users is not the same table used by an existing implementation, validate every column explicitly. Do not infer semantics from column names. - -After all identity columns are confirmed, summarize the full identity-building logic and ask: +After all identity fields are confirmed, summarize the full identity-building logic and ask: > "Here is how identities will be built: [summary]. Does this look correct?" --- ### 3b. Organization Mapping -Ask first: "Does this source contain organization/company data?" - -If **no**: skip to 3c. +If Pre-Analysis determined there is no org data (no org-related columns found in any table), confirm: "I don't see any organization columns in the schema. Does this source have org/company data?" — if yes, proceed; if no, skip to 3c. -If **yes**: +If Pre-Analysis identified org columns: -Ask: "Is the org data sourced from a JOIN table already used by another platform's implementation?" - -- If **yes**: read that existing implementation's `buildOrganizations` method from the codebase. Follow its column mapping exactly. Show the user what columns will be used and ask for confirmation. -- If **no** (new org table): validate each column explicitly, one at a time: - 1. Org display name column? - 2. Website column? - 3. Domain aliases column? If yes: "What is the format? (comma-separated string, array, etc.)" - 4. Logo URL column? - 5. Industry column? - 6. Employee count/size column? +- If sourced from a JOIN table already used by an existing implementation: read that transformer's `buildOrganizations` method and show the user exactly which columns will be used. Ask: "The org mapping follows the existing [PLATFORM] implementation — does this look correct?" +- If new org table: present the Pre-Analysis proposals for each field, one at a time: + - "Org name: I see `ACCOUNT_NAME` (values like `Acme Corp`). Confirm, or specify another column." + - "Website: I see `ORG_WEBSITE` (values start with `http`). Confirm?" + - "Domain aliases: I see `DOMAIN_ALIASES` — are these comma-separated strings or an array?" + - "Logo URL: I see `LOGO_URL`. Confirm, or is there no logo column?" + - "Industry: I see `ORGANIZATION_INDUSTRY`. Confirm?" + - "Size: I see `ORGANIZATION_SIZE`. Confirm?" **Critical — `isIndividualNoAccount`:** Read `services/apps/snowflake_connectors/src/core/transformerBase.ts` to find `isIndividualNoAccount`. The generated code MUST call this method rather than reimplementing it. Show the user the method signature and confirm it will be used identically to existing sources. @@ -208,7 +295,7 @@ For each activity type the user provides, suggest the following **one at a time* | Enum key | SCREAMING_SNAKE_CASE version of the name (e.g., `ENROLLED_CERTIFICATION`) | | String value | The name the user provided (kebab-case) | | Label | Human-readable (e.g., `Enrolled in certification`) | -| Description | One sentence describing the event (look at similar types in `backend/src/database/migrations/` for style) | +| Description | One sentence describing the event (look at similar types in `backend/src/c/migrations/` for style) | | `isCodeContribution` | `false` unless it involves code (check existing platforms — almost always false for non-GitHub sources) | | `isCollaboration` | `false` unless it is a collaborative activity | @@ -218,33 +305,29 @@ Ask: "Does this look correct for `{type_name}`? Any changes?" before moving to t ### 3d. Timestamp & sourceId -Before asking, explain: -> "The **timestamp column** is critical — it drives all incremental exports. Records updated after the last export run are re-fetched using this column. It must never be null. If it can be null, we need to find an alternative or add a `WHERE column IS NOT NULL` guard. -> -> The **sourceId** is used as the deduplication key — if two records have the same sourceId, only one activity is kept. It must uniquely identify each logical event." +If Pre-Analysis already proposed these with HIGH confidence, present them with this explanation before asking for confirmation (never skip the explanation — it ensures the user understands the criticality): -Ask: +> "The **timestamp column** drives all incremental exports — records updated after the last export run are re-fetched using this column. It must never be null. The **sourceId** is the deduplication key — two records with the same sourceId produce only one activity. It must uniquely identify each logical event." -1. "Which column is the incremental update timestamp?" - - Check the schema registry nullable flag. If nullable=Y: "⚠️ This column is marked nullable. Can it actually be null in practice? If yes, we need an alternative column or a guard." Wait for resolution. - - Valid examples: `updated_ts`, `enrolled_at`, `created_at` +Then confirm the Pre-Analysis proposals: -2. "Which column uniquely identifies each record and will be used as `sourceId`?" - - Verify it's in the schema registry. - - Confirm: "Is this value guaranteed to be unique per logical event (not per user)?" +1. Timestamp: "I identified `UPDATED_TS` (TIMESTAMP_NTZ, not nullable) as the incremental timestamp. Confirm?" + - If the proposed column is nullable=Y: "⚠️ `UPDATED_TS` is marked nullable in the schema. Can it actually be null in practice? If yes, we need an alternative or a `WHERE col IS NOT NULL` guard." Wait for resolution. + - If no candidate was found: "I couldn't identify an update timestamp column — please specify. Valid examples: `updated_ts`, `enrolled_at`, `created_at`." + +2. sourceId: "I identified `REGISTRATION_ID` as the sourceId (appears unique in sample). Confirm, and is this guaranteed unique per logical event (not just per user)?" + - If no candidate was found: "I couldn't identify a unique record ID — please specify which column to use as `sourceId`." --- ### 3e. Base Class Check -- If this is the **first source** for this platform: - - If org mapping is present: propose creating `{platform}TransformerBase.ts` with shared `buildOrganizations` logic. Show the proposed class structure (modeled on `tncTransformerBase.ts`). Ask for confirmation. - - If no org mapping: no base class needed; transformer extends `TransformerBase` directly. +- If this is the **first and only source** for this platform: no base class — `transformer.ts` extends `TransformerBase` directly and contains all logic inline (CVENT pattern). Do not propose a base class. -- If the platform **already has sources**: +- If the platform **will have multiple sources** (user confirmed more sources are planned or already exist): - Check if `services/apps/snowflake_connectors/src/integrations/{platform}/{platform}TransformerBase.ts` exists. - - If yes: new transformer should extend it. Confirm with user. - - If no: new transformer extends `TransformerBase` directly. + - If yes: new transformer extends it. Confirm with user. + - If no and shared org logic exists: propose creating the base class. Show proposed structure (modeled on `tncTransformerBase.ts`). Ask for confirmation before creating. --- @@ -257,12 +340,12 @@ Ask: ## Phase 4: File Generation & Review -Generate files in the order listed in the File Inventory. For **each file**: -1. Show the complete file content in a code block -2. Ask: "Does this look correct? Any changes before I write it?" -3. Only write the file after the user explicitly confirms +Before generating any files, ask the user: +> "How would you like me to proceed with the implementation? +> - **A — Review mode:** I show each file before writing it, you approve or request changes. +> - **B — Auto mode:** I implement all files directly, then show a summary of everything written for a final review." -Read each target file before modifying it. +Proceed based on the user's choice. Read each target file before modifying it in both modes. --- @@ -422,6 +505,7 @@ File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/t 3. Else → push EMAIL value as USERNAME for platform (email-as-username) - After building platform identities, if LFID column is present and non-null → push separate LFID identity: `{ platform: PlatformType.LFID, value: lfid, type: MemberIdentityType.USERNAME, ... }` - `isIndividualNoAccount` must call `this.isIndividualNoAccount(displayName)` from `TransformerBase` — never reimplement +- **Do not set member attributes** (e.g., `MemberAttributeName.JOB_TITLE`, `AVATAR_URL`, `COUNTRY`) unless: (a) the user explicitly requested them, or (b) the same table and column are already used for that attribute in an existing implementation — in which case follow the existing pattern exactly - Extends the platform base class if one was confirmed in Phase 3e; otherwise extends `TransformerBase` directly - If in doubt about any mapping: ask the user before writing @@ -429,12 +513,14 @@ Show the full generated file and ask for confirmation before writing. --- -**Touch point 10 — `{platform}TransformerBase.ts`** (new file, only if confirmed in Phase 3e) +**Touch point 10 — `{platform}TransformerBase.ts`** (new file, only if platform has multiple sources AND shared org logic — confirmed in Phase 3e) File: `services/apps/snowflake_connectors/src/integrations/{platform}/{platform}TransformerBase.ts` Model on `tncTransformerBase.ts`. Contains only the shared `buildOrganizations` logic. Abstract class that extends `TransformerBase` and sets `readonly platform = PlatformType.{PLATFORM}`. +If the platform has a single source, skip this file entirely — keep all logic in `transformer.ts` (CVENT pattern). + Show the full generated file and ask for confirmation before writing. --- @@ -469,6 +555,18 @@ When the user pastes results, for each row: - Flag immediately: null email, missing USERNAME identity, unexpected activity type, null sourceId, null timestamp - If any issue found: loop back to the relevant Phase 3 sub-section, fix the mapping, regenerate the affected file +### Format & Lint + +After all files are written, run format then lint from `services/apps/snowflake_connectors`: + +```bash +cd services/apps/snowflake_connectors +pnpm run format +pnpm run lint +``` + +Fix any errors or warnings and re-run `pnpm run lint` until it reports no complaints. Do not proceed to the completion checklist until lint is clean. + ### Completion Checklist Before declaring the implementation complete, verify every item: @@ -488,8 +586,8 @@ Before declaring the implementation complete, verify every item: These apply throughout the entire skill execution. They are non-negotiable. -1. **Never assume a column exists.** Every column reference must come from the pasted `DESCRIBE TABLE` output or a confirmed JOIN table schema. -2. **Never assume a JOIN table is the same as an existing one.** If a table has not been verified against an existing implementation, validate every column. +1. **Never assume a column exists.** Every column reference must come from the schema registry or confirmed JOIN table schema. Pre-Analysis proposals are evidence-based (column names + sample values) — not assumed — but they still require user confirmation before being used in generated code. +2. **Never assume a JOIN table is the same as an existing one.** If a table has not been verified against an existing implementation, use heuristics from sample data but treat every column mapping as MEDIUM confidence at best — require explicit user confirmation. 3. **Never suggest activity type names or scores.** These come from the user. 4. **Never write a file without showing it first and receiving explicit user approval.** 5. **When in doubt, ask.** A question costs seconds. A wrong implementation costs hours. From 8c286ec211798470cf18781c5264ddd99984eb16 Mon Sep 17 00:00:00 2001 From: Mouad BANI Date: Thu, 2 Apr 2026 12:20:39 +0100 Subject: [PATCH 3/5] fix: incosistencies between skill and existing patterns Signed-off-by: Mouad BANI --- .claude/skills/scaffold-snowflake-connector/SKILL.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/.claude/skills/scaffold-snowflake-connector/SKILL.md b/.claude/skills/scaffold-snowflake-connector/SKILL.md index c9e1ef0495..a864387cd2 100644 --- a/.claude/skills/scaffold-snowflake-connector/SKILL.md +++ b/.claude/skills/scaffold-snowflake-connector/SKILL.md @@ -295,7 +295,7 @@ For each activity type the user provides, suggest the following **one at a time* | Enum key | SCREAMING_SNAKE_CASE version of the name (e.g., `ENROLLED_CERTIFICATION`) | | String value | The name the user provided (kebab-case) | | Label | Human-readable (e.g., `Enrolled in certification`) | -| Description | One sentence describing the event (look at similar types in `backend/src/c/migrations/` for style) | +| Description | One sentence describing the event (follow the style in `backend/src/database/migrations/V1771497876__addCventActivityTypes.sql` and `V1772556158__addTncActivityTypes.sql`) | | `isCodeContribution` | `false` unless it involves code (check existing platforms — almost always false for non-GitHub sources) | | `isCollaboration` | `false` unless it is a collaborative activity | @@ -412,9 +412,9 @@ Add `OrganizationAttributeSource.{PLATFORM},` to the `ORG_DB_ATTRIBUTE_SOURCE_PR **Touch point 5 — Flyway migration** (new file) -File: `backend/src/database/migrations/V{Date.now()}__add{Platform}ActivityTypes.sql` +File: `backend/src/database/migrations/V{epoch_seconds}__add{Platform}ActivityTypes.sql` -Template (use epoch timestamp for `V` prefix — run `date +%s%3N` in terminal): +Template (use a 10-digit second-epoch for the `V` prefix, matching existing migrations — run `date +%s` in terminal): ```sql INSERT INTO "activityTypes" ("activityType", platform, "isCodeContribution", "isCollaboration", description, "label") VALUES @@ -474,7 +474,7 @@ These are AI-generated from the confirmed column mappings. Apply all rules stric File: `services/apps/snowflake_connectors/src/integrations/{platform}/{source}/buildSourceQuery.ts` **Rules (enforced — do not deviate):** -- Use explicit column names only. Never use `table.*` +- Use explicit column names only. Do not use `table.*` or `table.* EXCLUDE (...)` in new implementations — existing sources (TNC, CVENT) use these patterns but new sources should list columns explicitly to avoid parquet encoding/decoding issues - If any TIMESTAMP_TZ columns exist in the schema, exclude and re-cast them as TIMESTAMP_NTZ (see CVENT pattern) - Follow the CTE structure: 1. `org_accounts` CTE (if org data present) @@ -572,7 +572,7 @@ Fix any errors or warnings and re-run `pnpm run lint` until it reports no compla Before declaring the implementation complete, verify every item: - [ ] All applicable files from the File Inventory written to disk -- [ ] No `table.*` anywhere in any generated query +- [ ] No `table.*` or `table.* EXCLUDE (...)` in any generated query - [ ] No broad `else` in any transformer - [ ] All string comparisons are case-insensitive (`.toLowerCase()` on both sides of comparison) - [ ] `sourceId` confirmed unique in test data From 1c45714704bc894b3119742251e6e29b652d803e Mon Sep 17 00:00:00 2001 From: Mouad BANI Date: Thu, 2 Apr 2026 12:40:17 +0100 Subject: [PATCH 4/5] fix: include down migration creation to follow project pattern Signed-off-by: Mouad BANI --- .claude/skills/scaffold-snowflake-connector/SKILL.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/.claude/skills/scaffold-snowflake-connector/SKILL.md b/.claude/skills/scaffold-snowflake-connector/SKILL.md index a864387cd2..3e263b3fe6 100644 --- a/.claude/skills/scaffold-snowflake-connector/SKILL.md +++ b/.claude/skills/scaffold-snowflake-connector/SKILL.md @@ -410,11 +410,11 @@ Add `OrganizationAttributeSource.{PLATFORM},` to the `ORG_DB_ATTRIBUTE_SOURCE_PR --- -**Touch point 5 — Flyway migration** (new file) +**Touch point 5 — Flyway migration** (two new files) -File: `backend/src/database/migrations/V{epoch_seconds}__add{Platform}ActivityTypes.sql` +Use a 10-digit second-epoch for the timestamp — run `date +%s` in terminal. -Template (use a 10-digit second-epoch for the `V` prefix, matching existing migrations — run `date +%s` in terminal): +**File A:** `backend/src/database/migrations/V{epoch_seconds}__add{Platform}ActivityTypes.sql` ```sql INSERT INTO "activityTypes" ("activityType", platform, "isCodeContribution", "isCollaboration", description, "label") VALUES @@ -423,6 +423,10 @@ INSERT INTO "activityTypes" ("activityType", platform, "isCodeContribution", "is ('{string-value}', '{platform-string}', {false|true}, {false|true}, '{description}', '{label}'); ``` +**File B:** `backend/src/database/migrations/U{epoch_seconds}__add{Platform}ActivityTypes.sql` + +Empty file — same timestamp and name as File A, prefix `U` instead of `V`. Matches the pattern used by existing migrations (e.g., `U1771497876__addCventActivityTypes.sql`). + --- **Touch point 6 — `snowflake_connectors/types.ts`** From a0a6908bbdaa4c85631195c5c5cf0c5c985c8d8a Mon Sep 17 00:00:00 2001 From: Mouad BANI Date: Thu, 2 Apr 2026 14:30:53 +0100 Subject: [PATCH 5/5] chore: allow mcp connection & repo lookup/clone Signed-off-by: Mouad BANI --- .../scaffold-snowflake-connector/SKILL.md | 97 +++++++++++++++++-- 1 file changed, 87 insertions(+), 10 deletions(-) diff --git a/.claude/skills/scaffold-snowflake-connector/SKILL.md b/.claude/skills/scaffold-snowflake-connector/SKILL.md index 3e263b3fe6..adc061a9e2 100644 --- a/.claude/skills/scaffold-snowflake-connector/SKILL.md +++ b/.claude/skills/scaffold-snowflake-connector/SKILL.md @@ -67,6 +67,41 @@ Read the current state of each file before modifying. Never modify without readi --- +## Context Detection + +Run this **once at skill start**, before any other step. + +Check whether the skill is running inside the crowd.dev repository by testing for the presence of the snowflake connectors integration index: + +``` +services/apps/snowflake_connectors/src/integrations/index.ts +``` + +Use `Glob` or `Read` to check for this file relative to the current working directory. + +### Case A — File found (running inside crowd.dev) + +Set `CROWD_DEV_ROOT = "."` (current directory). All file paths in this skill are relative to CWD. Proceed to Phase 1 immediately. + +### Case B — File not found (running from cross-team skills repo or elsewhere) + +Ask the user: +> "This skill needs access to the crowd.dev repository to read and modify files. Do you have a local clone? +> - If yes: provide the absolute path (e.g., `/Users/you/work/crowd.dev`) +> - If no: I'll give you the clone command first." + +If the user provides a path: verify it by checking `{path}/services/apps/snowflake_connectors/src/integrations/index.ts` exists. If confirmed, set `CROWD_DEV_ROOT = {path}`. Proceed to Phase 1. + +If the path doesn't exist or they need to clone: +> "Run: `git clone git@github.com:linuxfoundation/crowd.dev.git` +> Then provide the path to the cloned directory." + +Wait for a valid path before continuing. + +**For all subsequent file operations**, prefix every path in this skill with `CROWD_DEV_ROOT`. Example: `services/libs/types/src/enums/platforms.ts` becomes `{CROWD_DEV_ROOT}/services/libs/types/src/enums/platforms.ts`. + +--- + ## Phase 1: Initial Questions Ask one question at a time. Do not bundle questions. @@ -109,7 +144,28 @@ Do not assume a direct JOIN between tables. The relation type (JOIN, subquery, C ## Phase 2: Schema Collection -Once all table names are known, generate a single query to collect schema information for every table at once. Ask the user to run it in Snowflake UI and either paste the output or provide a path to the exported file (e.g., CSV): +Once all table names are known, collect column schemas using the LFX BI Layer MCP tools. Only fall back to the manual query when BI Layer lookup fails for a specific table. + +### Step 0 — Check BI Layer connectivity + +Use `ToolSearch` with query `"LFX BI Layer get_all_sources"` to check whether the BI Layer tools are available in the current session. + +- **Tools found**: proceed directly to Step 1. +- **Tools not found**: prompt the user to authenticate: + > "The LFX BI Layer MCP is not connected. To enable automatic schema lookup, run `/mcp` and select **'claude.ai LFX BI Layer'** to authenticate. Once done, say 'continue' and I'll proceed. Or say 'skip' to use the manual query approach instead." + + Wait for the user's response. If they authenticate: re-check with `ToolSearch` and proceed to Step 1. If they skip or authentication still fails: proceed directly to Step 2 for all tables. + +### Step 1 — BI Layer lookup (always attempt first) + +1. Call `mcp__claude_ai_LFX_BI_Layer__get_all_sources`. This returns all registered dbt sources. +2. For each user-provided table, search the results (case-insensitive match on the `name` field). +3. For each table **found**: call `mcp__claude_ai_LFX_BI_Layer__get_source_details` using the `uniqueId` from Step 1. Extract the column registry from `catalog.columns`: each entry has `name`, `type`, and `index` (ordinal position). +4. For each table **not found** in dbt sources: inform the user and fall back to Step 2 for that table only. + +### Step 2 — Manual fallback (only for tables not in dbt sources) + +Generate the ACCOUNT_USAGE query for the missing tables and ask the user to run it: ```sql SELECT DISTINCT @@ -122,28 +178,45 @@ SELECT DISTINCT ordinal_position FROM SNOWFLAKE.ACCOUNT_USAGE.COLUMNS WHERE (table_catalog, table_schema, table_name) IN ( - ('DB1', 'SCHEMA1', 'TABLE1'), - ('DB2', 'SCHEMA2', 'TABLE2') - -- one row per confirmed table + ('DB1', 'SCHEMA1', 'TABLE1') + -- one row per missing table ) AND DELETED IS NULL ORDER BY table_name, ordinal_position; ``` -Replace the placeholder values with the actual table names provided by the user. +Ask the user to paste the output or provide a path to an exported file (CSV, JSON, or TSV). If a file path is given, read the file. -Ask the user to either paste the output directly or provide a path to an exported file (CSV, JSON, or TSV). If a file path is given, read the file. +### Column registry -Parse the output to build a column registry per table: +After Step 1 and/or Step 2, build a column registry per table: - Column name (exact casing from schema — this is the reference for all code) - Data type -- Nullable (YES/NO) +- Ordinal position **Store this as the canonical column reference. Every column name used in generated code must appear in this registry. Never assume or invent a column name.** For each JOIN table, check whether any existing transformer in `services/apps/snowflake_connectors/src/integrations/` queries from the same table. If yes, inherit its column mappings; if no, treat every column as unknown and derive it from sample data in the Pre-Analysis step below. -After building the column registry, generate a sample data query using **only explicit column names from the registry** (never `*`) and ask the user to run it: +### Step 3 — Sample data + +Attempt to retrieve sample data via the LFX BI Layer before asking the user to run a manual query. + +#### Option A — query_metrics (attempt first) + +1. Call `mcp__claude_ai_LFX_BI_Layer__list_metrics`. Search the results for metrics whose name or description relates to the platform/source (e.g., `total_committees`, `total_committee_members`, `total_enrollments`). +2. For each relevant metric found, call `mcp__claude_ai_LFX_BI_Layer__get_dimensions` to list available dimensions. +3. Select all categorical dimensions that are likely to correspond to identity, org, or activity fields (email, username, name, status, type, date, project_slug, etc.). +4. Call `mcp__claude_ai_LFX_BI_Layer__query_metrics` with those dimensions and `limit: 20`. +5. Use the returned rows as sample data for Pre-Analysis. + +**Important caveat:** `query_metrics` queries semantic/transformed dbt models, not raw source tables. Column names in the result come from the semantic model and may differ from raw schema column names. Use the data to infer field roles and value patterns; map back to raw column names using the schema registry from Step 1–2. + +If no relevant metrics are found, or the returned dimensions don't cover the fields needed for mapping, fall through to Option B. + +#### Option B — manual paste (fallback) + +Generate a sample data query using **only explicit column names from the registry** (never `*`) and ask the user to run it in Snowflake: ```sql SELECT @@ -219,6 +292,8 @@ After the user responds, record all confirmed mappings. **Skip any Phase 3 sub-s 1. **How existing data sources handle it** — read the relevant transformer(s) from `services/apps/snowflake_connectors/src/integrations/` and show the pattern. If an existing source skips or doesn't implement the feature, say so explicitly (e.g., "TNC doesn't use a logo field — it's left undefined"). 2. **A Snowflake query to resolve it** — if the ambiguity can be answered by inspecting actual data (e.g., checking uniqueness, null rate, value distribution), provide the query so the user can run it instead of guessing. +When raising any ambiguity that can be resolved by inspecting data, first try `mcp__claude_ai_LFX_BI_Layer__query_metrics` with the relevant dimensions and a `where` filter to narrow results. Only ask the user to run a manual Snowflake query if the semantic layer doesn't have dimensions covering the ambiguous columns. + Example format for an ambiguity: > "I see two timestamp columns: `CREATED_AT` (nullable) and `UPDATED_AT` (not nullable). > - Existing sources (TNC, CVENT) all use a non-nullable `updated_ts`-style column for incremental exports. @@ -549,7 +624,9 @@ LIMIT 100; ``` Instruct the user: -> "Please run this query in Snowflake and paste the result (JSON or CSV). I'll walk through each row and verify the transformer logic produces the expected `IActivityData` before we consider this done." +> "Please run this query directly in Snowflake and paste the result (JSON or CSV). I'll walk through each row and verify the transformer logic produces the expected `IActivityData` before we consider this done." + +Note: the LFX BI Layer MCP does not support arbitrary SQL execution — the test query must be run manually in the Snowflake UI. ### Dry-Run Validation