Skip to content

colafly/phonebook

Repository files navigation

Google Drive Cleanup Script

A TypeScript script that uses the Claude Agent SDK to analyze Google Drive folders and extract company information from documents, outputting the results to a CSV file.

Features

  • Fetches all subfolders from a parent Google Drive folder
  • Reads documents (Google Docs, text files) within each subfolder
  • Uses Claude Agent SDK with enhanced tool access to extract company information:
    • Canonical name
    • Domain name
    • English name
    • Chinese name
    • Japanese name
    • Short summary (1-2 sentences)
    • Can search the web for additional company information
    • Can fetch company websites for verification
  • Outputs results to CSV with Google Drive links

Setup

1. Install Dependencies

npm install

2. Set up Google Cloud Project

  1. Go to Google Cloud Console
  2. Create a new project or select an existing one
  3. Enable the Google Drive API:
    • Go to "APIs & Services" > "Library"
    • Search for "Google Drive API"
    • Click "Enable"

3. Create OAuth2 Credentials

  1. Go to "APIs & Services" > "Credentials"
  2. Click "Create Credentials" > "OAuth client ID"
  3. Choose "Desktop app" as the application type
  4. Download the credentials JSON file
  5. Copy the client_id and client_secret from the JSON

4. Choose Your AI Provider

The script supports two AI providers for extraction:

Option A: Claude (Anthropic)

  1. Go to Anthropic Console
  2. Create an account or sign in
  3. Generate an API key
  4. Set AI_PROVIDER=claude in your .env file

Option B: Gemini (Google)

  1. Go to Google AI Studio
  2. Create or sign in with your Google account
  3. Generate an API key
  4. Set AI_PROVIDER=gemini in your .env file

5. Configure Environment Variables

cp .env.example .env

Edit .env and fill in your credentials:

GOOGLE_CLIENT_ID=your_client_id_here
GOOGLE_CLIENT_SECRET=your_client_secret_here
GOOGLE_REDIRECT_URI=http://localhost:3000/oauth2callback
PARENT_FOLDER_URL=https://drive.google.com/drive/folders/YOUR_FOLDER_ID

# AI Provider: "claude" or "gemini"
AI_PROVIDER=claude

# Claude API Key (if using Claude)
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Gemini API Key (if using Gemini)
GEMINI_API_KEY=your_gemini_api_key_here

6. Authenticate

Run the authentication setup:

npm run auth

This will:

  1. Open a URL in your browser
  2. Ask you to authorize the application
  3. Save the access token to token.json

Usage

Basic Usage

npm start

Or specify the folder URL and output path:

ts-node google-drive-cleanup.ts "https://drive.google.com/drive/folders/YOUR_FOLDER_ID" ./output.csv

Output

The script generates a CSV file with the following columns:

  • Folder Name: Original folder name in Google Drive
  • Canonical Name: Extracted official company name
  • Domain Name: Company website domain
  • English Name: Company name in English
  • Chinese Name: Company name in Chinese (if found)
  • Japanese Name: Company name in Japanese (if found)
  • Summary: AI-generated 1-2 sentence summary of the company
  • Google Drive Link: Direct link to the folder
  • Folder ID: Google Drive folder ID

How It Works

This script processes company folders in Google Drive to extract and organize business information into a CSV file.

What it does:

Scans all subfolders within a parent Google Drive folder Filters for files containing "call memo" in the filename Extracts text from various formats (Google Docs, PDFs, Word, Excel) Uses AI (Claude or Gemini) to analyze the documents and extract structured company data: Canonical company name Domain name English, Chinese, and Japanese names Business summary Outputs results to a CSV file with links back to the original folders

Supported File Types

  • Google Docs (.gdoc)
  • Plain text files (.txt)
  • CSV files (.csv)
  • Limited support for PDFs (requires additional setup)

Enhanced AI Capabilities

The script supports two AI providers with different capabilities:

Claude (via Agent SDK)

When using AI_PROVIDER=claude, the following tools are enabled:

  • Read: Deep analysis of document content (PDFs, DOCX, etc.)
  • WebSearch: Search the web for additional company information
  • WebFetch: Fetch and analyze company websites for verification

Claude can:

  • Read and analyze PDF, Word, Excel files directly
  • Verify company domains by checking their websites
  • Look up missing information (e.g., find a company's English name if only Chinese is in documents)
  • Cross-reference information from multiple sources
  • Provide more accurate and complete company profiles

Gemini (Google AI)

When using AI_PROVIDER=gemini, you get:

  • Local Text Extraction: Extracts text from PDF, DOCX, Excel files locally on your machine
  • Fast processing with Gemini 2.0 Flash
  • Cost-effective API pricing
  • Good multilingual support (English, Chinese, Japanese)
  • Strong performance on structured data extraction
  • Complete privacy - files never leave your machine

How it works: Files are downloaded from Google Drive, text is extracted locally using specialized libraries (pdf-parse, mammoth, xlsx), and only the extracted text is sent to Gemini for analysis.

Switching AI Providers

To switch between Claude and Gemini:

  1. Open your .env file
  2. Change AI_PROVIDER=claude to AI_PROVIDER=gemini (or vice versa)
  3. Ensure the corresponding API key is set (ANTHROPIC_API_KEY or GEMINI_API_KEY)
  4. Run the script

Which provider should I use?

  • Use Claude if:

    • You want web search and website verification capabilities
    • You need the highest quality extraction from very complex documents
    • You want Claude to intelligently search for missing company information
  • Use Gemini if:

    • You want faster processing (Gemini 2.0 Flash is very fast)
    • You want lower API costs
    • You need good multilingual support
    • You want a simpler, more cost-effective solution

Both providers:

  • Extract text locally from PDFs, DOCX, and Excel files (complete privacy)
  • Process files without uploading to external services
  • Support multilingual company information extraction

Resume After Failures

The script automatically saves progress after processing each folder:

  • Progress is saved to progress.json
  • Results are appended to CSV immediately after each folder
  • Full logs are written to processing.log

If the script crashes or is interrupted:

  1. Simply run npm start again
  2. It will skip already processed folders automatically
  3. Check processing.log to see what happened

To start completely fresh:

rm progress.json company_info.csv processing.log
npm start

Troubleshooting

"Invalid credentials" error

  • Make sure your .env file has the correct GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET
  • Delete token.json and run npm run auth again

"Insufficient permissions" error

  • Make sure the Google Drive API is enabled in your Google Cloud project
  • Check that you've authorized the correct Google account

API key errors

  • "ANTHROPIC_API_KEY is required": Make sure you've set ANTHROPIC_API_KEY in your .env file when using AI_PROVIDER=claude
  • "GEMINI_API_KEY is required": Make sure you've set GEMINI_API_KEY in your .env file when using AI_PROVIDER=gemini

No data extracted

  • Check that your documents contain company information
  • Verify that files are readable (Google Docs, text files)
  • Check the console output for specific error messages

Development

Build

npm run build

Run with ts-node

ts-node google-drive-cleanup.ts

License

MIT

About

d

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors