Skip to content

rushikeshxdev/gmail-to-sheets-automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gmail-to-Sheets Automation System

Author: Rushikesh Randive
Project: Python Automation - Gmail to Google Sheets Integration
Date: January 2026


📋 Table of Contents


🎯 Project Overview

A Python automation system that connects to Gmail and Google Sheets APIs to automatically read unread emails from a Gmail inbox and log them into a Google Sheet. The system uses OAuth 2.0 authentication (user credentials only, no service accounts) and implements robust duplicate prevention.

Key Features

  • ✅ Fetches unread emails from Gmail inbox
  • ✅ Extracts structured data: From, Subject, Date, Content
  • ✅ Appends data to Google Sheets automatically
  • ✅ Marks processed emails as read
  • ✅ Prevents duplicate processing across multiple runs
  • ✅ Handles HTML to plain text conversion
  • ✅ Implements atomic file operations for data consistency

🏗️ Architecture

High-Level System Design

graph TB
    A[main.py - Orchestrator] --> B[gmail_service.py]
    A --> C[sheets_service.py]
    A --> D[email_parser.py]
    A --> E[state_manager.py]
    
    B --> F[Gmail API]
    C --> G[Google Sheets API]
    E --> H[processed_ids.json]
    
    I[config.py] --> A
    J[credentials.json] --> B
    J --> C
    K[token.json] --> B
    K --> C
    
    style A fill:#4CAF50
    style B fill:#2196F3
    style C fill:#2196F3
    style D fill:#2196F3
    style E fill:#2196F3
Loading

Component Responsibilities

Component Responsibility
main.py Orchestrates the complete workflow, coordinates all services
gmail_service.py Gmail API authentication, email fetching, marking as read
sheets_service.py Google Sheets API authentication, data appending
email_parser.py Extracts structured data from email messages
state_manager.py Tracks processed emails, prevents duplicates
oauth_handler.py Handles OAuth 2.0 authentication flow
config.py Centralized configuration management

✨ Features

Core Functionality

  1. OAuth 2.0 Authentication

    • User credential-based authentication (no service accounts)
    • Browser-based consent flow
    • Automatic token refresh
    • Secure token storage
  2. Email Processing

    • Fetches only unread emails from inbox
    • Extracts: Sender, Subject, Date, Content
    • Handles HTML to plain text conversion
    • Processes multipart messages
  3. Google Sheets Integration

    • Appends data as new rows
    • Preserves existing data
    • Automatic header creation
    • Validates spreadsheet access
  4. Duplicate Prevention

    • Tracks processed email IDs in JSON file
    • Skips already-processed emails
    • Persists state across system restarts
    • Atomic file operations prevent corruption
  5. Email Status Management

    • Marks processed emails as read
    • Continues processing even if marking fails
    • Maintains state consistency

🚀 Setup Instructions

Prerequisites

  • Python 3.7 or higher
  • Google Cloud Project with Gmail API and Sheets API enabled
  • OAuth 2.0 credentials (Desktop Application type)

Step 1: Clone Repository

git clone https://github.com/rushikeshxdev/gmail-to-sheets-automation.git
cd gmail-to-sheets-automation

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Configure Google Cloud Project

  1. Go to Google Cloud Console
  2. Create a new project or select existing one
  3. Enable APIs:
    • Gmail API
    • Google Sheets API
  4. Create OAuth 2.0 credentials:
    • Go to "Credentials" → "Create Credentials" → "OAuth client ID"
    • Application type: "Desktop application"
    • Download credentials JSON file
  5. Place credentials file at: credentials/credentials.json

Step 4: Configure OAuth Consent Screen

  1. Go to "OAuth consent screen"
  2. Add your email as a test user
  3. Add required scopes:
    • https://www.googleapis.com/auth/gmail.modify
    • https://www.googleapis.com/auth/spreadsheets

Step 5: Update Configuration

  1. Copy the example config:

    cp config.example.py config.py
  2. Edit config.py and set your Google Sheets ID:

    SPREADSHEET_ID = "your-spreadsheet-id-here"

    (Find this in your Google Sheets URL after /d/)

Step 6: Run the System

python src/main.py

On first run, a browser window will open for OAuth consent. Grant permissions, and the system will start processing emails.


🔐 OAuth Flow Explanation

OAuth 2.0 Authentication Process

The system uses OAuth 2.0 with user credentials (not service accounts) to ensure secure access to Gmail and Google Sheets.

First-Time Authentication Flow

1. User runs the application
   ↓
2. System checks for existing token.json
   ↓
3. No token found → Initiate OAuth flow
   ↓
4. Open browser for user consent
   ↓
5. User signs in with Google account
   ↓
6. User grants permissions (Gmail + Sheets)
   ↓
7. Google returns authorization code
   ↓
8. System exchanges code for access/refresh tokens
   ↓
9. Tokens saved to token.json
   ↓
10. System proceeds with email processing

Subsequent Runs

1. User runs the application
   ↓
2. System loads token.json
   ↓
3. Check if token is valid
   ↓
4. If valid → Use directly
   ↓
5. If expired → Refresh automatically
   ↓
6. System proceeds with email processing

Why OAuth 2.0 User Credentials?

  • Security: No passwords stored, only temporary tokens
  • User Control: Users can revoke access anytime
  • Compliance: Follows Google's security best practices
  • Transparency: Users see exactly what permissions are granted

💾 State Persistence Method

How Duplicate Prevention Works

The system uses a JSON file (processed_ids.json) to track which emails have been processed. This ensures that running the script multiple times doesn't create duplicate entries in the spreadsheet.

State File Structure

{
  "processed_ids": [
    "19bb836afcb121af",
    "19bb8258747b6936",
    "19bb7ea3e22b1d84"
  ],
  "count": 3
}

State Management Workflow

Email Processing Flow:
┌─────────────────────────────────────┐
│ 1. Load processed_ids.json          │
│    (or create if doesn't exist)     │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 2. Fetch unread emails from Gmail   │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 3. For each email:                  │
│    - Check if ID in processed_ids   │
│    - If YES → Skip                  │
│    - If NO → Process                │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 4. Parse email data                 │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 5. Append to Google Sheets          │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 6. Add email ID to processed_ids    │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 7. Save state immediately           │
│    (atomic write operation)         │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 8. Mark email as read in Gmail      │
└─────────────────────────────────────┘

Why This Approach?

  1. Simplicity: JSON file is easy to read and debug
  2. Reliability: Atomic write operations prevent corruption
  3. Persistence: State survives system restarts
  4. Performance: In-memory set for fast lookups
  5. Transparency: Users can inspect the state file

Alternative Approaches Considered

Approach Pros Cons Why Not Used
Database Scalable, ACID Requires setup, overkill Too complex for this use case
Gmail Labels Native to Gmail Requires API calls Slower, less reliable
Timestamp Simple Not reliable Emails can arrive out of order
JSON File Simple, reliable Not scalable to millions CHOSEN - Perfect for this use case

🔄 Duplicate Prevention Logic

What Happens When Script Runs Twice?

First Run

Inbox: [Email A, Email B, Email C] (all unread)
processed_ids.json: []

Processing:
- Email A → Process → Add to Sheet → Mark as read → Save ID
- Email B → Process → Add to Sheet → Mark as read → Save ID
- Email C → Process → Add to Sheet → Mark as read → Save ID

Result:
- Sheet rows: 3 new rows
- processed_ids.json: [A, B, C]
- Gmail: All marked as read

Second Run (Immediately After)

Inbox: [] (no unread emails)
processed_ids.json: [A, B, C]

Processing:
- No unread emails found
- Nothing to process

Result:
- Sheet rows: No changes
- processed_ids.json: [A, B, C] (unchanged)
- Gmail: No changes

Second Run (New Email Arrives)

Inbox: [Email D] (new unread email)
processed_ids.json: [A, B, C]

Processing:
- Email D → Check ID → Not in processed_ids → Process
- Email D → Add to Sheet → Mark as read → Save ID

Result:
- Sheet rows: 1 new row (only Email D)
- processed_ids.json: [A, B, C, D]
- Gmail: Email D marked as read

Edge Case Handling

  1. Script Crashes Mid-Processing

    • State is saved after each email
    • Only unprocessed emails will be retried
    • No duplicates created
  2. Email Cannot Be Marked as Read

    • Email is still added to processed_ids
    • Prevents reprocessing even if marking fails
    • Error is logged for debugging
  3. Spreadsheet Append Fails

    • Email is NOT added to processed_ids
    • Will be retried on next run
    • Ensures data consistency

📊 Execution Results

Test Execution Summary

Test Environment:

  • Gmail account with real incoming emails
  • Google Sheets for data logging
  • Multiple test runs to verify duplicate prevention

Results

============================================================
✅ Processing Complete!
============================================================
📊 Emails fetched: 300
✅ Emails processed: 300
⏭️  Emails skipped (duplicates): 0
❌ Emails failed: 0
📖 Emails marked as read: 300
============================================================

Performance Metrics

  • Total Processing Time: ~9 minutes
  • Average Time per Email: ~1.8 seconds
  • Success Rate: 100%
  • API Calls: ~900 (Gmail: 600, Sheets: 300)
  • Data Written: 300 rows to Google Sheets
  • Duplicates Created: 0

Sample Data Structure

From Subject Date Content
sender@example.com Meeting Reminder 2026-01-13 16:35:48 Email body content...
notifications@service.com Account Update 2026-01-13 14:20:15 Your account has been...
team@company.com Project Status 2026-01-13 10:45:30 Here's the latest update...

🚧 Challenges Faced

Challenge 1: OAuth Access Denied Error

Problem:
Initial OAuth authentication failed with "Error 403: access_denied". The Google Cloud project was in testing mode and could only be accessed by approved test users.

Solution:

  • Added my email as a test user in Google Cloud Console
  • Configured OAuth consent screen with proper scopes
  • Manually added both required scopes:
    • https://www.googleapis.com/auth/gmail.modify
    • https://www.googleapis.com/auth/spreadsheets

Learning:
Google Cloud projects in testing mode require explicit test user approval. This is a security feature to prevent unauthorized access during development.


Challenge 2: Port Already in Use

Problem:
OAuth flow failed with "WinError 10048: Only one usage of each socket address is normally permitted" because port 8080 was already in use by another application.

Solution:

  • Modified OAuth handler to use port 0 (automatic port selection)
  • System now finds an available port automatically
  • Changed from port=8080 to port=0 in run_local_server()

Code Fix:

# Before
self.credentials = flow.run_local_server(port=8080, ...)

# After
self.credentials = flow.run_local_server(port=0, ...)  # Auto-select port

Learning:
Hardcoded ports can cause conflicts. Using port 0 allows the OS to assign an available port automatically, making the application more robust.


Challenge 3: Spreadsheet Permission Denied

Problem:
Initial spreadsheet access failed with "403 Forbidden: PERMISSION_DENIED" error even though OAuth authentication was successful.

Solution:

  • Shared the Google Sheet with the authenticated email account
  • Set permission level to "Editor"
  • Updated spreadsheet ID in config.py to the correct shared sheet

Learning:
OAuth authentication grants API access, but the spreadsheet itself must be explicitly shared with the user account. API permissions and document permissions are separate layers of security.


Challenge 4: HTML Email Content Extraction

Problem:
Many emails contain HTML content with tags, scripts, and formatting that shouldn't appear in the spreadsheet. Direct extraction resulted in messy, unreadable content.

Solution:

  • Implemented HTML to plain text conversion using BeautifulSoup
  • Removed script and style tags before extraction
  • Cleaned up excessive whitespace
  • Handled multipart messages correctly

Code Implementation:

def _html_to_plain_text(self, html_content: str) -> str:
    soup = BeautifulSoup(html_content, 'lxml')
    
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Get text and clean whitespace
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = ' '.join(chunk for chunk in chunks if chunk)
    
    return text

Learning:
Email content can be in various formats (plain text, HTML, multipart). Robust parsing requires handling all formats and converting to a consistent output format.


Challenge 5: State File Corruption Risk

Problem:
If the system crashes while writing to processed_ids.json, the file could become corrupted, leading to data loss or duplicate processing on the next run.

Solution:

  • Implemented atomic write operations
  • Write to temporary file first
  • Rename to actual file (atomic operation on most systems)
  • Clean up temporary file on errors

Code Implementation:

def save_state(self) -> bool:
    # Write to temporary file first
    temp_file = f"{self.state_file}.tmp"
    with open(temp_file, 'w') as f:
        json.dump(data, f, indent=2)
    
    # Atomic rename
    if os.path.exists(self.state_file):
        os.remove(self.state_file)
    os.rename(temp_file, self.state_file)

Learning:
File I/O operations should be atomic to prevent corruption. Writing to a temporary file and renaming is a common pattern for safe file updates in production systems.


⚠️ Limitations

Current Limitations

  1. Scalability

    • JSON file approach works well for thousands of emails
    • May become slow with millions of emails
    • Solution: Migrate to database for large-scale deployments
  2. Error Recovery

    • If Google Sheets API fails, emails are not retried automatically
    • Manual intervention required to reprocess failed emails
    • Solution: Implement retry queue with exponential backoff
  3. Content Length

    • Google Sheets has a 50,000 character limit per cell
    • Very long emails are truncated
    • Solution: Store full content in separate storage if needed
  4. Rate Limiting

    • Gmail API: 250 quota units per user per second
    • Sheets API: 100 requests per 100 seconds per user
    • Current implementation has basic retry logic
    • Solution: Implement more sophisticated rate limiting
  5. Concurrent Execution

    • Running multiple instances simultaneously can cause race conditions
    • State file may not be properly synchronized
    • Solution: Implement file locking or use database
  6. Email Attachments

    • Current implementation doesn't handle attachments
    • Only text content is extracted
    • Solution: Add attachment download and storage feature

📁 Project Structure

gmail-to-sheets-automation/
│
├── src/
│   ├── __init__.py
│   ├── main.py                 # Main orchestrator
│   ├── gmail_service.py        # Gmail API service
│   ├── sheets_service.py       # Google Sheets API service
│   ├── email_parser.py         # Email data extraction
│   ├── state_manager.py        # Duplicate prevention
│   └── oauth_handler.py        # OAuth 2.0 authentication
│
├── credentials/
│   ├── .gitkeep               # Keeps folder in Git
│   ├── credentials.json       # OAuth client credentials (gitignored)
│   └── token.json             # Access/refresh tokens (gitignored)
│
├── proof/                      # Execution proof screenshots
│   ├── Project-Structure.png
│   ├── Spreadsheet1.png
│   ├── Spreadsheet2.png
│   ├── Terminal-log.png
│   ├── Processed_ids.png
│   └── Gmail-Inbox.png
│
├── docs/                       # Additional documentation
│   ├── GITHUB_SETUP.md
│   ├── TROUBLESHOOTING.md
│   └── ... (other guides)
│
├── config.py                   # Configuration (gitignored)
├── config.example.py           # Configuration template
├── requirements.txt            # Python dependencies
├── .gitignore                 # Git ignore rules
├── README.md                  # This file
└── processed_ids.json         # State persistence (gitignored)

🔒 Security Considerations

  1. Credentials Protection

    • All sensitive files are git-ignored
    • Tokens stored with secure file permissions
    • No credentials logged or displayed
  2. OAuth Security

    • Uses HTTPS for all API communications
    • Tokens automatically refreshed
    • User can revoke access anytime
  3. Data Privacy

    • Email content processed locally
    • No data sent to third parties
    • State file contains only email IDs (not content)

📝 Usage Examples

Basic Usage

# First run (requires OAuth consent)
python src/main.py

# Subsequent runs (uses saved tokens)
python src/main.py

Clear State (Reprocess All Emails)

from src.state_manager import StateManager

state = StateManager()
state.clear_state()
print("State cleared - all emails will be reprocessed")

🎓 Conclusion

This Gmail-to-Sheets automation system successfully demonstrates:

  • ✅ OAuth 2.0 authentication with user credentials
  • ✅ Gmail API integration for email fetching
  • ✅ Google Sheets API integration for data logging
  • ✅ Robust duplicate prevention mechanism
  • ✅ Error handling and state consistency
  • ✅ Clean, modular, maintainable code architecture

The system has been tested with 300 real emails and achieved a 100% success rate with zero duplicates and zero failures.


📞 Contact

Author: Rushikesh Randive
GitHub: @rushikeshxdev
Repository: gmail-to-sheets-automation


📄 License

This project is created for educational purposes as part of an internship assignment.

About

Python automation system that reads unread emails from Gmail and logs them to Google Sheets using OAuth 2.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages