Gmail-to-Sheets Automation System

Author: Rushikesh Randive
Project: Python Automation - Gmail to Google Sheets Integration
Date: January 2026

📋 Table of Contents

Project Overview
Architecture
Features
Setup Instructions
OAuth Flow Explanation
State Persistence Method
Duplicate Prevention Logic
Execution Results
Challenges Faced
Limitations

🎯 Project Overview

A Python automation system that connects to Gmail and Google Sheets APIs to automatically read unread emails from a Gmail inbox and log them into a Google Sheet. The system uses OAuth 2.0 authentication (user credentials only, no service accounts) and implements robust duplicate prevention.

Key Features

✅ Fetches unread emails from Gmail inbox
✅ Extracts structured data: From, Subject, Date, Content
✅ Appends data to Google Sheets automatically
✅ Marks processed emails as read
✅ Prevents duplicate processing across multiple runs
✅ Handles HTML to plain text conversion
✅ Implements atomic file operations for data consistency

🏗️ Architecture

High-Level System Design

graph TB
    A[main.py - Orchestrator] --> B[gmail_service.py]
    A --> C[sheets_service.py]
    A --> D[email_parser.py]
    A --> E[state_manager.py]
    
    B --> F[Gmail API]
    C --> G[Google Sheets API]
    E --> H[processed_ids.json]
    
    I[config.py] --> A
    J[credentials.json] --> B
    J --> C
    K[token.json] --> B
    K --> C
    
    style A fill:#4CAF50
    style B fill:#2196F3
    style C fill:#2196F3
    style D fill:#2196F3
    style E fill:#2196F3

Component Responsibilities

Component	Responsibility
main.py	Orchestrates the complete workflow, coordinates all services
gmail_service.py	Gmail API authentication, email fetching, marking as read
sheets_service.py	Google Sheets API authentication, data appending
email_parser.py	Extracts structured data from email messages
state_manager.py	Tracks processed emails, prevents duplicates
oauth_handler.py	Handles OAuth 2.0 authentication flow
config.py	Centralized configuration management

✨ Features

Core Functionality

OAuth 2.0 Authentication
- User credential-based authentication (no service accounts)
- Browser-based consent flow
- Automatic token refresh
- Secure token storage
Email Processing
- Fetches only unread emails from inbox
- Extracts: Sender, Subject, Date, Content
- Handles HTML to plain text conversion
- Processes multipart messages
Google Sheets Integration
- Appends data as new rows
- Preserves existing data
- Automatic header creation
- Validates spreadsheet access
Duplicate Prevention
- Tracks processed email IDs in JSON file
- Skips already-processed emails
- Persists state across system restarts
- Atomic file operations prevent corruption
Email Status Management
- Marks processed emails as read
- Continues processing even if marking fails
- Maintains state consistency

🚀 Setup Instructions

Prerequisites

Python 3.7 or higher
Google Cloud Project with Gmail API and Sheets API enabled
OAuth 2.0 credentials (Desktop Application type)

Step 1: Clone Repository

git clone https://github.com/rushikeshxdev/gmail-to-sheets-automation.git
cd gmail-to-sheets-automation

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Configure Google Cloud Project

Go to Google Cloud Console
Create a new project or select existing one
Enable APIs:
- Gmail API
- Google Sheets API
Create OAuth 2.0 credentials:
- Go to "Credentials" → "Create Credentials" → "OAuth client ID"
- Application type: "Desktop application"
- Download credentials JSON file
Place credentials file at: credentials/credentials.json

Step 4: Configure OAuth Consent Screen

Go to "OAuth consent screen"
Add your email as a test user
Add required scopes:
- https://www.googleapis.com/auth/gmail.modify
- https://www.googleapis.com/auth/spreadsheets

Step 5: Update Configuration

Copy the example config:
```
cp config.example.py config.py
```
Edit config.py and set your Google Sheets ID:
```
SPREADSHEET_ID = "your-spreadsheet-id-here"
```
(Find this in your Google Sheets URL after /d/)

Step 6: Run the System

python src/main.py

On first run, a browser window will open for OAuth consent. Grant permissions, and the system will start processing emails.

🔐 OAuth Flow Explanation

OAuth 2.0 Authentication Process

The system uses OAuth 2.0 with user credentials (not service accounts) to ensure secure access to Gmail and Google Sheets.

First-Time Authentication Flow

1. User runs the application
   ↓
2. System checks for existing token.json
   ↓
3. No token found → Initiate OAuth flow
   ↓
4. Open browser for user consent
   ↓
5. User signs in with Google account
   ↓
6. User grants permissions (Gmail + Sheets)
   ↓
7. Google returns authorization code
   ↓
8. System exchanges code for access/refresh tokens
   ↓
9. Tokens saved to token.json
   ↓
10. System proceeds with email processing

Subsequent Runs

1. User runs the application
   ↓
2. System loads token.json
   ↓
3. Check if token is valid
   ↓
4. If valid → Use directly
   ↓
5. If expired → Refresh automatically
   ↓
6. System proceeds with email processing

Why OAuth 2.0 User Credentials?

Security: No passwords stored, only temporary tokens
User Control: Users can revoke access anytime
Compliance: Follows Google's security best practices
Transparency: Users see exactly what permissions are granted

💾 State Persistence Method

How Duplicate Prevention Works

The system uses a JSON file (processed_ids.json) to track which emails have been processed. This ensures that running the script multiple times doesn't create duplicate entries in the spreadsheet.

State File Structure

{
  "processed_ids": [
    "19bb836afcb121af",
    "19bb8258747b6936",
    "19bb7ea3e22b1d84"
  ],
  "count": 3
}

State Management Workflow

Email Processing Flow:
┌─────────────────────────────────────┐
│ 1. Load processed_ids.json          │
│    (or create if doesn't exist)     │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 2. Fetch unread emails from Gmail   │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 3. For each email:                  │
│    - Check if ID in processed_ids   │
│    - If YES → Skip                  │
│    - If NO → Process                │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 4. Parse email data                 │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 5. Append to Google Sheets          │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 6. Add email ID to processed_ids    │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 7. Save state immediately           │
│    (atomic write operation)         │
└──────────────┬──────────────────────┘
               ↓
┌─────────────────────────────────────┐
│ 8. Mark email as read in Gmail      │
└─────────────────────────────────────┘

Why This Approach?

Simplicity: JSON file is easy to read and debug
Reliability: Atomic write operations prevent corruption
Persistence: State survives system restarts
Performance: In-memory set for fast lookups
Transparency: Users can inspect the state file

Alternative Approaches Considered

Approach	Pros	Cons	Why Not Used
Database	Scalable, ACID	Requires setup, overkill	Too complex for this use case
Gmail Labels	Native to Gmail	Requires API calls	Slower, less reliable
Timestamp	Simple	Not reliable	Emails can arrive out of order
JSON File	Simple, reliable	Not scalable to millions	CHOSEN - Perfect for this use case

🔄 Duplicate Prevention Logic

What Happens When Script Runs Twice?

First Run

Inbox: [Email A, Email B, Email C] (all unread)
processed_ids.json: []

Processing:
- Email A → Process → Add to Sheet → Mark as read → Save ID
- Email B → Process → Add to Sheet → Mark as read → Save ID
- Email C → Process → Add to Sheet → Mark as read → Save ID

Result:
- Sheet rows: 3 new rows
- processed_ids.json: [A, B, C]
- Gmail: All marked as read

Second Run (Immediately After)

Inbox: [] (no unread emails)
processed_ids.json: [A, B, C]

Processing:
- No unread emails found
- Nothing to process

Result:
- Sheet rows: No changes
- processed_ids.json: [A, B, C] (unchanged)
- Gmail: No changes

Second Run (New Email Arrives)

Inbox: [Email D] (new unread email)
processed_ids.json: [A, B, C]

Processing:
- Email D → Check ID → Not in processed_ids → Process
- Email D → Add to Sheet → Mark as read → Save ID

Result:
- Sheet rows: 1 new row (only Email D)
- processed_ids.json: [A, B, C, D]
- Gmail: Email D marked as read

Edge Case Handling

Script Crashes Mid-Processing
- State is saved after each email
- Only unprocessed emails will be retried
- No duplicates created
Email Cannot Be Marked as Read
- Email is still added to processed_ids
- Prevents reprocessing even if marking fails
- Error is logged for debugging
Spreadsheet Append Fails
- Email is NOT added to processed_ids
- Will be retried on next run
- Ensures data consistency

📊 Execution Results

Test Execution Summary

Test Environment:

Gmail account with real incoming emails
Google Sheets for data logging
Multiple test runs to verify duplicate prevention

Results

============================================================
✅ Processing Complete!
============================================================
📊 Emails fetched: 300
✅ Emails processed: 300
⏭️  Emails skipped (duplicates): 0
❌ Emails failed: 0
📖 Emails marked as read: 300
============================================================

Performance Metrics

Total Processing Time: ~9 minutes
Average Time per Email: ~1.8 seconds
Success Rate: 100%
API Calls: ~900 (Gmail: 600, Sheets: 300)
Data Written: 300 rows to Google Sheets
Duplicates Created: 0

Sample Data Structure

From	Subject	Date	Content
sender@example.com	Meeting Reminder	2026-01-13 16:35:48	Email body content...
notifications@service.com	Account Update	2026-01-13 14:20:15	Your account has been...
team@company.com	Project Status	2026-01-13 10:45:30	Here's the latest update...

🚧 Challenges Faced

Challenge 1: OAuth Access Denied Error

Problem:
Initial OAuth authentication failed with "Error 403: access_denied". The Google Cloud project was in testing mode and could only be accessed by approved test users.

Solution:

Added my email as a test user in Google Cloud Console
Configured OAuth consent screen with proper scopes
Manually added both required scopes:
- https://www.googleapis.com/auth/gmail.modify
- https://www.googleapis.com/auth/spreadsheets

Learning:
Google Cloud projects in testing mode require explicit test user approval. This is a security feature to prevent unauthorized access during development.

Challenge 2: Port Already in Use

Problem:
OAuth flow failed with "WinError 10048: Only one usage of each socket address is normally permitted" because port 8080 was already in use by another application.

Solution:

Modified OAuth handler to use port 0 (automatic port selection)
System now finds an available port automatically
Changed from port=8080 to port=0 in run_local_server()

Code Fix:

# Before
self.credentials = flow.run_local_server(port=8080, ...)

# After
self.credentials = flow.run_local_server(port=0, ...)  # Auto-select port

Learning:
Hardcoded ports can cause conflicts. Using port 0 allows the OS to assign an available port automatically, making the application more robust.

Challenge 3: Spreadsheet Permission Denied

Problem:
Initial spreadsheet access failed with "403 Forbidden: PERMISSION_DENIED" error even though OAuth authentication was successful.

Solution:

Shared the Google Sheet with the authenticated email account
Set permission level to "Editor"
Updated spreadsheet ID in config.py to the correct shared sheet

Learning:
OAuth authentication grants API access, but the spreadsheet itself must be explicitly shared with the user account. API permissions and document permissions are separate layers of security.

Challenge 4: HTML Email Content Extraction

Problem:
Many emails contain HTML content with tags, scripts, and formatting that shouldn't appear in the spreadsheet. Direct extraction resulted in messy, unreadable content.

Solution:

Implemented HTML to plain text conversion using BeautifulSoup
Removed script and style tags before extraction
Cleaned up excessive whitespace
Handled multipart messages correctly

Code Implementation:

def _html_to_plain_text(self, html_content: str) -> str:
    soup = BeautifulSoup(html_content, 'lxml')
    
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Get text and clean whitespace
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = ' '.join(chunk for chunk in chunks if chunk)
    
    return text

Learning:
Email content can be in various formats (plain text, HTML, multipart). Robust parsing requires handling all formats and converting to a consistent output format.

Challenge 5: State File Corruption Risk

Problem:
If the system crashes while writing to processed_ids.json, the file could become corrupted, leading to data loss or duplicate processing on the next run.

Solution:

Implemented atomic write operations
Write to temporary file first
Rename to actual file (atomic operation on most systems)
Clean up temporary file on errors

Code Implementation:

def save_state(self) -> bool:
    # Write to temporary file first
    temp_file = f"{self.state_file}.tmp"
    with open(temp_file, 'w') as f:
        json.dump(data, f, indent=2)
    
    # Atomic rename
    if os.path.exists(self.state_file):
        os.remove(self.state_file)
    os.rename(temp_file, self.state_file)

Learning:
File I/O operations should be atomic to prevent corruption. Writing to a temporary file and renaming is a common pattern for safe file updates in production systems.

⚠️ Limitations

Current Limitations

Scalability
- JSON file approach works well for thousands of emails
- May become slow with millions of emails
- Solution: Migrate to database for large-scale deployments
Error Recovery
- If Google Sheets API fails, emails are not retried automatically
- Manual intervention required to reprocess failed emails
- Solution: Implement retry queue with exponential backoff
Content Length
- Google Sheets has a 50,000 character limit per cell
- Very long emails are truncated
- Solution: Store full content in separate storage if needed
Rate Limiting
- Gmail API: 250 quota units per user per second
- Sheets API: 100 requests per 100 seconds per user
- Current implementation has basic retry logic
- Solution: Implement more sophisticated rate limiting
Concurrent Execution
- Running multiple instances simultaneously can cause race conditions
- State file may not be properly synchronized
- Solution: Implement file locking or use database
Email Attachments
- Current implementation doesn't handle attachments
- Only text content is extracted
- Solution: Add attachment download and storage feature

📁 Project Structure

gmail-to-sheets-automation/
│
├── src/
│   ├── __init__.py
│   ├── main.py                 # Main orchestrator
│   ├── gmail_service.py        # Gmail API service
│   ├── sheets_service.py       # Google Sheets API service
│   ├── email_parser.py         # Email data extraction
│   ├── state_manager.py        # Duplicate prevention
│   └── oauth_handler.py        # OAuth 2.0 authentication
│
├── credentials/
│   ├── .gitkeep               # Keeps folder in Git
│   ├── credentials.json       # OAuth client credentials (gitignored)
│   └── token.json             # Access/refresh tokens (gitignored)
│
├── proof/                      # Execution proof screenshots
│   ├── Project-Structure.png
│   ├── Spreadsheet1.png
│   ├── Spreadsheet2.png
│   ├── Terminal-log.png
│   ├── Processed_ids.png
│   └── Gmail-Inbox.png
│
├── docs/                       # Additional documentation
│   ├── GITHUB_SETUP.md
│   ├── TROUBLESHOOTING.md
│   └── ... (other guides)
│
├── config.py                   # Configuration (gitignored)
├── config.example.py           # Configuration template
├── requirements.txt            # Python dependencies
├── .gitignore                 # Git ignore rules
├── README.md                  # This file
└── processed_ids.json         # State persistence (gitignored)

🔒 Security Considerations

Credentials Protection
- All sensitive files are git-ignored
- Tokens stored with secure file permissions
- No credentials logged or displayed
OAuth Security
- Uses HTTPS for all API communications
- Tokens automatically refreshed
- User can revoke access anytime
Data Privacy
- Email content processed locally
- No data sent to third parties
- State file contains only email IDs (not content)

📝 Usage Examples

Basic Usage

# First run (requires OAuth consent)
python src/main.py

# Subsequent runs (uses saved tokens)
python src/main.py

Clear State (Reprocess All Emails)

from src.state_manager import StateManager

state = StateManager()
state.clear_state()
print("State cleared - all emails will be reprocessed")

🎓 Conclusion

This Gmail-to-Sheets automation system successfully demonstrates:

✅ OAuth 2.0 authentication with user credentials
✅ Gmail API integration for email fetching
✅ Google Sheets API integration for data logging
✅ Robust duplicate prevention mechanism
✅ Error handling and state consistency
✅ Clean, modular, maintainable code architecture

The system has been tested with 300 real emails and achieved a 100% success rate with zero duplicates and zero failures.

📞 Contact

Author: Rushikesh Randive
GitHub: @rushikeshxdev
Repository: gmail-to-sheets-automation

📄 License

This project is created for educational purposes as part of an internship assignment.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
credentials		credentials
proof		proof
src		src
.gitignore		.gitignore
README.md		README.md
config.example.py		config.example.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Gmail-to-Sheets Automation System

📋 Table of Contents

🎯 Project Overview

Key Features

🏗️ Architecture

High-Level System Design

Component Responsibilities

✨ Features

Core Functionality

🚀 Setup Instructions

Prerequisites

Step 1: Clone Repository

Step 2: Install Dependencies

Step 3: Configure Google Cloud Project

Step 4: Configure OAuth Consent Screen

Step 5: Update Configuration

Step 6: Run the System

🔐 OAuth Flow Explanation

OAuth 2.0 Authentication Process

First-Time Authentication Flow

Subsequent Runs

Why OAuth 2.0 User Credentials?

💾 State Persistence Method

How Duplicate Prevention Works

State File Structure

State Management Workflow

Why This Approach?

Alternative Approaches Considered

🔄 Duplicate Prevention Logic

What Happens When Script Runs Twice?

First Run

Second Run (Immediately After)

Second Run (New Email Arrives)

Edge Case Handling

📊 Execution Results

Test Execution Summary

Results

Performance Metrics

Sample Data Structure

🚧 Challenges Faced

Challenge 1: OAuth Access Denied Error

Challenge 2: Port Already in Use

Challenge 3: Spreadsheet Permission Denied

Challenge 4: HTML Email Content Extraction

Challenge 5: State File Corruption Risk

⚠️ Limitations

Current Limitations

📁 Project Structure

🔒 Security Considerations

📝 Usage Examples

Basic Usage

Clear State (Reprocess All Emails)

🎓 Conclusion

📞 Contact

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages