Author: Rushikesh Randive
Project: Python Automation - Gmail to Google Sheets Integration
Date: January 2026
- Project Overview
- Architecture
- Features
- Setup Instructions
- OAuth Flow Explanation
- State Persistence Method
- Duplicate Prevention Logic
- Execution Results
- Challenges Faced
- Limitations
A Python automation system that connects to Gmail and Google Sheets APIs to automatically read unread emails from a Gmail inbox and log them into a Google Sheet. The system uses OAuth 2.0 authentication (user credentials only, no service accounts) and implements robust duplicate prevention.
- ✅ Fetches unread emails from Gmail inbox
- ✅ Extracts structured data: From, Subject, Date, Content
- ✅ Appends data to Google Sheets automatically
- ✅ Marks processed emails as read
- ✅ Prevents duplicate processing across multiple runs
- ✅ Handles HTML to plain text conversion
- ✅ Implements atomic file operations for data consistency
graph TB
A[main.py - Orchestrator] --> B[gmail_service.py]
A --> C[sheets_service.py]
A --> D[email_parser.py]
A --> E[state_manager.py]
B --> F[Gmail API]
C --> G[Google Sheets API]
E --> H[processed_ids.json]
I[config.py] --> A
J[credentials.json] --> B
J --> C
K[token.json] --> B
K --> C
style A fill:#4CAF50
style B fill:#2196F3
style C fill:#2196F3
style D fill:#2196F3
style E fill:#2196F3
| Component | Responsibility |
|---|---|
| main.py | Orchestrates the complete workflow, coordinates all services |
| gmail_service.py | Gmail API authentication, email fetching, marking as read |
| sheets_service.py | Google Sheets API authentication, data appending |
| email_parser.py | Extracts structured data from email messages |
| state_manager.py | Tracks processed emails, prevents duplicates |
| oauth_handler.py | Handles OAuth 2.0 authentication flow |
| config.py | Centralized configuration management |
-
OAuth 2.0 Authentication
- User credential-based authentication (no service accounts)
- Browser-based consent flow
- Automatic token refresh
- Secure token storage
-
Email Processing
- Fetches only unread emails from inbox
- Extracts: Sender, Subject, Date, Content
- Handles HTML to plain text conversion
- Processes multipart messages
-
Google Sheets Integration
- Appends data as new rows
- Preserves existing data
- Automatic header creation
- Validates spreadsheet access
-
Duplicate Prevention
- Tracks processed email IDs in JSON file
- Skips already-processed emails
- Persists state across system restarts
- Atomic file operations prevent corruption
-
Email Status Management
- Marks processed emails as read
- Continues processing even if marking fails
- Maintains state consistency
- Python 3.7 or higher
- Google Cloud Project with Gmail API and Sheets API enabled
- OAuth 2.0 credentials (Desktop Application type)
git clone https://github.com/rushikeshxdev/gmail-to-sheets-automation.git
cd gmail-to-sheets-automationpip install -r requirements.txt- Go to Google Cloud Console
- Create a new project or select existing one
- Enable APIs:
- Gmail API
- Google Sheets API
- Create OAuth 2.0 credentials:
- Go to "Credentials" → "Create Credentials" → "OAuth client ID"
- Application type: "Desktop application"
- Download credentials JSON file
- Place credentials file at:
credentials/credentials.json
- Go to "OAuth consent screen"
- Add your email as a test user
- Add required scopes:
https://www.googleapis.com/auth/gmail.modifyhttps://www.googleapis.com/auth/spreadsheets
-
Copy the example config:
cp config.example.py config.py
-
Edit
config.pyand set your Google Sheets ID:SPREADSHEET_ID = "your-spreadsheet-id-here"
(Find this in your Google Sheets URL after
/d/)
python src/main.pyOn first run, a browser window will open for OAuth consent. Grant permissions, and the system will start processing emails.
The system uses OAuth 2.0 with user credentials (not service accounts) to ensure secure access to Gmail and Google Sheets.
1. User runs the application
↓
2. System checks for existing token.json
↓
3. No token found → Initiate OAuth flow
↓
4. Open browser for user consent
↓
5. User signs in with Google account
↓
6. User grants permissions (Gmail + Sheets)
↓
7. Google returns authorization code
↓
8. System exchanges code for access/refresh tokens
↓
9. Tokens saved to token.json
↓
10. System proceeds with email processing
1. User runs the application
↓
2. System loads token.json
↓
3. Check if token is valid
↓
4. If valid → Use directly
↓
5. If expired → Refresh automatically
↓
6. System proceeds with email processing
- Security: No passwords stored, only temporary tokens
- User Control: Users can revoke access anytime
- Compliance: Follows Google's security best practices
- Transparency: Users see exactly what permissions are granted
The system uses a JSON file (processed_ids.json) to track which emails have been processed. This ensures that running the script multiple times doesn't create duplicate entries in the spreadsheet.
{
"processed_ids": [
"19bb836afcb121af",
"19bb8258747b6936",
"19bb7ea3e22b1d84"
],
"count": 3
}Email Processing Flow:
┌─────────────────────────────────────┐
│ 1. Load processed_ids.json │
│ (or create if doesn't exist) │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 2. Fetch unread emails from Gmail │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 3. For each email: │
│ - Check if ID in processed_ids │
│ - If YES → Skip │
│ - If NO → Process │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 4. Parse email data │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 5. Append to Google Sheets │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 6. Add email ID to processed_ids │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 7. Save state immediately │
│ (atomic write operation) │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ 8. Mark email as read in Gmail │
└─────────────────────────────────────┘
- Simplicity: JSON file is easy to read and debug
- Reliability: Atomic write operations prevent corruption
- Persistence: State survives system restarts
- Performance: In-memory set for fast lookups
- Transparency: Users can inspect the state file
| Approach | Pros | Cons | Why Not Used |
|---|---|---|---|
| Database | Scalable, ACID | Requires setup, overkill | Too complex for this use case |
| Gmail Labels | Native to Gmail | Requires API calls | Slower, less reliable |
| Timestamp | Simple | Not reliable | Emails can arrive out of order |
| JSON File | Simple, reliable | Not scalable to millions | CHOSEN - Perfect for this use case |
Inbox: [Email A, Email B, Email C] (all unread)
processed_ids.json: []
Processing:
- Email A → Process → Add to Sheet → Mark as read → Save ID
- Email B → Process → Add to Sheet → Mark as read → Save ID
- Email C → Process → Add to Sheet → Mark as read → Save ID
Result:
- Sheet rows: 3 new rows
- processed_ids.json: [A, B, C]
- Gmail: All marked as read
Inbox: [] (no unread emails)
processed_ids.json: [A, B, C]
Processing:
- No unread emails found
- Nothing to process
Result:
- Sheet rows: No changes
- processed_ids.json: [A, B, C] (unchanged)
- Gmail: No changes
Inbox: [Email D] (new unread email)
processed_ids.json: [A, B, C]
Processing:
- Email D → Check ID → Not in processed_ids → Process
- Email D → Add to Sheet → Mark as read → Save ID
Result:
- Sheet rows: 1 new row (only Email D)
- processed_ids.json: [A, B, C, D]
- Gmail: Email D marked as read
-
Script Crashes Mid-Processing
- State is saved after each email
- Only unprocessed emails will be retried
- No duplicates created
-
Email Cannot Be Marked as Read
- Email is still added to processed_ids
- Prevents reprocessing even if marking fails
- Error is logged for debugging
-
Spreadsheet Append Fails
- Email is NOT added to processed_ids
- Will be retried on next run
- Ensures data consistency
Test Environment:
- Gmail account with real incoming emails
- Google Sheets for data logging
- Multiple test runs to verify duplicate prevention
============================================================
✅ Processing Complete!
============================================================
📊 Emails fetched: 300
✅ Emails processed: 300
⏭️ Emails skipped (duplicates): 0
❌ Emails failed: 0
📖 Emails marked as read: 300
============================================================
- Total Processing Time: ~9 minutes
- Average Time per Email: ~1.8 seconds
- Success Rate: 100%
- API Calls: ~900 (Gmail: 600, Sheets: 300)
- Data Written: 300 rows to Google Sheets
- Duplicates Created: 0
| From | Subject | Date | Content |
|---|---|---|---|
| sender@example.com | Meeting Reminder | 2026-01-13 16:35:48 | Email body content... |
| notifications@service.com | Account Update | 2026-01-13 14:20:15 | Your account has been... |
| team@company.com | Project Status | 2026-01-13 10:45:30 | Here's the latest update... |
Problem:
Initial OAuth authentication failed with "Error 403: access_denied". The Google Cloud project was in testing mode and could only be accessed by approved test users.
Solution:
- Added my email as a test user in Google Cloud Console
- Configured OAuth consent screen with proper scopes
- Manually added both required scopes:
https://www.googleapis.com/auth/gmail.modifyhttps://www.googleapis.com/auth/spreadsheets
Learning:
Google Cloud projects in testing mode require explicit test user approval. This is a security feature to prevent unauthorized access during development.
Problem:
OAuth flow failed with "WinError 10048: Only one usage of each socket address is normally permitted" because port 8080 was already in use by another application.
Solution:
- Modified OAuth handler to use port 0 (automatic port selection)
- System now finds an available port automatically
- Changed from
port=8080toport=0inrun_local_server()
Code Fix:
# Before
self.credentials = flow.run_local_server(port=8080, ...)
# After
self.credentials = flow.run_local_server(port=0, ...) # Auto-select portLearning:
Hardcoded ports can cause conflicts. Using port 0 allows the OS to assign an available port automatically, making the application more robust.
Problem:
Initial spreadsheet access failed with "403 Forbidden: PERMISSION_DENIED" error even though OAuth authentication was successful.
Solution:
- Shared the Google Sheet with the authenticated email account
- Set permission level to "Editor"
- Updated spreadsheet ID in config.py to the correct shared sheet
Learning:
OAuth authentication grants API access, but the spreadsheet itself must be explicitly shared with the user account. API permissions and document permissions are separate layers of security.
Problem:
Many emails contain HTML content with tags, scripts, and formatting that shouldn't appear in the spreadsheet. Direct extraction resulted in messy, unreadable content.
Solution:
- Implemented HTML to plain text conversion using BeautifulSoup
- Removed script and style tags before extraction
- Cleaned up excessive whitespace
- Handled multipart messages correctly
Code Implementation:
def _html_to_plain_text(self, html_content: str) -> str:
soup = BeautifulSoup(html_content, 'lxml')
# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
# Get text and clean whitespace
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)
return textLearning:
Email content can be in various formats (plain text, HTML, multipart). Robust parsing requires handling all formats and converting to a consistent output format.
Problem:
If the system crashes while writing to processed_ids.json, the file could become corrupted, leading to data loss or duplicate processing on the next run.
Solution:
- Implemented atomic write operations
- Write to temporary file first
- Rename to actual file (atomic operation on most systems)
- Clean up temporary file on errors
Code Implementation:
def save_state(self) -> bool:
# Write to temporary file first
temp_file = f"{self.state_file}.tmp"
with open(temp_file, 'w') as f:
json.dump(data, f, indent=2)
# Atomic rename
if os.path.exists(self.state_file):
os.remove(self.state_file)
os.rename(temp_file, self.state_file)Learning:
File I/O operations should be atomic to prevent corruption. Writing to a temporary file and renaming is a common pattern for safe file updates in production systems.
-
Scalability
- JSON file approach works well for thousands of emails
- May become slow with millions of emails
- Solution: Migrate to database for large-scale deployments
-
Error Recovery
- If Google Sheets API fails, emails are not retried automatically
- Manual intervention required to reprocess failed emails
- Solution: Implement retry queue with exponential backoff
-
Content Length
- Google Sheets has a 50,000 character limit per cell
- Very long emails are truncated
- Solution: Store full content in separate storage if needed
-
Rate Limiting
- Gmail API: 250 quota units per user per second
- Sheets API: 100 requests per 100 seconds per user
- Current implementation has basic retry logic
- Solution: Implement more sophisticated rate limiting
-
Concurrent Execution
- Running multiple instances simultaneously can cause race conditions
- State file may not be properly synchronized
- Solution: Implement file locking or use database
-
Email Attachments
- Current implementation doesn't handle attachments
- Only text content is extracted
- Solution: Add attachment download and storage feature
gmail-to-sheets-automation/
│
├── src/
│ ├── __init__.py
│ ├── main.py # Main orchestrator
│ ├── gmail_service.py # Gmail API service
│ ├── sheets_service.py # Google Sheets API service
│ ├── email_parser.py # Email data extraction
│ ├── state_manager.py # Duplicate prevention
│ └── oauth_handler.py # OAuth 2.0 authentication
│
├── credentials/
│ ├── .gitkeep # Keeps folder in Git
│ ├── credentials.json # OAuth client credentials (gitignored)
│ └── token.json # Access/refresh tokens (gitignored)
│
├── proof/ # Execution proof screenshots
│ ├── Project-Structure.png
│ ├── Spreadsheet1.png
│ ├── Spreadsheet2.png
│ ├── Terminal-log.png
│ ├── Processed_ids.png
│ └── Gmail-Inbox.png
│
├── docs/ # Additional documentation
│ ├── GITHUB_SETUP.md
│ ├── TROUBLESHOOTING.md
│ └── ... (other guides)
│
├── config.py # Configuration (gitignored)
├── config.example.py # Configuration template
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── README.md # This file
└── processed_ids.json # State persistence (gitignored)
-
Credentials Protection
- All sensitive files are git-ignored
- Tokens stored with secure file permissions
- No credentials logged or displayed
-
OAuth Security
- Uses HTTPS for all API communications
- Tokens automatically refreshed
- User can revoke access anytime
-
Data Privacy
- Email content processed locally
- No data sent to third parties
- State file contains only email IDs (not content)
# First run (requires OAuth consent)
python src/main.py
# Subsequent runs (uses saved tokens)
python src/main.pyfrom src.state_manager import StateManager
state = StateManager()
state.clear_state()
print("State cleared - all emails will be reprocessed")This Gmail-to-Sheets automation system successfully demonstrates:
- ✅ OAuth 2.0 authentication with user credentials
- ✅ Gmail API integration for email fetching
- ✅ Google Sheets API integration for data logging
- ✅ Robust duplicate prevention mechanism
- ✅ Error handling and state consistency
- ✅ Clean, modular, maintainable code architecture
The system has been tested with 300 real emails and achieved a 100% success rate with zero duplicates and zero failures.
Author: Rushikesh Randive
GitHub: @rushikeshxdev
Repository: gmail-to-sheets-automation
This project is created for educational purposes as part of an internship assignment.