Agents that turn nightmare Excel sheets into organized tables
Author: Andre Moreira, November 2025 Version: 0.1.0-beta
How often do you have that Excel file with exactly the data you need, but in a format you cannot really use? The data is scattered across sheets and files, or mixed with charts, formulas and merged cells. But it still has the data you really need.
Mess Miner is an intelligent data cleaning tool that uses Large Language Models (LLMs) to understand and restructure messy data sources. Instead of manually writing complex parsing logic for each data format, Mess Miner analyzes the structure of your data and intelligently extracts it into clean, structured tables.
- Intelligent Structure Detection: Uses LLMs to analyze and understand data organization
- Parallel Processing: Handles multiple sheets/data sources simultaneously using map-reduce patterns
- Context-Aware: Accepts user background information to guide the cleaning process
- Flexible: Works Excel files without custom parsing code
- Observable: Optional integration with LangSmith for workflow monitoring and debugging
- Validated Output: Uses Pydantic for structured response parsing and data validation
Mess Miner uses a three-node graph workflow:
-
Node 1: Uploader
- Loads Excel files
- Converts each sheet to CSV format (text)
- Organizes data for processing
-
Node 2: Examiner
- Analyzes CSV data structure
- Identifies relevant columns and data organization
- Creates field definitions (name, type, description)
- Processes sheets in parallel (uses LangGraph's Send pattern)
-
Node 3: Extractor
- Dynamically builds Pydantic models based on examiner results
- Extracts data according to identified structure
- Cleans data (removes empty rows, validates types)
- Generates clean pandas DataFrames that can be saved as cleaned Excel file
- Processes sheets in parallel
- Clone this repository:
git clone https://github.com/andremoreira73/messminer.git
cd messminer- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env and add your API keys:
# - GOOGLE_API_KEY (for Gemini)
# - LANGCHAIN_API_KEY (for LangSmith)I left CLAUDE.md out of .gitignore, so if you happen to use Claude Code (or any other AI tools that use it) you can just jump in, no need to burn tokens with /init
- Open
MessMiner.ipynbin Jupyter - Executing the cells under the sections
- Imports
- Observability
- Node 1: uploader
- Node 2: examiner
- Node 3: extractor
- Putting it all together: the Graph
(note that there are "testing areas" under the different sections - these are optional, good for learning / testing)
- Go to the section User inputs and configure:
input_file = 'Example_0.xlsx'
output_file = 'Example_0_output.xlsx'
# if you want all sheets into one consolidated, organized table, set it to True;
# otherwise, the workflow will treat each sheet separately
consolidate_sheets = False
# Optional: Provide context to guide the cleaning
user_context = """The file contains agricultural data that is almost
correctly structured, but has empty rows. I need a clean, contiguous table."""- Run the workflow (cell under User Inputs):
initial_state = {
"user_background": str(user_context),
"file_name": input_file,
"sheets_as_csv": {}
}
graph_result = Mine_this_mess.invoke(initial_state)- Save the cleaned output:
save_cleaned_data(graph_result, output_file)Choose between Gemini Pro (more capable) or Flash (faster) models:
# For challenging/complex data
model_config_examiner = {
"model": "google_genai:gemini-2.5-pro",
"temperature": 1,
"thinking_budget": -1
}
# For simpler data or faster processing
model_config_examiner = {
"model": "google_genai:gemini-2.5-flash",
"temperature": 1,
"thinking_budget": -1
}If you prefer to use other models, add their configuration parameters here (see documentation here https://docs.langchain.com/oss/python/integrations/chat) and add the respective API keys to the .env file
.
├── MessMiner.ipynb # Main notebook with workflow and examples
├── classes.py # Pydantic models and state definitions
├── helper_functions.py # Utility functions
├── prompts.py # LLM prompt templates
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── Example_*.xlsx # Sample input files
└── Example_*_output.xlsx # Sample output files
The repository includes three anonymized example files demonstrating different use cases:
- Example_0.xlsx: Agricultural data with empty rows
- Example_1.xlsx: Technical/price data in German
- Example_2.xlsx: Trial management software output
Tracks the entire workflow state:
user_background: Context information provided by the userfile_name: Input file pathsheets_as_csv: Dictionary of sheet names to CSV stringscsv_structure: List of identified structurescleaned_csv: List of cleaned DataFrames
Used for parallel processing of individual sheets:
user_background: Shared contextsheet_name: Current sheet identifiersheet_value: CSV data as stringcsv_structure: Structure definition for this sheet
- Python: 3.11
- LangGraph: Agent orchestration and workflow management
- LangSmith: Observability and tracing
- LLMs: Google Gemini (2.5 Pro, 2.5 Flash)
- Pydantic: Structured response parsing and validation
- Development: Claude Code for brainstorming and code assistance
Observability: LangSmith integration is enabled by default. Configure in the notebook:
os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"
os.environ["LANGCHAIN_TRACING_V2"] = "true" # set to "false" if you do not want tracing
os.environ["LANGCHAIN_PROJECT"] = "MessMiner_v2"MIT License © 2025 Andre Moreira - See LICENSE file for details
If you use Mess Miner, please cite:
Mess Miner
Author: Andre Moreira
November 2025
Mess Miner is currently a demo, not production-ready. Future improvement ideas:
Production Hardening
- Robust error handling
- Token usage monitoring and caps
- Sheet processing limits
- User-friendly interface
Schema Reuse
- Consolidate multiple sheets using the same structure
- Save and reuse examined schemas to skip Node 2 in future runs
- Store inferred schemas alongside cleaned tables
Format Extensions
- Support PowerPoint and Word files (yes, people store data there)
Community Tool Vision
- Django web application with containerized deployment
- Users bring their own LLM and API keys
- Self-hosted or cloud-based options
Contributions are welcome! Fork the repository, experiment with improvements, and submit pull requests for features, bug fixes, or documentation enhancements.
Note on Maintenance: This project is maintained on a best-effort basis. Pull requests are reviewed as time permits, so response times may vary. The codebase is open for you to adapt and extend for your own use cases.
Reach out to a.moreira@lyfx.ai
