The Mess Miner

Agents that turn nightmare Excel sheets into organized tables

Author: Andre Moreira, November 2025 Version: 0.1.0-beta

How often do you have that Excel file with exactly the data you need, but in a format you cannot really use? The data is scattered across sheets and files, or mixed with charts, formulas and merged cells. But it still has the data you really need.

The agent workflow that cleans your data

Mess Miner is an intelligent data cleaning tool that uses Large Language Models (LLMs) to understand and restructure messy data sources. Instead of manually writing complex parsing logic for each data format, Mess Miner analyzes the structure of your data and intelligently extracts it into clean, structured tables.

Key Features

Intelligent Structure Detection: Uses LLMs to analyze and understand data organization
Parallel Processing: Handles multiple sheets/data sources simultaneously using map-reduce patterns
Context-Aware: Accepts user background information to guide the cleaning process
Flexible: Works Excel files without custom parsing code
Observable: Optional integration with LangSmith for workflow monitoring and debugging
Validated Output: Uses Pydantic for structured response parsing and data validation

How It Works

Mess Miner uses a three-node graph workflow:

Workflow Nodes

Node 1: Uploader
- Loads Excel files
- Converts each sheet to CSV format (text)
- Organizes data for processing
Node 2: Examiner
- Analyzes CSV data structure
- Identifies relevant columns and data organization
- Creates field definitions (name, type, description)
- Processes sheets in parallel (uses LangGraph's Send pattern)
Node 3: Extractor
- Dynamically builds Pydantic models based on examiner results
- Extracts data according to identified structure
- Cleans data (removes empty rows, validates types)
- Generates clean pandas DataFrames that can be saved as cleaned Excel file
- Processes sheets in parallel

Installation

Clone this repository:

git clone https://github.com/andremoreira73/messminer.git
cd messminer

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env and add your API keys:
# - GOOGLE_API_KEY (for Gemini)
# - LANGCHAIN_API_KEY (for LangSmith)

Claude

I left CLAUDE.md out of .gitignore, so if you happen to use Claude Code (or any other AI tools that use it) you can just jump in, no need to burn tokens with /init

Usage

Basic Usage

Open MessMiner.ipynb in Jupyter
Executing the cells under the sections

Imports
Observability
Node 1: uploader
Node 2: examiner
Node 3: extractor
Putting it all together: the Graph

(note that there are "testing areas" under the different sections - these are optional, good for learning / testing)

Go to the section User inputs and configure:

input_file = 'Example_0.xlsx'
output_file = 'Example_0_output.xlsx'

# if you want all sheets into one consolidated, organized table, set it to True;
# otherwise, the workflow will treat each sheet separately
consolidate_sheets = False

# Optional: Provide context to guide the cleaning
user_context = """The file contains agricultural data that is almost
correctly structured, but has empty rows. I need a clean, contiguous table."""

Run the workflow (cell under User Inputs):

initial_state = {
    "user_background": str(user_context),
    "file_name": input_file,
    "sheets_as_csv": {}
}
graph_result = Mine_this_mess.invoke(initial_state)

Save the cleaned output:

save_cleaned_data(graph_result, output_file)

Model Configuration

Choose between Gemini Pro (more capable) or Flash (faster) models:

# For challenging/complex data
model_config_examiner = {
    "model": "google_genai:gemini-2.5-pro",
    "temperature": 1,
    "thinking_budget": -1
}

# For simpler data or faster processing
model_config_examiner = {
    "model": "google_genai:gemini-2.5-flash",
    "temperature": 1,
    "thinking_budget": -1
}

If you prefer to use other models, add their configuration parameters here (see documentation here https://docs.langchain.com/oss/python/integrations/chat) and add the respective API keys to the .env file

Project Structure

.
├── MessMiner.ipynb          # Main notebook with workflow and examples
├── classes.py               # Pydantic models and state definitions
├── helper_functions.py      # Utility functions
├── prompts.py              # LLM prompt templates
├── requirements.txt        # Python dependencies
├── .env.example           # Environment variables template
├── Example_*.xlsx         # Sample input files
└── Example_*_output.xlsx  # Sample output files

Examples

The repository includes three anonymized example files demonstrating different use cases:

Example_0.xlsx: Agricultural data with empty rows
Example_1.xlsx: Technical/price data in German
Example_2.xlsx: Trial management software output

Agent Workflow State Management

OverallState

Tracks the entire workflow state:

user_background: Context information provided by the user
file_name: Input file path
sheets_as_csv: Dictionary of sheet names to CSV strings
csv_structure: List of identified structures
cleaned_csv: List of cleaned DataFrames

SingleCSVState

Used for parallel processing of individual sheets:

user_background: Shared context
sheet_name: Current sheet identifier
sheet_value: CSV data as string
csv_structure: Structure definition for this sheet

Technology Stack

Python: 3.11
LangGraph: Agent orchestration and workflow management
LangSmith: Observability and tracing
LLMs: Google Gemini (2.5 Pro, 2.5 Flash)
Pydantic: Structured response parsing and validation
Development: Claude Code for brainstorming and code assistance

Observability: LangSmith integration is enabled by default. Configure in the notebook:

os.environ["LANGSMITH_ENDPOINT"] = "https://eu.api.smith.langchain.com"
os.environ["LANGCHAIN_TRACING_V2"] = "true"  # set to "false" if you do not want tracing
os.environ["LANGCHAIN_PROJECT"] = "MessMiner_v2"

License

Citation

If you use Mess Miner, please cite:

Mess Miner
Author: Andre Moreira
November 2025

To Dos

Mess Miner is currently a demo, not production-ready. Future improvement ideas:

Production Hardening

Robust error handling
Token usage monitoring and caps
Sheet processing limits
User-friendly interface

Schema Reuse

Consolidate multiple sheets using the same structure
Save and reuse examined schemas to skip Node 2 in future runs
Store inferred schemas alongside cleaned tables

Format Extensions

Support PowerPoint and Word files (yes, people store data there)

Community Tool Vision

Django web application with containerized deployment
Users bring their own LLM and API keys
Self-hosted or cloud-based options

Contributing

Contributions are welcome! Fork the repository, experiment with improvements, and submit pull requests for features, bug fixes, or documentation enhancements.

Note on Maintenance: This project is maintained on a best-effort basis. Pull requests are reviewed as time permits, so response times may vary. The codebase is open for you to adapt and extend for your own use cases.

Support, Feedback

Reach out to a.moreira@lyfx.ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Mess Miner

The agent workflow that cleans your data

Key Features

How It Works

Workflow Nodes

Installation

Claude

Usage

Basic Usage

Model Configuration

Project Structure

Examples

Agent Workflow State Management

OverallState

SingleCSVState

Technology Stack

License

Citation

To Dos

Contributing

Support, Feedback

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.env.example		.env.example
CLAUDE.md		CLAUDE.md
Example_0.xlsx		Example_0.xlsx
Example_0_output.xlsx		Example_0_output.xlsx
Example_0_output_consolidated.xlsx		Example_0_output_consolidated.xlsx
Example_1.xlsx		Example_1.xlsx
Example_1_output.xlsx		Example_1_output.xlsx
Example_2.xlsx		Example_2.xlsx
Example_2_output.xlsx		Example_2_output.xlsx
LICENSE		LICENSE
MessMiner.ipynb		MessMiner.ipynb
MessMiner_workflow.png		MessMiner_workflow.png
MessMiner_workflow_send.png		MessMiner_workflow_send.png
README.md		README.md
classes.py		classes.py
helper_functions.py		helper_functions.py
prompts.py		prompts.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

The Mess Miner

The agent workflow that cleans your data

Key Features

How It Works

Workflow Nodes

Installation

Claude

Usage

Basic Usage

Model Configuration

Project Structure

Examples

Agent Workflow State Management

OverallState

SingleCSVState

Technology Stack

License

Citation

To Dos

Contributing

Support, Feedback

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages