Spec Automation Tool

A GUI application for automatically formatting MEP construction specifications to match architect-provided templates. This tool extracts content from office master specifications, intelligently filters out editing instructions and specifier notes, and rebuilds the documents with the exact formatting DNA from your project's architectural template.

🎯 What It Does

The Spec Automation Tool solves a common problem in construction documentation: taking your office's master specifications and reformatting them to match the specific style requirements of each project's architect.

The Process:

Extract → Reads your office master .docx files and converts content to markdown, automatically filtering out:
- Specifier notes and editing instructions
- Copyright notices and metadata
- Any content that shouldn't appear in the final specification
Reformat → Rebuilds the specification using the architect's template, applying:
- Custom heading styles (CSI Level 0, Level 1, Level 2, etc.)
- List formatting with proper indentation
- Page layout, margins, headers, and footers from the template
Output → Produces clean, properly formatted specifications ready for coordination and printing

🚀 Quick Start

Prerequisites

Python 3.7+ installed on your Windows machine
pip (Python package installer)

Installation

Install Required Libraries:
```
pip install python-docx
```
Download the Tool:
- Save spec_tool_gui.py to your preferred location
- Save project_config.json to the same folder
- Save check_styles.py (optional utility) to the same folder

First Run

Double-click spec_tool_gui.py or run from command line:

python spec_tool_gui.py

📖 How to Use

Step 1: Prepare Your Files

You'll need:

Office Masters Folder → Directory containing your office's master specification .docx files
Architect Template → The .docx file provided by the architect with their formatting
Output Folder → Where you want the formatted specifications saved

Step 2: Configure Style Mappings (First Time Only)

Before running the tool on a new project, you need to identify what style names the architect is using:

Run the check_styles.py utility:
```
python check_styles.py
```
This will print all paragraph styles found in the architect's template.

Open project_config.json and update the style names to match:

{
    "styles": {
        "Title": "CSILevel0",      ← Update these
        "Part": "CSILevel1",        ← with actual
        "Article": "CSILevel2"      ← style names
    },
    "list_levels": [
        "CSILevel3", 
        "CSILevel4", 
        "CSILevel5"
    ]
}

Step 3: Run the Tool

Launch spec_tool_gui.py
Click Browse buttons to select:
- Office Masters Folder
- Architect Template (.docx)
- Project Config (.json)
- Output Folder
Click 🚀 PROCESS ALL SPECS
Watch the log window for progress
Find your formatted specifications in the Output Folder

⚙️ Configuration Guide

Understanding `project_config.json`

{
    "styles": {
        "Title": "CSILevel0",       // Section headers (SECTION 23 21 13)
        "Part": "CSILevel1",        // Part headers (PART 1 - GENERAL)
        "Article": "CSILevel2"      // Article headers (1.01 SUMMARY)
    },
    "list_levels": [
        "CSILevel3",                // First indent (A. Item)
        "CSILevel4",                // Second indent (1. Subitem)
        "CSILevel5",                // Third indent (a. Detail)
        "CSILevel6",                // Fourth indent (1) Further detail)
        "CSILevel7"                 // Fifth indent (a) Deepest level)
    ],
    "options": {
        "strip_heading_numbers": true,    // Remove "1.01" from "1.01 SUMMARY"
        "strip_list_labels": true         // Remove "A." from "A. List item"
    }
}

Finding Style Names

Use check_styles.py to inspect any Word document:

python check_styles.py

Make sure architect_style.docx points to the correct template file in the script, or modify the script to accept command-line arguments.

Customizing Filtering

In spec_tool_gui.py, you can customize what gets filtered out during extraction:

# Line 47-54
IGNORED_STYLES = [
    "Specifier Note", 
    "Note", 
    "Instruction", 
    "Editing Instruction", 
    'CMT'
]

# Line 57-64
IGNORED_STARTS = [
    "See Editing Instruction",
    "Adjust list below",
    "Retain ",
    "Delete ",
    "Edit ",
    "Verify that Section titles"
]

Add or remove patterns based on what appears in your office masters.

🛠️ Advanced Usage

Processing Individual Files

While the GUI processes entire folders, you can modify the script for single-file processing by calling the functions directly:

from spec_tool_gui import extract_master_to_markdown, rebuild_from_markdown, load_config

config = load_config("project_config.json")
extract_master_to_markdown("master.docx", "temp.md", print)
rebuild_from_markdown("temp.md", "template.docx", "output.docx", config, print)

Creating Project-Specific Configs

For projects with multiple architects or phases, create separate config files:

project_configs/
├── architect_a_config.json
├── architect_b_config.json
└── phase_2_config.json

Select the appropriate config when running the tool.

🐛 Troubleshooting

"No .docx files found in Master folder"

Make sure you're pointing to a folder containing .docx files (not .doc)
Check that files aren't hidden or in subfolders
Ensure file names don't start with ~$ (temp files)

"Template not found"

Verify the architect template path is correct
Make sure it's a .docx file (not .dotx template)
Try copying the template to the same folder as the script

Formatting looks wrong

Run check_styles.py to verify style names
Compare project_config.json style names with template styles
Check that strip_heading_numbers and strip_list_labels are set appropriately

Content is missing

Check IGNORED_STYLES and IGNORED_STARTS lists
Make sure legitimate content isn't being filtered
Examine the temporary .md files (uncomment line 237 to keep them)

Program freezes

Large files may take time to process
Check the log window for errors
Try processing one file manually to diagnose issues

📝 File Structure

spec-automation-tool/
├── spec_tool_gui.py          # Main application
├── project_config.json        # Style mapping configuration
├── check_styles.py           # Utility to inspect Word styles
└── README.md                 # This file

🔄 Workflow Integration

Typical Project Workflow:

Project Kickoff → Architect provides template with their style guide
Style Analysis → Run check_styles.py on architect template
Config Setup → Create/update project_config.json with correct style names
Batch Processing → Use GUI to process all relevant specifications
QC Review → Spot-check formatted specs for correctness
Coordination → Proceed with normal specification coordination workflow

🎓 Background

This tool implements the "extract → clean → rebuild" pattern for document formatting automation. It works by:

Converting Word documents to an intermediate markdown representation
Applying intelligent filtering rules during extraction
Using the architect's template as a "formatting DNA" source
Surgically rebuilding documents with the target formatting

The approach preserves content while completely replacing formatting, ensuring consistency across all project specifications.

💡 Tips

Test First: Always test with one specification before batch processing
Keep Temps: Uncomment line 237 to keep markdown files for debugging
Version Control: Keep project configs in version control with project files
Backup: Keep original masters untouched; always work with copies
Style Guide: Create a reference document showing what each style looks like

📄 License

Copyright Notice

This software and associated documentation files (the "Software") are the proprietary property of Abraham Borg.

Unauthorized copying, modification, distribution, or use of this Software, via any medium, is strictly prohibited without express written permission from the copyright holder.

This Software is provided for review and reference purposes only. No license or right to use, copy, modify, or distribute this Software for any purpose, commercial or non-commercial, is granted.

Critical Architectural Limitation

Markdown as the intermediate format is fundamentally lossy. This single design choice cascades into most of the issues below.

When you extract to markdown and rebuild, you permanently lose:

Lost Element	Impact
Tables	Schedules, performance criteria, product tables—gone
Images	Diagrams, details, manufacturer logos—gone
Inline formatting	Bold, italic, underline within paragraphs—gone
Hyperlinks	URL references—gone
Track changes	Revision history—gone
Comments	Review notes—gone
Fields	Auto-numbering, cross-references, TOC entries—gone
Bookmarks	Internal document links—gone

For MEP specs that include equipment schedules or piping diagrams, this is a dealbreaker.

Regex-Based Structure Detection Failures

The current detection logic has several blind spots:

Part Detection Issues

# Current: Requires "PART" AND a dash
elif text.upper().startswith("PART") and "-" in text:

Will fail on:

PART 1 GENERAL (no dash—common in older masters)
PART ONE - GENERAL (spelled out)
Multi-line part titles where the number is separate from the title

Article Detection Issues

# Current: Article detection
elif re.match(r'^\d+\.\d+\s', text):

Will fail on:

1.01 with no space after (tight formatting)
1.1 SUMMARY (single digit after decimal)
Sub-articles like 2.01.A or 2.01.1 (nested numbering)

List Label Detection Issues

# Current: List label detection
if re.match(r'^[A-Z]\.\s', text):       # Level 1
elif re.match(r'^\d+\.\s', text):       # Level 2
elif re.match(r'^[a-z]\.\s', text):     # Level 3
elif re.match(r'^\d+\)\s', text):       # Level 4
elif re.match(r'^[a-z]\)\s', text):     # Level 5

Will fail on:

CSI Level 6: (1), (2), etc.
CSI Level 7: (a), (b), etc.
Roman numerals: i., ii., I., II.
Double letters: AA., BB. (overflow labels)
Continuation paragraphs (text that belongs to the previous list item but has no label)

Style Dependency Issues

Hard-coded Ignored Styles

IGNORED_STYLES = [
    "Specifier Note", 
    "Note", 
    "Instruction", 
    "Editing Instruction", 
    'CMT'
]

Different master file publishers use different style names:

BSD: SpecNote, EditNote
ARCOM: Specifier Note, SpN
Custom masters: Anything goes

This requires per-publisher configuration, not hard-coding.

Silent Style Fallback

def add_safe_paragraph(text, style_name):
    try:
        doc.add_paragraph(text, style=style_name)
    except KeyError:
        doc.add_paragraph(text, style='Normal')  # Silent failure!

If your config references CSILevel6 but the template only has up to CSILevel5, you'll get Normal style with no warning. The output will look wrong and you won't know why.

Config Validation Gaps

Your project_config.json references styles like CSILevel0 through CSILevel7, but:

No validation that these styles exist in the template
No schema validation for the JSON structure
KeyError risk if required keys are missing:

style_map = config.get("styles", DEFAULT_CONFIG["styles"])
# Later...
add_safe_paragraph(clean_text, style_map["Title"])  # KeyError if "Title" missing

Threading & GUI Issues

def log(self, message):
    self.log_area.config(state='normal')
    self.log_area.insert(tk.END, message + "\n")
    # Called from worker thread - not thread-safe!

Tkinter is not thread-safe. Calling widget methods from a worker thread can cause:

Random crashes
Frozen GUI
Garbled text

You need to use root.after() or a queue-based approach.

Indentation Detection is Fragile

dash_index = raw_text.find("-")
indent_level = dash_index // 2

This assumes:

Exactly 2 spaces per indent level
No tabs
The dash position in the intermediate markdown accurately reflects the intended CSI level

If anything shifts the dash position (mixed tabs/spaces, different editor settings), your list hierarchy breaks.

File Handling Edge Cases

Scenario	Current Behavior
File locked by Word	Crash or silent failure
Non-UTF-8 encoding	Mojibake or crash
Path with special characters	Platform-dependent failure
Very large files (100+ pages)	Memory issues, no progress feedback
Corrupted .docx	Unhandled exception

What's Missing Entirely

Table preservation — Would require XML-level extraction and rebuild
Image handling — Need to extract from docx media folder, track positions
Run-level formatting — Bold/italic requires parsing <w:r> elements, not just paragraph text
Continuation paragraphs — Text that belongs to a list item but has no label
Spec note filtering by bracket syntax — Many masters use [specifier note text] inline
Multi-section files — Some masters combine multiple spec sections in one file
Validation/preview — No way to see what will change before committing
Undo/rollback — If something goes wrong, you've overwritten your work

Recommendations for Improvement

Short-term (Keep current architecture)

Add style existence validation before processing
Fix threading with a proper queue
Add more regex patterns for edge cases
Make IGNORED_STYLES configurable in JSON
Add a "dry run" mode that shows what would happen

Medium-term (Better architecture)

Skip markdown entirely—work at the python-docx XML level
Copy paragraph-by-paragraph while transforming styles
Preserve tables and images by copying elements directly

Long-term (Production-ready)

Use your SpecCleanse approach—work at the ZIP/XML level
Build a proper AST representation of CSI structure
Implement surgical style remapping without content extraction
Add a diffing/preview system

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
check_styles.py		check_styles.py
project_config.json		project_config.json
requirements.txt		requirements.txt
spec_tool_gui.py		spec_tool_gui.py

Folders and files

Latest commit

History

Repository files navigation

Spec Automation Tool

🎯 What It Does

🚀 Quick Start

Prerequisites

Installation

First Run

📖 How to Use

Step 1: Prepare Your Files

Step 2: Configure Style Mappings (First Time Only)

Step 3: Run the Tool

⚙️ Configuration Guide

Understanding project_config.json

Finding Style Names

Customizing Filtering

🛠️ Advanced Usage

Processing Individual Files

Creating Project-Specific Configs

🐛 Troubleshooting

"No .docx files found in Master folder"

"Template not found"

Formatting looks wrong

Content is missing

Program freezes

📝 File Structure

🔄 Workflow Integration

🎓 Background

💡 Tips

📄 License

Copyright Notice

Critical Architectural Limitation

Regex-Based Structure Detection Failures

Part Detection Issues

Article Detection Issues

List Label Detection Issues

Style Dependency Issues

Hard-coded Ignored Styles

Silent Style Fallback

Config Validation Gaps

Threading & GUI Issues

Indentation Detection is Fragile

File Handling Edge Cases

What's Missing Entirely

Recommendations for Improvement

Short-term (Keep current architecture)

Medium-term (Better architecture)

Long-term (Production-ready)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Understanding `project_config.json`

Packages