A GUI application for automatically formatting MEP construction specifications to match architect-provided templates. This tool extracts content from office master specifications, intelligently filters out editing instructions and specifier notes, and rebuilds the documents with the exact formatting DNA from your project's architectural template.
The Spec Automation Tool solves a common problem in construction documentation: taking your office's master specifications and reformatting them to match the specific style requirements of each project's architect.
The Process:
-
Extract → Reads your office master
.docxfiles and converts content to markdown, automatically filtering out:- Specifier notes and editing instructions
- Copyright notices and metadata
- Any content that shouldn't appear in the final specification
-
Reformat → Rebuilds the specification using the architect's template, applying:
- Custom heading styles (CSI Level 0, Level 1, Level 2, etc.)
- List formatting with proper indentation
- Page layout, margins, headers, and footers from the template
-
Output → Produces clean, properly formatted specifications ready for coordination and printing
- Python 3.7+ installed on your Windows machine
- pip (Python package installer)
-
Install Required Libraries:
pip install python-docx
-
Download the Tool:
- Save
spec_tool_gui.pyto your preferred location - Save
project_config.jsonto the same folder - Save
check_styles.py(optional utility) to the same folder
- Save
Double-click spec_tool_gui.py or run from command line:
python spec_tool_gui.pyYou'll need:
- Office Masters Folder → Directory containing your office's master specification
.docxfiles - Architect Template → The
.docxfile provided by the architect with their formatting - Output Folder → Where you want the formatted specifications saved
Before running the tool on a new project, you need to identify what style names the architect is using:
-
Run the
check_styles.pyutility:python check_styles.py
This will print all paragraph styles found in the architect's template.
-
Open
project_config.jsonand update the style names to match:{ "styles": { "Title": "CSILevel0", ← Update these "Part": "CSILevel1", ← with actual "Article": "CSILevel2" ← style names }, "list_levels": [ "CSILevel3", "CSILevel4", "CSILevel5" ] }
- Launch
spec_tool_gui.py - Click Browse buttons to select:
- Office Masters Folder
- Architect Template (.docx)
- Project Config (.json)
- Output Folder
- Click 🚀 PROCESS ALL SPECS
- Watch the log window for progress
- Find your formatted specifications in the Output Folder
{
"styles": {
"Title": "CSILevel0", // Section headers (SECTION 23 21 13)
"Part": "CSILevel1", // Part headers (PART 1 - GENERAL)
"Article": "CSILevel2" // Article headers (1.01 SUMMARY)
},
"list_levels": [
"CSILevel3", // First indent (A. Item)
"CSILevel4", // Second indent (1. Subitem)
"CSILevel5", // Third indent (a. Detail)
"CSILevel6", // Fourth indent (1) Further detail)
"CSILevel7" // Fifth indent (a) Deepest level)
],
"options": {
"strip_heading_numbers": true, // Remove "1.01" from "1.01 SUMMARY"
"strip_list_labels": true // Remove "A." from "A. List item"
}
}Use check_styles.py to inspect any Word document:
python check_styles.pyMake sure architect_style.docx points to the correct template file in the script, or modify the script to accept command-line arguments.
In spec_tool_gui.py, you can customize what gets filtered out during extraction:
# Line 47-54
IGNORED_STYLES = [
"Specifier Note",
"Note",
"Instruction",
"Editing Instruction",
'CMT'
]
# Line 57-64
IGNORED_STARTS = [
"See Editing Instruction",
"Adjust list below",
"Retain ",
"Delete ",
"Edit ",
"Verify that Section titles"
]Add or remove patterns based on what appears in your office masters.
While the GUI processes entire folders, you can modify the script for single-file processing by calling the functions directly:
from spec_tool_gui import extract_master_to_markdown, rebuild_from_markdown, load_config
config = load_config("project_config.json")
extract_master_to_markdown("master.docx", "temp.md", print)
rebuild_from_markdown("temp.md", "template.docx", "output.docx", config, print)For projects with multiple architects or phases, create separate config files:
project_configs/
├── architect_a_config.json
├── architect_b_config.json
└── phase_2_config.json
Select the appropriate config when running the tool.
- Make sure you're pointing to a folder containing
.docxfiles (not.doc) - Check that files aren't hidden or in subfolders
- Ensure file names don't start with
~$(temp files)
- Verify the architect template path is correct
- Make sure it's a
.docxfile (not.dotxtemplate) - Try copying the template to the same folder as the script
- Run
check_styles.pyto verify style names - Compare
project_config.jsonstyle names with template styles - Check that
strip_heading_numbersandstrip_list_labelsare set appropriately
- Check
IGNORED_STYLESandIGNORED_STARTSlists - Make sure legitimate content isn't being filtered
- Examine the temporary
.mdfiles (uncomment line 237 to keep them)
- Large files may take time to process
- Check the log window for errors
- Try processing one file manually to diagnose issues
spec-automation-tool/
├── spec_tool_gui.py # Main application
├── project_config.json # Style mapping configuration
├── check_styles.py # Utility to inspect Word styles
└── README.md # This file
Typical Project Workflow:
- Project Kickoff → Architect provides template with their style guide
- Style Analysis → Run
check_styles.pyon architect template - Config Setup → Create/update
project_config.jsonwith correct style names - Batch Processing → Use GUI to process all relevant specifications
- QC Review → Spot-check formatted specs for correctness
- Coordination → Proceed with normal specification coordination workflow
This tool implements the "extract → clean → rebuild" pattern for document formatting automation. It works by:
- Converting Word documents to an intermediate markdown representation
- Applying intelligent filtering rules during extraction
- Using the architect's template as a "formatting DNA" source
- Surgically rebuilding documents with the target formatting
The approach preserves content while completely replacing formatting, ensuring consistency across all project specifications.
- Test First: Always test with one specification before batch processing
- Keep Temps: Uncomment line 237 to keep markdown files for debugging
- Version Control: Keep project configs in version control with project files
- Backup: Keep original masters untouched; always work with copies
- Style Guide: Create a reference document showing what each style looks like
Copyright © 2025 Abraham Borg. All Rights Reserved.
This software and associated documentation files (the "Software") are the proprietary property of Abraham Borg.
Unauthorized copying, modification, distribution, or use of this Software, via any medium, is strictly prohibited without express written permission from the copyright holder.
This Software is provided for review and reference purposes only. No license or right to use, copy, modify, or distribute this Software for any purpose, commercial or non-commercial, is granted.
Markdown as the intermediate format is fundamentally lossy. This single design choice cascades into most of the issues below.
When you extract to markdown and rebuild, you permanently lose:
| Lost Element | Impact |
|---|---|
| Tables | Schedules, performance criteria, product tables—gone |
| Images | Diagrams, details, manufacturer logos—gone |
| Inline formatting | Bold, italic, underline within paragraphs—gone |
| Hyperlinks | URL references—gone |
| Track changes | Revision history—gone |
| Comments | Review notes—gone |
| Fields | Auto-numbering, cross-references, TOC entries—gone |
| Bookmarks | Internal document links—gone |
For MEP specs that include equipment schedules or piping diagrams, this is a dealbreaker.
The current detection logic has several blind spots:
# Current: Requires "PART" AND a dash
elif text.upper().startswith("PART") and "-" in text:Will fail on:
PART 1 GENERAL(no dash—common in older masters)PART ONE - GENERAL(spelled out)- Multi-line part titles where the number is separate from the title
# Current: Article detection
elif re.match(r'^\d+\.\d+\s', text):Will fail on:
1.01with no space after (tight formatting)1.1 SUMMARY(single digit after decimal)- Sub-articles like
2.01.Aor2.01.1(nested numbering)
# Current: List label detection
if re.match(r'^[A-Z]\.\s', text): # Level 1
elif re.match(r'^\d+\.\s', text): # Level 2
elif re.match(r'^[a-z]\.\s', text): # Level 3
elif re.match(r'^\d+\)\s', text): # Level 4
elif re.match(r'^[a-z]\)\s', text): # Level 5Will fail on:
- CSI Level 6:
(1),(2), etc. - CSI Level 7:
(a),(b), etc. - Roman numerals:
i.,ii.,I.,II. - Double letters:
AA.,BB.(overflow labels) - Continuation paragraphs (text that belongs to the previous list item but has no label)
IGNORED_STYLES = [
"Specifier Note",
"Note",
"Instruction",
"Editing Instruction",
'CMT'
]Different master file publishers use different style names:
- BSD:
SpecNote,EditNote - ARCOM:
Specifier Note,SpN - Custom masters: Anything goes
This requires per-publisher configuration, not hard-coding.
def add_safe_paragraph(text, style_name):
try:
doc.add_paragraph(text, style=style_name)
except KeyError:
doc.add_paragraph(text, style='Normal') # Silent failure!If your config references CSILevel6 but the template only has up to CSILevel5, you'll get Normal style with no warning. The output will look wrong and you won't know why.
Your project_config.json references styles like CSILevel0 through CSILevel7, but:
- No validation that these styles exist in the template
- No schema validation for the JSON structure
KeyErrorrisk if required keys are missing:
style_map = config.get("styles", DEFAULT_CONFIG["styles"])
# Later...
add_safe_paragraph(clean_text, style_map["Title"]) # KeyError if "Title" missingdef log(self, message):
self.log_area.config(state='normal')
self.log_area.insert(tk.END, message + "\n")
# Called from worker thread - not thread-safe!Tkinter is not thread-safe. Calling widget methods from a worker thread can cause:
- Random crashes
- Frozen GUI
- Garbled text
You need to use root.after() or a queue-based approach.
dash_index = raw_text.find("-")
indent_level = dash_index // 2This assumes:
- Exactly 2 spaces per indent level
- No tabs
- The dash position in the intermediate markdown accurately reflects the intended CSI level
If anything shifts the dash position (mixed tabs/spaces, different editor settings), your list hierarchy breaks.
| Scenario | Current Behavior |
|---|---|
| File locked by Word | Crash or silent failure |
| Non-UTF-8 encoding | Mojibake or crash |
| Path with special characters | Platform-dependent failure |
| Very large files (100+ pages) | Memory issues, no progress feedback |
| Corrupted .docx | Unhandled exception |
- Table preservation — Would require XML-level extraction and rebuild
- Image handling — Need to extract from docx media folder, track positions
- Run-level formatting — Bold/italic requires parsing
<w:r>elements, not just paragraph text - Continuation paragraphs — Text that belongs to a list item but has no label
- Spec note filtering by bracket syntax — Many masters use
[specifier note text]inline - Multi-section files — Some masters combine multiple spec sections in one file
- Validation/preview — No way to see what will change before committing
- Undo/rollback — If something goes wrong, you've overwritten your work
- Add style existence validation before processing
- Fix threading with a proper queue
- Add more regex patterns for edge cases
- Make
IGNORED_STYLESconfigurable in JSON - Add a "dry run" mode that shows what would happen
- Skip markdown entirely—work at the python-docx XML level
- Copy paragraph-by-paragraph while transforming styles
- Preserve tables and images by copying elements directly
- Use your SpecCleanse approach—work at the ZIP/XML level
- Build a proper AST representation of CSI structure
- Implement surgical style remapping without content extraction
- Add a diffing/preview system