A Python-based tool to parse structured Word documents containing product specifications and convert them to CSV format.
This system is designed to parse Word documents with a specific hierarchical structure of product information and export the data to CSV files. It extracts data based on formatting cues like text color, numbering patterns, and specific keywords.
The system consists of:
- Word Parser Module (
word_parser.py): A command-line tool for parsing Word documents - Converter Application (
converter_app.py): A GUI-based application for batch processing multiple files - Main Script (
main.py): A wrapper script to run either the parser or the converter
- Python 3.7 or higher
- Required Python packages:
python-docx- For parsing Word documentstkinter- For the GUI application (usually comes with Python)
pip install python-docxThe easiest way to use the tool is through the GUI application:
python main.py
# OR
python main.py --guiThis will open the converter application where you can:
- Add one or more Word files for processing
- Select an output directory for the CSV files
- Enable debug mode for additional information
- Start the conversion process
- View the conversion log
For scripting or automation, you can use the command-line parser directly:
python main.py --cli
# OR
python word_parser.py -i input.docx -o output.csv [--debug]Command-line options:
-i,--input: Input Word document file path (default: specifications_catalog.docx)-o,--output: Output CSV file path (default: specifications_catalog.csv)--debug: Print debug information during parsing
The parser extracts the following hierarchical data from the Word document:
- Group_title: Top-level category in UPPERCASE (e.g., "MECHANISCHE SLOTEN")
- Subgroup_title: Second level with number format "00.00.00" (e.g., "00.00.00 Mechanische éénpuntsloten")
- Item_title_NL: Specific product category with number and brand (e.g., "00.00.00 Standaard klavierslot... |FH| st Litto")
- Description_NL: Detailed text description of the product category
- LongDescription: Specific product description in purple text
- Item_Number: Reference code (e.g., "A13E1")
- Brand: Brand name extracted from Item_title_NL (e.g., "Litto")
- Measuring_State: Special format text (e.g., "|FH| st")
For proper parsing, the Word document should follow this structure:
- Group_title: All capital letters, no numbering
- Subgroup_title: Starts with "00.00.00" but doesn't contain "|FH|"
- Item_title_NL: Starts with "00.00.00" and contains "|FH| st" followed by the brand name
- Description_NL: Regular text below the Item_title_NL
- LongDescription: Purple text just above an Item_Number
- Item_Number: Contains "Referentie : " followed by a code and "of equivalent"
-
No Purple Text Detected:
- The exact RGB values for purple may need adjustment. Modify the
is_purple_textfunction inword_parser.py
- The exact RGB values for purple may need adjustment. Modify the
-
Parsing Structure Incorrect:
- If the document structure varies slightly, you may need to adjust the patterns in the parsing functions
-
Brand or Measuring State Not Extracted:
- Check if the format matches the expected pattern "|FH| st Brand"
This software is distributed under the MIT license.
- Your Name - Initial development