langchain-xparse

LangChain integration with xParse Parse API for intelligent document parsing. Converts unstructured documents (PDF, images, Word, Excel, PPT, etc.) into AI-friendly structured data (JSON, Markdown) with rich metadata.

Installation

From PyPI:

pip install langchain-xparse

Configuration

Set your TextIn credentials (from Textin Workspace):

export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"

Or pass them when creating the loader:

loader = XParseLoader(
    file_path="doc.pdf",
    app_id="your-app-id",
    secret_code="your-secret-code",
)

Usage

Basic Usage

from langchain_xparse import XParseLoader

loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata)  # source, category, element_id, filename, page_number

Lazy Load

for doc in loader.lazy_load():
    # process each document
    print(doc.page_content[:100])

Async Load

async for doc in loader.alazy_load():
    # process each document asynchronously
    print(doc.page_content[:100])

Custom Parse Configuration

Customize parsing behavior using the config parameter. See Parse Config Documentation for details.

loader = XParseLoader(
    file_path="doc.pdf",
    config={
        "document": {
            "password": "pdf-password"  # For encrypted PDFs
        },
        "capabilities": {
            "include_hierarchy": True,         # Include parent-child relationships
            "include_inline_objects": True,    # Extract formulas, handwriting, etc.
            "include_table_structure": True,   # Detailed table structure
            "include_char_details": True,      # Character-level details
            "include_image_data": True,        # Image URLs and data
            "pages": True,                     # Page metadata
            "title_tree": True,                # Document outline/TOC
            "table_view": "html"               # Table format: "html" or "markdown"
        },
        "scope": {
            "page_range": "1-10"               # Process specific pages
        },
        "config": {
            "force_engine": "textin",          # Engine selection (expert mode)
            "engine_params": {
                "formula_level": 0,
                "image_output_type": "url"
            }
        }
    }
)
docs = loader.load()

Multiple Files

loader = XParseLoader(file_path=["a.pdf", "b.pdf", "c.docx"])
for doc in loader.lazy_load():
    print(f"{doc.metadata.get('source')}: {doc.page_content[:50]}")

File-like Object

When passing a file-like object instead of a path, you must set metadata_filename:

with open("doc.pdf", "rb") as f:
    loader = XParseLoader(file=f, metadata_filename="doc.pdf")
    docs = loader.load()

Document Metadata

Each loaded document includes rich metadata:

source: File path or filename
category: Element type (Title, NarrativeText, Table, Image, Formula, etc.)
element_id: Unique element identifier
filename: Original filename
page_number: Page number (if available)
parent_id: Parent element ID (with include_hierarchy)
children_ids: Child element IDs (with include_hierarchy)
Additional element-specific metadata

References

xParse Parse API - API endpoint documentation
Parse Config - Configuration parameters
Parse Response - Response structure and fields

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
example_docs		example_docs
examples		examples
langchain_xparse		langchain_xparse
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langchain-xparse

Installation

Configuration

Usage

Basic Usage

Lazy Load

Async Load

Custom Parse Configuration

Multiple Files

File-like Object

Document Metadata

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

langchain-xparse

Installation

Configuration

Usage

Basic Usage

Lazy Load

Async Load

Custom Parse Configuration

Multiple Files

File-like Object

Document Metadata

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages