LangChain integration with xParse Parse API for intelligent document parsing. Converts unstructured documents (PDF, images, Word, Excel, PPT, etc.) into AI-friendly structured data (JSON, Markdown) with rich metadata.
From PyPI:
pip install langchain-xparseSet your TextIn credentials (from Textin Workspace):
export XPARSE_APP_ID="your-app-id"
export XPARSE_SECRET_CODE="your-secret-code"Or pass them when creating the loader:
loader = XParseLoader(
file_path="doc.pdf",
app_id="your-app-id",
secret_code="your-secret-code",
)from langchain_xparse import XParseLoader
loader = XParseLoader(file_path="example.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata) # source, category, element_id, filename, page_numberfor doc in loader.lazy_load():
# process each document
print(doc.page_content[:100])async for doc in loader.alazy_load():
# process each document asynchronously
print(doc.page_content[:100])Customize parsing behavior using the config parameter. See Parse Config Documentation for details.
loader = XParseLoader(
file_path="doc.pdf",
config={
"document": {
"password": "pdf-password" # For encrypted PDFs
},
"capabilities": {
"include_hierarchy": True, # Include parent-child relationships
"include_inline_objects": True, # Extract formulas, handwriting, etc.
"include_table_structure": True, # Detailed table structure
"include_char_details": True, # Character-level details
"include_image_data": True, # Image URLs and data
"pages": True, # Page metadata
"title_tree": True, # Document outline/TOC
"table_view": "html" # Table format: "html" or "markdown"
},
"scope": {
"page_range": "1-10" # Process specific pages
},
"config": {
"force_engine": "textin", # Engine selection (expert mode)
"engine_params": {
"formula_level": 0,
"image_output_type": "url"
}
}
}
)
docs = loader.load()loader = XParseLoader(file_path=["a.pdf", "b.pdf", "c.docx"])
for doc in loader.lazy_load():
print(f"{doc.metadata.get('source')}: {doc.page_content[:50]}")When passing a file-like object instead of a path, you must set metadata_filename:
with open("doc.pdf", "rb") as f:
loader = XParseLoader(file=f, metadata_filename="doc.pdf")
docs = loader.load()Each loaded document includes rich metadata:
source: File path or filenamecategory: Element type (Title, NarrativeText, Table, Image, Formula, etc.)element_id: Unique element identifierfilename: Original filenamepage_number: Page number (if available)parent_id: Parent element ID (withinclude_hierarchy)children_ids: Child element IDs (withinclude_hierarchy)- Additional element-specific metadata
- xParse Parse API - API endpoint documentation
- Parse Config - Configuration parameters
- Parse Response - Response structure and fields