Official LangChain integration for Ujeebu Extract API - Extract clean, structured content from news articles and blog posts for use with Large Language Models (LLMs) and AI applications.
- Easy Integration: Seamlessly integrate Ujeebu Extract API with LangChain agents and chains
- Document Loaders: Load articles as LangChain Documents for use with vector stores and retrievers
- Agent Tools: Use Ujeebu Extract as a tool in LangChain agents
- Rich Metadata: Extract article text, HTML, author, publication date, images, and more
- Quick Mode: Optional fast extraction mode (30-60% faster)
- Type Safe: Full type hints and Pydantic validation
Ujeebu Extract converts news and blog articles into clean, structured JSON data. It extracts:
- Clean article text and HTML
- Author and publication date
- Title and summary
- Images and media
- RSS feeds
- Site metadata
Perfect for RAG (Retrieval-Augmented Generation) applications, content analysis, and LLM training data.
pip install langchain-ujeebu- Python 3.8 or higher
- LangChain 0.1.0 or higher
- An Ujeebu API key (Get one here)
export UJEEBU_API_KEY="your-api-key"Or set it programmatically:
import os
os.environ["UJEEBU_API_KEY"] = "your-api-key"from langchain_ujeebu import UjeebuExtractTool
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
# Initialize the tool
ujeebu_tool = UjeebuExtractTool()
# Create an agent
llm = ChatOpenAI(temperature=0)
agent = initialize_agent(
tools=[ujeebu_tool],
llm=llm,
agent=AgentType.OPENAI_FUNCTIONS,
verbose=True
)
# Use the agent
response = agent.invoke({
"input": "Extract the article from https://example.com/article and summarize it"
})
print(response)from langchain_ujeebu import UjeebuLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
# Load articles
loader = UjeebuLoader(
urls=[
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3"
]
)
documents = loader.load()
# Create a vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Query the documents
results = vectorstore.similarity_search("What are the main topics?")from langchain_ujeebu import UjeebuExtractTool
tool = UjeebuExtractTool()
result = tool._run(
url="https://example.com/article",
text=True,
author=True,
pub_date=True
)
print(result)from langchain_ujeebu import UjeebuExtractTool
tool = UjeebuExtractTool()
result = tool._run(
url="https://example.com/article",
images=True # Extract article images
)from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
quick_mode=True # 30-60% faster, slightly less accurate
)
documents = loader.load()from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
extract_html=True, # Include HTML content
extract_images=True # Include images
)
documents = loader.load()
# Access metadata
doc = documents[0]
print(f"Title: {doc.metadata['title']}")
print(f"Author: {doc.metadata['author']}")
print(f"Images: {doc.metadata['images']}")from langchain_ujeebu import UjeebuLoader
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
# Load articles
loader = UjeebuLoader(
urls=[
"https://example.com/article1",
"https://example.com/article2"
]
)
documents = loader.load()
# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Query
result = qa_chain.invoke({"query": "What are the main points?"})
print(result["result"])A LangChain tool for extracting article content.
Parameters:
api_key(str, optional): Ujeebu API key. Defaults toUJEEBU_API_KEYenvironment variable.
Tool Parameters:
url(str, required): URL of the article to extracttext(bool): Extract article text (default: True)html(bool): Extract article HTML (default: False)author(bool): Extract article author (default: True)pub_date(bool): Extract publication date (default: True)images(bool): Extract images (default: False)quick_mode(bool): Use quick mode for faster extraction (default: False)
A LangChain document loader for articles.
Parameters:
urls(List[str], required): List of article URLs to loadapi_key(str, optional): Ujeebu API keyextract_text(bool): Extract article text (default: True)extract_html(bool): Extract article HTML (default: False)extract_author(bool): Extract author (default: True)extract_pub_date(bool): Extract publication date (default: True)extract_images(bool): Extract images (default: False)quick_mode(bool): Use quick mode (default: False)
Methods:
load(): Load all documentslazy_load(): Lazy load documents (same as load for this implementation)
Document Metadata:
source: Original URLurl: Resolved URLcanonical_url: Canonical URLtitle: Article titleauthor: Article authorpub_date: Publication datelanguage: Article languagesite_name: Site namesummary: Article summaryimage: Main image URLimages: List of all image URLs (if extract_images=True)
from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(
urls=["https://example.com/article"],
base_url="https://custom-api.ujeebu.com/extract"
)from langchain_ujeebu import UjeebuLoader
loader = UjeebuLoader(urls=["https://example.com/article"])
try:
documents = loader.load()
print(f"Loaded {len(documents)} documents")
except ValueError as e:
print(f"API key error: {e}")
except Exception as e:
print(f"Error loading documents: {e}")Run the test suite:
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=langchain_ujeebu --cov-report=html
# Run type checking
mypy langchain_ujeebu
# Run linting
flake8 langchain_ujeebu
black langchain_ujeebuCheck out the examples directory for more usage examples:
- agent_example.py - Using Ujeebu with LangChain agents
- document_loader_example.py - Using the document loader with vector stores
Ujeebu Extract API pricing is based on usage. Check the pricing page for details.
- Documentation: https://ujeebu.com/docs/extract
- API Reference: https://ujeebu.com/docs
- Support: support@ujeebu.com
- GitHub Issues: Report a bug
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain - Build applications with LLMs through composability
- Ujeebu API - Web scraping and content extraction API
- Initial release
- UjeebuExtractTool for LangChain agents
- UjeebuLoader document loader
- Full test coverage
- Comprehensive documentation