Skip to content

Commit e77c29b

Browse files
committed
chore: merge main
2 parents 5da7878 + 8ba8617 commit e77c29b

File tree

12 files changed

+467
-285
lines changed

12 files changed

+467
-285
lines changed

.devcontainer/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ ARG DEBIAN_FRONTEND=noninteractive
44
ARG USER=vscode
55

66
RUN DEBIAN_FRONTEND=noninteractive \
7-
&& apt-get update \
7+
&& apt-get update \
88
&& apt-get install -y build-essential --no-install-recommends make \
99
ca-certificates \
1010
git \
@@ -27,7 +27,7 @@ RUN DEBIAN_FRONTEND=noninteractive \
2727
# Python and poetry installation
2828
USER $USER
2929
ARG HOME="/home/$USER"
30-
ARG PYTHON_VERSION=3.11
30+
ARG PYTHON_VERSION=3.13
3131

3232
ENV PYENV_ROOT="${HOME}/.pyenv"
3333
ENV PATH="${PYENV_ROOT}/shims:${PYENV_ROOT}/bin:${HOME}/.local/bin:$PATH"
@@ -40,4 +40,4 @@ RUN echo "done 0" \
4040
&& pyenv global ${PYTHON_VERSION} \
4141
&& echo "done 3" \
4242
&& curl -sSL https://install.python-poetry.org | python3 - \
43-
&& poetry config virtualenvs.in-project true
43+
&& poetry config virtualenvs.in-project true

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ The template supports multiple LLM (Large Language Model) providers, such as STA
3838

3939

4040
## 1. Getting Started
41-
A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring).
41+
A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring.html).
4242

4343
### 1.1 Components
4444

libs/README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# RAG Core Libraries
22

33
This directory contains the core libraries of the STACKIT RAG template.
4-
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML), web sources via sitemaps, and Confluence pages.
4+
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML, EPUB), web sources via sitemaps, and Confluence pages.
55
It consists of the following python packages:
66

77
- [`1. Rag Core API`](#1-rag-core-api)
@@ -228,15 +228,19 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
228228
| file_service | [`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py) | [`extractor_api_lib.impl.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/impl/file_services/s3_service.py) | Handles operations on the connected storage. |
229229
| database_converter | [`extractor_api_lib.table_converter.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/table_converter/dataframe_converter.py) | [`extractor_api_lib.impl.table_converter.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/impl/table_converter/dataframe2markdown.py) | Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
230230
| pdf_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py) | Extractor used for extracting information from PDF documents. |
231-
| ms_docs_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) |[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232-
| xml_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
233-
| all_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
231+
| ms_docs_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) |[`extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py) | Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232+
| xml_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.impl.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py) | Extractor used for extracting content from XML documents. |
233+
| epub_extractor | [`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py) | [`extractor_api_lib.impl.extractors.file_extractors.epub_extractor.EPUBExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/epub_extractor.py) | Extractor used for extracting content from EPUB documents. |
234+
| file_extractors | `dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]` | `dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)` | List of all available file extractors. If you add a new type of file extractor you would have to add it to this list. |
235+
| intern2external | [`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py) | [`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py) | Maps internal information pieces to external information pieces, converting between internal and external content types. |
236+
| confluence_document2information_piece | [`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py) | [`extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece.ConfluenceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py) | Maps LangChain documents from Confluence to information pieces with Confluence-specific metadata handling. |
237+
| sitemap_document2information_piece | [`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py) | [`extractor_api_lib.impl.mapper.sitemap_document2information_piece.SitemapLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py) | Maps LangChain documents from sitemap sources to information pieces with sitemap-specific metadata handling. |
234238
| general_file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) |[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) | Combines multiple file extractors and decides which one to use for the given file format. |
235-
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
236239
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py) | Implementation of an extractor for the source `confluence`. |
237240
| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
238241
| sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
239-
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
242+
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_metadata_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
243+
| source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source and handles available extractors for confluence and sitemap sources. |
240244

241245
## 4. RAG Core Lib
242246

@@ -250,6 +254,7 @@ Examples of included components:
250254
- ...
251255

252256
### 4.1 Requirements
257+
253258
All required python libraries can be found in the [pyproject.toml](./extractor-api-lib/pyproject.toml) file.
254259
In addition to python libraries the following system packages are required:
255260

libs/extractor-api-lib/poetry.lock

Lines changed: 17 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/extractor-api-lib/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ langchain-core = "0.3.72"
101101
camelot-py = {extras = ["cv"], version = "^1.0.0"}
102102
fake-useragent = "^2.2.0"
103103
pypdfium2 = "4.30.0"
104+
pypandoc-binary = "^1.15"
104105

105106
[tool.poetry.group.dev.dependencies]
106107
pytest = "^8.3.5"

libs/extractor-api-lib/src/extractor_api_lib/dependency_container.py

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,21 @@
33
from dependency_injector.containers import DeclarativeContainer
44
from dependency_injector.providers import Factory, List, Singleton # noqa: WOT001
55

6-
from extractor_api_lib.impl.api_endpoints.general_source_extractor import GeneralSourceExtractor
6+
from extractor_api_lib.impl.api_endpoints.general_file_extractor import (
7+
GeneralFileExtractor,
8+
)
9+
from extractor_api_lib.impl.api_endpoints.general_source_extractor import (
10+
GeneralSourceExtractor,
11+
)
712
from extractor_api_lib.impl.extractors.confluence_extractor import ConfluenceExtractor
8-
from extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor import MSDocsExtractor
13+
from extractor_api_lib.impl.extractors.file_extractors.epub_extractor import (
14+
EpubExtractor,
15+
)
16+
from extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor import (
17+
MSDocsExtractor,
18+
)
919
from extractor_api_lib.impl.extractors.file_extractors.pdf_extractor import PDFExtractor
1020
from extractor_api_lib.impl.extractors.file_extractors.xml_extractor import XMLExtractor
11-
from extractor_api_lib.impl.api_endpoints.general_file_extractor import GeneralFileExtractor
1221
from extractor_api_lib.impl.extractors.sitemap_extractor import SitemapExtractor
1322
from extractor_api_lib.impl.file_services.s3_service import S3Service
1423
from extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece import (
@@ -17,7 +26,12 @@
1726
from extractor_api_lib.impl.mapper.internal2external_information_piece import (
1827
Internal2ExternalInformationPiece,
1928
)
20-
from extractor_api_lib.impl.mapper.sitemap_document2information_piece import SitemapLangchainDocument2InformationPiece
29+
from extractor_api_lib.impl.mapper.langchain_document2information_piece import (
30+
LangchainDocument2InformationPiece,
31+
)
32+
from extractor_api_lib.impl.mapper.sitemap_document2information_piece import (
33+
SitemapLangchainDocument2InformationPiece,
34+
)
2135
from extractor_api_lib.impl.settings.pdf_extractor_settings import PDFExtractorSettings
2236
from extractor_api_lib.impl.settings.s3_settings import S3Settings
2337
from extractor_api_lib.impl.table_converter.dataframe2markdown import DataFrame2Markdown
@@ -44,12 +58,15 @@ class DependencyContainer(DeclarativeContainer):
4458
xml_extractor = Singleton(XMLExtractor, file_service)
4559

4660
intern2external = Singleton(Internal2ExternalInformationPiece)
47-
langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
61+
confluence_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
62+
langchain_document2information_piece = Singleton(LangchainDocument2InformationPiece)
4863
sitemap_document2information_piece = Singleton(SitemapLangchainDocument2InformationPiece)
49-
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor)
64+
epub_extractor = Singleton(EpubExtractor, file_service, langchain_document2information_piece)
65+
66+
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor, epub_extractor)
5067

5168
general_file_extractor = Singleton(GeneralFileExtractor, file_service, file_extractors, intern2external)
52-
confluence_extractor = Singleton(ConfluenceExtractor, mapper=langchain_document2information_piece)
69+
confluence_extractor = Singleton(ConfluenceExtractor, mapper=confluence_document2information_piece)
5370

5471
sitemap_extractor = Singleton(
5572
SitemapExtractor,
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""Module containing the EpubExtractor class."""
2+
3+
import logging
4+
from pathlib import Path
5+
6+
from langchain_community.document_loaders import UnstructuredEPubLoader
7+
8+
from extractor_api_lib.extractors.information_file_extractor import (
9+
InformationFileExtractor,
10+
)
11+
from extractor_api_lib.file_services.file_service import FileService
12+
from extractor_api_lib.impl.mapper.langchain_document2information_piece import (
13+
LangchainDocument2InformationPiece,
14+
)
15+
from extractor_api_lib.impl.types.file_type import FileType
16+
from extractor_api_lib.models.dataclasses.internal_information_piece import (
17+
InternalInformationPiece,
18+
)
19+
20+
logger = logging.getLogger(__name__)
21+
22+
23+
class EpubExtractor(InformationFileExtractor):
24+
"""Extractor for Epub documents using unstructured library."""
25+
26+
def __init__(
27+
self,
28+
file_service: FileService,
29+
mapper: LangchainDocument2InformationPiece,
30+
):
31+
"""Initialize the EpubExtractor.
32+
33+
Parameters
34+
----------
35+
file_service : FileService
36+
Handler for downloading the file to extract content from and upload results to if required.
37+
mapper : LangchainDocument2InformationPiece
38+
An instance of LangchainDocument2InformationPiece used for mapping langchain documents
39+
to information pieces.
40+
"""
41+
super().__init__(file_service=file_service)
42+
self._mapper = mapper
43+
44+
@property
45+
def compatible_file_types(self) -> list[FileType]:
46+
"""
47+
List of compatible file types for the EPUB extractor.
48+
49+
Returns
50+
-------
51+
list[FileType]
52+
A list containing the compatible file types, which in this case is EPUB.
53+
"""
54+
return [FileType.EPUB]
55+
56+
async def aextract_content(self, file_path: Path, name: str) -> list[InternalInformationPiece]:
57+
"""
58+
Extract content from an epub file and processes the elements.
59+
60+
Parameters
61+
----------
62+
file_path : Path
63+
The path to the epub file to be processed.
64+
name : str
65+
Name of the document.
66+
67+
Returns
68+
-------
69+
list[InformationPiece]
70+
A list of processed information pieces extracted from the epub file.
71+
"""
72+
elements = UnstructuredEPubLoader(file_path.as_posix()).load()
73+
return [self._mapper.map_document2informationpiece(document=x, document_name=name) for x in elements]
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
"""Module for the LangchainDocument2InformationPiece class."""
2+
3+
from extractor_api_lib.mapper.source_langchain_document2information_piece import (
4+
SourceLangchainDocument2InformationPiece,
5+
)
6+
7+
8+
class LangchainDocument2InformationPiece(SourceLangchainDocument2InformationPiece):
9+
"""A class to map a LangchainDocument to an InformationPiece."""
10+
11+
def _map_meta(self, internal: dict, document_name: str) -> dict:
12+
return internal

libs/extractor-api-lib/src/extractor_api_lib/impl/types/file_type.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ class FileType(StrEnum):
1111
DOCX = "DOCX"
1212
PPTX = "PPTX"
1313
XML = "XML"
14+
EPUB = "EPUB"

0 commit comments

Comments
 (0)