You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ The template supports multiple LLM (Large Language Model) providers, such as STA
38
38
39
39
40
40
## 1. Getting Started
41
-
A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring).
41
+
A [`Tiltfile`](./Tiltfile) is provided to get you started :rocket:. If Tilt is new for you, and you want to learn more about it, please take a look at the [Tilt guides](https://docs.tilt.dev/tiltfile_authoring.html).
Copy file name to clipboardExpand all lines: libs/README.md
+11-6Lines changed: 11 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# RAG Core Libraries
2
2
3
3
This directory contains the core libraries of the STACKIT RAG template.
4
-
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML), web sources via sitemaps, and Confluence pages.
4
+
These libraries provide comprehensive document extraction capabilities including support for files (PDF, DOCX, XML, EPUB), web sources via sitemaps, and Confluence pages.
5
5
It consists of the following python packages:
6
6
7
7
-[`1. Rag Core API`](#1-rag-core-api)
@@ -228,15 +228,19 @@ Technically, all parameters of the `SitemapLoader` from LangChain can be provide
228
228
| file_service |[`extractor_api_lib.file_services.file_service.FileService`](./extractor-api-lib/src/extractor_api_lib/file_services/file_service.py)|[`extractor_api_lib.impl.file_services.s3_service.S3Service`](./extractor-api-lib/src/extractor_api_lib/impl/file_services/s3_service.py)| Handles operations on the connected storage. |
229
229
| database_converter |[`extractor_api_lib.table_converter.dataframe_converter.DataframeConverter`](./extractor-api-lib/src/extractor_api_lib/table_converter/dataframe_converter.py)|[`extractor_api_lib.impl.table_converter.dataframe2markdown.DataFrame2Markdown`](./extractor-api-lib/src/extractor_api_lib/impl/table_converter/dataframe2markdown.py)| Converts the extracted table from *pandas.DataFrame* to markdown. If you want the table to have another format, this would need to be adjusted. |
230
230
| pdf_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.impl.extractors.file_extractors.pdf_extractor.PDFExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/pdf_extractor.py)| Extractor used for extracting information from PDF documents. |
231
-
| ms_docs_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py)| Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232
-
| xml_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py)| Extractor used for extracting content from XML documents. |
233
-
| all_extractors |`dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]`|`dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)`| List of all available extractors. If you add a new type of extractor you would have to add it to this list. |
231
+
| ms_docs_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor.MSDocsExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/ms_docs_extractor.py)| Extractor used for extracting information from Microsoft Documents like *.docx, etc. |
232
+
| xml_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.impl.extractors.file_extractors.xml_extractor.XMLExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/xml_extractor.py)| Extractor used for extracting content from XML documents. |
233
+
| epub_extractor |[`extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_file_extractor.py)|[`extractor_api_lib.impl.extractors.file_extractors.epub_extractor.EPUBExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/file_extractors/epub_extractor.py)| Extractor used for extracting content from EPUB documents. |
234
+
| file_extractors |`dependency_injector.providers.List[extractor_api_lib.extractors.information_file_extractor.InformationFileExtractor]`|`dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor)`| List of all available file extractors. If you add a new type of file extractor you would have to add it to this list. |
235
+
| intern2external |[`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py)|[`extractor_api_lib.impl.mapper.internal2external_information_piece.Internal2ExternalInformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/internal2external_information_piece.py)| Maps internal information pieces to external information pieces, converting between internal and external content types. |
236
+
| confluence_document2information_piece |[`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py)|[`extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece.ConfluenceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/confluence_langchain_document2information_piece.py)| Maps LangChain documents from Confluence to information pieces with Confluence-specific metadata handling. |
237
+
| sitemap_document2information_piece |[`extractor_api_lib.mapper.source_langchain_document2information_piece.SourceLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/mapper/source_langchain_document2information_piece.py)|[`extractor_api_lib.impl.mapper.sitemap_document2information_piece.SitemapLangchainDocument2InformationPiece`](./extractor-api-lib/src/extractor_api_lib/impl/mapper/sitemap_document2information_piece.py)| Maps LangChain documents from sitemap sources to information pieces with sitemap-specific metadata handling. |
234
238
| general_file_extractor |[`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py)|[`extractor_api_lib.impl.api_endpoints.general_file_extractor.GeneralFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py)| Combines multiple file extractors and decides which one to use for the given file format. |
235
-
| general_source_extractor |[`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py)|[`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py)| Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
236
239
| confluence_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/confluence_extractor.py)| Implementation of an extractor for the source `confluence`. |
237
240
| sitemap_extractor |[`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py)|[`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py)| Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
238
241
| sitemap_parsing_function |`dependency_injector.providers.Factory[Callable]`|[`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py)| Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
239
-
| sitemap_meta_function |`dependency_injector.providers.Factory[Callable]`|[`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py)| Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
242
+
| sitemap_meta_function |`dependency_injector.providers.Factory[Callable]`|[`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_metadata_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py)| Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |
243
+
| source_extractor |[`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py)|[`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py)| Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source and handles available extractors for confluence and sitemap sources. |
240
244
241
245
## 4. RAG Core Lib
242
246
@@ -250,6 +254,7 @@ Examples of included components:
250
254
- ...
251
255
252
256
### 4.1 Requirements
257
+
253
258
All required python libraries can be found in the [pyproject.toml](./extractor-api-lib/pyproject.toml) file.
254
259
In addition to python libraries the following system packages are required:
0 commit comments