Index and extract metadata with very long documents and VLM #1345
Replies: 4 comments
-
|
Answer to 1: To clarify, our BAML example doesn't use pdftext library. It hands over the entire PDF to BAML, and BAML will leverage other LLMs to process the PDFs (usually convert it to images first). So it already uses LLMs.
|
Beta Was this translation helpful? Give feedback.
-
|
Answer to 2: We have another example pdf_elements_embedding doing this (extract image first, then process each image). You can replace each part as the logic you want. The example uses pypdf as the first step, and you can replace with docling calls. |
Beta Was this translation helpful? Give feedback.
-
|
Answer to 3: What is your target storage / format for output? For example, for metadata of each image, do you put them into separate rows of a table, or you want to further consolidate them after extracting metadata from them? For the first, it's just a "iterate-collect" process, similar to the example above. For the second, you can also add another step taking metadata for the entire document as input, after extracting information from all images. |
Beta Was this translation helpful? Give feedback.
-
|
Many thanks, after reading the other examples as you commented, I see that I can do it the way you said, for my use case I think I will use docling also to divide the document in sections, if the chunks allow it, and will deal with images individually. regarding the images, I have to check each image independently, and review that all the required number/type of images/diagramas are included in the document, and they follow some guidelines. regarding the BAML example I think that it won't work in my case because my documents are taking a few Gb, they include very big diagrams, and sending all the document will excess the context length of the models. It will be interesting how BAML people are dealing with very big files. it will be great if you include some of these stuff in the docling document that you are preparing, there will be many use cases similar to this one. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Great Discussion posted from discord by ics.
I have seen your tutorial about using BAML for extracting metadata and indexing documents, but i have these challenges when I have tried to index and extract metadata, and I would like to know how can I use cocoindex to deal with them:
Beta Was this translation helpful? Give feedback.
All reactions