Index and extract metadata with very long documents and VLM #1345

badmonster0 · 2025-11-30T18:15:56Z

badmonster0
Nov 30, 2025
Maintainer

Great Discussion posted from discord by ics.

I have seen your tutorial about using BAML for extracting metadata and indexing documents, but i have these challenges when I have tried to index and extract metadata, and I would like to know how can I use cocoindex to deal with them:

you use pdftext library, but I have seen that using docling to extract the text is much better because the way it structures the tables make the LLMs to understand better the data in the tables
also I need to extract metadata from images in the pdf pages, so how can I use a VLM and get still BAML structured output for the images?
my documents are sometimes very long, about 250 pages, and the BAML data that I need is extracted from different pages, how do you consolidate the BAML through all the pages?

badmonster0 · 2025-11-30T18:16:43Z

badmonster0
Nov 30, 2025
Maintainer Author

Answer to 1:

To clarify, our BAML example doesn't use pdftext library. It hands over the entire PDF to BAML, and BAML will leverage other LLMs to process the PDFs (usually convert it to images first). So it already uses LLMs.

If you want to use docling, you can create another custom function with a few lines of code, similar to extract_patient_info) but call docling instead of BAML (see our document for how to define a custom function for more details).
We can also consider creating an example using docling if it helps.

0 replies

badmonster0 · 2025-11-30T18:17:12Z

badmonster0
Nov 30, 2025
Maintainer Author

Answer to 2:
This basically needs two steps: (1) extract images out of PDFs; (2) process each image.

We have another example pdf_elements_embedding doing this (extract image first, then process each image). You can replace each part as the logic you want.

The example uses pypdf as the first step, and you can replace with docling calls.
For the second step, depends on what metadata you want.
If they're very general metadata for image format (e.g. width/height of images), they're already available in the Image object and you can get them directly when you extract image from PDF (example).
If the "metadata" needs to be extracted from content of the image, BAML is a good fit. You can use a custom function to call BAML for the 2nd step.

0 replies

badmonster0 · 2025-11-30T18:17:32Z

badmonster0
Nov 30, 2025
Maintainer Author

Answer to 3:

What is your target storage / format for output? For example, for metadata of each image, do you put them into separate rows of a table, or you want to further consolidate them after extracting metadata from them?

For the first, it's just a "iterate-collect" process, similar to the example above. For the second, you can also add another step taking metadata for the entire document as input, after extracting information from all images.

0 replies

micuentadecasa · 2025-11-30T18:37:50Z

micuentadecasa
Nov 30, 2025

Many thanks,

after reading the other examples as you commented, I see that I can do it the way you said, for my use case I think I will use docling also to divide the document in sections, if the chunks allow it, and will deal with images individually.

regarding the images, I have to check each image independently, and review that all the required number/type of images/diagramas are included in the document, and they follow some guidelines.

regarding the BAML example I think that it won't work in my case because my documents are taking a few Gb, they include very big diagrams, and sending all the document will excess the context length of the models. It will be interesting how BAML people are dealing with very big files.

it will be great if you include some of these stuff in the docling document that you are preparing, there will be many use cases similar to this one.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index and extract metadata with very long documents and VLM #1345

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Index and extract metadata with very long documents and VLM #1345

Uh oh!

Uh oh!

badmonster0 Nov 30, 2025 Maintainer

Replies: 4 comments

Uh oh!

badmonster0 Nov 30, 2025 Maintainer Author

Uh oh!

badmonster0 Nov 30, 2025 Maintainer Author

Uh oh!

badmonster0 Nov 30, 2025 Maintainer Author

Uh oh!

Uh oh!

micuentadecasa Nov 30, 2025

badmonster0
Nov 30, 2025
Maintainer

badmonster0
Nov 30, 2025
Maintainer Author

badmonster0
Nov 30, 2025
Maintainer Author

badmonster0
Nov 30, 2025
Maintainer Author

micuentadecasa
Nov 30, 2025