build_index --> get_chunks --> load_text() prematurely fails with 512 character limit before splitting

Hello, 

I stumbled upon a situation where I do trust in TextChunker to split given long string during build_index(). As in the example

`build_index(["This is a 512-character string designed for testing purposes, meticulously crafted to meet the precise length 
requirement. It contains repetitive phrases to easily fill the space, ensuring that every byte and character contributes to the 
grand total. The purpose is to demonstrate a fixed-length text block, which is often crucial in various programming contexts, especially when dealing with data schemas, API limits, or text processing for machine learning model1111111111111111111111111111111111111111111111111"])`

gives the error: 

`ERROR: AssertionError: Each `source` should be less than 512 characters long. Detected: 513 characters. You must provide sources for each text when using `TextChunker`
Stacktrace:
 [1] load_text(chunker::PromptingTools.Experimental.RAGTools.TextChunker, input::String; source::String, kwargs::@Kwargs{})     
   @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:166
 [2] load_text
   @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:164 [inlined]
 [3] get_chunks(chunker::PromptingTools.Experimental.RAGTools.TextChunker, files_or_docs::Vector{…}; sources::Vector{…}, verbose::Bool, separators::Vector{…}, max_length::Int64)
   @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:205
 [4] get_chunks
   @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:191 [inlined]
 [5] build_index(indexer::SimpleIndexer, files_or_docs::Vector{…}; verbose::Int64, extras::Nothing, index_id::Symbol, chunker::PromptingTools.Experimental.RAGTools.TextChunker, chunker_kwargs::@NamedTuple{}, embedder::PromptingTools.Experimental.RAGTools.BatchEmbedder, embedder_kwargs::@NamedTuple{}, tagger::PromptingTools.Experimental.RAGTools.NoTagger, tagger_kwargs::@NamedTuple{}, api_kwargs::@NamedTuple{}, cost_tracker::Base.Threads.Atomic{…})
   @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:624
 [6] build_index(indexer::SimpleIndexer, files_or_docs::Vector{String})
   @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:609
 [7] #build_index#137
   @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:745 [inlined]
 [8] build_index(files_or_docs::Vector{String})
   @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:744
 [9] top-level scope
   @ REPL[38]:1
Some type information was truncated. Use `show(err)` to see complete types.`

we expect the build_index to split the long text into chunks of max_length (default=256). However when it calls get_chunks() function the function first asks to load the strings via load_text() function which immediately refuses the >512 characters. 

Is this purposefully designed or should we skip this check and enable the code to recursively split (with chunker_kwargs of max_length)

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build_index --> get_chunks --> load_text() prematurely fails with 512 character limit before splitting #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

build_index --> get_chunks --> load_text() prematurely fails with 512 character limit before splitting #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions