Skip to content

build_index --> get_chunks --> load_text() prematurely fails with 512 character limit before splitting #19

@ametalci

Description

@ametalci

Hello,

I stumbled upon a situation where I do trust in TextChunker to split given long string during build_index(). As in the example

build_index(["This is a 512-character string designed for testing purposes, meticulously crafted to meet the precise length requirement. It contains repetitive phrases to easily fill the space, ensuring that every byte and character contributes to the grand total. The purpose is to demonstrate a fixed-length text block, which is often crucial in various programming contexts, especially when dealing with data schemas, API limits, or text processing for machine learning model1111111111111111111111111111111111111111111111111"])

gives the error:

ERROR: AssertionError: Each sourceshould be less than 512 characters long. Detected: 513 characters. You must provide sources for each text when usingTextChunkerStacktrace: [1] load_text(chunker::PromptingTools.Experimental.RAGTools.TextChunker, input::String; source::String, kwargs::@Kwargs{}) @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:166 [2] load_text @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:164 [inlined] [3] get_chunks(chunker::PromptingTools.Experimental.RAGTools.TextChunker, files_or_docs::Vector{…}; sources::Vector{…}, verbose::Bool, separators::Vector{…}, max_length::Int64) @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:205 [4] get_chunks @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:191 [inlined] [5] build_index(indexer::SimpleIndexer, files_or_docs::Vector{…}; verbose::Int64, extras::Nothing, index_id::Symbol, chunker::PromptingTools.Experimental.RAGTools.TextChunker, chunker_kwargs::@NamedTuple{}, embedder::PromptingTools.Experimental.RAGTools.BatchEmbedder, embedder_kwargs::@NamedTuple{}, tagger::PromptingTools.Experimental.RAGTools.NoTagger, tagger_kwargs::@NamedTuple{}, api_kwargs::@NamedTuple{}, cost_tracker::Base.Threads.Atomic{…}) @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:624 [6] build_index(indexer::SimpleIndexer, files_or_docs::Vector{String}) @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:609 [7] #build_index#137 @ C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:745 [inlined] [8] build_index(files_or_docs::Vector{String}) @ PromptingTools.Experimental.RAGTools C:\Users\kivanc.ulker\.julia\packages\PromptingTools\UPYoB\src\Experimental\RAGTools\preparation.jl:744 [9] top-level scope @ REPL[38]:1 Some type information was truncated. Useshow(err) to see complete types.

we expect the build_index to split the long text into chunks of max_length (default=256). However when it calls get_chunks() function the function first asks to load the strings via load_text() function which immediately refuses the >512 characters.

Is this purposefully designed or should we skip this check and enable the code to recursively split (with chunker_kwargs of max_length)

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions