refactor(go-segmenter): replace custom GoSegmenter with Tree-Sitter implementation#138
refactor(go-segmenter): replace custom GoSegmenter with Tree-Sitter implementation#138vbelouso wants to merge 3 commits intoRHEcosystemAppEng:rh-aiq-mainfrom
Conversation
4b877d9 to
a627169
Compare
|
Hi @vbelouso, Can you please rebase and resolve conflicts before i'm starting reviewing it? |
…itter implementation Signed-off-by: Vladimir Belousov <vbelouso@redhat.com>
a627169 to
5a756be
Compare
Done |
| return re.search("[A-Z][a-z0-9-]*", function_name) | ||
| return bool(re.search("[A-Z][a-z0-9-]*", function_name)) | ||
|
|
||
| def get_function_name(self, function: Document) -> str: |
There was a problem hiding this comment.
@vbelouso This is an example of something that is not working correctly ( the example test is failing) , get_function_name should return the variable name containing the anonymous function.
@pytest.mark.asyncio
async def test_transitive_search_golang_generic():
parser = GoLanguageFunctionsParser()
doc1 = Document(page_content=("greet := func() { // Assigning anonymous function to a variable 'greet'\n"
" fmt.Println(\"Greetings from a variable-assigned anonymous function!\")\n"
" }"))
name = parser.get_function_name(doc1)
print(f"name_of_function={name}")
assert name == "greet"Your revised GoSegmenter with TreeSitter parse the anonymous function assigned to a variable correctly,
But instead of taking the name of the variable in this case, it return :=, which is incorrect , please check.
There was a problem hiding this comment.
@zvigrinberg
Updated.
I also increased the number of test cases.
Signed-off-by: Vladimir Belousov <vbelouso@redhat.com>
Signed-off-by: Vladimir Belousov <vbelouso@redhat.com>
5310848 to
9261a8c
Compare

Summary
This PR replaces the legacy regex-based Go segmenter with a native Tree-Sitter parser (GoSegmenterExtended), enabling syntax-aware extraction of Go functions, methods, anonymous functions, and types with deterministic and reproducible results.
Rationale
Implementation highlights
Architectural impact
The segmentation layer now uses a structured syntax tree (Tree-Sitter) instead of regex parsing.
Downstream modules such as ChainOfCallsRetriever and function analyzers still operate on text chunks, but now those chunks are syntactically well-formed and consistent across runs.
This refactor lays the foundation for future AST-based semantic analysis (e.g. variable/type inference, symbol resolution).
Benchmark
Tested on https://github.com/openshift/origin with 35001 Go files