Skip to content

Conversation

@advenk
Copy link
Contributor

@advenk advenk commented Jul 17, 2025

This is the PR for all changes that have been done in GSoC'25 for the Hindi Chapter. 

Contributor: Aditya Venkatesh

Mentors: Dr. Sanju Tiwari, Debarghya Datta, Dr. Ronak Panchal

Description: This aim of this project was to evaluate and enhance various stages of the existing information extraction pipeline from Hindi text. The goals of this project were multi-fold:

* Streamline the existing pipeline and make it easy to run
* Evaluate the performance of the existing pipeline
* Experiment and implement new triplet extraction methods using Small Language Models (SLM)
* Enhance the IndIE method by integrating in the last phase
* Implement link prediction and evaluate to predict missing links using the existing hindi KG
* Implement predicate linking in the existing framework
* Deploy the SPARQL endpoint and test its performance

## Important links
Blog: https://advenk.github.io/av-blog/
Hindi SPARQL temporary endpoint deployed at http://95.217.58.54:8890/sparql (archived as of 15/09/2025)
Unmerged PR for Hindi mappings and extractors: https://github.com/dbpedia/extraction-framework/pull/776 - Can be merged only after updating the mappings via UI. 
SPARQL server performance testing code (not to be merged) - https://github.com/advenk/virtuoso-sparql-endpoint-quickstart/tree/gsoc25_hindi_chapter

## What has been done
1. Streamlined GSoC24_H Changes to make it easily runnable:
- Updated model download script
- Updated model paths
- Updated requirements
- Added config.toml for required models
- Added work around instructions wherever required to run the pipeline

2. Created framework for evaluating performance of open source language models via ollama for hindi information extraction:
- Support for various prompt templates
- Support for both .generate and .chat interfaces
- Output parser to handle various output formats from the LMs
- Hindi-BenchIE(gold standard for hindi IE) integration to evaluate our performance
- Integrated ReAct prompting strategy

3.  Link prediction notebook:
- Analysed the Hindi KG generated via the DIEF
- Trained link predictions models TransE and ConvE
- Trained same models using initial embeddings derived from MuRIL
- Analysed and documented the performance

4. IndIE enhancement:
- Added support for llm extraction to complement the handwritten rules
- Added support for llm filtering on final output
- Updated readme for running and evaluation of the final framework
- Testable on hindi-benchIE dataset
- Scored 65% recall with the latest prompt strategy of providing the chunks and dependency tree to LLM

5. Finetuning: 
- Synthetic data generation script to generate data through gemini API for finetuning gemma locally
- Ran lora finetuning using mlx but realised data is not as good
- Rewrote synthetic data gen and filtering scripts

6. SPARQL deployment
- SPARQL endpoint deployed at http://95.217.58.54:8890/sparql
- Performance tested the endpoint with code here (https://github.com/advenk/virtuoso-sparql-endpoint-quickstart/tree/gsoc25_hindi_chapter)
- Created and hosted docker image for easy deployment
- Can be deployed in any server with: 
```docker run -d -p 8890:8890 -p 1111:1111 --name hindi-sparql 42bitstogo/hindi-dbpedia-sparql:latest```

7.  Predicate linking
- Predicate linking code lives in src/predicate_linking.py
- Added entity linking normalisation (src/el_normalisation.py) for wikidata id to dbpedia resource mapping
- Added graph + lexical + embedding based predicate linking

## What's left to do
- Finetune gemma3:12b using the final filtered dataset 
- Enhance predicate linking to handle isA relations (type relations) where objects are classes
- Add mappings via UI (as per PR: https://github.com/dbpedia/extraction-framework/pull/776) 
- Deploy SPARQL endpoint once we have permanent hindi chapter server

## Acknowledgements
I am super grateful for the guidance of all my mentors during this GSoC period. I have learned a lot and hope to continue to contribute to dbpedia and open source.

@mommi84 mommi84 self-requested a review January 19, 2026 19:29
@mommi84 mommi84 merged commit 1ebe3e4 into dbpedia:main Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants