This repository contains the project implementation and report for an SDG (Sustainable Development Goals) reporting system that leverages Text Mining, Natural Language Processing (NLP), and Large Language Models (LLMs). The core objective is to automate and enhance the process of identifying and extracting information relevant to SDGs from unstructured textual data, providing insights into an organization's contributions towards these global goals.
The project addresses the challenges of manual SDG reporting by offering a scalable and efficient solution for data collection, preprocessing, feature extraction, and classification using advanced AI techniques.
- Automate SDG Reporting: Move from manual, time-consuming reporting to an automated, data-driven process.
- Leverage NLP & LLMs: Utilize state-of-the-art NLP techniques and Large Language Models for intelligent text analysis and classification.
- Identify SDG Contributions: Accurately classify text segments based on their relevance to specific SDGs.
- Provide Actionable Insights: Offer a comprehensive overview of an organization's impact on sustainable development.
- Data Collection & Preprocessing: Includes methods for handling unstructured text data, converting PDFs to text, and cleaning raw data.
- Text Feature Extraction: Techniques such as TF-IDF are used to convert text into numerical features suitable for machine learning models.
- Machine Learning Models: Implementation of various ML models (e.g., SVM, Random Forest, Naive Bayes) for text classification.
- LLM Integration: Exploration of LLMs for advanced semantic understanding and categorization of text related to SDGs.
- Performance Evaluation: Metrics like Accuracy, Precision, Recall, and F1-score are used to evaluate model performance.
- Visualization: Tools for visualizing the results and SDG contributions (though specific visualization code might not be in this repo, the report covers it).
- Python: Primary programming language for data processing, NLP, and ML.
- Natural Language Processing (NLP) Libraries: (e.g., NLTK, spaCy, Hugging Face Transformers for LLMs)
- Machine Learning Libraries: (e.g., scikit-learn, PyTorch/TensorFlow for LLMs)
- Data Handling: (e.g., Pandas for data manipulation)
- PDF Processing: (e.g., PyPDF2, pdfminer.six for PDF to text conversion)
- Python 3.x
- pip (Python package installer)
- Clone the repository:
git clone [https://github.com/Dhanushmh5/YourSDGRepoName.git](https://github.com/Dhanushmh5/YourSDGRepoName.git) # Replace YourSDGRepoName cd YourSDGRepoName
- Create and activate a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On macOS/Linux # For Windows: venv\Scripts\activate
- Install dependencies:
(You'll need to create a
requirements.txtfile listing all Python libraries used. Based on the report, it would include things likepandas,numpy,scikit-learn,nltk,spacy,transformers,torchortensorflow,pypdf2.)Examplepip install -r requirements.txt
requirements.txtcontent:pandas numpy scikit-learn nltk spacy transformers torch # or tensorflow, depending on LLM framework pypdf2 # or pdfminer.six - Download necessary NLP models/data (if applicable):
python -m nltk.downloader punkt stopwords # Example for NLTK python -m spacy download en_core_web_sm # Example for spaCy
(This section will depend on what scripts you have in your repository. For example:)
- Run the data preprocessing script:
python scripts/preprocess_data.py
- Train the ML models:
python scripts/train_model.py
- Run LLM-based analysis:
python scripts/llm_analysis.py
- Generate reports/visualizations:
(Adjust script names to match your actual files)
python scripts/generate_report.py