Skip to content

bcgov/citz-imb-ai

Repository files navigation

CITZ-IMB AI Repository

Welcome to the CITZ-IMB AI repository. This repository contains a collection of tools, workflows, and examples for building, deploying, and maintaining AI models and their infrastructure.

Introduction

This project aims to provide a comprehensive framework for developing AI solutions, including data preprocessing, model training, deployment, and monitoring. It leverages modern technologies like FastAPI for the backend, various machine learning libraries for model development, and MLOps practices for managing the lifecycle of AI models.

Getting Started

To get started with the CITZ-IMB AI repository, follow these steps:

Clone the Repository

git clone https://github.com/bcgov/citz-imb-ai.git
cd citz-imb-ai

Prerequisite Knowledge

Before diving into this repository, it is beneficial to have a basic understanding of the following concepts and technologies:

  • Python Programming: Proficiency in Python is essential as the majority of the codebase is written in Python.
  • Machine Learning: Familiarity with machine learning concepts and libraries such as TensorFlow, PyTorch and libraries like langchain, lamaindex.
  • FastAPI: Understanding of how to build and deploy APIs using FastAPI.
  • Javascript react: Understanding of how to build and deploy frontend using react and javascript.
  • Docker: Experience containerizing applications with Podman or Docker and using Compose for orchestration.
  • Openshift Kubernetes: Basic understanding of Openshift/Kubernetes for managing containerized applications.
  • MLOPs (Airflow): Familiarity with Apache Airflow or MLOps concepts for workflow orchestration.
  • Graph DB (Neo4j): Basic knowledge of graph databases, specifically Neo4j, as it is used in this project.

Getting started with the project

The best way to get started is to launch the jupyter notebook docker container. The container will start the jupyter notebook server and you can access the notebooks from the browser.

All the example files should be present when you launch the notebook. Try running some of the examples to get started. You may need to install some of the dependencies, connect to the VPN and may need to modify some of the scripts like which folder do you want to download the files to, etc to run the examples.

Set Up Local ENV

  1. Create and populate a .docker/.env file:
    • Check compose.controller.yaml to identify necessary environment variables.
    • Obtain values from your team members.
  2. Build the Docker images.
  3. Open the Jupyter notebook interface at localhost.
  4. Open a terminal within Jupyter notebook.
  5. Run pip install -r requirements.txt.
  6. Run dvc pull to fetch data from S3 (ensure you are connected to the BC Gov VPN). Refer to the detailed instructions in DVCReadme.md for setup guidance, troubleshooting, and best practices to avoid common data synchronization issues.
  7. (Legacy) Run python s3.py to download Acts data (ensure you are connected to the BC Gov VPN).
  8. Initialize the TruLens database:
    • Create a database named trulens.
    • Run the upgrade script located in trulens_upgrade.ipynb before launching the web application.
    • TruLens captures all evaluation data; the web application requires this setup to function correctly. You are now set to use existing Jupyter notebooks.

Note: Always add future dependencies to requirements.txt to maintain clarity and prevent code breakage.

For example notebooks, visit: Examples

Populating Your Neo4j Data

⚠️ Prerequisite
This assumes you already have Jupyter Notebook, Airflow, and Neo4j set up.
If not, please follow the instructions above to spin up all required containers before proceeding.

📄 Next Step:
Refer to NEO4J data generation steps for a detailed breakdown.


Documentation

To navigate all the work done in the project, please refer to the table below:

Overview
Architecture Detailed Flow Preprocessing Workflow Active Learning Pipeline
Examples
Presentations
Technical Presentation
Web Interface (frontend and backend)
Frontend Backend
Feedback infrastructure
Embedding Adaptors
MLOps
MLOps Airflow Scripts
Hardware / HPC
Intel 3rd Gen Xeon CPU HPC - Embeddings Generation
HPC Documentation
WordPiece Pre-tokenizer Murmur3 Hash Table Memory Pool End-to-End Legislative Data Processing

Repository Structure

The repository is organized into several key sections:

- Architecture: Contains the overall architecture and design documents.
- Preprocessing: Includes data preprocessing workflows and scripts.
- Examples: Provides example scripts and notebooks to help you get started.
- Web Interface:
    - Frontend: Source code for the frontend web interface.
    - Backend: FastAPI-based backend service code.
- Feedback Infrastructure: Components for collecting and analyzing user feedback.
- MLOps: Tools and scripts for managing the machine learning lifecycle.
- Hardware: Information and scripts for high-performance computing tasks.

By following the links provided in each section, you can navigate to the respective directories and explore the code, documentation, and resources available. This structure is designed to help you quickly find and understand the different components of the project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7