Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
af0e1ec
refactor: switch from Conda to uv for environment management
Essmaw Jan 26, 2026
8d730ef
refactor: use click and pathlib.Path for CLI
Essmaw Jan 26, 2026
81194b0
fix: update LangChain imports to restore compatibility
Essmaw Jan 26, 2026
c4b528b
refactor: move prompt to txt file in markdown format for clarity
Essmaw Jan 26, 2026
492c57b
feat: add prompt_path option to CLI to query the chatbot
Essmaw Jan 26, 2026
a74c62d
refactor: streamline document processing when creating the database.
Essmaw Jan 26, 2026
ec52352
chore: add pre-commit configuration and dev dependency for prek
Essmaw Jan 27, 2026
1b73bf9
chore: remove unnecessary scripts
Essmaw Jan 27, 2026
1e9590b
feat: extend header renumbering to support level 5 headers in Markdow…
Essmaw Jan 27, 2026
0af9c61
doc(parse_clean_markdown): add comments and improve docstrings
Essmaw Jan 27, 2026
e1e5a12
feat: enhance `create_database` script to support model and provider …
Essmaw Jan 27, 2026
2628212
feat: add logger module with configurable logging to file and stdout
Essmaw Jan 27, 2026
a43861b
feat: improve logging with an example of chunk metadata for each mark…
Essmaw Jan 27, 2026
538a039
refactor: remove useless constants
Essmaw Jan 27, 2026
c43ff00
docs: fix logs
Essmaw Jan 27, 2026
d575452
chore: update pre-commit
Essmaw Jan 27, 2026
bb76583
docs: update log
Essmaw Jan 27, 2026
79ac42a
chore: uodate dependencies
Essmaw Jan 27, 2026
809c54d
docs: update logs
Essmaw Jan 27, 2026
f1505c6
docs: improve documentation formatting with ruff
Essmaw Jan 27, 2026
7203587
chore: adding dependency
Feb 2, 2026
93b2087
feat: adding pydantic model settings to configure CLI of BioPyAssista…
Feb 2, 2026
27702ce
refactor: fix the add of file_name in chunk metadata
Feb 2, 2026
0a7fa9b
chore: fix some docstrings and misspelling
Feb 2, 2026
eee5c7b
[WIP] - feat: adding the new version of streamlit IU
Feb 2, 2026
5e9c73e
feat: adding prompt path to the setting model.
Essmaw Feb 6, 2026
b1faabd
feat: simplify footer, adding the llm and prompt settings and improve…
Essmaw Feb 6, 2026
83e5eaa
feat: create YAML file listing all Python chapters and levels
Feb 9, 2026
cb0b263
refactor(parse_clean_markdown): load chapters and output paths from Y…
Feb 9, 2026
6a04d81
docs: change log level to success for chapter loading message
Feb 9, 2026
ff1bdf0
refactor(create_database): update argument structure to use YAML for …
Feb 9, 2026
71de1da
feat(create_database): add chapter IDs and separate file_path and fi…
Feb 9, 2026
be07be0
chore: remove unuseful files
Feb 9, 2026
2f1ca08
chore: remove unused gradio dependency
Feb 9, 2026
d25335b
chore(app): remove model config settings (not part of this PR)
Feb 9, 2026
dd4f61a
chore(pre-commit): update ruff version and format configuration
Feb 9, 2026
abdbb0b
feat(create_database): enable embeddings with all OpenRouter models a…
Feb 9, 2026
adc182d
feat(query_chatbot): add level-filtered retrieval, vector DB and LLM…
Feb 9, 2026
01db3e7
chore(prompts): rename prompt directory to prompts
Feb 9, 2026
508e123
fix(parse_clean_markdown): validate processed_file_path before creat…
Feb 9, 2026
87397f9
chore(chapters_and_levels): update chapter IDs to string format
Feb 9, 2026
ebc0872
chore(readme): update setup instructions to use uv for environment ma…
Feb 10, 2026
1b64888
docs: update scripts to use 'uv' for execution
Feb 10, 2026
6c33371
chore: revert Streamlit app to previous version for history
Feb 10, 2026
55ba0fb
chore: remove deprecated config_app.toml file
Feb 10, 2026
a4c7d37
chore: update logging messages and argument names
Feb 10, 2026
1787b52
fixes(query_chatbot): rename ai_api_key to api_key for ChatOpenAI ini…
Feb 11, 2026
7200618
refactor(prompts): rename prompt folder and files for consistency
Feb 11, 2026
1838eb9
refactor(yml file): add quotes to values, replace source_file_path an…
Feb 11, 2026
d8c2ef8
chore(gitignore): ignore raw and processed course data paths
Feb 11, 2026
f19eb97
refactor(parse_and_clean_markdown): build chapter source and destinat…
Feb 11, 2026
2b33578
refactor(yml): update prompt_file to prompt_path for consistency
Feb 11, 2026
b66f18c
feat(query_chatbot): answer to the user question even though it is no…
Feb 11, 2026
8c39644
Refactor(create_database): update retrieval to use processed_file_path
Feb 11, 2026
f838700
refactor(query_chatbot): log answer sentence by sentence for clarity
Feb 11, 2026
3fd7939
chore: added .gitkeep files to ensure data directories are tracked
Feb 11, 2026
418cb3e
chore: update readme
Feb 11, 2026
340ce00
refactor: add checks to avoid AttributeError/UnboundLocalError
Feb 11, 2026
9bd746b
refactor(create_database): remove short-form options (-c, -s...) on CLI
Feb 11, 2026
911c274
refactor(prompt): translate the title in french and remove useless `<<<`
Feb 11, 2026
5f1fe1c
refactor(create_database): rename arguments for consistency
Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@
# vector database
chroma_*

logs/
tmp/
.DS_Store.env

# course content
data/markdown_raw/*
data/markdown_processed/*
data/course_raw/*
data/course_processed/*
# Keep the directories themselves
!data/course_raw/.gitkeep
!data/course_processed/.gitkeep

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
36 changes: 36 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Install pre-commit hooks with:
# prek install
exclude: "|tmp/*|"
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level exclude: "|tmp/*|" is a regex that matches the empty string (because of leading/trailing | alternations), which effectively excludes all files from running hooks. Use a proper regex that only matches the tmp directory (e.g. (^|/)tmp/).

Suggested change
exclude: "|tmp/*|"
exclude: '(^|/)tmp/'

Copilot uses AI. Check for mistakes.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: end-of-file-fixer
- id: mixed-line-ending
- id: trailing-whitespace
- id: check-json
- id: check-yaml
- id: check-added-large-files
args: ["--maxkb=5000"]

- repo: https://github.com/asottile/pyupgrade
rev: v3.21.2
hooks:
- id: pyupgrade

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.14.14
hooks:
# Run the linter.
- id: ruff-check
types_or: [python, pyi]
args: [--fix]
# Run the formatter.
- id: ruff-format
types_or: [python, pyi]

- repo: https://github.com/PyCQA/bandit
rev: "1.9.2"
hooks:
- id: bandit
71 changes: 36 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,54 +32,54 @@ git clone https://github.com/pierrepo/biopyassistant.git
cd biopyassistant
```

### Install [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).
### Activate the environment

### Create a Conda environment
We use [uv](https://docs.astral.sh/uv/getting-started/installation/)
to manage dependencies and the project environment.

```bash
conda env create -f environment.yml
```

### Activate the Conda environment
Sync dependencies:

```bash
conda activate biopyassistantenv
uv sync
```

### Copy the raw Markdown files of the Python [course](https://github.com/bioinfo-prog/cours-python):

```bash
git clone --depth 1 https://github.com/bioinfo-prog/cours-python.git
rm -f data/markdown_raw/*.md
cp cours-python/cours/*.md data/markdown_raw/
rm -f data/course_raw/*.md
cp cours-python/cours/*.md data/course_raw/
rm -rf cours-python
```

### Process raw Markdown files

```bash
rm -f data/markdown_processed/*.md
python src/parse_clean_markdown.py --in data/markdown_raw --out data/markdown_processed
rm -f data/course_processed/*.md
uv run src/parse_clean_markdown.py --config data/chapters_and_levels.yaml
```

In this step, Python comments (`#`) are slighty changed to avoid confusion with Markdown headers (`#`, `##`...) and headers are numbered (from `## Title` to `## 1.1 Title`). Processed Markdown files are stored in `data/markdown_processed`
In this step, Python comments (`#`) are slighty changed to avoid confusion with Markdown headers (`#`, `##`...) and headers are numbered (from `## Title` to `## 1.1 Title`). Processed Markdown files are stored in `data/course_processed`


### Add OpenAI API key
### Add OpenAI and OpenRouter API key

Create an `.env` file with a valid OpenAI API key:
Create an .env file with a valid [OpenAI](https://platform.openai.com/docs/api-reference/authentication) and [OpenRouter](https://openrouter.ai/docs/api/reference/authentication) API key:

```text
```sh
OPENAI_API_KEY=<your-openai-api-key>
OPENROUTER_API_KEY=<your-openrouter-api-key>
```

> Remark: This `.env` file is ignored by git.
> Remark: This .env file is ignored by git.


### Create the vector database

```bash
python src/create_database.py --data-path data/markdown_processed --chroma-path chroma_db
uv run src/create_database.py --course-yaml data/chapters_and_levels.yaml \
--chroma-path chroma_db \
--embedding-model text-embedding-3-large \
--model-provider openai
```

This command will create a Chroma vector database from the processed Markdown files. All files will be split into chunks of 1000 characters with an overlap of 200 characters.
Expand All @@ -91,21 +91,32 @@ This command will create a Chroma vector database from the processed Markdown fi


```bash
python src/query_chatbot.py --query "Your question here" [--model "model_name"] [--include-metadata]
uv run python src/query_chatbot.py --query "Your question here" \
--level "user_level" \
--model "model_name" \
--provider-llm "provider_name" \
--include-metadata
```

### Options

- 🤖 Model Selection. For instance: `gpt-4o`, `gpt-4-turbo`
- 📝 Include Metadata: Include metadata in the response, such as the sources of the answer. By default, metadata is excluded.
- 📚 **User Level**: Specify the user's Python knowledge level to tailor the chatbot's responses.
Choose between: `beginner`, `intermediate`, `advanced`.
- 🤖 **Model Selection**: Choose the language model for the query. Examples: `gpt-4o`, `deepseek/deepseek-v3.2`, etc.
- 🌐 **LLM Provider**: Specify the provider of the language model. Choose between: `openai`, `openrouter`.
- 📝 **Include Metadata**: Include metadata in the response, such as the sources of the answer. By default, metadata is excluded.

Example:

```bash
python src/query_chatbot.py --query "What is the difference between list and set ?" --model gpt-4-turbo --include-metadata
uv run python src/query_chatbot.py --query "What is the difference between list and set ?" \
--level "advanced" \
--model "gpt-4o" \
--provider-llm "openai" \
--include-metadata
```

This command will query the chatbot with the question "What is the difference between list and set ?" using the `gpt-4-turbo` model and include metadata in the response.
This command will query the chatbot for a response to the question "What is the difference between list and set ?" for an intermediate user using the `gpt-4o` model from the `openai` provider. The response will include metadata about the sources of the answer.

Output:

Expand All @@ -124,22 +135,12 @@ For more information, you can refer to the following sources:

## Usage (web interface)

### Streamlit app


```bash
streamlit run src/streamlit_app.py
uv run streamlit run src/streamlit_app.py
```

This will run the Streamlit app in your web browser.


### Gradio App


```bash
python src/gradio_app.py
```

This will run the Gradio app in your web browser. A battle mode is available to compare the responses of different models.

Loading