This repository was archived by the owner on Jan 29, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 12
Feature/embedings k8s #632
Open
drsantos89
wants to merge
110
commits into
master
Choose a base branch
from
feature/embedings-k8s
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
110 commits
Select commit
Hold shift + click to select a range
c8467d0
add k8s folder and notebook to create indices
58e353d
update create indices notebook
bcb5cfd
add create indices notebook
f0cc141
update requirements
32cd5c5
add upload es; change to paragraphs; add mypy options
9efe9d5
run test on VSCode
1c8ba39
PR comments Emilie
578c676
PR Emilie
54fd407
PR Emilie order INSTALL_REQUIRES
e5378bd
fail under makes test discover on VSCode fail
fc02b3b
add notebook
d51ab55
add initial embeddings k8s
9e7927d
add embedding model
7cdcfdb
Merge branch 'feature/paragraph-token-length' into feature/embedings-k8s
9fca694
add embedding locally
fb8c372
add normalization check
f73dcaa
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
ed3156d
adjustments after PR review
bf93692
adjustments after PR review
b3da33d
Merge remote-tracking branch 'origin/master' into feature/add-es-uplo…
c5d146d
changes after PR review
cd87236
changes after PR review
7b508cd
remove outdated test
b5e883f
remove .env file from tox
fc7c2e6
flake8 fixes
9988843
add __init__.py
788854b
add docstring to init
1206ee3
add docstring to init
6bc72b8
update docs
370e14b
add elasticsearch to ci
7022102
update ci
e440507
update ci
20bd85b
update ci
a10e5bc
update ci
d31a2a8
Modify test and the design
jankrepl f73a8a5
linters
17b56d7
linters
f5bdfb6
Fix random pandas issue
jankrepl fd79772
Temp removal of tox dependencies
jankrepl d7294a7
Completely get rid of en-core-sci-lg
jankrepl 6969c50
Download spacy model just before unit tests
jankrepl 12de5a0
Use underscores instead
jankrepl f3d8be0
update docstrings
032aa7c
Merge remote-tracking branch 'origin/fix/fix-unittest-"Feature/start-…
cfbfcb1
update docstrings
b3d8046
Merge remote-tracking branch 'origin/master' into feature/add-es-uplo…
7b9bdb2
revert mypy changes
687acb3
isort
e33ba1b
from __future__ import annotations
e5ab467
separate add es and sql
e8b134d
reverted article
ea51316
Write tests
jankrepl c592b3e
Undo changes in the add module
jankrepl c37585c
Fix paragraphs
jankrepl f4b7536
Make sure articles index works correctly
jankrepl d2ff32d
Make sure paragraphs correct
jankrepl 73ba4e3
refractor add to es
9bc3abf
add notebook to test list in es
ee4e4dd
add docs
897b3d5
Delete test_lists_es.ipynb
drsantos89 a7772ee
mypy ci
6a1d729
Merge branch 'feature/add-es-uploader' of https://github.com/BlueBrai…
22d3fca
from Emilie PR
da1b64e
Merge remote-tracking branch 'origin/feature/add-es-uploader' into fe…
a7b0b6e
linters
55e7d19
docs
29e4a59
revert mypy
450eb72
linters ci
1dc44b1
mypy local
8c8ddf2
mypy
e4d2e06
mypy
7fb25bb
mypy
c1f452d
mypy
180bcda
linters
dbd8e5c
unittest fix
11e69d9
linters
5f34f29
unittest remove senttrans dim and add checkpoint
b8ba858
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
7349444
add tests model dim and is normalized
0d41141
add tests
1d0ba00
update docs
d0e7264
fix parser str nargs +
74ff6cf
mypy
838255e
black
b2494de
pin elasticsearch
2f0948d
add skip test if not es
5f076d6
update ci ES on unit test
ac4e5b1
fix CI macos
2021d4f
update ci
582fabf
update ci docker
f164a98
update ci macos docker
e92d248
update ci macos docker
b5e8353
update ci no docker macos
b49c323
Emilie PR review
b222786
Emilie PR review
0c0b1d3
add seldon embedding
691d9fa
Merge branch 'master' into feature/embedings-k8s
jankrepl 6771fdb
Fix CLI
jankrepl 5d6f178
Merge branch 'feature/embedings-k8s' of https://github.com/BlueBrain/…
1fc5f73
lint add es
2f5db4b
fix embedding function
c34f32b
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
b9d7b50
isort
0f098a5
add bentoml
405b6bf
update tests
28a0146
linters
97846c1
fix test bentml not available
17f9cb2
add force embeddings
2460923
lint
a5ede03
small fix
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| bluesearch.k8s.embeddings module | ||
| ================================ | ||
|
|
||
| .. automodule:: bluesearch.k8s.embeddings | ||
| :members: | ||
| :undoc-members: | ||
| :show-inheritance: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,231 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# connect to ES" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from bluesearch.k8s.connect import connect" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "client = connect()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# tokenize all the paragraphs" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import tqdm\n", | ||
| "from elasticsearch.helpers import scan" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from transformers import AutoTokenizer" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "tokenizer = AutoTokenizer.from_pretrained(\"sentence-transformers/multi-qa-MiniLM-L6-cos-v1\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "lens = []\n", | ||
| "progress = tqdm.tqdm(position=0, unit=\" Docs\", desc=\"Scanning paragraphs\")\n", | ||
| "body = {\"query\":{\"match_all\":{}}}\n", | ||
| "for hit in scan(client, query=body, index=\"paragraphs\"):\n", | ||
| " emb = tokenizer.tokenize(hit['_source']['text'])\n", | ||
| " lens.append(len(emb))\n", | ||
| " progress.update(1)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# plot results" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import matplotlib.pyplot as plt\n", | ||
| "import seaborn as sns\n", | ||
| "sns.set()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "plt.boxplot(lens)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "plt.boxplot(lens)\n", | ||
| "plt.ylim([0, 512])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "plt.hist(lens)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "plt.hist(lens, bins=100)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "plt.hist(lens, bins=100)\n", | ||
| "plt.xlim([0, 512])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import numpy as np" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "lens=np.array(lens)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "len(lens[np.array(lens)>512]) / len(lens) * 100" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# get biggest paragraphs" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "paragraphs = []\n", | ||
| "progress = tqdm.tqdm(position=0, unit=\" Docs\", desc=\"Scanning paragraphs\")\n", | ||
| "body = {\"query\":{\"match_all\":{}}}\n", | ||
| "for hit in scan(client, query=body, index=\"paragraphs\"):\n", | ||
| " emb = tokenizer.tokenize(hit['_source']['text'])\n", | ||
| " hit['_source']['tokenizer'] = ', '.join(emb)\n", | ||
| " progress.update(1)\n", | ||
| " if len(emb) > 1000:\n", | ||
| " paragraphs.append(hit['_source'])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "paragraphs" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3.10.5 ('py10')", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.10.5" | ||
| }, | ||
| "orig_nbformat": 4, | ||
| "vscode": { | ||
| "interpreter": { | ||
| "hash": "e14b248c68ef27f7e40aef879e7b97aaa0976632ef81142793ba6d8efee923a4" | ||
| } | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a typo in the name of the notebook
paragrapha