Skip to content
This repository was archived by the owner on Jan 29, 2024. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
c8467d0
add k8s folder and notebook to create indices
Aug 31, 2022
58e353d
update create indices notebook
Aug 31, 2022
bcb5cfd
add create indices notebook
Sep 6, 2022
f0cc141
update requirements
Sep 6, 2022
32cd5c5
add upload es; change to paragraphs; add mypy options
Sep 14, 2022
9efe9d5
run test on VSCode
Sep 14, 2022
1c8ba39
PR comments Emilie
Sep 14, 2022
578c676
PR Emilie
Sep 14, 2022
54fd407
PR Emilie order INSTALL_REQUIRES
Sep 15, 2022
e5378bd
fail under makes test discover on VSCode fail
Sep 15, 2022
fc02b3b
add notebook
Sep 15, 2022
d51ab55
add initial embeddings k8s
Sep 15, 2022
9e7927d
add embedding model
Sep 16, 2022
7cdcfdb
Merge branch 'feature/paragraph-token-length' into feature/embedings-k8s
Sep 16, 2022
9fca694
add embedding locally
Sep 16, 2022
fb8c372
add normalization check
Sep 19, 2022
f73dcaa
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
Sep 20, 2022
ed3156d
adjustments after PR review
Sep 20, 2022
bf93692
adjustments after PR review
Sep 20, 2022
b3da33d
Merge remote-tracking branch 'origin/master' into feature/add-es-uplo…
Sep 20, 2022
c5d146d
changes after PR review
Sep 20, 2022
cd87236
changes after PR review
Sep 20, 2022
7b508cd
remove outdated test
Sep 21, 2022
b5e883f
remove .env file from tox
Sep 21, 2022
fc7c2e6
flake8 fixes
Sep 21, 2022
9988843
add __init__.py
Sep 21, 2022
788854b
add docstring to init
Sep 21, 2022
1206ee3
add docstring to init
Sep 21, 2022
6bc72b8
update docs
Sep 21, 2022
370e14b
add elasticsearch to ci
Sep 21, 2022
7022102
update ci
Sep 21, 2022
e440507
update ci
Sep 21, 2022
20bd85b
update ci
Sep 21, 2022
a10e5bc
update ci
Sep 21, 2022
d31a2a8
Modify test and the design
jankrepl Sep 22, 2022
f73a8a5
linters
Sep 22, 2022
17b56d7
linters
Sep 22, 2022
f5bdfb6
Fix random pandas issue
jankrepl Sep 22, 2022
fd79772
Temp removal of tox dependencies
jankrepl Sep 22, 2022
d7294a7
Completely get rid of en-core-sci-lg
jankrepl Sep 22, 2022
6969c50
Download spacy model just before unit tests
jankrepl Sep 22, 2022
12de5a0
Use underscores instead
jankrepl Sep 22, 2022
f3d8be0
update docstrings
Sep 23, 2022
032aa7c
Merge remote-tracking branch 'origin/fix/fix-unittest-"Feature/start-…
Sep 23, 2022
cfbfcb1
update docstrings
Sep 23, 2022
b3d8046
Merge remote-tracking branch 'origin/master' into feature/add-es-uplo…
Sep 23, 2022
7b9bdb2
revert mypy changes
Sep 23, 2022
687acb3
isort
Sep 23, 2022
e33ba1b
from __future__ import annotations
Sep 23, 2022
e5ab467
separate add es and sql
Sep 23, 2022
e8b134d
reverted article
Sep 23, 2022
ea51316
Write tests
jankrepl Sep 23, 2022
c592b3e
Undo changes in the add module
jankrepl Sep 23, 2022
c37585c
Fix paragraphs
jankrepl Sep 23, 2022
f4b7536
Make sure articles index works correctly
jankrepl Sep 23, 2022
d2ff32d
Make sure paragraphs correct
jankrepl Sep 23, 2022
73ba4e3
refractor add to es
Sep 26, 2022
9bc3abf
add notebook to test list in es
Sep 26, 2022
ee4e4dd
add docs
Sep 26, 2022
897b3d5
Delete test_lists_es.ipynb
drsantos89 Sep 26, 2022
a7772ee
mypy ci
Sep 26, 2022
6a1d729
Merge branch 'feature/add-es-uploader' of https://github.com/BlueBrai…
Sep 26, 2022
22d3fca
from Emilie PR
Sep 26, 2022
da1b64e
Merge remote-tracking branch 'origin/feature/add-es-uploader' into fe…
Sep 26, 2022
a7b0b6e
linters
Sep 26, 2022
55e7d19
docs
Sep 26, 2022
29e4a59
revert mypy
Sep 26, 2022
450eb72
linters ci
Sep 26, 2022
1dc44b1
mypy local
Sep 26, 2022
8c8ddf2
mypy
Sep 27, 2022
e4d2e06
mypy
Sep 27, 2022
7fb25bb
mypy
Sep 27, 2022
c1f452d
mypy
Sep 27, 2022
180bcda
linters
Sep 27, 2022
dbd8e5c
unittest fix
Sep 27, 2022
11e69d9
linters
Sep 27, 2022
5f34f29
unittest remove senttrans dim and add checkpoint
Sep 27, 2022
b8ba858
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
Sep 27, 2022
7349444
add tests model dim and is normalized
Sep 27, 2022
0d41141
add tests
Sep 27, 2022
1d0ba00
update docs
Sep 27, 2022
d0e7264
fix parser str nargs +
Sep 27, 2022
74ff6cf
mypy
Sep 27, 2022
838255e
black
Sep 27, 2022
b2494de
pin elasticsearch
Sep 28, 2022
2f0948d
add skip test if not es
Sep 28, 2022
5f076d6
update ci ES on unit test
Sep 28, 2022
ac4e5b1
fix CI macos
Sep 28, 2022
2021d4f
update ci
Sep 28, 2022
582fabf
update ci docker
Sep 28, 2022
f164a98
update ci macos docker
Sep 28, 2022
e92d248
update ci macos docker
Sep 28, 2022
b5e8353
update ci no docker macos
Sep 28, 2022
b49c323
Emilie PR review
Sep 28, 2022
b222786
Emilie PR review
Sep 28, 2022
0c0b1d3
add seldon embedding
Sep 30, 2022
691d9fa
Merge branch 'master' into feature/embedings-k8s
jankrepl Oct 4, 2022
6771fdb
Fix CLI
jankrepl Oct 4, 2022
5d6f178
Merge branch 'feature/embedings-k8s' of https://github.com/BlueBrain/…
Oct 4, 2022
1fc5f73
lint add es
Oct 4, 2022
2f5db4b
fix embedding function
Oct 4, 2022
c34f32b
Merge remote-tracking branch 'origin/master' into feature/embedings-k8s
Oct 4, 2022
b9d7b50
isort
Oct 4, 2022
0f098a5
add bentoml
Oct 4, 2022
405b6bf
update tests
Oct 4, 2022
28a0146
linters
Oct 4, 2022
97846c1
fix test bentml not available
Oct 4, 2022
17f9cb2
add force embeddings
Oct 21, 2022
2460923
lint
Oct 21, 2022
a5ede03
small fix
Oct 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/source/api/bluesearch.k8s.embeddings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
bluesearch.k8s.embeddings module
================================

.. automodule:: bluesearch.k8s.embeddings
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/source/api/bluesearch.k8s.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Submodules

bluesearch.k8s.connect
bluesearch.k8s.create_indices
bluesearch.k8s.embeddings

Module contents
---------------
Expand Down
231 changes: 231 additions & 0 deletions notebooks/check_paragrapha_size.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo in the name of the notebook paragrapha

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# connect to ES"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bluesearch.k8s.connect import connect"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = connect()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# tokenize all the paragraphs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tqdm\n",
"from elasticsearch.helpers import scan"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(\"sentence-transformers/multi-qa-MiniLM-L6-cos-v1\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lens = []\n",
"progress = tqdm.tqdm(position=0, unit=\" Docs\", desc=\"Scanning paragraphs\")\n",
"body = {\"query\":{\"match_all\":{}}}\n",
"for hit in scan(client, query=body, index=\"paragraphs\"):\n",
" emb = tokenizer.tokenize(hit['_source']['text'])\n",
" lens.append(len(emb))\n",
" progress.update(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# plot results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"sns.set()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.boxplot(lens)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.boxplot(lens)\n",
"plt.ylim([0, 512])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(lens)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(lens, bins=100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.hist(lens, bins=100)\n",
"plt.xlim([0, 512])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lens=np.array(lens)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(lens[np.array(lens)>512]) / len(lens) * 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# get biggest paragraphs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"paragraphs = []\n",
"progress = tqdm.tqdm(position=0, unit=\" Docs\", desc=\"Scanning paragraphs\")\n",
"body = {\"query\":{\"match_all\":{}}}\n",
"for hit in scan(client, query=body, index=\"paragraphs\"):\n",
" emb = tokenizer.tokenize(hit['_source']['text'])\n",
" hit['_source']['tokenizer'] = ', '.join(emb)\n",
" progress.update(1)\n",
" if len(emb) > 1000:\n",
" paragraphs.append(hit['_source'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"paragraphs"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.5 ('py10')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "e14b248c68ef27f7e40aef879e7b97aaa0976632ef81142793ba6d8efee923a4"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
# Required to encrypt mysql password; >= 3.2 to fix RSA decryption vulnerability
"cryptography>=3.2",
"defusedxml",
"elasticsearch>=8",
"elasticsearch==8.3.3",
"google-cloud-storage",
"h5py",
"ipython",
Expand Down
Loading