From 2a0d7ce404402f5a12866f965a7a8f705b42f971 Mon Sep 17 00:00:00 2001 From: YanivHyper-Space <124336435+YanivHyper-Space@users.noreply.github.com> Date: Thu, 11 Jul 2024 12:13:31 +0300 Subject: [PATCH] Add files via upload RAG notebook added --- DataSets/RAG/RAG.ipynb | 2390 ++++++++++++++++++++++++++++++++++++++++ DataSets/RAG/Readme.md | 4 + 2 files changed, 2394 insertions(+) create mode 100644 DataSets/RAG/RAG.ipynb create mode 100644 DataSets/RAG/Readme.md diff --git a/DataSets/RAG/RAG.ipynb b/DataSets/RAG/RAG.ipynb new file mode 100644 index 0000000..223a674 --- /dev/null +++ b/DataSets/RAG/RAG.ipynb @@ -0,0 +1,2390 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "![63f78014766fd30436c18a79_Hyperspace - navbar logo.png](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAMoAAAAYCAYAAAC7k2KMAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAApoSURBVHgB7VtdaBzXFT737q6xXUimTfPiUHtNXdKmpZHBeTHUlqBN6Uts4R/8pvUfhL7oxwkE+uDVQyEQ25JeSsBKtHprYxvZfikNBq9TmpcEtKatgxu3HbXEeXBNNzI4RjtzT865Mzs7MzszO6NVYpHMB2Jn5v7/nHO+c+4VQI4cOXLkyJEjR46vCCLq4/FTF/cjKgMyYMOGUv3N14dNyJHja4guQTk2cem0AKxCZgiztEEOZRGW4xOX5jrF8cpbZw9eDuc58cqFQVRypP1uKzlZmw62QYI9DSiebL+/de7A0cj2SAFQvn3td1mwZ86/cbiRVFcYCHBzwwZ5OWmc4XaSYD8hx2vV4ab/28mxhQEl1WhcGSGhiYg3ouYrCi+/tlC2LDVKym8QURhCILUnqA4xT2OJVXCVsQWjINVUUj+U0vORSklyP1or6rT/W9T4kzBWXTCWl1WFxrAXEQZ0PwSYNJYrX+a6FP2JJCSV1QkJA3kSrleqCzvTDxwr7SeBuEQ/XQuvFJaFL1+xaM3TjxmoBUWD8nhCd2LiwtLsuUNVfx53kaa4n7o9gY2wkDh1wT7h5okCa5bWij114pVL07NnDoxH5kEcIIGqQApsbMIk/QTmC4uWAUpU4sqg0j9jpGhMamw8SWBY8VF/q27PHM2Iwn3DQZoTk/JMvn3uQC2iuOFfo6h+OPPBdVyYefvcoWlIgLWi9ofrKy5bN+knsVwbrDQfLNsLgvuFPi2PUHbHMkrCEDsf/ayLDFU02H5+6Zc/gpdefA5mzx6A3bu2dVX06q/3wOu/+RU8+/2nYfcL7XQsUwOZKNtaQC+ygHr7HUGMksAG+rGyokbAJwCWXRyGPoAKx0hYpuCxgsaDsHDy1XcGolLTKT6tiOa0tu2jHySCU7yRE3MBdFlJFOk0vHYHlLgOkLS/nPno1Y/VoBiXwEJy9d1bcPuf92ILc/rRI7vg/Q9MeNyQdmFcSXvRfTUKDzRl0BTMsSZtrQqsimbC9C0KbHUU4kznS8EQerEdgWNhIYGc7GFBmwhqPC7xkVHoaX0RkLQbml6/UA7SGDw6qixRpZ+ujU599dEcYQqpjlpWsVECKNtSDTjptMkFNCy7UIeeEDUE+0ZcP8iyc3v1qJIOhY6w1AiDL48tlN9MWA8fG/B1BWZofbTlQJusr78fSjK72A7JyLQusYJy9U8fkVV5Dq79+WO4//+HXel/+WBJCxNj8e+fwuPG+enhxvHxizM0YY7WQqzQ4szPnjlUdya5DWHatqymq1U0iU7U/F+ozoar2TRKy0wnoJZQCdVxOCm9J6SE+uyZw3Xfpxpp2G28yZxuiufDZdi/IKZd9uoAmDxPc+G+MuVsUJ56sWDPkZAMk+LoKbBUx43zwbHUaM7B26Tt/kQAkf3M9kYXJj2zZdDWwZJWhX6qcWWJDQz66TA9T86eDVDrOtFtk5iEqxjQYCsbRa19yLQuMjZFILzxu/dg4MfPwPeeCfq2mzeW4Bc/2wGNv92FP1y5Cd/99mZYD7CxUAUfr2QNx/SDnjxtS0s1mWZTxIE0sgnrAESTb/hee9JdJdRomKKxVZ09e3Con/kQQjRSZfQJEa8Blax33sVochvoSxdm2P9kWKo4LcniCYlDtips7yEkmZFIveBFIFq1BFu3GNoP4WcWkiP7n4dFEpJndzwNP9+zg6zPLf3eDxDkCGnJvd0JohwwuQngBSfBGPcce1ocevY2hxRw+fzZSKc1NaRUI4HuSWX2KGLQuK5HJZDTOQSrgEtFPA3NUZ9wHp4LarfubVAUA8oWi24AoEFK5EavKFEvcATqwXIgOhcpcE60qWMRlJJ1CsqYZJnbCsxgajbbsXhBuNEtDYqORmVxhT0y2hmDTOuSQL1ukaX4Fjz1Hcda/JCcdsbWLY51YSf+/v2HcO2Tj8mP+R/0D+2IlSO+QxawY08TMOLTYJ62bZEfAxnA4Uf/ZKITXSl3cpB2O3Ow3qMaI4mSpOuImKJ+uJsQDfK3yuAbl0IxE1XM9dtCDrAzzzSO/Ry9Y2EqlQpH0wgMkmbXc+uCIlADwbpFTPRN+IW67vqH5vEJPSZdngSXhaYeLkn0sExcoVMTdny1PpFpXWRiKnl7Dx+1nDgc/X3+0HnetKnk/G4uwXqEbRe6NAvz2jQOfAjOZLp/NORysFLMJHirhT4v8PohApuTN15MaFf7bURDdlI5DqlH0yuqkwRmka0UZOqH3mR+ITH5jCtcRteLHepLQj3va9v/PBKOVK4nxFMvcuQfft6Ca+/dgc2bivCfTz4jp/0u3P7XPR3pun3nnrYqO/f8QOdLio6lAk2aKGAt/JkOtGiDitOQASwQ5GTO+yMhzGFhjcCbk4RkMpYqBNEk3txXKDq2H4Dj5NQmjstVDhV+ZnpDtGegALBXgT4KaG9Mw1pJdqgT0OS1s1FWo3wdxxH39VnIJ49NvFPhZ/JN6ETHYwxGVGBEr+XERe+dwsllWBtkWpfEqBdv/iP7fkr+x6fw/odL+jsLxdzvP9TnKOyXXH33I4+e9QNy2MyojUeTWo65aZMJq3FYOTxMTXtWQ1jFJtlUM2NdzZQCldARGC7ZhUaLIlQBuiChp8PKGr1Nq9x+8N80U5qCE053qQ8M9KqL2p4hZebRK54PtlpJRbQj7mfPaE8Lj8gEabUSyIqt1l0J+HwtrfzGwlm039ay53rdNvAh07okRr1YGP5797Mua8HCwuHh3bvK2rr8485a+CjrEUJPZvuPN0U/EaJV90Jgk88ZwpTSPbeIBQU2pphWRR1IaksTEQRIgqQIV3g+kvI7tCuFALZBwhBFv8IRPr5tEM6jD5Sd4M0cjfnfa33omHzgSFaFQ8NPRYR/d/5kiz5w5Hy7X9iqLUuOSBgU46/GJVpPFKfTXvnhzU11TXrnBbQxjk1cGIu6OuLe2dOaV9ly8fipSzUh1HyRwtt2qWWQ0NHG6kQEUcjLsMZotVTwwBPUfFQ+9FHr4rLFfa7605k2F2TnZgXfNqDx0HM7Akan++i/GpMuyJJlXRKpF1sVFojFv0aHfllImIatBfX6GsPABB9rY1NTjdRWqnvTiNOkubvCvJKJjJ+y0kYiC1RpSYog2RJEgPYIMy4g0BeCUaV61PkHQx8LuHnJB+k6ImArfnJsYTgQwXMEo9JuyIcmBRXShN0zrUuAeqHv8IitBQvLiVOXPP/EDz6MfO23f9S0jM9XHAjzkQFfOTX5JoE3DV9F8X0ymJuH8/GmpI1w1DkFjwcHJlJurExwqI/vNF1GWxOdhqJzNkICE0WbvAhe4ng48lYYWkV0sycCFoW0yjSZo0RJi4cODw5luTIt9B0mFzL6jlBBlRooLS9f6pNxCZdXE3MnXUtnEs7/4iiUJqwGNBah0p3/RCmWYrFoUhQqdszsH5ygg9V2P1mhMrcPz71rJWr6npUt9gsJ+qoLbcymQlxix7yHQ9sMrFFBZTrt9pdNasdSslaUVs/QsCsA2/VlT8RBGs823Q7S3gO4SRG9WqIP2ce6rNk/bilVqn8ZkpwjR44cOXLkyJEjR45vFL4Aeoh6Zli/bmMAAAAASUVORK5CYII=)\n", + "\n", + "\n", + "# Building a RAG system using Hyperspace\n", + "Retrieval-Augmented Generation (RAG) is a method to improve the quality and accuracy of generated responses by combining retrieval-based methods with generative models. RAG methods produce more informed and contextually relevantresults by feeding the genrative model with external knowledge sources during the generation process, RAG can produce more informed, contextually relevant, and factually accurate outputs compared to traditional generative models.\n", + "\n", + "\n", + "This notebook illustrates the application of Hyperspace hybrid search to create simple RAG that improves the output of a chat LLM.\n", + "\n", + "As the LLM, we will use the Microsoft [DialoGPT-small](https://huggingface.co/microsoft/DialoGPT-small)! model from the Hugging face website ![HuggingFaceLogo.PNG](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAOQAAABDCAYAAACMTqnbAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAFiUAABYlAUlSJPAAABD9SURBVHhe7Z1/cFvVlcfzf7fsdstu1R8UTJMWQ2jj8itOCMGCMNiFtqj1dOIl22HdXxk1nVJNOpvV5B/Gm51NNemu18POsKpD0siAwQ5JkCFO5JQkKL/lQIjSTopom3a1JW21kIJgU/a75+jeJ9139SQ//Vqe4/udOePo3R/vSk+fd8499+plHoyMjDwjA6SRkYdkgDQy8pAMkEZGHpIB0sjIQzJAGhl5SAZIIyMPyQBpZOQhGSCNjDwkA6SRkYdkgDQy8pBaD+RrR4ETA8Ch7wH7vgo8dy8w2Qvs/yZw5O+B9CPAhV/IykZGc1utAfJ8igAMAU9eD/zofe5s4m4B7ltZ2YmR0dxTc4FkEA+usYP2zC3A0XuA6VvJFgIvXQ2cJEt9msxPnnIFsP0zpfqPzTdgGs1ZNQ/IzHgJqievAZ4nAI9+jAD8uDs78glg723A1g/JPsi7/ud+2bmR0dxQc4BUYTz8BeDFK5yhc2MnyHse+HKpPwOl0RxS40Ce/EEJnhR5OCfI6rF95GGtft/IyJMZGV3aagzIs7ESNKc/6wxWI3b8LtH3jz8MvP0HeVIjo0tX9QP5+5eAbRSaMjDHP+MMVDPseLc4x56vyBMbGV26qh/Iqb8RoBxe7gxSM+2Fz4tznfxneXKjhnUxi9REDLHxJDI5eczoPVd9QP50WACy9w5ngMh+tXc+3jx6lWOZbu+evBJnJxbg3RevdCwv2PYOYMvlwGvH5SCM6lcG0e55mDdPmi+MRF4WGb2nqh3Id/4beKJdAHnkakd4Hv/hIixZugK3334ndjx8vWMdy155bgHuu7erUP87f7e0AKdTPRxc2oLQNYP4xggiRYvTESflkBxW60WRnM1e5VwMAQtGaQOHZJnnpF+jajbLrwupdiB/sVOAMVk5VL23218AzLLTOz/lWI/h677LXd2CjX5SnPvN38jBNKokBmxfzAE64qQsYr1qvQBi52TRrFQKkcXK+/EFET8vizwn/RpVs9l+XeoBMvld6R1vKYHywkeAsQ8Az12O/PGrbICxvbp7vh0sxb70+dttdROPXifKnqXwlPtMfrhUf+8yce6fbZaDaVRzFUhSLo2xQeFVEp7eFGWArK4YhZQxbeH/6b8UoLCdvAK7/v169H15OX53sA2psWvwvy8qdTX7+bMLcP7A1fjHtTfjH77dKeaRqY+V+tv5wVL9E53iWGKlHEyjmsNAzhrp16gNnSv88DtaCPFZvuOyNiDfeV0AEdfC1RfIi/34/cAzCjyN2g7qi/tk76se3/LXdANYIgfUqJoFZA7pqQQSRUvTEVUzlUvls0iODCGyPohAbxBh8l5jtnbC0g7hZe5MHFGaR4Ue8KN/bQRDI/GydoljWYjcTfXx5M+lHNpQq7N8jhD6V/QjRGOLn51hwnY+jTjPvdf2w/9ACJHBGOK287KlkK2aUHJ7jSqIxpAcj1IkwOMOILieP5vkDOeUouuR4rZ8PQrveQixQ6XPQ5e4BmEEe8V5ouP03i7KQpeqDUjeMcNA7qmcXW25Pb6AjOaZTVGzgJypn5nK80gP96PNVqey2RIw5xMId/kc65VZb4zeCav6eLIjgVKZbwCpixmMPeiHz9aGzYe+Eac0WA6J9U71nWymaKM+IHOnxjDQ26a008zXg8h0RbSQ3BioeD18XTQG9V6US2Go0rnmBzBU8Tzlqg3I144JIH/S7QzL/4dtv4G85F/JATUqbwCZHe1z+eUVVgQyn8KAmpyZyeoBkkLE9oVVgC9LCOWR2tDpXNfRWgGkfr0qmGMyi24+D1QBmcy3akx+jqQLNL6ZroGvD7FXZf0Z1FIg+5bdgQ33LMOv/+Nax/J3p6/E06Gb8E1/F7Z/5ybHOmXWUiArz0/afWq9ZgKZQmSRUraYyuTdNzsRRqfSrme1SO/Hz4ry3M5gqR17q63SW+XSiK5SIFrUR+EvtR1OytC0+njtQErztdP5wwiv7inzHMGdirvIxRFUynyrYsgUwjYKkx9Rbzwd6FvnZqlCH2sY8VwOOQez+aHpiPjsaNx9aynUnEggPhJBvwaP/xG7h8880mMr58/V//Vw4XMvvPdFYSQvyMr0nuKr1ZuVDz3rZFg+EUVQOZdvXaJiqKuqpSHrSgKSM6dLyb7W5cfQ/UsQ/aqwdd3LcffSO4vZ1d3fJ9Ac+iizloasbq2JQGprgvYvSAbRFaWywIg9Y5F8qFQ2b17ItrifnwwpZTWMh1QG5OIwEoonST/st5V3bErJEtKhAVtZaFIdVAIhpczd2qfba6RfE/LUBGHZHO5sFD1qO9Xb0fjCthuvQ0iu9qf11al+Diy6KXQUy+3Xp5LqTOrcBnDm9PBH6aQaMIr94L5lReCq2TKy15+tvDSC6SuAI3Qu/ndLkzpurXVA2qGzh162Lz7JDqR2ThsYFJrZvFBtQOo3Ah26eQ8prbUyO3T289o8a0W5vUbVQ998No3kVByxwaAdSJ4jyzo4EFb6I+uOVtgoIpTZqn5OPTRX1Lz2uTElWvAjKiObaqoNSBYve2wjOEYuE3Cybf+AgIZ/kDxJwDzvKwD7ypZrcaviBTfc4sc99Ld3yZ34VqfwnmybAssEbPw7yp9Q2z3UxzE6R4peP0V9W+cZ+aD427Jljx4ECzs+dAujTw0rmwmkHrLSPK/4JXjVDqv+Bc6N95faUQgYPmDdgvNIru8olS2KlL50BVUfb0NAnh9Dv1LWsT5ZDNXy9IUveYwORKZlQVU1AGQ2idhDfdp0Q7fSe9fft34D1GW/Ic5sbiKC2oG0NgYUwCBw4peLfz/+58DmPyuVjdJrgvKpNTfhjiUrcOpG8qqfXYajNy7HL29YXvj3tpu7cP9tfuQPtRXWLwt9WO0fpb4ek68n6Bx7aN5olXluY0AjQBJYNBdUkzpt3UGE19OdfL7SxkchT3HuYknfccPzpQhCK9tt/fVo86SZxtMQkKTUJjWp40P7yhAiazUwZvA+JeljdTeHzE2G4S8DkcbS1anNgSsDWfa+NdUGpA8Dx2TDKnIP5Fv/JfaRckLFAsN6MsCkcuzQR4Bn5EYB9nRU/sZwO/C3NwI3kCckEAt2ZyfwTwvx5gG5H3a37IPhO0h9WP0x9Fx+ksw6tuVDwNT9wNu/l4OrV94AsjDf2aQnExSrljrPUlh0jUObgvngf9Bpf25rgeTPa2x1u72OYr6uEOIus47ur5Gi83EEVRjp84tMZZAvzP8q98fZ7tJxMvLu1ZTaoCZ0/AiP6musqs203irkHsjnvy5g2KU8SW4fhZcMC9sIebPRvxD/5vkll/PCPns+qw7/+mMHzRWntE3pHJpukd7VOsbe9gnZH9sk1bHOu0s+FCv5oBxcvXJ7sWsFMmzv54I981h2numh4t28cw2FyKsDhcwuL/BXX1zOIb5Gpuh9fQgP8uK3H37eVDAYQ6Liwn1rgcxNBKUX8qFv3VBhs4K/sCg/hBiB4WbmWFLtQOpg2RJL1fqzJWHIHKOSkuxZ7nnoH6/tnTnJPZA8b3zC+iXGp8U+U4aCd+nwseM0f9xCAFoAPSc9Hv+1jlUy3uHDdS1vyJ6XXzOo/JrnpPx6B3nPU4vIKMTdSl549Do5uHrVLCC1eSB/EbfSHZlK8ufiDgv36nk4dW4d70HUtecgqV+gdQl50I1aCaT6WYSRqHgzcas6gNTGH5woAZkZD9qWkuz9aVMAMt+qKNIWZxezSA5GMGZde90T000xekqDkndfHXMXnLPcAzn5JQHFy4ESSFMUOm4lCK0N4BxucvhagIpsM5Wxp7MypE7GIe4w1WFvym342C4CdK+Ecx+dYxvBv1/xxidpDDwWfvByQ2oWkHkk1lVZPC8z9Tz2vn1d/WLNsGhiDS0xbYVcimxgtCFAHtWWjBoeo1ApifR5PVZqJZD2vtt6af6ojmmj2A6YPKOtG1ZU7UDmp8LaRgu5vrzYacHf3h8vFzlt0mhbWJqX+1bHi17enmmVdRdb69dyvlrDr2ncAcmP6/iRDCnZ9iwlMG8XcBwlL/YYhZYn5CMfORGzXwJ6TG4S33aZyMJaQFnGHpBBHKY6VnsOgzm5w/9mWEcJcva+/Dr9OYKe5qLWOIap39dfkYOsR80CkmQtRDvZ4gHEBtULZz+PPQlSxXx+hHYqd1v9Dl3F2nqHkCqGX60EUl8sr2KutpXVDqSTpytZG/xdSga6rD8xn3eCsmQ+hA/I6vR+E+tmvn4dG6pnbC25A/KlfxEAsHc8RXNA/hsjcOILhfc7TqHsUQmZbhzSclsOcfWyJwlkLmPwiscVcI98SsB4mOzZmwl2+vvyF2kMnyCTHruhjGsTgSRldoa0zB4nVWJIMwi2NS7tPBcziK5U21WzTgydku1I7A3aHeuVm2+NdWdvJZCkCwmEKyaaNGPvUXXqVQ+QJIc9vpxMinFIWe1aSOUODCHYXe5RfQv7MDCeLvPu2akI+py2GM7vQXDQYYNCBbkD8pdx8eVP3Aac5qePEyzpG+iNrSAPRx5spgci828buX1SAY9DUD7G65ZqXd0Y+G1U9zB5x5dp7srHTtPf3YtF+98eloP0iPI5ZM8kCyFmrkpCoKQMYsVtbgFEClu8hpQQL4x+7YtlAZInMCyv3LE2RuHpWOEXH8W2a/UN0i6/zI1I3du5iACg8HTM9rSFEALqcg6Zux079UlsCEghk3MXIOvK53LITIspQ85FH7XW1+UOSP7S85d/bAHw9LUEyb3AmbsEHOlbyFt90g6Rbhyucntes7SO8e8c+ZiahXWyU+2lR0xyyJrsEWN48mrR/g9pOcjZKVumrlKaXfNIfaMMpJo8IY/tmAzSPLu6K6VFSm0qhYMBa2+tJrsHdrc+N1fkDkjryeSp+wQYZwiK3Z3A1HIChiBVAXIyC0g148qJGz5mJXKq2elFNHe8ibzqHaUbwVH5JLpf75GDnJ2yfTnJo5T9wDaXRmy1Oufpx1ghQWAP5QIPl//GMjs1YNsm1rGx1TjSqNTFcpq3FjOUlrIJDKgP2CrbRTS35Q7IPH0DeA8pA3CIvFRa/jIjfauYz6nwONmeq0TbfTSf5CUNBnSPXPzf1+bcRrWXyQOnKUTlf59ZAhyUD1CO0WveXzuLxWGnbe2LrJil67LvuOGERP+o5XX0eS2Zr11m9/zo1MLCwm/4XIXQjUmff7IHbO+S70fPcvr8FK7WF0peqnIHJOs3+2guR3NFBmHzZRS23kNw8M+wZnjUo7qLp5LJHT0VjRNJaTrfC3eLJRJuM0Ig//aIHNxslrsfJxcTEqpc/Ti5DYGN7pMKjcvdj5PbeiMef5bPeyP3QLIuvgn8bAuFnhQuPipDToYzsQw48Q2aD36FwKGy0wTPcfr79HWiDnvX/V8T64bqf9i6n9pYW/GeobD0xBepPZWfprb8xHIOkXmJZfj9os6jNAfdHQB+/hjw7jtyUJeI+HEREzEMFZIxIYRkEoQfxZF6tboXyZ1JysRJGKG14rd7hfW+QxTGvlcOqPjojAjC9H6stdXouNO6qJGl2oBU9T9vUHxyENhBYasFjG6beUGfoON9sJX0x3M0F10l1hSd+uC+d/nJU/6Qbgh/lI2MjC5N1Q+kqj+9JTKxr4xSePmvNOf7N+DcZG0A8VzwVxOi/alBIPOUCEn/9LasYGR06as5QBoZGTVFBkgjIw/JAGlk5CEZII2MPCQDpJGRh2SANDLykAyQRkYekgHSyMhDMkAaGXlIBkgjIw/JAGlk5CEZII2MPCQDpJGRh2SANDLykAyQRkYekgHSyMhDMkAaGXlIBkgjIw/JAGlk5CEZII2MPCPg/wCnu9KG3L/f4QAAAABJRU5ErkJggg==)\n", + "The vectors were embedded using the [all-MiniLM-L6-v2 model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hyper-space-io/QuickStart/blob/master/DataSets/ImageAndTextSearch/ImageAndTextSearch.ipynb)\n", + "\n", + "# The Dataset - Hyperspace Documentation\n", + "The chat will allow search in Hyperspace documentation. We will first use BeautifulSoup4 to collect information from the Hyperspace website, and use it to improve the chat requests.\n", + "We will use Hyperspace filtering in order to prevent the system from returning results to irrelevant queries." + ], + "metadata": { + "id": "GLMYcpGtUyA3" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Installing Relevant Packages\n", + "We start by installing required packages" + ], + "metadata": { + "id": "Nehv_SQcYqZ3" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install InstructorEmbedding\n", + "!pip install sentence-transformers==2.2.2\n", + "!pip install beautifulsoup4 requests\n", + "!pip install sentence-transformers\n", + "!pip install transformers\n", + "!pip install torch" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HU4yKOf5p1-b", + "outputId": "d04135f2-1169-443e-b33f-bad0e8653495" + }, + "execution_count": 307, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: InstructorEmbedding in /usr/local/lib/python3.10/dist-packages (1.0.1)\n", + "Requirement already satisfied: sentence-transformers==2.2.2 in /usr/local/lib/python3.10/dist-packages (2.2.2)\n", + "Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (4.41.2)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (4.66.4)\n", + "Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (2.3.0+cu121)\n", + "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (0.18.0+cu121)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (1.25.2)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (1.2.2)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (1.11.4)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (3.8.1)\n", + "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (0.1.99)\n", + "Requirement already satisfied: huggingface-hub>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==2.2.2) (0.23.4)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (3.15.4)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (2023.6.0)\n", + "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (24.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (6.0.1)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (2.31.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (4.12.2)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (1.12.1)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (3.3)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (3.1.4)\n", + "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.105)\n", + "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (8.9.2.26)\n", + "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.3.1)\n", + "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (11.0.2.54)\n", + "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (10.3.2.106)\n", + "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (11.4.5.107)\n", + "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.0.106)\n", + "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (2.20.5)\n", + "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (12.1.105)\n", + "Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers==2.2.2) (2.3.0)\n", + "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.6.0->sentence-transformers==2.2.2) (12.5.82)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers==2.2.2) (2024.5.15)\n", + "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers==2.2.2) (0.19.1)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers==2.2.2) (0.4.3)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers==2.2.2) (8.1.7)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers==2.2.2) (1.4.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers==2.2.2) (3.5.0)\n", + "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision->sentence-transformers==2.2.2) (9.4.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->sentence-transformers==2.2.2) (2.1.5)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (3.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers==2.2.2) (2024.6.2)\n", + "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->sentence-transformers==2.2.2) (1.3.0)\n", + "Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (4.12.3)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (2.31.0)\n", + "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (2.5)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests) (3.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests) (2024.6.2)\n", + "Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.2.2)\n", + "Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.41.2)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.4)\n", + "Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.3.0+cu121)\n", + "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.18.0+cu121)\n", + "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.25.2)\n", + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.2.2)\n", + "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (1.11.4)\n", + "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (3.8.1)\n", + "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.1.99)\n", + "Requirement already satisfied: huggingface-hub>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.23.4)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.15.4)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2023.6.0)\n", + "Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (24.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0.1)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.12.2)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (1.12.1)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (3.3)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (3.1.4)\n", + "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (8.9.2.26)\n", + "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.3.1)\n", + "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (11.0.2.54)\n", + "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (10.3.2.106)\n", + "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (11.4.5.107)\n", + "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.0.106)\n", + "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (2.20.5)\n", + "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (12.1.105)\n", + "Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.6.0->sentence-transformers) (2.3.0)\n", + "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.6.0->sentence-transformers) (12.5.82)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2024.5.15)\n", + "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.19.1)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.4.3)\n", + "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers) (8.1.7)\n", + "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->sentence-transformers) (1.4.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->sentence-transformers) (3.5.0)\n", + "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision->sentence-transformers) (9.4.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.6.0->sentence-transformers) (2.1.5)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2024.6.2)\n", + "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.6.0->sentence-transformers) (1.3.0)\n", + "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.41.2)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.15.4)\n", + "Requirement already satisfied: huggingface-hub<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.4)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.25.2)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.5.15)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)\n", + "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.3)\n", + "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.4)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.23.0->transformers) (2023.6.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.23.0->transformers) (4.12.2)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.6.2)\n", + "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.3.0+cu121)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.15.4)\n", + "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.12.1)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.3)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2023.6.0)\n", + "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch) (8.9.2.26)\n", + "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.3.1)\n", + "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch) (11.0.2.54)\n", + "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch) (10.3.2.106)\n", + "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch) (11.4.5.107)\n", + "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.0.106)\n", + "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch) (2.20.5)\n", + "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch) (2.3.0)\n", + "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch) (12.5.82)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.5)\n", + "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Collecting the Data\n", + "We use the BeautifulSoup package to scan the Hyperspace documentation and create a corresponding data collection. We then parse the documentation into individual sections, each stored as a separate document, and use Hyperspace Hybrid search for retrieval." + ], + "metadata": { + "id": "b4U71-sFpmsz" + } + }, + { + "cell_type": "code", + "source": [ + "import requests\n", + "from bs4 import BeautifulSoup\n", + "import os\n", + "import regex as re\n", + "\n", + "folder_name = 'docs' # local folder to store the documentation\n", + "os.makedirs(folder_name, exist_ok=True)\n", + "documents = []\n", + "\n", + "known_words = [\"keyword\",\"return\", \"True\",\"list\",\"fieldname\",\"function\",\n", + " \"between\", \"vector\",\"parameter\",\"score\",\"should\",\"must\",\"clause\",\n", + " \"aggregate\",\"configuration\",\"section\",\"boost\",\"commit\",\"string\",\n", + " \"False\",\"aggregation\",\"candidate\",\"filter\"]\n", + "replacements = {\"themin\": \"the min\", \"typekeywordorlist\": \"type keyword or list\",\n", + " \"Hyperspaces\": \"Hyperspace\", \"uploading\": \"upload\", \"combining\": \"combine\",\n", + " \"filtering\": \"filter\", \"commiting\": \"commit\", \"scoring\": \"score\",\n", + " \"indexing\": \"index\", \"matching\": \"match\", \"debugging\": \"debug\"}\n", + "def normalize_text(text):\n", + " text = text.lower()\n", + " for i in range(0):\n", + " text = text.replace(str(i), \" \")\n", + " for char in [\")\",\"_\",\"(\",\"-\"]:\n", + " text = text.replace(char, \" \")\n", + " for word in known_words:\n", + " text = text.replace(word + \"s\", \" \" + word + \"s \")\n", + " text = text.replace(word, \" \" + word + \" \")\n", + " text = text.replace(\"'\",\"\")\n", + " text = text.replace(\")\",\" \")\n", + " text = text.replace(\"(\",\" \")\n", + " for key in replacements.keys():\n", + " text = text.replace(key, replacements[key])\n", + " text = text.replace(\" \", \" \")\n", + "\n", + " return text\n", + "\n", + "def split_text(text):\n", + " parts = re.split(r'(? 0:\n", + " documents.append(doc)\n", + " doc = {\"Sentences\": [], \"url\": base_url}\n", + " doc[\"header\"] = normalize_text(current_header)\n", + " file_path = os.path.join(folder_name, f\"{current_header}.txt\")\n", + " file = open(file_path, 'w', encoding='utf-8')\n", + "\n", + " elif element.name == 'p' and current_header:\n", + " if file:\n", + " text = element.get_text(strip=True)\n", + "\n", + " if len(text) < 5:\n", + " continue\n", + " doc[\"full_text\"] = normalize_text(text)\n", + " for sentence in split_text(doc[\"full_text\"]):\n", + " if len(sentence) < 5:\n", + " continue\n", + " sentence = \" \".join([word for word in sentence.split(\" \") if len(word) > 1])\n", + " if len(sentence.split(\" \")) < 3:\n", + " continue\n", + " doc[\"Sentences\"].append(sentence)\n", + " file.write(sentence + '\\n')\n", + "\n", + " if file:\n", + " file.close()\n", + " else:\n", + " print(\"No main content section found. Please check your HTML structure.\")\n", + " else:\n", + " print(f\"Failed to retrieve {base_url}: Status code {response.status_code}\")\n", + "\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wyxlFr77Fi8e", + "outputId": "36b0e8af-ca8f-4bf0-9f3b-84766d8c9e57" + }, + "execution_count": 386, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Error: Failed to retrieve https://docs.hyper-space.iohttps://docs.hyper-space.io/hyperspace-docs/flows/data-collections/supported-data-types\n", + "Error: Failed to retrieve https://docs.hyper-space.iohttps://docs.hyper-space.io/hyperspace-docs/flows/data-collections/supported-data-types\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let us examine the collected data" + ], + "metadata": { + "id": "BDVeuWDIQSza" + } + }, + { + "cell_type": "code", + "source": [ + "documents" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0CQOYrCaEzmv", + "outputId": "3a3937e2-3140-4dd1-d980-400a39ba3511" + }, + "execution_count": 387, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'Sentences': ['hyperspace is cloud search database that leverages cloud hardware to enhance search speed and relevancy.',\n", + " 'hyperspace uses search processing unit spu virtual chip—a domain specific architecture optimized for search tasks, to provide unmatched performance for real time applications at scale, maintaining cost efficiency without compromising over logic complexity.',\n", + " 'hyperspace is managed saas solution, combine hardware level speed with software level flexibility and designed to support wide range of ai applications such as real time recommendations, fraud prevention, ad tech, rag, and threat detection.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'overview',\n", + " 'full_text': 'hyperspace is a managed saas solution, combine hardware level speed with software level flexibility and designed to support a wide range of ai applications such as real time recommendations, fraud prevention, ad tech, rag, and threat detection.'},\n", + " {'Sentences': ['hyperspace excels in delivering query results with minimal latency, even when dealing with billions of documents.',\n", + " 'using designated processing units in the cloud, hyperspace provides latencies that are 10 to 100 times faster than industry benchmarks, all while reducing costs.',\n", + " 'furthermore, hyperspaces search function ality is designed to operate at an extremely large scale without compromising on performance or stability.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'fast queries at scale',\n", + " 'full_text': 'hyperspace excels in delivering query results with minimal latency, even when dealing with billions of documents. using designated processing units in the cloud, hyperspace provides latencies that are 10 to 100 times faster than industry benchmarks, all while reducing costs. furthermore, hyperspaces search function ality is designed to operate at an extremely large scale without compromising on performance or stability.'},\n", + " {'Sentences': ['hyperspaces hybrid search combines keyword search, term and value match, with vector based search, allowing versatile and efficient approach to information retrieval.',\n", + " 'while vector search tends to excel at capturing semantic relationships, it behaves unexpectedly in many cases.',\n", + " 'keyword search can pinpoint explicit matches and retrieve documents based on specific terms, improving relevancy when vector search falls short.',\n", + " 'hyperspace hybrid search allows to create complicated function combine these two methods, and allowing comprehensive results with high relevancy.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'relevant search results',\n", + " 'full_text': 'hyperspaces hybrid search combines keyword search, term and value match, with vector based search, allowing a versatile and efficient approach to information retrieval. while vector search tends to excel at capturing semantic relationships, it behaves unexpectedly in many cases. a keyword search can pinpoint explicit matches and retrieve documents based on specific terms, improving relevancy when vector search falls short. hyperspace hybrid search allows to create complicated function s , combine these two methods, and allowing comprehensive results with high relevancy.'},\n", + " {'Sentences': ['hyperspace documents include fields with variety of types, including keyword numerical values, list and vector',\n", + " 'the list of supported types is available underdata types.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'hyperspace documents store vector s and metadata',\n", + " 'full_text': 'hyperspace documents include fields with a variety of types, including keyword s , numerical values, list s and vector s . the list of supported types is available underdata types.'},\n", + " {'Sentences': ['hyperspace stores data under collections, each with its own index.',\n", + " 'hyperspace allows to easily create and manage data collections, upload and modify data:'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'creating collections and ingesting documents',\n", + " 'full_text': 'hyperspace stores data under collections, each with its own index. hyperspace allows to easily create and manage data collections, upload and modify data:'},\n", + " {'Sentences': ['hyperspace allows you to run search queries in standard dsl'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'creating keyword based queries',\n", + " 'full_text': 'hyperspace allows you to run search queries in standard dsl'},\n", + " {'Sentences': ['we call the search dsl api to submit the query.',\n", + " 'in this call we specify the name of the query, the number of documents to return and the collection name.',\n", + " 'the result of this call goes to the result dictionary, containing the top document ids along with their score',\n", + " 'thats our first query!',\n", + " 'in the following chapters, well discuss these features in greater detail.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'submitting a dsl query',\n", + " 'full_text': 'in the following chapters, well discuss these features in greater detail.'},\n", + " {'Sentences': ['hyperspace score function allows you to specify the keyword search and vector search behavior in python syntax.',\n", + " 'score function allow filter ing and score including tf/idf',\n", + " 'below is an example of simple hybrid score function that filter the results based on two fields and applies score manipulation.',\n", + " 'here is simple keyword search followed by vector search on the resulting documents.',\n", + " 'we specify how both score will be merged via the boost parameter in this example, the keyword search is given twice the weight of the vector search.',\n", + " 'document is used for the query documents.',\n", + " 'in python client, you can provide either the function directly or the name of the file that contains the function',\n", + " 'in java client, you can only provide the name of the file.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs',\n", + " 'header': 'creating hybrid queries',\n", + " 'full_text': 'in python client, you can provide either the function directly or the name of the file that contains the function . in java client, you can only provide the name of the file.'},\n", + " {'Sentences': ['hyperspace search allows to filter and assign score for candidate using lexical and vector search.',\n", + " 'only candidate that passed the filter stage will be assigned score',\n", + " 'in addition, hyperspace allows aggregation of candidate fields.',\n", + " 'aggregation can be performed on all candidate and not just those that passed filter ing.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'hyperspace search',\n", + " 'full_text': 'in addition, hyperspace allows aggregation s of candidate fields. aggregation s can be performed on all candidate s , and not just those that passed filter ing.'},\n", + " {'Sentences': ['hyperspace allows to build lexical, vector and hybrid search queries in either domain specific syntax .json structure or native python syntax.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'native python syntax and dsl syntax',\n", + " 'full_text': 'hyperspace allows to build lexical, vector and hybrid search queries in either domain specific syntax .json structure or native python syntax.'},\n", + " {'Sentences': ['hyperspace supports variety of score mechanisms.',\n", + " 'these include tf idf and bm25 based score weights and boost for lexical search, and similarity metrics such as euclidean and hamming distance for vector search.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'multiple score methods',\n", + " 'full_text': 'hyperspace supports a variety of score mechanisms. these include tf idf and bm25 based score , weights and boost s for lexical search, and similarity metrics such as euclidean and hamming distance for vector search.'},\n", + " {'Sentences': ['hyperspace efficient memory management allows to include an extremely large number of keyword and value fields in each query, allowing practically unlimited number of fields.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'no limitation on number of metadata fields',\n", + " 'full_text': 'hyperspace efficient memory management allows to include an extremely large number of keyword and value fields in each query, allowing practically unlimited number of fields.'},\n", + " {'Sentences': ['hyperspace fully supports multi model vector search, allowing to use multiple vector in each search query, and to use the results of each vector search in order to filter other vector searches.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'full support of multi model search',\n", + " 'full_text': 'hyperspace fully supports multi model vector search, allowing to use multiple vector s in each search query, and to use the results of each vector search in order to filter other vector searches.'},\n", + " {'Sentences': ['hyperspace allows to perform vector search with extremely large vector with thousands of elements per vector'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'extremely large vector s ',\n", + " 'full_text': 'hyperspace allows to perform vector search with extremely large vector s , with thousands of elements per vector .'},\n", + " {'Sentences': ['hyperspace allows to create sophisticated filter ing and score logic based on both vector and lexical search results.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': ' filter ing and score based on both vector and lexical search',\n", + " 'full_text': 'hyperspace allows to create sophisticated filter ing and score logic based on both vector and lexical search results.'},\n", + " {'Sentences': ['this guide explains how to set up the hyperspace database in minutes.',\n", + " 'to start using hyperspace, follow these steps:'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'quick start',\n", + " 'full_text': 'to start using hyperspace, follow these steps:'},\n", + " {'Sentences': ['run the following shell command in your code or your data terminal',\n", + " 'host address, use the following code to connect to the database through the hyperspace api.',\n", + " 'for more information, seehere.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': '1. install the hyperspace api client',\n", + " 'full_text': 'for more information, seehere.'},\n", + " {'Sentences': ['once you receive credentials and host address, use the following code to connect to the database through the hyperspace api.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': '2. create a local instance of the hyperspace client',\n", + " 'full_text': 'once you receive credentials and host address, use the following code to connect to the database through the hyperspace api.'},\n", + " {'Sentences': ['the schema files outline the data structure, index and metric types, and similar configuration',\n", + " 'more info can be found in the configuration file section'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'create a schema file',\n", + " 'full_text': 'the schema files outline the data structure, index and metric types, and similar configuration s . more info can be found in the configuration file section .'},\n", + " {'Sentences': ['copy the following code snippet to create collection',\n", + " 'schema.json specifies the path to the configuration file that you created locally on your machine.',\n", + " 'collection name specifies the name of the collection to be created in the hyperspace database.',\n", + " 'alternatively, you can define the database config schema as local python object',\n", + " 'schema– specifies the python dictionary that outlines the configuration schema.',\n", + " 'collection name specifies the name of the collection to be created in the hyperspace database.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'create a collection',\n", + " 'full_text': 'collection name – specifies the name of the collection to be created in the hyperspace database.'},\n", + " {'Sentences': ['data can be uploaded in batches.',\n", + " 'copy the following code snippet to upload data',\n", + " 'data point– represents the document to upload.',\n", + " 'each document must have dictionary like structure with keys according to the database schema configuration file.',\n", + " 'batch size– specifies the number of documents in batch.',\n", + " 'commit is required for vector search only'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'upload data',\n", + " 'full_text': ' commit is required for vector search only'},\n", + " {'Sentences': ['hyperspace queries can be of one of the following types',\n", + " 'lexical search can be performed in dsl syntax, or as using score function of the following form:'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'build and run a query python only ',\n", + " 'full_text': 'lexical search can be performed in dsl syntax, or as using a score function of the following form:'},\n", + " {'Sentences': ['specify that this score function file is to be used for the search, as follows'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'to set a hybrid or lexical search query –',\n", + " 'full_text': 'specify that this score function file is to be used for the search, as follows –'},\n", + " {'Sentences': ['define the query schema and run',\n", + " 'query bodyis the query in dsl syntax.query body must have similar structure to the database documents, according to the query schema config file.',\n", + " 'if query body includes fields of type'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'to run a hybrid or lexical search query –',\n", + " 'full_text': 'query bodyis the query in dsl syntax.query body must have a similar structure to the database documents, according to the query schema config file. if query body includes fields of type'},\n", + " {'Sentences': ['define the query schema and run',\n", + " 'query bodyis the query in dsl syntax.',\n", + " 'resultsis dictionary with two keys {similarity: {}, took ms: ..}',\n", + " 'took ms– is float value that specifies how long the query took to run, such as 8.73ms',\n", + " 'similarity– return list',\n", + " 'each element of the list represents match document.',\n", + " 'for each document, it specifies the score and the vector id that you can use to retrieve the document from the collection.',\n", + " 'here is an example of what results might look like if they were printed on the screen',\n", + " '[{ score 513.7000122070312, vector id: 78254},\\n score 512.5500126784442, vector id: 23091},\\n score 485.5471220787652, vector id: 85432}]',\n", + " 'you can retrieve additional document fields in the query, using the \"fields\" keyword'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/quick-start',\n", + " 'header': 'to run a lexical search query in dsl syntax–',\n", + " 'full_text': 'you can retrieve additional document fields in the query, using the \"fields\" keyword .'},\n", + " {'Sentences': ['hyperspace allows native python syntax and can be seamlessly integrated into an existing python projects or used as stand alone code.',\n", + " 'the hyperspace client requires python 3.4 or newer.',\n", + " 'users with experience in search can use hyperspace to easily create extremely fast search engines from scratch.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up',\n", + " 'header': 'setting up',\n", + " 'full_text': 'users with experience in search can use hyperspace to easily create extremely fast search engines from scratch.'},\n", + " {'Sentences': ['the local hyperspace client uses database configuration schema can be provided as python dictionary or as file, for example, named schema.json to define the collection to be created as described in the next step',\n", + " 'similar to other search databases, the hyperspace database leverages this file to outline the data scheme and required customized settings.',\n", + " 'the database configuration schema consists of dictionary containing key value pairs that lay out the structure data schema fields of the data to be uploaded into collection.',\n", + " 'defined in standard .json format, the configuration is provided under the dict key configuration in the form configuration fieldname 1: type}, fieldname 2: type}, ...}.',\n", + " 'each attribute field is described by an attribute name such as city, country, and street key attribute of the value, such as type and low cardinality and value property, such as keyword boolean, and true',\n", + " 'create data schema document as local variable',\n", + " 'the data schema should be provided as .json file, or as python dictionary with the following structure:',\n", + " 'vector fields should be given the type \"dense vector \", while metadata fields can be given any type from thesupported data types list',\n", + " 'hyperspace does not currently support signed integers.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'creating a database schema configuration file',\n", + " 'full_text': 'hyperspace does not currently support signed integers.'},\n", + " {'Sentences': ['the following optional type attributes can be added'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'optional fields',\n", + " 'full_text': 'the following optional type attributes can be added –'},\n", + " {'Sentences': ['the key ‘index’ allows to disable the index of field.',\n", + " 'when set to false, the relevant field will be included in the dataset, but will not be indexed and will not contribute to search results.',\n", + " 'the default value for ‘index’ is true.',\n", + " 'see example of usage under ‘open now’ in the above example.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'index',\n", + " 'full_text': 'the key ‘index’ allows to disable the index of a field. when set to false, the relevant field will be included in the dataset, but will not be indexed and will not contribute to search results. the default value for ‘index’ is true. see example of usage under ‘open now’ in the above example.'},\n", + " {'Sentences': ['values non keyword fields are configured as scalars by default.',\n", + " 'to build list of the same data type, add the key and value struct type: list',\n", + " 'metadata fields of type keyword can describe keyword str or list of keyword list [str] without the need to state type list \".',\n", + " 'the length of each keyword is limited to 256 characters.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'struct type – list ',\n", + " 'full_text': 'the length of each keyword is limited to 256 characters.'},\n", + " {'Sentences': ['hyperspace supports the use of nested objects as part of document.',\n", + " 'to include nested objects, define the relevant field type as nested, and the corresponding sub items under the fields key.',\n", + " 'vector fields should be given the type \"dense vector \", while metadata fields can be given any type from thesupported data types list',\n", + " 'in the above example, \"paragraphs\" is nested object with subfields named \"text\", \"count\" and \"value\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'nested objects',\n", + " 'full_text': 'in the above example, \"paragraphs\" is a nested object with subfields named \"text\", \"count\" and \"value\".'},\n", + " {'Sentences': ['cardinality refers to the number of unique values an attribute can have.',\n", + " 'hyperspace provides the option to accelerate search performance by setting one of the following cardinality attributes to true.',\n", + " 'to accelerate the search, apply the appropriate cardinality attribute where relevant',\n", + " 'low cardinality indicates that this attribute has up to 10 possible unique values.',\n", + " 'it is suitable for fields with limited set of possible values.',\n", + " 'high cardinality indicates that this attribute has more than 100 possible values.',\n", + " 'it indicates that this attribute has more than 100 possible unique values, meaning that it has broader range of distinct possible values.',\n", + " 'to accelerate the search, apply the appropriate cardinality attribute where relevant.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'cardinality',\n", + " 'full_text': 'for example –'},\n", + " {'Sentences': ['the dense vector value assigned to the type attribute instructs hyperspace to index and map the imported data to be suitable for vector search.',\n", + " 'note– currently, dense vector is the only type attribute value that is supported for vector search.',\n", + " 'in the future, an additional option called sparse vector will be supported.',\n", + " 'type string ]– for vector specify dense vector',\n", + " 'dim [integer]– specifies the dimension of the vector which indicates the number of values that the vector will contain.',\n", + " 'this is essential storage and search optimization in the database in order to enable efficient and accurate handling of vector operations.',\n", + " 'for binary vector this number must be divisible by 8.note– index type and dim must always be provided together or not at all, and they cannot be used in combination with struct type, low cardinality, or high cardinality.',\n", + " 'index type string ]– specifies the index method data distribution to be used for this vector which influences both the speed of operations performed on the vector and their accuracy.',\n", + " 'choosing the highest speed may necessitate minor trade off in accuracy.',\n", + " 'this choice also impacts the types of mathematical operations that can be conducted.',\n", + " 'brute force– knn using brute force, which is accurate yet time consuming.',\n", + " 'hnsw– index by hierarchical navigable small world method.',\n", + " 'ivf– index by inverted file index scheme.',\n", + " 'bin ivf– index by inverted file index scheme for binary vector',\n", + " 'metric– specifies the metric to be employed to calculate the similarity or distance between vector as one of the following options',\n", + " 'ip– inner product cosine similarity this option must be specified when thehnswor theivfindex type described above is selected.',\n", + " 'hamming– hamming distance this option must be specified when thebin ivf index type described above is selected.',\n", + " 'see additional infohere.',\n", + " 'list int, default 128 only used for index type ivf or bin ivf.',\n", + " 'this option is used during index creation and represents the number of buckets used during clustering.',\n", + " 'larger list leads to quicker search with lower accuracy.',\n", + " 'int, default 30 used exclusively for index type hnsw.',\n", + " 'it specifies the number of arcs per new element.',\n", + " 'higher value should correspond to datasets with higher intrinsic dimensionality and/or higher recall.',\n", + " 'this means that if the dataset has more complex features or you want more accurate results, consider using higher value.',\n", + " 'ef int, default 16 used exclusively for index type hnsw.',\n", + " 'this represents the dynamic list size for nearest neighbors.',\n", + " 'larger ef value results in better accuracy but slower search times.',\n", + " 'essentially, by setting larger ef, youre allowing the algorithm to consider more potential neighbors for better match, but this comes at the cost of longer processing times.',\n", + " 'ef must be larger than the number of queried nearest neighbors nn',\n", + " 'ef construction int, default 360 used exclusively for index type hnsw.',\n", + " 'it is similar to ef, but used for index creation.',\n", + " 'though the upload and index creation may require more time, this option provides more precise search outcome.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file',\n", + " 'header': 'dense vector ',\n", + " 'full_text': 'ef construction int, default 360 – used exclusively for index type = hnsw. it is similar to ef, but used for index creation. though the upload and index creation may require more time, this option provides a more precise search outcome.'},\n", + " {'Sentences': ['the vector search metric defines the distance measure between vector during the vector search.',\n", + " 'from mathematical point of view, the nearest neighbors are those with the smallest distance from the search query vector',\n", + " 'in vector search higher score reflects greater similarity, or smaller distance.',\n", + " 'thus, algebraic conversion from distance to score is required.',\n", + " 'this relation is given in the hyperspace score column.',\n", + " 'x,y =∑i xi−yi 2d x, \\\\sum {i} ^2d x,y =∑i\\u200b xi\\u200b−yi\\u200b',\n", + " 'x,y =−∑ixi⋅yid x,y ⋅y id x,y =−∑i\\u200bxi\\u200b⋅yi\\u200b',\n", + " '{11+difd≥01−dd<0\\\\begin{cases} \\\\frac{1}{1 d} \\\\text{if \\\\geq \\\\\\\\ \\\\text{d<0} \\\\end{cases}{1+d1\\u200b1−d\\u200bifd≥0d<0\\u200b',\n", + " 'x,y =∑i=1nδ xi,yi x,y \\\\sum {i=1}^{n} \\\\delta i, x,y =∑i=1n\\u200bδ xi\\u200b,yi\\u200b'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/vector-similarity-metrics',\n", + " 'header': ' vector similarity metrics',\n", + " 'full_text': '11+d\\\\frac{1}{1 + d}1+d1\\u200b'},\n", + " {'Sentences': ['thel2l2l2metric quantifies the similarity as the distance between two points in euclidean space.',\n", + " 'the metric is derived from the pythagorean theorem and represents the length of the shortest path between the two points:',\n", + " 'note that thel2l^2l2metric is special case of thelpl^plpmetric',\n", + " 'hyperspace uses the squaredl2l2l2metric for calculation efficiency.',\n", + " 'this does not affect the order of the candidate'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/vector-similarity-metrics',\n", + " 'header': 'euclidean metric l2 ',\n", + " 'full_text': 'hyperspace uses the squaredl2l2l2metric for calculation efficiency. this does not affect the order of the candidate s .'},\n", + " {'Sentences': ['the inner product metric quantifies the similarity between two vector as times the projection of one vector on the other, which is if the vector are parallel and if they are perpendicular.',\n", + " 'the minus sign ensures that minimal distance corresponds to maximum similarity.',\n", + " 'in euclidean vector space, the inner product is:',\n", + " 'vector must be normalized before using the ip metric.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/vector-similarity-metrics',\n", + " 'header': 'inner product ip ',\n", + " 'full_text': ' vector s must be normalized before using the ip metric.'},\n", + " {'Sentences': ['the index type specifies the strategy by which data will be distributed in collection.',\n", + " 'this influences the data storage volume, speed of search operations, and the search accuracy.',\n", + " 'specifically, there is an inherent trade off between these traits.',\n", + " 'for example, higher speed of operation leads to reduced accuracy or increased storage volume.',\n", + " 'hyperspace supports multiple index methods, as explained below.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/index-type-methods',\n", + " 'header': 'index type methods',\n", + " 'full_text': 'the index type specifies the strategy by which data will be distributed in a collection. this influences the data storage volume, speed of search operations, and the search accuracy. specifically, there is an inherent trade off between these traits. for example, higher speed of operation leads to reduced accuracy or increased storage volume. hyperspace supports multiple index methods, as explained below.'},\n", + " {'Sentences': ['the brute force index method corresponds to precise nearest neighbors retrieval knn devoid of approximations.',\n", + " 'this approach ensures optimal accuracy, at the expense of reduced search speed.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/index-type-methods',\n", + " 'header': 'brute force brute force ',\n", + " 'full_text': 'the brute force index method corresponds to precise k nearest neighbors retrieval knn , devoid of approximations. this approach ensures optimal accuracy, at the expense of a reduced search speed.'},\n", + " {'Sentences': ['hierarchical navigable small world hnsw is an efficient approach for approximated nearest neighbor search in high dimensional vector spaces.',\n", + " 'hnsw creates hierarchical structure that organizes data points as graph nodes, in manner similar to small world network.',\n", + " 'in this structure, each node has connections to selected nearest neighbors.',\n", + " 'the hierarchy allows for efficient navigation through the dataset, reducing the computational complexity of the search process.',\n", + " 'hnsw offers good balance between accuracy and search speed in large datasets with high dimensionality.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/index-type-methods',\n", + " 'header': 'hnsw hnsw ',\n", + " 'full_text': 'hnsw offers a good balance between accuracy and search speed in large datasets with high dimensionality.'},\n", + " {'Sentences': ['int, default 30 specifies the number of arcs per new data point.',\n", + " 'larger value results in higher recall, but larger storage volume and slower search times.',\n", + " 'thus, larger values of are required for datasets that have higher dimensionality and/or require higher recall.',\n", + " 'ef int, default 16 specifies the size of the dynamic list for nearest neighbors.',\n", + " 'larger ef value results in better accuracy but slower search times.',\n", + " 'essentially, by setting larger ef, youre allowing the algorithm to consider more potential neighbors for better match, but this comes at the cost of longer processing times.',\n", + " 'ef must be larger than the number of queried nearest neighbors nn',\n", + " 'ef construction int, default 360 it is similar toef, but used for index creation.',\n", + " 'though the upload and index creation may require more time and storage volume, this option provides search outcome with higher recall.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/index-type-methods',\n", + " 'header': ' parameter s ',\n", + " 'full_text': 'ef construction int, default 360 – it is similar toef, but used for index creation. though the upload and index creation may require more time and storage volume, this option provides search outcome with higher recall.'},\n", + " {'Sentences': ['inverted file index is search optimization method that composed of partitioning the datasets and associate each with centroid.',\n", + " 'each database vector is then associated with partition that corresponds to its nearest centroid.',\n", + " 'the vector components serve as dimensions in the vector space, and the inverted file index is constructed to map each dimension or combination of dimensions to the documents that have vector containing them.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/creating-a-database-schema-configuration-file/index-type-methods',\n", + " 'header': 'inverted file index ivf ',\n", + " 'full_text': 'inverted file index is a search optimization method that composed of partitioning the datasets and associate each with a centroid. each database vector is then associated with a partition that corresponds to its nearest centroid. the vector components serve as dimensions in the vector space, and the inverted file index is constructed to map each dimension or a combination of dimensions to the documents that have vector s containing them.'},\n", + " {'Sentences': ['data points of all types are uploaded into hyperspace collection as documents and stored according to the identifier you specify during upload, as described below.',\n", + " 'data upload can be performed in batches or by upload single vector as follows.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/uploading-data-to-a-collection',\n", + " 'header': 'upload data to a collection',\n", + " 'full_text': 'data points of all types are uploaded into hyperspace collection as documents and stored according to the identifier you specify during upload, as described below. data upload can be performed in batches or by upload a single vector , as follows.'},\n", + " {'Sentences': ['use the following command to upload single document',\n", + " 'document– represents the document to upload.',\n", + " 'the structure of each document must be according to the database schema configuration file.',\n", + " 'must be oftype dictionary.',\n", + " 'collection name– specifies the name of the collection into which to load the document.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/uploading-data-to-a-collection',\n", + " 'header': 'upload a single document',\n", + " 'full_text': 'collection name– specifies the name of the collection into which to load the document.'},\n", + " {'Sentences': ['each document must have unique identifier, under the field id\".',\n", + " 'you can manually set an id per document by defining designated field named id\" in the document.',\n", + " 'use the following example',\n", + " 'if noid is assigned in the document, id\"will be assigned automatically.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/uploading-data-to-a-collection',\n", + " 'header': 'assigning id to a document',\n", + " 'full_text': 'if noid is assigned in the document, \" id\"will be assigned automatically.'},\n", + " {'Sentences': ['describes how to build and run lexical search query.',\n", + " 'lexical search, or classic search, is fundamental approach to retrieve information based on keyword integer, float, date etc.',\n", + " 'this search operates by assessing the similarity of collection documents to query document, using score function that you define, that assigns similarity score to each document, query pair.',\n", + " 'it then selects the documents with the top score and return their ids to the user.',\n", + " 'checkthe query flowfor detailed discussion on score function',\n", + " 'steps to build lexical search query',\n", + " 'step creating the score function optional',\n", + " 'step specifying the document for similarity search',\n", + " 'step defining the lexical query schema',\n", + " 'step running the lexical search query',\n", + " 'step viewing results'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-lexical-search-query',\n", + " 'header': 'building a lexical search query',\n", + " 'full_text': 'step 5 viewing results'},\n", + " {'Sentences': ['create apython syntax query function which describes your own score logic, with higher score indicating stronger match.',\n", + " 'the structure can be similar to the following example.',\n", + " 'the following python score function initially sets the score to 0.',\n", + " 'if there is match of the country or any country in list the score is set to 5.',\n", + " 'if there is match of both the country and the street, the score is set as 10.',\n", + " 'here is another example that illustrates more complex score mechanism that can be used to weight different attributes for recommendations.',\n", + " 'this score function provides recommendation which is based on the requirement that at least one element of the fields \"genres and countries match.',\n", + " 'movies with high rating recieve boost',\n", + " 'you can specify the score function file is be used for the lexical search, as follows',\n", + " 'score function recommendation– specifies the name of the function containing the logic to be used in the search query, which is described in step #1 above.',\n", + " 'collection name– specifies the name of the collection that contains the data to be searched.',\n", + " 'function name– assigns the score function local object name to be used later when running the search query.',\n", + " 'you can also run the score function from file, using the command',\n", + " 'score function filename– specifies the name and path of the file containing the logic to be used in the search query, which is described in step #1 above.',\n", + " 'this loads the contents of this file to local object.',\n", + " 'collection name– specifies the name of the collection that contains the data to be searched.',\n", + " 'function name– assigns the score function local object name to be used later when running the search query.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-lexical-search-query',\n", + " 'header': 'creating the score function optional ',\n", + " 'full_text': ' function name– assigns the score function a local object name to be used later when running the search query.'},\n", + " {'Sentences': ['if you want to use database document as for the query, use the function \"get document\".',\n", + " 'specify the collection name and identifier of the document for example, 47 that contains the data to which you want to find similarities by placing it in local object named document.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-lexical-search-query',\n", + " 'header': 'specifying the document for classic search',\n", + " 'full_text': 'if you want to use a database document as for the query, use the function \"get document\". specify the collection name and identifier of the document for example, 47 that contains the data to which you want to find similarities by placing it in a local object named document.'},\n", + " {'Sentences': ['if you are using python score function copy the following code snippet to run the lexical search query',\n", + " 'document– specifies the document for similarity search and the multiplier of the return score as described in step 3, defining the classic query schema.',\n", + " 'size– specifies the number of results to return',\n", + " 'function name– specifies the score function to be used in the classic search query as described in step 1, creating the score function',\n", + " 'collection name– specifies the collection in which to search.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-lexical-search-query',\n", + " 'header': 'running the lexical classic search query',\n", + " 'full_text': 'collection name– specifies the collection in which to search.'},\n", + " {'Sentences': ['the following describes how to build and run vector search query.',\n", + " 'this type of search matches the similarity of documents in collection to document that you provide by measuring the proximity or similarity between vector rather than relying on traditional keyword match or exact matches.',\n", + " 'this vector search assigns numerical score to each document according to the type of vector score method that you specified in the metric field, such as hamming.',\n", + " 'this provides the identifiers of the documents with the top highest score for retrieval.',\n", + " 'multiplier weight option can be assigned to the score values.',\n", + " 'note while upload data into collection, dense vector value assigned to the type attribute signifies that the data is suitable for vector search.',\n", + " 'therefore, only data that has been uploaded in this manner will be matched in vector search, as described increating database schema configuration file.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-vector-search-query',\n", + " 'header': 'building a vector search query',\n", + " 'full_text': 'note – while upload data into a collection, dense vector value assigned to the type attribute signifies that the data is suitable for a vector search. therefore, only data that has been uploaded in this manner will be matched in a vector search, as described increating a database schema configuration file.'},\n", + " {'Sentences': ['define the vector search query schema by specifying the following',\n", + " 'document– specifies the document for which the query is searching for match.',\n", + " 'vector field name 1– the name of the first vector field to be given weight',\n", + " 'vector field name 2– the name of the second vector field to be given weight',\n", + " 'boost the key states the value of the given weight per score type, which is 0.5 in the example.',\n", + " 'default value is 1.0.',\n", + " 'in the above example, the query will perform two knn calculations and return the weighted sum of the score',\n", + " 'score 0.5 score vector field name 0.5 score vector field name',\n", + " 'using multiple vector in the same query requires all of the included vector to be indexed in the same manner, for example hnsw for all fields in thedatabase schema configuration file.',\n", + " 'multi vector query can be integrated with lexical search to create multi hybrid query.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-vector-search-query',\n", + " 'header': 'defining the vector query schema',\n", + " 'full_text': 'multi vector query can be integrated with lexical search to create a multi hybrid query.'},\n", + " {'Sentences': ['the following describes how to build and run hybrid search query.',\n", + " 'hybrid search performs both classic search and vector search.',\n", + " 'it then assigns multiplier weight to the resulting matches and retrieves the documents with the top highest score for retrieval.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-hybrid-search-query',\n", + " 'header': 'building a hybrid search query',\n", + " 'full_text': 'the following describes how to build and run a hybrid search query. a hybrid search performs both a classic search and a vector search. it then assigns a multiplier weight to the resulting matches and retrieves the documents with the top highest score s for retrieval.'},\n", + " {'Sentences': ['if you are using score function copy the following code snippet to run the hybrid search query',\n", + " 'document– specifies the document for similarity search and the multiplier of the return score as described in step 3, defining the classic query schema.',\n", + " 'size– specifies the number of results to return',\n", + " 'function name– specifies the score function to be used in the classic search query as described in step 1,creating the score function',\n", + " 'collection name– specifies the collection in which to search.',\n", + " 'alternatively, if you use dsl syntax, copy the following code snippet',\n", + " 'query string is your query logic, see example below.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-hybrid-search-query',\n", + " 'header': 'running the hybrid search query',\n", + " 'full_text': 'query string is your query logic, see example below.'},\n", + " {'Sentences': ['you can create hybrid search queries in two methods',\n", + " 'linear combination of vector and lexical search default',\n", + " 'hybrid score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-hybrid-search-query',\n", + " 'header': 'creating the hybrid search query',\n", + " 'full_text': 'hybrid score function '},\n", + " {'Sentences': ['the query will be hybrid search query with the score being linear combination of lexical and vector score',\n", + " 'by default, query components are assigned with weight 1.0.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-hybrid-search-query',\n", + " 'header': 'linear combination of score s ',\n", + " 'full_text': 'the query will be a hybrid search query with the score being a linear combination of lexical and vector score s . by default, query components are assigned with weight = 1.0.'},\n", + " {'Sentences': ['to change the weights, add key named \"knn\"that includes key \"query\"for the lexical search and designated keys with vector field name vector field in the example for the vector search.',\n", + " 'in the above example, the vector field score will be multiplied by 0.6 in the overall score and the lexical query score by 0.05.',\n", + " 'all fields of typedense vector under params will be included in the vector search, unless the corresponding boost key is set to 0.',\n", + " 'the weight will be assigned the default value of 1.0.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/building-and-running-queries/building-a-hybrid-search-query',\n", + " 'header': 'assigning weights',\n", + " 'full_text': 'all fields of typedense vector under params will be included in the vector search, unless the corresponding boost key is set to 0. the weight will be assigned the default value of 1.0.'},\n", + " {'Sentences': ['the results can be accessed through the python dictionary object return ed byhyperspace client.search which has the following entries',\n", + " 'took ms specifies the query time in [ms].',\n", + " 'similarity provides the list of results.',\n", + " 'each item in results[similarity] contains the following entries',\n", + " 'score float specifies the match score for classic, vector or hybrid search.',\n", + " 'id str the unique identifier of document.',\n", + " 'this is the same id that you assigned to this data when it was uploaded.',\n", + " 'you can use this identifier to retrieve the document, as described inget document.',\n", + " 'here is an example of what results might look like if they were printed on the screen',\n", + " '[{ score 513.7000122070312, id: 78254},\\n score 512.5500126784442, id: 23091},\\n score 485.5471220787652, id: 85432}]',\n", + " 'heres another example of printing results, in which thetook msis also shown',\n", + " 'the following is printed–',\n", + " 'results for collection data 2023 09 06\\n==================================\\nquery run time: 2.82ms\\nquery run time candidate +1 2.82',\n", + " '[{ score 513.7000122070312, id: 78254},\\n score 512.5500126784442, id: 23091},\\n score 485.5471220787652, id: 85432}]'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/retrieving-results',\n", + " 'header': 'retrieving results',\n", + " 'full_text': '[{ score : 513.7000122070312, id: 78254},\\n { score : 512.5500126784442, id: 23091},\\n { score : 485.5471220787652, id: 85432}]'},\n", + " {'Sentences': ['use the following to retrieve and print the results of the search.',\n", + " 'collection name– specifies the collection name from which to retrieve the documents.',\n", + " 'id the id of each result that was return ed.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/retrieving-results',\n", + " 'header': 'retrieving query results by id',\n", + " 'full_text': 'id the id of each result that was return ed.'},\n", + " {'Sentences': ['data can be uploaded in batches or as single documents.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/data-collections/uploading-data',\n", + " 'header': 'upload data',\n", + " 'full_text': 'data can be uploaded in batches or as single documents.'},\n", + " {'Sentences': ['to upload document into collection–',\n", + " 'hyperspace documents must be of type dictionary.',\n", + " 'use the following to upload batch of documents',\n", + " 'use the following to upload single document',\n", + " 'document– contains the data to be uploaded in the structure specified in the data schema configuration file.',\n", + " 'collection name– specifies the name of the collection into which to load the document.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/data-collections/uploading-data',\n", + " 'header': 'upload a single document',\n", + " 'full_text': 'collection name– specifies the name of the collection into which to load the document.'},\n", + " {'Sentences': ['hyperspace documents can include any of the following types.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/data-collections/supported-data-types',\n", + " 'header': 'supported data types',\n", + " 'full_text': 'hyperspace documents can include any of the following types.'},\n", + " {'Sentences': ['practically unlimited number of fields per document',\n", + " 'full support of nested objects',\n", + " 'full support of all common list types',\n", + " 'main data types',\n", + " 'integers unsigned int',\n", + " 'float vector dense vector',\n", + " 'binary vector dense vector',\n", + " 'geo points geo point',\n", + " 'unix timestamp time stamp'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/data-collections/supported-data-types',\n", + " 'header': ' keyword s and values',\n", + " 'full_text': 'unix timestamp time stamp '},\n", + " {'Sentences': ['hyperspace aggregation allow you to perform analytics on your data, gather insights, and summarize information in structured way.',\n", + " 'aggregation operate on set of documents and produce summary statistics or analysis results.',\n", + " 'the aggregation result is stored under the query results objects, as separate key.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/aggregations',\n", + " 'header': ' aggregation s ',\n", + " 'full_text': 'the aggregation result is stored under the query results objects, as a separate key.'},\n", + " {'Sentences': ['metric aggregation allow you to compute and analyze numeric measures on sets of documents.',\n", + " 'metric aggregation operate on numeric field in documents and produce single numeric result for the specified metric.',\n", + " 'to create metric aggregation follow the following template',\n", + " 'agg name aggregation results will be saved under this key',\n", + " 'metric type the type of aggregation',\n", + " 'possible values are:',\n", + " 'sum return the sum of the field over the relevant candidate',\n", + " 'min return the min of the field over the relevant candidate',\n", + " 'max return the max of the field over the relevant candidate',\n", + " 'avg return the average of the field over the relevant candidate',\n", + " 'count return the total number of valid field entries in the relevant candidate',\n", + " 'cardinality return the total number of valid field values in the relevant candidate',\n", + " 'percentiles return the percentiles of the field over the relevant candidate',\n", + " 'fieldname the name of the field to be used in the aggregation',\n", + " 'in the above example, the sum of the field \"sales\" will be calculated and stored under key named \"total sales\"',\n", + " 'in the above example, the number of unique values of the field \"user id\" will be calculated and stored under key named \"unique users\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/aggregations',\n", + " 'header': 'metric aggregation s ',\n", + " 'full_text': 'in the above example, the number of unique values of the field \"user id\" will be calculated and stored under a key named \"unique users\".'},\n", + " {'Sentences': ['date histogram aggregation groups documents into time intervals, forming buckets based on date or time values.',\n", + " 'the basic syntax includes the terms',\n", + " '\"field\": specifies the filed over which the aggregation wll be performed',\n", + " '\"interval\": defines the time interval for creating buckets.',\n", + " 'common intervals include \"year,\" \"quarter,\" \"month,\" \"week,\" \"day,\" \"hour,\" \"minute,\" or \"second.\"',\n", + " 'in the above example, the result will include set of buckets, each representing specific day.',\n", + " 'each bucket contains:'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/aggregations',\n", + " 'header': 'date histogram aggregation s ',\n", + " 'full_text': 'in the above example, the result will include a set of buckets, each representing a specific day. each bucket contains:'},\n", + " {'Sentences': ['you can combine aggregation and candidate filter ing, such that different aggregation on the same query will be performed on different candidate',\n", + " 'in the above example, the aggregation \"avg rating electronics\" will be performed on documents with \"category\" value that equals \"electronics\", while the aggregation \"avg rating high price\" will be performed on documents with \"price\" value greater than or equal to 50.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/aggregations',\n", + " 'header': 'combine aggregation s and candidate filter ing',\n", + " 'full_text': 'in the above example, the aggregation \"avg rating electronics\" will be performed on documents with \"category\" value that equals \"electronics\", while the aggregation \"avg rating high price\" will be performed on documents with \"price\" value greater than or equal to 50.'},\n", + " {'Sentences': ['terms aggregation is bucket aggregation that allows you to group documents into buckets based on the values of specific field.',\n", + " 'in the above example, the aggregation will be performed over the field \"unique products\" over all documents that pass the query.',\n", + " 'the aggregation results will be stored under buckets, according to the possible value of the field \"prod name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/aggregations',\n", + " 'header': 'terms aggregation ',\n", + " 'full_text': 'in the above example, the aggregation will be performed over the field \"unique products\" over all documents that pass the query. the aggregation results will be stored under buckets, according to the possible value of the field \"prod name\".'},\n", + " {'Sentences': ['the hyperspace bool query allows you to construct complex queries by combine multiple sub queries and conditions.',\n", + " 'the bool query supports sub clause such as must should must not, should notand filter'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': 'bool query',\n", + " 'full_text': 'the hyperspace bool query allows you to construct complex queries by combine multiple sub queries and conditions. the bool query supports sub clause s such as must , should , must not, should notand filter .'},\n", + " {'Sentences': ['the must clause specifies conditions that must be satisfiedfor document to be considered match.',\n", + " 'in terms of logical operators, it corresponds to and operator.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' must clause ',\n", + " 'full_text': 'the must clause specifies conditions that must be satisfiedfor a document to be considered a match. in terms of logical operators, it corresponds to and operator.'},\n", + " {'Sentences': ['under the must clause each element is assigned probabi list ic rarity score and the total score for the must clause is calculated by combine these individual score',\n", + " 'unless specified otherwise, the score is based on the tf idf score model.',\n", + " '.the must clause is used when you want all specified conditions to be satisfied for document to be considered match.',\n", + " 'since each score represent different probability, the combined score is product of the individual score',\n", + " 'score score score score 3...',\n", + " 'note: this score method assumes the must clause are independent of one another',\n", + " 'in the above example, all candidate must satisfy all three conditions',\n", + " 'exact match over the bird field, with score equals tf idf score for \"bird\": \"asian koel\"',\n", + " 'the price field must be greater than or equal to 10',\n", + " 'match over \"in stock\" field, with score equals tf idf score for \"in stock\": \"true\"',\n", + " 'the overall score will be product of the individual score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' must clause score ',\n", + " 'full_text': 'the overall score will be a product of the individual score s .'},\n", + " {'Sentences': ['in hyperspace, the must not clause specifies conditions that must notbe satisfiedfor document to be considered match.',\n", + " 'in terms of logical operators, it corresponds to not or and not operators',\n", + " 'the must not clause is used when you want none of the specified conditions to be satisfied for document to be considered match.',\n", + " 'in the above example, all candidate must satisfy the following condition:',\n", + " 'exact match over the color field',\n", + " 'all candidate must also not satisfy any of the three conditions',\n", + " 'exact match over the bird field',\n", + " 'the price field must be less than 10',\n", + " 'exact match over the \"in stock\" field'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' must not clause ',\n", + " 'full_text': 'exact match over the \"in stock\" field'},\n", + " {'Sentences': ['in hyperspace, the should clause within aboolquery is used to specify conditions that are optional for document to be considered match.',\n", + " 'unlike the must clause which imposes mandatory conditions, the should clause only modifies the document score and allows for flexibility by indicating that any of the specified conditions can be satisfied for document to contribute to the search results.',\n", + " 'the should clause is often used for expressing optional or desirable conditions.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' should clause ',\n", + " 'full_text': 'in hyperspace, the should clause within aboolquery is used to specify conditions that are optional for a document to be considered a match. unlike the must clause , which imposes mandatory conditions, the should clause only modifies the document score , and allows for flexibility by indicating that any of the specified conditions can be satisfied for a document to contribute to the search results. the should clause is often used for expressing optional or desirable conditions.'},\n", + " {'Sentences': ['within the should clause each condition is associated with designated score and the overall score for the should clause is determined by combine these individual score',\n", + " 'if not explicitly specified otherwise, score follows the tf idf score model.',\n", + " 'the should clause is employed when you desire flexibility, as it allows for documents to be considered match if they satisfy any of the specified conditions.',\n", + " 'the combined score is sum of the individual score',\n", + " 'score score score score 3...',\n", + " 'in the above example, all candidate must satisfy the condition',\n", + " 'exact match over the bird field, with score equals tf idf score for \"bird\": \"asian koel\"',\n", + " 'in addition, any documents that satisfy the following conditions',\n", + " 'exact match over the country field, with score equals tf idf score for \"bird\": \"asian koel\"',\n", + " 'exact match over \"color\" field, with score equals tf idf score for \"in stock\": \"true\"',\n", + " 'will receive higher score',\n", + " 'the overall score will be sum of the individual score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' should clause score ',\n", + " 'full_text': 'will receive higher score . the overall score will be a sum of the individual score s .'},\n", + " {'Sentences': ['the should not clause within aboolquery specifies conditions that are optional for document to be considered match, in similar manner to should clause',\n", + " 'the should not clause decreases the document score',\n", + " 'the should clause is often used for expressing optional or desirable conditions.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' should not clause ',\n", + " 'full_text': 'the should not clause within aboolquery specifies conditions that are optional for a document to be considered a match, in a similar manner to should clause . the should not clause decreases the document score . the should clause is often used for expressing optional or desirable conditions.'},\n", + " {'Sentences': ['in the should not clause each condition is associated with designated score and the overall score for the clause is determined by substracting these individual score',\n", + " 'if not explicitly specified otherwise, score follows the tf idf score model.',\n", + " 'the should not clause enables documents to be lesser match if they meet any of the specified conditions.',\n", + " 'the combined score is subtraction of the individual score',\n", + " 'score score score score 3...',\n", + " 'in the above example, all candidate must satisfy the condition',\n", + " 'exact match over the bird field, with score equals tf idf score for \"bird\": \"asian koel\"',\n", + " 'in addition, any documents that satisfy the following conditions',\n", + " 'exact match over the country field, with score equals tf idf score for \"bird\": \"asian koel\"',\n", + " 'exact match over \"color\" field, with score equals tf idf score for \"in stock\": \"true\"',\n", + " 'will receive lower score',\n", + " 'the overall score will be sum of the individual score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/bool-query',\n", + " 'header': ' should not clause score ',\n", + " 'full_text': 'will receive lower score . the overall score will be a sum of the individual score s .'},\n", + " {'Sentences': ['candidate filter ing is the process of narrowing down set of potential documents that might be relevant to search query before the score phase.',\n", + " 'you can filter candidate using the following methods:',\n", + " 'exact match between keyword',\n", + " 'window match between dates',\n", + " 'match between geo coordinates'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/candidate-generation-and-metadata-filtering',\n", + " 'header': ' candidate generation and metadata filter ing',\n", + " 'full_text': 'match between geo coordinates'},\n", + " {'Sentences': ['the term query is used to search for documents that contain specific exact value in particular field.',\n", + " 'it is designed for exact matches and is commonly used for fields that are not analyzed, such as keyword fields.',\n", + " 'you can match either keyword or keyword or list',\n", + " 'for keyword match requires exact match between the keyword and for list of keyword it requires an exact match between any two keyword in the two list',\n", + " 'in the above example, candidate must include the field continent and contain the value \"asia\" under the field continent will be return ed',\n", + " 'in the above example, candidate must include the field continent with any of the following values \"asia\", \"europe\", \"africa\" under the field continent will be return ed'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/candidate-generation-and-metadata-filtering',\n", + " 'header': 'term match',\n", + " 'full_text': 'in the above example, candidate s must include the field continent with any of the following values \"asia\", \"europe\", \"africa\" under the field continent will be return ed'},\n", + " {'Sentences': ['therangequery allow you to filter documents based on specified range of values within given field.',\n", + " 'it can be used for numeric and date fields.',\n", + " 'the range query uses the following terms:',\n", + " '\"gte\": the document must be greater than or equals to the provided values',\n", + " '\"gt\": the document must be greater than the provided values',\n", + " '\"lte\": the document must be smaller than or equals to the provided values',\n", + " '\"lt\": the document must be smaller than the provided values',\n", + " 'the above example requires candidate to have field named \"date\" with values that are greater than or equal to \"2023 01 01\" and smaller than or equal to \"2023 12 31\".',\n", + " 'the above example requires candidate to have field named \"datetime\" with values that are greater than or equal to \"2023 01 01t08:00:00\" and smaller than \"2023 01 01t17:30:00\"\".',\n", + " 'the above example requires candidate to have field named \"price\" with values that are greater than \"10\" and smaller or equal to than \"30\"\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/candidate-generation-and-metadata-filtering',\n", + " 'header': 'range match',\n", + " 'full_text': 'the above example requires candidate s to have a field named \"price\" with values that are greater than \"10\" and smaller or equal to than \"30\"\".'},\n", + " {'Sentences': ['hyperspace support various methods of score and arithmetic, based on rarity of keyword in the collection.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': 'score and ranking',\n", + " 'full_text': 'hyperspace support various methods of score and arithmetic, based on rarity of keyword s in the collection.'},\n", + " {'Sentences': ['term frequency inverse document frequency tf idf is numerical statistic that reflects the importance of term within document in corpus.',\n", + " 'it is the default score for matched terms.',\n", + " 'in the above example, matched documents will be assigned with score based on the tf idf formula and will be ranked accordingly.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': 'rarity score tf idf ',\n", + " 'full_text': 'in the above example, matched documents will be assigned with a score based on the tf idf formula and will be ranked accordingly.'},\n", + " {'Sentences': ['the dis max query allows you to select the highest score of list of subqueries.',\n", + " 'in the above example, matched documents will be assigned with score based on the tf idf formula and the maximum score will be return ed.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': 'dis max clause ',\n", + " 'full_text': 'in the above example, matched documents will be assigned with a score based on the tf idf formula and the maximum score will be return ed.'},\n", + " {'Sentences': ['the function score query allows you to modify the relevance score of documents return ed by query.',\n", + " 'its particularly useful when you want to introduce custom score logic, boost certain documents, or apply mathematical function to influence the relevance of search results.',\n", + " 'the function score query wraps around an existing query e.g., amatchquery and modifies the score produced by that query.',\n", + " 'score function are defined within the function array.',\n", + " 'each function applies specific logic to modify the relevance score of documents.',\n", + " 'common types of function include:',\n", + " 'weight: assigns static weight to the documents.',\n", + " 'field value factor: scales score based on the values of numeric field.',\n", + " 'script score allows you to define custom score logic using script.',\n", + " 'random score introduces randomness to the score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': ' function score ',\n", + " 'full_text': 'random score : introduces randomness to the score s .'},\n", + " {'Sentences': ['multiple score function can be defined within the function array.',\n", + " 'the results of these function are combined to produce the final relevance score',\n", + " 'you can control how the score are combined using parameter like score modeand boost mode.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': 'combine function s :',\n", + " 'full_text': 'multiple score function s can be defined within the function s array. the results of these function s are combined to produce the final relevance score . you can control how the score s are combined using parameter s like score modeand boost mode.'},\n", + " {'Sentences': ['the boost mode parameter specifies how the score from different function are combined.',\n", + " 'common options include:',\n", + " 'multiply: multiply the score from different function',\n", + " 'sum: add the score from different function',\n", + " 'replace: use the score of the first function that produces non zero score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': ' boost mode:',\n", + " 'full_text': 'replace: use the score of the first function that produces a non zero score .'},\n", + " {'Sentences': ['the score mode parameter determines how the score of individual function are combined.',\n", + " 'common options include:',\n", + " 'multiply: multiply the score',\n", + " 'sum: add the score',\n", + " 'avg: calculate the average of the score',\n", + " 'in the above example, the function score query is applied to amatchquery.',\n", + " 'it includes two function one that assigns static weight of 2, and another that scales the score based on the square root of numeric field.',\n", + " 'the first function weight multiplies the score by 2.0 weight base score',\n", + " 'the second function field value factor uses the square root of the numeric field.',\n", + " 'the final score for this document would be the sum of these score basis score first function second function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/dsl-query-interface/scoring-and-ranking',\n", + " 'header': ' score mode:',\n", + " 'full_text': 'the final score for this document would be the sum of these score s basis score + first function + second function '},\n", + " {'Sentences': ['hyperspace score function generate candidate by filter ing the database documents.',\n", + " 'hyperspace supports several filter ing methods, both range match based and keyword match based.',\n", + " 'the filter ing is performed at the external conditions stack, that is, only the external “if” conditions will affect the candidate list',\n", + " 'the final candidate list can then be created using additional filter and score',\n", + " 'in the above example, theifmatch genres condition in will create the candidate list while the ifmatch countries condition allows to modify their score but will not change the overall candidate list',\n", + " 'as second example example',\n", + " 'in the above example, only theifmatch genres and doc[budget] >= 100000000condition will create the candidate list',\n", + " 'you can filter candidate using the following methods:',\n", + " 'exact match between keyword',\n", + " 'window match between dates',\n", + " 'match between geo coordinates'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-filtering',\n", + " 'header': ' candidate filter ing',\n", + " 'full_text': 'match between geo coordinates'},\n", + " {'Sentences': ['you can match keyword using the function match str fieldname',\n", + " 'the function operates on either keyword or list of keyword',\n", + " 'for keyword the function return true for an exact match between the keyword and for list of keyword it return true for an exact match between any two keyword in the two list',\n", + " 'hyperspace allows two forms of match',\n", + " 'match between field in the query and the same field in the database documents',\n", + " 'match between field in the query and different field in the database documents',\n", + " 'in the above example:',\n", + " 'the field streetis compared between the query and each document.',\n", + " 'if the field includes match value, the corresponding match function will return true.',\n", + " 'the field cityin the query is compared with the field shipping city in the database documents.',\n", + " 'if there is match value, the corresponding match function will return true.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-filtering',\n", + " 'header': 'exact match between keyword s ',\n", + " 'full_text': 'the field cityin the query is compared with the field shipping city in the database documents. if there is a match value, the corresponding match function will return true.'},\n", + " {'Sentences': ['window match between dates can be performed using the function window match str fieldname unsigned int dt0, unsigned int dt1',\n", + " 'the function compares the dates doc[ fieldname dt0 anddoc[ fieldname dt1 to params[ fieldname ].',\n", + " 'in other words, the function operates on date fields and return true',\n", + " 'ifdoc[ fieldname dt0 params[ fieldname doc[ fieldname ] 2dandwindow matchwill return false.'},\n", + " {'Sentences': ['geographical coordinates can be compared using the function geo dist match str fieldname float thresh',\n", + " 'the function return trueif the distance between the coordinates is below the threshold, andfalseotherwise.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-filtering',\n", + " 'header': 'match between geo coordinates',\n", + " 'full_text': 'example:'},\n", + " {'Sentences': ['you can directly access the input query values and database documents using the syntaxparams[ fieldname ]ordoc[ fieldname ], correspondingly.',\n", + " 'you can then use the retrieved values in the score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-filtering',\n", + " 'header': 'comparison of document field values',\n", + " 'full_text': 'example:'},\n", + " {'Sentences': ['you can filter based on the vector search score using the function knn filter str vector fieldname 1, str vector fieldname 2, float min score',\n", + " 'knn filter filter based on the knn score according to the metric defined in thedata configuration schemafile.',\n", + " 'the function return if the score is abovemin score threhold,or otherwise.min score can be dynamic value, defined in the query params.',\n", + " 'the function operates onparams[ vector fieldname 1]anddoc[ vector fieldname 2].',\n", + " 'by default vector fieldname vector fieldname and min score threhold'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-filtering',\n", + " 'header': ' filter ing based on knn score ',\n", + " 'full_text': 'by default vector fieldname 2 = vector fieldname 1 and min score threhold = 0'},\n", + " {'Sentences': ['hyperspace support various methods of score and arithmetic, based on rarity of keyword in the collection.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': ' candidate score ',\n", + " 'full_text': 'hyperspace support various methods of score and arithmetic, based on rarity of keyword s in the collection.'},\n", + " {'Sentences': ['you can assign constant score using standard assignment in the score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': 'constant',\n", + " 'full_text': 'example:'},\n", + " {'Sentences': ['you can calculate rarity score for two matched keyword between any matched keyword between two list of keyword',\n", + " 'hyperspace uses the tf idf formula for the score',\n", + " 'two different types of usages are currently allowed',\n", + " 'rarity max str fieldname return the maximum rarity out of all the keyword in the list',\n", + " 'rarity sum str fieldname return the sum of rarities of all the keyword in the list',\n", + " 'for keyword fields non list the two function will return the same result.',\n", + " 'rarity sum and rarity max will only return different score for list keyword ].',\n", + " 'in particular, when used for match fields of type keyword they will always return the same score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': 'rarity score tf idf ',\n", + " 'full_text': 'rarity sum and rarity max will only return different score for list [ keyword s ]. in particular, when used for match fields of type keyword , they will always return the same score .'},\n", + " {'Sentences': ['hyperspace allows multiple methods for score arithmetic'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': ' score operations',\n", + " 'full_text': 'arithmetic operations'},\n", + " {'Sentences': ['the function receives score output of function such as rarity sum and return their sum',\n", + " 'sum float score 1, float score 2,...',\n", + " 'score 1, score 2, score 3are the results of score function',\n", + " 'score sumis the sum of score 1, score 2, score 3...'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': 'sum of score s ',\n", + " 'full_text': ' score sumis the sum of score 1, score 2, score 3...'},\n", + " {'Sentences': ['the function receives score output of function such as rarity sum and return the maximum of their values',\n", + " 'score 1, score 2, score 3are the results of score function',\n", + " 'score maxis the maximum between score 1, score 2, score 3...'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': 'max of score s ',\n", + " 'full_text': ' score maxis the maximum between score 1, score 2, score 3...'},\n", + " {'Sentences': ['hyperspace allows arithmetic operations between score using the operators+, *, /.',\n", + " 'these operators can be used in combination with the operator=',\n", + " 'score 0is the result of score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': 'arithmetic operators',\n", + " 'full_text': ' score 0is the result of a score function .'},\n", + " {'Sentences': ['you can include the vector search score in the score function by using the function distance str vector fieldname 1, str vector fieldname 2, float min score',\n", + " 'distance return the knn score if the score is abovemin score threhold,or otherwise, according to the metric defined in thedata configuration schemafile..min score can be dynamic value, defined in the query params.',\n", + " 'the function operates onparams[ vector fieldname 1]anddoc[ vector fieldname 2].',\n", + " 'by default vector fieldname vector fieldname and min score threhold'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/candidate-score',\n", + " 'header': ' vector distance',\n", + " 'full_text': 'by default vector fieldname 2 = vector fieldname 1 and min score threhold = 0'},\n", + " {'Sentences': ['you can debug the score function using the debug info str message, float var function',\n", + " 'the function allows to create and store tuples of messages and numeric values, as part of the query results.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/debugging-the-score-function',\n", + " 'header': 'debug the score function ',\n", + " 'full_text': 'you can debug the score function s using the debug info str message, float var function . the function allows to create and store tuples of messages and numeric values, as part of the query results.'},\n", + " {'Sentences': ['message str the message to be printed after the run',\n", + " 'var float or int variable to be stored as part of the message'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/debugging-the-score-function',\n", + " 'header': 'input',\n", + " 'full_text': 'var float or int a variable to be stored as part of the message'},\n", + " {'Sentences': ['hyperspace allows aggregation of numerical fields over selected documents.',\n", + " 'the aggregation function is implemented inside the score function where each aggregation is performed over the candidate that passed the filter ing up to this position in the code.',\n", + " 'the aggregation result is stored as key under the query results object.',\n", + " 'in the above example, the query results will include key named aggregation \", with the following sub keys:',\n", + " 'key named “max rating”, with value of the max value of the rating of all candidate that passed the filter overgenres.',\n", + " 'key named \"sum budget\", which includes the sum over the field “budget” of all candidate that passed the filter over genres and languages.',\n", + " 'key named \"percentile budget\", which includes the 10,15,32 and 75 percentile over the field “budget” of all candidate that passed the filter over genres and languages.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/aggregations',\n", + " 'header': ' aggregation s ',\n", + " 'full_text': 'a key named \"percentile budget\", which includes the 10,15,32 and 75 percentile over the field “budget” of all candidate s that passed the filter s over genres and languages.'},\n", + " {'Sentences': ['aggregate sum str agg name, str fieldname return the sum of the field over the relevant candidate',\n", + " 'aggregate min str agg name, str fieldname return the min of the field over the relevant candidate',\n", + " 'aggregate max str agg name, str fieldname return the max of the field over the relevant candidate',\n", + " 'aggregate avg str agg name, str fieldname return the average of the field over the relevant candidate',\n", + " 'aggregate count str agg name return the total number of valid field entries in the relevant candidate',\n", + " 'aggregate cardinality str agg name, str fieldname return the total number of valid field values in the relevant candidate',\n", + " 'aggregate percentile str agg name, str fieldname list [float] percentiles return the percentiles of the field over the relevant candidate'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/flows/queries/python-query-interface/aggregations',\n", + " 'header': 'the following aggregation s types are supported',\n", + " 'full_text': ' aggregate percentile str agg name, str fieldname , list [float] percentiles return s the percentiles of the field over the relevant candidate s .'},\n", + " {'Sentences': ['database search is the process of retrieving the documents that best meet the query conditions.',\n", + " 'this flow is comprised of two main steps:',\n", + " 'space reduction and candidate generation:the search space is reduced, and list of candidate documents that passed the query filter ing is generated.',\n", + " 'candidate ranking: candidate are ranked according to score that corresponds to how well that match to the query.',\n", + " 'given query, the naïve approach is to consider all the documents in the collection, while evaluating the expression score user query document and return the top match documents.',\n", + " 'however, this approach does not scale because it is impractical to review all the documents in the collection for each query.',\n", + " 'to overcome this problem, one needs some way to reduce the search space dramatically, from all dataset documents down to thousands or even hundreds of documents, so that user query di evaluation is only performed on small fraction of the dataset.',\n", + " 'this is calledspace reductionor filter ing, and the reduced group of documents is called the candidate group.',\n", + " 'once the search space is reduced, it is easy to evaluate the score per document over this space and to return the top match documents.',\n", + " 'the next section describe how filter ing and score are specified in hyperspace python and dsl syntaxes.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-query-flow',\n", + " 'header': ' candidate generation and score',\n", + " 'full_text': 'given a query, the naïve approach is to consider all the documents in the collection, while evaluating the expression score i = user query document i and return the top k match documents. however, this approach does not scale because it is impractical to review all the documents in the collection for each query. to overcome this problem, one needs some way to reduce the search space dramatically, from all dataset documents down to thousands or even hundreds of documents, so that user query di evaluation is only performed on a small fraction of the dataset. this is calledspace reductionor filter ing, and the reduced group of documents is called the candidate group. once the search space is reduced, it is easy to evaluate the score per document over this space and to return the k top match documents. the next section s describe how filter ing and score are specified in hyperspace python and dsl syntaxes.'},\n", + " {'Sentences': ['the hyperspace engine uses primary conditional expression to narrow down the search space and generate candidate group.',\n", + " 'thus, only documents meeting this primary condition are considered as candidate',\n", + " 'the following code snippets show basic query in the hyperspace python interface.',\n", + " 'in this example, the match \"email domain\" condition is the primary condition lines 1–4 and lines 6–11 apply to the documents that meet the primary condition.',\n", + " 'the \"email domain\" field is matched with the corresponding value,\"yahoo.com\".',\n", + " 'only documents that satisfy the primary condition are advanced to the candidate group.',\n", + " 'consequently, the system only conducts an in depth evaluation of the logic of documents within this group.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-query-flow',\n", + " 'header': 'query python interface',\n", + " 'full_text': 'only documents that satisfy the primary condition are advanced to the candidate group. consequently, the system only conducts an in depth evaluation of the logic of documents within this group.'},\n", + " {'Sentences': ['the following code snippet shows the same query as in the elastic dsl interface format, shown above.',\n", + " 'even without considering the additional function ality, the python syntax is much simpler and more readable than the dsl syntax.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-query-flow',\n", + " 'header': 'query dsl interface',\n", + " 'full_text': 'the following code snippet shows the same query as in the elastic dsl interface format, shown above. even without considering the additional function ality, the python syntax is much simpler and more readable than the dsl syntax.'},\n", + " {'Sentences': ['in many applications, queries are used repeatedly with the same logic but with different parameter',\n", + " 'for instance, different instances of the above query might substitute the values \"yahoo.com\" and \"john\" with other values.',\n", + " 'hyperspace engines allows to distinguish between query logic and query documents.',\n", + " 'while the query documents vary, the query logic remains constant.',\n", + " 'this distinction is crucial as it eliminates the need for recompiling the query logic before execution, thereby significantly reducing latency.',\n", + " 'the following code snippets show the query logic and the query documents derived from the above user query.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-query-flow',\n", + " 'header': 'the query logic and query document concept',\n", + " 'full_text': 'the following code snippets show the query logic and the query documents derived from the above user query.'},\n", + " {'Sentences': ['keyword and metadata hold essential information about data.',\n", + " 'keyword based search also calledclassic search leverages this information by match keyword and values to find related items.',\n", + " 'similarity search translates these keyword and values into queries to quickly identify objects with shared characteristics or patterns.',\n", + " 'this approach minimizes computational effort.',\n", + " 'similarity search is relatively fast method that minimizes the computational investment required to identify similar items, thus enabling fast retrieval.',\n", + " 'just like vector search, this method is commonly used in wide variety of applications.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-hybrid-search',\n", + " 'header': 'lexical search',\n", + " 'full_text': ' keyword s and metadata hold essential information about data. keyword based search also calledclassic search leverages this information by match keyword s and values to find related items. similarity search translates these keyword s and values into queries to quickly identify objects with shared characteristics or patterns. this approach minimizes computational effort. similarity search is a relatively fast method that minimizes the computational investment required to identify similar items, thus enabling fast retrieval. just like vector search, this method is commonly used in a wide variety of applications.'},\n", + " {'Sentences': ['vector search is widely used technique for finding similar items using vector representations.',\n", + " 'it involves vector embedding, which turns data into high dimensional vector that embody their essential traits.',\n", + " 'instead of traditional keyword or exact match, vector search identifies similar items by measuring the closeness or similarity between these vector rather than relying on traditional keyword match or exact matches.',\n", + " 'this approach is frequently used in wide variety of applications, such as recommendation systems and content search.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-hybrid-search',\n", + " 'header': ' vector search',\n", + " 'full_text': ' vector search is a widely used technique for finding similar items using vector representations. it involves vector embedding, which turns data into high dimensional vector s that embody their essential traits. instead of traditional keyword or exact match, vector search identifies similar items by measuring the closeness or similarity between these vector s rather than relying on traditional keyword match or exact matches. this approach is frequently used in a wide variety of applications, such as recommendation systems and content search.'},\n", + " {'Sentences': ['hybrid search combines vector and keyword /metadata searches, offering versatile solution that caters to various types of applications and delivers the best of both worlds.',\n", + " 'hybrid search allows the use of semantic search for understanding the context and meaning behind the search terms, as well as keyword search for exact match of specific terms,',\n", + " 'hyperspaces engine allows hybrid search that merges fulllexical searchand vector search.',\n", + " 'this combination leverages the strengths of each search type, offering more accurate and comprehensive search experience.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-hybrid-search',\n", + " 'header': 'hybrid search',\n", + " 'full_text': 'hyperspaces engine allows hybrid search that merges fulllexical searchand vector search. this combination leverages the strengths of each search type, offering a more accurate and comprehensive search experience.'},\n", + " {'Sentences': ['hyperspace enables enhanced search performance through smart index.',\n", + " 'hyperspaces similarity search engine employs an inverted index that supports various data field types, ensuring data access.',\n", + " 'the search performance is further improved by providing cardinality hints at the configuration level.',\n", + " 'hyperspace offers similar advantages in the realm of vector search.',\n", + " 'in particular, it is optimized for graph search in many cases where cpu based search suffers from poor performance.',\n", + " 'in many scenarios, vector search uses graph technology.',\n", + " 'to fully employ the cpu cycle, the data should be cached to some extent.',\n", + " 'however, graphs commonly suffer from low predictability, thus reducing the ability for efficient caching.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-hybrid-search',\n", + " 'header': 'hyperspace hybrid index',\n", + " 'full_text': 'hyperspace offers similar advantages in the realm of vector search. in particular, it is optimized for graph search in many cases where cpu based search suffers from poor performance. in many scenarios, vector search uses graph technology. to fully employ the cpu cycle, the data should be cached to some extent. however, graphs commonly suffer from low predictability, thus reducing the ability for efficient caching.'},\n", + " {'Sentences': ['for hybrid search that combines accurate knn brute force with metadata filter ing, hyperspace uses thepre filter ingapproach.',\n", + " 'in this approach, the documents are first filter ed using lexical search.',\n", + " 'vector similarity is then calculated only for documents that pass this initial filter ing.',\n", + " 'for knn, this approach optimizes the query latency without reducing its recall.\\u200b'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/reference/hyperspace-hybrid-search',\n", + " 'header': 'hybrid k nearest neighbors knn ',\n", + " 'full_text': 'for hybrid search that combines accurate knn brute force with metadata filter ing, hyperspace uses thepre filter ingapproach. in this approach, the documents are first filter ed using lexical search. vector similarity is then calculated only for documents that pass this initial filter ing. for knn, this approach optimizes the query latency without reducing its recall.\\u200b'},\n", + " {'Sentences': ['the function add batch batch, collection name uploads batch of documents to collection.',\n", + " 'batch– represents the batch of documents to upload.',\n", + " 'the structure of each document must conform to the database schema configuration file.',\n", + " 'it must be oftype list [dictionary].',\n", + " 'collection name– specifies the name of the collection into which to load the document.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/hyperspace-client/add_batch',\n", + " 'header': 'add batch',\n", + " 'full_text': 'collection name– specifies the name of the collection into which to load the document.'},\n", + " {'Sentences': ['the following code snippet builds list of documents in temporary variable named batch and then uploads each batch using',\n", + " '{code: 200, message: batch successfully added, status: ok}'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/hyperspace-client/add_batch',\n", + " 'header': 'example 1 ',\n", + " 'full_text': '{code: 200, message: batch successfully added, status: ok}'},\n", + " {'Sentences': ['theasync reqkey can be used in order to run commands in an non synchronic manner.',\n", + " 'by default when command is submitted, the system awaits the client response.',\n", + " 'theasync reqkey allows to change this behavior and run multiple processes simultaneously.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/hyperspace-client/async_req',\n", + " 'header': 'async req',\n", + " 'full_text': 'theasync reqkey can be used in order to run commands in an non synchronic manner. by default when a command is submitted, the system awaits the client response. theasync reqkey allows to change this behavior and run multiple processes simultaneously.'},\n", + " {'Sentences': ['the function update document body, collection name,partial update, doc as upsert uploads single document to collection.',\n", + " 'body document represents the document to upload.',\n", + " 'the body must include the id of the document to be updated, and the structure must conform to the collection schema configuration file.',\n", + " 'it must be of typedictionary.',\n", + " 'collection name str specifies the name of the collection into which to load the document.',\n", + " 'partial update bool specifies if to perform partial or full update.',\n", + " 'forpartial update true, only the fields included in body will be updated.',\n", + " 'ifpartial update false, the collection document will be replaced.',\n", + " 'default value is false.',\n", + " 'doc as upsert bool \"update or insert\" operation.',\n", + " 'whenupsert true, the upload command will attempt to run your update script.',\n", + " 'if the document exists, it will be updated.',\n", + " 'otherwise, it will be uploaded as new document.',\n", + " 'the following response should be received',\n", + " '{status: ok, code: 200, message: document was successfully updated}',\n", + " 'at the moment, updating field of type \"dense vector is not possible.',\n", + " 'when updating document, the \"body\" document must not include fields of type \"dense vector \".',\n", + " 'if the corresponding database document meaning the document with the same id in the database includes such field, it will remain unchanged.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/hyperspace-client/update_document',\n", + " 'header': 'update document',\n", + " 'full_text': 'at the moment, updating a field of type \"dense vector \" is not possible. \\nwhen updating a document, the \"body\" document must not include fields of type \"dense vector \". if the corresponding database document meaning the document with the same id in the database includes such a field, it will remain unchanged.'},\n", + " {'Sentences': ['hyperspace supports the following commands in python score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework',\n", + " 'header': 'python query framework',\n", + " 'full_text': 'hyperspace supports the following commands in a python score function '},\n", + " {'Sentences': ['boolean expression and or not', 'geo coordinates match'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework',\n", + " 'header': ' candidate generation and metadata filter ing',\n", + " 'full_text': 'window match'},\n", + " {'Sentences': ['max \\ufeff\\ufeffrarity score tf idf bm25',\n", + " 'sum rarity score tf idf bm25',\n", + " 'score operations sum, max, arithmetic operations',\n", + " 'score weights boost'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework',\n", + " 'header': 'score and ranking',\n", + " 'full_text': ' score max'},\n", + " {'Sentences': ['free text search'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework',\n", + " 'header': ' candidate generation and metadata filter ing',\n", + " 'full_text': 'free text search'},\n", + " {'Sentences': ['the function aggregate sum str agg name, str fieldname return the sum of the field fieldname over all documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key namedagg name.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_sum',\n", + " 'header': ' aggregate sum',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key namedagg name.'},\n", + " {'Sentences': ['the function aggregate avg str agg name, str fieldname return the average of the field fieldname over all documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key named \"agg name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_avg',\n", + " 'header': ' aggregate avg',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key named \"agg name\".'},\n", + " {'Sentences': ['the aggregate min str agg name, str fieldname return the min of the field fieldname over all documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key namedagg name.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_min',\n", + " 'header': ' aggregate min',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key namedagg name.'},\n", + " {'Sentences': ['the aggregate max str agg name, str fieldname return the max of the field fieldname over all documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key named \"agg name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_max',\n", + " 'header': ' aggregate max',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key named \"agg name\".'},\n", + " {'Sentences': ['the function aggregate count str agg name, str fieldname return the number documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key named \"agg name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_count',\n", + " 'header': ' aggregate count',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key named \"agg name\".'},\n", + " {'Sentences': ['the function aggregate cardinality str agg name, str fieldname return the number documents that passed the relevant filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key named \"agg name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_cardinality',\n", + " 'header': ' aggregate cardinality',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key named \"agg name\".'},\n", + " {'Sentences': ['the function aggregate percentile str agg name, str fieldname list [float] percentiles return the percentiles of the field fieldname over all cabdidates that passed the filter ing.',\n", + " 'the aggregation result will be stored under the query results objects, under key named \"agg name\".'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/aggregate_percentile',\n", + " 'header': ' aggregate percentile',\n", + " 'full_text': 'the aggregation result will be stored under the query results objects, under a key named \"agg name\".'},\n", + " {'Sentences': ['the function date histogram str agg name, str fieldname str time interval allows to create histograms of the aggregation results, by date.',\n", + " 'however, results will be segmented as histogram with resolution according totime interval.',\n", + " 'the aggregation result will be stored under key named \"agg name\" in results.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/aggregations/date_histogram',\n", + " 'header': 'date histogram',\n", + " 'full_text': 'however, results will be segmented as a histogram with resolution according totime interval. the aggregation result will be stored under a key named \"agg name\" in results.'},\n", + " {'Sentences': ['match str fieldname allowsexact match between keyword',\n", + " 'the function operates on fields of type keyword or list of keyword',\n", + " 'for keyword the function return truefor an exact match andfalseotherwise.',\n", + " 'for list of keyword the function acts asmatch any return ing truefor an exact match between any two keyword of the list',\n", + " 'the function can operate on field in the query and the same field in database document, or between field in the query and different field in database document.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/match',\n", + " 'header': 'match',\n", + " 'full_text': 'the function can operate on a field in the query and the same field in a database document, or between a field in the query and a different field in a database document.'},\n", + " {'Sentences': ['fieldname str the name of the field to match in query document.',\n", + " 'params[ fieldname 1] must be of type keyword or list keyword',\n", + " 'fieldname str, default= fieldname the name of the field to match in database document.',\n", + " 'doc[ fieldname 2] must be of type keyword or list keyword'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/match',\n", + " 'header': 'input',\n", + " 'full_text': ' fieldname 2 str, default= fieldname 1 the name of the field to match in database document. doc[ fieldname 2] must be of type keyword or list [ keyword ]'},\n", + " {'Sentences': ['the function knn filter str vector fieldname 1, str vector fieldname 2, float min score allows to filter candidate based on vector distance knn between document fields, according to the distance metric defined in the data schema config file.',\n", + " 'if the distance score is belowmin score the function will return 0.',\n", + " 'otherwise, it will return 1.',\n", + " 'any arithmetic combination +, *, *, ofknn filter and variable or constant is allowed.',\n", + " 'min score can be dynamic value, included in the query parameter'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/knn_filter',\n", + " 'header': 'knn filter ',\n", + " 'full_text': 'min score can be a dynamic value, included in the query parameter s .'},\n", + " {'Sentences': ['vector fieldname str the name of the query field to use in the knn calculation.',\n", + " 'params[ fieldname 1] must be of typedense vector',\n", + " 'vector fieldname str, default= fieldname the name of the document field to use in the knn calculation.',\n", + " 'params[ fieldname 1] must be of typedense vector',\n", + " 'by default vector fieldname 2is set to vector fieldname 1.',\n", + " 'min score float, default=0 the score threshold.',\n", + " 'if the distance score is below this value, distance will return 0.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/knn_filter',\n", + " 'header': 'input',\n", + " 'full_text': 'min score float, default=0 the score threshold. if the distance score is below this value, distance will return 0.'},\n", + " {'Sentences': ['result float the return ed value is 1.0 if the knn distance between vector fieldname 1to vector fieldname 2is greater the min thershold, and 0.0 otherwise.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/knn_filter',\n", + " 'header': 'output',\n", + " 'full_text': 'result float the return ed value is 1.0 if the knn distance between vector fieldname 1to vector fieldname 2is greater the min thershold, and 0.0 otherwise.'},\n", + " {'Sentences': ['geo dist match str fieldname float thresh allowscomparison between fields that contain geo coordinates.',\n", + " 'the function return trueif the distance between the coordinates is smaller thanthresh, andfalseotherwise.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/geo_dist_match',\n", + " 'header': 'geo dist match',\n", + " 'full_text': 'geo dist match str fieldname , float thresh allowscomparison between fields that contain geo coordinates. the function return s trueif the distance between the coordinates is smaller thanthresh, andfalseotherwise.'},\n", + " {'Sentences': ['fieldname str the name of the field to match in query document and database document.',\n", + " 'params[ fieldname and doc[ fieldname must be of typetuple float',\n", + " 'thresh float the distance threshold'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/geo_dist_match',\n", + " 'header': 'input',\n", + " 'full_text': 'thresh float the distance threshold'},\n", + " {'Sentences': ['window match str fieldname unsigned int dt0, unsigned int dt1 allowswindow match between dates.',\n", + " 'the function return trueifparams[ fieldname ]is between doc[ fieldname dt0todoc[ fieldname dt1'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/window_match',\n", + " 'header': 'window match',\n", + " 'full_text': 'the function return s trueifparams[ fieldname ]is between doc[ fieldname ] dt0todoc[ fieldname ] dt1'},\n", + " {'Sentences': ['fieldname str the name of the field to match in query docmuent and database document.',\n", + " 'params[ fieldname and doc[ fieldname must be of typeint',\n", + " 'dt0 unsigned int the left margin of the window.',\n", + " 'must include units s/m/h/d',\n", + " 'dt1 unsigned int the right margin of the window.',\n", + " 'must include units s/m/h/d'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-generation-and-metadata-filtering/window_match',\n", + " 'header': 'input',\n", + " 'full_text': 'dt1 unsigned int the right margin of the window. must include units s/m/h/d .'},\n", + " {'Sentences': ['combine vector distance with classic score in score function',\n", + " 'the function distance str vector fieldname 1, str vector fieldname 2, float min score calculates vector distance knn between document fields, according to the distance metric defined in the data schema config file.',\n", + " 'if the distance score is belowmin score the function will return 0.',\n", + " 'otherwise, it will return the knn score',\n", + " 'any arithmetic combination +, *, *, ofdistance and variable or constant is allowed.',\n", + " 'min score can be dynamic value, included in the query parameter'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/distance',\n", + " 'header': 'distance',\n", + " 'full_text': 'min score can be a dynamic value, included in the query parameter s .'},\n", + " {'Sentences': ['vector fieldname str the name of the query field to use in the knn calculation.',\n", + " 'params[ fieldname 1] must be of typedense vector',\n", + " 'vector fieldname str, default= fieldname the name of the document field to use in the knn calculation.',\n", + " 'params[ fieldname 1] must be of typedense vector',\n", + " 'by default vector fieldname 2is set to vector fieldname 1.',\n", + " 'min score float, default=0 the score threshold.',\n", + " 'if the distance score is below this value, distance will return 0.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/distance',\n", + " 'header': 'input',\n", + " 'full_text': 'min score float, default=0 the score threshold. if the distance score is below this value, distance will return 0.'},\n", + " {'Sentences': ['int the return ed values will be the knn distance between vector fieldname 1to vector fieldname 2if it is greater the min thershold, and otherwise.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/distance',\n", + " 'header': 'output',\n", + " 'full_text': ' int the return ed values will be the knn distance between vector fieldname 1to vector fieldname 2if it is greater the min thershold, and 0 otherwise.'},\n", + " {'Sentences': ['max float score 1, float score receives score output of score function and return their max'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/max',\n", + " 'header': 'max',\n", + " 'full_text': 'max float score 1, float score 2 receives n score s output of n score function s and return s their max'},\n", + " {'Sentences': ['score 1, score 2,...',\n", + " 'float score results of score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/max',\n", + " 'header': 'input',\n", + " 'full_text': ' score 1, score 2,... float n score s results of score function s '},\n", + " {'Sentences': ['the function rarity max str fieldname return therarity score of matched keyword',\n", + " 'the function can calculate this score over keyword or list of keyword ,using the tf idf formula.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/rarity_max',\n", + " 'header': 'rarity max',\n", + " 'full_text': 'the function rarity max str fieldname return s therarity score of matched keyword s . the function s can calculate this score over keyword s or list s of keyword s ,using the tf idf formula.'},\n", + " {'Sentences': ['fieldname str the name of the field to match in query document.',\n", + " 'params[ fieldname 1] must be of type keyword or list keyword',\n", + " 'fieldname str, default= fieldname the name of the field to match in database document.',\n", + " 'v[ fieldname 2] must be of type keyword or list keyword'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/rarity_max',\n", + " 'header': 'input',\n", + " 'full_text': ' fieldname 2 str, default= fieldname 1 the name of the field to match in database document. v[ fieldname 2] must be of type keyword or list [ keyword ]'},\n", + " {'Sentences': ['rarity sum str fieldname calculates the rarity score for matched keyword',\n", + " 'the function calculates the score over keyword or list of keyword using the tf idf formula.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/rarity_sum',\n", + " 'header': 'rarity sum',\n", + " 'full_text': 'rarity sum str fieldname calculates the rarity score for matched keyword s . the function calculates the score over keyword s or list s of keyword s , using the tf idf formula.'},\n", + " {'Sentences': ['fieldname str the name of the field to match in query document.',\n", + " 'params[ fieldname 1] must be of type keyword or list keyword',\n", + " 'fieldname str, default= fieldname the name of the field to match in database document.',\n", + " 'v[ fieldname 2] must be of type keyword or list keyword'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/rarity_sum',\n", + " 'header': 'input',\n", + " 'full_text': ' fieldname 2 str, default= fieldname 1 the name of the field to match in database document. v[ fieldname 2] must be of type keyword or list [ keyword ]'},\n", + " {'Sentences': ['sum float score 1, float score 2,..',\n", + " 'receives score output of score function and return their sum'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/sum',\n", + " 'header': 'sum',\n", + " 'full_text': 'sum float score 1, float score 2,.. receives n score s output of n score function s and return s their sum'},\n", + " {'Sentences': ['score float set of score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/candidate-scoring/sum',\n", + " 'header': 'input',\n", + " 'full_text': ' score float a set of score s '},\n", + " {'Sentences': ['debug info str message, float var allowsto print message and variables from the score function',\n", + " 'the output will be saved under the search result.',\n", + " 'this function is useful tool to assist in debug of the score function'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/debug_info',\n", + " 'header': 'debug info',\n", + " 'full_text': 'this function is a useful tool to assist in debug of the score function .'},\n", + " {'Sentences': ['message str the message to be saved in the search result.',\n", + " 'var float or int variable to be stored as part of the message.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/python-query-framework/debug_info',\n", + " 'header': 'input',\n", + " 'full_text': 'var float or int a variable to be stored as part of the message.'},\n", + " {'Sentences': ['hyperspace supports the following commands in dsl queries'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': 'dsl query framework',\n", + " 'full_text': 'hyperspace supports the following commands in a dsl queries'},\n", + " {'Sentences': ['must not clause', 'should not clause'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': 'boolean queries',\n", + " 'full_text': ' should not clause '},\n", + " {'Sentences': ['free text search'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': ' candidate generation and metadata filter ing',\n", + " 'full_text': 'free text search'},\n", + " {'Sentences': ['rarity score tf idf'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': 'score and ranking',\n", + " 'full_text': 'rarity score tf idf '},\n", + " {'Sentences': ['sum return the sum of the field over the relevant candidate',\n", + " 'min return the min of the field over the relevant candidate',\n", + " 'max return the max of the field over the relevant candidate',\n", + " 'avg return the average of the field over the relevant candidate',\n", + " 'count return the total number of valid field entries in the relevant candidate',\n", + " 'percentiles return the percentiles of the field over the relevant candidate'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': ' aggregation s ',\n", + " 'full_text': 'terms'},\n", + " {'Sentences': ['update by script', 'update by query'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework',\n", + " 'header': 'data management',\n", + " 'full_text': 'update by query'},\n", + " {'Sentences': ['the must clause specifies conditions that must be satisfiedfor document to be considered match.',\n", + " 'in terms of logical operators, it corresponds to and operator.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/bool-queries/must-clause',\n", + " 'header': ' must clause ',\n", + " 'full_text': 'the must clause specifies conditions that must be satisfiedfor a document to be considered a match. in terms of logical operators, it corresponds to and operator.'},\n", + " {'Sentences': ['in hyperspace, the should clause within aboolquery is used to specify conditions that are optional for document to be considered match.',\n", + " 'unlike the must clause which imposes mandatory conditions, the should clause only modifies the document score and allows for flexibility by indicating that any of the specified conditions can be satisfied for document to contribute to the search results.',\n", + " 'the should clause is often used for expressing optional or desirable conditions.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/bool-queries/should-clause',\n", + " 'header': ' should clause ',\n", + " 'full_text': 'in hyperspace, the should clause within aboolquery is used to specify conditions that are optional for a document to be considered a match. unlike the must clause , which imposes mandatory conditions, the should clause only modifies the document score , and allows for flexibility by indicating that any of the specified conditions can be satisfied for a document to contribute to the search results. the should clause is often used for expressing optional or desirable conditions.'},\n", + " {'Sentences': ['the should not clause within aboolquery specifies conditions that are optional for document to be considered match, in similar manner to should clause',\n", + " 'the should not clause decreases the document score',\n", + " 'the should clause is often used for expressing optional or desirable conditions.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/bool-queries/should_not-clause',\n", + " 'header': ' should not clause ',\n", + " 'full_text': 'the should not clause within aboolquery specifies conditions that are optional for a document to be considered a match, in a similar manner to should clause . the should not clause decreases the document score . the should clause is often used for expressing optional or desirable conditions.'},\n", + " {'Sentences': ['the function score query allows you to modify the relevance score of documents return ed by query.',\n", + " 'its particularly useful when you want to introduce custom score logic, boost certain documents, or apply mathematical function to influence the relevance of search results.',\n", + " 'the function score query wraps around an existing query e.g., amatchquery and modifies the score produced by that query.',\n", + " 'score function are defined within the function array.',\n", + " 'each function applies specific logic to modify the relevance score of documents.',\n", + " 'common types of function include:',\n", + " 'weight: assigns static weight to the documents.',\n", + " 'field value factor: scales score based on the values of numeric field.',\n", + " 'script score allows you to define custom score logic using script.',\n", + " 'random score introduces randomness to the score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/scoring-and-ranking/function-score',\n", + " 'header': ' function score ',\n", + " 'full_text': 'random score : introduces randomness to the score s .'},\n", + " {'Sentences': ['multiple score function can be defined within the function array.',\n", + " 'the results of these function are combined to produce the final relevance score',\n", + " 'you can control how the score are combined using parameter like score modeand boost mode.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/scoring-and-ranking/function-score',\n", + " 'header': 'combine function s :',\n", + " 'full_text': 'multiple score function s can be defined within the function s array. the results of these function s are combined to produce the final relevance score . you can control how the score s are combined using parameter s like score modeand boost mode.'},\n", + " {'Sentences': ['the boost mode parameter specifies how the score from different function are combined.',\n", + " 'common options include:',\n", + " 'multiply: multiply the score from different function',\n", + " 'sum: add the score from different function',\n", + " 'replace: use the score of the first function that produces non zero score'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/api-documentation/dsl-query-framework/scoring-and-ranking/function-score',\n", + " 'header': ' boost mode:',\n", + " 'full_text': 'replace: use the score of the first function that produces a non zero score .'},\n", + " {'Sentences': ['hyperspace search allows to filter and assign score for candidate using lexical and vector search.',\n", + " 'only candidate that passed the filter stage will be assigned score',\n", + " 'in addition, hyperspace allows aggregation of candidate fields.',\n", + " 'aggregation can be performed on all candidate and not just those that passed filter ing.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'hyperspace search',\n", + " 'full_text': 'in addition, hyperspace allows aggregation s of candidate fields. aggregation s can be performed on all candidate s , and not just those that passed filter ing.'},\n", + " {'Sentences': ['hyperspace allows to build lexical, vector and hybrid search queries in either domain specific syntax .json structure or native python syntax.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'native python syntax and dsl syntax',\n", + " 'full_text': 'hyperspace allows to build lexical, vector and hybrid search queries in either domain specific syntax .json structure or native python syntax.'},\n", + " {'Sentences': ['hyperspace supports variety of score mechanisms.',\n", + " 'these include tf idf and bm25 based score weights and boost for lexical search, and similarity metrics such as euclidean and hamming distance for vector search.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'multiple score methods',\n", + " 'full_text': 'hyperspace supports a variety of score mechanisms. these include tf idf and bm25 based score , weights and boost s for lexical search, and similarity metrics such as euclidean and hamming distance for vector search.'},\n", + " {'Sentences': ['hyperspace efficient memory management allows to include an extremely large number of keyword and value fields in each query, allowing practically unlimited number of fields.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'no limitation on number of metadata fields',\n", + " 'full_text': 'hyperspace efficient memory management allows to include an extremely large number of keyword and value fields in each query, allowing practically unlimited number of fields.'},\n", + " {'Sentences': ['hyperspace fully supports multi model vector search, allowing to use multiple vector in each search query, and to use the results of each vector search in order to filter other vector searches.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'full support of multi model search',\n", + " 'full_text': 'hyperspace fully supports multi model vector search, allowing to use multiple vector s in each search query, and to use the results of each vector search in order to filter other vector searches.'},\n", + " {'Sentences': ['hyperspace allows to perform vector search with extremely large vector with thousands of elements per vector'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': 'extremely large vector s ',\n", + " 'full_text': 'hyperspace allows to perform vector search with extremely large vector s , with thousands of elements per vector .'},\n", + " {'Sentences': ['hyperspace allows to create sophisticated filter ing and score logic based on both vector and lexical search results.'],\n", + " 'url': 'https://docs.hyper-space.io/hyperspace-docs/getting-started/overview/hyperspace-search',\n", + " 'header': ' filter ing and score based on both vector and lexical search',\n", + " 'full_text': 'hyperspace allows to create sophisticated filter ing and score logic based on both vector and lexical search results.'}]" + ] + }, + "metadata": {}, + "execution_count": 387 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "In the next step, we will use NLTK to parse the data. Note that Hyperspace supports free text search and this can be done withing Hyperspace as well." + ], + "metadata": { + "id": "wII6tDG_qF7R" + } + }, + { + "cell_type": "code", + "source": [ + "import nltk\n", + "from nltk.corpus import wordnet\n", + "from nltk.stem import WordNetLemmatizer\n", + "\n", + "nltk.download('averaged_perceptron_tagger')\n", + "nltk.download('wordnet')\n", + "nltk.download('punkt')\n", + "nltk.download('stopwords')\n", + "from nltk.corpus import stopwords\n", + "stop_words= set(list(stopwords.words('english')) + [\"I\", \"The\",\"a\",\"an\", '..'])\n", + "\n", + "def get_wordnet_pos(word, tag):\n", + " tag_dict = {\n", + " \"J\": wordnet.ADJ,\n", + " \"N\": wordnet.NOUN,\n", + " \"V\": wordnet.VERB,\n", + " \"R\": wordnet.ADV,\n", + " }\n", + " return tag_dict.get(tag[0].upper(), wordnet.NOUN)\n", + "\n", + "def lemmetize_text(text):\n", + " tokens = nltk.word_tokenize(text)\n", + " lemmatizer = WordNetLemmatizer()\n", + " tagged = nltk.pos_tag(nltk.word_tokenize(text))\n", + " return [lemmatizer.lemmatize(word, get_wordnet_pos(word, tag)) for word, tag in tagged if word not in stop_words and (len(word) > 1 or word.isdigit()) and word != \"``\"]\n", + "\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N74LU1QaJ54i", + "outputId": "1b9f7566-fd2b-4799-c3ca-5a58b3f61f01" + }, + "execution_count": 329, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "[nltk_data] Downloading package averaged_perceptron_tagger to\n", + "[nltk_data] /root/nltk_data...\n", + "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", + "[nltk_data] date!\n", + "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", + "[nltk_data] Package wordnet is already up-to-date!\n", + "[nltk_data] Downloading package punkt to /root/nltk_data...\n", + "[nltk_data] Package punkt is already up-to-date!\n", + "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", + "[nltk_data] Package stopwords is already up-to-date!\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, we embbed the text of each document using the all-MiniLM-L6-v2 model." + ], + "metadata": { + "id": "oO1nxWZjqMgS" + } + }, + { + "cell_type": "code", + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\n", + "\n", + "data = []\n", + "for i, document in enumerate(documents):\n", + " for sentence in document[\"Sentences\"]:\n", + " new_doc = {key: document[key] for key in document.keys() if key!= \"Sentences\"}\n", + " new_doc[\"embedded_sentence\"] = list([float(x) for x in embedding_model.encode(sentence)])\n", + " new_doc[\"parent_id\"] = str(i)\n", + " new_doc[\"full_text\"] = document[\"full_text\"]\n", + " new_doc[\"keywords\"] = lemmetize_text(sentence)\n", + " new_doc[\"header keywords\"] = lemmetize_text(new_doc[\"header\"])\n", + " data.append(new_doc)" + ], + "metadata": { + "id": "yA1I7SgSYqnN" + }, + "execution_count": 388, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "data[0][\"header keywords\"]" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "DaqeAquylDfs", + "outputId": "a17ff647-1260-41ac-abfd-856abadc98c5" + }, + "execution_count": 405, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "['overview']" + ] + }, + "metadata": {}, + "execution_count": 405 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K41CEp06-JmN" + }, + "source": [ + "# Setting up the Hyperspace environment\n", + "We are now moving to Ingesting and Querying the data. Working with Hyperspace requires the following steps:\n", + "\n", + "1. Install the client API\n", + "2. Create data config file\n", + "3. Connect to a server\n", + "4. Create collection\n", + "5. Ingest data\n", + "6. Run query" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7UVt24r6-Mft" + }, + "source": [ + "## 1. Install the client API\n", + "Hyperspace API can be installed directly from git, using the following command" + ] + }, + { + "cell_type": "code", + "execution_count": 315, + "metadata": { + "id": "edxBW-er-Lvi", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "4fe50baa-9865-4cdc-e4e1-04ecd7256e8a" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting git+https://github.com/hyper-space-io/hyperspace-py\n", + " Cloning https://github.com/hyper-space-io/hyperspace-py to /tmp/pip-req-build-tbcdr0eg\n", + " Running command git clone --filter=blob:none --quiet https://github.com/hyper-space-io/hyperspace-py /tmp/pip-req-build-tbcdr0eg\n", + " Resolved https://github.com/hyper-space-io/hyperspace-py to commit 749fd8016074f537e9bcb2674e3d9aef35e71e5d\n", + " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "Requirement already satisfied: certifi>=14.05.14 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2024.6.2)\n", + "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (1.16.0)\n", + "Requirement already satisfied: python_dateutil>=2.5.3 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2.8.2)\n", + "Requirement already satisfied: setuptools>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (67.7.2)\n", + "Requirement already satisfied: urllib3>=1.15.1 in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (2.0.7)\n", + "Requirement already satisfied: msgpack in /usr/local/lib/python3.10/dist-packages (from hyperspace==1.0.0) (1.0.8)\n" + ] + } + ], + "source": [ + "pip install git+https://github.com/hyper-space-io/hyperspace-py" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TCZSwM6DVeDm" + }, + "source": [ + "#2. Connect to a server\n", + "\n", + "Once the Hyperspace API is installed, the database can be accessed by creating a local instance of the Hyperspace client. This step requires host address, username and password" + ] + }, + { + "cell_type": "code", + "execution_count": 389, + "metadata": { + "id": "G22qqqKV51ns" + }, + "outputs": [], + "source": [ + "import hyperspace\n", + "\n", + "host = 'host'\n", + "username = \"username\"\n", + "password = \"password\"\n", + "\n", + "hyperspace_client = hyperspace.HyperspaceClientApi(username=username, password=password, host=host)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HXmdh3YGVfQV" + }, + "source": [ + "#3. Create a Data Schema File\n", + "\n", + "Similarly to other search databases, Hyper-Space database requires a configuration file that outlines the data schema. Here, we create a config file that corresponds to the fields of the given dataset.\n", + "\n", + "For vector fields, we also provide the index type to be used, and the metric. . Current options for index include \"**brute_force**\", \"**hnsw**\", \"**ivf**\", and \"**bin_ivf**\" for binary vectors, and \"**IP**\" (inner product) as a metric for floating point vectors and \"**Hamming**\" ([hamming distance](https://en.wikipedia.org/wiki/Hamming_distance)) for binary vectors.\n", + "Here, we use \"brute_force\" (exact KNN) with inner product." + ] + }, + { + "cell_type": "code", + "execution_count": 436, + "metadata": { + "id": "5d8267ad-a81b-4278-bd3a-85b9eae81258" + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "config = {\n", + " \"configuration\": {\n", + " \"id\": {\n", + " \"type\": \"keyword\",\n", + " \"id\": True\n", + " },\n", + " \"parent_id\": {\n", + " \"type\": \"keyword\",\n", + " },\n", + " \"url\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + "\n", + " \"header\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"keywords\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"text\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + " \"header keywords\": {\n", + " \"type\": \"keyword\"\n", + " },\n", + "\n", + " \"embedded_sentence\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dim\": 384,\n", + " \"index_type\": \"brute_force\",\n", + " }\n", + " }\n", + "}\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7RTDkUsr3ead" + }, + "source": [ + "## 4. Create Collection\n", + "Hyperspace stores data in Collections, where each collections, that are sub repositories in the cloud. Each search then operates within a specific collection. To create a collection, use the command **\"create_collection(schema_filename, collection_name)\"**.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 437, + "metadata": { + "id": "6deb4ad5-3622-4557-a6ee-3f85c57bc08a", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c758f4a2-4ef5-414c-e1af-55505460b09f" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'collections': {'Hyperspace_RAG': {'creation_time': '2024-07-07T14:34:32Z',\n", + " 'size': 0}}}" + ] + }, + "metadata": {}, + "execution_count": 437 + } + ], + "source": [ + "collection_name = 'Hyperspace_RAG'\n", + "if collection_name in hyperspace_client.collections_info()[\"collections\"]:\n", + " hyperspace_client.delete_collection(collection_name)\n", + "hyperspace_client.create_collection(config, collection_name)\n", + "hyperspace_client.collections_info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SoAn8EuqqmPW" + }, + "source": [ + "## 5. Ingest data\n", + "\n", + "We ingest the dataset to the Hyperspace database in batches of 500 documents. The batch size can be controlled by user, and in particular, can be increased in order decrease ingestion time. You can batches of data using the command **add_batch(batch, collection_name)**.\n", + "If no Id per document is provided, Hyperspace will randomly assign one.\n", + "\n", + "Here, we will assign id values to each documents by adding an \"id\" field to the data. If the id field is decalred in the data schema config file, this step is mandatory." + ] + }, + { + "cell_type": "code", + "execution_count": 438, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "72d7a7c9-00c3-488f-bd13-707d9ee1d601", + "outputId": "26c45e4c-6bc9-4222-b86f-f8974d9c0b1f", + "scrolled": true + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "0 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "500 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n", + "748 {'code': 200, 'message': 'Batch successfully added', 'status': 'OK'}\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'code': 200, 'message': 'Dataset committed successfully', 'status': 'OK'}" + ] + }, + "metadata": {}, + "execution_count": 438 + } + ], + "source": [ + "import math\n", + "\n", + "BATCH_SIZE = 500\n", + "\n", + "batch = []\n", + "for i, doc in enumerate(data):\n", + " doc['id'] = str(i)\n", + " batch.append(doc)\n", + "\n", + " if i % BATCH_SIZE == 0:\n", + " response = hyperspace_client.add_batch(batch, collection_name)\n", + " batch.clear()\n", + " print(i, response)\n", + "response = hyperspace_client.add_batch(batch, collection_name)\n", + "batch.clear()\n", + "print(i, response)\n", + "hyperspace_client.commit(collection_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z4ULwaqx-_1z" + }, + "source": [ + "Let us check the collection status before we continue" + ] + }, + { + "cell_type": "code", + "execution_count": 334, + "metadata": { + "id": "4d543fe8-8572-4217-9ee3-61066c3cfab2", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "d0d21cf6-d4c6-4460-f29c-f86b8af406bf" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'collections': {'Hyperspace_RAG': {'creation_time': '2024-07-07T13:57:52Z',\n", + " 'size': 749}}}" + ] + }, + "metadata": {}, + "execution_count": 334 + } + ], + "source": [ + "hyperspace_client.collections_info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "42F0n4dLs0sq" + }, + "source": [ + "## 6. Running The Query\n", + "We will now use the database we created in order to build a chat-bot over Hyperspace documentation. The method is:\n", + "\n", + "1. Recieve the user query\n", + "2. Match user query with relevant documents from the documentation\n", + "3. Inject results to the LLM\n", + "4. Use the LLM chatbot" + ] + }, + { + "cell_type": "code", + "source": [ + "def rag_query(params, doc):\n", + " score = 0.0\n", + " boost = 0.0\n", + " if match('keywords') :\n", + " boost = 1.0\n", + " score = rarity_sum('keywords')/100\n", + " score_header = rarity_sum(\"header keywords\")/100\n", + " return score_header + boost * distance('embedded_sentence', min_score=0.4) + score\n", + "\n", + "hyperspace_client.set_function(rag_query, collection_name = collection_name , function_name='rag_query')" + ], + "metadata": { + "id": "Jz_xw71j93ht", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e13dd5a5-fa73-449a-9f45-5d987c6bb052" + }, + "execution_count": 450, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "{'code': 200, 'message': 'Function was set successfully', 'status': 'OK'}" + ] + }, + "metadata": {}, + "execution_count": 450 + } + ] + }, + { + "cell_type": "code", + "source": [ + "def get_extra_text(text):\n", + " params = {'keywords': lemmetize_text(text),\n", + " \"header keywords\": lemmetize_text(text),\n", + " \"embedded_sentence\" : list([float(x) for x in embedding_model.encode(text)])}\n", + " results = hyperspace_client.search(\n", + " {'params': params},\n", + " function_name='rag_query',\n", + " size=10,\n", + " fields = [\"full_text\",\"parent_id\"],\n", + " collection_name=collection_name)\n", + " parent_ids = []\n", + " texts = []\n", + " urls = []\n", + " scores = []\n", + " final_results = []\n", + " if results['hits'][\"total\"][\"value\"] > 0:\n", + " for res in results['hits']['hits']:\n", + " if res['_score'] == 0.0:\n", + " continue\n", + " result = hyperspace_client.get_document(collection_name, res[\"_id\"])\n", + "\n", + " if result[\"parent_id\"] in parent_ids:\n", + " continue\n", + " if len(texts) > 2:\n", + " break\n", + " scores.append(res[\"_score\"])\n", + " texts.append(result[\"full_text\"])\n", + " urls.append(result[\"url\"])\n", + " parent_ids.append(result[\"parent_id\"])\n", + " final_results.append(result)\n", + " return texts, urls, scores, final_results\n" + ], + "metadata": { + "id": "VpQSE6ascmgp" + }, + "execution_count": 457, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "#Building The Chat" + ], + "metadata": { + "id": "Md72MtryshsS" + } + }, + { + "cell_type": "code", + "source": [ + "pip install transformers torch\n" + ], + "metadata": { + "id": "zfYx-hLcHKbB", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "08f41790-2296-427f-d99a-e2e430355053" + }, + "execution_count": 323, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.41.2)\n", + "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.3.0+cu121)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.15.4)\n", + "Requirement already satisfied: huggingface-hub<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.23.4)\n", + "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.25.2)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.5.15)\n", + "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)\n", + "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.19.1)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.3)\n", + "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.4)\n", + "Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.12.1)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.3)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)\n", + "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2023.6.0)\n", + "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch) (8.9.2.26)\n", + "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.3.1)\n", + "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch) (11.0.2.54)\n", + "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch) (10.3.2.106)\n", + "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch) (11.4.5.107)\n", + "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.0.106)\n", + "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch) (2.20.5)\n", + "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch) (12.1.105)\n", + "Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch) (2.3.0)\n", + "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch) (12.5.82)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.5)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.7)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.6.2)\n", + "Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "\n", + "model_name = \"microsoft/DialoGPT-small\"\n", + "model = AutoModelForCausalLM.from_pretrained(model_name)\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "\n", + "def chat(prompt):\n", + " inputs = tokenizer.encode(prompt + tokenizer.eos_token, return_tensors=\"pt\")\n", + " outputs = model.generate(inputs, max_length=150, pad_token_id=tokenizer.eos_token_id)\n", + " text = tokenizer.decode(outputs[0], skip_special_tokens=True)\n", + "\n", + " return text\n", + "\n", + "\n", + "def rag_chat(query_text):\n", + " texts, urls, scores, final_results = get_extra_text(query_text)\n", + " texts = texts[:2]\n", + "\n", + " if len(texts) == 0:\n", + " print(\"Sorry, I cannot answer that\")\n", + " return\n", + " for i, text in enumerate(texts):\n", + " if scores[i] == 0:\n", + " print(\"Sorry, I cannot answer that\")\n", + " return\n", + "\n", + " query = text + \" \" + query_text\n", + " # if scores[i] > 0.65:\n", + " # break\n", + " response = chat(query)\n", + " response = response.replace(query_text, \" \")\n", + " print(response, scores[i])\n", + " print(\" for more info, see\", urls[i])\n", + "" + ], + "metadata": { + "collapsed": true, + "id": "dfjmCdgr_Hpd" + }, + "execution_count": 463, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Let's first demonstrate an example of using Hyperspace filtering" + ], + "metadata": { + "id": "CnYTNXA5lJ-3" + } + }, + { + "cell_type": "code", + "source": [ + "query_text = \"How are you?\"\n", + "print(\"Chat response\")\n", + "print(\"=\"*20)\n", + "\n", + "print(chat(query_text))\n", + "\n", + "print(\"\\nRAG Chat response\")\n", + "print(\"=\"*20)\n", + "\n", + "rag_chat(query_text)\n" + ], + "metadata": { + "id": "ysJ7uA_oqQ8R", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c3915a23-ae87-4f8c-cb39-c93b1ba9288d" + }, + "execution_count": 468, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Chat response\n", + "====================\n", + "How are you?I'm here\n", + "\n", + "RAG Chat response\n", + "====================\n", + "Sorry, I cannot answer that\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "As a second example, let's ask a generic question -" + ], + "metadata": { + "id": "yQi38Sz9lpLe" + } + }, + { + "cell_type": "code", + "source": [ + "query_text = \"How do i upload data ?\"\n", + "\n", + "#Basic chat - no extrernal context\n", + "print(\"Chat response\")\n", + "print(\"=\"*20)\n", + "display(chat(query_text))\n", + "print(\"\\n\")\n", + "\n", + "# RAG seach - use website context\n", + "print(\"RAG Chat response\")\n", + "print(\"=\"*20)\n", + "\n", + "rag_chat(query_text)" + ], + "metadata": { + "id": "VJZKnY9ffRjx", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 209 + }, + "outputId": "57fec958-112d-499b-a741-1b6686b04b9e" + }, + "execution_count": 469, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Chat response\n", + "====================\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "\"How do i upload data?You can't.\"" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "\n", + "RAG Chat response\n", + "====================\n", + "data points of all types are uploaded into hyperspace collection as documents and stored according to the identifier you specify during upload, as described below. data upload can be performed in batches or by upload a single vector, as follows. How do i upload data?What is the data point of the data? 1.1461771726608276\n", + " for more info, see https://docs.hyper-space.io/hyperspace-docs/flows/setting-up/uploading-data-to-a-collection\n", + "collection name– specifies the name of the collection into which to load the document. How do i upload data?You can't upload data to a collection name. 1.097333550453186\n", + " for more info, see https://docs.hyper-space.io/hyperspace-docs/flows/data-collections/uploading-data\n" + ] + } + ] + } + ] +} \ No newline at end of file diff --git a/DataSets/RAG/Readme.md b/DataSets/RAG/Readme.md new file mode 100644 index 0000000..520f8c8 --- /dev/null +++ b/DataSets/RAG/Readme.md @@ -0,0 +1,4 @@ +# RAG Using Hyperspace +Retrieval-Augmented Generation (RAG) is a method to improve the quality and accuracy of generated responses by combining retrieval-based methods with generative models. +The data was retrieved from [Hyperspace documentation](https://docs.hyper-space.io/hyperspace-docs). +For more info, see the [Overview Page](https://docs.hyper-space.io/hyperspace-docs/getting-started/overview).