diff --git a/docs/articles/Vector-Indexes.md b/docs/articles/Vector-Indexes.md index 64c1fd7d..1985384e 100644 --- a/docs/articles/Vector-Indexes.md +++ b/docs/articles/Vector-Indexes.md @@ -8,7 +8,7 @@ Running AI applications depends on vectors, often called [embeddings](https://su ![What is a vector index](../assets/use_cases/vector_indexes/vector_index1.png) -Vector indexing, by creating groups of matching elements, speeds up similarity search - which calculate vector closeness using metrics like Euclidean or Jacobian distance. (In small datasets where accuracy is more important than efficiency, you can use K-Nearest Neighbors to pinpoint your query's closest near neighbors. As datasets get bigger and efficiency becomes an issue, an [Approximate Nearest Neighbor](https://superlinked.com/vectorhub/building-blocks/vector-search/nearest-neighbor-algorithms) (ANN) approach will *very quickly* return accurate-enough results.) +Vector indexing, by creating groups of matching elements, speeds up similarity search - which calculate vector closeness using metrics like Euclidean or Jaccard distance. (In small datasets where accuracy is more important than efficiency, you can use K-Nearest Neighbors to pinpoint your query's closest near neighbors. As datasets get bigger and efficiency becomes an issue, an [Approximate Nearest Neighbor](https://superlinked.com/vectorhub/building-blocks/vector-search/nearest-neighbor-algorithms) (ANN) approach will *very quickly* return accurate-enough results.) Vector indexes are crucial to efficient, relevant, and accurate search in various common applications, including Retrieval Augmented Generation ([RAG](https://superlinked.com/vectorhub/articles/advanced-retrieval-augmented-generation)), [semantic search in image databases](https://superlinked.com/vectorhub/articles/retrieval-from-image-text-modalities) (e.g., in smartphones), large text documents, advanced e-commerce websites, and so on. @@ -77,9 +77,9 @@ IVF_SQ makes sense when dealing with medium to large datasets where memory effic ### DiskANN -Most ANN algorithms - including those above - are designed for in-memory computation. But when you're dealing with *big data*, in-memory computation can be a bottleneck. Disk-based ANN ([DiskANN](https://suhasjs.github.io/files/diskann_neurips19.pdf)) is built to leverage Solid-State Drives' (SSDs') large memory and high-speed capabilities. DiskANN indexes vectors using the Vamana algorithm, a graph-based indexing structure that minimizes the number of sequential disk reads required during, by creating a graph with a smaller search "diameter" - the max distance between any two nodes (representing vectors), measured as the least number of hops (edges) to get from one to the other. This makes the search process more efficient, especially for the kind of large-scale datasets that are stored on SSDs. +Most ANN algorithms - including those above - are designed for in-memory computation. But when you're dealing with *big data*, in-memory computation can be a bottleneck. Disk-based ANN ([DiskANN](https://suhasjs.github.io/files/diskann_neurips19.pdf)) is built to leverage Solid-State Drives' (SSDs') large memory and high-speed capabilities. DiskANN indexes vectors using the Vamana algorithm, a graph-based indexing structure that minimizes the number of sequential disk reads required, by creating a graph with a smaller search "diameter" - the max distance between any two nodes (representing vectors), measured as the least number of hops (edges) to get from one to the other. This makes the search process more efficient, especially for the kind of large-scale datasets that are stored on SSDs. -By using a SSD to store and search its graph index, DiskANN can be cost-effective, scalable, and efficient. +By using an SSD to store and search its graph index, DiskANN can be cost-effective, scalable, and efficient. ### SPTAG-based Approximate Nearest Neighbor Search (SPANN) diff --git a/docs/articles/advanced_retrieval_augmented_generation.md b/docs/articles/advanced_retrieval_augmented_generation.md index e489e0bd..53fb93a5 100644 --- a/docs/articles/advanced_retrieval_augmented_generation.md +++ b/docs/articles/advanced_retrieval_augmented_generation.md @@ -109,7 +109,7 @@ embed_model = HuggingFaceEmbedding(model_name="mixedbread-ai/mxbai-embed-large-v Settings.embed_model = embed_model ``` -Specifically, we selected "mixedbread-ai/mxbai-embed-large-v1", a model that strikes a balance between retrieval accuracy and computational efficiency, according to recent performance evaluations in the Huggingface [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). +Specifically, we selected "mixedbread-ai/mxbai-embed-large-v1", a model that strikes a balance between retrieval accuracy and computational efficiency, according to recent performance evaluations in the Hugging Face [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). ### Indexing @@ -164,7 +164,7 @@ Another way to enhance retrieval accuracy is through [hybrid search](https://sup This hybrid approach captures both the semantic richness of embeddings and the direct match precision of keyword search, leading to improved relevance in retrieved documents. -So far we've seen how careful preretrieval (data preparation, chunking, embedding, indexing) and retrieval (hybrid search) can help improve RAG retrieval results. What about _after_ we've done our retrieval? +So far we've seen how careful pre-retrieval (data preparation, chunking, embedding, indexing) and retrieval (hybrid search) can help improve RAG retrieval results. What about _after_ we've done our retrieval? ## Post-retrieval @@ -259,7 +259,7 @@ display(Markdown(f"{response}")) "Based on the context provided, the dangers of hallucinations in the context of machine learning and natural language processing are that they can lead to inaccurate or incorrect results, particularly in customer support and content creation. These hallucinations, which are false pieces of information generated by a generative model, can have disastrous consequences in use cases where there's more at stake than simple internet searches. In short, machine hallucinations can be dangerous because they can lead to false information being presented as fact, which can have serious consequences in real-world applications." -Our advanced RAG pipeline result appears to be relatively precise, avoid hallucinations, and effectively integrate retrieved context into generated output. Note: generation is not a fully deterministic process, so if you run this code yourself, you may receive slightly different output. +Our advanced RAG pipeline result appears to be relatively precise, avoids hallucinations, and effectively integrates retrieved context into generated output. Note: generation is not a fully deterministic process, so if you run this code yourself, you may receive slightly different output. ## Conclusion diff --git a/docs/articles/airbnb-search-benchmarking.md b/docs/articles/airbnb-search-benchmarking.md index ec3aa078..e1ddd861 100644 --- a/docs/articles/airbnb-search-benchmarking.md +++ b/docs/articles/airbnb-search-benchmarking.md @@ -2,7 +2,7 @@ ## Introduction & Motivation -Imagine you are searching for the ideal Airbnb for a weekend getaway. You open the website and adjust sliders and checkboxes but still encounter lists of options that nearly match your need but never are never truly what you are looking for. Although it is straightforward to specify a filter such as: "price less than two hundred dollars", rigid tags and thresholds for more complex search queries, make it a much more difficult task to figure out what the user is looking for. +Imagine you are searching for the ideal Airbnb for a weekend getaway. You open the website and adjust sliders and checkboxes but still encounter lists of options that nearly match your need but are never truly what you are looking for. Although it is straightforward to specify a filter such as: "price less than two hundred dollars", rigid tags and thresholds for more complex search queries, make it a much more difficult task to figure out what the user is looking for. Converting a mental image of a luxury apartment near the city's finest cafés or an affordable business-ready suite with good reviews into numerical filters often proves frustrating. Natural language is inherently unstructured and must be transformed into numerical representations to uncover user intent. At the same time, the rich structured data associated with each listing must also be encoded numerically to reveal relationships between location, comfort, price, and reviews. @@ -49,7 +49,7 @@ def create_text_description(row): """Create a unified text description from listing attributes.""" text = f"{row['listing_name']} is a {row['accommodation_type']} " text += f"For {row['max_guests']} guests. " - text += f"It costs ${row['price']} per night with a rating of {row['rating']} with {row['review_count']} nymber of reviews. " + text += f"It costs ${row['price']} per night with a rating of {row['rating']} with {row['review_count']} number of reviews. " text += f"Description: {row['description']} " text += f"Amenities include: {', '.join(row['amenities_list'])}" return text @@ -259,7 +259,7 @@ If neither of the two approaches produces satisfactory results on structured dat
Figure 10: Hybrid search results for "luxury places with good reviews"
-The results indicate that hybrid search effectively balances semantic understanding with keyword precision. By combining vector search's ability to grasp concepts like "luxury" with BM25's strength in finding exact term matches, the hybrid approach delivers more comprehensive results. However, the fundamental limitations remain: the system still cannot reliably interpret numerical constraints (Figure 11) or make sophisticated judgments about what constitutes "good reviews" in terms of both rating quality and quantity. Additionaly, finding the optimal alpha value for the weighted combination requires careful tuning and may need adjustment based on specific use cases or datasets. Implementing hybrid search also requires maintaining two separate index structures and ensuring proper score normalization and fusion. This suggests that while hybrid search improves upon its component approaches, we need a more advanced solution to truly understand structured data attributes and their relationships. +The results indicate that hybrid search effectively balances semantic understanding with keyword precision. By combining vector search's ability to grasp concepts like "luxury" with BM25's strength in finding exact term matches, the hybrid approach delivers more comprehensive results. However, the fundamental limitations remain: the system still cannot reliably interpret numerical constraints (Figure 11) or make sophisticated judgments about what constitutes "good reviews" in terms of both rating quality and quantity. Additionally, finding the optimal alpha value for the weighted combination requires careful tuning and may need adjustment based on specific use cases or datasets. Implementing hybrid search also requires maintaining two separate index structures and ensuring proper score normalization and fusion. This suggests that while hybrid search improves upon its component approaches, we need a more advanced solution to truly understand structured data attributes and their relationships.
Hybrid Search Results @@ -324,7 +324,7 @@ The cross-encoder reranking results demonstrate a notable improvement in result Most impressively, for the numerical constraints query, the cross-encoder makes progress in understanding specific requirements. Despite the first result exceeding the price constraint (2632 > 2000), the reranking correctly identifies more listings matching the "5 guests" requirement and prioritizes them appropriately. This shows the effectiveness of using cross-encoders, since they re-calculate the similarity between the query and the documents after the initial retrieval based on vector search. In other words, the model can make finer distinctions when examining query-document pairs together rather than separately. However, the cross-encoder still does not perfectly understand all numerical constraints. Additionally, despite the improvements, cross-encoder reranking has significant computational drawbacks. It requires evaluating each query-document pair individually through a transformer-based model, which increases latency and resource requirements. Especially as the candidate pool grows, making the search challenging to scale for large datasets or real-time applications with strict performance requirements. These takeaways suggest that while this approach represents a significant improvement, a more structured approach to handling multi-attribute data could yield better results.
- Cross-Encoder Results for Numerical Query + Cross-Encoder Results for Numerical Query
Figure 15: Cross-encoder results for numerical constraints query
@@ -349,7 +349,7 @@ During offline indexing, each listing is passed through a BERT model to produce
Figure 17: Colbert Multi-Vector Retrieval
-Here is how we implment the multi-vecotr search by ColBERT: +Here is how we implement the multi-vector search by ColBERT: ```python class ColBERTSearch: @@ -462,7 +462,7 @@ At query time, Superlinked uses a large language model to interpret the user’s To ensure that non-negotiable constraints are respected, Superlinked first applies hard filters to eliminate listings that do not meet specific criteria, such as guest capacity or maximum price. Only the listings that pass these filters are considered in the final ranking stage. The system then performs a weighted nearest neighbors search, comparing the multi-attribute embeddings of these candidates against the weighted query representation to rank them by overall relevance. This combination of modality-aware encoding, constraint filtering, and weighted ranking allows Superlinked to produce accurate, context-aware results that reflect both the structure of the underlying data and the nuanced preferences of the user. -Here is how we implment the Superlinked for our Airbnb search: +Here is how we implement the Superlinked for our Airbnb search: We first need to define a schema that captures the structure of our dataset. The schema outlines both the fields we'll use for embedding and those we'll use for filtering: diff --git a/docs/articles/custom_retriever_with_llamaindex.md b/docs/articles/custom_retriever_with_llamaindex.md index c87a2a94..9efd601b 100644 --- a/docs/articles/custom_retriever_with_llamaindex.md +++ b/docs/articles/custom_retriever_with_llamaindex.md @@ -6,13 +6,13 @@ The goal was simple: take Superlinked's core strengths in handling complex, mult Superlinked excels at creating sophisticated vector spaces through its mixture of encoders approach, allowing you to combine multiple embedding models, apply custom weighting schemes, and handle complex multi-modal data with ease. LlamaIndex, on the other hand, provides the robust infrastructure for RAG applications, from document processing and node management to query engines and response synthesis. -As retrieval-augmented generation (RAG) systems continue to evolve, the need for **custom, domain-specific retrievers** is becoming more and more obvious. Sure, traditional vector databases are great for basic similarity search but the moment you throw in more complex, context-heavy queries, they start to fall short. Especially when you're working with real-world data that needs richer filtering or semantic understanding. If you are not sure why you would need mixture-of-encoders as part of you RAG pipeline, feel free to [talk to us](https://links.superlinked.com/get_demo_langchain). +As retrieval-augmented generation (RAG) systems continue to evolve, the need for **custom, domain-specific retrievers** is becoming more and more obvious. Sure, traditional vector databases are great for basic similarity search but the moment you throw in more complex, context-heavy queries, they start to fall short. Especially when you're working with real-world data that needs richer filtering or semantic understanding. If you are not sure why you would need mixture-of-encoders as part of your RAG pipeline, feel free to [talk to us](https://links.superlinked.com/get_demo_langchain). -You can follow allong this guide in a colab notebook: +You can follow along this guide in a colab notebook: - [Google Colab of this guide](https://colab.research.google.com/github/superlinked/VectorHub/blob/main/docs/assets/use_cases/custom_retriever_with_llamaindex/superlinked_custom_retriever_with_llamaindex.ipynb) If you prefer to start using Superlinked's retriever right away you can have a look at the full implementation with Llamaindex: -- [Link to full offical integration on Llamahub](https://links.superlinked.com/llama_hub_in_article) +- [Link to full official integration on Llamahub](https://links.superlinked.com/llama_hub_in_article) In this guide, we'll show you our approach for building a custom LlamaIndex retriever that leverages Superlinked's mixture of encoders architecture. We've refined this approach through numerous production deployments, and now we're making it available for the broader developer community. @@ -66,7 +66,7 @@ class BaseRetriever: pass ``` -The moat here is the presence of the Retrieval Protocol from the LlamaIndex. As this "retrieval protocol" makes it easy to plug in different backends or strategies without having to touch the rest of your system. Let’s break it down on what’s exactly is going on : +Any custom retriever only needs to implement one core method. The "retrieval protocol" from LlamaIndex makes it easy to plug in different backends or strategies without having to touch the rest of your system. Let’s break down exactly what is going on: 1. **Input: `QueryBundle`** @@ -74,7 +74,7 @@ The moat here is the presence of the Retrieval Protocol from the LlamaIndex. As 2. **Output: `List[NodeWithScore]`** - The retriever returns a list of nodes—these are your chunks of content, documents, or data entries—each paired with a relevance score. The higher the score, the more relevant the node is to the query. This list is what gets passed downstream to the LLM or other post-processing steps. As in our case, we are plugging on the + The retriever returns a list of nodes—these are your chunks of content, documents, or data entries—each paired with a relevance score. The higher the score, the more relevant the node is to the query. This list is what gets passed downstream to the LLM or other post-processing steps. 3. **Processing: Backend-Agnostic** @@ -301,7 +301,7 @@ print("✅ SuperlinkedSteamGamesRetriever class defined successfully!") ### Part 3: Superlinked Schema Definition and Setup -Now is the time when we go a bit deep dive on certain thing. Starting with schema desgin, Now in Superlinked, the schema isn’t just about defining data types— it’s more like a formal definition between our data and the underlying vector compute engine. This schema determines how our data gets parsed, indexed, and queried — so getting it right is crucial. +Now is the time when we go a bit deep dive on certain things. Starting with schema design, now in Superlinked, the schema isn’t just about defining data types— it’s more like a formal definition between our data and the underlying vector compute engine. This schema determines how our data gets parsed, indexed, and queried — so getting it right is crucial. In our `SuperlinkedSteamGamesRetriever`, the schema is defined like this: @@ -321,10 +321,10 @@ class GameSchema(sl.Schema): self.game = GameSchema() ``` -Let’s break down what some of these elements actually _does_: +Let’s break down what some of these elements actually _do_: - **`sl.IdField` (→ `game_number`)** - Think of this as our primary key. It gives each game a unique identity and allows Superlinked to index and retrieve items efficiently, I mean basically it’s about how we are telling the Superlinked to segregate the unique identify of the games, and btw it’s especially important when you're dealing with thousands of records. + Think of this as our primary key. It gives each game a unique identity and allows Superlinked to index and retrieve items efficiently. I mean basically it’s about how we are telling Superlinked to segregate the unique identity of the games, and btw it’s especially important when you're dealing with thousands of records. - **`sl.String` and `sl.Float`** Now these aren't just type hints—they enable Superlinked to optimize operations differently depending on the field. For instance, `sl.String` fields can be embedded and compared semantically, while `sl.Float` fields can support numeric filtering or sorting. - **`combined_text`** @@ -353,7 +353,7 @@ Why do this? Because users don’t just search by genre or name—they describe To power the semantic search over our Steam games dataset, I made two intentional design choices that balance performance, simplicity, and flexibility. -First, for the embedding model, I selected `all-mpnet-base-v2` from the Sentence Transformers library. This model produces 768-dimensional embeddings that strike a solid middle ground: they're expressive enough to capture rich semantic meaning, yet lightweight enough to be fast in production. I mean it’s a reliable general-purpose model, known to perform well across diverse text types — which matters a lot when your data ranges from short genre tags to long-form game descriptions. In our case, i needed a model that wouldn’t choke on either end of that spectrum, and `all-mpnet-base-v2` handled it cleanly. +First, for the embedding model, I selected `all-mpnet-base-v2` from the Sentence Transformers library. This model produces 768-dimensional embeddings that strike a solid middle ground: they're expressive enough to capture rich semantic meaning, yet lightweight enough to be fast in production. I mean it’s a reliable general-purpose model, known to perform well across diverse text types — which matters a lot when your data ranges from short genre tags to long-form game descriptions. In our case, I needed a model that wouldn’t choke on either end of that spectrum, and `all-mpnet-base-v2` handled it cleanly. Next, although Superlinked supports multi-space indexing — where you can combine multiple fields or even modalities (like text + images) — I deliberately kept things simple with a single `TextSimilaritySpace`. I would have included the `RecencySpace` in here too but I don’t have the information on the release date for the games. But just to put this out here, if we have the release date information, I could plug in the RecencySpace here, and I can even sort the games with the `TextSimilaritySpace` along with the Recency of the games. Cool.. @@ -389,7 +389,7 @@ Next, although Superlinked supports multi-space indexing — where you can combi At the heart of our retrieval system is a streamlined pipeline built for both clarity and speed. I start with the `DataFrameParser`, which serves as our ETL layer. It ensures that each field in the dataset is correctly typed and consistently mapped to our schema — essentially acting as the contract between our raw CSV data and the Superlinked indexing layer. -Once the data is structured, I feed it into an `InMemorySource`, which is ideal for datasets that comfortably fit in memory . This approach keeps everything lightning-fast without introducing storage overhead or network latency. Finally, the queries are handled by an `InMemoryExecutor`, which is optimised for sub-millisecond latency. This is what makes Superlinked suitable for real-time applications like interactive recommendation systems, where speed directly impacts user experience. +Once the data is structured, I feed it into an `InMemorySource`, which is ideal for datasets that comfortably fit in memory. This approach keeps everything lightning-fast without introducing storage overhead or network latency. Finally, the queries are handled by an `InMemoryExecutor`, which is optimized for sub-millisecond latency. This is what makes Superlinked suitable for real-time applications like interactive recommendation systems, where speed directly impacts user experience. ### Part 6: The Retrieval Engine @@ -466,7 +466,7 @@ Now once we receive the results from Superlinked, I transformed them into a form Next, I make sure that **all original fields** from the dataset, including things like genre, pricing, and game details - are retained in the metadata. This is crucial because downstream processes might want to filter, display, or rank results based on this information. I don’t want to lose any useful context once we start working with the retrieved nodes. -Finally, I apply a lightweight **score normalisation** strategy. Instead of relying on raw similarity scores, we assign scores based on the position of the result in the ranked list. This keeps things simple and consistent. The top result always has the highest score, and the rest follow in descending order. It's not fancy, but it gives us a stable and interpretable scoring system that works well across different queries. +Finally, I apply a lightweight **score normalization** strategy. Instead of relying on raw similarity scores, we assign scores based on the position of the result in the ranked list. This keeps things simple and consistent. The top result always has the highest score, and the rest follow in descending order. It's not fancy, but it gives us a stable and interpretable scoring system that works well across different queries. ## Show Time: Executing the pipeline diff --git a/docs/articles/ecomm-recys.md b/docs/articles/ecomm-recys.md index 0aab808b..882e9c6d 100644 --- a/docs/articles/ecomm-recys.md +++ b/docs/articles/ecomm-recys.md @@ -2,7 +2,7 @@ ### - a [notebook](https://github.com/superlinked/superlinked/blob/main/notebook/recommendations_e_commerce.ipynb) article -Pioneered by the likes of Google and AirBnB, vector embedding has revolutionized recommendation systems by enabling more accuracy and personalization than traditional methods. By representing users and items as high-dimensional vectors in a latent space, embeddings capture similarities and relationships between users and items, and can therefore be used to provide more relevant recommendations. Their compact and dense nature facilitates efficient computation and scalability, which is vital for real-time and large-scale scenarios. +Pioneered by the likes of Google and Airbnb, vector embedding has revolutionized recommendation systems by enabling more accuracy and personalization than traditional methods. By representing users and items as high-dimensional vectors in a latent space, embeddings capture similarities and relationships between users and items, and can therefore be used to provide more relevant recommendations. Their compact and dense nature facilitates efficient computation and scalability, which is vital for real-time and large-scale scenarios. In this article, we'll walk you through how to use the Superlinked library to create an effective RecSys - specifically an e-commerce site selling mainly clothing, that can be updated in real-time employing feedback loops defined by user interactions. @@ -424,7 +424,7 @@ With very little weight placed on Spaces affected by events, we observe a change But if we weight the event-affected Spaces more heavily, we surface completely novel items in our recommendations list. ```python -# with larger weight on the the event-affected spaces, more totally new items appear in the TOP10 +# with larger weight on the event-affected spaces, more totally new items appear in the TOP10 event_weighted_result = app_with_events.query( personalised_query, user_id="user_1", diff --git a/docs/articles/hybrid_search_&_rerank_rag.md b/docs/articles/hybrid_search_&_rerank_rag.md index 64e81038..d0d3be6e 100644 --- a/docs/articles/hybrid_search_&_rerank_rag.md +++ b/docs/articles/hybrid_search_&_rerank_rag.md @@ -168,7 +168,7 @@ vectorstore = Chroma.from_documents(chunks, embeddings) Now, we build the keyword and semantic retrievers separately. For keyword matching, we use the [BM25 retriever](https://python.langchain.com/docs/integrations/retrievers/bm25) from Langchain. By setting k to 3, we’re asking the retriever to return the 3 most relevant documents or vectors from the vector store. ```python -vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 3}) +vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) keyword_retriever = BM25Retriever.from_documents(chunks) keyword_retriever.k = 3 ``` @@ -176,7 +176,7 @@ keyword_retriever.k = 3 Now, we create the ensemble retriever, which is a weighted combination of the keyword and semantic retrievers above. ```python -ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver, +ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retriever, keyword_retriever], weights=[0.3, 0.7]) ``` diff --git a/docs/articles/kg_ontologies.md b/docs/articles/kg_ontologies.md index 948c9701..c70bb37b 100644 --- a/docs/articles/kg_ontologies.md +++ b/docs/articles/kg_ontologies.md @@ -46,7 +46,7 @@ One of the keys to a knowledge graph’s power is its ontology. **The ontology m A KG ontology is a formal and abstract representation of the graph's data. It typically contains the rules, axioms, and constraints governing the entities, attributes, and relationships (including complex ones like inheritance and polymorphism) within the graph. A good ontology has a clear, semantic framework that makes the data's logical structure clear and easier to understand. It models concepts as they exist in the real world, using familiar business language. This helps both humans and machines in making their searches more precise. It also permits the inference of new knowledge from existing facts through logical reasoning. -The ontological capabilities of knowledge graphs enable them to understand your organization's fundamental concepts. KGs, by treating metadata as data, permit you to seamlessly connect those concepts to real data. As organizations' futures increasingly come to depend on providing AI with a clear understanding of organizational semantics and data, KGs are becoming more and more indispensible. +The ontological capabilities of knowledge graphs enable them to understand your organization's fundamental concepts. KGs, by treating metadata as data, permit you to seamlessly connect those concepts to real data. As organizations' futures increasingly come to depend on providing AI with a clear understanding of organizational semantics and data, KGs are becoming more and more indispensable. ## Using KGs for organization survival @@ -162,7 +162,7 @@ Take for example, "trade". Each department in your organization takes on the tas Embed the definitions salient to your own particular business into your own version of schema.org, sticking as close as possible to the actual, working semantics of the real people in your business. In a large organization, there’ll be between 5000-100000 separate apps and databases, each with 1000s of different tables, each table with 100s of different columns - in sum, a vast complex of data. -Each department's data can be made available as a webAPI to your central data department, and represented in JSON-LD. Your central data department can then publish the data in each application or database in JSON-LD. As long as each deparment has referenced the well-defined semantics in your schema.org, you will have a good, queryable organizational KG. +Each department's data can be made available as a webAPI to your central data department, and represented in JSON-LD. Your central data department can then publish the data in each application or database in JSON-LD. As long as each department has referenced the well-defined semantics in your schema.org, you will have a good, queryable organizational KG. Your organizational KG should not be treated as a new database. Rather, for each use case, you need only download the chunk of the graph (pre-connected, pre-integrated) that you want. In other words, “within your organization, let a thousand knowledge graphs bloom.” diff --git a/docs/articles/readme.md b/docs/articles/readme.md index 69ee32c6..f605abc4 100644 --- a/docs/articles/readme.md +++ b/docs/articles/readme.md @@ -19,12 +19,12 @@ In the blog section, we collate examples of these use cases and case studies fro - [Representation Learning on Graph Structured Data](https://hub.superlinked.com/representation-learning-on-graph-structured-data) - [Improving RAG performance with Knowledge Graphs](use_cases/knowledge_graphs.md) - [Retrieval from Image and Text Modalities](use_cases/retrieval_from_image_and_text.md) -- [Real-time Socal Media Retrieval System](use_cases/social_media_retrieval.md) +- [Real-time Social Media Retrieval System](use_cases/social_media_retrieval.md) - [Evaluating Retrieval Augmented Generation - part 1](use_cases/retrieval_augmented_generation_eval.md) - [A Real-time Retrieval System for Social Media Data](https://superlinked.com/vectorhub/a-real-time-retrieval-system-for-social-media-data) -- [Optimising RAG with Hybrid Search & Rerank](https://superlinked.com/vectorhub/optimizing-rag-with-hybrid-search-and-reranking) +- [Optimizing RAG with Hybrid Search & Rerank](https://superlinked.com/vectorhub/optimizing-rag-with-hybrid-search-and-reranking) - [RecSys for Beginners](https://superlinked.com/vectorhub/recsys-for-beginners) - [An evaluation of RAG Retrieval Chunking Methods](https://superlinked.com/vectorhub/an-evaluation-of-rag-retrieval-chunking-methods) -We are always looking to expand our Use Cases and share the latest thinking. So if you've been working on something and would like to share your experiences with the community, you [get in touch and contribute](https://github.com/superlinked/VectorHub). +We are always looking to expand our Use Cases and share the latest thinking. So if you've been working on something and would like to share your experiences with the community, you can [get in touch and contribute](https://github.com/superlinked/VectorHub). diff --git a/docs/articles/retrieval_augmented_generation.md b/docs/articles/retrieval_augmented_generation.md index e09294c7..08fd7699 100644 --- a/docs/articles/retrieval_augmented_generation.md +++ b/docs/articles/retrieval_augmented_generation.md @@ -119,7 +119,7 @@ Answer: Tesla's revenue for Q2 2023 was $1.2 billion. ``` -Despite our model's confident assertion, it turns out that Telsa's February earnings were _not_ the $1.2 billion it claims. In fact, this result is way off. Without external data, we might have believed phi-1.5's result, and made a poorly informed investment decision. +Despite our model's confident assertion, it turns out that Tesla's February earnings were _not_ the $1.2 billion it claims. In fact, this result is way off. Without external data, we might have believed phi-1.5's result, and made a poorly informed investment decision. So how can we fix this? You already know the answer: RAG to the rescue. In order to retrieve relevant context, we need a document to retrieve from in the first place. We will download Tesla's financial report for Q2 2023 from their website. diff --git a/docs/articles/social_media_retrieval.md b/docs/articles/social_media_retrieval.md index 2dac22a6..a354f095 100644 --- a/docs/articles/social_media_retrieval.md +++ b/docs/articles/social_media_retrieval.md @@ -47,7 +47,7 @@ Because LinkedIn posts (or any other social media data) evolve frequently, your ### 1.2. The retrieval client -Our retrieval client is a standard Python module that preprocesses user queries and searches the vector DB for most similar results. Qdrant vector DB lets us decouple the retrieval client from the streaming ingestion pipeline. +Our retrieval client is a standard Python module that preprocesses user queries and searches the vector DB for the most similar results. Qdrant vector DB lets us decouple the retrieval client from the streaming ingestion pipeline. To avoid training-serving skew, it's essential to preprocess the ingested posts and queries in the same way. @@ -60,7 +60,7 @@ Lastly, to better understand and explain the retrieval process for particular qu ## 2. Data -We will ingest 215 LinkedIn posts from [my Linked profile - Paul Iusztin](https://www.linkedin.com/in/pauliusztin/). Though we simulate the post ingestion step using JSON files, the posts themselves are authentic. +We will ingest 215 LinkedIn posts from [my LinkedIn profile - Paul Iusztin](https://www.linkedin.com/in/pauliusztin/). Though we simulate the post ingestion step using JSON files, the posts themselves are authentic. Before diving into the code, let's take a look at an example LinkedIn post to familiarize ourselves with the challenges it introduces ↓ @@ -114,7 +114,7 @@ These constants are used across all components of the retrieval system, ensuring Let's dive into the streaming pipeline, beginning at the top, and working our way to the bottom ↓ -### 4.1. Bytemax flow - starting with ingestion +### 4.1. Bytewax flow - starting with ingestion **The Bytewax flow** transparently conveys all the steps of the streaming pipeline. @@ -150,7 +150,7 @@ def build_flow(): ) op.inspect("inspect", stream, print) op.output( - "output", stream, QdrantVectorOutput(vector_size=model.embedding_size) + "output", stream, QdrantVectorOutput(vector_size=embedding_model.embedding_size) ) return flow @@ -220,7 +220,7 @@ class CleanedPost(BaseModel): return cleaned_text ``` -The `from_raw_post` factory method takes an instance of the `RawPost` as input and uses the `clean()` method to clean the text, so that it's compatible with the embedding model. Our cleaning method addresses all embedding-incompatible features highlighted in our `2. Data` section above - e.g., bolded text, emojis, non-ascii characters, etc. +The `from_raw_post` factory method takes an instance of the `RawPost` as input and uses the `clean()` method to clean the text, so that it's compatible with the embedding model. Our cleaning method addresses all embedding-incompatible features highlighted in our `2. Data` section above - e.g., bolded text, emojis, non-ASCII characters, etc. Here's what the cleaned post looks like: @@ -419,11 +419,11 @@ class QdrantVectorDBRetriever: def embed_query(self, query: str) -> list[list[float]]: cleaned_query = CleanedPost.clean(query) chunks = ChunkedPost.chunk(cleaned_query, self._embedding_model) - embdedded_queries = [ + embedded_queries = [ self._embedding_model(chunk, to_list=True) for chunk in chunks ] - return embdedded_queries + return embedded_queries ``` In cases where the query is too large, we divide it into multiple smaller query chunks. We can query Qdrant with each chunk and merge the results. Moreover, chunking can even enhance our search, by broadening it to include posts in more areas of the embedded posts vector space. In essence, it permits more comprehensive coverage of the vector space, potentially leading to more relevant and diverse results.