How-to use own data within GPT-4 prompt hosted on Azure Webservices

February 2025, JULITH GmbH, Thomas RIEHN

Overview and Motivation

Embedding self-managed content into Large Language Models (LLMs) like GPT-4 offers numerous advantages and compelling reasons for its implementation. One key reason is that it allows for the integration of domain-specific knowledge that may not be universally available in the model's original training data. This is particularly beneficial in industries like healthcare, law, or engineering, where specialized information is crucial. By embedding such content, the model can provide more accurate, context-specific, and detailed responses, significantly improving its value in niche applications.

Another advantage is the ability to ensure improved accuracy and relevance. While LLMs like GPT-4 have a broad understanding of language and concepts, they may lack up-to-date or highly specialized information. Embedding self-managed content, such as proprietary datasets or specialized knowledge, ensures that the model delivers accurate, relevant, and timely information, which is particularly valuable when the model’s training data might not cover recent developments or unique subject matter.

Embedding self-managed content also enables personalization, allowing the model to adapt to user preferences, historical interactions, or specific contextual needs. This makes it possible for organizations to create tailored experiences, ensuring that the LLM aligns with individual user preferences or organizational standards. Additionally, by embedding carefully curated and up-to-date information, it becomes easier to reduce the likelihood of "hallucinations," which occur when an LLM generates plausible-sounding but inaccurate information. Self-managed content offers a controlled, verified source of information, enhancing the reliability of the model's outputs.

The ability to continuously update and adapt the model is another significant benefit. With embedded self-managed content, organizations can update the model’s knowledge base with new information as it becomes available. This means that the LLM can stay relevant and accurate over time without requiring complete retraining or relying on infrequent updates from the model's developers. Furthermore, organizations have greater control over the knowledge embedded in the LLM, allowing them to ensure that the model’s outputs are aligned with company policies, ethical standards, and legal requirements.

Embedding self-managed content also reduces dependence on external data sources, improving the speed and consistency of the model’s responses. Rather than relying on third-party systems or services, the LLM can directly access proprietary or frequently-used information, making it more efficient. This leads to operational efficiency as the model can handle specialized tasks, such as customer support or decision-making, without requiring human intervention for basic queries.

From a data security perspective, embedding self-managed content provides a higher level of control over sensitive or confidential information. Organizations can apply their own data governance policies, ensuring that the model’s outputs are secure and comply with privacy regulations. Moreover, integrating self-managed content with existing organizational systems, like databases or CRM platforms, allows for a seamless flow of information, enhancing the model’s capability to support complex tasks like analytics and reporting.

Finally, embedding self-managed content can be cost-effective, as it reduces the need for relying on external APIs or continuously updating training data, which can be expensive. This approach allows organizations to decrease their reliance on paid data sources and API calls, ultimately lowering operational costs. In summary, embedding self-managed content into LLMs like GPT-4 enables organizations to harness more accurate, secure, and tailored outputs, improving both the model's effectiveness and its operational efficiency.

Basic Configuration of Azure-Webservices

Advantages of Azure OpenAI Services

Azure OpenAI Services offer several advantages when it comes to embedding self-managed content into Large Language Models (LLMs) like GPT-4, making it a powerful platform for organizations seeking to integrate their own proprietary data while maintaining control, security, and scalability. One of the key benefits is customization. Azure allows organizations to fine-tune models on their own datasets, enabling them to embed domain-specific, self-managed content directly into the model. This ensures that the model can generate more accurate and contextually relevant responses based on the unique data it has been provided.

Another significant advantage is scalability. Azure's infrastructure is built to handle large-scale applications, so when self-managed content is embedded into models, it can efficiently scale across millions of queries or users without compromising performance. This is particularly beneficial for enterprises that require robust, high-performance AI services capable of managing large volumes of data while ensuring that embedded content remains effective and accessible.

Security and compliance are also notable advantages of Azure OpenAI Services. Azure provides enterprise-grade security and compliance frameworks, ensuring that organizations can embed sensitive or proprietary self-managed content securely. Data is protected with advanced encryption methods, and compliance with various regulations, such as GDPR and HIPAA, is facilitated by Azure’s comprehensive tools and certifications. This is especially crucial for businesses in industries that handle sensitive data and need to ensure that their AI solutions meet stringent legal and ethical requirements.

Additionally, integration with existing Azure services is another key benefit. Organizations that are already using other Azure services, such as Azure Cognitive Services, Azure Databases, or Azure Machine Learning, can seamlessly integrate their self-managed content into the OpenAI models. This interoperability allows for a more streamlined workflow, where data can be pulled from other internal systems and used to train or fine-tune models. This makes the embedding of self-managed content into LLMs much easier and more effective, enhancing the overall user experience.

Azure also enables continuous updates to embedded content. This flexibility allows organizations to frequently update their self-managed content, whether to reflect new knowledge, changing regulations, or evolving business needs. With Azure’s AI infrastructure, updates can be made quickly and efficiently without the need for re-training from scratch, ensuring that the model remains current and aligned with the organization’s requirements.

Furthermore, Azure offers advanced monitoring and management tools that help organizations oversee the performance of their embedded content in real time. Azure’s tools allow for detailed insights into how the model is performing, how the embedded content is influencing the responses, and where adjustments might be needed. This data-driven approach enhances decision-making and ensures that the integration of self-managed content is continuously optimized for the best results.

In summary, Azure OpenAI Services provide a comprehensive and flexible platform for embedding self-managed content into models like GPT-4. The advantages include customization for domain-specific knowledge, scalability to handle large data volumes, strong security and compliance features, seamless integration with existing Azure services, the ability for continuous updates, and advanced monitoring tools—all of which contribute to making the process of embedding self-managed content highly effective and secure for organizations.

Getting started

In the following example we want to embed text documents. Therefore it is necessary to make use of several Azure services. But: first of all login to your azure account ;)

Creating a new subscription

We start our project using a new and dedicated subscription. You can skip this chapter, if you don't want to have a separate subscription.

You click on "Subscriptions" and the subscription overview opens up After clicking "Add" you can enter the details for the new subscription. After clicking "Review + Create" you see the details of your subscription to be created. After clicking "Create" the subscription will be created.

Creating a new resource group

We continue to set-up a new resource group within the newly created subscription.

We create a resource group by clicking "Create" within the overview of the existing resource groups ("Settings" -> "Resource Groups")

Creating a new storage account

With clicking "+ Create Resources" we create a storage account.

As primary service we select "Azure Blob Storage or Azure Data Lake Storage Gen 2"

After clicking "Review + Create" we see the summary of the storage account.

After clicking "Create" the storage account will be created.

Creating a new storage container

We go to the storage account by clicking "Go to resource".

In the following step we need to create an container ("Data storage" -> "Containers"). After clicking "+ Container" we can define the name of the container within the subwindow on the right hand side.

Configure metadata of storage container

This is an optional step if you want to add some metadata to your data within the storage container. If you want to do so, you have to click on the three points on the right side next to the created storage container and select "Edit metadata" in the context menu.

We will add two fields for testing purposes (the documentname and URL) to demonstrate the possibilities.

Import Data into Storage Container

Entering the container you have the possibility to upload content. For our test propose we decide to upload text files. For sure it is possible to import other data-formats as well. Azure is capable of identifying content within pictures as well. In regards to the data size we are limited to 64kb due to the pricing model we will select during the configuration of AI Search furtheron.

After clicking "upload" you can select the files to be uploaded. For demonstration purposes we selected "Cool" as Access tier.

After uploading the overview should look like this.

Configuring AI Search Component

Next we want to set-up the AI Search. First of all we need to create a corresponding resource.

Creating an AI Search

After clicking "+ Create" we are asked for the details.

Within the selection of the "Pricing Tier" you can differentiate between following models Table from: https://learn.microsoft.com/en-us/azure/search/search-sku-tier

After clickign "Review + Create" we see th summary of the sotrage account.

After clicking "Create" the AI Search will be created.

Setting-up the necessary identities

For the seamless integration of the services we highly recommend to set-up an identity based configuration. It is necessary to change the identiy to system-assigned.

After commiting the change with the "Save" button, an Azure Object will be created.

Next it is necessary to change the "Azure role assignments" by clicking on the samenamed button.

There are to role assignments wich are necessary to implement, one for the resource group an one for the storage account. With the "+ Add role assignment" button we go futheron.

Resource Group

We define "Resource Group" as scope and select the created resource group from above. The role wich needs to be selected is the "Cognitive Services User" role.

Storage Account

We define "Storage" as scope and select the created storage account from above. The role wich needs to be selected is the "Storage Blob Data Reader" role.

Change Authentication to RBAC

Within the keys section we have to switch from API-keys to Role-based access control.

After the successful switch it should look like this:

Create an Azure OpenAI Service

Add Azure OpenAI to your subscription

To add Azure OpenAI to your subscription search for "OpenAI"

After having found the service you can start to create the service instance. Within that process we need to select a region, a name and the pricing tier.

After clicking "Next" we are asked to select the type of network security we want for the AI Services resource.

After clicking "Next" we are asked to enter additional tags, which we will leave blank.

After clicking "Next" and "Review + create" the service will be created.

Resource Group overview

Our resource group should now look like this.

Role and role assignments

Step 1: Set-up Role-Identity

Within the "Resource Management" we can find the possibility of switching to Azure role-based access control (Azure RBAC). The doing is quite the same compared to the AI Search service.

After having saved the status change from "Off" to "On" we can see the created Azure Object ID.

Step 2: Set-up Azure role assignments

We need to click "Azure role assignments" first.

Step 3: Search Service Contributor

We add within the scope of the resource group the role "Search Service Contributor".

Step 4: Search Index Data Reader

We add within the scope of the resource group the role "Search Index Data Reader".

Azure OpenAI Overview

By clicking on the OpenAI resource the following window opens.

Configure Service

By clicking on "Explore Azure AI Foundry portal" a new tab opens up in the browswer an d you enter the Chat-Playground

Append Language Models

We now want to add two different language models.

GPT-4 as lanugage model for the chat-prompt

Referring to the first steps we select the model cataloge.

After having found, we can deploy this language model.

ada-002 as language model for the embedding

We search within the model catalog for "ada".

After having found, we can deploy this language model.

Import data into the

Vectorize Content for Azure Search

We switch back again into the created search service within our created resource group and click on "Import and vectorize data".

Step 1: Data connection

In the first step we need to set-up our data connection (Azure Blob Storage in our case).

Within this step it is necessary to select the authentication using a system-assigned managed identity.

Step 2: Vectorization model

In the next step it is necessary to select the OpenAI Service and the deployed text-embedding model.

Step 3: Vectorization of images

In our test-case we leave this blank.

Step 4: Advanced Settings

We highly recommend to enable the semantic ranker and leave the schedule for indexing by once.

Step 5: Review and create

After clicking start-searching you reach the vector.

Within that final process the index

the indexers

and the skillset has been generated.

During the indexing process you can see the progress like this:

The final result should hopefully look like this:

Configure Azure OpenAI Prompt

Add data source

Before we are able to finally set-up the chat-prompt we need to add our data source within the chat-playground.

Step 1: Selection of data source

We select "Azure AI Search" and our "Search Index"

We select, that we want to embed the vectorsearch to this search resource and need to select the embedding model.

Step 2: Data managment

We stay with the hybrid and semantic search type and select the exisiting semantic search configuration.

Step 3: Data connection

We select that we want to use the system managed identity.

If you get an error like this, you need to step back once again and set-up the roles correctly.

Step 4: Review & Create

We have to review and create the settings.

#Conclusion Now you are able to use the prompt on the right-hand side.

#Remark It could be necessary to add the user to the roles of the "Azure OpenAI" object as well.

Assigning role to the asking user

Select the role.

Add the necessary user to the role.

And "Review + assign" the role membership.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md

License

JULITHCH/azure_embedded_lmm

Folders and files

Latest commit

History

Repository files navigation