In recent years, the field of artificial intelligence has seen significant advancements, particularly with the advent of generative AI (GenAI) models. These models have the capability to create content, understand context, and perform tasks that were previously thought to be exclusive to human intelligence. One of the most exciting applications of GenAI is the development of action-oriented AI agents that can interact with users, perform tasks, and make decisions based on their understanding of the environment.
Recently, I have formed a team to participate in a hackathon focused on building integrating GenAI into daily operations. We were supposed to also integrate voice capabilities into our project. However, I will be focusing only on the action-layer of the architecture. During this event, we had the opportunity to develop an AI agent that responds to user queries and produces actions and parameters that the backend can execute. Effectively, this turns the agent into a layer between user and the backend that turns something conversational into an action that can be executed programmatically.
This document outlines the initial design, implementation, challenges, and potential enhancements to address the limitations of the current prototype.
To demonstrate the capabilities of this architecture, we focused on a specific use case: collecting product information. This will be done by the means of conversational interaction between the user and the GenAI agent. The agent will ask relevant questions to gather necessary details about a product and then generate an action that the backend can execute to store this information in a database.
There are several limitations when working with GenAI models:
The general components of this architecture include:
The traditional architecture would have the user layer interact directly with the backend layer (think of Rest API, CRUD operations, user forms, etc). However, in this design, we introduce a GenAI layer that serves as an intermediary. The GenAI layer processes user inputs, generates appropriate actions, and communicates with the backend layer to execute these actions.
The following is a sequence diagram illustrating the interaction between the user, the application, and the GenAI agent when collecting product information such as product group, brand, etc:
sequenceDiagram
User->>App: "I am at shelf 1. I am looking at oat milk of brand Oatside, 1L for RM10"
App->>GenAI: Prompt instructions + Transcribed message(id=1)
GenAI-->>App: action: set memory <br> --- <br> current shelf: 1 <br>
GenAI-->>App: action: upsert <br> --- <br> productGroup: oat milk <br> brand: Oatside <br> price: 10 <br> currency: MYR
GenAI-->>App: action: exclude messages <br> --- <br> transcriptionId: 1
App-->>App: Execute action: set memory, upsert, exclude messages
App-->>User: Success
User->>App: "I saw a dairy milk here of 1L for RM7"
App->>GenAI: Prompt instructions + Transcribed message(id=2)
GenAI-->>App: action: follow up <br> message: "Which brand?" <br> followUpId: 1
App-->>User: TTS: "Which brand?"
User-->>App: "My bad. Dutch Lady"
App->>GenAI: Prompt instructions + Transcribed message(id=2, followUpId: 1, 3)
GenAI-->>App: action: upsert <br> --- <br> productGroup: dairy milk <br> brand: Dutch Lady <br> price: 7 <br> currency: MYR
GenAI-->>App: action: exclude messages <br> --- <br> transcriptionId: 2, 3 <br> followUpId: 1
App-->>App: Execute action: upsert, exclude messages
App-->>User: Success
User will be interacting with the GenAI through the application layer that handles the communication between the user and the GenAI model. To ensure that the application can effectively interpret and execute the actions generated by the GenAI, we must define a clear schema for available actions and their required parameters. With the help of Structured Model Output from OpenAI, we can keep the responses well-structured and easy to parse.
To address the problem statement, we defined the following action schema for the GenAI agent:
interface UpsertProductAction {
action: "upsert";
refId: string;
productGroup: string;
brand: string;
price: number;
currency: string;
}
We will allow the GenAI to generate refId for new products. When there is a follow-up interaction, the GenAI can reference the refId to update the existing product information. This also means that the GenAI needs to be aware of the existing products in the database to avoid creating duplicates. To facilitate this, we can provide a summary of existing products in the prompt context.
However, this will result in longer prompt lengths, which may exceed the context window of the GenAI model. To mitigate this, we can implement a strategy to summarize or truncate the existing product information based on relevance to the current interaction. We introduce excludeMessages action to help the GenAI keep track of which messages have already been processed and can be excluded from future prompts. This helps in managing the context length effectively.
interface ExcludeMessagesAction {
action: "excludeMessages";
transcriptionIds: string[];
}
Additionally, we define a FollowUpAction to handle scenarios where the GenAI needs to ask clarifying questions to gather more information from the user.
interface FollowUpAction {
action: "followUp";
message: string;
}
To maintain context about the current state of the interaction, we introduce a SetMemoryAction that allows the GenAI to store relevant information that is also included in the prompt context.
interface SetMemoryAction {
action: "setMemory";
memory: string
}
The initial prototype was configured to collect basic product information such as product group name, brand, price, and product features. Upon demonstration, it successfully collects product information through conversational interactions with appropriate follow ups and upserts to the database. Additionally, because the LLMs are trained in multiple languages, the agent was able to understand and respond in different languages, making it versatile for diverse user bases.
The OpenAI API responses return usage metrics that help us monitor the token consumption for both prompt and completion. This is crucial for optimizing costs and ensuring that the application remains efficient. Based on the initial tests, after collecting about 10 different products, the average token usage per interaction was around 3000 tokens per prompt. As a result, we easily hit the API call limit within a very short period of time, and this can block the intended user experience to enhance daily operations through GenAI.
To further improve the architecture and address its limitations, we can consider restructuring the user prompts into several tinier pipelines. For example, when we prompt the GenAI, we may first ask the GenAI which action can be performed based on the user input. Once we have the action, we can then prompt the GenAI again to fill in the parameters for that specific action. This approach may help reduce the overall context length and improve the accuracy of the generated actions, but it will definitely introduce additional latency due to multiple API calls.