Raflie Zainuddin
Welcome to my page.
I am Raflie Zainuddin, and I like experimenting with JavaScript and TypeScript.
On my website, I enjoy sharing how I leverage functionalities of the programming languages to create useful hacks and tricks. Hopefully, these insights can help you enhance and streamline your own projects.
2025 Dec 08 • 5min read time

In recent years, the field of artificial intelligence has seen significant advancements, particularly with the advent of generative AI (GenAI) models. These models have the capability to create content, understand context, and perform tasks that were previously thought to be exclusive to human intelligence. One of the most exciting applications of GenAI is the development of action-oriented AI agents that can interact with users, perform tasks, and make decisions based on their understanding of the environment.

Recently, I have formed a team to participate in a hackathon focused on building integrating GenAI into daily operations. We were supposed to also integrate voice capabilities into our project. However, I will be focusing only on the action-layer of the architecture. During this event, we had the opportunity to develop an AI agent that responds to user queries and produces actions and parameters that the backend can execute. Effectively, this turns the agent into a layer between user and the backend that turns something conversational into an action that can be executed programmatically.

This document outlines the initial design, implementation, challenges, and potential enhancements to address the limitations of the current prototype.

Problem statement

To demonstrate the capabilities of this architecture, we focused on a specific use case: collecting product information. This will be done by the means of conversational interaction between the user and the GenAI agent. The agent will ask relevant questions to gather necessary details about a product and then generate an action that the backend can execute to store this information in a database.

Constraints

There are several limitations when working with GenAI models:

  • Context length: GenAI models have a limited context window, which means they can only consider a certain amount of text at a time. This can be a challenge considering that LLMs are stateless and do not retain memory of past interactions unless explicitly provided in the prompt.
  • Accuracy: While GenAI models are impressive, they are not perfect. They may generate incorrect or nonsensical responses, which can lead to errors in the actions they produce.
  • Latency: Generating responses from GenAI models can take time, which may not be suitable for real-time applications.
  • Cost: Using GenAI models, especially large ones, can be expensive. This is an important consideration when designing a system that relies heavily on these models.

Architecture overview

The general components of this architecture include:

  • User layer
  • GenAI layer
  • Backend layer
  • Database layer

The traditional architecture would have the user layer interact directly with the backend layer (think of Rest API, CRUD operations, user forms, etc). However, in this design, we introduce a GenAI layer that serves as an intermediary. The GenAI layer processes user inputs, generates appropriate actions, and communicates with the backend layer to execute these actions.

Program flow

The following is a sequence diagram illustrating the interaction between the user, the application, and the GenAI agent when collecting product information such as product group, brand, etc:

sequenceDiagram
    User->>App: "I am at shelf 1. I am looking at oat milk of brand Oatside, 1L for RM10"
    App->>GenAI: Prompt instructions + Transcribed message(id=1)
    GenAI-->>App: action: set memory <br> --- <br> current shelf: 1 <br>
    GenAI-->>App: action: upsert <br> --- <br> productGroup: oat milk <br> brand: Oatside <br> price: 10 <br> currency: MYR
    GenAI-->>App: action: exclude messages <br> --- <br> transcriptionId: 1
    App-->>App: Execute action: set memory, upsert, exclude messages
    App-->>User: Success
    User->>App: "I saw a dairy milk here of 1L for RM7"
    App->>GenAI: Prompt instructions + Transcribed message(id=2)
    GenAI-->>App: action: follow up <br> message: "Which brand?" <br> followUpId: 1
    App-->>User: TTS: "Which brand?"
    User-->>App: "My bad. Dutch Lady"
    App->>GenAI: Prompt instructions + Transcribed message(id=2, followUpId: 1, 3)
    GenAI-->>App: action: upsert <br> --- <br> productGroup: dairy milk <br> brand: Dutch Lady <br> price: 7 <br> currency: MYR
    GenAI-->>App: action: exclude messages <br> --- <br> transcriptionId: 2, 3 <br> followUpId: 1
    App-->>App: Execute action: upsert, exclude messages
    App-->>User: Success

User will be interacting with the GenAI through the application layer that handles the communication between the user and the GenAI model. To ensure that the application can effectively interpret and execute the actions generated by the GenAI, we must define a clear schema for available actions and their required parameters. With the help of Structured Model Output from OpenAI, we can keep the responses well-structured and easy to parse.

Action schema definition

To address the problem statement, we defined the following action schema for the GenAI agent:

interface UpsertProductAction {
  action: "upsert";
  refId: string;
  productGroup: string;
  brand: string;
  price: number;
  currency: string;
}

We will allow the GenAI to generate refId for new products. When there is a follow-up interaction, the GenAI can reference the refId to update the existing product information. This also means that the GenAI needs to be aware of the existing products in the database to avoid creating duplicates. To facilitate this, we can provide a summary of existing products in the prompt context.

However, this will result in longer prompt lengths, which may exceed the context window of the GenAI model. To mitigate this, we can implement a strategy to summarize or truncate the existing product information based on relevance to the current interaction. We introduce excludeMessages action to help the GenAI keep track of which messages have already been processed and can be excluded from future prompts. This helps in managing the context length effectively.

interface ExcludeMessagesAction {
  action: "excludeMessages";
  transcriptionIds: string[];
}

Additionally, we define a FollowUpAction to handle scenarios where the GenAI needs to ask clarifying questions to gather more information from the user.

interface FollowUpAction {
  action: "followUp";
  message: string;
}

To maintain context about the current state of the interaction, we introduce a SetMemoryAction that allows the GenAI to store relevant information that is also included in the prompt context.

interface SetMemoryAction {
  action: "setMemory";
  memory: string
}

Outcomes

The initial prototype was configured to collect basic product information such as product group name, brand, price, and product features. Upon demonstration, it successfully collects product information through conversational interactions with appropriate follow ups and upserts to the database. Additionally, because the LLMs are trained in multiple languages, the agent was able to understand and respond in different languages, making it versatile for diverse user bases.

The OpenAI API responses return usage metrics that help us monitor the token consumption for both prompt and completion. This is crucial for optimizing costs and ensuring that the application remains efficient. Based on the initial tests, after collecting about 10 different products, the average token usage per interaction was around 3000 tokens per prompt. As a result, we easily hit the API call limit within a very short period of time, and this can block the intended user experience to enhance daily operations through GenAI.

Possible enhancements

To further improve the architecture and address its limitations, we can consider restructuring the user prompts into several tinier pipelines. For example, when we prompt the GenAI, we may first ask the GenAI which action can be performed based on the user input. Once we have the action, we can then prompt the GenAI again to fill in the parameters for that specific action. This approach may help reduce the overall context length and improve the accuracy of the generated actions, but it will definitely introduce additional latency due to multiple API calls.