Have you ever wondered what it takes to build a truly helpful AI assistant? It’s more than just plugging into a large language model (LLM). It’s about building a smart, efficient, and reliable system that understands your specific needs.
Today, we’re pulling back the curtain on the architecture of the AI Assistant in our landlord dashboard. Let’s dive into how we make the magic happen.
The Goal: A Librarian, Not Just a Chatterbox
Our goal wasn’t to create an AI that could just chat; we wanted one that could give accurate, relevant answers based on a specific set of knowledge. Think of it like a super-powered librarian. Instead of reading the entire internet, it consults a curated library of documents to find the perfect answer to your question.
This approach is called Retrieval-Augmented Generation (RAG), and it’s the core of our system. It grounds the AI in facts, dramatically reducing the risk of made-up answers (hallucinations) and ensuring the information is relevant to you.
A User's Query: A Journey Through Our System
When you type a message into the AI Assistant, it kicks off a rapid, multi-step process.
Step 1: The Friendly Frontend
Your journey starts in our Next.js web application. The chat interface is clean and simple, designed to get you answers without fuss. When you hit send, your message, along with your secure user token, is packaged and sent to our API.
Step 2: The Secure Gateway (AppSync API)
Your request first hits our AppSync GraphQL API. This acts as the secure front door, validating your identity using AWS Cognito to ensure only authorized users can access the assistant. Once you’re cleared, AppSync forwards the request to the brains of the operation: our central Lambda function.
Step 3: The Brains of the Operation (The RAG Lambda Function)
This is where the real work happens. Our AWS Lambda function, written in TypeScript, orchestrates the entire RAG workflow in milliseconds.
Common Sense First: Does the AI really need to fire up for a simple “hello”? Nope. We have lightweight heuristic guardrails that catch greetings and very vague questions. For these, it provides a helpful, pre-written response, saving both time and money.
Understanding the Question: To find relevant documents, the system first needs to understand the meaning behind your words. It sends your query to Amazon Bedrock’s Titan embedding model, which converts your text into a numerical vector—a sort of “meaning fingerprint.”
Finding the Right Documents: Now, the system takes that fingerprint and compares it against a precomputed index of document embeddings stored in an S3 bucket. Using an incredibly fast math technique called cosine similarity, it finds the document snippets from our knowledge base that are most contextually related to your question. If no good matches are found (i.e., the similarity score is too low), it wisely decides not to guess and instead guides you to ask a more specific question.
Building the Perfect Prompt: With the best context snippets in hand, the system assembles a carefully crafted prompt for our language model. It essentially says: “Hey AI, answer this user’s question, but—and this is important—base your answer only on the following information I’ve provided.”
Generating the Answer (with a Backup Plan!): Finally, the prompt is sent to an Anthropic Claude model via Bedrock. For speed and cost-efficiency, we start with the nimble Claude Haiku. But what if Haiku is unavailable or overloaded? No problem. We’ve built a resilient fallback system that automatically retries with more powerful models (Sonnet, then Opus) to ensure you always get an answer.
The final, context-aware answer is then sent back through the API to your screen.
Key Design Choices (And Why We Made Them)
Building a system like this involves making smart tradeoffs. Here are a few that define our approach:
Precomputed Index vs. On-the-fly: We embed our knowledge base offline and store it in a simple JSON file on S3. This makes the real-time retrieval process lightning-fast and cost-effective, as we only need to embed the user’s query during the chat.
Simple Storage vs. A Vector Database: For our current scale, using a file in S3 is brilliantly simple and avoids the operational overhead of a dedicated vector database. We can always scale up when needed.
Hiding Sources by Default: We decided to hide the raw source snippets in the UI to reduce cognitive noise and provide a cleaner, more direct answer. The data is still there if we ever want to add a “show sources” feature.
What's Next on the Horizon?
We’re just getting started! Our architecture is designed for growth. Here are a few enhancements we’re excited about:
Streaming Responses: To make the assistant feel even more responsive, we’ll stream the answer back to you word-by-word.
Conversation History: Allowing the assistant to remember the last few things you talked about for more natural, contextual conversations.
Automated Index Updates: A CI/CD pipeline that automatically updates our knowledge base whenever our source documents change.
By combining a smart RAG architecture, resilient cloud components, and a focus on user experience, we’ve built an AI Assistant that is not only powerful but also reliable and efficient. We hope this peek behind the curtain was insightful!