LLM Agents, Part 1 - The “9” Commandments: How to Build LLM Products Successfully
"It’s easy to demo a car self-driving around a block, but making it into a product takes a decade." - Karpathy
In this write up we will go over the most important principles you should follow as you ideate, validate, design, and build your LLM product. One thing that you will realize by the end of this is that the principles of building the most sophisticated multi-agent LLM products is the same as the ones for any LLM product and ultimately the same as the ones for any data-powered software product.
1. Data is most probably your only moat
We are living in a world where powerful open-source models are just a few clicks away, and your proprietary data is likely your only sustainable competitive advantage. While anyone can access state-of-the-art models like GPT-4o, Llama3, and Claude, the data you use to fine-tune and augment these models is what will truly set your product apart. Your data is the secret sauce that enables you to build AI systems that can perform tasks and provide insights your competitors can only dream of. Even if becoming a unicorn is not necessarily your thing, being able to interface LLMs with different types of data (eg. multi-modality) blows up the space of possibilities you can explore in terms of use cases.
It is crucial to focus on building products and features that allow you to collect unique and valuable data that others can't easily replicate. This might mean targeting niche domains where you have deep expertise in, or creating AI-powered tools that incentivize users to contribute their own data. Another strategy is to form data partnerships with organizations that have complementary datasets, allowing you to enhance your models' capabilities without starting from scratch.
Perhaps, collecting high-quality data can be challenging and resource-intensive. One approach to mitigate this is to create synthetic data that mimics real-world scenarios. Synthetic data can help augment your existing datasets and improve model performance, especially in cases where real data is scarce or expensive to obtain.
When it comes to preparing your data for training models, it's important to weigh the benefits of annotating data versus relying solely on unsupervised methods. While unsupervised learning can be appealing due to its potential to reduce manual labor, annotated data often leads to better model performance and faster convergence. Investing in data annotation can pay off in the long run by improving the accuracy and reliability of your AI systems.
Adopting a data-centric approach to machine learning is key to building a strong competitive moat. By focusing on collecting, curating, and leveraging high-quality data, you can create AI products that are more accurate, insightful, and valuable to your users. Always be on the lookout for opportunities to expand and enrich your data moat, as it will be the foundation upon which your AI business is built.
The flip side of this advice is that blindly following this principle without thinking deeply about what results in customer’s long term loyalty could simply result in disappointment. Ultimately, the big question is how to use the data (or any other unique resources you have) to deliver value to your customers in a way that compounds: you win new customers and expand your relationship with the ones that you have.
Further reading:
2. Follow validation driven development
To build successful AI products, you need a rigorous approach for measuring and optimizing performance. This is where evaluation-driven development comes in. This is particularly important for agentic workflows and multi-agent systems. A very common problem in naively built agentic systems is compounding error in these systems that quickly leads into systems falling in endless loops or producing nonsensical results. The only way to avoid these problems is having reliable and granular metrics throughout the system that act as feedback or reward mechanisms keeping the components and overall system in check.
Start by defining clear, quantitative metrics that capture what "good" looks like for your product - whether that's accuracy, user engagement, task completion rate, or some combination of these. This has to be done for the components of your architecture as well as the overall performance of the system.
With your key metrics in place, orient your development process around continuously evaluating your pipelines against these benchmarks and iterate to improve performance. This could involve experimenting with different system designs, model architectures, model combinations, fine-tuning techniques, prompt engineering approaches, and UX designs. The key is to have a solid experimental setup where you're constantly shipping new arrangements of components, measuring their impact on your core metrics, and doubling down on the most promising ideas. This is also particularly important for LLM agent systems since the landscape of potential improvements is so vast that a thorough investigation of all possibilities with limited resources is simply impractical.
Don't get caught up in chasing the latest shiny model or technique without a clear sense of how it actually moves the needle on your core evaluation criteria. If you can't measure what "better" means, you're at high risk of turning in circles or fixating on the wrong things. By grounding your development in rigorous evaluation, you can efficiently zoom in on system architecture designs that actually deliver value to your users.
Further Reading:
3. Get your product in the hands of your (ideally paying) users asap
One of the biggest pitfalls in AI development is getting bogged down in endless technical tweaks before getting any feedback from real users. This is especially tempting with LLMs, where there's always another parameter to tune or dataset to incorporate. But the reality is, you'll never know if you're building something people actually want until you put it in their hands. The feedback gets even more real if they are paying you (or at least they anticipate having to pay you to use the product).
The antidote is simple (but not always easy): Build the simplest viable version of your product and get it in front of users as quickly as humanly possible. This might mean starting with a bare-bones MVP that only does one thing, or even launching a "fake" version powered by human labor behind the scenes. The point is to start collecting real feedback and data from day one, so you can validate your core assumptions and start iterating in the right direction. Doing this can also help regulate your understanding of the right metrics to track as per last commandment. It is easy to lose sight of what really matters to the user quantitatively by hiding behind technical metrics like accuracy.
Ideally, get this initial version in the hands of paying customers, even if it's just a small pilot group. Seeing real people actually fork over their hard-earned cash for your product is the ultimate validation that you're onto something. Plus, having revenue coming in from the get-go will help extend your runway and give you more breathing room to iterate.
Another important aspect of this is deployment. It is great that your product works on your laptop, but if the user can’t interact with it you have significant friction in getting the feedback that you need.
Further Reading:
4. Separate the data and interface layers, and be prepared to invest in data engineering
Using LLMs doesn't give you a free pass to ignore established software and data engineering best practices. In fact, as LLM-based systems grow in complexity and capability, it becomes even more critical to architect your systems in a modular, maintainable way.
A key principle here is maintaining a clear separation between your data and interface layers.
Designing your system in a way that the LLM itself becomes the source of knowledge is a risky and ill-advised approach. Instead, strive to architect your system, craft your prompts, and provide the relevant context to the LLM to ensure it relies solely on the information you supply to it when generating a response. While this may evolve in the future, cleanly decoupling your data from your interfaces gives you greater control, allows you to layer on additional security and privacy measures, and makes your system more robust to changes in the underlying models. Retrieval-augmented generation (RAG) techniques provide a powerful way to achieve this decoupling while still harnessing the full power of LLMs.
It is tempting to think that you just fine-tune one model using your data and it will work as expected with all the controls that you need. The reality is that LLMs are not well-behaved enough to achieve any granular level of control necessary for real world applications and use cases. It is best to separate the data layer (knowledge base documents, structured data, etc) in already well established structured (aka databases) with all the necessary controls (eg. identity and access management) that come with those. This also makes it easier for you to pre- / post- process that data before feeding it to the model in retrieval augmented generation (or equivalent) setups.
Separating the data layer also gives you the ability to build all the logic necessary for processing, storing, and retrieving the data used to train and run smaller models you use in your control flows or to fine-tune your larget models. This includes data ingestion pipelines, data cleaning and transformation steps, feature engineering, and data versioning. Your interface layer, on the other hand, should focus solely on exposing the capabilities of your models to end-users, whether that's via APIs, chatbots, or interactive GUIs. Of course, LLM itself can act as a linguistic interface by providing conversational interactions with the user.
Further Reading:
Your LLM Needs a Data Journey: A Comprehensive Generative AI Guide for Data Engineers
Unlocking the Power of Retrieval Augmented Generation with Added Privacy: A Comprehensive Guide
5. Do not count on LLMs beyond linguistic interfaces
LLMs are incredibly powerful for natural language tasks - they can engage in human-like dialogue, answer questions, summarize long passages, and even write creative fiction. But it's critical not to get swept away by the hype and expect them to be a magic bullet for every use case. As the name suggests, LLMs are language models - they excel at generating statistically plausible sequences of words, but struggle with many other desirable capabilities like reasoning, analysis, and grounding in real-world facts.
Many people fall into the trap of hoping LLMs will handle complex reasoning, read their minds to infer intent, write flawless code on the first try, or magically handle scheduling and workflow automation. But today's models simply aren't reliable for these types of tasks. Outside of linguistic interfaces, LLMs have significant limitations that constrain their usefulness. They are notoriously prone to "hallucinations" - confidently generating false or nonsensical information that can be hard to detect. They struggle with maintaining coherence over long time horizons or complex multi-step tasks.
So when architecting LLM-powered products, it's crucial to be ruthlessly realistic about what the models can and can't do. Focus on leveraging LLMs for what they excel at - engaging with users through natural language - and thoughtfully architect supporting systems to handle any downstream tasks. Be prepared to break down complex workflows into atomic steps, provide extensive context and guidance, and double-check outputs for factual and logical consistency. By playing to the strengths of LLMs while proactively addressing their limitations, you can design products that harness their power while mitigating their downfalls.
Further Reading:
Large Language Models: Reasoning Capabilities and Limitations
LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks
6. Create Robust Feedback Modules
In academia, scientific papers undergo peer review, where different experts independently critique the work before publication. Borrowing from this process, a powerful paradigm for building self-improving AI systems is to train multiple models that play distinct roles akin to authors and reviewers.
In this setup, you might use a generative model to produce some output, like a dialogue response, a document summary, or a piece of code. You then have to use a separate "critic" model to evaluate the quality of that output along various dimensions like factual accuracy, logical coherence, style and tone. Crucially, these models are trained independently, so the critic acts as an objective assessor, not just a rubber stamp.
This is particularly important in agentic systems where the goal is for the system to continuously monitor its performance, reflect on the outcome, and try again with improved likelihood of better performance. This is a crucial ingredient for the level of autonomy we seek in agents. Therefore implementing highly reliable, accurate, and trustworthy feedback sub-systems (aka “reward mechanism” in the context of RL) is a big part of success in building an agentic product. You can even equip the agents with ensembles of critic tools (including but not limited to occasionally asking for human input) to cover different facets of evaluation, like long-term coherence vs. individual response quality.
The key benefit of this architecture is that it provides a scalable mechanism for quality control and continuous improvement that doesn't rely solely on human judgement. That said, it's not a total replacement for human evaluation - you'll still want to spot check the system's outputs, especially in the early stages. And there's an art to designing the right training setup and reward functions to get useful feedback while avoiding degenerate equilibrium between the generator and critic. But when done right, this approach can imbue your AI systems with the benefits of peer review to make them more robust and self-correcting over time, and therefore achieving a higher level of autonomy.
Further Reading:
Large Language Models Cannot Self-Correct Reasoning Yet (Deepmind)
LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks
7. Actually Improve User’s Productivity
It's easy to get caught up in building flashy AI demos that showcase the latest and greatest model capabilities. But at the end of the day, the true measure of success for your LLM products is how well they improve users' lives in tangible ways. In particular, since the most promised benefit of LLM agent tools is productivity, it's critical to honestly assess whether you're making people more efficient at important tasks, or just giving them one more thing to babysit.
To deliver real productivity gains, you need to deeply understand your target users' existing workflows and ruthlessly prioritize AI features that will save them time and effort. Now productivity does not equal saving time only, but rather it implies saving unwanted effort, therefore, your product has to address workflows that your users:
Spend significant time on, AND
They do not want to spend that time doing that task.
Approach every new capability through the lens of "how does this concretely make my user's job efficient and effective?" If you can't quantify the impact, chances are it's not worth building.
Another important nuance here is that builders are sometimes excited about taking away the parts of the job that people actually enjoy spending time on rather than parts that they hate doing. While that product might theoretically improve productivity, the psychological barrier of using it will backfire.
Further reading:
8. Think deeply about integrations into users' tools and workflows
To drive successful adoption, your AI product needs to fit seamlessly into users' existing workflows and tool chains. No matter how impressive your models are under the hood, if using your product feels like a clunky, disjointed experience, people simply won't bother. On the flip side, if your product slots nicely into the tools and processes users are already using day-to-day, you'll dramatically lower the barriers to adoption and make your AI feel like a natural extension of users' workflows. The last thing people want is yet another siloed app to switch back and forth from. Instead, look for opportunities to embed your AI capabilities right within the apps users already live in day-to-day, whether that's their email client, messaging platform, note-taking tool, or code editor.
By meeting users where they already work and focusing relentlessly on concrete effort savings, you can ensure your AI product isn't just a novelty, but an essential part of people's daily flow. And those efficiency gains add up fast - saving someone a few minutes or clicks on a task they do 10 times a day is a game changer. Keep humans at the center, measure what matters, and optimize for their productivity above all else.
To get this right, you need to invest significant time upfront to deeply understand how your target users currently work and what their key pain points are. This means going beyond surface-level interviews and surveys to really immerse yourself in their world. Shadow them as they go about their tasks, paying close attention to all the tools, systems, and collaborators they interact with along the way. Map out their end-to-end workflows to identify bottlenecks, inefficiencies, and opportunities for AI to streamline the process.
Armed with this deep understanding, architect your AI product to integrate with the specific tools your users depend on, with seamless bridges for importing and exporting data, triggering actions, and collaborating with teammates. In many cases, this means delivering your AI capabilities as plugins or add-ons right within users' primary tools, instead of forcing them to switch to a separate app.
When well executed, this deep integration approach makes your AI product feel less like a tool and more like an intelligent assistant that's always there in the flow of work, ready to lend a hand. Users don't have to disrupt their normal processes or learn new interfaces - they can simply tap into the power of AI whenever and wherever they need it. And that frictionless experience is the key to making AI an indispensable part of people's daily lives.
Further Reading:
Change Management for LLM-based Products
9. Design for humans!
Amid the excitement around LLMs and other AI breakthroughs, it can be tempting to get carried away imagining a world where machines handle every task and decision. But the reality is, humans are going to remain an essential part of the equation for the vast majority of use cases for the foreseeable future. Even the most sophisticated AI systems today are narrow in scope and brittle in the face of edge cases. They are powerful tools to be wielded by humans, not wholesale replacements for human judgement.
As such, it's critical that we keep real human needs, behaviors, and constraints at the center of our AI product development process. At every step along the way, we need to be testing our products with actual users, seeing how they integrate (or don't) into their real-world contexts, and shaping the user experience accordingly. Pretty model performance numbers in a lab setting are meaningless if they don't translate into tangible benefits for humans in the messy real world.
Prioritizing the human element means investing deeply in thoughtful UX design, extensive user testing, and rapid iteration based on feedback. It means providing robust, accessible user education to help people understand both the capabilities and limitations of the AI systems you're putting in their hands. And it means proactively considering and mitigating the potential risks and unintended negative consequences your product could have in people's lives.
Ultimately, our north star as AI product builders should be empowering humans to do their best work. We have an incredible opportunity to usher in a new era of productivity and creativity, but it will require the hard, patient work of aligning powerful AI capabilities with real human needs. If we keep humans at the center and measure success by the positive impact we have in their lives, not just our model metrics, we can build an AI-powered future that brings out the best in both machines and people. The road ahead won't be easy, but the destination will be more than worth it. So stay focused on those human needs, keep iterating, and let's build the future together!
Further Reading:
Conclusion
Building successful multi-agent LLM products requires a multidisciplinary approach that blends technical chops, product sensibilities, and proactive ethical responsibility. The 9 principles we've explored provide a comprehensive roadmap for navigating this complex landscape.
By focusing on building proprietary data moats, maintaining modularity in your architecture, relentlessly evaluating and iterating on product performance, and deeply integrating with users' workflows, you'll be well on your way to creating AI systems that deliver real value. And by keeping humans at the center of the process and proactively addressing potential negative impacts, you can ensure that value is achieved responsibly and sustainably.
But while these principles provide a solid foundation, the reality is that building game-changing AI products is hard. It requires grappling with cutting-edge research, wrangling messy real-world data, and constantly iterating in the face of shifting user needs and expectations. There will be setbacks, dead-ends, and pivots along the way. The key is to stay focused on your north star of empowering users, stay humble in the face of complexity, and keep pushing forward one experiment at a time.
The potential for LLMs and multi-agent systems to transform how we live and work is immense - but it won't be realized without diligent, human-centric innovators translating the raw capabilities into meaningful products. By internalizing these 9 principles and tenaciously applying them in practice, you'll be at the vanguard of this exciting frontier. The journey won't be easy, but the destination will be more than worth it. So go forth and build the future!