Deep Random Thoughts: AI (LLMs, etc)

Guardians of The Experiment: The New Role of Product Managers in Agentic AI Era

Amir Feizpour (ai.science) — Tue, 22 Jul 2025 14:52:35 GMT

In our work with our commercial clients over the past few years I have firsthand experienced how my role as the product person has evolved as everything went from regular-software-with-AI-pieces-here-and-there to all-hail-agentic-ai kind of software. Surprisingly, or maybe in hindsight unsurprisingly, after developers the second most common persona that attends our agentic AI bootcamp are also product managers (PMs) and product oriented founders. And their motivation…

“What does all this mean for me as a product person?”

Product roles, like many other knowledge work areas, are being invaded from all fronts! The business stakeholders can brainstorm with ChatGPT and send you a long summary of what they think should happen. The CMO can throw in some thoughts into Loveable and send you the new frontend with some notes about how you should take it from here. And the engineers who have to spend less time on finding stupid bugs in the code, powered by their coding agents, spend more time daydreaming about what products can be built.

At the same time product people themselves are invading other areas. The most common area being tech or non-tech PMs who are now empowered, via vibe coding, to provide working prototypes to the engineering team instead of vague Jira tickets. Or the ones that after an intense market research session with Perplexity think to themselves “I should really stop following orders from my conservative CEO and launch that business I have wanted for a while”.

It seems like we are at a melting point, at the beginning of a new future, and product people are right at the center of this transformation.

In the early days of software product management, a PM’s job was clear: understand user needs, write a clear product requirements document (PRD), work with engineering to scope and build features, then ship them.

Those days aren’t gone - but they’re no longer sufficient. As software begins to rely more heavily on large language models (LLMs), agentic systems, and AI-assisted workflows, the nature of product development is changing. And so is the role of the PM.

In this new paradigm, PMs aren’t just writing specs in PRDs - they’re curating experiments. They’re not dictating behavior - they’re defining success through evaluation. And they’re not always building software in the classical sense - they’re managing emergent systems that must be guided more than constructed.

From Certainty to Stochasticity

Traditional software systems are deterministic: you click a button, and the code runs exactly as specified. Product managers in this world serve as the glue between business needs and engineering execution, translating ideas into feature sets, wireframes, and task lists.

But AI-powered systems, particularly those involving LLMs or agents, aren’t deterministic. They’re stochastic. They produce different outputs on different runs. Their behavior can’t be fully specified in code. Instead, it's shaped through:

training data,
prompts,
context management,
agent memory and orchestration,
tool usage strategies,
and more.

This means the role of the PM has to evolve. The future of product management in AI isn’t about control; it’s about alignment, experimentation, and iterative refinement.

This is not quite new. In the previous era, while good PMs were busy writing meticulous PRDs, great PMs were all about experimental design and rapid product evolution towards what needs to be built. Today, this is no longer an optional stretch goal; this is the definition of the job.

Instead of shipping static specs, these PMs:

Define product behavior via evaluation datasets: curated examples of expected inputs and desirable outputs.
Replace checklists with rubrics: frameworks for evaluating model behavior across axes like factuality, safety, and importantly impact-centric metrics.
Manage subjective quality via preference ranking: either through human raters or verified LLM-as-a-judge systems.
Track progress through behavioral metrics and evaluation set performance, not just usage stats or feature completion.

They own the process of asking:

"What does good look like? How will we know we’re improving?"

Rather than defining functionality, they define desirable behavior in an open-ended landscape. They become evaluation strategists.

Why PMs Matter More Than Ever

You can already see this evolution in practice:

PMs working with AI teams now spend more time in curation tools, annotation workflows, and model dashboards than in JIRA.
They’re prototyping flows using tools like GPT-4, LangChain, Cursor, Replit, and Figma AI; not waiting for design and engineering cycles to play out.
They’re defining user experience not through pixel-perfect screens, but through libraries of I/O examples, rubrics, and iterative behavioral improvements.

This shift doesn’t mean PMs are less relevant; it means they’re differently relevant. There is a lot of talk around PM functions being eliminated and the slack being picked up by engineering and growth teams. But I think that’s short-sighted and based on a misunderstanding of what (good) PMs bring to the table. PMs, from a skillset point of view, are the ones that are most suited to:

Align the stochastic AI behavior with business value using well maintained evaluation loops
Represent the user’s needs in an ambiguous, generative world
Help legal, design, infra, and research teams communicate, not via requirements but rather via evaluation data sets
Decide what “improvement” actually means when you can’t rely on binary “correctness”

So, what’s the playbook if you are a PM / or product oriented founder?

Learn how evaluation datasets work; Start thinking of examples as UX specs, not just model tests.
Practice writing rubrics and pairwise comparisons; Define not just “what” the system should do, but how to recognize better vs worse.
Prototype with AI tools; Use Cursor or Claude Code to sketch new flows. Simulate a customer service bot. Try agent frameworks.
Think like an experimenter; Learn how to write and version prompts, get really good at understanding context management for LLMs, and how to decide the next best intervention based on what you learned in the last experiment.

The job of a product manager has always been to navigate ambiguity, advocate for users, and guide software toward value. But in a world of stochastic models and agentic behavior, as the guardians of the experiment, the best PMs will be the ones who can define success even when they can’t predict the exact steps to get there.

And that might just be the most exciting job in tech.

Business Case for AI Ethics

Amir Feizpour (ai.science) — Fri, 09 May 2025 13:40:21 GMT

I have a hard time recalling any instances where the “do the right thing” narrative made a significant headway in convincing the people in charge of capital allocation. This is a statement in general, but also about AI Ethics in particular.

Any push towards doing the ethical and responsible thing that fails to acknowledge the systematic biases towards maximizing profits is just simply wishful thinking. I’m not trying to be contrarian and definitely not trying to say that people who are spending their lives advocating for AI Ethics and such are barking up the wrong tree. But let’s be honest, big corporations pretend to care about ethics and responsibility when it helps their PR and as soon as things get tough, the AI Ethics teams are the first ones that get axed off.

So, what’s the missing ingredient if we want to sustain AI Ethics efforts in the long run?

The Ethical-Economic Paradox

Often, arguments about AI ethics start with examples like biased loan application processing systems. They go on to say “AI might deny loans to people from certain backgrounds due to biased data, and that’s bad”. Yes, it is! However, what you’re failing to mention is that the same AI increases overall processing efficiency, saving the financial institute lots of money, and therefore they have zero incentive to take this kind of outcry seriously. For the CFO of this business, “people from certain backgrounds” you are advocating for are simply statistical errors, in an otherwise well performing, now improved system.

You see the issue?

We create systems that are statistically efficient but cause individual harm, sometimes knowingly, and sometimes without even knowing why, thanks to black-box algorithms. The obvious answer is yes, we have a responsibility to make these systems ethically right. But how do we do that in a way that acknowledges the realities of the world?

You might expect me to say things like "ethics keeps you out of trouble," "it's good for your brand," or "values matter." I find these statements often meaningless because we've been saying them for a long time without seeing substantial outcomes.

"Short-term profit is always at odds with the well-being of the user, society, and environment,"

It even has a name: the "ethical-economic paradox."

Let’s look at this in action:

Startups chasing growth at all costs often deprioritize “soft values” like ethics.
Private equity firms acquire companies and cut anything that doesn’t immediately impact the bottom line.
Social media platforms are built on maximizing attention and outrage - not on protecting mental health.
Big Pharma has minimal incentive to heal when treating symptoms is more profitable.
Food industries thrive on sugar, chemicals, and addiction - because they drive sales.

So when people say, “Let’s build ethical AI,” I ask: in this environment - how, exactly, do you expect that to happen?

This fundamental tension explains why many well-intentioned ethics initiatives collapse when business conditions tighten.

In nearly all the examples above, the push for short-term gains leads to products or practices that cause long-term harm. While regulations attempt to address this, what we’ve seen is that the regulators are often a decade behind where the tech is and by the time they wrap their heads around it, the damage is done. Now imagine where we would be in 5 years at the current rate of development in foundation models and AI agents built around them!

Here's where I think things get interesting: I argue that the only way out of this tension is finding the sweet spot - an "opportunity zone” where ethical behavior, profit-making, and structural realities overlap. I believe the most impactful change will come from builders, founders, and entrepreneurs who build in this “sweet spot”. Maybe these wont be necessarily venture-capitalist friendly, but they could surely make enough money for the founder to live comfortably while also sleeping at peace at night.

Learning from Environmental Innovation

We can learn a lot from how the ESG and clean-tech sectors evolved over the decades.

Many startups tried to build products that reduce waste or promote clean energy. Their intentions were good, but their methods were flawed.

These companies often failed because they appealed to morality. They asked investors to fund them out of principle. They asked consumers to buy their products out of conscience. In the real world, that usually doesn’t work.

A large number of those startups never got off the ground. Many faded away without scale because they never solved a real business problem. They assumed that if people cared enough, things would change. That didn’t happen.

But some did succeed. Let me share two examples that explain how they got it right.

Smart Recycling Bins (MyMatter)

MyMatter created a smart bin that uses computer vision to sort waste automatically. If someone throws a recyclable item into the trash, the bin detects the mistake and moves the item into the correct compartment.

This solves a practical issue. People often don’t know whether something is recyclable, and they don’t want to think about it.

The product removes the decision-making burden from the user.
It is sold to cities and hotels where a lot of garbage mixing happens.
It uses AI to address a problem that would otherwise require behavior change.

This product works because it connects environmental goals with business needs. It reduces waste, saves time, and fits naturally into how people behave.

Kitchen Waste Monitoring (Winnow)

Winnow places a camera above and a scale below trash bins in hotel kitchens. The system identifies what food is thrown away and weighs it. Each day, the hotel receives a report that shows the exact cost of the waste.

For example, “You threw away $140 worth of cucumbers today.”

It also gives practical suggestions. Reuse tomato scraps for sauce. Reduce future orders for the items you often waste.

Staff don’t have to change their process. The system works passively.
Executives gain visibility into financial losses, and guess what, they’re motivated to reduce that.
Behavior shifts naturally through awareness and cost savings.

This is another case of solving a structural issue while aligning with both sustainability and profitability.

These examples succeed because they align doing the right thing (reducing waste) with making money (saving costs) and working around structural biases (making it easy for staff).

This is what ethical product design should do. It should eliminate resistance. It should reduce the friction of doing the right thing.

The Opportunity Zone: Where Ethics Meets Business

The key idea I want to present is identifying what I call the "opportunity zone" – the intersection of three critical elements:

Doing the right thing (ethical imperatives)
Making money (business viability)
Working around structural biases (practical implementation)

This framework shifts our thinking from abstract moral principles to concrete business implication. Instead of just saying "build ethical AI because it's right," we can reframe ethical considerations as business imperatives.

If you genuinely care about ethical AI, you must figure out how to operate in this intersection. The solution to the paradox lies in balancing these competing forces. The environmental examples did precisely this: they linked environmental goals directly to financial outcomes (cost savings) and addressed structural issues (making recycling effortless, automated waste tracking).

Crucially, you cannot go far doing the right thing and solving structural problems without making money. Ethical initiatives require funding. Whether inside a corporation or as a startup, if your ethical effort isn't tied to the bottom line, it risks being cut. No investor funds a project solely because it's ethical; they invest because they expect a return.

While massive companies like Meta or Google face different scaled challenges related to the paradox, for most of us building products, aligning ethics with these business realities is key to creating sustainable positive impact.

The ESG playbook evolved from “reduce waste” to “cut costs and access new asset classes.” AI needs the same shift.

The AI Ethics Playbook

Perhaps, this can be the beginning of a practical checklist for building AI products that succeed ethically and commercially:

Alignment with user preferences isn't just ethical – it drives adoption
Explainability isn't just transparent – it enables sales to the regulated industries and helps users stick around.
Guardrails aren't just responsible – they're necessary for business customers who require predictable systems.
Unbiased data isn't just fair – it expands your addressable market

1. Trust is Paramount

Trust isn't just a moral virtue – it's a business necessity. When ChatGPT first launched, initial hallucinations created excitement but quickly eroded user trust for some applications, as people found it unreliable for serious use. Only after addressing these credibility issues did sustained usage follow in many areas. While established companies might get second chances, most startups won't have that luxury. You have to get it right the first time.

2. Explainability Drives Adoption

If users don't understand how your system makes decisions, they won't stick around – especially in regulated industries. No financial institution or healthcare provider will adopt a black-box system that can't explain its recommendations when challenged. Explainability isn't just about transparency; it's about market access. You often can't sell the product otherwise.

3. Data Quality Determines Market Reach

Biased training data doesn't just create ethical problems – it limits your addressable market. If your product works well for urban users but poorly for rural ones due to skewed data, you're unnecessarily constraining your growth and losing out on a portion of the market. Every demographic your system underserves represents lost revenue and opportunity.

4. Goal-Oriented Design Creates Value

Generative AI systems that produce impressive outputs without helping users achieve concrete goals won't retain users long-term. The crucial question isn't just whether your system can generate compelling text or images, but whether it helps users accomplish meaningful tasks. Value creation drives retention and leads to happy customers.

5. Sustainable Engagement Builds Longevity

While it might be technically possible to create AI experiences that nudge users toward manipulative behavior and maximize short-term engagement, this approach ultimately leads to burnout and abandonment (like my decision to cut out news and social media from my life completely). Aim for sustainable engagement that provides long-term value. Even gaming platforms now often include features encouraging breaks because they understand that sustainable engagement creates more lifetime value than exploitation.

Closing Thoughts

My core message is this: The future of AI ethics isn't about more impassioned moral appeals – it's about demonstrating that ethical AI is better business (at least in some cases). As AI becomes increasingly integrated into critical domains, the companies that succeed won't be those with the most virtuous mission statements, but those that build trustworthy, explainable, and genuinely helpful systems that align ethical considerations with user needs and business objectives.

And maybe that’s the most important insight of all. Sustainable ethics requires sustainable business models. By reframing AI ethics as a business imperative - not just a moral one - we create the conditions for those values to survive and thrive in real markets.

Mechanistic Interpretability - Decoding Neural Networks Might Need a Physics Degree - Part 1

Amir Feizpour (ai.science) — Tue, 18 Mar 2025 12:51:24 GMT

As someone who is trained in traditional hard sciences, how people do “science” in computer science has always bothered me. It always reminds me of the rivalry we’ve had with chemists and biologists about how their “science” is just empirical “throwing things at the wall and see what sticks” kind of research. Of course, as I grew older in physics and started to face more and more complex problem statements my point of view started to become more modest as I saw really interesting problems that can’t be solved by nicely compact formula and could only be tackled by numerical simulations and data driven approaches.

One of the things that you sacrifice as you start looking at more complex systems as chemists, biologists, computer scientists, and physicists do, is the ability to neatly explain how the system behaves and why. Fortunately, most of these sciences value causal inference significantly, which means that we end up with much more generalizable and transparent statements about nature. However, a mechanistic understanding of how systems behave in computer science is at best a secondary consideration.

The need for transparency and explainability of how complex neural nets work, however, especially with the exponential rate of adoption of LLMs and the agentic systems they enable, is urgent particularly in high-stakes domains like healthcare and finance, where trust hinges on understanding the reasoning behind AI-driven choices.

So, you can imagine my delight when I heard about Neel Nada’s work on MLST.

A core challenge in this pursuit lies in how neural networks encode information. In an ideal world neurons represent singular, well-defined concepts (mono-semantic), but in reality, for efficiency purposes, they represent multiple overlapping ideas (poly-semantic)! While this boosts efficiency, it complicates interpretability, forcing a trade-off between performance and transparency. According to Neel, to bridge this gap, Mechanistic Interpretability (MI) comes into play which is a systematic approach that dissects neural networks much like physics dissects the natural world.

In this article series, we’ll explore how I understand frameworks used to advance MI towards transparent AI with a physics lens.

1. Introduction to Mechanistic Interpretability

Mechanistic Interpretability (MI) is an emerging field focused on reverse-engineering neural networks to understand how they operate at a fundamental level. We can imagine it as disassembling a complex machine - like a car engine - to examine each gear, spring, and bolt, observing how they interact to produce the final output. Similarly, MI seeks to decode neural networks layer by layer, neuron by neuron, to identify the specific features they recognize, the circuits that process information, and the interpretability bases that map these abstract computations to human-understandable concepts. The goal is not just to observe what a model does but to explain how and why it does it, down to the smallest actionable components.

Foundational Concepts

To understand MI, it’s really important to grasp foundational concepts that describe how neural networks store and process information:

Features:

Features are the building blocks of a neural network’s understanding. They represent specific attributes or patterns in the input data that the model has learned to detect. For example:

In image recognition, a feature might be a horizontal edge, a texture like fur or scales, or even higher-level concepts like "eyes" or "wheels."
In language models, features could correspond to grammatical structures (e.g., verb tenses), semantic categories (e.g., "scientific terms" or "emotional language"), or even abstract relationships (e.g., cause-and-effect).

Features are not hand-coded by humans; they emerge organically during training as the model optimizes to solve its task.

Circuits:

These are groups of a model’s weights and non-linearities that connect one set of features to another. Think of circuits as the pathways that determine how information flows and is processed within the network. For instance:

A circuit in a vision model might link a feature for "edges" to a feature for "shapes," which then activates a feature for "faces."
In a language model, a circuit could route a feature for "question words" (e.g., who, what, where) to a feature for "answer structure," ensuring the response matches the query.

Crucially, circuits are not just linear chains of neurons - they involve non-linear transformations (e.g., activation functions like ReLU) and interactions between multiple layers.

Interpretability Bases:

Interpretability bases are mathematical tools that help researchers "decode" a model’s internal activations. Neural networks process data in high-dimensional spaces (e.g., thousands of dimensions), which are inherently unintuitive to humans. Interpretability bases project these activations onto specific directions in the space that correspond to human-interpretable features.

For example, in a sentiment analysis model, one direction in the activation space might align with "positive sentiment," while another aligns with "negative sentiment." By analyzing these bases, researchers can quantify how much each interpretable feature contributes to the model’s predictions.

Neurons vs. Layers

Neurons: Individual units that activate in response to specific input patterns (e.g., a neuron in a vision model firing for diagonal edges).

Layers: Hierarchical collections of neurons. Early layers detect simple patterns (edges, textures), while deeper layers assemble these into complex concepts (objects, sentences).

Attention Heads (in Transformers)

Transformers, which power modern language models, process data using attention heads - specialized sub-circuits that determine which parts of the input to prioritize. Each head can be thought of as a "mini-circuit" with a specific role:

Query-Key-Value Operations: Attention heads compute relationships between words (e.g., linking pronouns like "he" to their antecedents).
Specialization: Some heads focus on syntax (e.g., subject-verb agreement), while others track semantic coherence (e.g., ensuring "bank" refers to a river, not a financial institution, based on context).
Why this matters: Reverse-engineering attention heads is a cornerstone of MI in transformers, as their behavior directly impacts model outputs.

Superposition

Neural networks often use superposition - a phenomenon where a single neuron or activation encodes multiple unrelated features. For example, a neuron might activate for both "cat ears" and "scientific terminology" in a multimodal model.

Polysemantic Neurons: Neurons that respond to many distinct features (common in large models due to sparse feature space).
Monosemantic Neurons: Neurons that activate for a single, specific feature (rarer but easier to interpret).
Why this matters: Superposition complicates MI by obfuscating the "clean" mapping between neurons and features, requiring advanced techniques to disentangle overlapping signals.

Activation Functions

These mathematical operations (e.g., ReLU, sigmoid) determine how neurons transform inputs into outputs. In MI, they act as "gates" that shape information flow:

Non-Linearity: Functions like ReLU introduce non-linear decision boundaries, enabling networks to learn complex patterns.
Saturation: Functions like sigmoid can "saturate" (e.g., outputting 0 or 1), which MI researchers study to identify when a circuit stops responding to input variations.
Why this matters: Activation functions define the "rules" for how circuits combine features, influencing everything from robustness to adversarial attacks to generalization.

Probing vs. Intervening

Two key methodologies in MI:

Probing: Training a simple model (e.g., linear classifier) on a network’s activations to test if a specific feature (e.g., "sentiment") is present in its representations.
Intervening: Actively modifying activations (e.g., ablating a neuron, amplifying a circuit) to observe causal effects on outputs. For example, silencing a circuit might reveal it was responsible for suppressing biased language.
Why this matters: Probing identifies correlations ("Feature X is here"), while intervening establishes causality ("Circuit Y causes Behavior Z").

Causal Scrubbing

A technique to validate hypothesized circuits by "scrubbing" (resetting) certain activations and observing if the model’s output degrades. If the hypothesis is correct, scrubbing should disrupt specific behaviors (e.g., failing math problems if a "number detection" circuit is scrubbed).

Why this matters: Causal scrubbing bridges the gap between observational and experimental science in MI, enabling rigorous falsification of theories.

How MI Differs from General Interpretability

While general interpretability aims to provide broad explanations of model behavior (e.g., "The model classifies cats by focusing on fur texture"), MI demands a mechanistic, step-by-step account. It asks questions like:

Which exact neurons detect "fur texture"?
How do these neurons communicate with others to trigger the "cat" classification?
What happens if we disrupt this circuit?

This granular approach allows researchers to rigorously test hypotheses about a model’s behavior, similar to how a biologist might study a cell by isolating and manipulating individual proteins. By contrast, general interpretability methods (e.g., attention visualization or feature importance scores) often provide correlational insights rather than causal explanations.

2. Drawing Parallels with Physics

To understand mechanistic interpretability (MI), it helps to borrow frameworks from physics - a field that has spent centuries decoding the universe’s most complex systems. Physics and MI share a common goal: to explain how systems work at their most fundamental level. Whether studying particles or neural networks, both fields rely on observation, hypothesis, and experimentation to move from mystery to mechanistic understanding.

Just as physicists decompose natural phenomena into fundamental principles, MI researchers deconstruct neural networks into interpretable components. This process mirrors the scientific method:

Observation

In physics: Galileo observed pendulum swings to infer laws of motion; astronomers mapped planetary orbits to deduce gravity’s role.
In MI: Researchers track how neurons activate when a model processes inputs. For example, in a vision model, you might notice a neuron firing every time the input contains a spiral shape (like a galaxy or a seashell).

Hypothesis

In physics: Newton proposed that gravity governs both falling apples and orbiting moons.
In MI: A researcher hypothesizes that a specific circuit in a language model resolves pronouns (e.g., linking “it” to “the cat” in the sentence “The cat sat down because it was tired”).

Testing and Validation

In physics: Young’s double-slit experiment tested whether light behaves as a wave or particle by observing interference patterns.
In MI: To validate the pronoun-resolution hypothesis, researchers might “ablate” (disable) the suspected circuit. If the model then fails to link “it” to “the cat,” the hypothesis gains support.

This iterative cycle will allow MI to build causal explanations, much like physics constructs theories to predict celestial motion or particle interactions.

Physics Concepts as Tools for MI

Beyond methodology, specific principles from physics illuminate how neural networks operate:

Classical Mechanics and Deterministic Systems

Classical mechanics predicts outcomes from initial conditions. For example, knowing a ball’s position and velocity lets you calculate its trajectory.

MI parallel: MI researchers trace input-to-output pathways in neural nets looking for ones that behave deterministically in response to particular input properties, much like calculating a ball’s path.
Example: If a vision model always activates Neuron #512 when it “sees” a cat’s eye, you can reverse-engineer how this neuron contributes to the final “cat” classification.

Superposition

In wave mechanics (quantum, electromagnetism, etc), particles / waves exist in multiple states simultaneously (superposition) and share correlated behaviors.

MI parallel: Polysemantic neurons activate for multiple unrelated features. For instance, a single neuron might fire for both “cat ears” and “mathematical integrals,” creating ambiguity.
Why it matters: Just as measuring a quantum particle collapses its state, intervening on a polysemantic neuron (e.g., silencing it) can disrupt seemingly unrelated model behaviors.

Statistical Mechanics and Emergent Behavior

Macroscopic phenomena like temperature emerge from countless microscopic interactions (e.g., molecules colliding).

MI parallel: High-level model capabilities (e.g., storytelling) emerge from low-level neuron interactions. No single neuron “knows” grammar, but circuits across layers collaborate to enforce syntax.
Example: A language model’s ability to write poetry isn’t stored in one neuron, it arises from how circuits combine words, rhythms, and emotions.

Symmetry Principles

Physical laws often remain unchanged under transformations (e.g., rotating a system doesn’t alter its energy conservation).

MI parallel: Convolutional Neural Networks (CNNs) use translational invariance, they detect edges or textures regardless of their position in an image.
Example: A CNN trained to recognize cats will identify a cat’s ear whether it’s in the top-left or bottom-right corner of an image.

Perturbation Theory

Physicists study systems by applying small perturbations (e.g., nudging a particle) to observe responses.

MI parallel: Researchers tweak neuron activations to test causality. For example, amplifying a “positive sentiment” neuron in a language model should make its output more optimistic.
Example: If silencing a circuit reduces a model’s accuracy on math problems, you’ve likely found a “number reasoning” module.

In the remaining parts of this series, we will look at how MI overlaps with physics and where it might go next.

Enhancing AI Agents with Causality

Amir Feizpour (ai.science) — Wed, 05 Mar 2025 17:57:25 GMT

We recently hosted Ali Madani for an insightful session on the intersection of AI Agents and Causality, a fundamental question that rarely gets enough attention: Can AI agents truly make reliable decisions without understanding cause and effect?

The distinction between correlation and causation is just like the difference between saying students should skip exams to avoid weight gain (because exams correlate with weight gain) versus addressing the actual causal chain (exams → stress eating → weight gain). This example, shared during the session, perfectly illustrates why causal reasoning matters in practical applications.

A few years ago I read the “Book of Why” and, as a physicist, I really enjoyed it. The book explores the concept of causality—how we determine cause-and-effect relationships rather than just correlations. It argues that traditional statistical methods (like correlation and regression - aka foundations of everything we do in ML) are insufficient for understanding causality. The book introduces a “causal inference framework” based on “causal diagrams” and “do-calculus”, which allow us to answer counterfactual questions like, What would have happened if X had not occurred? It contrasts different "levels of causation" using “Ladder of Causation”:

Association (Seeing) – Correlation and pattern recognition (e.g., "Smokers tend to get lung cancer").
Intervention (Doing) – Understanding the effects of actions (e.g., "What happens if we ban smoking?").
Counterfactuals (Imagining) – Reasoning about alternate realities (e.g., "Would this person have avoided cancer if they had never smoked?").

The book critiques traditional statistical methods (like those used in machine learning) for their reliance on correlation without causal understanding. It also discusses real-world applications in medicine, economics, AI, and social sciences.

The math we use in science often relies heavily on counterfactuals to understand fundamental assertions that generalize very broadly within the boundaries of their assumptions (think f = ma and such). In physics, for instance, sparse causal relationships enable tremendous generalizability. As Ali illustrated: "Newton didn't have millions of data points, it was an apple and then all the experiments and then he came up with the formulas, it worked out."

By identifying similar sparse causal relationships in other domains, we might achieve similar generalizability without requiring the massive datasets currently needed for correlation-based approaches. That is one of the most compelling aspects of marrying causality and classical ML is in hopes of improving generalization with less data, addressing a fundamental challenge in traditional machine learning approaches.

After a few years, I still believe that causal inference can be a significant addition to how we do AI, but I have moderated my view of it from “absolutely necessary” to “practically useful”. My go-to analogy for this kind of thing is flight: how nature flies is mechanistically very different from how humans fly. The “artificial” flight leverages a remarkable brute force power called a jet engine to pick up a significantly heavier object from the ground. That means that the absolutely necessary properties like light weight and features and wings, in their natural form, become largely irrelevant. The question that I’m struggling with these days is this: Given the remarkable brute force power we have access to, namely lots of data and computation, is causal inference the “light weight and feather” of cognition?

Well, I only think about that question when I have my philosopher hat on. When I have my pragmatic AI company hat on, I do spend a lot of time thinking about how causal structures can create scaffolding for the agentic systems we build for commercial and research purposes. While there’s a remarkably successful effort going on to build reasoning into the statistical models we use and love, say R1, in parallel and for practical applications, I think it is very important to still think about causality and counterfactual reasoning when designing agentic systems, especially those involving multi-agent interactions, autonomous decision-making, and adaptive learning.

Now let’s get into some notes from the session with Ali.

The Promise and Limitations of AI Agents

AI agents, at their core, are systems designed to interact with their environment through an iterative process of assessment, information processing, and autonomous decision-making. They're characterized by their ability to learn, adapt, and operate with varying degrees of independence. The recent explosion of large language models has accelerated interest in these agents, particularly for their potential to automate complex tasks across industries.

In healthcare alone, AI agents could revolutionize prevention, detection, diagnosis, and patient monitoring, not by replacing doctors, but by handling repetitive tasks and providing real-time support. The economic implications are significant, with potential cost reductions across multiple sectors.

But here's where things get interesting: most AI systems today operate primarily on correlative relationships rather than causal ones. This creates a fundamental limitation.

The Correlation Trap

"If you go correlative and identify association between different variables, you can see that exams definitely have correlation with gaining weight," Ali noted. "So many students go through stress eating through exams and they gain weight."

Imagine we want to recommend actions to help students avoid weight gain. Data analysis might show a strong correlation between exams and weight gain. A purely correlative approach might suggest the absurd recommendation to "avoid taking exams" to prevent weight gain. However, a causal understanding reveals that exams cause stress eating, which then causes weight gain. With this causal chain identified, we can make more meaningful recommendations targeting the actual mechanism (stress eating) rather than the initial trigger (exams).

This example highlights why correlation isn't enough for truly intelligent systems. Without causality, AI agents risk making recommendations based on spurious correlations, like the correlation between wind in Taiwan and Googling “I’m tired”.

The problem extends beyond obvious examples. In drug discovery, researchers spend years designing chemical compounds without knowing if they'll have the expected effect on patients. Some causal relationships remain unknown even to human experts, creating a significant challenge for AI systems.

Bringing Causality to AI Agents

There are several approaches to incorporating causality into AI agents, each with different applications:

1. Randomized Interventions

The gold standard for establishing causality involves randomized interventions, where confounding variables are controlled through randomization. This approach is widely used in clinical trials and allows for direct measurement of causal effects:

Causal Effect = Outcome(Treatment) - Outcome(Control)

While powerful, randomization isn't always feasible due to cost constraints or ethical considerations. As Ali noted, "From an ethical perspective in many situations, for example in the case of drugs, we cannot test every single thing that we hypothesize to work on human beings."

2. Causal Discovery Algorithms

These algorithms aim to generate directed acyclic graphs (DAGs) that represent causal relationships between variables. Unlike correlation, which merely shows association, these graphs reveal directionality, which variables cause changes in another.

So, for scenarios where controlled experiments aren't possible, causal discovery algorithms can extract causal relationships from observational data:

"We have causal discovery algorithms that aim to generate causal graphs and directed acyclic graphs... when you provide these values across variables into some of these causal discovery algorithms, what they try to do is to check some of the causality assumptions and at the end generates a directed acyclic graph for you."

These algorithms come in two main varieties:

Statistical methods (traditional constraint-based or score-based approaches like PC)
Machine learning-based gradient algorithms (more computationally efficient)

What's particularly valuable is that these approaches don't require massive datasets, hundreds or thousands of data points can suffice, making them practical for many real-world applications.

3. Causal Representation Learning

This emerging field aims to learn representations that reveal unknown causal structures. It's based on a fundamental insight from physics: most phenomena are governed by a sparse set of causal rules rather than thousands of continuous features.

Perhaps this fundamentally differs from traditional representation learning. While traditional approaches summarize raw features into latent variables, causal representation learning aims to uncover the underlying causal structure of data.

This approach draws inspiration from physics, where sparse sets of fundamental rules determine complex phenomena. As Ali explained: "We have a sparse set of rules that determine a specific phenomena... those rules are based on the causal roots... like gravity for example, the electromagnetic rules."

This sparsity principle applies across domains. In cancer research, for instance, while there isn't a single gene causing poor outcomes, we don't expect thousands of genes to be equally responsible either. Causal representation learning seeks to identify these sparse causal factors.

4. Large Language Models and Causality

While LLMs weren't explicitly trained for causal reasoning, research has shown they can effectively tackle certain causal tasks with proper prompting. A paper highlighted during the session demonstrated that models like GPT-4 can achieve up to 96% accuracy in identifying known pairwise causal relationships.

The key lies in smart but simple prompting strategies. Rather than asking broadly about causal relationships between multiple variables, researchers found success by asking direct questions like: "Which cause and effect relationship is more likely: changing A causes a change in B, or changing B causes a change in A?"

Importantly, LLMs excel at retrieving known causal relationships but cannot uncover novel ones:

"This way of using the large language models replace the experts for graph generation... But it doesn't uncover unknown relationship."

The key insight: domain knowledge is crucial. LLMs can only identify causal relationships they've encountered during pre-training. They excel at retrieving and applying known causal knowledge but cannot uncover truly unknown relationships.

This creates a natural categorization of causal tasks for AI agents:

Known causal relationships: LLMs can reliably retrieve these (e.g., smoking causing lung cancer)
Abundant data but unclear causality: Areas where causal discovery algorithms might help (e.g., sales data, web page optimization)
Unknown relationships: Domains requiring experimental validation and specialized causal learning algorithms (e.g., novel drug discovery)

5. Reinforcement Learning and Causality

The final piece of the puzzle involves using reinforcement learning to improve AI agents' causal reasoning. By providing feedback based on causal relationships, either from experts, experiments, or causal modeling, we can fine-tune models to make better causal inferences over time.

"The success of large language models was partially related to reinforcement learning... putting the transformers-based large language models and reinforcement learning for providing the feedback and fine-tuning and penalizing them and rewarding them resulted in huge success in the field."

Practical Applications Across Domains

The integration of causality with AI agents offers compelling applications:

Healthcare

More accurate diagnosis through root cause identification
Prevention and detection capabilities
Patient monitoring with causal understanding
Treatment recommendation based on causal effects

Business Applications

Understanding true drivers of sales beyond correlations
Designing effective A/B tests to measure intervention impacts
Web optimization based on causal rather than correlative insights

Drug Discovery

Target identification for different cancer types
Biomarker discovery for drug response prediction
Analysis of treatment regimens and patient journeys

Conclusion

What struck me most about Ali's presentation wasn't a single breakthrough technique, but rather the recognition that enhancing AI agents with causality requires integration across multiple approaches. It's not about waiting for perfect causal reasoning models, but about strategically incorporating causal thinking into existing systems.

As AI agents become more integrated into critical domains like healthcare, finance, and education, their ability to reason causally will directly impact human lives. An AI that recommends interventions based on genuine causal understanding rather than statistical correlation is not just more accurate, it's more trustworthy.

The future of AI isn't just about bigger models or more data, it's about smarter reasoning. And at the heart of smarter reasoning lies causality: understanding not just what happens, but why.

Q&A

Q: What's the relationship between reasoning models like R1 and causality?

A: While these models demonstrate impressive capabilities, true reasoning arguably requires causal understanding. Ali suggested we don't need to wait for perfect causal reasoning models, we can immediately begin providing causal feedback to existing models through reinforcement learning approaches while developing more fundamentally causal architectures in parallel.

Q: How does causal representation learning differ from traditional representation learning?

A: Traditional representation learning summarizes raw features into latent variables, while causal representation learning aims to uncover underlying causal structures. The latter involves additional assumptions beyond traditional IID (independent and identically distributed) assumptions, with the goal of identifying sparse causal relationships that enable better generalization and out-of-distribution performance.

Q: Can you give practical examples of how causality has helped in your work?

A: In drug discovery, Ali's team has used causal discovery and inference to identify new gene targets for different cancer types. They've also applied causal approaches to biomarker discovery, identifying underlying mechanisms related to drug responses. While specific results remain confidential, these applications demonstrate the practical value of causal approaches in real-world settings.

The Business Impact of DeepSeek R1

Amir Feizpour (ai.science) — Mon, 24 Feb 2025 13:44:44 GMT

We had a community session on this topic and the following are some notes from that session along with some additional thoughts from me!

The LLM industry has been dominated by a few major players - OpenAI, Meta, Google DeepMind, and Anthropic, to name a few. A few models from the Middle East and Europe popped up here and there and took the headlines for a few days and then they disappeared as quickly as they showed up. Therefore, US based companies have controlled not only model development but also pricing, infrastructure, and the regulatory landscape around AI. This centralization may be partly due to where the capital for moonshot projects is most readily available, but it has certainly created a single point of failure for markets, entrepreneurs, and businesses alike. The dominance that we have experienced so far from the US companies in this market had also created relatively stable market dynamics at a relatively high price point for the end users; well, until a new kid on the block messed things up!

DeepSeek is a relatively young company and in our last two posts (this and this) we looked at the technical side of their newest model: R1, a reasoning model developed in China.

For the first time, a non-Western model put in question the dominance of American LLM providers- not just in performance but in accessibility, cost, and infrastructure independence. While there are still many open questions about deployment, security, and long-term impact, one thing is clear: mindset has shifted away from “only Americans can do it”, competition is here, and that’s good news for entrepreneurs.

Even if R1 itself doesn’t deserve the hype it created and even if it can’t hold up the promise like other models from non-US origins, its emergence signals a breakaway from the idea that all major AI breakthroughs must come from Western corporations. The availability of alternatives fosters competition, drives down costs, and increases the diversity of AI applications, creating new opportunities for businesses to build, experiment, and innovate on their own terms.

What has made me quite excited about LLMs in the past few years is the equalizing power that they bring to innovation. I know there is a lot of business potential around automating mundane tasks using LLM agents, but the part that gets me out of bed every day is building agentic applications that facilitate knowledge intensive workflows. Even with the earliest versions of LLM apps like ChatGPT we saw lowering of significant barriers to knowledge that was traditionally reserved to smaller portions of the population, think coding, business strategy, marketing tricks, and product ideation. This led into many more people trying out their crazy ideas or at least feeling more encouraged to explore their options. Now imagine with the lowering cost of operationalizing LLM systems, a serious competition that can impact pricing, and more powerful models, what kind of tools can we build to give more founders and founders-to-be a fair playing field!

The following section covers some select Qs and As from our session.

1. What makes DeepSeek R1 different from existing AI models?

DeepSeek R1 is a reasoning model that claims to be smaller and more efficient than existing alternatives. However, its real distinction lies in its origin. Unlike models from OpenAI or Anthropic, R1 was developed in China and optimized for deployment in non-Western infrastructure, such as Alibaba Cloud. This means businesses that previously had no choice but to rely on US-based AI providers now have an alternative. The model’s efficiency also raises questions about the future of AI computing, as it suggests that high-quality reasoning tasks may not require the enormous compute resources traditionally associated with GPT-O1-level models. For the western audience this might not even be an option, but imagine how many countries are out there, say the Middle East, Africa, and Eastern Europe who are more than happy to consider their options more broadly now that there are options available to them.

2. How does R1 impact the cost structure of AI-powered businesses?

Cost has always been a major factor in AI adoption. OpenAI’s most advanced models can cost up to $15 per million tokens for reasoning tasks, a price that makes AI-powered applications prohibitively expensive for many startups and small businesses. In contrast, DeepSeek R1 is priced at just $0.14 per million tokens - a staggering difference. While this might be an apples to pears comparison and these figures don’t necessarily reflect training or operational costs, they indicate that AI reasoning capabilities may soon become dramatically more affordable. This reduction in cost could allow small businesses to integrate advanced AI into their workflows without needing the budgets of Big Tech corporations.

3. Is DeepSeek a direct threat to Nvidia and Western AI infrastructure?

The release of R1 led to a short-term drop in Nvidia’s stock, highlighting the market’s reaction to potential shifts in AI computing demand. Investors had largely assumed that AI adoption worldwide would remain dependent on US cloud providers and American GPUs. However, the rise of R1 and similar models means businesses may increasingly turn to non-US cloud infrastructure and non-Nvidia chips, such as those developed by Huawei. While Nvidia and other Western AI players will likely continue to thrive, the assumption of American AI dominance is no longer a given. To a large extent this is a market correction because there was no reason to assume, this early in the game, that the dominance will remain in the West.

4. What challenges do businesses face in deploying R1?

Despite its advantages, R1 has proven difficult to deploy. Its official API has suffered frequent downtime, reportedly operating only 20% of the time due to DDoS attacks. Additionally, its architecture, built on the Mixture of Experts (MoE) framework, adds complexity to serving the model. MoE models have historically been difficult to scale and operate efficiently in production environments, which is why they have not been widely adopted despite their theoretical efficiency benefits. Entrepreneurs looking to integrate R1 into their products will need to consider these operational challenges.

5. Are there security risks associated with using R1?

Security is a major concern when adopting any AI model, and R1 is no exception. Some businesses may be hesitant to use a Chinese-developed AI system due to fears about data security and compliance. Those are fair concerns, but at the same time all those concerns can and should exist for any other players in the space. AI models, including R1, could be susceptible to training data poisoning, where adversaries inject subtle biases or vulnerabilities into a model’s responses. Additionally, LLMs can be manipulated to promote specific software libraries, including ones that contain hidden vulnerabilities. Businesses considering any LLMs, including R1 must weigh these risks against its cost and efficiency benefits.

6. How does R1 change the competitive landscape for AI startups?

Before R1, reasoning-capable AI was largely controlled by OpenAI and a few other firms, meaning startups had little choice but to pay high API fees for access. With R1 and potential future competitors, smaller businesses can now explore alternatives that are both cheaper and more flexible. This shift could make AI-powered startups more viable and profitable, especially in emerging markets where cost constraints previously limited access to high-quality models.

7. Is R1 a sign that China is overtaking the West in AI?

It’s too early to say that China is surpassing the US in AI innovation, but R1 demonstrates that the playing field is becoming more balanced. In 2023, 8 of the top 10 AI models were American, with only two exceptions - one from France (Mistral) and one from Canada (Cohere). By 2025, it is expected that at least half of the leading AI models will be Chinese. Companies like Alibaba, Baidu, and DeepSeek are rapidly catching up, and the assumption that only Western companies can build cutting-edge AI is no longer valid.

Thanks for reading Deep Random Thoughts! This post is public so feel free to share it.

Innovations Leading up to DeepSeek R1

Amir Feizpour (ai.science) — Fri, 14 Feb 2025 13:01:57 GMT

In this session, we explored the architecture evolution and technical innovations that led to development of DeepSeek R1, a model that stands at the forefront of AI advancements. DeepSeek R1 pushes the boundaries of reasoning in artificial intelligence and is designed to handle efficiency, lower cost, and cutting-edge performance. To fully appreciate DeepSeek R1’s capabilities, it is important to understand the evolution of DeepSeek’s models and how each step led to the development of this advanced reasoning model.

Technical Evolution and Foundation

Since its inception in 2023, DeepSeek has continually advanced its large language models (LLMs), with each new release building upon the previous model’s strengths. Below is a detailed breakdown of the contributions made by each iteration, culminating in the creation of DeepSeek R1.

DeepSeek V2

We’ll start with exploring DeepSeek-V2, which was a large language model (LLM) released in May 2024. This model introduced significant architectural advancements, notably the integration of multi-head latent attention (MLA) and a mixture of experts (MoE) framework. The MLA mechanism enhanced the model’s ability to process complex patterns by utilizing compressed latent vectors, thereby improving performance and reducing memory usage during inference. The MoE architecture allowed the model to activate a subset of specialized experts per forward pass, optimizing computational efficiency.

DeepSeek-V2 was trained on an extensive dataset of 8.1 trillion tokens, with a higher proportion of Chinese text compared to English. The context length was extended from 4,000 to 128,000 tokens using the YaRN method, which improved the model’s ability to handle longer sequences. The training process involved supervised fine-tuning (SFT) on 1.5 million instances for helpfulness and 300,000 for safety, followed by reinforcement learning (RL) using Group Relative Policy Optimization (GRPO) in two stages: one focused on math and coding problems, and the other on helpfulness, safety, and rule adherence.

Source: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek V3

Building upon the V2 architecture, DeepSeek introduced V3 in December 2024. This iteration maintained the MoE framework and MLA, featuring a total of 671 billion parameters with a context length of 128,000 tokens. The training process for V3 involved pretraining on 14.8 trillion tokens, predominantly in English and Chinese, with a higher ratio of math and programming content. The context length was further extended from 4,000 to 128,000 tokens using YaRN.

SFT was conducted for two epochs on 1.5 million samples of reasoning and non-reasoning data. Expert models were trained to generate synthetic reasoning data in specific domains (math, programming, logic), and model-based reward models were developed to guide the RL process. The final model, DeepSeek-V3, was trained using GRPO with both reward models and rule-based rewards. This version marked a significant step forward in computational efficiency and reasoning capabilities, ensuring that the model could better handle complex tasks and improve its overall performance.

Source: DeepSeek-V3 Technical Report

DeepSeek R1-Zero

In November 2024, DeepSeek released R1-Lite-Preview, an early version of R1, accessible via API and chat interfaces. This model was trained for logical inference, mathematical reasoning, and real-time problem-solving. It was reported to outperform OpenAI’s o1 model on benchmarks such as the American Invitational Mathematics Examination (AIME) and MATH.

R1-Lite-Preview was initialized from DeepSeek-V3-Base and shared its architecture. The model employed a Mixture of Experts (MoE) framework with 671 billion parameters, activating 37 billion per forward pass to maintain computational efficiency. The training process for R1-Lite-Preview involved supervised fine-tuning (SFT) on a small dataset of high-quality, readable reasoning examples, followed by reinforcement learning (RL) to further develop its reasoning skills. This approach encouraged the autonomous emergence of behaviors such as chain-of-thought reasoning, self-verification, and error correction, setting the foundation for the more advanced R1 model.

DeepSeek R1

On January 20, 2025, DeepSeek launched R1, an open-source AI model emphasizing reasoning capabilities. R1 was initialized from DeepSeek-V3-Base and shares its architecture, including the MoE framework with 671 billion parameters, activating 37 billion per forward pass to maintain computational efficiency. The training process for R1 involved a four-phase pipeline:

Cold Start: Supervised fine-tuning on a small dataset of high-quality, readable reasoning examples.
Reasoning-Oriented RL: Large-scale RL focusing on rule-based evaluation tasks, incentivizing accurate and coherent responses.
Supervised Fine-Tuning: Synthesis of reasoning data using rejection sampling, combined with non-reasoning data for comprehensive fine-tuning.
RL for All Scenarios: A second RL phase refining the model’s helpfulness and harmlessness while preserving advanced reasoning skills.

This approach led to the emergence of complex reasoning patterns, such as self-verification and reflection, without explicit programming. Distilled versions of R1, ranging from 1.5 billion to 70 billion parameters, were also developed to cater to different computational needs. R1’s focus on improving reasoning and inference abilities while maintaining computational efficiency marked a key advancement in the evolution of DeepSeek’s models.

Source: DeepSeek R1 architecture by @SirrahChan

Multi-Token Prediction Innovation

One of the most significant advancements in DeepSeek R1 is its novel approach to multi-token prediction, which enhances both the depth and flexibility of the model’s output.

Sequential Prediction Modules: Traditional AI models often rely on parallel token prediction, generating multiple tokens simultaneously. In contrast, R1 adopts sequential prediction, where tokens are generated one after another in a stepwise manner. This method improves contextual relevance and coherence, as each token’s prediction is informed by the ones that came before it, leading to more cohesive and meaningful output.
Enhanced Internal Representations: The switch to sequential prediction allows R1 to develop richer internal representations of data. This change improves the model’s planning capabilities and enhances its ability to capture long-term dependencies in the sequence, which is crucial for tasks involving complex logic or narrative structures.
Densified Training Signals: In traditional training setups, models predict a single token at a time, which limits the amount of useful training feedback per step. R1’s approach of multi-token prediction increases the density of training signals per step, providing more concentrated and effective learning, which contributes to its superior accuracy.
Shared Embedding Layers: By utilizing shared embedding layers in combination with sequential transformer blocks, R1 achieves better cohesion between tokens in a sequence. This improves the consistency of predictions across different tokens and helps the model generate more coherent outputs overall

Data Processing and Quality Control

DeepSeek R1’s performance is not just driven by its architecture but also by its innovative approach to managing and processing data, ensuring both efficiency and high-quality outputs.

Cross-Dump Deduplication: R1 implements cross-dump deduplication across 91 instances of Common Crawl data, eliminating redundant or repetitive entries. This ensures that the model is exposed to a broader range of unique, high-quality data during training, which enriches its understanding and generalization capabilities.
Strategic Exclusion of Multiple-Choice Questions: Unlike many other pre-training models that include multiple-choice questions, R1 excludes them. This strategic decision allows the model to focus on more complex language tasks that require deeper understanding and nuanced responses, enhancing its ability to process subtler forms of reasoning.
Mathematical Content Enhancement: R1 incorporates an iterative classification approach to enhance the mathematical content within its training data. This process strengthens its ability to process and reason through mathematical concepts, improving its performance in specialized tasks that require advanced mathematical reasoning.
Innovative Bin Packing Algorithms: To address the common issue of document truncation, R1 employs innovative bin packing algorithms. These algorithms optimize the organization of training data, reducing unnecessary loss of information and ensuring that the model has access to as much data as possible to predict the next token.

Practical Applications and Implementation

For those looking to implement R1, its practical applications are key to understanding its capabilities and maximizing its potential.

Best Suited for Verifiable Tasks: R1 is particularly well-suited for environments where success criteria are clearly defined and verifiable. This makes it an ideal choice for industries that require transparency, accountability, and high levels of precision, such as healthcare, law, and finance.
Different Prompting Strategies: Unlike traditional models that often rely on standard prompting strategies, R1 requires experimentation to determine the most effective prompting techniques. This flexibility allows for tailored interactions, enabling users to unlock the model’s full potential across a variety of applications.
Trajectory Planning in Agent-Based Systems: R1 performs exceptionally well in trajectory planning tasks, where it predicts the best possible path forward for an agent navigating dynamic environments. This ability makes R1 a valuable tool for agentic workflows.

Future Implications and Research Directions

As DeepSeek R1 continues to evolve, several exciting avenues for future research and improvement are emerging:

Optimal Stopping Criteria for Reasoning Chains: Research into the best points at which reasoning chains should be terminated could significantly optimize both performance and efficiency, preventing unnecessary computation while ensuring high-quality outputs.
Cross-Lingual Reasoning Capabilities: R1’s performance across languages, particularly in reasoning tasks, suggests a promising area for further exploration. Optimizing R1 for cross-lingual reasoning could expand its applicability to multilingual environments, broadening its scope in global applications.
Token Distribution Impact on Reasoning: Investigating how varying token distribution strategies influence reasoning quality could provide valuable insights into further optimizing R1 for different types of reasoning tasks, from simple queries to complex, multi-step deductions.
Integration with Existing Agent Frameworks: A key research direction is how R1 can be integrated with existing agent frameworks. This could enhance its ability to make autonomous decisions in real-world applications, further extending its utility in dynamic and interactive environments.

By understanding these technical foundations and innovations, we can better leverage DeepSeek R1’s capabilities while acknowledging areas for future development and optimization. Through thoughtful implementation and ongoing research, DeepSeek R1 holds immense potential to drive forward the field of artificial intelligence reasoning.

Resources

FAQ

Q: What are the primary innovations leading up to DeepSeek R1?

A: Multi-head Latent Attention (MLA)

⦿ What it is: Enhances attention mechanisms by working on latent representations instead of raw token sequences, improving efficiency and scalability.

⦿ Prior work: Perceiver (DeepMind, 2021) introduced attention over latent variables, reducing quadratic complexity in long-context scenarios.

⦿ Why it matters: Helps with long-context processing by operating on compressed representations, enabling better retrieval and reasoning.

Load Balancing for MoE Models

⦿ What it is: Ensures even distribution of workload across experts in Mixture of Experts (MoE) models, preventing bottlenecks.

⦿ Prior work: GLaM (Google, 2021) improved expert selection using an auxiliary routing loss, optimizing compute efficiency.

⦿ Why it matters: Makes MoE models more efficient and scalable, allowing better utilization of compute resources while maintaining high performance.

Fill-in-the-Middle (FIM) Learning Objective

⦿ What it is: Trains models to generate missing text segments, not just predict the next token, improving bidirectional reasoning.

⦿ Prior work: Codex (OpenAI, 2021) leveraged FIM to enhance code completion, significantly improving edit and autocomplete capabilities.

⦿ Why it matters: Enables better document completion, code generation, and interactive AI assistants that can modify text instead of just appending to it.

FP8 Training (Floating Point 8-bit Precision)

⦿ What it is: Uses lower-precision floating-point formats to reduce memory usage and accelerate training.

⦿ Prior work: NVIDIA Hopper Architecture (2022) introduced hardware-optimized FP8 support, enabling more efficient training of large models.

⦿ Why it matters: Reduces training costs and memory constraints, making long-context and large-scale models more feasible.

Multi-token Prediction

⦿ What it is: Instead of generating tokens one at a time, the model predicts multiple tokens in parallel, improving response fluency and speed.

⦿ Prior work: PaLM 2 (Google, 2023) refined parallel decoding techniques to improve latency and coherence in text generation.

⦿ Why it matters: Reduces response time and improves fluency in long-form text generation, making AI models more usable in real-time applications.

Q: How does the use of multi token prediction in R1 improve context and coherence compared to traditional models?

A: In non-reasoning models, the next token is predicted individually at each step, which can sometimes lead to a lack of context or coherence in longer sequences. R1’s approach of predicting multiple tokens allows for a richer internal representation, which helps in planning and reasoning. Although these multi-token prediction modules are removed during inference, the richer training signals learned during the process contribute to more accurate token distributions and better reasoning, ensuring that R1 maintains context and coherence when generating responses.

Q: How did the inclusion of mathematical and coding tokens during training enhance R1’s reasoning abilities?

A: R1 benefited from a large set of math tokens, which were gathered through a process that involved using open web math as a seed, followed by applying a classifier to identify math-related documents in common crawl data. This methodology enabled R1 to acquire 120 billion math tokens, which were important for improving the model’s mathematical reasoning abilities. Additionally, the model’s ability to handle code was enhanced by a pre-processing pipeline that ensured proper ordering of files, including dependencies. Learning from more structured and verifiable domains like math and coding helped R1 learn the mechanics of reasoning which was generalized to other types of reasoning through training.

Q: How does DeepSeek R1 handle model bias?

A: All models have bias and their creators take steps to mitigate that. One of the observations I (Suhas) made during testing is that the model performed better when reasoning in Mandarin compared to English, especially for tasks requiring logical reasoning. This improvement seems to be related to the higher Shannon entropy of the Chinese alphabet (9.56 bits per character) compared to the English alphabet (3.9 bits), which may allow for richer token distributions and more efficient encoding of information. In terms of mitigating bias, the model seems to respond well to diverse inputs, and further research is ongoing to test how language and token distribution affect reasoning capabilities.

Q: What are the primary evaluation metrics and testing strategies for assessing the performance of new AI models like R1?

A: When evaluating new models like R1, I (Suhas) use a variety of strategies to assess reasoning capabilities. One method is to take potentially familiar data from the model’s pre-training and alter it in such a way that it challenges the model to demonstrate whether it is merely recalling information or engaging in actual reasoning. For instance, I may jumble parts of the prompt to see if the model can still generate a correct answer based on the modified context. Additionally, I break complex tasks into smaller sub-tasks to evaluate whether the model can handle these individual components. If the model performs well on each sub-task independently, it provides insight into whether it can successfully tackle the full, multi-step task. This helps assess both the model’s ability to memorize and its capacity for generalizing reasoning across different problem types.

Q: Can DeepSeek R1 be effectively used in agentic applications, where reasoning and planning are required, and if so, how?

A: Yes. One of its strengths is the ability to sample from a set of potential actions, using heuristics to guide decision-making in agentic trajectories. Even without explicit support for tool calls, the model performs well when tasked with reasoning and planning. In my experience, I (Suhas) have tested the model by providing specific tokens in the prompt, which act as delineators, helping the model to follow a structured reasoning path. This approach has shown promising results, demonstrating the model’s potential for handling tasks involving reasoning and planning.

ACKNOWLEDGEMENT: These notes are prepared by Mohsin Iqbal.

Understanding DeepSeek R1

Amir Feizpour (ai.science) — Mon, 03 Feb 2025 19:47:27 GMT

We’ve been tracking the explosive rise of DeepSeek R1, which has taken the AI world by storm in recent weeks. In this session, we dove deep into the evolution of the DeepSeek family - from the early models through DeepSeek V3 to the breakthrough R1. We also explored the technical innovations that make R1 so special in the world of open-source AI.

The DeepSeek Family Tree: From V3 to R1

DeepSeek isn’t just a single model; it’s a family of increasingly sophisticated AI systems. The evolution goes something like this:

DeepSeek V2:

This was the foundation model which leveraged a mixture-of-experts architecture, where only a subset of experts are used at inference, drastically improving the processing time for each token. It also featured multi-head latent attention to reduce memory footprint.

DeepSeek V3:

This model introduced FP8 training techniques, which helped drive down training costs by over 42.5% compared to previous iterations. FP8 is a less precise way to store weights inside the LLMs but can greatly improve the memory footprint. However, training using FP8 can typically be unstable, and it is hard to obtain the desired training results. Nevertheless, DeepSeek uses multiple tricks and achieves remarkably stable FP8 training. V3 set the stage as a highly efficient model that was already cost-effective (with claims of being 90% cheaper than some closed-source alternatives).

DeepSeek R1-Zero:

With V3 as the base, the team then introduced R1-Zero, the first reasoning-focused iteration. Here, the focus was on teaching the model not just to generate answers but to “think” before answering. Using pure reinforcement learning, the model was encouraged to generate intermediate reasoning steps, for example, taking extra time (often 17+ seconds) to work through a simple problem like “1+1.”

The key innovation here was the use of group relative policy optimization (GROP). Instead of relying on a conventional process reward model (which would have required annotating every step of the reasoning), GROP compares multiple outputs from the model. By sampling several potential answers and scoring them (using rule-based measures like exact match for math or verifying code outputs), the system learns to favor reasoning that leads to the correct result without the need for explicit supervision of every intermediate thought.

DeepSeek R1:

Recognizing that R1-Zero’s unsupervised approach produced reasoning outputs that could be hard to read or even mix languages, the developers went back to the drawing board. They used the raw outputs from R1-Zero to generate “cold start” data and then manually curated these examples to filter and improve the quality of the reasoning. This human post-processing was then used to fine-tune the original DeepSeek V3 model further—combining both reasoning-oriented reinforcement learning and supervised fine-tuning. The result is DeepSeek R1: a model that now produces readable, coherent, and reliable reasoning while still maintaining the efficiency and cost-effectiveness of its predecessors.

What Makes R1 Series Special?

The most fascinating aspect of R1 (zero) is how it developed reasoning capabilities without explicit supervision of the reasoning process. It can be further improved by using cold-start data and supervised reinforcement learning to produce readable reasoning on general tasks. Here's what sets it apart:

Open Source & Efficiency:

R1 is open source, allowing researchers and developers to inspect and build upon its innovations. Its cost efficiency is a major selling point especially when compared to closed-source models (claimed 90% cheaper than OpenAI) that require massive compute budgets.

Novel Training Approach:

Instead of relying solely on annotated reasoning (which is both expensive and time-consuming), the model was trained using an outcome-based approach. It started with easily verifiable tasks, such as math problems and coding exercises, where the correctness of the final answer could be easily measured.

By using group relative policy optimization, the training process compares multiple generated answers to determine which ones meet the desired output. This relative scoring mechanism allows the model to learn “how to think” even when intermediate reasoning is generated in a freestyle manner.

Overthinking?

An interesting observation is that DeepSeek R1 sometimes “overthinks” simple problems. For example, when asked “What is 1+1?” it might spend nearly 17 seconds evaluating different scenarios—even considering binary representations—before concluding with the correct answer. This self-questioning and verification process, although it might seem inefficient at first glance, could prove advantageous in complex tasks where deeper reasoning is necessary.

Prompt Engineering:

Traditional few-shot prompting techniques, which have worked well for many chat-based models, can actually degrade performance with R1. The developers recommend using direct problem statements with a zero-shot approach that specifies the output format clearly. This ensures that the model isn’t led astray by extraneous examples or hints that might interfere with its internal reasoning process.

Getting Started with R1

For those looking to experiment:

Smaller variants (7B-8B) can run on consumer GPUs or even only CPUs
Larger versions (600B) require significant compute resources
Available through major cloud providers
Can be deployed locally via Ollama or vLLM

Looking Ahead

We're particularly intrigued by several implications:

The potential for this approach to be applied to other reasoning domains
Impact on agent-based AI systems traditionally built on chat models
Possibilities for combining with other supervision techniques
Implications for enterprise AI deployment

Open Questions

How will this affect the development of future reasoning models?
Can this approach be extended to less verifiable domains?
What are the implications for multi-modal AI systems?

We'll be watching these developments closely, particularly as the community begins to experiment with and build upon these techniques.

Resources

Join our Slack community for ongoing discussions and updates about DeepSeek and other AI developments. We're seeing fascinating applications already emerging from our bootcamp participants working with these models.

Chat with DeepSeek:

https://www.deepseek.com/

Q&A

Q1: Which model deserves more attention – DeepSeek or Qwen2.5Max?

A: While Qwen2.5 is also a strong model in the open-source community, the choice ultimately depends on your use case. DeepSeek R1 emphasizes advanced reasoning and a novel training approach that may be especially valuable in tasks where verifiable logic is critical.

Q2: Why did major providers like OpenAI opt for supervised fine-tuning rather than reinforcement learning (RL) like DeepSeek?

A: We should note upfront that they do use RL at the very least in the form of RLHF. It is very likely that models from major providers that have reasoning capabilities already use something similar to what DeepSeek has done here, but we can’t be sure. It is also likely that due to access to more resources, they favored supervised fine-tuning due to its stability and the ready availability of large annotated datasets. Reinforcement learning, although powerful, can be less predictable and harder to control. DeepSeek’s approach innovates by applying RL in a reasoning-oriented manner, enabling the model to learn effective internal reasoning with only minimal process annotation - a strategy that has proven promising despite its complexity.

Q3: Did DeepSeek use test-time compute strategies similar to those of OpenAI?

A: DeepSeek R1’s design emphasizes efficiency by leveraging techniques such as the mixture-of-experts approach, which activates only a subset of parameters, to reduce compute during inference. This focus on efficiency is central to its cost advantages.

Q4: What is the difference between R1-Zero and R1?

A: R1-Zero is the initial model that learns reasoning solely through reinforcement learning without explicit process supervision. It generates intermediate reasoning steps that, while sometimes raw or mixed in language, serve as the foundation for learning. DeepSeek R1, on the other hand, refines these outputs through human post-processing and supervised fine-tuning. In essence, R1-Zero provides the unsupervised “spark,” and R1 is the polished, more coherent version.

Q5: How can one stay updated with in-depth, technical research while managing a busy schedule?

A: Staying current involves a combination of actively engaging with the research community (like AISC - see link to join slack above), following preprint servers like arXiv, attending relevant conferences and webinars, and participating in discussion groups and newsletters. Continuous engagement with online communities and collaborative research projects also plays a key role in keeping up with technical advancements.

Q6: In what use-cases does DeepSeek outperform models like O1?

A: The short answer is that it’s too early to tell. DeepSeek R1’s strength, however, lies in its robust reasoning capabilities and its efficiency. It is particularly well suited for tasks that require verifiable logic—such as mathematical problem solving, code generation, and structured decision-making—where intermediate reasoning can be reviewed and confirmed. Its open-source nature further allows for customized applications in research and enterprise settings.

Q7: What are the implications of DeepSeek R1 for enterprises and start-ups?

A: The open-source and cost-efficient design of DeepSeek R1 lowers the entry barrier for deploying advanced language models. Enterprises and start-ups can leverage its advanced reasoning for agentic applications ranging from automated code generation and customer support to data analysis. Its flexible deployment options—on consumer hardware for smaller models or cloud platforms for larger ones—make it an attractive alternative to proprietary solutions.

Q8: Will the model get stuck in a loop of “overthinking” if no correct answer is found?

A: While DeepSeek R1 has been observed to “overthink” simple problems by exploring multiple reasoning paths, it incorporates stopping criteria and evaluation mechanisms to prevent infinite loops. The reinforcement learning framework encourages convergence toward a verifiable output, even in ambiguous cases.

Q9: Is DeepSeek V3 completely open source, and is it based on the Qwen architecture?

A: Yes, DeepSeek V3 is open source and served as the foundation for later iterations. It is built on its own set of innovations—including the mixture-of-experts approach and FP8 training—and is not based on the Qwen architecture. Its design emphasizes efficiency and cost reduction, setting the stage for the reasoning innovations seen in R1.

Q10: How does DeepSeek R1 perform on vision tasks?

A: DeepSeek R1 is a text-based model and does not incorporate vision capabilities. Its design and training focus solely on language processing and reasoning.

Q11: Can professionals in specialized fields (for example, labs working on cures) apply these methods to train domain-specific models?

A: Yes. The innovations behind DeepSeek R1—such as its outcome-based reasoning training and efficient architecture—can be adapted to various domains. Researchers in fields like biomedical sciences can tailor these methods to build models that address their specific challenges while benefiting from lower compute costs and robust reasoning capabilities. It is likely that in deeply specialized fields, however, there will still be a need for supervised fine-tuning to get reliable results.

Q12: Were the annotators for the human post-processing experts in technical fields like computer science or mathematics?

A: The discussion indicated that the annotators primarily focused on domains where correctness is easily verifiable—such as math and coding. This suggests that expertise in technical fields was indeed leveraged to ensure the accuracy and clarity of the reasoning data.

Q13: Could the model get things wrong if it relies on its own outputs for learning?

A: While the model is designed to optimize for correct answers via reinforcement learning, there is always a risk of errors—especially in ambiguous scenarios. However, by evaluating multiple candidate outputs and reinforcing those that lead to verifiable results, the training process minimizes the likelihood of propagating incorrect reasoning.

Q14: How are hallucinations minimized in the model given its iterative reasoning loops?

A: The use of rule-based, verifiable tasks (such as math and coding) helps anchor the model’s reasoning. By comparing multiple outputs and using group relative policy optimization to reinforce only those that yield the correct result, the model is guided away from generating unfounded or hallucinated information.

Q15: Does the model rely on complex vector mathematics?

A: Yes, advanced techniques—including complex vector math—are integral to the implementation of mixture-of-experts and attention mechanisms in DeepSeek R1. However, the primary focus is on using these techniques to enable effective reasoning rather than showcasing mathematical complexity for its own sake.

Q16: Some worry that the model’s “thinking” may not be as refined as human reasoning. Is that a valid concern?

A: Early iterations like R1-Zero did produce raw and sometimes hard-to-read reasoning. However, the subsequent refinement process—where human experts curated and improved the reasoning data—has significantly enhanced the clarity and reliability of DeepSeek R1’s internal thought process. While it remains an evolving system, iterative training and feedback have led to meaningful improvements.

Q17: Which model variants are suitable for local deployment on a laptop with 32GB of RAM?

A: For local testing, a medium-sized model—typically in the range of 7B to 8B parameters—is recommended. Larger models (for example, those with hundreds of billions of parameters) require significantly more computational resources and are better suited for cloud-based deployment.

Q18: Is DeepSeek R1 “open source” or does it offer only open weights?

A: DeepSeek R1 is provided with open weights, meaning that its model parameters are publicly accessible. This aligns with the overall open-source philosophy, allowing researchers and developers to further explore and build upon its innovations.

Q19: What would happen if the order of training were reversed—starting with supervised fine-tuning before unsupervised reinforcement learning?

A: The current approach allows the model to first explore and generate its own reasoning patterns through unsupervised RL, and then refine these patterns with supervised methods. Reversing the order might constrain the model’s ability to discover diverse reasoning paths, potentially limiting its overall performance in tasks that benefit from autonomous thought.

ACKNOWLEDGEMENT: These notes are prepared by Mohsin Iqbal and edited by Boqi (Percy) Chen and myself.

AI and Talent Development

Amir Feizpour (ai.science) — Mon, 21 Oct 2024 19:46:36 GMT

As AI takes over the world across industries, one of the big topics of discussion is: what would humans do then? In more recent history, educational institutions have been responsible for providing an answer to this question. You would go to school to become an accountant, or a lawyer, or a doctor etc, and then you become one. The dotted line between “what do you want to be when you grow up” and “what you really ended up doing” has been connected to each other via various stages of progressively advanced education. Now the abundance of the question “what would humans do” and its tangential variants tells me that there is a big gap in people’s heads between educational output and what is realistic for people to do going forward.

This is, of course, not a complete surprise. Easily 9 out of 10 data people I know come from non-computer science backgrounds. While “people I know” might not be exactly a representative population, I doubt anyone can argue against the possibility and the occurrence of significant career movement to areas that are superficially unrelated to one’s educational background in the past decade or so. Is this the sign of a declining and failing educational system or is it just the natural evolution of things?

A Historical Lens

If we take a look at how “education” has evolved over time, the major shifts over the past few centuries closely align with the various industrial revolutions. In other words, every time the work context has changed drastically, a combination of market forces, business interests, and government incentives created a force towards aligning what people learned as they grew up to what kind of workforce was needed.

In the 18th and 19th century when work changed from manual labor to mechanization and rise of factories, it transformed the agriculture focused societies into industrialized urban centers, and kickstarted a growing demand for a literate and numerate workforce. Factory jobs required workers who could read instructions, measure, and do simple calculations. The increasing demand for skilled workers led governments to invest in primary education systems.

During the late 19th and early 20th century electrification, telecommunications, and large-scale infrastructure like railroads created another tectonic shift. As industries grew more specialized, the need for vocational and technical education became apparent. Schools began offering specific training for careers in engineering, manufacturing, and other technical fields. Polytechnic institutions and trade schools emerged to provide practical skills to the working class. Governments began extending compulsory education beyond primary school to better prepare students for skilled labor in an industrial society, and even public universities started to emerge.

Then post-WW-II and until very recently, we have been going through the digital revolution where everything has been slowly becoming computerized. Schools and universities began integrating computers and other digital tools into the classroom. The rise of personal computers in the 1980s and 1990s and the internet in the 1990s dramatically changed how information was accessed and shared. Computer literacy became essential for students. The internet made distance learning and online education possible, democratizing access to education, and then Covid made it the norm. The rapid pace of technological change meant that workers needed to continually update their skills. Education systems began placing a greater emphasis on lifelong learning and adult education programs to help workers adapt to new digital realities and corporates started designing and running reskilling and upskilling programs.

Rise of Language Models

One of the major technological shifts in the past 5 years has been our ability to computationally analyze and generate natural and formal language. While language in itself is not a complete representation of intelligence, the entrance of the large language models (LLMs) into the public vocabulary has created the speculation of the upcoming “artificial general intelligence” (AGI) and the 4th industrial revolution. If (when?) that actually happens, then we will be one of the first generations that might experience more than one industrial revolution in their lifetime which has significant implications for how we live, work, think, learn, entertain ourselves, socialize, find love, and more. The speculations are also partly based on the progress we are making in other technologies like robotics, IoT, bio-tech, and quantum computing, and the hope that more powerful AI systems mean more major breakthroughs in these areas as we saw in the case of protein folding.

Even before AGI is here, the rate at which LLMs are getting better at tasks beyond simple linguistic ones is remarkable and the expectation is that we will see significant progress towards automation in tasks that are more traditionally reserved for the human brain. Multi-modal language models and their cousins, especially when combined with more traditional software scaffolds, are expected to be able to do all sorts of tasks that require entry to mid level expertise relatively soon. This type of automation is displacing many tasks that have traditionally been entry points for junior workers after going through the educational system. Routine tasks like data wrangling, report generation, writing software tests, and basic analyses—key areas where people typically learn on the job—are being automated by AI tools. As a result, fewer opportunities for hands-on learning exist at the lower rungs of the professional ladder. The tasks left for humans often require advanced problem-solving, decision-making, and strategic thinking, which are typically handled by senior employees. This could lead to a bifurcated workforce, where junior talent lacks the experience needed to develop into senior positions, potentially leading to talent gaps in more senior, complex roles.

So, the question is, what do LLMs and their future iterations mean for the future of learning and talent development?

Emerging Trends in Talent Development

It is hard to predict where things will go given the pace of change and chaos created by lack of preparedness in the society, industry, and academia. But a few general patterns are most likely to happen:

Interdisciplinary Skills: The future of work is increasingly interdisciplinary. As AI integrates with fields like healthcare, finance, logistics, and even the humanities, talent development will need to focus on cultivating skills that cross boundaries between disciplines (e.g., AI for biology, AI ethics, or AI in creative industries). This shift will encourage universities and industries to promote cross-disciplinary education where AI workers (human or machine) collaborate with domain experts.
AI Automation and Tooling: There’s a growing trend toward automated workflow execution and no-code / low-code platforms which require less technical know-how and might operate with natural language as an interface. Talent development will likely focus on higher-order problem-solving skills rather than specific subject matters like manual coding, as more AI tools abstract away the implementation details. The emphasis will shift to developing business acumen and problem-definition capabilities—being able to frame business problems and align them with AI solutions will become critical.
Fluid Academic Disciplines and Accreditation: The fluidity in learning and working, particularly in AI-driven environments where automation and interdisciplinary skills are critical, does raise important questions about the future of formal, distinct academic disciplines. Rather than seeing academic disciplines as isolated silos, they will be viewed as building blocks for more fluid, interdisciplinary research and industrial work. And instead of solely focusing on the subject matter of the discipline, education will focus on methods of inquiry and problem-solving that are crucial in interdisciplinary collaboration. For instance, the scientific method in the natural sciences, the design thinking process in engineering, and the interpretative methods in social sciences all become the direct learning objectives. Finally, education could shift to a more modular structure where students can specialize in certain disciplinary areas but take modules from multiple fields to create customized learning pathways. This would maintain the rigor of disciplinary knowledge while allowing flexibility for interdisciplinary applications.
Experiential Learning and Research: Universities will balance traditional disciplinary learning with experiential and project-based learning that reflects the real-world challenges of AI and automation. By integrating distinct academic disciplines with applied, hands-on learning experiences, universities will prepare students to work across boundaries in industry while still being grounded in the depth of formal knowledge.
Experts Mentors and AI Gyms for Juniors: With the automation of the grunt work, more senior employees will spend some of their time creating learning paths and challenges in AI-powered learning playgrounds where junior employees work with their AI learning buddies to fill the talent gap and become skilled in advanced problem solving, critical thinking, and strategic planning.

These are likely trends for undergraduate level education and early career development, and the next question becomes the impact of the AI revolution on the postgraduate part of academia and research.

Blurring Academia / Industry Boundaries

Another important trend is that the translation gap, the need for change in an academic idea to become operational in industry, has been shrinking as well. With more research and development firms getting funded, more corporations starting their research labs, and more prominent scientists working for the tech giants or starting their own companies, the separation of research responsibilities between academic and industry is quickly vanishing. This intermingling has started to bridge the gap between the technical knowledge required for developing complex AI systems and the operational expertise necessary to manage them.

Bridging the Tech-Business Divide: With more co-pilots and agent systems deployed, business people are more empowered to interact with the traditionally academic artifacts like language models and provide feedback about how those can fit into their operational workflows better. The more applied scientists, on the other hand, spend more time with their industry counterparts learning about what is needed to move their models and pipelines from the abstract world of the lab to the messy, real world.
AI System Lifecycle Management: Academics focus more on the entire AI system lifecycle—covering design, deployment, monitoring, and governance—to ensure researchers and students understand both the technical intricacies of building AI systems and the operational aspects of maintaining and scaling them over time. This would even go deeper towards topics like governance, fairness, bias, and ethics built by design in AI systems.
Human-Machine Interaction: Research and development in human-AI interaction will become increasingly relevant in the gray area between academia and industry as AI systems are integrated into everyday life and workplaces. Researchers and their industry counterparts will focus on designing AI systems that complement human decision-making and create seamless interactions between AI tools and users.

Verdict

The future of work and education is increasingly intertwined with technological advancements, especially as the onset of the 4th industrial revolution reshapes industries. As AI and automation accelerate, the tasks traditionally performed by junior workers may become obsolete, requiring educational systems to evolve beyond rigid structures like the four-year degree. Interdisciplinary learning, lifelong upskilling, and adaptability will be critical as the distinction between academic disciplines blurs in response to workforce demands. The role of industry in driving innovation, as seen in Nobel Prize-winning research from both academic and industrial sectors, underscores the importance of collaboration. Universities and organizations must create sustainable talent pipelines that focus on practical problem-solving and continuous learning, ensuring workers remain equipped for complex roles.

LLM Agents, Part 6 - State Management

Amir Feizpour (ai.science) — Mon, 02 Sep 2024 14:37:25 GMT

At this point, we've seen how Service-Oriented Architecture (SOA) and Event-Driven Architecture (EDA) boost modularity, responsiveness, and scalability in our multi-agent system. However, these architectures don't fully address the complexities of managing internal task progression or multi-step workflows within agents. That's where State Management steps in, providing an explicit structure to agent behaviors and system-wide data flow. In this article, we'll explore how State Management can significantly improve multi-agent systems.

State management in multi-agent systems is all about defining the playground for autonomy. It's like drawing boundary lines that let agents explore and act within a landscape of possible states, guided by the rules of the system. This balance between freedom and structure ensures each agent can play its part while keeping the overall system in harmony.

Consider LLM agents in our biotech sales example. An agent processing potential leads might freely prioritize and categorize them, but it can't access financial records or communicate with clients directly—those boundaries are set by the state management system. Additionally, certain states might be conditionally available. For instance, the agent may only access prior communication history for a client if those documents are tagged as unclassified, ensuring that sensitive data is only handled when relevant.

What Is a State?

State, in the context of software systems, represents the condition or status of an application or its components at a specific point in time.

In a multi-agent system, state can encompass various elements:

Agent Internal State: The current status, knowledge, beliefs, and decision-making parameters of individual agents.
Task State: The progress of ongoing tasks or processes within the system.
Environment State: The current condition of the environment in which the agents operate.
System-wide State: The overall status of the multi-agent system, including inter-agent relationships and global parameters.

As you can see already, there are multiple levels of granularity that can exist when describing the state of the overall system and its components and their sub-components, and so on.

Each agent in our system maintains its own internal state, which influences its decision-making processes and actions. For example, the Lead Qualification Agent's state might include the criteria it's currently using to evaluate leads, while the Proposal Generation Agent's state could include the sections of a proposal it has completed.

What is State Management?

State Management is the practice of organizing, tracking, and controlling the state of a software system. In multi-agent systems, it extends to coordinating the states of individual agents, their underlying services, and the overall system state.

State Management provides mechanisms for:

Defining possible states and transitions between them
Updating state or transition based on events or actions
Propagating state changes to relevant parts of the system
Ensuring consistency across distributed components

For example, in our biotech sales system, a Lead Management Agent might progress through states such as "Lead Identified," "Lead Qualified," "Proposal Generated," and "Deal Closed." Each state represents a distinct phase in the sales process, with transitions driven by specific events or conditions.

Build Agents with Us

Benefits of State Management in Multi-Agent Systems

Implementing robust State Management in multi-agent systems offers several key advantages:

Agent Autonomy and Interaction: State Management provides a framework for representing an agent's internal state and its relationship to the overall system state, modeling its decision-making process, and enabling its interaction with other agents. This is crucial in multi-agent systems where agents are autonomous entities making decisions based on their internal states and perceptions of the environment. In absence of effective state management, the state space available to autonomous agents might be too wide to allow efficient decision making and progress.
Managing Complexity: As agents become more sophisticated and handle more complex workflows, it becomes essential to have a clear structure governing how they move through their tasks. State Management provides this structure by explicitly defining a series of states and transitions, ensuring that agents follow logical and predictable paths. The modular nature of state based representation of the system dynamics also makes the system easier to understand and improve. For example, we might have a hierarchical structure with states that are selected via routing type of transitions and then a series of transitions within their substructures.
Ensuring Task Completion: By explicitly defining states and transitions, State Management ensures that agents complete all necessary tasks before moving on to the next phase. This is particularly important in processes where each step must be completed before the next can begin. For example, in our biotech sales system, the Business Development Agent must complete the "Qualify Lead" task before moving to the "Assess Viability" task.
Improved Coordination: By clearly defining and managing states, we ensure that all agents have a consistent understanding of the system's status, leading to better coordination. This might involve tracking important variables and their current values which can be used as inputs to the next best action selection as part of the system state.
Enhanced Reliability: State Management helps prevent agents from entering invalid states or performing actions out of sequence, reducing errors in complex processes. This could manifest as a guardrail, for example, preventing users from asking questions about politics from a system designed to help with selling biotechnology products. Or it might prevent the system from placing a sales order before a certain checklist of approvals are obtained.
Increased Scalability: As we add more agents or expand the system's capabilities, a well-structured State Management approach makes it easier to integrate new components without disrupting existing workflows. Thinking about the system design as states and their corresponding transitions is naturally modular with easier pathways for extensibility.
Better Observability: With centralized State Management, it becomes easier to monitor the system's overall status, track progress, and identify bottlenecks in various processes. All of us have experienced the nightmare of tracking the values of our variables as they flow through the pipelines especially as the software becomes more and more complex, and now we also have to do that for complex data objects that contain lots of natural language strings.
Simplified Debugging: When issues arise, having a clear state model makes it easier to trace the sequence of events and identify the root cause of problems. This is a consequence of higher visibility in the inner workings of the system but also an outcome of having a more unified pattern of investigation.
Adaptive Behavior: State Management allows agents to adapt their behavior based on their current state and the state of the system, enabling more intelligent and context-aware decision-making.

It's worth noting that state management techniques have been used in traditional software development for decades, particularly in areas such as user interface design, game development, and embedded systems. This same principle is now being applied to multi-agent systems, allowing us to manage the complexity of agent behaviors in a similar manner.

Implementing State Management in the Biotech Sales Example

To better understand the role of State Management in our multi-agent system, let's apply them to the biotech sales scenario. Consider the Business Development Agent, which follows a structured series of states to evaluate and qualify leads. This agent might progress through the following states, which align with its services:

Lead Identified (Lead Generation Service)
Lead Qualified (Lead Qualification Service)
Viability Assessed (Viability Assessment Service)
Objections Handled (Objection Handling Service)
Meeting Scheduled (Meeting Scheduling Service)

Each state is defined by specific actions and rules for transitioning to the next state and can contain more granular sub-states. For example, the transition from "Lead Qualified" to "Viability Assessed" might depend on whether the lead meets certain qualification criteria set by the Lead Qualification Service.

This structured approach ensures that the agent doesn't skip crucial steps, like assessing the technical fit of a lead before scoring it. It also enables the agent to handle errors, such as missing data, by transitioning to an error-handling state and retrying the process. Another important point about the State Management approach is its capability to enable the agents to handle routing gracefully. For example, it helps in choosing the right chain of actions or pathways in the workflow, ensuring that the agent follows the most optimal path based on the current state and context.

Challenges and Considerations

While State Management offers numerous benefits, it also comes with challenges that need to be considered:

Complexity: As the number of states and transitions grows, the system can become complex and harder to manage. In larger multi-agent systems with multiple agents like our Business Development Agent, Sales Agent, and Customer Simulation Agent, it's crucial to keep the state diagrams well-organized and ensure that transitions are clear and logical.
Redundancy: In some cases, similar actions might need to be performed in multiple states across different agents. To avoid redundancy, it's important to identify common actions and abstract them into reusable components that can be called by different states or services.
Debugging Transitions: While state management can simplify debugging by providing clear states, identifying issues in transitions, especially in complex decision-making logic, can still be challenging. Careful testing and monitoring are essential to ensure smooth operation across all agents and their services.

Build Agents with Us

Wrapping Up

As we've explored throughout this article, the combination of Service-Oriented Architecture (SOA), Event-Driven Architecture (EDA), and State Management forms a solid foundation for building sophisticated multi-agent systems like our biotech sales platform. By combining these architectural patterns, we create multi-agent systems that are Flexible, Scalable, Responsive, Robust, and Maintainable.

LLM Agents, Part 5 - Communication Protocol in Agentic Systems

Amir Feizpour (ai.science) — Wed, 21 Aug 2024 19:36:25 GMT

In our multi-agent systems series, we have started by introducing what agents are and how multi-agent systems emerge as a natural evolution of the software architecture as we move on to more complex workflows. We explored how Service-Oriented Architecture (SOA) can be applied to create flexible, modular multi-agent systems, and then looked at how it can be used for a biotech sales organization as our example.

While SOA provides a solid foundation for structuring our multi-agent system, it doesn't fully capture the dynamic nature of complex, real-time interactions. SOA tells us what the components are, but it doesn't address how these components interact or manage their internal workflows. To create truly responsive and adaptable systems, ones that can eventually mimic some degree of agency, we need to go beyond static structures and incorporate patterns that handle the flow of information and the progression of tasks.

Today, we’ll be building on that foundation by introducing another critical architectural component: Event-Driven Architecture (EDA).

Service-Oriented Architecture (SOA)

Before getting into EDA, let's first recap why Service-Oriented Architecture really matters for multi-agent systems. SOA is all about breaking down complex processes into manageable, independent services. In our biotech sales system, SOA allowed us to modularize the entire sales process into independent, specialized services. Each service, such as lead generation, lead qualification, viability assessment, and proposal writing, operates like a specialized team within a business, each performing a distinct role. The key here is that these services communicate with each other through well-defined interfaces, promoting loose coupling. This, we argued, results in a system that won't fall apart every time you need to make a change.

For example, let's say you decide to develop a more advanced logic for lead qualification. In a monolithic system, this could be a nightmare, potentially affecting everything from data input to proposal generation. But with SOA, you can update the lead generation service without disrupting other services like proposal writing or viability assessment. This flexibility is what makes SOA such a powerful bedrock for multi-agent systems.

Event-Driven Architecture (EDA)

Now, let's talk about Event-Driven Architecture. EDA isn't new, it's a well-established design pattern that's been around for years. But its application in multi-agent systems, while in its infancy, is where things get interesting, and potentially messy if not handled correctly.

EDA is a software design pattern that emphasizes real-time system response to events. An "event" is a significant state change such as a new lead being identified or a proposal being finalized. In EDA, components produce and consume these events, triggering further actions across the architecture. It promotes decoupled, asynchronous interactions, making systems more flexible and scalable.

This approach has been used in systems like enterprise applications, where services like customer orders, payments, and inventory updates function independently but are synchronized through events. The same principles now apply to multi-agent systems, where agents can respond to events without being tightly coupled to other agents or services.

Build Agents with Us

Why EDA in Multi-Agent Systems?

In the context of our biotech sales system, EDA allows us to design a responsive system where services and agents react to events as they occur, without waiting for direct interactions. For example, when the lead generation service identifies a new potential client, it produces a "New Lead Identified" event without necessarily knowing or being impacted by how other services or agents might interact with that event. This event triggers actions in services that subscribe to that event type such as lead qualification, market analysis, and proposal generation. This architecture choice would lead to flexibility to adapt to evolving business needs by adjusting events and agent interactions without requiring significant system-wide changes:

Real-time Responsiveness: EDA ensures that when an event like identifying a new lead occurs, multiple agents can start immediately, such as the lead qualification and market analysis agents.
Decoupling: One of the core principles of EDA is decoupling. In this approach, agents or services react to events independently, without any direct connection. In the biotech example, the lead qualification agent doesn’t need to know how the lead generation agent works. It just reacts to the event that the lead agent produces. This allows the system to remain modular and flexible.
Scalability: New agents, say a pricing analysis agent, can be easily integrated to listen for relevant events and act without disrupting existing workflows.

Mechanics of EDA in Multi-Agent Systems

1. Event Producers and Consumers

In EDA, agents or services are categorized as event producers (those that trigger events) or event consumers (those that react to events). In many cases, an agent can play both roles. For example:

The lead generation service identifies a potential client and creates a "New Lead Identified" event.
The lead qualification service consumes this event, evaluates the lead, and produces a "Lead Qualified" event.
The proposal generation service consumes the "Lead Qualified" event to start preparing the proposal.

This approach allows for greater flexibility, as services can operate independently but are still coordinated by events.

2. The Event Bus: System Coordination

The event bus is the backbone of the EDA system, routing events between producers and consumers. In the biotech sales scenario, it ensures that when the lead qualification service produces a "Lead Qualified" event, it is automatically routed to all relevant services—such as proposal generation, pricing strategy, and market analysis.

This centralized coordination ensures that agents and services stay decoupled, yet the entire system stays synchronized as events flow through the architecture.

3. Event Schemas: Standardizing Communication

Event schemas define data structure, standardize communication, and ensure correct data interpretation.

For example, in the biotech sales system, a "Lead Qualified" event schema might look like this:

{

"eventType": "LeadQualified"

"lead_Id": "12345"

"companyName": "PharmaCorp"

"potentialValue": 500000,

"productInterest": ["Lab Equipment", "AI Drug Discovery"]

"qualificationScore": 85

}

This standardization allows agents to communicate consistently, ensuring that data is interpreted correctly, like predefined contracts (APIs) in a microservices architecture.

In LLM-based agentic architectures, large language models are often used to create and / or interpret some of the components of the schema that are best captured as natural language or domain specific language. For example, the event produced by the “Proposal Writing” service might contain a field called “content” that provides a nested dictionary with section titles and paragraphs of the proposal. That nested dictionary is likely to be generated using an LLM call (potentially a RAG based sub-system). On the other hand, the receiver of the event would also likely need an LLM call to interpret and react to that data object.

Challenges and Considerations

While EDA has many benefits, there are also some challenges to consider:

Event Storage: As the system grows, the number of events increases, making efficient event storage crucial. Event sourcing patterns, which are commonly used in traditional EDA systems, can be applied to reconstruct system states from past events.
Debugging Complexity: Tracing the flow of events in a large system can be challenging, especially when issues arise. Distributed tracing tools are often required to pinpoint problems in the event chain.
Over-communication: If systems are not carefully managed, they can become overwhelmed with too many events. It’s important to balance responsiveness with efficiency to avoid performance bottlenecks.

Build Agents with Us

Event-based Communication Protocol

In this article we explored event-driven architectures as one of the more promising communication protocols in multi-agent systems. We looked at how this architecture choice complements the service-oriented architecture that we previously discussed as a design pattern for constructing multi-agent products based on existing business workflows. In the next parts of this series, we will introduce more architecture concepts, and will eventually discuss how they will combine to create a full picture of multi-agent systems.

LLM Agents, Part 4 - Workflow to Multi-agent Architecture Design

Amir Feizpour (ai.science) — Wed, 14 Aug 2024 18:46:49 GMT

BACKGROUND

To understand this better, feel free to skim the previous writeups first;

TRANSCRIPT SUMMARY:

Here our focus would be to transform a regular sales workflow of a Bio Tech Sales company into a modular, maintainable. and a scalable multi-agent architecutre. We’ll explore how we can utilize a traditional architecture design called Service-Oriented Architecture (SOA) to create a sophisticated multi-agent system.

Current Workflow of Bio Tech Sales Scenario

Before we dive into architectural patterns, it's crucial to understand the current process we're working with. Let's first break down the sales workflow of Bio Tech Sales company:

The founder identifies potential pharmaceutical company leads
A sales engineer evaluates the leads and asks clarifying questions
The technical lead suggests relevant use cases, and the founder estimates ROI
The sales engineer assesses the feasibility of use cases with the tech lead
A proposal is drafted, refined, and sent to the pharma company prospect
The prospect's team (medical scientist, chemist, CFO) reviews the proposal
Clarifications and objection handling occur via email
A meeting is scheduled if the proposal is accepted

Service-Oriented Architecture

Before we apply SOA to our biotech sales scenario, let's briefly recap what it is and why it matters.

Service-Oriented Architecture (SOA) is a design approach used in traditional software architectures to break down complex systems into independent, modular services that communicate through standardized interfaces. These services are designed to perform specific business functions while remaining loosely coupled and reusable, allowing for greater flexibility and scalability across the system. SOA enables different services to operate autonomously, while still being able to collaborate when needed, ensuring that systems can evolve and adapt without requiring a complete overhaul.

You can read about SOA in detail here.

Applying SOA to the Biotech Sales System

With a better understanding of SOA in traditional software architectures, let's now apply these concepts to our biotech sales scenario.

In order to identify our key agents, let’s break down the workflow into discrete, reusable services grouped together based on theme or domain of work. This approach will allow for greater flexibility, easier maintenance, and improved scalability.

Once we have the grouping based on our workflow analysis, we can identify three key agents:

Business Development Agent
Sales Agent
Customer Simulation Agent

These agents will act as logical groupings of related services, each responsible for a specific aspect of the sales process. We're also keeping a human in the loop - the founder - who will provide critical inputs and oversight.

Business Development Agent Services

Lead Generation Service: Identifies potential pharma company leads.
Lead Qualification Service: Evaluates and qualifies the identified leads.
Viability Assessment Service: Assesses the viability of pursuing a lead, including suggesting relevant use cases and estimating ROI.
Objection Handling Service: Manages clarifications and objections raised by prospects.
Meeting Scheduling Service: Arranges meetings with interested prospects.

Sales Agent Services

Feasibility Assessment Service: Evaluates the technical feasibility of proposed solutions, working closely with the technical lead.
Proposal Generation Service: Drafts, refines, and sends proposals to pharma company prospects.

Customer Simulation Agent Service

Proposal Review Service: Simulates the review process by the pharma company's medical scientist, chemist, and CFO.

Each of these services follows SOA principles with standardized interfaces, loose coupling, abstracted internal complexity, and potential reuse in different contexts.

Walkthrough of Service Oriented System

Let's walk through how this SOA-based system would handle a typical sales process:

The process begins with the Business Development Agent's Lead Generation Service identifying potential pharma company leads using data from sources like Crunchbase and PitchBook.

The Lead Qualification Service then evaluates these leads, possibly using a machine learning model to score them based on predefined criteria.

For qualified leads, the Viability Assessment Service kicks in, suggesting relevant use cases and estimating ROI. This service might use a combination of historical data and AI-driven forecasting.

The system notifies the founder (our human-in-the-loop) via Slack, presenting the qualified leads and viability assessments. The founder can provide feedback, add additional insights, or approve moving forward.

Once approved, the Sales Agent's Feasibility Assessment Service evaluates the technical feasibility of the proposed solutions. This might involve analyzing technical requirements and consulting internal knowledge bases.

The Proposal Generation Service then creates a tailored proposal based on all the gathered information. This could involve using templates and AI-driven customization.

The generated proposal is sent to the Customer Simulation Agent's Proposal Review Service, which simulates the review process of the pharma company's team. This service might use NLP to analyze the proposal and generate realistic objections based on historical data.

Any objections or requests for clarification are handled by the Business Development Agent's Objection Handling Service, which might use a combination of pre-defined responses and AI-generated explanations.

If the simulated customer is satisfied, the Meeting Scheduling Service arranges a follow-up meeting, integrating with calendar systems like Google Calendar or Outlook.

Throughout this process, the system maintains a shared memory where all actions and important data are recorded. This allows for better coordination between services and provides a clear audit trail.

Benefits of This SOA Approach

By applying SOA principles to our biotech sales system, we gain several advantages:

Modularity: Each service can be developed, tested, and maintained independently. If we need to update our lead scoring algorithm, we can do so without touching the proposal generation system.

Scalability: Individual services can be scaled based on demand. If we're seeing a surge in lead generation, we can allocate more resources to that service without affecting others.

Flexibility: New services can be added or existing ones modified as business needs evolve. For instance, if we later want to add a pricing optimization service, we can do so without overhauling the entire system.

Reusability: Services like Objection Handling could potentially be reused in different contexts, not just in initial sales but also in account management.

Improved Efficiency: By automating many of the time-consuming aspects of the sales process, we free up human resources to focus on high-value activities like relationship building and strategic decision-making.

Conclusion

In this part of our multi-agent series, we demonstrated how to transform a real-world workflow into a modular and scalable multi-agent system using Service-Oriented Architecture (SOA). By breaking down complex processes into discrete and reusable services, we created a flexible system that can adapt easily without any major overhauls.

Next, we'll explore how these services interact with each other, while examining the communication patterns that enable collaboration between agents.

LLM Agents, Part 3 - Multi-Agent LLM Products: A Design Pattern Perspective

Amir Feizpour (ai.science) — Wed, 07 Aug 2024 14:37:46 GMT

Why write this?

I see lots of “multi-agent” frameworks out there and I, personally, think most of them are nonsense. They are nonsense because they try to paint rosier picture than what it really takes to build extremely complex intelligent software systems. For example they claim that if you get a few LLMs to talk to each other in natural language you have a software system that robustly solves your complex business problems. Or if you throw a large crew of LLMs at a problem, they can reliably do sales and marketing and operations for your business. I think the creators of (and perhaps those who get excited about) these either have never written serious software, or are just interested in the academic exercise of “what if” rather than building anything that can actually go into production.

Starting from principles is such an important thing to do when proposing and building a new complex framework and I’m utterly surprised by how unimportant it seems in many of the proposed frameworks. Hopefully, I have convinced you that going to the basics of RL is important in thinking through agentic workflows, and in this article my attempt is to convince you that going back to software design principles is the way to go about creating multi-agent systems.

Software Architecture and ML

In this article, we will explore how established software design principles can be applied to the emerging trend of multi-agent large language model (LLM) systems. We will examine how traditional software design patterns, such as Domain-Driven Design (DDD), Service-Oriented Architecture (SOA), and microservices architecture, contribute to the development of these multi-agent systems.

Traditional design patterns provide a robust framework for software development. By integrating machine learning (ML) into these patterns, we can introduce a new dimension to software architecture. ML enables probabilistic routing between software components, replacing pre-programmed deterministic routing. This integration not only enhances the functionality of individual components but also introduces new capabilities. Both LLMs and specialized ML models, and often a combination of the two, can be utilized to achieve these improvements.

The incorporation of LLMs into software systems brings a broad range of benefits, making them more dynamic and flexible. These systems can exhibit a diverse set of behavior without the need for explicit programming, which offers a significant advantage. However, this flexibility comes at a cost: such systems can be harder to predict, maintain, and debug reliably.

Communication methods between components, previously services and more recently agents, remain consistent with traditional approaches, using REST, GraphQL, JSON, and DSLs. However, the introduction of natural language as an interface adds a new layer of complexity, with its own set of advantages and challenges. These hybrid systems, combining predetermined and probabilistic behavior, may become the new standard in software development.

In the following sections, we will delve deeper into the concepts of DDD, SOA, and microservices architecture. We will explain how DDD focuses on modeling software based on real-world domains with isolated data sharing between domains. We will also explore the benefits and challenges of this new approach, drawing parallels to successful microservices implementations to illustrate suitable use cases.

Build Agents with Us

Domain-Driven Design: The Foundation

DDD emphasizes modeling software around the core domain of a business. It advocates for a common language shared by developers and domain experts, ensuring everyone speaks the same language. DDD breaks down the domain into bounded contexts, areas with well-defined and segregated responsibilities, and often minimal dependency on other areas.

Bounded contexts ensure that complexity is manageable by focusing on specific aspects of the domain. This focus also promotes better communication and understanding between developers and domain experts. By breaking down the domain into bounded contexts, we lay the groundwork for introducing agents with specialized capabilities, each responsible for a specific bounded context within the larger multi-agent LLM system. Just as bounded contexts promote modularity and focus within the domain, agents with bounded responsibility will do the same.

Example: E-commerce Platform

Consider an e-commerce platform. DDD could be used to define several bounded contexts:

Customer Management: Handles customer accounts, profiles, and preferences.
Product Catalog: Manages product information, categories, and pricing.
Order Processing: Processes orders, manages inventory levels, and handles payments.
Content Management: Creates and manages product descriptions, promotions, and other content.

Note that each of these are borrowed from the business domain of commerce to facilitate better communication between stakeholders and developers but also to tap into the robust nature of trusted and true business workflows. Each bounded context has its own data entities, business rules, and common language. This modular approach allows developers to focus on specific areas of functionality without getting overwhelmed by the complexity of the entire system.

From Bounded Contexts to Services

SOA takes the concept of bounded contexts from DDD and maps them to services. Each service encapsulates a specific domain functionality and exposes a well-defined interface. This promotes loose coupling, allowing services to evolve independently without impacting others.

Microservices architecture takes SOA a step further by creating even smaller, more focused services. Unlike SOA, in microservice architectures the services focus as narrowly as possible, often only on a single function. This approach offers greater agility, scalability, and resilience. Each microservice owns its data and logic, promoting independent development and deployment.

Example: E-commerce Platform - Microservices Breakdown

The platform would be composed of independent, loosely coupled microservices:

Customer Service: This service manages customer accounts, profiles, login credentials, and preferences. It would expose APIs for user registration, login, profile management, and wishlist functionalities.
Product Service: This service handles product information, including descriptions, categories, images, pricing, and availability. It would provide APIs for product search, filtering, retrieving product details, and managing inventory levels.
Recommender Service: This service handles proactive product recommendations functionality and integrates with the Product Service for data retrieval.
Search Service: This service handles product search functionality and integrates with the Product Service for data retrieval.
Order Service: This service oversees the order processing flow. It would handle actions like adding items to the cart, managing shopping carts, initiating checkout, processing payments, and managing order fulfillment. The order service would interact with both the Product Service and the Payment Service.
Payment Service: This service handles secure payment processing, integrating with various payment gateways. It would expose APIs for initiating payments, handling authorization, and receiving transaction confirmations.
Content Management Service: This service focuses on creating and managing website content, including product descriptions, promotions, blog posts, and other informational pages. It would provide APIs for content creation, editing, and publishing.

Each microservice would expose well-defined APIs for other services to interact with. For example:

When a customer adds an item to the cart in the frontend, it would send an API request to the Cart functionality within the Order Service.
The Order Service might then interact with the Product Service to retrieve product details and confirm availability.
Upon checkout, the Order Service would communicate with the Payment Service to initiate the payment process.

You might notice that some of the things that you’ve imagined as “multi-agent” systems could be achieved simply by a well designed software system.

The challenge with a system like this is that, although it might include some narrow scope AI components, it is ultimately passive and fairly rigid in what it can do. Combining the reliability of software written within those design principles with the flexibility of emergent capabilities offered via LLMs can be a winning formula.

Build Agents with Us

Multi-Agent LLMs: A New Design Pattern

Multi-agent LLMs borrow heavily from the principles discussed above. Just like microservices, they consist of multiple, specialized agents (in addition to services), each focusing on a specific aspect of the task. These agents collaborate and leverage services to achieve a common goal, similar to how services interact through APIs.

Beyond Microservices: Active Agents vs. Passive Data Handlers

Microservices excel in building modular, scalable software systems. However, they primarily function as passive data handlers, responding to requests and manipulating data. Multi-agent LLMs, on the other hand, take a leap forward by introducing “active” components inside these services, effectively allowing them to “make decisions” in scenarios without being deterministically programmed to do so. These agents can:

Continuously monitor the situation, analyze data, and identify potential issues or opportunities.
Take initiative and perform actions without explicit instructions. This can involve initiating communication with other agents, retrieving information, or even triggering predefined workflows.
Collaborate and negotiate with each other to achieve a common goal. This allows for dynamic decision-making and adaptation to unforeseen circumstances.

This shift from passive data handling to active agents unlocks new possibilities:

Complex Task Automation: Multi-agent LLMs can automate complex tasks that require reasoning, planning, and collaboration across different domains. Imagine a system with a service constantly monitoring traffic patterns, augmented with an agent analyzing weather data, and a second one making decisions about rerouting deliveries to avoid congestion – a scenario beyond the pre-determined nature of microservices.
Emergent Behavior: LLMs themselves show emergent properties; they can classify text, extract entities, and more although they are only explicitly trained on predicting the next most likely token. When LLM-powered agents that are fine-tuned to strengthen any of these properties or are augmented with tools that give them specialized capabilities can interact with each other, non-trivial collective behavior might appear. The semantic flexibility, although less controllable in nature, combined with the reliability of JSON based communication between various software services, agentic or otherwise, could result in systems that work in ways that are more “expected” by human operators, for example by adapting and responding to situations in ways that might not be explicitly programmed.
Continuous Improvement: The modular, and to some extent potentially redundant, nature of multi-agent systems can make them less constrained by improvements in a single component as the only opportunity for improvements in the overall system. For example, an agent that is fine-tuned to do task decomposition effectively can help other agents do that task well by providing examples in a few shot setup. In a more well setup system, each component inside agents, potentially small LM or non-LM models, can have a feedback loop continuously being restrained and improved. This could include models that are involved in the policies of individual agents or the overall system.

Contracts, Languages, and Communication

Microservices architectures thrive on clear and well-defined communication. This communication relies on predefined API contracts, essentially agreements that dictate how services interact with each other. These contracts act like sheet music for an orchestra, ensuring each microservice plays its part seamlessly.

REST APIs and JSON are the cornerstones of these contracts. REST (Representational State Transfer) defines a standardized architecture for requesting and receiving data between services. JSON (JavaScript Object Notation) acts as the “language” for transmitting data, offering a lightweight and human-readable format for exchanging information.

Agentic systems use these existing mechanisms and will also introduce a new dimension to communication, adding two more communication types:

Domain-Specific Languages (DSLs): These are custom languages tailored to a specific domain or purpose. Imagine a trading agent responsible for capital market transactions using a combination of statistics, machine learning, and business logic rules. Communicating this info in natural language is too complex and error-prone, and in JSON is too limited. However, using a DSL, imagine a set of pseudocode snippets describing the logic of the rules, as a communication contract between the controller agent and the executor agent can be the most efficient channel. DSLs offer more expressiveness and efficiency compared to generic JSON data, but require specialized knowledge to understand and implement.
Natural Language (NL): This is the most human-like form of communication. Agents could potentially communicate and share information using natural language processing (NLP) techniques. However, natural language is inherently ambiguous and prone to misinterpretations. While offering the most flexibility, NL communication is also the least robust and requires advanced NLP capabilities to manage effectively.

Even in the realm of multi-agent systems, the established approach of API calls and JSON data exchange remains the most reliable and robust communication method. It provides a clear and well-defined path for information exchange. DSLs offer a middle ground, balancing expressiveness with control. Finally, natural language communication, while offering the most flexibility, comes with the greatest risk of misunderstandings and requires significant development effort to implement effectively. In all likelihood, a product that you design would tap into all these different communication channels between services and therefore agents to achieve the best balance between performance and control.

Shared Benefits and Challenges

Both multi-agent LLMs and microservices architectures offer several advantages:

Modularity: Break down complex tasks into smaller, manageable units.
Scalability: Scale individual agents or services independently based on needs.
Resilience: If designed right, given the adaptability of agent policies that leads to some redundancy, failure of one agent or service doesn't cripple the entire system.
Independent Deployment: Deploy and update individual agents/services without affecting others.

However, both approaches also come with challenges:

Increased Complexity: Managing interactions and dependencies between agents/services requires careful planning.
Testing and Debugging: Debugging issues that span multiple agents/services can be intricate. Also, the probabilistic nature of agents can make systems built with them considerably harder to debug.
Distributed System Management: Distributing resources and ensuring consistent behavior across agents/services adds complexity.

Therefore, multi-agent systems, unlike what vendors tell you, are not a silver bullet for everything and choosing to approach solving a business problem with them comes down to a careful pros/cons analysis.

Build Agents with Us

How to design multi-agent architectures?

Map the workflow humans execute to achieve a particular objective including people, processes, and tools involved.
Draw context boundaries around parts of the process that are self-contained.
1. You want each of these areas to have minimal data dependency on another one (if they share / exchange a lot of data they might have to be merged).
2. You want each of these areas to have minimal functional dependencies on another one (if most of the time a change in one requires a change in another one, they should be merged into one context).
Decide if each of these contexts are a software services or if they need to become “agentic” and mark them as such (“Payment Processing Service”, “Data Analyzer Agent”)
1. Note that each agent might contain microservices (“PDF Parser microservice”, “info retrieval microservice”)
Add any other services necessary that your architecture doesn’t explicitly contain (“User Management Service”, “Shared Memory Service”)
Determine how data flows through the system (“PDF goes from File Upload service to parsing service”, “JSON containing rewritten query and search filters goes from query analyzer service to info retrieval service).
1. Revise the system modularity (context boundaries) to minimize data movement.
Determine and document communication protocols between different services and agents.
1. In most interfaces the protocol should be REST based and the data load should be JSON for robustness purposes.
2. If that fails to meet your requirement, then try to use DSLs, only if that fails also, use natural (or formal) language.
3. It is ok to use natural language for most of your interfaces in a quick and dirty prototype implementation, but you should remind yourself that, in all likelihood, it will not meet the reliability threshold for user facing production deployment.
Determine how you would unit test each microservice.
1. If conceptualizing a unit test for a microservice is too complex, it might be a sign of a need to break it into smaller pieces.
2. The unit test for a service would be all component unit tests passing. The unit test for an agent might be a less trivial heuristic on how all component unit tests behaved (in principle agents should be able to recover from some of the failing components; for example it can choose a document search action instead if google search action is failing).

Congratulations! You designed your first multi-agent architecture.

LLM Agents, Part 2 - What the Heck are Agents, anyway?

Amir Feizpour (ai.science) — Tue, 30 Jul 2024 16:42:12 GMT

An Intelligent Agent (IA) is an autonomous entity that observes and acts upon an environment to achieve specific goals. These agents can range from simple systems, such as thermostats or basic control mechanisms, to highly complex AI-powered systems. The exact definitions and the thresholds necessary to attribute agency to a system are up for debate and can only be contextually discussed. However, most IAs possess some or all of the following key properties:

Autonomous operations
Reactive to the environment
Proactive (goal-directed)
Interactive with other agents (via the environment)

To better understand how our understanding of the concept has evolved, the current state, and the potential future of IAs, it's essential to trace their history and examine the key milestones that have shaped the field. But first…

Why should I care?

Are agents yet another hype that will soon die out? Or are they the next major platform shift?

Why are we excited about LLM Agents?

The emergence of LLM agents has sparked considerable excitement in the AI community. Their ability to comprehend and generate coherent text, undertake complex tasks, and exhibit autonomous behavior has opened up a wide array of possibilities. One of the key factors contributing to this excitement is the potential of LLM agents to serve as planning modules for autonomous agents.

Open-source LLMs have also reached a point where they can effectively drive agentic workflows. For example, the integration of LLMs into systems where they can call tools has further enhanced their capabilities, allowing them to perform more complex and diverse tasks.

This growth has led to a significant rise in both the development and adoption of LLMs as core components of autonomous agents. The excitement surrounding them is further fueled by their potential to function as artificial general intelligence systems (AGI), capable of performing a wide range of tasks with human-like proficiency. However, it is important to note that there are still significant challenges to be addressed before LLM agents can truly achieve such advanced capabilities.

I am working on use case X, should I really care about LLM agents?

No, if:

Your task is well-defined and specific.
Needs a single function like grammar checking, text summarization, or code generation.
Doesn't require remembering past interactions or context.
Operates solely on the information provided in the current prompt.

Examples:

Highlighting grammatical errors in a document.
Creating a concise summary of a lengthy article.
Translating a simple sentence from one language to another.
Generating different creative text formats based on a single prompt (e.g., poems, scripts).

Yes, if:

Your task is more complex and involves multiple steps.
Needs the ability to remember past interactions and context.
Benefits from accessing and interacting with external tools or resources.
Requires a level of autonomy in completing the task.

Examples:

A virtual assistant that manages your schedule, checks weather data, and books appointments.
A system that analyzes customer reviews and recommends product improvements.
A chatbot that can answer complex questions by searching the web and integrating information from different sources.
A content creation tool that understands your previous creative decisions and generates content that aligns with your overall vision.

Build Agents with Us

Brief History of Intelligent Agents

The concept of Intelligent Agents has evolved alongside the development of Artificial Intelligence (AI), with its roots dating back to the 1950s. Let's take a look at a brief history of intelligent agents and how they have progressed over time.

1950s and before: The Dawn of AI

Turing Machine (1936): Though not an agent, Alan Turing's theoretical model provided a foundation for defining computation and intelligence.
Turing Test (1950): Proposed by Alan Turing, this test established a benchmark for a machine's ability to exhibit human-level intelligence.

These early concepts laid the groundwork for the development of autonomous agents in the following decade.

1960s: The Rise of Autonomous Agents

ELIZA: A natural language processing program created by Joseph Weizenbaum in the 1960s, was one of the earliest intelligent agents capable of simulating a psychotherapist through natural language conversations.
General Problem Solver (GPS): Developed by Herbert Simon, J.C. Shaw, and Allen Newell in the late 1950s, was an early intelligent agent system that could solve problems by searching through a space of possible solutions, laying the foundation for future problem-solving agents.
SHRDLU: Developed by Terry Winograd, SHRDLU demonstrated rudimentary natural language processing capabilities to solve tasks in a simulated block world.

Building on these early successes, the 1970s and 1980s saw intelligent agents finding applications in specialized domains.

1970s-1980s: Growth and Specialization

MYCIN: An early expert system designed for medical diagnosis, MYCIN showcased the potential of knowledge-based systems in specialized domains.
Shakey the Robot (1970s): A mobile robot from SRI International, Shakey pioneered basic navigation and manipulation tasks in a controlled environment.

As AI technology advanced, the 1990s and 2000s witnessed the rise of intelligent agents in more practical and everyday applications.

1990s-2000s: The Rise of Practical Applications

Deep Blue (1997): IBM's Deep Blue, a chess-playing computer, defeated chess grandmaster Garry Kasparov, demonstrating AI's potential for complex decision-making.
Roomba Vacuum Cleaner (2002): The Roomba became a popular example of IAs entering everyday life, performing basic cleaning tasks autonomously.

In the 21st century, intelligent agents have become increasingly sophisticated and integrated into various aspects of our lives.

2000s-Present: Evolution to Advanced Intelligent Agents

Virtual personal assistants such as Siri, Alexa, and Google Assistant are prime examples of intelligent agents.
Self-driving cars, recommendation systems, and game-playing AI are other examples of intelligent agents.
NASA's mobile agents for human planetary exploration are some of the most advanced machines we have created.

The 21st century has witnessed a remarkable surge in the development and deployment of intelligent agents across various domains. The evolution of powerful machine learning algorithms, coupled with the exponential growth in computing power and data availability, has enabled the creation of highly sophisticated autonomous systems. One of the most significant breakthroughs in this era has been the emergence of Reinforcement Learning (RL) as a key approach for training intelligent agents.

RL has proven to be a game-changer in the realm of game-playing AI, with notable examples such as AlphaGo, which made history by defeating world champion Go players in 2016. This achievement highlights the potential of RL in enabling agents to learn and adapt to complex environments through trial-and-error learning and reward maximization.

Build Agents with Us

Agents in Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning that focuses on training agents to make sequential decisions in an environment to maximize a cumulative reward signal. In RL, an agent interacts with its environment by taking actions, observing the resulting state, and receiving rewards or penalties based on its actions. The goal here is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time.

In RL, an “agent” is strictly defined by its policy, a mapping function from the current state to the most appropriate next action, informed by the reward earned from all previous actions.

Let’s look at an example. Say an RL agent is in charge of controlling the conversation flow inside a customer service chat system.

“State” in this case is some indicator of progress extracted from the last utterance from the customer (eg. intent, like needs more information to purchase).
“Reward” could include positive points for retrieving relevant information, or positive sentiment from the customer, or improved likelihood of a purchase or service renewal, and negative points for signs of frustration (eg. repeated asks) or negative sentiment from the customer, or abandoned conversation.
The “environment” in this case is the current conversation, all the data we have about the customer (their purchase history, demographic, previous communication transcripts), and say data related to competitive services.
“Actions” could include offering a solution, retrieving data to answer questions, and asking clarifying questions.
The “policy” (and therefore the agent) is the decision making function (could be learned from historical data or designed based on business logic or both) that selects the next best action given the current reward. For example, a “perception” function might evaluate the intent of the last utterance (eg. complaint) to infer the state and using that information the policy determines what the best next action is (eg. apologize and offer a discount).

Note that while reward has a significant impact on how the agent behaves, it is not an internal property of the agent. It is instead a property determined by the designer of the system (part art, part science) to help it learn the desired goal seeking behavior. In other words, the reward in this case is externally driven (as opposed to human behavior, for example, where the incentives could be internal or external).

Properties of RL agents

We can explore the key properties we mentioned at the beginning of this article for RL agents.

Autonomy: RL agents can make decisions independently based on their policy, without requiring explicit instructions or supervision.
Interactivity: Agents continuously interact with their environment, including other agents, by taking actions and receiving feedback in the form of rewards or penalties.
Adaptability: Through trial-and-error learning, RL agents can adapt their behavior based on the feedback they receive, allowing them to improve their performance over time.
Goal-orientation: RL agents are driven by the objective of maximizing cumulative rewards, which enables them to learn optimal strategies for achieving specific goals.

Examples of RL agents

Game-playing agents:

AlphaGo: Developed by DeepMind, AlphaGo is an RL agent that learned to play the complex game of Go at a superhuman level, defeating world champion players.
OpenAI Five: Created by OpenAI, this RL agent mastered the multiplayer video game Dota 2, showcasing the potential of RL in complex strategic environments.

Robotics and autonomous systems:

Autonomous vehicles: Reinforcement learning is actively being explored for training self-driving cars. Companies like Waymo and Tesla utilize RL for tasks like lane following, obstacle avoidance, and optimizing driving behavior.
Robotic manipulation: RL agents can learn to perform dexterous manipulation tasks, such as grasping and assembling objects, by learning from trial-and-error interactions with the environment.

Recommendation systems:

News recommendation: RL agents can be employed to personalize news article recommendations based on user preferences and engagement, optimizing for long-term user satisfaction. The DRN framework, for instance, uses Deep Q-learning to deliver personalized news content.
E-commerce recommendations: By learning from user interactions and purchase history, RL agents can provide personalized product recommendations that maximize user engagement and revenue. This paper proposed to use deep reinforcement learning to recommend product sequences that sustain user interest and drive purchases.

The examples we've explored demonstrate the wide range of applications for RL agents across various domains, from game-playing and robotics to recommendation systems. As research in RL continues to advance, we can expect to see even more innovative and impactful use cases for these adaptive, goal-oriented agents.

RL and NLP

One particularly exciting area where RL is making significant advancements is in the field of Natural Language Processing (NLP). By leveraging the power of RL, researchers and practitioners are developing agents that can effectively tackle complex language tasks, such as text generation, dialogue management, and summarization.

The intersection of Reinforcement Learning and Natural Language Processing has given rise to a new generation of language-based agents that can learn to generate, manipulate, and understand human language in increasingly sophisticated ways.

Text Generation Control:

RL agents can be employed to control the style and content of text generation tasks, enabling the creation of tailored writing styles for different audiences.
For example, Reinforcement Learning with Human Feedback (RLHF) has been used to train models that can generate text with desired style or tone and even content, opening up new possibilities for creative writing and content generation.

Dialogue Management in Chatbots:

RL agents can learn optimal conversation strategies in chatbots, allowing them to engage users more effectively and achieve specific goals.
By training RL agents to select appropriate responses based on user input and conversation context, chatbots can maintain engaging discussions, provide relevant information, and even assist in tasks like booking appointments or making recommendations.

Text Summarization:

RL agents can be applied to the task of text summarization, learning to generate concise and informative summaries of longer documents. This has also been used in the context of making language model prompts more efficient.
By designing reward functions that encourage faithfulness to the original text, coherence, and brevity, RL agents can produce high-quality summaries that capture the key points of a document while maintaining readability.

The potential of Reinforcement Learning in Natural Language Processing is truly remarkable. This enables the development of intelligent agents that can generate, manipulate, and understand human language in increasingly sophisticated ways. However, the concept of intelligent agents in NLP is not a recent development. In fact, it has been deeply rooted in the field since its early days, aiming to create systems that can understand and respond to human language in a meaningful way.

Build Agents with Us

Agents in Natural Language Processing

The idea of agents in NLP can be traced back to the field's early focus on interaction and reasoning. As researchers sought to develop systems capable of engaging in human-like communication, the notion of language-based agents naturally emerged.

Focus on Interaction and Reasoning:

NLP has long been motivated by the desire to create systems that can understand and respond to human language, mimicking human-like interaction and reasoning capabilities.
This focus on interaction and reasoning naturally led to the conceptualization of agents as entities that can engage with users using natural language.

Early NLP Systems as Agents:

Some of the earliest NLP systems, such as SHRDLU (1972), can be considered simple agents in their own right.
SHRDLU, for example, could understand and respond to natural language questions about a simulated block world, showcasing basic reasoning capabilities within a limited domain.
These early systems laid the foundation for the development of more sophisticated language-based agents in the years to come.

Dialogue Systems and Chatbots:

The development of dialogue systems and chatbots heavily relied on the concept of agents, as these systems needed to process user input, understand intent, and generate appropriate responses.
Early chatbots, while less advanced than modern language models, were essentially software agents operating in the domain of human-computer conversation.
These systems paved the way for the more sophisticated conversational agents we interact with today.

As NLP technologies continue to evolve, the concept of agents has taken on new dimensions, particularly with the advent of large language models. LLMs have unlocked unprecedented possibilities for creating intelligent, language-based agents that can understand and generate human-like text with remarkable coherence and contextual awareness.

LLM Agents

What are they?

LLM agents are a new class of AI systems that combine large language models (LLMs) with the ability to make informed decisions, take actions, and work towards specific goals. They can be described as a system that uses an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools. LLM agents are typically characterized by 3 properties:

Memory (equivalent to environment in the context of RL)
Tool usage (equivalent to actions in the context of RL)
Planning (equivalent to policies in RL mapping states to actions based on the current state to maximize reward)

This concept enables LLMs to analyze information they encounter and choose the most appropriate tool for the task at hand based on their available policies. This empowers them to make informed decisions and achieve their goals. This is exactly what we humans do. When we have a task to solve, we gather information, we look for ways and tools that help us to solve this task as easily as possible. Memory and tool usage are relatively well established, but planning has still significant room for debate and improvements.

Early LLM agents like AutoGPT and BabyAGI have shown promise in complex tasks like web searches and code generation. However, these agents are still under development, and their stability, reliability, and applicability to real-world problems remain open questions.

Agents and Autonomy

One of the important characteristics of agents is their level of autonomy characterized by their ability to execute increasingly complex tasks with little to no supervision (correlated with the complexity of their policies). A software system at a low level of autonomy might resemble tools and one with a high level of autonomy might behave like an agent. While there is no clear cut differentiation, comparison to autonomous driving can be illuminating.

Levels 0 and 1 are largely what we have seen in industry in the past 2 decades for narrow and specialized use cases. Level 2 is what we have seen in the past 5 years or so, especially with generative models that can convert effectively between modalities (text to image) or ones that have emergent capabilities although they’re only trained on single tasks (aka large language models). The important property of level 2 is that systems at this level are combinations of smaller narrow systems and are coupled to each other via carefully designed interfaces (the most speculated architecture for GPT4 is a mixture of experts). In all these levels, systems have a low level of autonomy which means that they can perform specific actions based on clear instructions but have limited decision-making authority.

Level 3 is where things start to get interesting. Medium level of autonomy means that the systems can choose between different predefined options or strategies based on the context. A lot of the cases we call “agentic workflows” today fall into this category where a pre-trained classifier (eg. intent classifiers in chatbots) routes queries to specific pipelines or action chains. This is a bit of a gray area in terms of the strict definition of agents but robustness demands for most near-term applications would mean that we will see most systems follow this design pattern until more robust infrastructure is available for higher levels of autonomy.

Levels 4 and 5 have high autonomy which means that the agent can learn from its interactions, set its own goals within a broader framework, and make complex decisions without needing explicit instructions. Very much like what we have experienced with autonomous driving at levels 4 and 5, there are significant infrastructure prerequisites for these levels to exist and deliver robust performance. For example, there might be a need for significant organizational changes to allow a software system to execute a large number of tasks within a business workflow.

What are the challenges of deploying LLM Agents in business workflows in the near term?

While LLM agents have a potential to enhance business workflows and to enable more intelligent systems, there are significant challenges that need to be addressed before these systems can be widely deployed in real-world settings.

Technical Challenges

Stability and reliability: Early experiments with LLM agents have shown that they can be prone to erratic or unexpected behavior, often deviating from intended goals or producing nonsensical outputs.
Measuring progress and performance: Evaluating the effectiveness of LLM agents can be complex, as they may take unexpected approaches to achieve goals or potentially deviate from desired outcomes.. Developing robust metrics and evaluation frameworks is an active area of research.

Organizational Challenges

Integration with existing processes and infrastructures: Deploying LLM agents may require complex setup and management of data sources, APIs, and tools. Compatibility issues with legacy systems and the need for custom interfaces and integrations are also challenges.
Human oversight and intervention: Mechanisms for humans to monitor, guide, and correct the behavior of LLM agents are important. Designing workflows and interfaces that allow for seamless collaboration between humans and AI agents is a challenge.

These challenges illustrate that while LLM agents have the potential to augment and automate various business tasks, their deployment requires careful planning, iterative testing, and ongoing monitoring and refinement.

Agents and Control Flow

In the near term, overcoming these challenges hinges on robust control flow mechanisms. Control flow dictates how the LLM agent navigates interactions, makes decisions, and ultimately achieves its goals. Without it, LLM agents risk producing nonsensical outputs, deviating from intended tasks, or simply becoming unstable.

Imagine an LLM agent designed to write customer service emails. Control flow ensures it understands the situation (e.g., complaint, inquiry), retrieves relevant information (e.g., customer details, order history), and crafts a professional and appropriate response. This might involve routing the user's request to the appropriate department or dynamically generating different email templates based on the issue. Control flow keeps the agent on track, preventing irrelevant tangents or factual errors.

Effective control flow also addresses the challenges of measuring progress and integrating with existing systems. By establishing clear decision points and expected behaviors, developers can create metrics to track the LLM agent's performance and identify areas for improvement. Furthermore, control flow allows for the integration of human oversight and intervention. Developers can design control mechanisms that allow humans to guide the LLM towards desired outcomes or step in when the agent encounters unexpected situations. In essence, control flow acts as the bridge between the raw power of LLMs and the need for stability, reliability, and human oversight in real-world applications. See also:

You can think of control flows as generalized policies that come from a deep understanding of the workflow the agent is trying to automate. They could include things like:

Guardrails: Type of control flow specifically designed to restrict the LLM's behavior in certain ways. They act like safety rails, preventing the LLM from venturing into undesirable areas or generating harmful outputs. For instance: preventing offensive language, staying on topic, and fact-checking.
Routing: LLMs can be used to make decisions about how to respond to a user's query. This can involve classifying the query type (e.g., question, request, instruction) and then directing the response accordingly. For instance, an LLM might predict the best response to a question is a factual summary, while a request might require completing an action (like booking a flight).
Error Handling and Recovery: When an agentic system encounters an issue, control flow mechanisms allow it to diagnose the problem and take corrective actions. This might involve prompting the user for clarification, reformulating a request, or attempting alternative strategies to complete the task.
Prioritization and Decision Making: Agentic systems often juggle multiple tasks or goals. Control flow structures help them prioritize based on urgency, importance, or available resources. For instance, a virtual assistant might prioritize responding to an urgent message over completing a less time-sensitive task.
State Management: Many agentic systems track their internal state (e.g., conversation history, user preferences) to provide a more consistent and personalized experience. Control flow dictates how the system updates its state based on new information and uses it to inform future actions. Imagine a chatbot remembering your previous order preferences while recommending a new product.
Learning and Adaptation: Advanced agentic systems can learn and adapt their behavior over time. Control flow allows them to integrate newly acquired knowledge into their decision-making process. For instance, a recommendation system might adjust its suggestions based on your past interactions and positive feedback.

Most common implementations involve training small, specialized models (not necessarily language models) that carry out tasks and provide information or constraints to the overall system including crafting prompts that maximize the likelihood of the desired LLM response.

Build Agents with Us

Multi-agent LLM systems

Why multi-agent systems?

Multi-agent systems involve multiple single-agent systems that interact to achieve complex tasks. This is particularly useful when a monolithic agent (one policy based on a singular reward function) is too inefficient or impossible. In our customer service example, imagine an agent who is rewarded by maximizing the likelihood of upselling products. This can hurt the business by sacrificing long term retention of the customer if they start to feel they are being “sold to”. One solution could be adding more terms to the reward function to account for the long term retention. Now collapsing all those contradictory requirements into one function might not necessarily result in the best policy learned by the system. Alternatively, one could create a system where two agents, one rewarded for maximizing short term profit and one rewarded for reducing the risk of retention, can collaborate and keep each other in check.

Some of the other reasons for breaking up the system into multiple agents are:

Optimized task allocation: It might be more efficient (from a design, implementation, and maintenance point of view) to break down a complex problem into smaller subproblems and have agents rewarded for solving those subproblems specifically. This more modular design, although unnecessary from a functionality point of view, could be easier to improve and scale.
Enhanced response time: The modular design of multiple agents is not only useful if the sub-problems are different but also in cases where they might be similar but they can be done in parallel to save analysis time.
More robust specialization: While it is plausible for one agent to learn how to choose amongst a large number of actions, it might be more robust to partition actions based on some relevant property and have agents specialize in using them more effectively.

Agents UX and HCI

Software agents have become integral components of various Human-Computer Interaction (HCI) applications. These agents, powered by large language models (LLMs), serve various roles, from acting as personal assistants to customizing user interactions based on individual preferences. This section explores the roles of LLM agents in HCI and their impact on user experiences.

Intelligent User Interfaces (IUIs): Agents can act as intelligent assistants that can understand user needs, provide recommendations, and automate tasks. They offer a more intuitive and efficient means of interaction, reducing the cognitive load on users. Virtual assistants like Siri, Alexa, or Google Assistant are prime examples of IUIs which allow LLM agents to interpret user queries and provide relevant information.
Personalization and Recommendation Systems: Recommendation systems in e-commerce or streaming services can be powered by agents. These agents learn user preferences and recommend products, movies, or music based on that information. However, it's important to acknowledge that LLM agents can inherit biases from the data they're trained on. Transparency in how recommendations are generated is crucial for user trust.
Adaptive Interfaces: Agents can be used to create adaptive interfaces that adjust to user behavior or skill level. For instance, an educational software program might use an agent to tailor the difficulty of exercises based on the user's performance.
Embodied Conversational Agents (ECAs): These are virtual characters that can interact with users through spoken language or gestures. ECAs powered by LLMs can be used for customer service, education, or even companionship. Imagine an ECA tutor that personalizes learning experiences and provides emotional support.
Augmented Reality (AR) and Virtual Reality (VR): Agents can be integrated into AR/VR experiences to guide users, provide information, or even act as companions within the virtual environment. An LLM agent in an AR museum experience could provide historical context about exhibits or answer visitor questions in a natural, conversational way.

LLM agents are revolutionizing HCI by creating more intuitive, efficient, and personalized user experiences. They reduce cognitive load, improve accessibility, and offer a more natural way to interact with technology. As LLM agents continue to evolve and become more sophisticated, we can expect even more innovative applications that enhance human-computer interaction in the years to come.

Build Agents with Us

Parting words

Hopefully, with this article you have learned more about agents and are as excited as we are about them! Agents are the holy grail of a lot of what we have done and have been wanting to do with computers for the past several decades. Today, with natural language as a new way to interface with machine, we are closer than ever to that dream!

Happy building!

LLM Agents, Part 1 - The “9” Commandments: How to Build LLM Products Successfully

Amir Feizpour (ai.science) — Wed, 24 Jul 2024 14:04:40 GMT

In this write up we will go over the most important principles you should follow as you ideate, validate, design, and build your LLM product. One thing that you will realize by the end of this is that the principles of building the most sophisticated multi-agent LLM products is the same as the ones for any LLM product and ultimately the same as the ones for any data-powered software product.

1. Data is most probably your only moat

We are living in a world where powerful open-source models are just a few clicks away, and your proprietary data is likely your only sustainable competitive advantage. While anyone can access state-of-the-art models like GPT-4o, Llama3, and Claude, the data you use to fine-tune and augment these models is what will truly set your product apart. Your data is the secret sauce that enables you to build AI systems that can perform tasks and provide insights your competitors can only dream of. Even if becoming a unicorn is not necessarily your thing, being able to interface LLMs with different types of data (eg. multi-modality) blows up the space of possibilities you can explore in terms of use cases.

It is crucial to focus on building products and features that allow you to collect unique and valuable data that others can't easily replicate. This might mean targeting niche domains where you have deep expertise in, or creating AI-powered tools that incentivize users to contribute their own data. Another strategy is to form data partnerships with organizations that have complementary datasets, allowing you to enhance your models' capabilities without starting from scratch.

Perhaps, collecting high-quality data can be challenging and resource-intensive. One approach to mitigate this is to create synthetic data that mimics real-world scenarios. Synthetic data can help augment your existing datasets and improve model performance, especially in cases where real data is scarce or expensive to obtain.

When it comes to preparing your data for training models, it's important to weigh the benefits of annotating data versus relying solely on unsupervised methods. While unsupervised learning can be appealing due to its potential to reduce manual labor, annotated data often leads to better model performance and faster convergence. Investing in data annotation can pay off in the long run by improving the accuracy and reliability of your AI systems.

Adopting a data-centric approach to machine learning is key to building a strong competitive moat. By focusing on collecting, curating, and leveraging high-quality data, you can create AI products that are more accurate, insightful, and valuable to your users. Always be on the lookout for opportunities to expand and enrich your data moat, as it will be the foundation upon which your AI business is built.

The flip side of this advice is that blindly following this principle without thinking deeply about what results in customer’s long term loyalty could simply result in disappointment. Ultimately, the big question is how to use the data (or any other unique resources you have) to deliver value to your customers in a way that compounds: you win new customers and expand your relationship with the ones that you have.

Build Agents with Us

2. Follow validation driven development

To build successful AI products, you need a rigorous approach for measuring and optimizing performance. This is where evaluation-driven development comes in. This is particularly important for agentic workflows and multi-agent systems. A very common problem in naively built agentic systems is compounding error in these systems that quickly leads into systems falling in endless loops or producing nonsensical results. The only way to avoid these problems is having reliable and granular metrics throughout the system that act as feedback or reward mechanisms keeping the components and overall system in check.

Start by defining clear, quantitative metrics that capture what "good" looks like for your product - whether that's accuracy, user engagement, task completion rate, or some combination of these. This has to be done for the components of your architecture as well as the overall performance of the system.

With your key metrics in place, orient your development process around continuously evaluating your pipelines against these benchmarks and iterate to improve performance. This could involve experimenting with different system designs, model architectures, model combinations, fine-tuning techniques, prompt engineering approaches, and UX designs. The key is to have a solid experimental setup where you're constantly shipping new arrangements of components, measuring their impact on your core metrics, and doubling down on the most promising ideas. This is also particularly important for LLM agent systems since the landscape of potential improvements is so vast that a thorough investigation of all possibilities with limited resources is simply impractical.

Don't get caught up in chasing the latest shiny model or technique without a clear sense of how it actually moves the needle on your core evaluation criteria. If you can't measure what "better" means, you're at high risk of turning in circles or fixating on the wrong things. By grounding your development in rigorous evaluation, you can efficiently zoom in on system architecture designs that actually deliver value to your users.

Build Agents with Us

Further Reading:

3. Get your product in the hands of your (ideally paying) users asap

One of the biggest pitfalls in AI development is getting bogged down in endless technical tweaks before getting any feedback from real users. This is especially tempting with LLMs, where there's always another parameter to tune or dataset to incorporate. But the reality is, you'll never know if you're building something people actually want until you put it in their hands. The feedback gets even more real if they are paying you (or at least they anticipate having to pay you to use the product).

The antidote is simple (but not always easy): Build the simplest viable version of your product and get it in front of users as quickly as humanly possible. This might mean starting with a bare-bones MVP that only does one thing, or even launching a "fake" version powered by human labor behind the scenes. The point is to start collecting real feedback and data from day one, so you can validate your core assumptions and start iterating in the right direction. Doing this can also help regulate your understanding of the right metrics to track as per last commandment. It is easy to lose sight of what really matters to the user quantitatively by hiding behind technical metrics like accuracy.

Ideally, get this initial version in the hands of paying customers, even if it's just a small pilot group. Seeing real people actually fork over their hard-earned cash for your product is the ultimate validation that you're onto something. Plus, having revenue coming in from the get-go will help extend your runway and give you more breathing room to iterate.

Another important aspect of this is deployment. It is great that your product works on your laptop, but if the user can’t interact with it you have significant friction in getting the feedback that you need.

Build Agents with Us

Further Reading:

4. Separate the data and interface layers, and be prepared to invest in data engineering

Using LLMs doesn't give you a free pass to ignore established software and data engineering best practices. In fact, as LLM-based systems grow in complexity and capability, it becomes even more critical to architect your systems in a modular, maintainable way.

A key principle here is maintaining a clear separation between your data and interface layers.

Designing your system in a way that the LLM itself becomes the source of knowledge is a risky and ill-advised approach. Instead, strive to architect your system, craft your prompts, and provide the relevant context to the LLM to ensure it relies solely on the information you supply to it when generating a response. While this may evolve in the future, cleanly decoupling your data from your interfaces gives you greater control, allows you to layer on additional security and privacy measures, and makes your system more robust to changes in the underlying models. Retrieval-augmented generation (RAG) techniques provide a powerful way to achieve this decoupling while still harnessing the full power of LLMs.

It is tempting to think that you just fine-tune one model using your data and it will work as expected with all the controls that you need. The reality is that LLMs are not well-behaved enough to achieve any granular level of control necessary for real world applications and use cases. It is best to separate the data layer (knowledge base documents, structured data, etc) in already well established structured (aka databases) with all the necessary controls (eg. identity and access management) that come with those. This also makes it easier for you to pre- / post- process that data before feeding it to the model in retrieval augmented generation (or equivalent) setups.

Separating the data layer also gives you the ability to build all the logic necessary for processing, storing, and retrieving the data used to train and run smaller models you use in your control flows or to fine-tune your larget models. This includes data ingestion pipelines, data cleaning and transformation steps, feature engineering, and data versioning. Your interface layer, on the other hand, should focus solely on exposing the capabilities of your models to end-users, whether that's via APIs, chatbots, or interactive GUIs. Of course, LLM itself can act as a linguistic interface by providing conversational interactions with the user.

Build Agents with Us

Further Reading:

5. Do not count on LLMs beyond linguistic interfaces

LLMs are incredibly powerful for natural language tasks - they can engage in human-like dialogue, answer questions, summarize long passages, and even write creative fiction. But it's critical not to get swept away by the hype and expect them to be a magic bullet for every use case. As the name suggests, LLMs are language models - they excel at generating statistically plausible sequences of words, but struggle with many other desirable capabilities like reasoning, analysis, and grounding in real-world facts.

Many people fall into the trap of hoping LLMs will handle complex reasoning, read their minds to infer intent, write flawless code on the first try, or magically handle scheduling and workflow automation. But today's models simply aren't reliable for these types of tasks. Outside of linguistic interfaces, LLMs have significant limitations that constrain their usefulness. They are notoriously prone to "hallucinations" - confidently generating false or nonsensical information that can be hard to detect. They struggle with maintaining coherence over long time horizons or complex multi-step tasks.

So when architecting LLM-powered products, it's crucial to be ruthlessly realistic about what the models can and can't do. Focus on leveraging LLMs for what they excel at - engaging with users through natural language - and thoughtfully architect supporting systems to handle any downstream tasks. Be prepared to break down complex workflows into atomic steps, provide extensive context and guidance, and double-check outputs for factual and logical consistency. By playing to the strengths of LLMs while proactively addressing their limitations, you can design products that harness their power while mitigating their downfalls.

Build Agents with Us

Further Reading:

6. Create Robust Feedback Modules

In academia, scientific papers undergo peer review, where different experts independently critique the work before publication. Borrowing from this process, a powerful paradigm for building self-improving AI systems is to train multiple models that play distinct roles akin to authors and reviewers.

In this setup, you might use a generative model to produce some output, like a dialogue response, a document summary, or a piece of code. You then have to use a separate "critic" model to evaluate the quality of that output along various dimensions like factual accuracy, logical coherence, style and tone. Crucially, these models are trained independently, so the critic acts as an objective assessor, not just a rubber stamp.

This is particularly important in agentic systems where the goal is for the system to continuously monitor its performance, reflect on the outcome, and try again with improved likelihood of better performance. This is a crucial ingredient for the level of autonomy we seek in agents. Therefore implementing highly reliable, accurate, and trustworthy feedback sub-systems (aka “reward mechanism” in the context of RL) is a big part of success in building an agentic product. You can even equip the agents with ensembles of critic tools (including but not limited to occasionally asking for human input) to cover different facets of evaluation, like long-term coherence vs. individual response quality.

The key benefit of this architecture is that it provides a scalable mechanism for quality control and continuous improvement that doesn't rely solely on human judgement. That said, it's not a total replacement for human evaluation - you'll still want to spot check the system's outputs, especially in the early stages. And there's an art to designing the right training setup and reward functions to get useful feedback while avoiding degenerate equilibrium between the generator and critic. But when done right, this approach can imbue your AI systems with the benefits of peer review to make them more robust and self-correcting over time, and therefore achieving a higher level of autonomy.

Build Agents with Us

Further Reading:

7. Actually Improve User’s Productivity

It's easy to get caught up in building flashy AI demos that showcase the latest and greatest model capabilities. But at the end of the day, the true measure of success for your LLM products is how well they improve users' lives in tangible ways. In particular, since the most promised benefit of LLM agent tools is productivity, it's critical to honestly assess whether you're making people more efficient at important tasks, or just giving them one more thing to babysit.

To deliver real productivity gains, you need to deeply understand your target users' existing workflows and ruthlessly prioritize AI features that will save them time and effort. Now productivity does not equal saving time only, but rather it implies saving unwanted effort, therefore, your product has to address workflows that your users:

Spend significant time on, AND
They do not want to spend that time doing that task.

Approach every new capability through the lens of "how does this concretely make my user's job efficient and effective?" If you can't quantify the impact, chances are it's not worth building.

Another important nuance here is that builders are sometimes excited about taking away the parts of the job that people actually enjoy spending time on rather than parts that they hate doing. While that product might theoretically improve productivity, the psychological barrier of using it will backfire.

Build Agents with Us

8. Think deeply about integrations into users' tools and workflows

To drive successful adoption, your AI product needs to fit seamlessly into users' existing workflows and tool chains. No matter how impressive your models are under the hood, if using your product feels like a clunky, disjointed experience, people simply won't bother. On the flip side, if your product slots nicely into the tools and processes users are already using day-to-day, you'll dramatically lower the barriers to adoption and make your AI feel like a natural extension of users' workflows. The last thing people want is yet another siloed app to switch back and forth from. Instead, look for opportunities to embed your AI capabilities right within the apps users already live in day-to-day, whether that's their email client, messaging platform, note-taking tool, or code editor.

By meeting users where they already work and focusing relentlessly on concrete effort savings, you can ensure your AI product isn't just a novelty, but an essential part of people's daily flow. And those efficiency gains add up fast - saving someone a few minutes or clicks on a task they do 10 times a day is a game changer. Keep humans at the center, measure what matters, and optimize for their productivity above all else.

To get this right, you need to invest significant time upfront to deeply understand how your target users currently work and what their key pain points are. This means going beyond surface-level interviews and surveys to really immerse yourself in their world. Shadow them as they go about their tasks, paying close attention to all the tools, systems, and collaborators they interact with along the way. Map out their end-to-end workflows to identify bottlenecks, inefficiencies, and opportunities for AI to streamline the process.

Armed with this deep understanding, architect your AI product to integrate with the specific tools your users depend on, with seamless bridges for importing and exporting data, triggering actions, and collaborating with teammates. In many cases, this means delivering your AI capabilities as plugins or add-ons right within users' primary tools, instead of forcing them to switch to a separate app.

When well executed, this deep integration approach makes your AI product feel less like a tool and more like an intelligent assistant that's always there in the flow of work, ready to lend a hand. Users don't have to disrupt their normal processes or learn new interfaces - they can simply tap into the power of AI whenever and wherever they need it. And that frictionless experience is the key to making AI an indispensable part of people's daily lives.

Build Agents with Us

Further Reading:

Change Management for LLM-based Products

9. Design for humans!

Amid the excitement around LLMs and other AI breakthroughs, it can be tempting to get carried away imagining a world where machines handle every task and decision. But the reality is, humans are going to remain an essential part of the equation for the vast majority of use cases for the foreseeable future. Even the most sophisticated AI systems today are narrow in scope and brittle in the face of edge cases. They are powerful tools to be wielded by humans, not wholesale replacements for human judgement.

As such, it's critical that we keep real human needs, behaviors, and constraints at the center of our AI product development process. At every step along the way, we need to be testing our products with actual users, seeing how they integrate (or don't) into their real-world contexts, and shaping the user experience accordingly. Pretty model performance numbers in a lab setting are meaningless if they don't translate into tangible benefits for humans in the messy real world.

Prioritizing the human element means investing deeply in thoughtful UX design, extensive user testing, and rapid iteration based on feedback. It means providing robust, accessible user education to help people understand both the capabilities and limitations of the AI systems you're putting in their hands. And it means proactively considering and mitigating the potential risks and unintended negative consequences your product could have in people's lives.

Ultimately, our north star as AI product builders should be empowering humans to do their best work. We have an incredible opportunity to usher in a new era of productivity and creativity, but it will require the hard, patient work of aligning powerful AI capabilities with real human needs. If we keep humans at the center and measure success by the positive impact we have in their lives, not just our model metrics, we can build an AI-powered future that brings out the best in both machines and people. The road ahead won't be easy, but the destination will be more than worth it. So stay focused on those human needs, keep iterating, and let's build the future together!

Build Agents with Us

Further Reading:

Crafting LLM-powered Interactions: Design Principles for Natural-Language User Interfaces

Conclusion

Building successful multi-agent LLM products requires a multidisciplinary approach that blends technical chops, product sensibilities, and proactive ethical responsibility. The 9 principles we've explored provide a comprehensive roadmap for navigating this complex landscape.

By focusing on building proprietary data moats, maintaining modularity in your architecture, relentlessly evaluating and iterating on product performance, and deeply integrating with users' workflows, you'll be well on your way to creating AI systems that deliver real value. And by keeping humans at the center of the process and proactively addressing potential negative impacts, you can ensure that value is achieved responsibly and sustainably.

But while these principles provide a solid foundation, the reality is that building game-changing AI products is hard. It requires grappling with cutting-edge research, wrangling messy real-world data, and constantly iterating in the face of shifting user needs and expectations. There will be setbacks, dead-ends, and pivots along the way. The key is to stay focused on your north star of empowering users, stay humble in the face of complexity, and keep pushing forward one experiment at a time.

The potential for LLMs and multi-agent systems to transform how we live and work is immense - but it won't be realized without diligent, human-centric innovators translating the raw capabilities into meaningful products. By internalizing these 9 principles and tenaciously applying them in practice, you'll be at the vanguard of this exciting frontier. The journey won't be easy, but the destination will be more than worth it. So go forth and build the future!

How to do retrieval augmented generation (RAG) right!

Amir Feizpour (ai.science) — Wed, 22 May 2024 14:04:01 GMT

Why write this article?

Well, it’s no surprise that retrieval augmented generation (RAG) has become such a commonplace paradigm for all things LLMs these days. I would attribute that to two things: 1) it’s pretty straightforward to understand and implement 2) vector db vendors are making damn sure you “think” RAG (with their product) is all you need to live happily ever after. Really, though?

That’s why a lot of my client conversations start by “hey, we have this vector db + LLM RAG thing and it’s not working, do you know why”. And my answer is often “pull in the chair and sit down, we’re going to have a long chat about how robust software is designed and built”.

The reality is, a one-size-fits-all approach just doesn't work when it comes to information retrieval, data handling, and reliable language generation. It all starts by thinking deeply about what your software is trying to achieve, what “good” performance means quantitatively, and how that definition of “good” can act as a north star to guide your design and implementation decision.

In order for us to discuss how RAG should be used properly, let’s first recap what the motivation behind it is.

Lightning Course on Agents

Why RAG though? for real…

Even the most impressive large language models (LLMs) stumble when it comes to verifiable facts, real-world understanding, and staying grounded in anything, esp. when it comes to the sparse parts of the data they were trained on. Besides that, as the name suggests, they are language models which means that they are only reliably good at parroting the average language used on the internet, and not the other 1000 things we would like to do with them.

But what if we could connect these powerful language models to up-to-date, trusted, and authoritative external knowledge sources such as databases, APIs, proprietary documents, knowledge graphs, you name it? Imagine a chatbot that can draw upon a company's vast knowledge base, customer interaction history, and product documentation to provide truly personalized and effective customer support. These possibilities highlight the desirability of such interfaces: expanded capabilities, richer understanding, and more nuanced responses.

RAG was proposed as a way of retrieving targeted, relevant information from external data sources to enrich the responses of language models. RAG provides the opportunity to have granular control over the behavior of the LLM by separating the linguistic interface (LLM itself) and the data layer. This allows for a more modular approach where all the appropriate processing can happen at the data layer (eg. measures for security, privacy, access control, relevance, refinement, guardrails) and let the LLM be in charge of what it’s good at: understand what the user is saying, and respond elegantly.

The Need for External Data Integration

Large language models trained on existing data are inherently limited in major ways:

Limited Knowledge and Expertise Scope: No matter how much data is ingested during training, these models can't cover the full breadth of factual knowledge about our infinitely complex world. There will always be gaps to be filled. This is compounded by the fact that LLMs are trained to be good at language skills which does not make them great at other expert tasks like mathematics, analyzing structured data, or simulating the physical world.
Factual Inaccuracy and Hallucinations: Given that they are only trained to predict the next most likely token, language models can easily generate information that is outdated or plain wrong. They operate based on patterns in the text data, without a deep understanding of how the information connects to real-world entities, events, and concepts. They sometimes produce fluent yet false information (hallucinations) as they try to "make sense" of sparse data, leading to spurious correlations.
Inconsistency: Language models can produce different results from run to run, even with the same inputs, due to their probabilistic nature and lack of explicit design for deterministic outputs. This can be problematic when using LLMs for tasks requiring stable and reproducible results, such as in scientific, financial, or legal contexts. For example, an LLM might incorrectly state a historical date or generate conflicting legal interpretations for the same query across multiple runs.

Integrating external data sources through RAG can help address these limitations by coupling them with more reliable information retrieval systems. It is also important to recognize the pertinent challenges involved, including seamlessly interfacing diverse data formats, balancing retrieval quality and efficiency, and ensuring the external data is trustworthy and up-to-date. If we can get it right, RAG has the potential to unlock a new level of knowledge-intensive, contextually aware interactions across a wide range of domains, which we will explore further at the end.

Validation Driven Development

We said earlier that the one-size-fits-all approach to interfacing LLMs with external data, as suggested in vendor blog posts online, does not work. Instead the architecture used has to be tightly designed around the nuances of the workflow it’s expected to augment. Humans don't simply regurgitate memorized facts. We gather information, analyze it, and then use language to communicate our understanding. RAG that mimics this process can be more nuanced and adaptable. Imagine a system that can not only retrieve relevant information but also identify potential biases or missing context, just like a human researcher might. This would lead to more reliable, nuanced, and insightful outputs.

Don’t get me wrong! A basic RAG architecture might be a good starting point, but it's crucial to assess its performance across diverse situations to deeply understand where it stumbles. Does it struggle with factual queries or with keeping consistent themes in creative writing tasks? Identifying these weaknesses allows us to be targeted and efficient with potential improvements. By pinpointing the common patterns of error coming out of a thorough assessment of a simpler, starter architecture, we can explore ways to enhance the system’s capabilities.

The landscape of potential improvements can be overwhelming. Should we try a different LLM? Should we increase the size of the database? Tweak the retrieval algorithm? Fine-tune the model? Trying to tackle everything at once is a recipe for wasted effort. Instead, focusing on the most important shortcomings in performance, one at a time, allows for a more systematic approach. We can identify a specific weakness, test targeted solutions, and measure the impact. For example, imagine systematically addressing the system's tendency to generate factual inaccuracies. By iteratively improving the fact-checking capabilities, we can build a more trustworthy and reliable RAG system.

Validation vs Evaluation vs Verification

Evaluating LLM-based software, often used to tackle various cognitive use cases within a certain business context, presents a unique challenge compared to traditional software or even less advanced ML models. While the core principles of verification, validation, and evaluation still hold true, their application requires adaptation to the dynamic nature of LLMs.

Before getting into the details of each of these terms, let’s define what we mean by these terms:

Evaluation: Is the software good? Eg. Is the system good at doing math?
Verification: Is the software built right? Eg. Did this iteration of the system run provide the correct math answer?
Validation: Is the software built, the right thing? Eg. is the LLM system good at the math questions the user cares about?

Evaluation, assessing the overall quality and impact of the LLM system, is not always trivial. Traditional metrics like accuracy may not fully capture the nuances of human language interactions. New metrics encompassing factors like fairness, interpretability, and creativity are needed to judge the system’s effectiveness. It is often also helpful to take a step back and look at the impact of the output of the system on a measurable downstream task (eg. did the generated proposals result in more sales). Evaluation needs to be ongoing and iterative, monitoring for potential drift in performance or the emergence of unintended biases as the models, esp. third party ones, evolve.

Verification, ensuring the system functions as designed every time, becomes more nuanced. Traditional unit testing struggles to capture the emergent behavior arising from complex interactions between LLM components. Instead, techniques like adversarial testing (red teaming), fuzzing (stress testing by feeding unexpected data), and testing individual details of the output (eg. numbers, entities, fact checking) are crucial to uncover unexpected outputs and biases:

Returns an invalid json
Returns a schema not consistent with your prompt
Your prompt (part of Intellectual Property) gets leaked
Replies with harmful content like hate speech
Produces malicious code
Jailbreaks your prompt
Function calling returns incorrect signature

Validation, confirming the LLM meets user needs, takes on a new dimension. The stochastic nature of the output makes it difficult to decide if an output is reliably good. Besides, in many situations it is hard to decide if a certain output, say a few paragraph text with lots of technical details, is objectively acceptable. Finally, the opaque nature of LLMs makes it difficult to understand their reasoning and decision-making processes. Thinking carefully about the user experience of the overall system is an important ingredient for gaining the user’s trust. This might include verbosely showing the process the system is executing to keep the user engaged, or showing citations for information presented, or communicating the results of verifications if warnings are needed.

So, here's the recipe:

Before writing any code, wrap your head around how evaluation, validation, and verification will happen throughout the development process and procure the necessary data for this.
Implement comprehensive guardrails to prevent undesirable outcomes.
Leverage a metric-driven development approach to tailor evaluations to specific use cases.
Utilize formal verification when necessary to ensure the reliability and safety of LLMs.

Modular RAG

Let’s recap! The basic architecture my vector db provider tells me to use doesn't work. I have to spend a lot of time wrapping my head around what “good” means for my use case and what data should be used to measure it. And also there’s no clear path between the basic architecture and what’s needed to get to the promised “good” performance. Got it!

Well, software engineering 101 says: in a situation like this, make the architecture as modular as you meaningfully can so that you can move things around or swap things in and out, and experiment extensively to find the right combination.

Now, let’s introduce “modular RAG”, aka a properly engineered software system inspired by RAG.

Retrieval-Augmented Generation for Large Language Models: A Survey - arXiv:2312.10997v5

Common modules

Of course, in reality, the modules will end up largely replicating the steps a human would take to carry out a particular workflow. So, it’s generally good practice to deeply understand the workflows, standard operating procedures, that humans follow, break them down into sub-tasks, and design your architecture to replicate that. Chances are you won’t get it right the first time, second, or n-th time because humans do a lot of sub-tasks that they don’t consciously realize they do. But at least you have a modular setup that you can modify based on the edge cases you are uncovering in each iteration.

That said, there are modules that are relatively common for various use cases. Let’s take a look at them.

Query Expansion and Rewriting

This first module involves transforming the user’s query or prompt into a format optimized for searching the external knowledge base or data source. This often involves rephrasing the query, identifying keywords, or explicitly adding relevant info that is implicit in the user’s utterance. More sophisticated methods can turn the query into a rich "search object" containing search phrases, database filter values, query type classifiers, and other metadata to properly configure the subsequent retrieval stages.

Query expansion has been used for decades to bridge the vocabulary gap between how users express queries and how information is stated in databases. For example, one classic approach is pseudo-relevance feedback, which retrieves documents based on the initial query, identifies keywords from those documents, adds them to the original query, and retrieves them again.

In the world of LLMs, there are several techniques:

Hypothetical Document Embeddings (HyDE) creates a hypothetical document relevant to the query, uses its embedding to retrieve nearest neighbor documents, and rephrases the query into better-matched terms.
Step-Back prompting allows large language models (LLMs) to perform abstract reasoning and retrieval based on high-level concepts.
Query2Doc creates multiple pseudo-documents using prompts from LLMs and merges them with the original query to form a new expanded query.
ITER-RETGEN proposes a method that combines the outcome of the previous generation with the prior query. This is followed by retrieving relevant documents and generating new results. This process is repeated multiple times to achieve the final result.

For a more in-depth exploration of these query rewriting techniques, please refer to this article.

For overall pipeline efficiency, it is crucial to balance the complexity and compute requirements of query expansion. While large language models like GPT-4 are capable of handling query expansion and rewriting, they are often too computationally expensive and slow for efficient use in production settings. To achieve a better balance between performance and efficiency, alternative methods can be employed. Using task-specific fine-tuned small LLMs, knowledge distillation, or even methods like pseudo-relevance feedback (PRF) can provide good performance at a lower cost. In some cases even very small classifiers trained on high quality data can make a big difference in explicitly telling the system what to look for.

Information / Example Retrieval

With the user's query rewritten into an optimized search object, the Retrieve module fetches potentially relevant information from external data sources. This could involve keyword searching over a set of documents, extracting specific passages, querying structured databases via APIs, or even writing SQL-type queries. This could also involve a mixture of all of the above, and even calling other expert models (machine learning or statistical models or even physics/engineering simulations) served via APIs.

Another type of subtask a module like this might do is finding relevant examples (eg. FAQ) and putting them into the context of the eventual LLM call for a few-shot setup. This can significantly improve the repeatability of the output from the overall setup by showing examples of how similar problems were handled in the past.

Reranking & Post-processing

The goal of retrieval is to grab as much potentially relevant information as possible, even if some of it ends up being irrelevant. Some of these results might already be ranked within the assumptions of their respective systems and search parameters sent to them. However, those rankings might not be relevant to the query at hand.

After retrieving a large pool of relevant document chunks or data points, the reranking stage plays a crucial role in determining which set of chunks or data points are most relevant to the user’s query. This is a critical step because the language model can only process and utilize a limited context to produce the correct answer.

In most baseline implementations, reranking could be done using simple semantic ranking. However, more advanced relevance scoring techniques, such as cross-encoders, query likelihood models, or other supervised ranking models, can be employed to improve the accuracy of the reranking process. By using these methods, the reranking stage assigns higher scores to the most pertinent information and lower scores to less relevant or irrelevant data. This helps prioritize and select the most valuable content that will assist the LLM in providing a correct response to the user's query.

One word of caution here is the following: It is tempting to jam as much info in the context window of the LLM as possible and hope that the LLM attention mechanism can robustly tell what’s relevant. This doesn’t necessarily hold esp for very long prompts (see Lost-in-the-middle problem). So while part of the reason that we do this re-ranking is to cut off irrelevant information, another reason for it is that we can use the score to try a few different arrangements of the information in the prompt. You might end up finding that putting the most relevant info at the bottom is better, or alternating between top and bottom, or any other arrangement.

At this point, we have retrieved and selected the most relevant texts that can potentially serve as input to the LLM. However, we must ensure that the retrieved information is free of any irrelevant, redundant, or lengthy texts and it fits well within the context limit of the LLM. To achieve this we can post-process the chunks or data. For textual data we can employ two primary methods: summarization and chain-of-note generation. For other types of data the post-processing might involve doing math on the retrieved info or any other specialized operation that boils down the data to what is strictly needed for the task at hand.

Prompt Crafting and Generation

Once the content to be fed to the LLM has been determined through the previous modules (documents, summaries, notes, processed data, examples), the next step is to decide how to present this information to the LLM. The way the content is structured and fed into the LLM can significantly impact the quality and coherence of the generated responses.

There is not a whole lot of science here (definitely not engineering as much as people like to think “prompt engineering” is “engineering”). But the good news is that you have all the pieces of information (instructions, examples, retrieved and processed data) in separate data elements, so you can try various data templates and order of the info until it gives you the desired output.

Finally, the crafted prompt is fed into the LLM, which processes this context and produces a final response to the user's query. The standard approach is to generate the output all at once, but more advanced techniques involve interleaving generation and retrieval in an iterative process called active retrieval or generating multiple outputs and selecting the best.

Active retrieval generates some output, retrieves more context based on the generated text, and then generates more output, iterating until a complete response is formed. This can help maintain coherence in long-form text generation. The decision to retrieve additional context can be based on various factors, such as a fixed number of tokens generated, completion of textual units like sentences or paragraphs, or when the available context is deemed insufficient to continue generating a meaningful response.

Verification

If you think that you have generated a response and you’re done, I’m sorry to tell you that you are not. At this stage you did all you could to feed the LLM the right info, but what it has done with it is unpredictable and more often than not, it might hallucinate or get the details wrong. That’s why it is important to include some modules that verify the details of the output. This could include checking that entities and numbers mentioned are actually present in the context you provided. Or fact checking the generated response sentences against the chunks passed to the LLM. This could get as elaborate as doing linguistic analysis on the output to check if it meets certain requirements.

What you do in case of “pass” or “fail” of these verifications depends on the use case and the severity of the issue. Sometimes it might be just appropriate to append a warning to the user. Sometimes you might provide the failure along with some additional info as feedback to the LLM, asking it to try again. Sometimes you could overlay the verified information (eg. fact checking results) as citations inline with the generated text. Sometimes you might have a more elaborate policy about how different verifications should be done and how the system should behave depending on the outcome.

Design patterns

There aren’t many well established patterns for how these modules might come together to build your pipeline. The ones discussed above are most probably going to show up in the order presented here. You can see some of the ones presented in various academic papers in the image presented at the beginning of this section.

Ultimately though, as mentioned previously, the pattern you follow will have to replicate the cognitive process and the business workflow you are integrating into. Once you get deep enough in the weeds of those workflows, it’s likely to realize that a human executing the cognitive process makes decisions about taking different sets of steps based on previously available information. This often leads you into creating routing modules that replicate that decision process with pre-determined configurations of modules in each of the possible decision pathways, making it almost like it has “agency”!

RAG and Agentic Workflows

It is common to also realize that different configurations of the modules are necessary for different types of scenarios your system needs to handle. Of course, we all wish that we had more autonomous agents to handle all this for us. But in the foreseeable future, what people call “agentic workflows” is most often routing or configuration selecting modules that robustly select the right pipeline to handle the variance in the user queries.

What constitutes an “agent” and what that implies are beyond the scope of this write up and will be dealt with in a future one. Semantics aside, the goal is still to follow the validation driven development, tackling one performance issue at a time. At some point, you have added all the necessary modules but are still struggling with performance and that’s where fine-tuning comes in!

Lightning Course on Agents

Fine-tuned RAG

Another advantage of the modularity is that each module could be its own independent small model (language or otherwise) fine-tuned or trained to handle the subtasks in the best possible way. For example the router could be a simple query classifier that selects the right pipeline based on the last few interactions with the user. Or retrieval and re-ranking modules can be models trained specifically for handling the types of queries at hand and the nuances of relevance based on annotated ranking data. The prompt crafting stage could also be a model trained in a supervised fashion on pairs of input data and desired output LLM generation (see RLPrompt).

Conclusion

While a basic vector database and LLM combo might seem like a quick win for RAG implementation, it falls short for most real-world use cases. The workflows that RAG aims to augment are inherently complex, demanding more than just finding and summarizing text. This requires additional modules beyond the core RAG architecture.

The key to unlocking RAG's true potential lies in its modularity. By meticulously mapping the specific workflow you're trying to enhance and designing the RAG architecture around its individual steps, you gain immense flexibility. This upfront investment in planning can save significant time, money, and frustration down the line. Remember, validation-driven development is paramount. Start with the simplest possible RAG system, identify its edge cases through rigorous evaluation, and tackle them one by one. By focusing on these core principles, you can build a robust and adaptable RAG system that truly revolutionizes your workflow.

Acknowledgement: This article is inspired by the presentations given by Suhas Pai, Ian Yu, Nikhil Varghese, and Percy Chen. Some of the thought processes presented here are borrowed with permission from the content of Suhas’s book, Designing Large Language Models Applications. The early drafts of this article and the accompanying visuals were provided by Mohsin Iqbal.

Levels of Autonomy in Multi-agent LLM Systems

Amir Feizpour (ai.science) — Wed, 13 Mar 2024 14:08:19 GMT

We have been hearing more and more about multi-agent LLM systems and how they are the next big thing. But how far are we from this next big thing? The answer is less trivial than some social media influencers lead us to believe!

For context, when I talk about a multi-agent LLM system, I am referring to is a system that leverages a large language model (or another type of reasoning / coordination system) as an orchestration layer sitting on top of a few agents (LLMs or other types of models) with the objective of collaborating to achieve a certain goal. The early examples of this came out nearly a year ago, eg. AutoGPT, where a few powerful LLM instances collaborated to write software with particular characteristics. Creating software turned out to be a particularly attractive sandbox because testing if software works or not is easier than evaluating the quality of most other complex tasks humans do.

Very early on we started noticing that these types of systems, while cute on the surface, had significant stability problems and could easily fall into execution loops or repeated actions. This quickly reminded me of another area where we have been seeking autonomous machines for years (decades?): driving from A to B. So, I did a side by side comparison in one of the presentations I gave last year.

While it’s still too early to delegate judgment completely to machines, we have been seeing an increasing trend of being able to do that in both autonomous driving and applications of AI to business workflows. Since autonomous driving has been around for longer, people have gotten around the hype and have sat down and carefully imagined how this might pan out and what’s necessary to get to full autonomy. So, why don’t we get inspiration from that?

I am not going to repeat the information from the graphics above, but the short of it is that I think we are at Level 3 of autonomy with multi-agent LLM systems where “the machine decides with a high level of participation from an expert / operator”. Why do I think we are here? Well, as one might see in cases of AutoGPT and BabyAGI, the stability was nowhere close to being acceptable to be used in any sort of serious business workflow. To make it more complex, and this is still true, how to measure if the system is moving in the right direction or not iwas highly non-trivial. In autonomous driving, we use a large number of sensors and cameras to tell the machine where it is and what’s around it. In autonomous business systems, the equivalent of sensors and cameras could be drastically different from use case to use case. Therefore, from my point of view, the only robust way to leverage them today is to have a human highly involved in the operation.

So, what do we need to get to Level 4 and Level 5? Great question! And the autonomous driving crowd have done the hard work of figuring that out and has made it available to us for inspiration:

Level 4, in autonomous driving, requires robust vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communication, and of course eventually vehicle to everythings (V2X). This means that in order for the vehicle to operate safely and efficiently, they need to “talk” to each other and their surroundings. This allows them to share real-time information, creating a broader picture of the environment. Cars can anticipate actions and optimize traffic flow through V2V, while V2I communication with traffic lights and signs helps them adapt to changing conditions and improve efficiency. The robustness of this communication, meaning its speed, coverage, and security, is critical for reliable and safe operation of Level 4 self-driving cars.

Well, here is our blueprint: we need to create systems with robust communication and coordination frameworks amongst all available agents, meaning their speed, coverage, and security. And of course, that cannot be achieved before we figure out how to, with very high accuracy, tell each individual agent how far off their actions are, how their errors compound, and what they need to do to course correct.

Now the interesting thing about Level 5 is that it is fairly unlikely to happen without significant infrastructure rehaul. In other words, being able to have seamless V2X type of communication and coordination necessary for complete removal of human intervention puts a very high bar on the quality and robustness of the infrastructure. This means that we might have to completely rethink our roads, signs and signals, traffic rules, insurance policies, and many other things.

This can give us a glimpse into autonomous AI systems at Level 5 as well: we most probably have to completely rethink our corporate structures, business processes and workflows, business models, capital allocation, insurance policies, governance and many more things. In other words, just the same way that Level 5 autonomous driving is unlikely to take off within the currently available infrastructure, I wonder what the minimum level of necessary infrastructure change is for us to start seeing the beginning of Level 5 autonomous AI systems!? 🤔

Collective Artificial Intelligence

Amir Feizpour (ai.science) — Fri, 01 Mar 2024 11:07:10 GMT

Where does the name “aggregate intellect” come from anyway?

5 years ago when I was incorporating my company, I was looking for a fitting name. I didn’t know what my company would do exactly, but I knew that it would have something to do with “collective intelligence” and AI given that it was being born out of the AI community I had built.

Credit should go to my partner for the name because I was stuck finding a name that abbreviates to a.i. AND means “collective intelligence”; and she came up with “aggregate intellect”. It is a mouthful, but I immediately liked it, it had everything I wanted!

Now reflecting on the past 5 years, have we gotten closer to the vision I had?

Some people recently started talking about “intelligence 3.0” as the superintelligent machines that surpass human capabilities. That’s a fairly wishy washy topic for me given where we are and how things are going. But a sub-topic of it that is more specifically focusing on the collective intelligence of human-machine systems is quite interesting!

That is one of the areas that we are very deeply interested in and keenly focused on. Our research is exploring several interesting ideas in that space and we are ramping up to do some interesting experiments with our academic collaborators at McGill University and University of Toronto. The gist of the idea is, what are the design parameters of systems that combine a mixture of human and machine intelligence and use the latter to facilitate effective collaboration and problem solving between all the agents involved.

Some of the interesting concepts that exist in the space are:

Expert in the loop intelligence: the essential question here is creating AI systems, as point solutions, that collaborate with humans by taking care of the mundane tasks and delegating all the important decisions to the right human expert for robust and successful execution. We have been building rudimentary versions of this in the past and we are starting to see more sophisticated systems using LLMs as the linguistic interface with humans. Delegation often is based on thresholding the confidence of the system in handling a task.
Multi-agent LLM systems: the main question here is, what if we had an array of expert models each of which are particularly good at one task, say time series forecasting, physical simulation, image analysis and generation, and linguistic tasks, and they used LLMs’ ability to communicate via code and data, to coordinate and manage task execution and communication of observations. A system like this, could take in an objective and start breaking it down, delegating to the right experts, and going through observations, refinement of tasks based on observations, and iterations until the objective is achieved. Most of the existing multi-agent LLM projects are currently focused on primarily LLM based agents and exclude humans in their execution loop.
Mixture of experts (MoEs): This is a slightly more demanding approach and it’s rumored to be the architecture of GPT4. MoEs consist of a gating (routing?) network and a range of expert models that are all trained together. Effectively, through training, the network learns how to delegate subtasks to the right expert models, and how to combine the outputs of those for the execution of the primary task. This is the next natural step beyond what we loved in more traditional ML, aka ensemble methods, except that in the MOE case, not all expert models execute the incoming task, and a more sophisticated routing is learned.

I think the right solution is a combination of all 3 (and potentially other approaches like reinforcement learning). It is unsafe and honestly unlikely to build systems that completely exclude human experts, so the expert-in-the-loop aspect is very important. More interestingly, going beyond the interaction of one expert with the system and indeed augmenting how humans collaborate with each other and then with the machine could be quite interesting. With multi-agent systems, we can go beyond just point solutions and mundane tasks and slowly handle more complex scenarios. That will be a huge workflow boost. Using natural / formal language as an agent communication tool is definitely attractive, and important for explainability reasons. But that is most probably very limiting for many scenarios especially when machine agents are communicating within each other. That’s where MoEs would come in.

The final product would be a system that includes humans as operators and quality gate keepers, multi-agent LLM systems as workflow handlers where explainability is necessary, and MoEs where most efficient task handling is the priority.

The Ideal AI Use Case

Amir Feizpour (ai.science) — Fri, 01 Mar 2024 10:56:26 GMT

I have been seeing an increasingly large number of use cases that are thrown around for LLMs and Gen AI in general. With the power that these types of models bring to the table, it becomes overwhelming very quickly to know what is worth paying attention to. So, going back to the basics could be helpful!

The most obvious example of the rule “hard generation, easy verification” is binary supervised learning where solving the problem of “is this a cat” based on the value of the pixels in the image is hard, but looking at the provided label to check that the predicated class is indeed “cat” is easy.

In the context of visual gen ai, crafting an image that conveys a specific message and has a particular feel to it is hard, but when that image is generated, then verifying that it has the right elements and visual parameters is easy.

For LLMs things get more interesting because esp with multi-agent systems we are talking about scenarios where several models collaborate with each other to achieve a complex objective. This by definition is a complicated problem solving / reasoning / generation process. Now, if the outcome is an easily measurable object then, we’re in the realm of interesting use cases. For example, imagine a multi-agent system that mines a bunch of scientific papers, generates hypotheses about a new molecule, calls a generative model to come up with a formula for that molecule, etc. Using various physical tests, we can check what the properties of the resultant molecule is and if it is what we were hoping for. Another example is writing complex software. If we go through all the complex steps and write sophisticated software, we could use various types of software tests to verify that it is indeed what we wanted.

However, for many other use cases that step of verification is not straightforward. For example, let’s say the task at hand is generating a report that contains lots of numbers and references to various entities, and facts. It is likely that there is no simple set of tests one can run to check if the outcome is acceptable. In high stake scenarios it might take pretty much redoing the problem from scratch to verify that the outcome is reliable.

So, how have we dealt with complex verification problems in the past? Well, that’s where AI explainability comes to picture. Say I have a sophisticated deep learning model that predicts who qualifies for a loan by taking 1000s of elements into account. If a human wants to verify that answer, they would need to … well, let’s be honest, that’s not even possible! So, what did we do? We provided post-hoc explainability methods and many variations of that which give us a glimpse int “what the model might have thought”.

In our complex multi-agent scenarios also some variation of that is necessary. A multi-agent system might be calling many other models, including but not limited to the loan qualification one, and all these models need their own explainability modules. But then for the rest of the system that deals with natural or formal language what is needed is a good user experience that verbosely provides the reasoning steps (or at least logs them somewhere to be revealed if needed). Citations in RAG systems is a good example of this kind of thinking. And then the verification of the overall system can be done via a proxy by looking at the verification of the parts, and how they compound.

So, here is the recipe:

1. Is the workflow you’re considering complex? (for most cases the answer is now)
2. Can you think of an easy way to verify the outcome of the workflow (chances are the answer is “no” and you have to spend lots of time thinking and working hard on turning this answer to “yes”)
3. Now you have a north star metric that can be evaluated quickly with every iteration of the solution for the generation part, and you can repeat until you succeed.

AI and Loss of Job Description

Amir Feizpour (ai.science) — Fri, 01 Mar 2024 10:52:22 GMT

“AI is more likely to lead to loss of _job descriptions_ than loss of jobs”

There is a lot of speculation about what (gen) AI (and esp growth of LLM) means for labor markets. I heard the above statement on a podcast and it most concisely captures how I think about it.

“Data science” was a job title that came to the forefront of our attention recently and almost as quickly went back into the background. What I think happened there was that we were trying hard to jam in a bunch of mostly different things that involved analyzing data under one job title. I am sure people are sticking to that still although it doesn’t quite work because it’s still relatively “cool”. But the reality is that neither those who are data scientists nor those who hire them know exactly what that role means, and even their understanding and expectations of it is constantly shifting.

I’m bringing up this example because it is one of the latest examples of how jobs, esp newer ones, can’t really be defined by a generic title and are more realistically a subset of a broader set of tasks chosen by a particular team at a particular time. I experienced that closely as a data scientist, and now I’m going through it again as a founder.

Now, if we think about “jobs” as a flexible and fluid set of tasks, it is kind of obvious that many factors can impact the exact details of that set, like business context and priorities, availability of resources, and introduction of technologies and processes. In other words, jobs are really these open, dynamic, systems of tasks within the context of their environments, and they evolve to maximize some sort of system level metric.

Now the nature of fluid and flexible things is that they adapt to changes and it is non-trivial to completely wipe them off (eg. it is an incredibly difficult and expensive task to suck all the air out of a chamber). So, I’m a bit confused why people so easily talk about “loss of jobs” as if jobs are these monolithic things that could be here or not in a binary way. It is completely reasonable to talk about “loss of job description” in the sense that this particular subset of tasks are no longer needed and instead that other subset is necessary, but thinking that the whole set of tasks would become redundant so quickly that we won't even have time to react is a bit dramatic. And if that happens it is most probably because we were in denial although there were obvious leading indicators.

That said, the rate at which this replacement is happening is much faster than anything we have experienced, and there’s no clarity around what that might mean at a larger scale. But I think instead of focusing on “what can go wrong”, we need to think about “what can go right”!

For example, given that job descriptions are evolving faster than ever, what kind of infrastructure (eg. education system) is necessary? Or what kind of workforce management / corporate structure is most suitable for a rapidly shifting landscape of job descriptions? And importantly at a psychological and social level, what kind of support is necessary to help people find consistent and ongoing “meaning” and “purpose” although their job description is constantly changing?

I think the first step in responsible AI should really be focused on drafting the right set of questions to ask. Without that we might be barking up the wrong tree!

AI as Judgment Machine?

Amir Feizpour (ai.science) — Sat, 24 Feb 2024 21:04:37 GMT

In the past year or so, we have been seeing an increasing rate of AI adoption in interesting use cases across the industry. With the remarkable power of large language models added to the toolbox of AI engineers, we are seeing more and more applications that blur the boundaries of work done by human versus machine. As we see adoption in deeper layers of the society, and as these applications become more sophisticated, it is becoming harder, especially for non-expert users, to decipher what the machine is actually doing.

This lack of clarity, coupled by rushed and careless user experiences designed and built by engineers and product people could result in unintended consequences for the ever increasing areas of human-machine interaction. Imagine the outcome of a not-so-techy grandma interpreting the next-word-prediction output of ChatGPT as medical judgment.

In the realm of AI, a crucial line exists between prediction, the ability to estimate the probability of future outcomes based on data, and judgment, the capacity to assess and evaluate situations with understanding and values. While AI excels at prediction, offering next-word suggestions or analyzing medical scans, it currently lacks the human ability to judge.

Unlike humans, AI's predictions stem from complex algorithms analyzing vast data, not genuine reasoning and understanding. While it can churn through information and identify patterns, it lacks the ability to reason about cause and effect, consider alternative scenarios, or adapt to unforeseen situations. This is due to the absence of a true world model, a comprehensive internal representation of the world we inhabit. Without this, AI cannot grasp the nuances of context, emotions, or social cues, crucial elements for making sound judgments. Furthermore, AI's training data often reflects human biases, leading to decisions devoid of ethical or moral or even safe considerations.

“But AI judges what I like to watch on Netflix!”

It is so easy to think that recommender systems, like the ones that Netflix and TikTok use, generative co-pilots, like ChatGPT or Gemini, and self-driving cars are making decisions. While the reality is that they are statistical machines making predictions about what is the most likely thing to happen next. Because of the specifics of these use cases, a significant amount of effort goes into making the most likely predictions so good that it is easy to forget that they are one of the several predicted outputs with a certain probability of resembling an average human judgment.

In the case of a recommender system, it seems that the AI is judging your taste and recommending something you'll enjoy. While in reality the algo analyzes your past viewing history, demographics, and similar users' preferences to predict how long you will spend consuming a piece of content. While it considers your past choices, it doesn't understand your actual enjoyment or deeper reasons for watching specific movies.

It might seem that the generative chatbot is understanding your problem and offering personalized solutions. While it in fact is following pre-programmed decision trees and predicts the next words in the conversation. It doesn't truly understand your intent or the nuances of your problem.

And a self-driving car might seem to “decide” to change lanes, or “choose” to yield to the pedestrian, or “selects” the right exit on the highway. In this case, the car uses its sensors (cameras, radar, LiDAR) to gather data about its environment, including other vehicles, pedestrians, and road markings, feeds that into complex algorithms that analyze the situation and predict possible future outcomes based on past experiences and training data. It uses all the data to predict the most likely action that would have been taken by a human operator.

In each of these cases, interpreting the most probable output as a judgment without carefully assessing the assumptions, biases, and limitations existing in the training data could lead to unintended outcomes.

What now?

The number of areas where we have enough data, sophisticated algorithms, and human oversight to be able to let the most likely output to pass as judgment is definitely increasing. However, doing so blindly and without the necessary ethical, responsible, and safe measures in place might result in reputational risk, miscalculated investments, or opportunity cost.

In the meantime, the safest grounds are treating AI as a prediction machine, and use cases as predictive or prescriptive analysis, and building the algorithms within user experiences that provide checks and balances, including human oversight, for robust and reliable performance.

Well, putting the nerd talk aside, if your workflow contains complex decisions, build well designed/ behaved co-pilots, not tools that try to do your work.