Home

How to Properly Craft an AI Product, or What Remains When the Model Catches Up

On building something the baseline cannot replicate.

Most AI products being built right now are built on borrowed ground.

Not because the teams are careless, or the technology is insufficient, but because of a structural reality that is easy to overlook when the demos are working and the metrics are moving in the right direction. The foundation models that power these products, the large language models that generate text, answer questions, write code, summarize documents, are not proprietary to the companies using them. They are shared infrastructure. The same GPT, the same Claude, the same open-weight Llama that your product depends on is equally available to every competitor who woke up this morning with a similar idea. The baseline keeps rising, and it rises for everyone simultaneously.

This is the condition that defines the current moment in AI product development, and it makes the question of defensibility more urgent and more specific than most teams realize before it is too late.

There are exceptions worth naming, because they clarify the frame rather than contradict it. If your problem is narrow enough to be solved with a gradient boosted tree, or a fine-tuned convolutional network, or a time-series model, you do not need a language model and you should not use one. Classical ML remains the right tool for fraud detection on tabular data, demand forecasting, recommendation ranking, anomaly detection on sensor streams. These are real problems with real, well-understood solutions that do not require the overhead of a billion-parameter transformer. But for anything that involves language, reasoning, generation, or what the industry has taken to calling "intelligence," you are almost certainly building on top of a foundation model, and you are almost certainly building on top of the same one your competitors are.

The reason you cannot simply train your own is not a matter of ambition. It is a matter of arithmetic. Training a large language model from scratch requires clusters of specialized hardware running for months, research teams that took years to assemble, and costs that can reach into the hundreds of millions before the model is capable of anything commercially useful. The organizations that can do this at the frontier are countable: OpenAI, Google, Anthropic, Meta, and a handful of others with comparable resources and institutional patience, or state backing. For everyone else, the foundation model is given. You are not choosing whether to use someone else's infrastructure. You are choosing whose, and how.

What this meansis that the technology itself cannot be your moat. The model is shared. The capability is shared. The improvements are shared. Whatever separates a product that survives from one that quietly becomes unnecessary will not be found at the model layer.

It will be found in the data.

I. The Failure That Looks Like Progress

There is a moment in the life of most AI products that does not announce itself as failure, which is part of why it is so effective.

Nothing crashes. The metrics, at least the ones you thought to measure, remain within acceptable ranges. The system continues to produce outputs that are locally convincing. If you demo it, it demos well. If you try it yourself, it works often enough to sustain belief.

And yet the product, in a way that is difficult to quantify and therefore easy to ignore, has begun to lose its reason for existing.

What was once an edge now feels like a slightly improved default. What required explanation before now requires justification. The difference is still there, but it has become rhetorical, which is to say it lives in how you talk about the product rather than in what the product does.

This is not a technical failure. It is a structural one. And it usually originates at a point that felt, at the time, like progress.

The structure that failed is not the architecture, or the prompts, or the choice of model. What failed is the absence of any data advantage that is specifically the product’s own. The system was built on top of a shared baseline and improved only as fast as that baseline improved. When the baseline caught up, there was nothing underneath.

II. The Pipeline That Runs Backwards

The traditional response to this problem is to invest more in the model. Better prompts. More fine-tuning. A newer API. But this misidentifies where the advantage needs to come from. You cannot outpace a baseline you are standing on by optimizing your relationship to it.

The answer is to build something the baseline cannot absorb. And that requires a different understanding of what you are building and in what order.

For most of machine learning’s history, the workflow ran in one direction. You gathered data. You trained a model. Then you built a product. This made sense when models were scarce and specialized, when training required institutional resources, and when the only way to get a capable system was to construct one from scratch.

That order is now backwards.

The modern approach inverts it: you build the product first. You take the best available foundation model, push it as far as prompting allows, and ship something. This is not a shortcut. It is the correct sequence given the current state of the technology, because the most valuable thing you can accumulate, once the models are commodities, is not a better model. It is data that reflects how real users interact with a real system solving a real problem.

And you cannot accumulate that data before the product exists.

This inversion has consequences that are easy to state and surprisingly difficult to internalize. The product is not the end of the process. It is the beginning of the data collection process. The moment users interact with your system, they begin producing something more durable than revenue: signal. A user edits an output. A user regenerates an answer three times before accepting one. A user abandons a workflow halfway through and rephrases the request. Each of these is a labeled example of failure or preference, produced by a real person solving a real problem in your specific domain. At scale, these signals become a dataset that no foundation model provider can sell you and no competitor can scrape. This is the data flywheel: the product improves because people use it, and people use it because it improves, and each turn of the wheel deepens a gap that the shared baseline cannot close.

The entire technical sequence that follows is in service of one thing: getting the product live quickly enough that the flywheel starts spinning, and then ensuring the system is designed to capture and use what it generates.

There is, however, an important qualification. The above assumes you are starting from nothing. Many organizations are not.

If you have been operating in a domain for years, you may already possess the data that constitutes the moat. Ten years of customer interactions in a niche accounting software. A decade of medical records from a specific patient population. Proprietary financial contracts, support transcripts, internal case histories. Data that is specific enough, and deep enough, that no foundation model was ever trained on anything like it.

In that case the calculus shifts. You are not building first in order to generate data. You are building first because you already have the data and need a product through which to activate it. The flywheel, in this version, starts spinning from day one because the data already exists. The question is not how to accumulate it but how to use it effectively: through retrieval, through fine-tuning, through a system that surfaces what your data knows in response to what your users ask.

And consider what this means for the improving baseline. Foundation models will continue to get better. Each new release is more capable than the last. The general knowledge these models carry will expand. But the chance that any future general-purpose model will understand your specific customer base, the edge cases your industry has accumulated over a decade, the implicit patterns buried in your operational history, is essentially zero. General capability and domain depth are not the same thing. The baseline rises, but it rises evenly across all domains. Your ten years of domain-specific data is a cliff the baseline cannot scale by getting taller.

Whether you are building the flywheel from scratch or activating one that already exists, the question is the same: is your product generating advantage that is specifically yours, or is it simply benefiting from improvements that everyone else receives at the same time? There is a test for this, and it is both simple and slightly cruel.

Imagine the underlying model improves significantly overnight. Not incrementally. Meaningfully.

Does your product become disproportionately better?

Or does it simply remain competitive or simply stops making comercial sense?

If the answer is the latter, your system is not generating its own advantage. It is co-evolving with a baseline that improves for everyone simultaneously. This is the structural failure, and it does not require a trivial product to happen. It only requires that your improvements are not your own.

III. Start With the Prompt

So the goal is to get something in front of users as quickly as possible, and the way you do that is by starting with the simplest possible thing. A base model and a prompt explicit enough to remove ambiguity without attempting to be clever. Zero-shot, meaning no examples, just instructions. The goal is not elegance. It is to establish a baseline that fails in ways you can observe and name, and to reach the point where real users are generating real signal.

If zero-shot is insufficient, you add examples directly into the prompt. Anywhere from one to fifty, depending on what the task demands. This is cheap, and the information it gives you about where the system is failing is disproportionately valuable. You push this until it fails in a new way, or until fitting enough examples into the context window becomes prohibitively expensive in latency or cost.

Only then do you ask a more fundamental question: what kind of failure is this?

There are failures of information. The model does not know something. It lacks access to data required for correctness, and compensates by generating plausible alternatives that are often wrong. The knowledge is stale. The domain is too specialized. The facts belong to your organization and were never in the training set.

There are failures of behavior. The model has access to the necessary information but does not use it correctly. It formats output incorrectly. It ignores constraints. It misinterprets the structure of what is being asked. The problem is not what the model knows. It is how it responds.

This distinction is small. The consequences of ignoring it are large.

IV. Facts and Form

Information failures have one remedy: change what the model sees.

You introduce retrieval. You connect the model to external data sources so that when a question arrives, the system first retrieves relevant context and generates against it. This is retrieval-augmented generation (RAG), and it is the correct tool for facts the model cannot be expected to carry internally, for knowledge that changes, for information that belongs specifically to your organization.

You begin with simple retrieval, term-based search, before moving to semantic retrieval using dense embeddings. The reason is that complexity at the retrieval layer introduces new failure modes that are distinct from and sometimes worse than the ones you were trying to fix.

Context must be chunked to fit into a fixed window, which means meaning is segmented in ways that may not align with how the information was originally structured. Retrieval must balance semantic similarity with exact matching, because embeddings are good at capturing intent and bad at preserving specific tokens that matter disproportionately, like identifiers, error codes, or product names. Latency increases because generation is now preceded by search. Systems that felt immediate begin to feel sequential.

Each mitigation, overlapping chunks to preserve boundary context, hybrid retrieval combining dense and sparse signals, reranking to surface only the highest-quality documents, reduces one class of error while introducing another constraint. You are not solving a problem. You are trading one set of problems for a better one.

Behavioral failures have a different remedy: change the model itself.

Fine-tuning becomes relevant when prompting and retrieval have been exhausted and the model still fails to conform to required structure or tone. The process looks simple from the outside. You provide examples. You adjust weights. The model improves on the task.

What is less obvious is that you are also narrowing the model’s distribution. You are biasing it toward your domain, which improves performance locally and can degrade it globally. A model fine-tuned for a highly specific task can become worse at everything adjacent to it. Fine-tuned models age. New base models appear. The relative advantage shifts and you are now maintaining a system that must be continuously re-evaluated against the improving frontier.

The rule of thumb that survives this complexity: fine-tuning is for form, RAG is for facts. If the model has both information and behavioral failures, start with retrieval. It is faster to implement, easier to maintain, and it often resolves more than expected. Combine the two only after pushing each to its limit individually.

And notice what each of these steps generates as a side effect. Every retrieval miss is a signal about what your data pipeline is missing. Every fine-tuning correction is a labeled example of the gap between what the model does and what your domain requires. The process of closing failures is also the process of collecting the training data that closes the next round of failures. This is the flywheel made concrete: not a strategy you adopt later, but a property you build into the system from the start.

V. When a Single Step Is Not Enough

There is a third class of problem that surfaces when single-step generation is no longer adequate for the complexity of the task.

Some problems require planning, sequential execution, interaction with external systems, the ability to check and correct prior outputs.

For these you introduce agents: systems where the model can call tools, branch conditionally, loop until a condition is met, or delegate work to specialized sub-models. A router that classifies intent before dispatching. A validator that checks a generator’s output before it reaches the user. A multi-step pipeline that can write to a database, execute code, or send a message.

You move to this layer only when single-step generation demonstrably fails on the task, when the problem requires write actions against external systems, or when one model has become too slow, too expensive, or too error-prone to handle the full workflow. Agents introduce non-determinism, latency, and failure modes that are harder to trace than simple generation errors. The complexity is justified by the task, not by architectural preference.

It is also worth noting that agentic interactions generate richer signal than single-turn exchanges. A user who abandons a multi-step workflow mid-execution is telling you something more specific than a user who regenerates a single response. Tool call failures, replanning events, steps that get corrected and retried, these are high-resolution failure data. The more complex your system, the more granular the feedback it can produce, provided you are capturing it.

VI. The Only Thing That Compounds

The signals described above only matter if you are actually capturing them. Most products are not. Most products treat user interactions as ephemeral, something that happens and then disappears into logs no one reads. This is the gap between a system that is riding the shared baseline and one that is building above it.

So what does capture actually mean in practice.

A user edits a generated response. You store both versions: the original output and the corrected one. This is a preference pair, a direct label of what the model should have produced.

A user regenerates an answer multiple times before accepting one. You record the sequence. The rejected outputs are the losing examples. The accepted output is the winner. This is ranking data, and it trains the model to align with your users’ actual preferences rather than the generic preferences encoded in pretraining.

A user abandons a workflow midway. You log the point of abandonment and the context that preceded it. This is a failure case with temporal and structural information, which is the most useful kind because it tells you not just that something went wrong but where in the chain it went wrong.

A user rephrases a query after an unsatisfying response. The original query, the failed response, and the rephrased query together form a correction signal: here is what the user wanted, here is what the model produced, here is how far apart they were.

Individually these signals are small and noisy. Aggregated across thousands of sessions, they describe the actual shape of the problem as your users experience it. That description cannot be purchased from a foundation model provider. It cannot be approximated by scaling pretraining data. It exists only inside your system, generated only because your product exists and people use it.

This is what compounds. Not the model. Not the prompts. The accumulated record of real usage, fed back into a system that improves because of it. Each iteration makes the product slightly more aligned with your domain. Each improvement attracts more usage. Each round of usage generates better data for the next iteration.

If your product is not designed to capture and use these signals, it is not building a moat. It is waiting for the tide to rise.

VII. Which Constraints You Want First

Somewhere along this path, the system becomes entangled with infrastructure decisions that were initially framed as implementation details but turn out to be something more consequential: decisions about how much control you have over the data strategy you just committed to.

APIs provide immediate access to high-performance models. They abstract away scaling, batching, hardware. They also introduce dependencies that are not always visible at the outset.

Costs scale linearly with usage. Behavior can change without explicit versioning. A model update can alter outputs in ways that are subtle enough to pass initial checks and significant enough to affect downstream tasks. The pricing is manageable. The unpredictability is the actual problem.

Self-hosting open-weight models inverts this.

You gain control over the model and its behavior. You can freeze versions, inspect internals, and optimize inference through quantization, which reduces numerical precision to cut memory footprint and increase throughput, through continuous batching, which processes requests without waiting for the longest one to finish, through speculative decoding, which uses a smaller draft model to propose tokens that the larger model verifies in parallel.

You also inherit the full complexity of making this work.

Inference is not a single operation. It is a pipeline with distinct bottlenecks. The prefill step, processing input tokens, is compute-bound. The decode step, generating output tokens, is memory bandwidth-bound. The KV cache grows with sequence length and batch size. GPU utilization depends on scheduling decisions that are not obvious and not forgiving.

At small scale, APIs are the rational choice. At larger scale, the economics and control of self-hosting become difficult to ignore. The decision is not about correctness. It is about when you choose to encounter which constraints.

It is also, less obviously, a decision about whether you can act on your own data. Fine-tuning on proprietary interaction signals, freezing a model version to run controlled experiments, modifying the inference pipeline to reduce latency on your specific workload: all of these require a degree of control that a commercial API does not provide. The infrastructure decision and the data strategy are not separate concerns. They are the same concern, approached from different directions.

VIII. No Stack Trace

All of this would be manageable if systems failed in ways that forced attention.

They do not.

They drift.

A prompt is modified. A model is updated upstream without announcement. Users adapt their inputs based on prior interactions, which shifts the distribution of what the system receives. Over time, the gap between what you built and what the system is doing widens without triggering any alert.

The system does not break. It becomes slightly worse in ways that are difficult to attribute.

This is one of the more insidious properties of systems built on probabilistic components. There is no stack trace for drift. There is no exception to catch. There is only slow erosion.

The only reliable defense is to treat evaluation as a first-class component of the system, built alongside the product rather than added afterward.

You define what constitutes a good output before you build. You construct datasets that reflect real usage and known failure modes. You tie AI metrics to actual business outcomes so that a score on an evaluation dataset maps to something observable in production. You slice your evaluation data to expose failures that aggregate metrics hide. You keep the evaluation pipeline stable even as everything else changes, because a moving baseline makes degradation invisible.

And you monitor in production. Not just latency and cost. Conversational signals: how often users rephrase after a failure, how often they abandon mid-session, how often the model refuses a valid request. These are the metrics that tell you whether the system is working in the way that matters, which is not in the way your benchmark measures but in the way users experience it.

Evaluation is also how you verify that the flywheel is actually turning. The conversational signals you monitor in production are not just system health metrics. They are evidence of whether the data you are capturing is making the product meaningfully better over time. If users are still abandoning at the same rate after three iterations, the flywheel is not spinning. It is stalling, and you need to know that before you spend another quarter on training runs.

Without this, you are navigating by intuition. Which scales poorly.

IX. What Actually Stays

So what does it mean to build something that persists in this environment.

It means accepting that you are not competing on access to models. That access is already commoditized.

It means recognizing that improvements in base models are shared by default, and that a product whose value rises and falls with the model version has no defensibility that matters.

And it means focusing on the only layer where differentiation accumulates rather than dissipates.

Systems that improve because they are used. Systems that capture interaction, structure it, and feed it back into themselves. Systems whose performance reflects not just the capabilities of a model, but the history of their own usage, the specific shape of the problem as real users have revealed it over time.

Everything else is transient.

And the difficulty is that transient systems can look, for a while, exactly like the real thing.