Playbook · 24pp

The SD Operator's AI Playbook

Practical frameworks for introducing AI into a non-AI company — written for San Diego operators.

AI San Diego 16 min read

Most San Diego companies don’t need an AI strategy. They need an AI implementation — the boring middle work between “we should probably be doing something with AI” and “the new feature is shipping on Tuesday.” This playbook is about that middle.

It is written for the operator: the VP of product, the director of engineering, the founder of a 30-person company, the COO who’s been told by the board to “figure out AI.” You don’t need a research team. You need a way to turn vendor demos into shipped features without lighting a year of budget on fire.

Part 1 — Before you build anything

Find a problem, not a technology

The single biggest mistake we see is a team that starts with “where should we use LLMs?” and works backwards. That question has no answer. The better question is: where in our business do we currently pay a human to read unstructured text and produce a structured decision?

Write down every such workflow in your company. Sort by volume × latency tolerance. Your highest-leverage AI project is at the top of that list — usually triage, routing, summarization, or first-draft generation. It is almost never customer-facing chat, and it is almost never “replace this team.”

Make the ROI case in one paragraph

If you cannot write the ROI case in a paragraph, the project is not ready. The paragraph should cover:

Who currently does this work and how much of their time it takes.
What it costs the business per month.
What fraction an AI system could plausibly handle end-to-end.
What the quality floor has to be for the handoff not to create more work than it saves.

If the math only works at 95%+ automation and your realistic target is 60%, kill the project before you start.

Decide early: buy, build, or wrap

Three roads:

Buy. Use a vertical SaaS product that has already solved your problem with AI behind the scenes. Cheapest, fastest, least flexible. Good when the problem is generic.
Wrap. Build a thin layer on top of OpenAI / Anthropic / Gemini APIs. Most SD companies should do this for 80% of their AI work. Fast, cheap, and your data stays yours.
Build. Train or fine-tune your own model. Almost never the right answer for a non-AI company in 2026. Revisit only when evals on a frontier model have been hitting a ceiling for 6+ months.

The decision is mostly obvious once you write down the question honestly. The mistake is teams that pick “build” because it sounds more serious, then spend 9 months discovering the frontier-model wrap would have been 85% as good for 2% of the cost.

Part 2 — Vendor selection

The SD AI vendor market in 2026 is a hall of mirrors. Every SaaS company has bolted an “AI” module onto their roadmap. Some are real; most are demos with a pricing page.

The 5-question vendor screen

Before you take a second meeting with any AI vendor, ask these five questions. Answers should be specific, not aspirational.

“Show me a production customer using this feature on data similar to ours.” Not a logo. A customer. If they can’t connect you, the feature is not production-ready.
“How do you evaluate quality? Can I see your eval set?” Vendors without evals have no idea whether their product works. Avoid.
“What model do you use, and what happens when it deprecates?” Any vendor who can’t answer this will be broken in 12 months.
“What’s your per-request latency at p50 and p95, on our expected payload size?” Demo latency is not production latency.
“If we churn, do we get our data back and is anything trained on it?” Non-negotiable. Get it in writing.

Most vendors fail two or more of these questions. That’s fine — you’ve just saved yourself 6 weeks of procurement.

The “POC in a week” rule

If a vendor cannot spin up a proof-of-concept against your actual data in a week, they are not ready for production. AI vendors that require months of onboarding are selling you the onboarding, not the product.

Part 3 — Evals are the product

The single biggest operational lesson in the last 3 years of applied AI: if you don’t have evals, you don’t have a product. You have a demo that works on the cases you happened to try.

The minimum viable eval set

For any AI feature going to production, build an eval set of at least 100 examples with:

The input (real customer data, scrubbed of PII).
The “golden” output, written by a domain expert in your company.
A grading rubric (exact match, structured fields, or LLM-as-judge with a specific prompt).

You will update this set every month for the life of the feature. It is the single most valuable artifact your AI team owns.

What to measure

Task accuracy — does it do the thing right? Measured on your eval set, tracked per release.
Grounding / hallucination rate — how often does it invent facts? Especially critical for summarization and RAG.
Latency — p50 and p95 per request, tracked over time.
Cost per task — dollars per successful task completion. Not dollars per API call.
Escalation rate — if the AI is a triage layer, what fraction of cases does it successfully route without human intervention?

The evals-before-features rule

Do not ship a new AI feature that lacks evals. Do not ship a prompt change that has not been run against the eval set. Do not accept a vendor claim of improvement that was not measured on your evals, not theirs. This rule, more than any other, separates teams that ship reliable AI from teams that don’t.

Part 4 — Budgeting and cost

The three cost curves

AI project costs have three stages, and each stage breaks its own budget:

Exploration. Dev tools, API credits, engineer time. ~$5–25k per project. Cheap and fast.
Production. Inference costs start scaling with usage. Infra, logging, evals, monitoring. ~$50–250k/year per shipped feature at mid-size company volume.
Scale. At high enough volume, inference dominates. This is where fine-tuning, caching, distillation, and smaller models become real budget levers.

Most teams underbudget stage 2 by 3x and overbudget stage 3 (because they worry about it before they’ve shipped anything). Correct this early.

The per-request math

For every AI feature, compute:

Cost per request (tokens in × input price + tokens out × output price).
Expected requests per day.
Expected cost at 3x growth.

If the 3x-growth number is scary, you have a design problem. Solve it before scale, not during.

Caching is free money

The single biggest lever for reducing inference cost on any RAG or agent workflow is prompt caching. Most teams leave it off for 6 months out of “we’ll optimize later” inertia. Turn it on in week one. The savings compound.

Part 5 — The 10 most common failure modes

After watching ~100 SD AI projects over the last two years, these are the failure modes that show up most often. If your project has any of them, you’re already in trouble.

No eval set. Covered above. This is the #1 cause of AI projects that “work in demo, fail in production.”
No owner. An AI project needs a single owner who wakes up thinking about it. Committees ship nothing.
Scope creep into “agent.” You do not need an agent. You need a good function call. 90% of 2025’s “agent” projects should have been scripts.
Prompt-engineering in a Google Doc. If your prompts live in a shared Doc, they are not code. Move them into version control on day one.
Ignoring latency until launch. Latency is a design constraint. Pick your model and architecture knowing your latency budget.
Picking the most expensive model by default. Frontier models are overkill for 70% of tasks. Start with the cheaper model; move up only when evals demand it.
Shipping without a human-in-the-loop fallback. Every production AI system needs a graceful degradation path. Design it on day one.
No logging. If you can’t replay a failed request with its full prompt and response, you cannot debug anything. Log everything from day zero.
Treating AI like traditional software for release management. Model behavior changes. Prompts drift. You need continuous eval in CI, not a one-time sign-off at launch.
“We’ll add guardrails later.” Later is too late. Safety, prompt injection, and PII handling are architectural decisions, not decorations.

Part 6 — Shipping your first feature

A staged rollout that works for non-AI companies getting their first AI feature to production:

Week 1–2: scope and eval

Define the one workflow you’re augmenting.
Write the ROI paragraph.
Build the eval set (100 real examples with golden outputs).
Pick buy/wrap/build.

Week 3–4: prototype

Get a working prototype against the eval set.
Hit target accuracy on evals before writing any UI.
Review failure modes with a domain expert.

Week 5–6: integration + UX

Build the thinnest possible UI around the working core.
Add logging, caching, and error handling.
Human-in-the-loop fallback path for cases where confidence is low.

Week 7: internal beta

Ship to a small group of internal users. Instrument heavily.
Track task accuracy, latency, and user satisfaction on a daily dashboard.
Iterate the prompt and eval set based on real usage.

Week 8: limited external

Ship to ~5% of users.
Watch the dashboard obsessively.
Roll back at the first sign of regression.

Week 9+: expand

Grow to 100% over 2–4 weeks, by cohort.
Keep the dashboard live forever. This feature will need monitoring as long as it exists.

How to use this playbook

Pick one workflow this week. Write the ROI paragraph. Build the eval set. Hand this playbook to your team and tell them section 5 is non-negotiable.

Then in 6 weeks, when your first feature is in internal beta and you want to know whether you’re doing this right, send us what you’ve shipped. We’ll tell you honestly.

Subscribe to the Tuesday brief for case studies from SD operators actually shipping this stuff.