Solutions/AI & Intelligent Automation

AI in production
not on a slide.

We build judgement-class automation for the work that actually moves the business. No demo theatre. No autonomy where it doesn't belong. Always behind a confidence threshold a human signed off on.

First model live
4–6 wks
behind threshold
Human-in-loop
Always
where it matters
Vendor lock-in
0
model-portable
classifier · support_intake
READING TICKET #48,221
Inbound · email body
Hi team, I tried to renew my subscription last night but the payment was declined — card is fine elsewhere. Tried a second card, same result. Frustrated — need this resolved today.
→ classification
87 ms
Category
Billing · Failed Payment
Billing · Failed Payment94 %
Account · Card Issue5 %
Other1 %
SENTIMENTfrustrated
URGENCYhigh · "today"
THRESHOLD≥ 90 % auto
→ routed · billing-tier-2 · drafted reply staged
↳ Live classifier from a client deployment. Tickets below the threshold land in a human queue, not the void.
§ 01 / Thesis

Stop benchmarking
models.
Start benchmarking
decisions.

Most AI projects fail not because the model was wrong, but because it was plugged into the wrong decision. The right test isn't accuracy — it's whether the cost-of-being-wrong is acceptable against the time-to-being-right.

Some decisions tolerate a 5% error rate at 200ms. Some decisions don't tolerate any error rate at all. The work isn't picking the model — it's mapping the decision before the model ever runs.

What we will build
  • Classifiers with confidence thresholds
  • RAG over your own structured + unstructured data
  • Extraction from messy documents at scale
  • Drafting that a human reviews before sending
What we won't
  • Autonomous agents acting on irreversible decisions
  • Chatbots replacing humans on regulated work
  • "AI features" added for the press release
  • Pipelines without an offline evaluation harness
§ 02 / Decision frame

Where AI belongs —
and where it doesn't.

Plot the decision on two axes: cost-of-being-wrong and volume of decisions per day. The diagonal tells you the answer.

Q · 01 / Don'thigh cost · low volume

Pure human judgement

Hiring decisions, M&A, legal exposure calls, medical diagnosis without specialist oversight.

Don't put a model in front of a decision a senior person makes ten times a year.
Q · 02 / Assisthigh cost · high volume

AI-assisted human

Underwriting, fraud review, claims triage, medical pre-screening, content moderation appeals.

Model drafts, scores, flags. Human signs. Always.
Q · 03 / Skiplow cost · low volume

Probably not worth it

Occasional admin tasks, one-off internal queries, tasks done less than weekly.

The build cost will exceed the saved time. Use a checklist.
Q · 04 / Automatelow cost · high volume

The sweet spot

Ticket classification, document extraction, lead enrichment, draft generation, smart routing.

↪ Where most of our AI work happens.
Low volume →↑ Cost of being wrong→ High volume
Don'tAssist a humanSkipAutomate behind threshold

↪ The first thing we do in any AI engagement is plot your candidate decisions on this grid. Roughly one in three lands in Q4.

§ 03 / Patterns

Six AI patterns,
shipped to production.

Not a model menu. The shapes of system we actually build, with the seam to a human always in the spec.
A · Classify

Routing & triage

Inbound items — tickets, emails, leads, claims — sorted into the right queue with confidence. Below threshold drops to a human in seconds.

INPUT → unstructured text OUTPUT → labeled · confidence · routed
B · Extract

Document extraction

Invoices, contracts, ID, KYC — pulled into structured fields, validated, and pushed into your systems. Failed extractions queue for review.

INPUT → PDF · scan · email OUTPUT → JSON · cross-validated
C · Retrieve

Grounded answer systems

RAG over your own knowledge — runbooks, policies, contracts, product data. Citations mandatory. No citation, no answer.

INPUT → question + corpus OUTPUT → answer + cited sources
D · Draft

First-draft generation

Replies, reports, summaries, RFP responses — drafted in your voice, sitting in a human's queue for one click of approval, one of edit.

INPUT → context + intent OUTPUT → editable draft (never sent)
E · Forecast

Predictive scoring

Churn, demand, default, lead conversion — ranked probabilities that drive ops. Often the right answer is gradient boosting, not a transformer.

INPUT → historical + features OUTPUT → ranked probabilities
F · Detect

Anomaly & drift detection

The system notices when something stops looking like itself — fraud patterns, ops drift, data-quality decay — before any dashboard surfaces it.

INPUT → continuous telemetry OUTPUT → flagged + explained
§ 04 / Model selection

The right model
is rarely the
biggest one.

We're model-agnostic by design. The architecture decides what runs each decision; you get the cheapest, fastest, and most portable thing that meets the threshold.
Classify (short)
Fine-tuned BERT · DistilBERT
~ 25 ms
€ 0.01
Specific labels, in your data, fast and self-hosted. An LLM is overkill.
Classify (nuanced)
Claude Haiku · GPT-4o-mini
~ 800 ms
€ 0.30
When labels need reading-between-the-lines. Small frontier models, not big ones.
Extract
Claude Sonnet · GPT-4o
~ 2 s
€ 1.50
Strong JSON-mode + schema validation. Cross-checked against rules before commit.
RAG · retrieve
pgvector · BM25 hybrid
~ 80 ms
€ 0.00
Postgres extension. Hybrid retrieval beats pure vector. No standalone vector DB needed.
RAG · generate
Claude Sonnet · with citations
~ 1.5 s
€ 1.20
Citation-by-default. If the model can't cite, it returns "unknown" — never invents.
Forecast
XGBoost · LightGBM · Prophet
~ 5 ms
€ 0.00
Tabular data with strong features beats LLMs at prediction. Almost always.
Sensitive · PII
Self-hosted · Llama · Mistral
~ 1 s
infra
Data sovereignty. Runs in your VPC or our EU region. Nothing leaves the perimeter.

↪ The architecture is portable: every model is behind an adapter, swappable in a day. Your business logic doesn't care which one ran.

§ 05 / Guardrails

The boring scaffolding
that makes AI
production-grade.

None of this is glamorous. All of it is non-negotiable. It's the difference between a demo that wows and a system that won't make the news for the wrong reason.
G · 01non-negotiable

Confidence thresholds

Every model output carries a confidence score. Below threshold, it lands in a human queue. Calibrated per decision, not per model.

G · 02non-negotiable

Offline eval harness

A frozen test set, scored on every change. No model goes to production without beating the previous one on metrics you signed off on.

G · 03non-negotiable

Shadow mode

Models run silently against real traffic before they ever act. Predictions logged, compared to humans, reviewed weekly until trusted.

G · 04non-negotiable

Drift monitoring

Models decay quietly when the world changes. We monitor input distribution and output calibration — alerts fire before performance does.

G · 05non-negotiable

Audit trail

Every prediction recorded with input, model version, confidence, and the eventual outcome. Replayable on demand, defensible to a regulator.

G · 06non-negotiable

Kill switch

Any model can be disabled in seconds, falling back to the pre-AI flow. Tested quarterly. Hoped never used. Always there.

§ 06 / Readiness

Two weeks to know
if AI is the answer.

Before any model gets built, we run a structured readiness audit. Output: a written brief that names the right candidates — and the wrong ones — by name.

R · 01days 1–2

Inventory the decisions

Every recurring decision in your operation, plotted on the spectrum. The honest list, not the demo list.

R · 02days 3–5

Audit the data

Ground-truth labels, distribution, drift, edge cases. Most projects fail here, before any model is touched.

R · 03days 6–8

Run a baseline

A small frontier model, off the shelf, prompted carefully — measured against your data. Sets the floor before any custom work.

R · 04days 9–10

Score & rank

Each candidate decision weighted on volume, savings, risk, and feasibility. You leave with a numbered roadmap.

R · 05handover

The written brief

A 14–20 page document with named candidates, baseline numbers, and a build plan. Yours to keep, regardless of what comes next.

FIXED FEE —Roughly 30% of audits end with us recommending against AI for the candidate workload. We're fine with that.

§ 07 / Engagement

Three ways
to start.

Tier 01Fixed fee

Readiness Audit

Two weeks to know what's worth building, what isn't, and where to start. Written brief at handover.

DURATION2 weeks
TEAM2 senior
PRICINGFixed fee
  • Decision spectrum mapped
  • Off-the-shelf baseline measured
  • Ranked roadmap, written brief
Brief an Audit →
Tier 02 · most common

Build

End-to-end build of one or more AI capabilities into your operation. Shadow mode → threshold → live.

DURATION3–6 months
TEAM2–3 senior
PRICINGT&M with cap
  • Eval harness on day one
  • All six guardrails baked in
  • Knowledge transfer to your team
Scope a Build →
Tier 03Long-term

Steward

We own the lifecycle of your AI fleet — drift, retraining, evals, threshold tuning — until your team takes the keys.

DURATION6+ months
TEAMEmbedded
PRICINGMonthly retainer
  • Drift & calibration monitoring
  • Quarterly model refresh
  • Hand-off plan from day one
Discuss Steward →

↪ Indicative. Every engagement is scoped from a written brief — no per-model surprises, no usage roulette.

§ 08 / Proof

Support intake
went from 11min
to 22 seconds —
and accuracy went up.

Routing accuracy
96.2 %
▲ from 89% (human)
Auto-route rate
81 %
≥ 90% confidence
First-response time
22 s
▼ from 11 min
"The model is right more often than our best agent — and we know exactly when it isn't."
— Head of Customer Ops · SaaS · NDA
§ 09 / Objections

The questions
we hear on
every first call.

Mostly variations of "is this safe", "will this hallucinate", and "what about our data". Fair questions.

Q · 01

"What about hallucinations?"

+
We architect against them, not around them. RAG systems must cite. Extractions must validate against schemas. Classifiers must confess uncertainty via threshold. And nothing irreversible runs without a human seam. The model is allowed to be wrong; the system is not allowed to be silent about it.
Q · 02

"Where does our data go?"

+
Defaults: EU regions, no training on your data, encryption in transit and at rest. For regulated workloads, fully self-hosted models in your VPC — nothing leaves the perimeter. The architecture decides the model; data sovereignty is one of the inputs.
Q · 03

"What if the model gets worse over time?"

+
We monitor drift on input distribution and output calibration continuously. Performance degrades a few percentage points before anyone notices in the wild — alerts fire long before that. On a Steward engagement, retraining or threshold tuning is part of the retainer.
Q · 04

"Are AI agents the future of automation?"

+
For some narrow domains, eventually. Today, "agentic" systems are typically pipelines of classifiers, extractors, and drafters wired into deterministic glue — which is exactly how we build. The interesting question isn't "is it an agent"; it's "where in the pipeline does the human sign?"
Q · 05

"Will this lock us into one model vendor?"

+
No. Every model sits behind an adapter — your business logic doesn't know whether it called Anthropic, OpenAI, or a self-hosted Llama. Switching takes a day, not a quarter. Vendor risk is treated the same way we treat every other dependency: minimised on the way in.
Currently accepting Q3 engagements

Where in your
operation does
judgement scale
worse than volume?

That's where AI earns its place. Bring it to a 30-minute call — we'll tell you, honestly, whether it's worth building and what it would take.