Jason Derr | AI Maestro — Orchestrating AI From Vision to Outcome

Reply rate is climbing. Pipeline isn't moving. That gap isn't a coincidence.

Every GTM metric the industry agreed on was built for a world where one rep meant one human input. CAC, LTV, win rate, pipeline coverage, NRR. Each one assumed a human-only motion where pipeline grew through volume and signal quality could be inferred from reply rate.

That world is gone.

In the AI-augmented motion, one operator supervises a stack of agents. Pipeline grows when systems catch deviations the team would have missed by eyeballing a Monday report. Reply rate is now actively misleading because AI-generated outreach gets more replies and a higher proportion of them are noise. The metrics didn't get worse. They got disconnected from what's actually producing or destroying revenue.

This is three metrics for the new motion. Each one solves a problem the canon doesn't see. Each one comes with the build that makes it work in production.

Three layers, three metrics, one dashed arrow

The motion has three layers that determine whether a quarter ships.

Top of funnel is signal quality. Middle of funnel is pipeline maturity. Operator level is throughput per human-plus-agent unit. Each layer has its own diagnostic metric. Each metric has a build that makes it real-time and actionable, not just visible in a weekly review.

Two of the arrows between these layers are mechanical. Better top-of-funnel signal quality directly enriches downstream pipeline. Clearer pipeline maturity directly improves operator prioritization. Instrument the layer, the cascade follows.

The third arrow is contingent. Operator productivity feeds back into signal quality only if freed capacity gets reinvested into experimentation rather than absorbed into more of the same volume. Most teams default to volume. Closing this loop deliberately is what separates a GTM motion that compounds from one that plateaus.

The three metrics teach the layers. The closer teaches how to push against funnel gravity.

PPO

Pipeline per operator.

Most teams measure productivity the way they did when one rep meant one human input. Meetings per SDR. ARR per AE. Dials per day. Each one assumed one human did one human's work.

In 2026 that's the wrong unit. An operator running three AI outbound sequences and a deal-coaching agent looks identical to a non-augmented rep on the dashboard. Same titles, same comp plan, same activity reports. Their leverage scales with the agent stack they supervise. The dashboard cannot see it.

Worse, the legacy metrics actively punish the operator who shifts work to agents. Their human-input volume drops. Their output rises. Comp, ranking, and headcount decisions get made off numbers that no longer reflect what the unit is producing.

PPO without an agent denominator is a 2022 metric pretending to be a 2026 one.

What PPO measures

Qualified pipeline per operator per week, where the operator denominator is the human plus the AI workflows they supervise. Two cuts against the same opportunity table.

PPO-creation captures stage 2+ opportunities created per operator-week.
PPO-maturation captures stage 4+ pipeline value moved per operator-week.

Same join, different filters, two diagnostic stories.

An AI-augmented SDR might triple PPO-creation while PPO-maturation stays flat. That's the right shape for that role. An AI-augmented AE supervising deal-coaching agents shows the inverse. The two cuts make the operator visible in the unit they actually work in.

The build

A weekly batch query that joins your opportunity table to ownership and aggregates by operator and week, where operator is the unit you're actually compensating. Two views off the same join.

sql

-- creation: stage 2+ opps created per operator-week
with creation as (
  select OwnerId as operator,
         date_trunc('week', CreatedDate) as week,
         count(*) as opps_created
  from Opportunity
  where StageName not in ('Prospecting','Qualification')
  group by 1, 2
),
-- maturation: value moved into stage 4+ per operator-week
maturation as (
  select o.OwnerId as operator,
         date_trunc('week', h.CreatedDate) as week,
         sum(o.Amount) as value_moved
  from OpportunityFieldHistory h
  join Opportunity o on o.Id = h.OpportunityId
  where h.Field = 'StageName'
    and h.NewValue in ('Proposal','Negotiation','Closed Won')
  group by 1, 2
)
select operator, week, opps_created, value_moved
from creation
full outer join maturation using (operator, week);

The artifact that proves the metric is the before/after curve. Instrument PPO across a cohort. Introduce one AI workflow. Measure the lift over 6-8 weeks. Supervision moves from anecdote to measured curve.

What PPO unlocks

Workflow ROI you can show to a CFO
Compensation tied to actual output rather than activity
Capacity planning that asks max PPO before max headcount
A teachable skill called supervision

The diagnostic question for your team. Can you point to one operator whose PPO went up after they adopted an AI workflow? By how much, over what window?

PQR

Pipeline-to-quota ratio.

Most teams measure pipeline coverage as a single moment. Open pipeline at each stage, divided by quota. 4x feels safe. 5x feels comfortable. The team forecasts off that number until two weeks before quarter-end, when reality catches up.

The problem is not the math. It's the snapshot. Coverage at a single moment tells you nothing about whether what is in the pipe will mature in time to close. You can have 5x coverage with most of it in stage 1 and still miss because stage 1 does not convert to closed-won in 8 weeks.

PQR fixes this by adding two things to coverage.

First, it calculates against cumulative pipeline by stage. Stage 4+ PQR sums everything from stage 4 forward (stage 4 + stage 5 + stage 6 + closed won) and divides by quota. That captures pipeline maturity rather than pipeline volume.

Second, it gets calculated daily and snapshotted against history. That turns the metric from a snapshot into a forecast.

PQR coverage without snapshots is forecasting theater.

Nikko Lubrano built the original PQR concept at Hightouch, citing Tomasz Tunguz's Sales Sandwich framework as the inspiration. What follows is the early-warning layer that turns the metric from a measurement into an alarm system.

The build

Three steps.

Step one. The live calculation. Aggregate cumulative pipeline by stage from your opportunity table, divided by quarterly quota. Build the same query for every stage cut you care about (stage 1+, stage 2+, etc.). Now you have a real-time view of maturity coverage by stage.

Step two. Daily snapshots. The live query gives you coverage at a moment. The forecasting power comes from watching the shape over time. dbt snapshot models are the cleanest path. They take a daily picture of your opportunity table and add a snapshot date column. Query the snapshot like you would query the live table, just filter by date.

Step three. The early-warning layer. Compare today's PQR shape to the same day-in-quarter from the last 6-8 quarters. Significant deviations (greater than 1.5σ from historical mean) post to a Slack channel with historical analog context. The analog matters. "Last time S2+ PQR deviated this much on day 35, we missed quota by 12%" is an alert someone acts on. A bare deviation number is decoration.

python

# detect PQR deviations on the daily snapshot stream
baseline = pqr_history.filter(
    day_in_quarter=current_day,
    stage_cut='stage4+',
    quarters=last_8
).agg(mean=mean(pqr), sd=std(pqr))

current = pqr_today.filter(stage_cut='stage4+')
z = (current.pqr - baseline.mean) / baseline.sd

if abs(z) > 1.5:
    analog = pqr_history.find_closest(
        pqr=current.pqr,
        day_in_quarter=current_day,
        stage_cut='stage4+',
    )
    slack.post(
        channel='#pqr-alerts',
        deviation=z,
        analog_quarter=analog.quarter,
        analog_outcome=analog.eoq_miss_pct,
    )

What PQR unlocks

End-of-month-1 forecast accuracy within ±5%. Same accuracy used to take a full quarter to reach.
Lead time of 2+ weeks ahead of standard pipeline reviews
Conversation shifts from "are we behind" to "which intervention works"
Quarterly retros become evidence-based (which deviation patterns predicted which misses)

The diagnostic question for your team. Can your team spot a PQR deviation 2 weeks before quarter-end, and does anyone get pinged when current-quarter shape diverges from historical baseline?

QRR

Qualified reply rate.

Most teams measure reply rate. Sends in the denominator, replies in the numerator. Sequence A gets 8% reply rate. Sequence B gets 12%. B wins. The team optimizes against that number.

In 2026 the number misleads. AI-generated outreach gets more replies than human-written outreach. A higher proportion of those replies are noise. "Wrong person." "Take me off this list." "Out of office until next month." Auto-responders. Reply rate goes up. Pipeline does not.

The team is optimizing a metric that's no longer connected to outcome. Worse, every iteration makes the disconnect deeper. Better-replying templates get prioritized regardless of what kind of replies they generate.

QRR is the new MQL.

Inbound figured this out a decade ago. Lead volume isn't the goal. Marketing-qualified leads are. The same evolution is overdue on the outbound side.

QRR is qualified replies divided by sends. The numerator depends on a classifier that tags every reply across eight categories. Qualified is the first three. Everything else is routing, cleanup, or signal that the targeting is off.

This metric isn't standardized yet. Reply rate is canonical. Qualified reply rate is what most teams need but few measure. Naming it formally is a position. So is shipping the classifier that makes it computable.

The build

Three steps.

Step one. The classifier. Take 200-300 sample replies from your existing outbound and label them by hand across eight categories (genuine interest, soft interest, wrong person, referral, unsubscribe, hostile, auto-responder, noise). Use those labels to build a few-shot prompt for Claude or GPT, or fine-tune a smaller model if your volume justifies it. Either approach works. The classifier reads each new reply and outputs a category in under a second.

Step two. The metric. Once the classifier is running, QRR is one division. Qualified replies (genuine interest + soft interest + referral) over total sends, calculated weekly per sequence, per segment, per rep.

Step three. The feedback loops. The classifier is more valuable than the metric. Every classified reply is a labeled data point. Pipe positive classifications back into the signal layer to reinforce what's working. Pipe negative classifications into ICP refinement. Pipe auto-responder rates into deliverability monitoring. The classifier is the artifact. The metric is one of several things you do with it.

python

CATEGORIES = [
    'genuine_interest', 'soft_interest',
    'wrong_person', 'referral',
    'unsubscribe', 'hostile',
    'auto_responder', 'noise',
]

QUALIFIED = {'genuine_interest', 'soft_interest', 'referral'}

def classify(reply_text):
    return claude.classify(
        text=reply_text,
        categories=CATEGORIES,
        examples=labeled_examples,
    )

def qrr(slice_filter, week):
    sends = outbound.filter(slice_filter, week).count()
    replies = inbound.filter(slice_filter, week)
    qualified = sum(
        1 for r in replies
        if classify(r.text) in QUALIFIED
    )
    return qualified / sends

The signal-based outbound rebuild I documented at /projects/outbound-rebuild put the QRR classifier in production. Reply rate moved from 1.2% to 3.8%. Meeting-booked rate from 0.4% to 1.6%. The classifier was the upstream change. The downstream metrics followed.

What QRR unlocks

Sequence quality scoring beyond reply rate
Empirical ICP refinement (segments with highest QRR are your real ICP)
A/B testing copy with a signal that correlates to pipeline
AI drafting agent supervision (is QRR climbing or decaying as the agent runs?)
A reusable classifier you can plug into routing, scoring, and reporting

The diagnostic question for your team. What percentage of your outbound replies are genuinely interested versus routing or noise? If you don't know, you're optimizing the wrong number.

Closing the dashed arrow

Two of the three arrows in the diagnostic stack are mechanical. Better top-of-funnel signal quality enriches downstream pipeline. Clearer pipeline maturity improves operator prioritization. Instrument the layer, the cascade follows.

The third arrow is contingent. Operator productivity feeds back into signal quality only if freed capacity gets reinvested into experiments that test new signal hypotheses. Most teams default to volume. Capacity gets absorbed into more sequences, more touches, more meetings booked. The system stays at its baseline signal quality and the compounding never happens.

The default has gravity. Reps optimize for activity numbers because that's what comp plans reward. Managers optimize for pipeline targets because that's what boards measure. Nobody on the team has a job description that says "make the signal better." The dashed arrow stays dashed.

Three practices close it. Each one is small, concrete, and runnable in the next two weeks. None of them require headcount or budget.

Practice 01. Protect 10-15% of operator capacity for signal experiments.

Calendar block. Recurring. Priority of a customer meeting, not "if there's time leftover." A 40-hour operator protects 4-6 hours weekly for testing signal hypotheses (a new signal source, a new ICP slice, a new enrichment approach, a new opener). One experiment per cycle is the minimum bar. Without protected time, the work falls back to the default.

Practice 02. Tag every experiment and measure outcomes against QRR.

Reply rate will lie here for the same reason QRR exists. An experiment that drives reply rate up but QRR flat is not a win. Tag each experiment in your outbound system, run the classifier against the replies, and after 4-6 weeks of data the experiment either improved QRR or it did not. Bury the ones that did not.

Practice 03. Quarterly retro on signal quality.

90-minute meeting, every quarter. Looks at every experiment that ran, what QRR shifted, and what to retire versus scale. The output is a one-page update to the signal library and the ICP definition. Without the retro, experiments accumulate but the system never gets smarter.

The principle that compounds

The diagnostic stack measures three things. PPO catches operator productivity at the human-plus-agent unit. PQR catches pipeline maturity before the quarter is gone. QRR catches reply quality where reply rate fails.

The first two work because the metrics make the layers visible. The third works because the operators build the loop deliberately.

Most teams will instrument PPO and PQR and stop there. The teams that compound will close the dashed arrow.

The AI GTM Engineer's Diagnostic Stack

Three layers, three metrics, one dashed arrow

PPO

What PPO measures

The build

What PPO unlocks

PQR

The build

What PQR unlocks

QRR

The build

What QRR unlocks

Closing the dashed arrow

The principle that compounds

# discussion