The Science of Call Grading

Not vibes. A pipeline.

Phone Genius turns a front-desk phone call into a structured, scored, coachable artifact. That requires more than a transcript. It requires speech recognition, speaker separation, intent tagging, a specialty-tuned rubric, and an inter-rater framework that makes the score reproducible — not the opinion of whoever happened to listen.

See the pipeline See the rubric

ASR + diarization Rubric-based grading Specialty-tuned scenarios Coaching-ready output

Mystery call placedSpecialty-specific scenario · voice model

ASR transcriptionWhisper-class model · word-level timestamps

Speaker diarizationCaller vs. staff · turn segmentation

Intent & moment taggingGreeting · discovery · objection · ask

Rubric scoring7 dimensions · 0–100 composite

Coaching extractionMoment · quote · suggested line

Research foundations

Four disciplines converge in one score.

Phone Genius isn't one model behind a chat interface. It's a pipeline that borrows its best ideas from four mature research fields — then translates them into something a specialty practice owner can actually use on a Monday morning.

Mystery shopping methodology

Fifty years of service-quality measurement — banking, retail, hospitality, healthcare. The simulated-patient variant (CRiSP reporting checklist) is an accepted research methodology. Phone Genius uses the same core: a structured, repeatable scenario under controlled conditions so the score reflects the practice's behavior, not the caller's mood.

Structured scenario Controlled conditions Standardized report Aggregate-first analysis

Speech analytics & conversational AI

ASR, speaker diarization, and intent classification — the standard contact-center pipeline. Whisper-class WER under 5% on clean English. Diarization separates caller from staff to measure turn-taking, interruption, talk-time ratio, and silence. Not marketing claims — standard tooling from the research literature.

Word-level ASR Speaker diarization Intent tagging Prosody cues

Rubric-based human evaluation

Same text, two graders, same score — that's inter-rater reliability (Cohen's kappa, Krippendorff's alpha). Our rubric uses concrete, observable anchors: "did staff ask for the appointment within 90 seconds of the pain question?" — not "did it feel warm?" Observable anchors make the grade reproducible between AI and human auditor.

Observable anchors Banded scores Human-calibrated Agreement stats

Dental specialty field research

A perio call has different stakes, objections, and conversion ceilings than a pizza order. Built from 1,000+ hours of specialty coaching, 350+ practices on Referral Lab, and a million-plus referral records — including ~85% target schedule rate, time-to-conversion as the forward indicator, and the 45–90 second appointment-ask window.

Specialty scenarios Insurance realism Schedule-rate benchmarks Perio/OS/ortho/endo/prosth

The rubric

Seven dimensions. One score. Zero guesswork.

Every call is scored on the same seven dimensions. Each dimension has concrete behavioral anchors — not mood words — so two graders looking at the same transcript reach the same number. The composite is a weighted 0–100 "Call Score," but the dimension breakdown is what drives the coaching.

Greeting & Warmth

Did the call open inside three rings with a clean brand greeting, a named staff member, and a warmth signal the caller's ear actually hears?

Observable anchors

Picked up within target ring count
Brand + staff name stated in opener
Tone: no dead-air or flat monotone
No audible hold before the human

Control of the Call

Did the staff lead, or did the caller lead? A controlled call moves through a discovery arc on purpose — not a Q&A volley dictated by the caller's anxiety.

Observable anchors

Staff asks more open questions than caller
Talk-time ratio inside healthy band
Interruptions trend low
Silence tolerated after a question

Discovery & Listening

Did the staff uncover the reason for the call — pain, referral source, timing, insurance, prior experience — before jumping to logistics?

Observable anchors

Open-ended "what brings you in" moment
Referral source captured verbatim
Pain / urgency acknowledged explicitly
Active-listening markers ("got it," "mm-hm")

Clinical & Administrative Accuracy

Were the factual answers correct? Insurance networks, specialty scope, referral requirements, visit length, what happens at the consult — the boring stuff patients fact-check after they hang up.

Observable anchors

Insurance answer matches practice payer list
Specialty scope stated correctly
Visit length expectation set accurately
No fabricated clinical promises

Objection Handling

When the caller pushed back on price, timing, insurance, or distance, did the staff have a move — or did the call collapse into "let me take your number"?

Observable anchors

Objection named, not deflected
Reframe offered before alternative
Value anchored, not just price defended
No premature "we'll call you back"

Urgency & Timing

Specialty new-patient calls have a window — usually 45 to 90 seconds — where the appointment ask lands. Blow past the window and conversion falls off a cliff.

Observable anchors

Appointment offered inside the window
Specific dates/times offered, not "we have openings"
Same-week option surfaced when scope allows
Pain-first callers fast-tracked

Appointment Conversion

Did the call end with a scheduled appointment — not a "we'll call you back," not a "think about it," not a verbal maybe? This is the dimension every other dimension is trying to serve. It's weighted accordingly.

Observable anchors

Firm appointment booked
Confirmation details read back
Next step communicated (forms, insurance, directions)
Warm close, not transactional goodbye

Composite score is a weighted sum with Appointment Conversion (07) as the dominant weight, Urgency (06) and Control (02) as secondary weights, and the remaining dimensions as texture.

90+ Books the appointment and does it well.

60 Books it poorly.

40 Didn't book it.

72 Call Score

Graded Mystery call #047 · Apr 18 · 2m 14s

01 Greeting

02 Control

03 Discovery

04 Accuracy

05 Objections

06 Urgency

07 Conversion

Coaching moment · 1:42

“I see you have some openings” — try: “I have Tuesday at 10 or Thursday at 2 — which works better for you?”

Methodology

How a phone call becomes a score.

The pipeline runs the same way every time so the output is comparable across calls, across staff, across practices, and across months. Nothing is "judged." Every score maps back to a specific observable event in the transcript with a timestamp.

Scenario & voice selection

Each mystery call gets a specialty-specific scenario: perio consult, OS extraction with pain, ortho teen, endo emergency, prosth implant. The AI caller uses a realistic persona (age, insurance posture, referral source) with scripted flexibility — not a hard read.

Call placement & recording

Placed at a realistic time and day. Ring count, hold time, and transfers are all captured. Recording follows applicable consent rules — two-party disclosure is built into the opener when required.

ASR with word-level timestamps

Whisper-class ASR model. Word-level timestamps, punctuation, confidence. WER under 5% on clean English telephony. Low-confidence regions are unscored, not guessed.

Speaker diarization

Caller turns vs. staff turns. Unlocks talk-time ratio, interruption counts, silence durations, and turn-taking patterns — all feeding Control (02) and Discovery (03).

Moment detection & intent tagging

Each staff turn is labeled: greeting, discovery, clinical explanation, objection handling, appointment ask, confirmation, close. Missing moments are scored — a call that never reaches "appointment ask" is a detectable event.

Rubric scoring

Seven dimensions scored against observable anchors. Banded (0/40/70/100), not free-text opinions. Composite Call Score is a weighted sum with Appointment Conversion dominant.

Coaching extraction

For each missed anchor: the quote, the timestamp, the staff member — plus a suggested replacement line grounded in specialty best practice. That 30-second clip is what the manager gets.

Calibration & audit

A rolling sample is re-graded by a human auditor. Agreement statistics (Cohen's kappa per dimension) are tracked. When agreement drifts, the rubric or model is adjusted — not the practice's score.

Why specialty-specific

A perio call isn't a pizza order.

Specialty new-patient calls carry more information, more friction, and more dollars than almost any other inbound conversation in healthcare. Generic call scoring misses the nuances that decide whether the appointment gets booked.

Generic call scoring

Surface-level

Warmth and professionalism in the abstract
"Did they say hello" style checklists
Flat 1–5 star manager scores
No model of the referral path
No benchmark for schedule rate or LTV
No concept of the appointment-ask window

Phone Genius

Specialty-tuned

Rubric anchors drawn from real specialty calls
Referral-source capture scored explicitly
Insurance posture: PPO, fee-for-service, medical billing
Urgency window (45–90 sec) measured, not guessed
Schedule-rate benchmarks: ~85% target in perio
Appointment conversion as the dominant weight

$8,500+ average lifetime value of one specialty new patient

~85% healthy target schedule rate for a perio practice

45–90 sec window to land the appointment ask on most new-patient calls

< 5% target ASR Word Error Rate on clean English telephony

What Phone Genius is and is not

Strong claims, clean boundaries.

A grading & coaching system

It measures how your front desk handles specialty new-patient calls against a consistent rubric, and surfaces the specific moments a manager can coach on this week.

Not a call tracking replacement

It does not replace CallRail, Dialpad, or your PBX. It grades the quality of the conversation. You keep your call tracking for routing and attribution.

Not a clinical decision tool

It does not diagnose, recommend treatment, or advise on clinical scope. Clinical accuracy means “did staff state the practice's scope correctly” — not “was the medicine right.”

Not a lie detector

It grades calls it places itself, as an AI mystery-shopper, using specialty-tuned scenarios. The sample is controlled on purpose.

Not a legal recording service

We comply with applicable consent laws for the calls we place. We don't record your real inbound traffic. Your own call-recording policy is yours.

Not a manager replacement

The output makes a manager twice as effective — not obsolete. A real person decides whether Monday's coaching moment is a one-to-one, a stand-up, or a hiring signal.

Field-tested, research-informed

Built inside specialty, not outside it.

Phone Genius was built by Cameron Full, DBA — founder of Referral Lab, fractional COO for specialty practices through Dentek, and author of a doctoral dissertation on how entrepreneurial operators recombine existing expertise to solve novel problems.

The rubric was pressure-tested against a decade of specialty operational data: treatment coordinator performance, front-desk variation, referral-source behavior, scheduling deficiencies, and the operational patterns that separate a 60% schedule rate from an 85% schedule rate.

Every dimension in the rubric exists because a specialty practice has paid a real cost for missing it — and because a trained TC coach has already coached against it.

350+ specialty practices in the operational dataset

1M+ referral records informing the benchmarks

1,000+ hours of specialty coaching feeding the rubric

10+ yrs inside perio, oral surgery, ortho, endo, prosth ops

Founding pilot

See your practice on the rubric.

We're onboarding a small cohort of specialty practices. Founding pilots lock in a discounted long-term rate, get direct input on the rubric, and receive their first graded calls inside two weeks.

Request early access Back to overview