AI company building the invisible layer that handles clinical administrative work so providers don't have to

Evals Lead

New YorkOn-site$170K - $220K5 - 10 years

We're building agents that handle the phone calls, faxes, prior auths, and scheduling loops that quietly eat half a clinic's day. As Evals Lead, you own the frameworks that tell us whether those agents are actually good enough to deploy into live patient workflows—where a hallucinated prior auth status isn't a benchmark metric, it's a real person waiting for care. You'll design evaluation pipelines that blend deterministic correctness checks with LLM-as-judge rubrics, surface failure patterns before they reach production, and build the data flywheel that makes our models measurably safer and more reliable every week.

What we're looking for

5+ years in QA, test engineering, or ML evaluation, with at least 2 years focused on evaluating LLM-based systems or conversational AI in production
Strong SQL and Python skills; comfortable working across MongoDB, PostgreSQL, and GCP to instrument evaluation infrastructure
Experience designing both automated regression suites and qualitative eval rubrics that capture task completion, tone, and clinical safety
Ability to define metrics that matter for healthcare workflows—high recall on prior auth submission errors matters more than abstract accuracy
Clear communication instincts; you can explain why a model failed to engineers and product stakeholders without hand-waving
Must be based in the United States.

Tech stack

GCPMongoMongoDBPostgreSQL

Evals Lead

What we're looking for

Tech stack

Apply for this role