AI company building the invisible layer that handles clinical administrative work so providers don't have to
Evals Lead
We're building agents that handle the phone calls, faxes, prior auths, and scheduling loops that quietly eat half a clinic's day. As Evals Lead, you own the frameworks that tell us whether those agents are actually good enough to deploy into live patient workflows—where a hallucinated prior auth status isn't a benchmark metric, it's a real person waiting for care. You'll design evaluation pipelines that blend deterministic correctness checks with LLM-as-judge rubrics, surface failure patterns before they reach production, and build the data flywheel that makes our models measurably safer and more reliable every week.
What we're looking for
- 5+ years in QA, test engineering, or ML evaluation, with at least 2 years focused on evaluating LLM-based systems or conversational AI in production
- Strong SQL and Python skills; comfortable working across MongoDB, PostgreSQL, and GCP to instrument evaluation infrastructure
- Experience designing both automated regression suites and qualitative eval rubrics that capture task completion, tone, and clinical safety
- Ability to define metrics that matter for healthcare workflows—high recall on prior auth submission errors matters more than abstract accuracy
- Clear communication instincts; you can explain why a model failed to engineers and product stakeholders without hand-waving
- Must be based in the United States.