AI company building the invisible layer that handles clinical administrative work so providers don't have to

Evals Lead

New YorkOn-site$170K - $220K5 - 10 years
We're building agents that handle the phone calls, faxes, prior auths, and scheduling loops that quietly eat half a clinic's day. As Evals Lead, you own the frameworks that tell us whether those agents are actually good enough to deploy into live patient workflows—where a hallucinated prior auth status isn't a benchmark metric, it's a real person waiting for care. You'll design evaluation pipelines that blend deterministic correctness checks with LLM-as-judge rubrics, surface failure patterns before they reach production, and build the data flywheel that makes our models measurably safer and more reliable every week.

What we're looking for

  • 5+ years in QA, test engineering, or ML evaluation, with at least 2 years focused on evaluating LLM-based systems or conversational AI in production
  • Strong SQL and Python skills; comfortable working across MongoDB, PostgreSQL, and GCP to instrument evaluation infrastructure
  • Experience designing both automated regression suites and qualitative eval rubrics that capture task completion, tone, and clinical safety
  • Ability to define metrics that matter for healthcare workflows—high recall on prior auth submission errors matters more than abstract accuracy
  • Clear communication instincts; you can explain why a model failed to engineers and product stakeholders without hand-waving
  • Must be based in the United States.

Tech stack

GCPMongoMongoDBPostgreSQL

Apply for this role

This role is one we're recruiting for on behalf of a client company; the client's identity is kept confidential at this stage. A Fluency recruiter will follow up with details.