Early-stage AI company building LLM evaluation infrastructure

Evaluations Engineer

San FranciscoOn-site$145K - $190K1 - 4 years
Join a 15-person team focused on evaluating and benchmarking large language models. You'll work on both backend systems and frontend interfaces that help organizations assess model performance and quality. The role spans full-stack development, requiring work across Python services, cloud infrastructure, and web applications.

What we're looking for

  • 1-4 years of software development experience with Python
  • Experience with Django backend development or equivalent web frameworks
  • Familiarity with AWS, Git, and modern software development workflows
  • React experience for frontend development contributions
  • Understanding of machine learning concepts and LLM evaluation methodologies
  • Ability to work on-site in San Francisco

Tech stack

PythonDjangoReactAWSGitOpenAI

Apply for this role

This role is one we're recruiting for on behalf of a client company; the client's identity is kept confidential at this stage. A Fluency recruiter will follow up with details.