Early-stage AI company building LLM evaluation infrastructure

Evaluations Engineer

San FranciscoOn-site$145K - $190K1 - 4 years

Join a 15-person team focused on evaluating and benchmarking large language models. You'll work on both backend systems and frontend interfaces that help organizations assess model performance and quality. The role spans full-stack development, requiring work across Python services, cloud infrastructure, and web applications.

What we're looking for

1-4 years of software development experience with Python
Experience with Django backend development or equivalent web frameworks
Familiarity with AWS, Git, and modern software development workflows
React experience for frontend development contributions
Understanding of machine learning concepts and LLM evaluation methodologies
Ability to work on-site in San Francisco

Tech stack

PythonDjangoReactAWSGitOpenAI

Evaluations Engineer

What we're looking for

Tech stack

Apply for this role