Early-stage AI company building LLM evaluation infrastructure
Evaluations Engineer
Join a 15-person team focused on evaluating and benchmarking large language models. You'll work on both backend systems and frontend interfaces that help organizations assess model performance and quality. The role spans full-stack development, requiring work across Python services, cloud infrastructure, and web applications.
What we're looking for
- 1-4 years of software development experience with Python
- Experience with Django backend development or equivalent web frameworks
- Familiarity with AWS, Git, and modern software development workflows
- React experience for frontend development contributions
- Understanding of machine learning concepts and LLM evaluation methodologies
- Ability to work on-site in San Francisco