THE PROBLEM 36 million businesses in America need insurance. 77% are underinsured. 40% have no coverage at all. We're building 90%+ AI-led commercial insurance distribution. ~1,000 new customers/month, 100x growth in a year. Our agents handle intake, sales, service, voice, and submission packaging. They get better every week - but "better" is only true if we can prove it. Today, AI engineers ship a prompt change, a tool change, or a new model and judge it by vibe: "feels worse," "feels better," "the demo passed." Vibes don't survive Series B. Build the evals that turn agent quality from a vibe into a number. Catch every regression before it ships. THE THESIS AI only compounds when the company can tell whether it is getting better. Demos do not count. Vibes do not count. The bar is a real customer case, a real transcript, a real failure mode, and a regression suite that catches the same mistake forever. You will build the evals that make Harper's agents trustworthy. When the agent improves, we know. When it regresses, we know before the customer does. That is how we scale judgment without scaling headcount. THE ROLE Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions. Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end. WHAT YOU'LL DO - Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice - Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones. - Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples. - Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked. - Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours. - Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test. YOU MIGHT BE A FIT IF… - You've built or operated eval frameworks for production LLM systems - You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise - You've designed an LLM-as-judge rubric that survived human calibration - You can debug a hallucination by reading transcripts, not aggregate dashboards - You write code with AI daily and have strong opinions on which agent behaviors matter - You're 3–6 years into your career REQUIREMENTS - 3–6 years software engineering experience - Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets - Familiarity with at least one major eval framework - Strong written communication - eval rubric docs, failure-mode taxonomies - Based in San Francisco or willing to relocate NICE TO HAVE - Open-source contribution to eval frameworks - Red-team / adversarial-testing experience for LLM systems - Voice AI eval experience (latency, interruption handling, transcription accuracy) - ML eval / observability background COMPENSATION - OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus) - Equity: competitive equity, so you share in the company you are helping build - Location: San Francisco, in-office BENEFITS - Health, dental, and vision insurance - Commuter benefits - Team meals and snacks THE PROCESS 1. Founder call (15 min) - Mission, pace, scope 2. Tech Lead deep-dive (60 min) - Eval architecture, grader design, real failure modes 3. Super Day on-site - full-day simulation of working at Harper: live eval-suite design, code review, team context, and founder/CTO time 4. Founder + Tech Lead offer conversation - No committee. Best offer, first. TO APPLY If you've turned vibes into a number - built an eval suite that caught a regression a model upgrade silently introduced - send your resume, the framework, and a transcript of a failure you found that nobody else did.

Ready to automate your job applications?

Senior Member of Technical Staff, AI Quality

Job Description

Interested in this role?

Ready to automate your job applications?