Hiring an AI developer in 2026 is harder than hiring a regular backend engineer. The field moves faster than the interview process can keep up with. Most question banks you find online are either stale (asking about RNNs in a transformer-native world) or pointlessly abstract (derive the attention equation by hand). Neither tells you whether the candidate can actually build something that works in production.
At Workforce Next we screen AI developers every week, both for our own AI developer engagements and for dedicated RAG and LangChain roles. Here is the question framework we actually use and why each layer matters.
Skip these first
Before getting into what to ask, here is what to stop asking:
"Explain how a transformer works." Every AI developer has read the Illustrated Transformer. Memorized answers tell you nothing about judgment.
"Implement backpropagation from scratch." Unless they are building a training framework, they will never do this on your product.
"What is the difference between GPT-4 and Claude?" This changes every quarter. A better signal is how they think about choosing models, not which one they used last.
Layer 1: Can they reason about problem shape?
The single highest-signal question we ask: "Here is a business problem. Walk me through whether it needs an LLM, a classical ML model, or just plain software."
Give them something like: "Our support team categorizes incoming tickets into 12 tags. They process 500 per day. Would you use an LLM?" A weak candidate jumps straight to "I would use GPT-4 with few-shot prompting." A strong candidate asks about accuracy requirements, cost per ticket, latency, and whether a fine-tuned classifier would beat an LLM on both cost and accuracy at that volume.
This is the same instinct behind context-first matching. Tech stack is easy. Judgment is what actually ships.
Layer 2: Have they shipped something that survived real users?
Ask: "Tell me about an AI feature you shipped to real users. What broke first?"
The answer reveals whether they have operated an AI system in production, or just built demos. Real answers sound like: "Our RAG system worked great in eval but users started asking questions outside the indexed corpus and the model hallucinated confidently. We added a retrieval confidence threshold and a fallback." Demo answers sound like: "It worked on the test set."
Follow up with: "How did you know it was broken?" You want to hear about eval sets, user feedback loops, or observability. If they only noticed when a user complained, they have not built production AI.
Layer 3: Can they debug an AI system?
Present a failure scenario: "Your RAG chatbot is giving wrong answers 20% of the time in production. Walk me through how you would diagnose it."
Listen for a structured debugging process: is it the retrieval (wrong chunks pulled), the chunking strategy (context split mid-concept), the embedding model (semantically similar but topically wrong), the prompt (ambiguous instructions), or the model itself (weak reasoning on the domain)? A strong AI developer has a mental model for each failure mode and knows which logs or evals to pull to isolate the layer.
Bonus signal: they mention they would run an eval set before changing anything, rather than guessing at fixes and re-deploying.
Layer 4: How do they think about cost?
Ask: "This feature costs us $0.12 per query. We have 1 million queries a month. How would you cut the cost in half without hurting quality?"
Good answers include: route simpler queries to a smaller model, cache embeddings and semantically similar queries, shorten prompts by trimming retrieved context, batch requests where possible, move metadata filtering out of the LLM into retrieval. If they only say "use a cheaper model," they have not operated a real AI product.
Layer 5: Do they have taste?
Taste is the hardest thing to screen for, but the most important. Ask: "Show me a prompt you are proud of and walk me through why you wrote it that way."
A good prompt engineer can explain tradeoffs: why they used XML tags vs markdown, why they put examples before or after the instructions, why they structured output one way vs another. A weak one will say "I just iterated until it worked." Both can ship, but the first one will ship faster and debug faster.
What this looks like end-to-end
A full AI developer interview at our scale takes about 90 minutes: 20 minutes on problem-shape reasoning, 20 minutes on a real shipped feature, 30 minutes on a live debugging exercise, and 20 minutes on cost and taste. We skip the whiteboard algorithm round entirely for AI roles. It tests nothing the job requires.
If you are hiring your first AI developer, the highest-leverage thing you can do is design the interview around judgment and production experience, not model trivia. That is the same approach we take when matching dedicated AI developers into client teams. If you want help, reach out and we will walk you through our screening loop.
