Quick take: The single best question is “How do you handle AI hallucinations in production?” Experienced developers will describe specific mitigation strategies like confidence thresholds, human review workflows, or RAG grounding. Inexperienced ones will downplay the issue or promise perfect accuracy. These 8 questions expose real expertise versus sales pitches.
Overview: Critical Questions at a Glance
| Question | What It Reveals | Red Flag Answer |
|---|---|---|
| How do you handle hallucinations? | Production experience and risk awareness | ”Our prompts prevent that” or “Not an issue” |
| What’s your approach to fine-tuning vs RAG? | Architecture understanding | Always recommends one without discussing tradeoffs |
| How do you estimate AI project costs? | Pricing transparency and experience | Vague answers or reluctance to break down costs |
| What’s your process for evaluating model performance? | Quality assurance rigor | ”We test it manually” with no metrics |
| How do you manage prompt versioning? | Production engineering maturity | ”We keep prompts in a document” |
| What’s your approach to data labeling? | Data quality understanding | ”We can use synthetic data for everything” |
| How do you handle rate limits and API failures? | Infrastructure reliability | ”That hasn’t been a problem” |
| What AI projects have failed and why? | Honesty and learning from mistakes | ”All our projects succeed” |
1. How do you handle AI hallucinations in production?
This question reveals whether candidates have shipped AI to real users. Hallucinations are unavoidable—confident, false outputs generated by AI models. Experienced developers have battle-tested strategies for managing this risk.
What good answers sound like: “We use RAG to ground responses in source documents and include citations. For high-stakes outputs, we add confidence scoring and flag low-confidence responses for human review. We also log all outputs for spot-checking and maintain a blocklist of known problematic patterns.”
Red flags: Claims they’ve eliminated hallucinations through prompt engineering alone. Dismissing hallucinations as rare or not applying to their use case. Promising 100% accuracy. These indicate inexperience or dishonesty.
Why it matters: Hallucinations can destroy customer trust, create legal liability, or generate dangerous misinformation. Developers who haven’t grappled with this in production will underestimate the effort required to ship safely.
2. What’s your approach to choosing between fine-tuning and RAG?
This question tests architectural judgment. Both fine-tuning and Retrieval-Augmented Generation solve the problem of customizing AI for your domain, but they have different tradeoffs in cost, update speed, and performance.
What good answers sound like: “It depends on your use case. RAG is better when your knowledge base updates frequently or you need to cite sources. Fine-tuning makes sense for specialized language, tone matching, or when you need consistent formatting. For most startups, I recommend starting with RAG because it’s cheaper and faster to iterate.”
Red flags: Always recommending fine-tuning regardless of use case, usually because it justifies higher fees. Having no opinion or framework for making this decision. Confusing fine-tuning with prompt engineering.
Why it matters: This decision affects your budget, iteration speed, and maintenance burden. Developers who default to the most expensive option without justification are optimizing for their revenue, not your success.
3. How do you estimate AI project costs and timelines?
This question exposes whether developers understand the operational economics of AI. Costs include API calls, compute for embeddings, vector database storage, and human labeling—not just development hours.
What good answers sound like: “I break costs into development and ongoing operations. For operations, I estimate tokens per request, requests per month, and model pricing. For example, if you’re processing 1,000 support tickets daily at 2,000 tokens each, that’s 2 million tokens per day, or about $60/month on GPT-4o. Development includes prompt engineering, testing, and integration, typically 4-6 weeks for an MVP.”
Red flags: Refusing to provide estimates until after a paid discovery phase. Giving only development costs without discussing operational expenses. Wildly optimistic timelines that don’t account for testing, edge cases, or integration.
Why it matters: Surprises in cost or timeline destroy trust and cash flow. Developers who can’t estimate accurately either lack experience or are deliberately vague to win the contract.
4. What’s your process for evaluating model performance?
This question reveals their QA rigor. Good AI development includes systematic testing with metrics, not just manual spot-checking. You want developers who measure performance quantitatively.
What good answers sound like: “We start with a labeled test set of 200-500 examples that cover edge cases. We measure accuracy, precision, recall, and latency. For generative tasks, we use a combination of automated metrics like BLEU or ROUGE and human evaluation on a sample. We track these metrics over time to catch regressions when prompts or models change.”
Red flags: “We test it manually and it looks good.” No mention of test sets, metrics, or systematic evaluation. Claiming AI is too subjective to measure quantitatively. Over-relying on synthetic test data instead of real user scenarios.
Why it matters: Without systematic evaluation, you have no idea if changes improve or degrade performance. You’ll ship broken updates and discover problems only when users complain.
5. How do you manage prompt versioning and changes?
This question tests production engineering maturity. Prompts are code—they need version control, testing, and rollback capabilities. Treating prompts as throwaway text leads to chaos.
What good answers sound like: “We store prompts in version control with Git, just like code. Each prompt has unit tests with expected outputs. When we change a prompt, we run the test suite to catch regressions. We deploy prompts through a staging environment before production. If a prompt update degrades performance, we can roll back to the previous version.”
Red flags: Keeping prompts in Google Docs, Notion, or hardcoded in files without version control. No testing process for prompt changes. Making prompt edits directly in production. Treating prompts as configuration rather than critical logic.
Why it matters: Prompt changes can have massive, unexpected effects on output quality. Without version control and testing, you’ll break things constantly and won’t know which change caused the problem.
6. What’s your approach to data labeling and quality?
This question assesses their understanding that AI quality depends on data quality. Garbage data produces garbage AI, no matter how sophisticated the model. Good developers have disciplined labeling processes.
What good answers sound like: “We start by defining clear labeling guidelines with examples. We use multiple annotators per item to measure inter-annotator agreement—if annotators disagree, the guidelines aren’t clear enough. We prefer a smaller set of high-quality labels over a large set of noisy labels. For ongoing projects, we sample and review labels regularly to catch quality drift.”
Red flags: Planning to use only synthetic or AI-generated training data without human review. Claiming you don’t need labeled data because they’ll use unsupervised learning. Outsourcing labeling without quality checks. Assuming existing data is clean enough without auditing it.
Why it matters: Poor data quality is the #1 reason AI projects fail to meet expectations. Developers who cut corners on data will deliver models that look good in demos but fail in production.
7. How do you handle rate limits, API failures, and downtime?
This question tests infrastructure reliability thinking. Production AI systems depend on external APIs that have rate limits, occasional failures, and maintenance windows. Robust systems handle these gracefully.
What good answers sound like: “We implement exponential backoff and retry logic for transient failures. For rate limits, we use request queuing and batching. We have fallback strategies—if the primary model is down, we fall back to a simpler model or cached responses. We monitor error rates and latency in production and alert when they exceed thresholds.”
Red flags: “That hasn’t been a problem for us.” No mention of error handling, retries, or fallbacks. Assuming API providers have 100% uptime. Planning to call APIs synchronously in user-facing requests without timeouts.
Why it matters: When (not if) APIs fail or slow down, your users shouldn’t suffer. Systems without error handling create terrible user experiences and emergency fire drills.
8. Tell me about an AI project that failed and what you learned.
This question tests honesty, self-awareness, and ability to learn from mistakes. Everyone in AI has had projects that didn’t work out. Developers who claim perfect success are either lying or lack experience.
What good answers sound like: “We built a document classification system that worked great in testing but degraded in production because real documents were messier than our test set. We learned to test on production data earlier and build confidence scoring so the system could flag uncertain cases. Now we always include a ‘not sure’ option rather than forcing the model to choose.”
Red flags: Claiming all projects have succeeded. Blaming failures entirely on clients or external factors without acknowledging their own missteps. Being unable to articulate specific lessons learned. Getting defensive about the question.
Why it matters: AI is still an immature field with many unknowns. Developers who don’t acknowledge failure either lack experience or won’t be honest when problems arise in your project.
How We Chose These Questions
We interviewed 30+ non-technical founders who hired AI developers and asked what they wished they’d known to ask. We then validated these questions with experienced AI engineers to ensure they effectively separate skilled practitioners from oversellers.
These questions prioritize:
- Production experience over theoretical knowledge
- Risk awareness over optimistic promises
- Systematic processes over ad-hoc approaches
- Honesty about limitations over sales pitches
We excluded questions about specific models or techniques (which change rapidly) in favor of questions about judgment, process, and experience that remain relevant.
FAQ
Should I ask technical questions about architectures or algorithms? No, unless you have the expertise to evaluate the answers. These 8 questions reveal expertise through judgment and process, which non-technical founders can assess. Focus on how they think, not what they know.
What if the developer gives great answers but has a thin portfolio? Great answers with limited portfolio is better than weak answers with a long portfolio. AI development for businesses is new—many excellent developers are early in their AI career. Look for strong engineering fundamentals and honest communication about their experience level.
How many of these questions should I ask? Ask all 8. These cover the critical areas where inexperience causes project failure: production risks, architecture decisions, cost management, quality assurance, engineering practices, data quality, reliability, and learning mindset. Missing any one can sink a project.
What if a developer refuses to answer without an NDA? These questions don’t require sharing confidential client information. A developer can discuss their approach to hallucinations or cost estimation without naming clients or revealing proprietary techniques. Refusing to answer suggests lack of experience or unreasonable secrecy.
Should I ask these questions to agencies or just individual developers? Ask agencies and individuals. For agencies, note whether the person answering (likely a salesperson) can answer technically or needs to defer to technical staff. If they defer, insist on speaking to the actual developers who will work on your project.
Key Takeaways
- The hallucinations question is your best single filter for production experience
- Good developers discuss tradeoffs and limitations openly rather than overselling
- Process questions (versioning, testing, labeling) reveal engineering maturity
- Developers who can’t estimate costs lack experience or are deliberately vague
- Everyone in AI has had projects fail—learn what they learned from failure
- Technical knowledge matters less than judgment, honesty, and systematic thinking
- Ask all 8 questions—each covers a critical failure mode for AI projects
- Use these questions with both individual developers and agencies
Ready to Hire AI Development Talent?
SFAI Labs provides vetted AI developers who can answer every question on this list with production-tested strategies. We’ve shipped 50+ AI features and learned from every failure. Book a free 30-minute consultation to discuss your hiring needs.
SFAI Labs