All Cases
RAG System Status
Retrieval — Latest Run
Generation — Latest Run
Behavioral Tests — Latest Run
Audio Parity — Latest Run
Golden Dataset
⬇ Download JSONLAdd Test Case
1 · Query
›
2 · Chunks
›
3 · Answer
›
4 · Save
What is a test case?
A question a resident would ask the condo bot. The system will automatically verify the bot answers it correctly. This case will be evaluated on every future run.
A question a resident would ask the condo bot. The system will automatically verify the bot answers it correctly. This case will be evaluated on every future run.
Run Evals
Measures how well the system finds the correct chunks. No LLM — fast and cheap. Only needs Voyage + Supabase.
idle
Calls the real bot and evaluates responses with an AI judge. Uses Claude + n8n — slower and has API cost.
idle
Runs isolation, multi-turn, error-handling, and format tests. Uses Claude + n8n — only behavioral cases (faster than full generation).
idle
Checks that audio and text give semantically equivalent answers. Uses Claude + n8n + Groq. Requires audio fixtures in
tests/audio/ — generate them once with GROQ_API_KEY=xxx node tests/scripts/generate-audio-fixtures.js.
idle