My first instinct was to keep adding to the prompt.
If the format was unstable, I added more formatting rules. If the answer was sometimes wrong, I added a stronger reminder to check it. If the image description was vague, I added another section about image requirements.
It worked for a while. Then the prompt slowly turned into a pile of rules that nobody wanted to touch.
Zairoo’s question generator is not just writing text. It generates isomorphic questions for Singapore Primary Mathematics. A usable question needs the right answer, a unique solution, the same learning outcome, meaningful variation, and image semantics that actually match the question.
One large prompt cannot hold all of that responsibility.
I eventually split the work into layers:
Generatecreates candidate questions.Verifychecks whether the question is solvable, has the right answer, and has a unique solution.Rubricchecks learning outcome, variation quality, batch diversity, and image compatibility.Benchmarkcompares models and prompt changes against fixed real question samples.
Once those layers were separate, failures became easier to talk about.
Before that, it was tempting to say, “this model is not good enough.” Now the failure can be more specific: the answer is wrong, the question drifted from the learning outcome, the image is not suitable, or the batch has too little variation.
I trust this shape more now. Prompt still matters, but it should express intent, not carry the whole quality system.
Quality has to be recorded, compared, replayed, and rejected when needed.
That is slower than writing a longer prompt, but it feels much closer to building a product.