Thirteenth
New Member
We started with a pretty simple agent built around one workflow, and honestly it did its job well. Then over time it turned into something else entirely. New instructions got added, then more tools, then fallback rules, then edge case handling from three different teams, and now the thing technically covers way more ground but feels completely unreliable. It'll follow the wrong workflow, miss context that's sitting right there, or go through a whole chain of tool calls when a single sentence would've done it. The behavior has clearly shifted but pinning down exactly which change broke what is near impossible. Every modification seemed reasonable at the time. How are other teams actually keeping track of what's working in their agents once the instructions and capabilities start stacking up?