A quality associate on a manufacturing site asks the team’s Copilot agent about the gowning procedure for a controlled area. The agent responds in under five seconds. The answer is fluent, specific, and formatted cleanly. It references the right process type. It names the correct area. It looks exactly like what a correct answer should look like.
The associate follows it.
At no point in that exchange did anyone stop to ask whether the answer was actually right. The agent was confident. The interface was clean. The whole thing took less time than walking to the quality office to ask someone. There was no signal that anything might be wrong, because from the outside, a confident wrong answer looks identical to a confident right answer.
That is the problem this post is about.
Speed is not a proxy for accuracy
Copilot agents are fast by design. That speed is part of why teams adopt them, fewer interruptions, faster answers, less time hunting through SharePoint folders that were never well organised to begin with. The operational value is real.
But speed tells you nothing about whether the answer is correct.
In pharma and manufacturing, the content an agent draws from is typically SOPs, work instructions, batch record guidance, and controlled procedures, documents where precision is the point and the margin for error is narrow. When an agent retrieves a response from that content and presents it confidently, the person on the receiving end has no way to distinguish an accurate answer from a plausible one. The fluency is the same. The tone of certainty is the same. The formatting is the same.
The result is that a well-designed interface actively removes the friction that might otherwise prompt someone to verify an answer. That is not a design flaw — it is what makes the tool useful. But it means the burden of reliability has to be carried somewhere else. And in most deployments, it is not being carried anywhere in particular.
A wrong answer to a question about a controlled procedure, delivered with apparent authority, is not a neutral event. It is worse than no answer. No answer prompts a lookup. A wrong answer prompts action.
Most agents get validated informally. That is not a reliability benchmark.
Ask most knowledge management teams how they validated their Copilot agent before deploying it, and the answer usually involves some version of: we tested a set of questions, the agent performed well, and we were satisfied.
That is a reasonable starting point. It is not a reliability benchmark.
The problem with testing an agent using questions you already know the answers to is structural. Your test set reflects what you anticipated the agent would be asked. But the agent’s users will ask things you did not anticipate, questions that sit between two SOPs, queries that reference a process by its informal floor name rather than its document title, questions formed in the language of the production team rather than the documentation team. The agent was grounded in specific SharePoint content. The coverage of that content, what it includes, what it is missing, where it is ambiguous, was probably never mapped systematically.
Informal testing works well when the questions match the grounding. When they do not, the agent produces a response anyway. Those responses, generated at the edges of the agent’s knowledge, are the ones that carry the most risk. And they are exactly the responses that informal testing tends to miss.
There is a name for what most teams are doing when they test a handful of questions, get good results, and proceed: confirmation bias. The test confirms what you hoped was true. It does not probe where the agent might fail.
This is not a criticism of the teams involved. Building a Copilot agent takes real effort, the informal approach is intuitive, and there has not been a widely used alternative to it. But “we asked it some questions and it seemed fine” is not a standard that would pass scrutiny in any other part of a regulated operation. The question is whether it should be the standard for this one.
An agent is not a general tool. The accountability sits differently.
There is a version of this conversation that applies to Microsoft Copilot in its general form, the assistant that lives in Teams, drafts emails, and summarises documents. When that tool produces a wrong or incomplete answer, there is always a ready defence: it is a general assistant, not purpose-built for any specific task. The expectation of reliability is calibrated accordingly.
A Copilot agent does not have that defence.
A Copilot agent is purpose-built. It was grounded in specific content, your SOPs, your procedures, your SharePoint libraries. It was configured for a specific job. It was deployed to a specific team with the explicit intent that they use it to answer questions about your processes. When it gets something wrong, “it’s a general tool” is not available as an explanation, because it is not one.
The accountability for a Copilot agent sits with the team that built and sanctioned it. In most organisations, that is the knowledge management function. The agent was built on your content, deployed under your authority, and trusted because your team put it in front of users. When a worker acts on the agent’s answer, they are, in a meaningful sense, acting on guidance your team provided.
Microsoft’s documentation draws this distinction directly, Copilot agents are purpose-built, domain-grounded extensions, not the general assistant. The architecture is different. The use case is specific. And the accountability that comes with specificity is different in kind from the accountability that comes with a general tool.
This is not an argument against building agents. The operational case for them in pharma and manufacturing is sound, and the teams who have deployed them well have seen genuine value. It is an argument that the assurance practices around a purpose-built agent need to reflect what it actually is (a specific, accountable piece of operational infrastructure) rather than treating it as an interesting add-on to a general platform.
So how do you actually know?
The teams doing this thoughtfully are asking a question that most have not yet formalised: what does reliable enough actually mean for this agent, in this environment, for these users?
In a regulated operation, that is not a vague standard. It has to mean something specific, a tested scope, a documented understanding of where the agent performs well and where its grounding is thin, a basis for saying “this agent is fit for this purpose” that goes beyond a set of test questions with known answers.
The informal approach most teams use is a start. It is not that. And the gap between the two is worth understanding before it shows up somewhere inconvenient, an audit, a deviation, a moment where someone on the floor acted on an answer that turned out to be wrong.
If you’re not sure whether your agent is reliable enough, that uncertainty is worth taking seriously. The Altuent team works specifically on Copilot Agent Reliability Score assessments for pharma and manufacturing deployments. Start a conversation with the Altuent team to find out how we can support you.