Looking for Technically Write IT (TWi)? We rebranded to Altuent.

Preparing your content for pre-trained GenAI: A practical guide for organisations

How to avoid costly missteps, bridge the knowledge gap, and make AI work for you

Many organisations want to effectively use a large language model (LLM) that is generative and pre-trained (a GenAI) like Microsoft’s Copilot and Anthropic’s Claude. However, they struggle when confronted with one fundamental question: What they can usefully ask the GenAI about?

One way to answer this question is through understanding what the GenAI “knows”. If the GenAI knows something, then surely you can ask it about that. But, more important to any organisation using it for professional work, you want it to know specific things, especially about your organisation or industry. And many such models help you to do that by allowing you to share your own specific content with it. When you share your content, you can assume that AI knows it. It’s like a diligent or professional person, surely, in this. If you share your content, AI reads it and then can talk to you about it. (And unlike a typical person, it won’t be distracted by something else.)

So, once you share something with a GenAI, does it know it? And for how long? Can you assume it knows what you tell it the way a human does?

The answer is more complex than you might think. And understanding why is key to any successful knowledge management strategy with AI.

Training GenAI

GenAI talks to us like it’s human. Maybe a bit obsequious, but generally very human-like. So, it’s natural to assume that it can do that because, like a human, it has things like understanding and memory. A GenAI retains the details you discuss with it (say, your company’s sick leave policy) because you’ve described those details in conversation with it. People do that all the time (if they’re paying attention).

However, GenAIs like Copilot and Claude don’t do this at all. They don’t “memorise” your input in the way a human does. In fact, after you close a conversation with them, the models lose everything you’ve said to them. Even if they otherwise seem like a person to you in all other ways, they won’t remember your conversation. As Mollick puts it, AI is “just like an infinitely patient new coworker who forgets everything you tell them each new conversation.” (Mollick 2024[1])

But they do remember everything they’ve been trained on. Their core knowledge is encoded in model parameters established during a pre-training phase – an enormous, once-off ingestion of public text and documents. This pre-training sets the weights: Weights that are used to calculate a response.

Weights and answers

A GenAI answers questions about the capital of France or the names Scott gave his husky dogs when traveling through the Arctic because, for your questions, suitable answers are encoded in the weights based on its training. Being generative, it can even do more. It can produce information outside its training data. For example, based on its separate weights for capitals and dog’s names, it might produce new names for an imagined Arctic capital or, if Scott went to France, the dogs he brought there instead.

One more thing: These weights are not locked to specific text in response. They are tied to probabilities of texts. The same exact question with the same version of a model can return different answers. Typically, the answers are not so different from one another (they don’t typically contradict each other), but the phrasing, use of terms, or specific details can vary.

Weights and conversation

The weights a GenAI uses to answer your questions are not updated during your everyday use.

When you interact with a pre-trained GenAI, your queries and the model’s responses are processed on the fly. Unless you explicitly provide the relevant context each time, the GenAI has no persistent awareness of any context you want it to assume. It won’t be aware of your previous conversations or your organisation’s unique, internal content.

Your previous conversations happen after the GenAI is trained on the content. Your organisation’s internal content is not part of that training phase (this is likely and good, given the publicly accessible nature of such GenAI).

Your conversations are temporary. If you don’t supply company-specific information to the GenAI each time you communicate with it, the GenAI answers based on its generalised, pre-trained baseline.

So, after you end a conversation, the GenAI will remember the training data it had when you began but not anything else communicated during the conversation.

Yet, you can still use it consistently with your own content. You do it by providing AI-ready content, content that is ready for each time you interact with GenAI.

Optimising GenAI interactions for your organisation

GenAI models are trained to communicate naturally. They are not trained to communicate accurately, and especially not for your content. To get there requires something more.

Preparing content for training GenAI

The training phase of GenAI models involves content preparation. It focuses on massive scale, copyright, and general quality control, mostly done by the model providers. Although specific methods for model providers are not publicly available, a similar publicly available methodology is FineWeb.

Here is an image of FineWeb’s process:

Figure 1: FineWeb’s content preparation process

Pre-training steps, as shown in FineWeb’s process, are typically optimised for natural language.[2] They involve removing unsuitable data for natural language, as well as undesirable content for other reasons. For example: all code (such as HTML) and SEO-optimised data; multiple copies of the same material (‘deduplication’); content from specific URLs; anything not publicly accessible or appropriately classified content (so, no pirated content and no personal information (‘PII removal’).

One issue that, in actual practice, is a large and as-yet-unresolved problem concerns transparency: what the exact content is for training data. There’s a vast amount of information ingested by the tool and it is not always clear how the specific outputs relate to that content.

And, as already said, another feature (not a problem) of pre-training data is that, so long as the model is this version, it is permanently “known” by the model, information the model does not forget.

Preparing content for using with GenAI

Content preparation for GenAI use is different. It is about making your specific knowledge easily and accurately accessible during live queries.

There are many differences between this content in GenAI and the pre-training content. The most significant is already mentioned: the pre-training content is remembered by the model. The content of your organisation is not. As such, you need to take a different approach to working with your content in a GenAI model.

However, there are also advantages to your content over the pre-training content. Because it happens within specific organisations, which typically own the content, issues around undesirable content are not typically a problem. Importantly, a serious issue for GenAI trainers is data transparency: What went into the model, and is it legal or otherwise suitable? For organisations, transparency is a much simpler issue. You know your source content, and you control both access and compliance. The challenge, then, isn’t suitability but practical accuracy.

Returning to the fact, your GenAI won’t recall what you tell it. There are several methods to overcome this:

  • Work on some general aspects of how you tell it what to do. This is prompt engineering.
  • Work on the specific content that you feed the GenAI along with your prompt.

Prompt engineering concerns crafting the most effective prompt for a GenAI, so that it returns the best response.

Several methods have developed around it. There are specific forms of engineering that can be included in each momentary prompt, such as few-shot (using examples of what you want), chain-of-thought (telling it to present reasoning to its answer as steps), and persona assignment (telling it to act like a particular role, such as a writer or subject matter expert). Many models also allow you to include all of these as system prompts: instructions always added automatically to every prompt you want. And further approaches enhance the kind of information you share each time you interact with a GenAI, for example, retrieval augmented generation (RAG) and model context protocol (MCP).

Yet, the above methods can still fall short of getting what you want. You don’t just prompt – you prompt about something, and that something is the content in your organisation. And no matter how good your prompt design might be, the content might be unsuitable for GenAI ingestion.

If the above methods are like the code for generating responses to your queries, the content is like the data. You must also prepare the content (the data) itself for GenAI use in your organisation.

There are several methods for improving content. However, they are more specific than the prompt engineering methods, often varying by model of GenAI and by what kind of content is being considered. This requires knowledge management expertise, such as deep understanding of the GenAI models, the use and intended audience for the content, and the relationship between both and other tools (such as specific content repositories like SharePoint).

Conclusion

Assume that the underlying GenAI model forgets your conversations by default. To successfully use GenAI in your organisation, and help it remember, focus on two things:

  • To guide the model’s responses, engineer your prompts.
  • To ensure your content is accurate, relevant, and structured for AI, prepare your content.

Prompt engineering helps unlock better responses. However, the specific value of GenAI emerges only with underlying reliable and well-prepared data. So, if you want to integrate GenAI with your organisation’s content, my recommendations are these:

  • Invest as much effort in preparing and managing your internal content as you do in refining your prompts.
  • Collaborate with knowledge management experts who understand both your chosen GenAI models and your business’s unique information landscape.

By doing so, you enhance the suitability and accuracy of AI-generated outputs, with lasting value for your organisation.

Download 30 use cases for Copilot agents