3 Ways Organizations Can Train AI on Their Own Data

The bedrock of most generative AI models today is data gathered from the internet. But within the confines of the organization lies an untapped goldmine of proprietary knowledge.

Company knowledge is often scattered across individual minds, processes, policies, reports, online chats, meetings and more. It goes unaccounted for, is hard to recognize and is difficult to deploy where needed.

Is generative AI the tool companies need to get a grasp on and unleash this knowledge?

Why Train Your Own AI Model?

Companies that train large language models (LLMs) on their proprietary data typically see two immediate benefits. The first is increased security, as proprietary data remains within the organization and not exposed to external AI web crawlers. This type of setup helps ensure sensitive customer data stays protected.

The second benefit comes in the form of specificity of responses. LLMs trained on company data can provide hyper-personalized experiences, as they access specific, non-public data to tailor responses, making interactions more relevant and engaging.

Other, more gradual benefits include greater efficiency among workers, better decision-support tools and greater long-term knowledge retention.

3 Ways to Train Generative AI on Company Data

There are multiple ways to train AI on propriety or company data. Let’s explore three methods.

Method 1: Training AI From Scratch

Training AI from scratch is the most resource-intensive method. It requires a massive amount of high-quality data, which most companies don’t have. It also requires a significant amount of computing power and an arsenal of talented data scientists.

There are only a small number of cases where building a model from scratch makes sense, according to Dr. Jules White, professor of computer science and director of Vanderbilt University’s initiative on the future of learning and generative AI.

“You're going to want the best possible reasoning, and it's going to be very hard to stay at the cutting edge of reasoning if you're trying to just train on your own data,” he said, “which is probably orders of magnitude smaller than what is available to these big players who are going to be doing it.”

Bloomberg pulled this off with BloombergGPT, a tool to assist with financial tasks — something it was able to do with its 40+ years’ worth of existing data, which it combined with other financial resources. BloombergGPT employs a dataset of more than 700 billion tokens (chunks of text) and is trained on 50 billion parameters.

White also pointed to a collaborator of his at Vanderbilt, Jesse Spencer-Smith, who’s working with a group to train a model using gravitational waves. Stay tuned on that one.

Method 2: Fine-tune an Existing LLM

The next method is to take an existing large language model — like GPT-3 or GPT-4 — that is already trained on general knowledge and fine-tune (or further train) it with company-specific content. This approach requires less data and computing time but still requires substantial investment and data science expertise.

OpenAI’s Codex, for example, is an LLM fine-tuned on Python code from GitHub. It’s a general-purpose programming model that can be applied to nearly any programming task, including writing new code and debugging existing code. Codex contains 12 billion parameters and was trained on a dataset of 159 GB.

Another popular fine-tuned LLM is Google’s Med-PaLM 2, which was designed for the medical industry. PaLM 2, the basis for Med-PaLM 2, was trained on 3.6 trillion tokens — nearly five times more than its predecessor, PaLM — and 340 billion parameters. When tested against US Medical and Licensing Exam-style questions, the AI system achieved 86.5% accuracy.

Method 3: Prompt-tune an Existing LLM

Method 3 may be the obvious choice for most organizations. White said that’s because before embarking on either of the two methods mentioned above, companies should ask themselves whether they even need to fine-tune or build their own models. “My perspective is you probably don’t,” he said.

Instead, this third approach entails prompt-tuning an existing LLM, which doesn’t require any further training of the model. Rather, it focuses on designing input prompts that guide the model to generate desired responses — a process that can cut computing and energy use by at least 1,000 times and save thousands of dollars, according to IBM’s David Cox.

Learning Opportunities

Conference

Walt Disney World Swan and Dolphin Resort

Oct

Gartner HR Symposium/Xpo Orlando 2025

Conference

Nov

KMWorld Washington 2025

Webinar

On demand

AI in Customer Service: Faster Resolutions, Happier Customers

Don’t let rising demand burn out your team. See how to build a smarter, more resilient support org.

Watch Now

Webinar

On demand

Accelerating Healthcare Ops with AI

Watch Now

Webinar

Chameleon Tech: A chameleon changing colors to blend with various tech environments, symbolizing adaptability and versatility in technology solutions.

On demand

From Nice-to-Have to Non-Negotiable: Prove the Value of Your EX Stack

Transform your tech stack: prove its value, secure your future.

Watch Now

Webinar

On demand

The Great Workforce Breakdown (And How to Stop It)

Disconnected employees. Failing communication. It’s time to turn things around.

Watch Now

Conference

Oct

Gartner HR Symposium/Xpo Orlando 2025

Conference

Nov

KMWorld Washington 2025

Webinar

On demand

AI in Customer Service: Faster Resolutions, Happier Customers

Don’t let rising demand burn out your team. See how to build a smarter, more resilient support org.

Watch Now

People assume they can’t get good performance from a model on tasks it wasn’t trained on, said White. But that’s not true. “There's prompt engineering techniques where you pull in the data you need into the prompt and give it to one of these models,” he explained. “It was never trained with your data, and it can perform and reason on your data.”

Morningstar, for example, used prompt-tuning on OpenAI’s GPT-3.5 to create Mo, an investment research assistant. Mo became available for use after only a month or so of development, with the average cost per question answered sitting at $0.002. The total expense Morningstar has devoted to Mo as of June 2023 — not including the compensation of its creators — is $3,000.

Related Article: 5 Generative AI Trends in the Digital Workplace

Navigating Generative AI for Knowledge Management

Generative AI offers a solution to manage and leverage the inherent and often-scattered knowledge within an organization. Companies looking to delve into these AI advancements should carefully choose the method that aligns with their needs and resources. By doing so, they not only elevate their employees’ performances but also position themselves strategically in a landscape where AI is quickly becoming a must-have.