line up of dogs sitting nicely, well-behaved, against a pink wall
Feature

How to Train AI on Your Company’s Data

3 minute read
Michelle Hawley avatar
By
SAVED
Companies that train large LLMs on their proprietary data can expect to derive two immediate benefits from it: security and specificity. Here’s how.

The bedrock of most generative AI models today is data gathered from the internet. But within the confines of the organization lies an untapped goldmine of proprietary knowledge.

Company knowledge is often scattered across individual minds, processes, policies, reports, online chats, meetings and more. It goes unaccounted for, is hard to recognize and is difficult to deploy where needed. 

Is generative AI the tool companies need to get a grasp on and unleash this knowledge?

Why Train Your Own AI Model? 

Companies that train large language models (LLMs) on their proprietary data typically see two immediate benefits. The first is increased security, as proprietary data remains within the organization and not exposed to external AI web crawlers. This type of setup helps ensure sensitive customer data stays protected.

The second benefit comes in the form of specificity of responses. LLMs trained on company data can provide hyper-personalized experiences, as they access specific, non-public data to tailor responses, making interactions more relevant and engaging.

Other, more gradual benefits include greater efficiency among workers, better decision-support tools and greater long-term knowledge retention. 

Related Article: 5 Ways Leaders Can Prepare Their Workforce for AI Disruption

3 Ways to Train Generative AI on Company Data

There are multiple ways to train AI on propriety or company data. Let’s explore three methods. 

Method 1: Training AI From Scratch 

Training AI from scratch is the most resource-intensive method. It requires a massive amount of high-quality data, which most companies don’t have. It also requires a significant amount of computing power and an arsenal of talented data scientists. 

There are only a small number of cases where building a model from scratch makes sense, according to Dr. Jules White, professor of computer science and director of Vanderbilt University’s initiative on the future of learning and generative AI.

“You're going to want the best possible reasoning, and it's going to be very hard to stay at the cutting edge of reasoning if you're trying to just train on your own data,” he said, “which is probably orders of magnitude smaller than what is available to these big players who are going to be doing it.”

bloomberggpt
An example of BloombergGPT utilizing its knowledge about stock tickers and financial terms to compose valid queries to retrieve data.

Bloomberg pulled this off with BloombergGPT, a tool to assist with financial tasks — something it was able to do with its 40+ years’ worth of existing data, which it combined with other financial resources. BloombergGPT employs a dataset of more than 700 billion tokens (chunks of text) and is trained on 50 billion parameters. 

White also pointed to a collaborator of his at Vanderbilt, Jesse Spencer-Smith, who’s working with a group to train a model using gravitational waves. Stay tuned on that one.

Method 2: Fine-tune an Existing LLM

The next method is to take an existing large language model — like GPT-3 or GPT-4 — that is already trained on general knowledge and fine-tune (or further train) it with company-specific content. This approach requires less data and computing time but still requires substantial investment and data science expertise. 

OpenAI’s Codex, for example, is an LLM fine-tuned on Python code from GitHub. It’s a general-purpose programming model that can be applied to nearly any programming task, including writing new code and debugging existing code. Codex contains 12 billion parameters and was trained on a dataset of 159 GB.

Med-PaLM 2
An example of how Med-PaLM 2 answers medical questions.

Another popular fine-tuned LLM is Google’s Med-PaLM 2, which was designed for the medical industry. PaLM 2, the basis for Med-PaLM 2, was trained on 3.6 trillion tokens — nearly five times more than its predecessor, PaLM — and 340 billion parameters. When tested against US Medical and Licensing Exam-style questions, the AI system achieved 86.5% accuracy

Method 3: Prompt-tune an Existing LLM

Method 3 may be the obvious choice for most organizations. White said that’s because before embarking on either of the two methods mentioned above, companies should ask themselves whether they even need to fine-tune or build their own models. “My perspective is you probably don’t,” he said. 

Instead, this third approach entails prompt-tuning an existing LLM, which doesn’t require any further training of the model. Rather, it focuses on designing input prompts that guide the model to generate desired responses — a process that can cut computing and energy use by at least 1,000 times and save thousands of dollars, according to IBM’s David Cox

Learning Opportunities

People assume they can’t get good performance from a model on tasks it wasn’t trained on, said White. But that’s not true. “There's prompt engineering techniques where you pull in the data you need into the prompt and give it to one of these models,” he explained. “It was never trained with your data, and it can perform and reason on your data.”

Morningstar
Morningstar’s Mo showing of its investment research knowledge.

Morningstar, for example, used prompt-tuning on OpenAI’s GPT-3.5 to create Mo, an investment research assistant. Mo became available for use after only a month or so of development, with the average cost per question answered sitting at $0.002. The total expense Morningstar has devoted to Mo as of June 2023 — not including the compensation of its creators — is $3,000. 

Related Article: 5 Generative AI Trends in the Digital Workplace

Navigating Generative AI for Knowledge Management 

Generative AI offers a solution to manage and leverage the inherent and often-scattered knowledge within an organization. Companies looking to delve into these AI advancements should carefully choose the method that aligns with their needs and resources. By doing so, they not only elevate their employees’ performances but also position themselves strategically in a landscape where AI is quickly becoming a must-have.

About the Author
Michelle Hawley

Michelle Hawley is an experienced journalist who specializes in reporting on the impact of technology on society. As editorial director at Simpler Media Group, she oversees the day-to-day operations of VKTR, covering the world of enterprise AI and managing a network of contributing writers. She's also the host of CMSWire's CMO Circle and co-host of CMSWire's CX Decoded. With an MFA in creative writing and background in both news and marketing, she offers unique insights on the topics of tech disruption, corporate responsibility, changing AI legislation and more. She currently resides in Pennsylvania with her husband and two dogs. Connect with Michelle Hawley:

Main image: Hannah Lim | unsplash
Featured Research