BloombergGPT: How We Built a 50 Billion Parameter Financial Language Model

()

What is a Language Model? (00:01:38)

Auto-regressive language models predict the next word in a sequence based on a probability distribution derived from prior words.
Text generation with language models is done by appending a sampled word from the probability distribution to the sequence and iterating the process.
Language models date back to at least the 1940s, with modern models having the ability to consider large contexts for prediction.
Modern models operate on tokens, which are sequences of characters or bytes, instead of whole words to handle a variety of inputs including typos and numerical expressions.
Language models have been applied to AI challenges like reading comprehension; they can generate answers to questions by generating text that includes the answers.
Building a large language model involves decisions about code, data, and compute infrastructure.

The goal was to match general purpose task performance while excelling in finance-specific tasks.
A risk mitigation strategy was to closely follow a successful model, specifically the Bloom model from the Big Science project.
The Bloom model is a “decoder only Transformer model”, and BloombergGPT took a similar approach with a few deviations.
The BloombergGPT tokenizer handles numbers differently by breaking them into individual digits.
BloombergGPT data set consisted of half public data (like C4, the pile, and Wikipedia) and half private financial data from Bloomberg, termed FinPile, which spanned from 2007 until the summer of the previous year.
In total, BloombergGPT was trained on 710 billion tokens, an amount 200 times larger than the text of the English Wikipedia, requiring significant computing power.

Identified a trade-off between model size and training data quantity within a fixed compute budget.
Influential paper in early 2022 changed perspectives, suggesting smaller models with more training data could perform better.
BloombergGPT is a 50 billion parameter model, focusing more on data quantity compared to peers like GPT-3 and OPT-175B.
Used chronological sorting for the financial dataset, and randomly mixed chronological data with public data during training.

Observed unexpected slow learning during BloombergGPT's initial training phase.
Hypothesized that chronological data sorting may have contributed to slow learning and out-of-sample testing issues.
Shifted to random data sorting and initiated V1 model training; encountered gradient norm spikes indicating training instability.
Discovered and addressed a bug related to weight decay on certain layer norm parameters.
Implemented multiple stability solutions, leading to the V2 model which trained well for 42 days before encountering performance issues.
Despite issues, achieved goal performance on downstream tasks, resulting in the end of the training due to budget and time constraints.

Benchmarked BloombergGPT against peer models on various tasks.
Performed on par or better than peers on general purpose and reasoning tasks.
Achieved superior results on financial tasks, likely due to financial data training and adjusted tokenizer for handling numbers.
Outperformed peers in sentiment analysis and in entity recognition when linked to stock tickers.
Showed promising results in translating natural language to BQL (Bloomberg query language) using few-shot learning.

Exploring the development of a natural language interface for the Bloomberg Terminal.
Aiming to simplify complex queries like plotting market capitalization and adding various data points, moving beyond traditional interfaces to a more intuitive system.

Start building large language models by working from smaller models and scaling up.
Many problems can be identified and resolved at a smaller scale, which is more cost-effective.
Conduct experiments at a smaller scale before implementing changes in larger models to understand the effects of those changes.
Progressing gradually allows for a deeper understanding and better use of resources.

A small team can build a large language model comparable to GPT-3 with sufficient computational resources and patience.
It is possible to combine domain-specific data with general data to create a model proficient in both areas.