Design Philosophy of Falcon LLM
The designers of Falcon models focused on scalability across below three axes which became their design philosophy.
1. Performance
The Falcon team utilized the EleutherAI Harnes – a framework designed for the evaluation of NLP models across various tasks. They chose to center their evaluations on measuring zero/few-shot generalization. Zero-shot and few-shot generalization refer to the ability of a model to perform well on tasks it hasn’t been explicitly trained on, either without any examples (zero-shot) or with only a small number of examples (few-shot)
2. Data – The RefinedWeb Dataset for Falcon LLM:
There are three constraints available when training a model – Compute budget, Model size, and Dataset size. During the initial wave of large language models, the philosophy was to increase the model size to increase the model performance. Then the chinchilla paper provided a general framework and showed that not only the model size but the training data size also mattered. It gave a ballpark relationship that the training size should be at least 20 times the model parameters to optimally train a model.
The Falcon team laid special emphasis on the quality of data. The traditional models are commonly trained on a mixture of filtered web data and curated “high-quality” corpora. However, the Falcon team argued that properly filtered and deduplicated web data if done properly alone can lead to powerful data. The team developed high-quality data from Common Crawl which consisted of 5 trillion tokens. The team has released 600 billion tokens from this dataset for open community research.
3. Hardware:
The team focused on designing the model in a way that not only improved task performance but also considered hardware scalability and throughput. They utilized a 3D parallelism strategy and optimizer sharding to run the training on AWS infrastructure (4096 number of 40 Gb A100 for 180B parameter model)
3D parallelism is a strategy that scales training across multiple dimensions:
- Data parallelism: Replicates the model and training data across multiple devices, each responsible for updating parameters based on its local data shard.
- Model parallelism: Splits the model (layers, weights, activations) across multiple devices, with each device computing activations or gradients for its assigned parts.
- Pipeline parallelism: Overlaps computation and communication by pipelining the training process into stages, such as data loading, forward pass, backward pass, and parameter update.
In large-scale deep learning, the optimizer state (e.g., gradients, momentum) can become memory bottlenecks. Optimizer sharding addresses this by:
- Partitioning optimizer state: Splits the optimizer state across multiple devices, reducing per-device memory requirements.
- Synchronizing gradients: Employs communication-efficient algorithms to aggregate gradients across devices without transferring the entire state.
Falcon LLM: Comprehensive Guide
Falcon LLM is a large language model that is engineered to comprehend and generate human like text, showcasing remarkable improvements in natural language and generation capabilities. This article covers the fundamentals of Falcon LLM and demonstrates how can we perform text generation using Falcon LLM.
Table of Content
- What is Falcon LLM?
- Key Features of Falcon LLM
- Design Philosophy of Falcon LLM
- Key Model components of Falcon LLM
- Limitation
- Text Generation using Falcon 7B
Falcon LLM aims to set new benchmarks in AI’s ability to interact, reason, and assist in a variety of complex tasks, promising transformative impacts across industries and research domains.
Large Language Model (LLM) is a very huge model (in terms of parameter) that are generally based on the transformer architecture (a special type of neural network capable of parallel processing through self-attention mechanism) that are trained on massive amounts of text data which help them to understand and generate text like humans do. Some examples of the famous LLM are GPT-3, Google BART, PaLM. Though the LLM models like GPT-3, Google BART, and PaLM are available to the public for inference, how they have been trained is not documented in detail. Traditionally the open-source LLM model has always lagged behind these private/commercial LLM models in terms of performance and size. The lack of detailed documentation about the training process of successful large-scale models limits the research and progress of open-source models.
Let us get an understanding of the key components of the Falcon Model.