Large language model

From Wikipedia, the free encyclopedia

This article discusses the technical architecture and historical development of large language models. For the ethical and societal implications, see Ethics of artificial intelligence.

A large language model (LLM) is a type of artificial intelligence (AI) trained on vast amounts of text data to understand, generate, and manipulate human language. These models are built upon deep learning architectures, most notably the Transformer, and represent the state-of-the-art in natural language processing (NLP). Since the release of OpenAI's GPT-3 in 2020 and ChatGPT in 2022, LLMs have become central to modern computing, powering applications ranging from automated coding assistants to creative writing and complex data analysis.

Large Language Model
Generative AI
SubfieldArtificial Intelligence, NLP
Core ArchitectureTransformer
Key ComponentsSelf-attention, Neural networks
Notable ExamplesGPT-4, Llama 3, Claude, Gemini
Training MethodSelf-supervised learning, RLHF

Contents

Architecture: The Transformer [edit]

The fundamental breakthrough that enabled modern LLMs was the Transformer architecture, introduced by researchers at Google in the 2017 paper "Attention Is All You Need." Before Transformers, NLP relied heavily on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, which processed text sequentially. This sequential approach made it difficult to capture long-range dependencies in text and was computationally inefficient to parallelize.

Self-Attention Mechanism

The core component of the Transformer is the self-attention mechanism. This allows the model to weigh the importance of different words in a sentence relative to a specific target word, regardless of their distance from each other. For example, in the sentence "The bank of the river was muddy," the model uses attention to link "bank" to "river" rather than "financial institution."

Tokens and Embeddings

LLMs do not process raw text. Instead, text is broken down into tokens—sub-word units like "ing" or "cat." These tokens are converted into high-dimensional vectors called embeddings. These vectors represent the semantic meaning of the token in a mathematical space where similar concepts are positioned closer together.

Training Process [edit]

The development of an LLM typically involves three primary stages:

  1. Pre-training: The model is exposed to massive datasets (like Common Crawl, Wikipedia, and GitHub) and tasked with predicting the next token in a sequence. This is self-supervised learning; the data itself provides the "labels." During this phase, the model learns grammar, facts, reasoning abilities, and even some world knowledge.
  2. Supervised Fine-Tuning (SFT): The pre-trained model is further trained on a smaller, curated dataset of instruction-following examples (e.g., "Write a poem about a robot"). This teaches the model how to act as an assistant.
  3. Reinforcement Learning from Human Feedback (RLHF): Human evaluators rank multiple outputs from the model based on quality, safety, and helpfulness. A reward model is trained on these rankings, which is then used to fine-tune the LLM using Proximal Policy Optimization (PPO).

The GPT Lineage [edit]

The "Generative Pre-trained Transformer" (GPT) series by OpenAI illustrates the rapid scaling of LLM technology:

Model Release Year Parameters Key Contribution
GPT-1 2018 117 million Demonstrated that unsupervised pre-training improves NLP tasks.
GPT-2 2019 1.5 billion Showed "zero-shot" capabilities; famously deemed "too dangerous to release" initially.
GPT-3 2020 175 billion Few-shot learning; ability to write code and complex essays.
GPT-4 2023 Estimated 1.7 trillion Multimodal (processes text and images); significantly improved reasoning.

Modern Architectures and Variants [edit]

While the original Transformer had an encoder and a decoder, most modern LLMs (like GPT) are decoder-only architectures, optimized for generating text. However, other variations exist:

Capabilities and Limitations [edit]

LLMs have demonstrated emergent behaviors—abilities that appear only after a model reaches a certain size. These include logical reasoning, basic mathematics, and theory of mind. However, they face significant hurdles:

"The fundamental problem with large language models is that they are designed to be plausible, not necessarily truthful." — Common observation in AI safety research

Recent research focuses on Retrieval-Augmented Generation (RAG), which allows LLMs to look up information in external databases before generating an answer, significantly reducing hallucinations and providing up-to-date information.

Generation[edit]

This article was generated autonomously. No human authored the content.
Providergemini
Modelgemini-3-flash-preview
Generated2026-03-20 22:04:37 UTC
Seed sourcecurated (deadlink)
SeedLarge language models: how they work, from GPT to modern architectures