Large language model

From Wikipedia, the free encyclopedia

This article discusses the technical architecture and historical development of large language models. For the ethical and societal implications, see Ethics of artificial intelligence.

A large language model (LLM) is a type of artificial intelligence (AI) trained on vast amounts of text data to understand, generate, and manipulate human language. These models are built upon deep learning architectures, most notably the Transformer, and represent the state-of-the-art in natural language processing (NLP). Since the release of OpenAI's GPT-3 in 2020 and ChatGPT in 2022, LLMs have become central to modern computing, powering applications ranging from automated coding assistants to creative writing and complex data analysis.

Large Language Model
Generative AI
Subfield	Artificial Intelligence, NLP
Core Architecture	Transformer
Key Components	Self-attention, Neural networks
Notable Examples	GPT-4, Llama 3, Claude, Gemini
Training Method	Self-supervised learning, RLHF

Contents

1 Architecture: The Transformer
2 Training Process
3 The GPT Lineage
4 Modern Architectures and Variants
5 Capabilities and Limitations

Architecture: The Transformer [edit]

The fundamental breakthrough that enabled modern LLMs was the Transformer architecture, introduced by researchers at Google in the 2017 paper "Attention Is All You Need." Before Transformers, NLP relied heavily on Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, which processed text sequentially. This sequential approach made it difficult to capture long-range dependencies in text and was computationally inefficient to parallelize.

Self-Attention Mechanism

The core component of the Transformer is the self-attention mechanism. This allows the model to weigh the importance of different words in a sentence relative to a specific target word, regardless of their distance from each other. For example, in the sentence "The bank of the river was muddy," the model uses attention to link "bank" to "river" rather than "financial institution."

Tokens and Embeddings

LLMs do not process raw text. Instead, text is broken down into tokens—sub-word units like "ing" or "cat." These tokens are converted into high-dimensional vectors called embeddings. These vectors represent the semantic meaning of the token in a mathematical space where similar concepts are positioned closer together.

Training Process [edit]

The development of an LLM typically involves three primary stages:

Pre-training: The model is exposed to massive datasets (like Common Crawl, Wikipedia, and GitHub) and tasked with predicting the next token in a sequence. This is self-supervised learning; the data itself provides the "labels." During this phase, the model learns grammar, facts, reasoning abilities, and even some world knowledge.
Supervised Fine-Tuning (SFT): The pre-trained model is further trained on a smaller, curated dataset of instruction-following examples (e.g., "Write a poem about a robot"). This teaches the model how to act as an assistant.
Reinforcement Learning from Human Feedback (RLHF): Human evaluators rank multiple outputs from the model based on quality, safety, and helpfulness. A reward model is trained on these rankings, which is then used to fine-tune the LLM using Proximal Policy Optimization (PPO).

The GPT Lineage [edit]

The "Generative Pre-trained Transformer" (GPT) series by OpenAI illustrates the rapid scaling of LLM technology:

Model	Release Year	Parameters	Key Contribution
GPT-1	2018	117 million	Demonstrated that unsupervised pre-training improves NLP tasks.
GPT-2	2019	1.5 billion	Showed "zero-shot" capabilities; famously deemed "too dangerous to release" initially.
GPT-3	2020	175 billion	Few-shot learning; ability to write code and complex essays.
GPT-4	2023	Estimated 1.7 trillion	Multimodal (processes text and images); significantly improved reasoning.

Modern Architectures and Variants [edit]

While the original Transformer had an encoder and a decoder, most modern LLMs (like GPT) are decoder-only architectures, optimized for generating text. However, other variations exist:

Encoder-only: Models like BERT (Bidirectional Encoder Representations from Transformers) are designed for understanding tasks like sentiment analysis or classification, rather than generation.
Mixture of Experts (MoE): Instead of activating the entire neural network for every query, MoE models (like Mixtral 8x7B or GPT-4) use a "router" to send the input to specific sub-networks (experts). This allows for much larger models with lower computational costs during inference.
Open-Weights Models: Starting with Meta's Llama series, many powerful models have been released with weights available to the public, fostering a massive ecosystem of fine-tuned variants like Vicuna and Alpaca.

Capabilities and Limitations [edit]

LLMs have demonstrated emergent behaviors—abilities that appear only after a model reaches a certain size. These include logical reasoning, basic mathematics, and theory of mind. However, they face significant hurdles:

"The fundamental problem with large language models is that they are designed to be plausible, not necessarily truthful." — Common observation in AI safety research

Hallucination: The model may confidently state false information because its goal is to predict the most likely next word, not to check facts against a database.
Context Window: Models have a limit on how much text they can "remember" at once. Modern models like Google's Gemini 1.5 have expanded this to millions of tokens.
Bias: Since they are trained on human-generated internet data, LLMs can inherit and amplify societal biases regarding race, gender, and religion.
Computational Cost: Training a state-of-the-art LLM requires thousands of GPUs (like the NVIDIA H100) and millions of dollars in electricity and hardware.

Recent research focuses on Retrieval-Augmented Generation (RAG), which allows LLMs to look up information in external databases before generating an answer, significantly reducing hallucinations and providing up-to-date information.

Generation[edit]

This article was generated autonomously. No human authored the content.
Provider	`gemini`
Model	`gemini-3-flash-preview`
Generated	2026-03-20 22:04:37 UTC
Seed source	curated (deadlink)
Seed	Large language models: how they work, from GPT to modern architectures