Skip to Content

What Is Multimodal AI?

A Plain-English Guide to Text, Image, Audio & Video — All in One Model
April 2, 2026 by
What Is Multimodal AI?
Vishal

Quick Answer: What is multimodal AI?

Multimodal AI is an artificial intelligence system that can understand and generate more than one type of information — such as text, images, audio, and video — within a single unified model. Unlike older systems that handled one data type at a time, multimodal AI processes all these inputs together, allowing it to reason across them simultaneously. Examples include GPT-4o (OpenAI), Gemini 1.5 (Google), and Claude 3.5 (Anthropic).


You snap a photo of a broken appliance, speak your question aloud, and an AI responds — having read the product manual, studied the image of the damage, and heard the concern in your voice, all at once. No switching apps. No copy-pasting. Just one fluid, intelligent conversation across every form of information you chose to share.

A few years ago, that scenario would have required three separate specialized tools — a speech-to-text engine, an image classifier, and a text chatbot — stitched together with fragile code and significant data loss at every handoff. Today, it describes a single model. That shift has a name: multimodal AI.

This guide explains multimodal AI in plain English — what it means, how it actually works under the hood, what you can do with it today, where it still falls short, and where it is heading next. No machine learning degree required.


What Does 'Multimodal' Actually Mean?

Mode means a type or channel of information. Text is a mode. An image is a mode. Audio is a mode. Video is a mode. The word 'multimodal' simply means 'more than one mode.'

Unimodal vs. multimodal AI

Most early AI systems were unimodal — they worked in exactly one lane:

  • Text-only models (e.g., early GPT versions): read and write text, nothing else.

  • Image classifiers (e.g., AlexNet, ResNet): identify objects in photos, but cannot answer a question about them in plain language.

  • Speech recognizers (e.g., early Siri): convert spoken words to text, but do not understand the content.

A multimodal AI breaks those walls down. It can take an image AND a question as input and return a spoken answer. Or it can read a PDF, watch a video, and produce a written summary — all inside one model, all in one context.

The crucial distinction: joint models vs. chained tools

There is an important difference between a true multimodal model and a pipeline of separate unimodal tools:

Approach

How it works

Chained tools (old)

Text model calls a separate image model, passes the output back, then calls a speech model. Lossy, slow, rigid.

True multimodal model (new)

All modalities processed inside one neural network. Context and meaning flow freely across types without translation loss.


The analogy: chained tools are like three specialists reading each other's notes in separate rooms. A true multimodal model is like a single person who can see, hear, and read simultaneously — drawing on all three at once when forming a thought.

What modalities exist?

The four core modalities in today's AI systems are:

  • Text — written language in any form: articles, code, conversations, documents.

  • Image — photographs, diagrams, screenshots, medical scans, illustrations.

  • Audio — speech, music, ambient sound, tone of voice.

  • Video — moving images, combining visual frames with an audio track over time.

Emerging modalities include structured data (tables, spreadsheets), 3D point clouds, and even biological signals like brain activity — but text, image, audio, and video are where today's most capable systems operate.


A Brief History: From Single-Sense to All-Senses AI

Understanding where multimodal AI came from helps explain why it feels like such a leap. The progression happened in roughly five eras:

Era 1 — Rule-based systems (pre-2012)

AI systems were hand-crafted, domain-specific, and strictly unimodal. A chess-playing program knew nothing about language. A spell-checker knew nothing about images. Each system lived in its own silo, built by specialists for one narrow task.

Era 2 — Deep learning takes over (2012–2017)

AlexNet's victory in the 2012 ImageNet competition ignited the deep learning era. Convolutional neural networks (CNNs) transformed computer vision; recurrent neural networks (RNNs) made progress on language and speech. These were still separate systems — they just learned from data rather than hand-coded rules.

Era 3 — The transformer changes everything (2017–2020)

Google Brain's 2017 paper Attention Is All You Need introduced the transformer architecture. Its self-attention mechanism proved remarkable: it could learn relationships between any two elements in a sequence, regardless of distance. Crucially, 'sequence' was a flexible concept — words, image patches, and audio frames could all be represented as sequences of tokens. The architecture was modality-agnostic from the start.

Era 4 — Vision and language collide (2021–2023)

OpenAI's CLIP (2021) demonstrated that images and text could be embedded into the same vector space — meaning a model trained on image-caption pairs could match photos to sentences without ever being told what objects look like. DALL-E brought text-to-image generation. GPT-4V brought serious image understanding to a mainstream language model. OpenAI's Whisper brought robust, multilingual speech recognition.

Era 5 — Natively multimodal models (2024–present)

The frontier shifted to models trained jointly across all modalities from the beginning — not a language model with vision bolted on, but a single architecture that treats text, image, audio, and video as equal first-class citizens. Google's Gemini 1.5, OpenAI's GPT-4o, and Anthropic's Claude 3 family represent this generation. Real-time voice and video interaction arrived in consumer products. A genuine paradigm shift.

Key milestone models at a glance

Year

Milestone

2017

Transformer architecture (Google Brain)

2020

GPT-3 — large language model at scale (OpenAI)

2021

CLIP — joint vision-language embedding (OpenAI)

2022

DALL-E 2, Stable Diffusion — text-to-image generation

2022

Whisper — robust multilingual speech recognition (OpenAI)

2023

GPT-4V — vision understanding in a large language model

2024

Gemini 1.5 Pro — 1M token context, natively multimodal

2024

GPT-4o — real-time audio + vision in one model



How Multimodal AI Works Under the Hood

You do not need to understand this section to use multimodal AI. But understanding it will help you use it more effectively, anticipate its limits, and evaluate claims made about it. Here is how it actually works — with no math.

The token: AI's universal unit of meaning

Before a neural network can process anything, the input must become numbers. The unit of numbering is called a token.

  • Text tokens: words or word fragments. 'Multimodal' might become ['Multi', '##modal'].

  • Image tokens: patches of pixels. A 224×224 image divided into 16×16 patches gives 196 tokens, each representing a small tile of the image.

  • Audio tokens: small time slices of a spectrogram (a 2D visual representation of sound frequency over time).

  • Video tokens: sequences of image patches interleaved with audio tokens, maintaining temporal order.

Analogy: think of tokens as Lego bricks. Text bricks are rectangular, image bricks are square patches, audio bricks are wavy slices — but they all snap into the same baseplate. The model does not see 'a photo' or 'a sentence'; it sees a long sequence of numbered bricks.

Encoders: translating the world into numbers

Before all those tokens can be fed into the main model, each modality goes through a specialized encoder that converts it into a vector — a long list of numbers that represents the content's meaning in a shared mathematical space called an embedding space.

  • Image encoder: a Vision Transformer (ViT) chops an image into patches and encodes their content and spatial relationships.

  • Audio encoder: converts a raw waveform into a spectrogram, then processes it like an image through a similar patch-based encoder.

  • Text encoder: a standard tokenizer and embedding layer (this is the familiar part of any language model).

  • Video encoder: processes frame patches and audio simultaneously, adding temporal position information so the model knows which frame came first.

The magic is that all these encoders output embeddings that live in the same mathematical space. A vector representing the word 'sunset' can be compared mathematically to a vector representing an image of an orange sky. This shared space is what makes cross-modal reasoning possible.

The shared transformer: where modalities talk to each other

Once encoded, all tokens — regardless of origin — are fed into the same transformer backbone. This is where the real reasoning happens.

The transformer's attention mechanism allows every token to attend to every other token. An image patch can look at a text token. An audio slice can reference an earlier frame patch. There is no barrier between modalities inside the transformer — it is one continuous flow of mutual reference.

Analogy: imagine a multilingual summit where every delegate wears an earpiece with perfect real-time translation. German, Mandarin, and Portuguese delegates can all reference each other's exact statements. The transformer is that earpiece — everything everyone said is available to everyone else, regardless of the original channel.

Decoders: generating the output

The output side mirrors the input side. After the transformer has built up a rich, cross-modal understanding, decoders translate the internal representation back into a specific modality:

  • Text decoder: autoregressive generation — predict the next token, append it, predict the next, and repeat until done. This is the familiar mechanism behind chatbots.

  • Image decoder: a diffusion model or vector-quantized variational autoencoder (VQ-VAE) reconstructs pixel space from the internal representation.

  • Audio decoder: a vocoder or mel-spectrogram inversion converts the model's output back into a waveform.

Important: not every multimodal model has every decoder. Many models can understand images but not generate them. Understanding (input) and generation (output) are different capabilities that require separate training objectives.

Training: learning from the whole internet at once

A multimodal model is trained on a vast dataset containing all modality types simultaneously — image-caption pairs, audio transcripts, video subtitles, interleaved text and images, and pure text. The model learns to predict missing information across any of these channels.

Key training techniques include:

  • Contrastive learning: push embeddings of matching pairs (an image and its caption) closer together in vector space; push non-matching pairs apart. CLIP used this to great effect.

  • Next-token prediction: the classic language modelling objective, extended to multimodal token sequences.

  • Masked modality prediction: randomly mask out a chunk of one modality and train the model to reconstruct it from the others.

  • Instruction tuning and RLHF: after pre-training, fine-tune on human-labelled multimodal conversations so the model responds helpfully across all modality combinations.

The key insight in one sentence

Once you represent every type of information as a sequence of tokens in a shared vector space, a single transformer can learn the relationships between all of them — and reason across them just as naturally as it reasons within any single one.


The Four Modalities — A Deep Dive on Each

Text: the anchor modality

Text was the first modality that large transformer-based models mastered, and it remains the modality against which all others are aligned. Nearly every multimodal model still uses text as its primary output channel, even when processing images or audio.

What text enables: question-answering, summarization, translation, reasoning, code generation, document drafting — the full range of language tasks.

What multimodality adds: the ability to ground text in visual or audio reality. A text-only model answering 'what is wrong with this wiring diagram?' is guessing from descriptions. A multimodal model is looking at the diagram itself — a fundamentally different epistemic position.

Key challenge: long context. Handling a 200-page legal contract, a full-day transcript, or a book-length document requires context windows that were unthinkable just three years ago. Gemini 1.5 Pro's 1 million token context window is a meaningful advance.

Join our newsletter for regular  updates on AI, digital marketing and growth!


Image: seeing and showing

Image capability divides into two directions: understanding (vision) and generation.

Image understanding capabilities include:

  • Object recognition and scene description

  • Optical character recognition (OCR) — reading text within images

  • Spatial reasoning (what is to the left of the lamp?)

  • Chart, graph, and diagram comprehension

  • Medical imaging analysis (X-rays, MRIs, dermatology photos)

  • Document parsing — reading scanned forms, invoices, handwritten notes

Image generation capabilities include:

  • Text-to-image synthesis (Stable Diffusion, DALL-E 3, Midjourney, Imagen 3)

  • Image editing based on natural language instructions

  • Inpainting (filling in or replacing parts of an image)

  • Style transfer and artistic rendering

Key technique: the Vision Transformer (ViT) treats an image as a grid of patches rather than processing pixel by pixel — the same attention mechanism used for text now attends across spatial regions of an image.

Key challenges: fine-grained spatial reasoning ('how many objects are between the red cube and the blue sphere?'); reading very small text in dense images; hallucinating visual details that are not actually present.

Audio: hearing and speaking

Audio is broader than speech. Multimodal AI systems that handle audio typically deal with three distinct sub-types:

  • Speech: spoken human language — the modality that enables voice assistants, real-time translation, meeting transcription.

  • Non-speech audio: music, ambient sounds, animal calls, machinery — content without a linguistic encoding.

  • Paralinguistic features: tone, emotion, pace, stress — the 'how' of speech rather than the 'what'.

Audio understanding: automatic speech recognition (ASR), speaker diarization (who said what), emotion detection, music transcription, sound event detection.

Audio generation: text-to-speech synthesis, voice cloning from short samples, music generation (Suno, Udio), sound effects synthesis, dubbing and lip sync.

Key technique: converting a waveform into a mel-spectrogram — a 2D image of frequency versus time — then processing it with the same Vision Transformer used for images. Audio literally becomes a visual task.

Key challenges: real-time latency (sub-200ms is needed for natural conversation); robustness to background noise; accent, dialect, and language coverage; the ethics of voice cloning.

Video: sight, sound, and time

Video is the most computationally demanding modality because it combines images and audio over time. A single minute of standard video at 24 frames per second is 1,440 image frames plus accompanying audio — before any analysis begins.

Video understanding: action recognition ('is this person running or dancing?'), temporal reasoning ('what happened before the car turned?'), highlight detection, video captioning, surveillance analytics, sports analysis.

Video generation: text-to-video synthesis (OpenAI's Sora, Runway Gen-3, Kling), video editing, scene continuation, slow-motion generation, video upscaling.

Key challenge: temporal coherence — keeping the same character looking the same across hundreds of frames; physical plausibility (water that flows correctly, objects that maintain mass and continuity). Current generation models still struggle with complex multi-object interactions over more than a few seconds.

Long-context video breakthrough: Gemini 1.5 Pro demonstrated processing an entire one-hour video within a single context window — something that would have required custom engineering just a year earlier. Asking 'at what point in this documentary does the speaker contradict themselves?' and receiving an accurate, time-coded answer.


Real-World Applications: Where Multimodal AI Shows Up Today

Healthcare and medicine

Healthcare may be the domain where multimodal AI's ability to integrate multiple information sources creates the most consequential value.

  • Radiology: AI systems that read X-rays and CT scans alongside patient history notes, flagging findings that align or conflict with the clinical narrative.

  • Dermatology: a photograph of a skin lesion combined with a text description of duration and symptoms — fed together into a model trained on clinical outcomes.

  • Pathology: digital microscope slides analyzed at scale, with the model simultaneously reading the pathologist's notes.

  • Mental health monitoring: tone of voice, facial expression, and transcript analyzed together to track symptom changes between clinical appointments.

  • Surgical guidance: real-time video of an operating field combined with the patient's electronic health record and surgical literature.

Accessibility

Multimodal AI is already transforming accessibility tools in ways that were impossible with unimodal systems:

  • Rich image description: rather than a flat alt-text label, a blind user receives a detailed description that includes layout, context, and implied content — written at the level of understanding a sighted reader has.

  • Sign language recognition and translation: video of a signer interpreted in real time, with the text or audio response generated simultaneously.

  • Audio description for video: automatic narration of visual events for visually impaired viewers, timed to gaps in the audio track.

  • Captions that include sound context: '[upbeat music begins]' or '[crowd cheering]' alongside spoken word transcription, giving deaf viewers full sonic context.

Education

  • Photo-to-explanation: a student photographs a textbook problem — handwritten or printed — and receives a spoken, step-by-step walkthrough.

  • Language learning: listen to a native speaker clip, read the transcript, record yourself, and receive pronunciation feedback on all three simultaneously.

  • Adaptive tutoring: a model that watches a student work through a problem on a whiteboard (via camera), reads their written answers, and identifies misconceptions in real time.

  • Historical archive access: scanned handwritten documents from centuries ago, automatically transcribed and made searchable, with the original image and the transcription cross-linked.

Creative industries

  • Music video production: a finished song combined with a director's mood board of images generates a video concept, storyboard, and initial rough cut.

  • Advertising: a creative brief (text) plus brand asset images generate copy, visual concepts, and voiceover scripts in a single workflow.

  • Game development: describe a non-player character in text; receive concept art, a dialogue script, a voice line recording, and an animation brief together.

  • Film post-production: AI dubbing that analyzes lip movements in the original language and generates dubbed audio timed to match — without the uncanny mismatch of traditional dubbing.

Business and enterprise productivity

  • Meeting intelligence: video, audio, and calendar context combined to produce summaries, action items, decision logs, and speaker-attributed transcripts.

  • Customer support: a customer sends a photo of a broken product plus a voice note describing the problem; the AI diagnoses the issue, checks the warranty database, and routes to the right team — without human triage.

  • Document processing: scanned invoices, contracts, and forms automatically parsed, validated against business rules, and entered into the relevant system.

  • Retail: visual search ('find a sofa like this one') and virtual try-on (overlay a garment on a customer's photo) both require image understanding at their core.

Scientific research

  • Materials science: electron microscope images analyzed alongside experimental notes and prior literature, predicting material properties without requiring additional synthesis.

  • Ecology: acoustic monitoring (audio of a rainforest) combined with satellite imagery to track species distribution and deforestation simultaneously.

  • Archaeology: aerial photography, ground-penetrating radar data, and historical text records fused together to locate and interpret buried sites.


Why Multimodal AI Is More Than the Sum of Its Parts

A reasonable question: couldn't you just chain together separate unimodal tools and get the same result? The answer is no — and understanding why is essential to understanding what makes multimodal AI genuinely new.

Emergent cross-modal reasoning

Certain capabilities appear only when modalities are processed together. They are not present in any single-modality model, and they cannot be replicated by passing data between unimodal tools. Researchers call these emergent capabilities.

  • Understanding a meme requires reading the text AND interpreting the image's emotional register simultaneously. A language model sees words. A vision model sees pixels. Neither sees the joke.

  • Detecting inconsistency between speech and expression — when someone's words say yes and their face says no — requires audio and video at the same moment, with the same contextual understanding.

  • Answering 'what part of this diagram does the narrator seem most uncertain about?' requires correlating audio tone with visual content in real time.

Grounding reduces hallucination

Hallucination — an AI model confidently stating something false — is substantially reduced when the model is looking at the source material rather than reconstructing it from memory. A model given the actual image of a contract clause is far less likely to invent a clause that does not exist than a model working from a text description of the contract.

This is not a minor improvement. In high-stakes fields like medicine, law, and finance, the difference between 'hallucinated' and 'grounded in the actual document' is the difference between a useful tool and a liability.

Human communication is inherently multimodal

Human beings have never communicated in a single channel. We speak with our hands, write with diagrams, and reinforce words with tone. Single-modality AI systems required humans to translate their full, rich, multimodal experience into text — an unnatural constraint that limited both what could be communicated and who could communicate it.

Multimodal AI removes that constraint. Point at something and ask a question. Show and tell. Speak naturally. This is not a user experience improvement — it is a fundamental change in who can access and effectively use AI systems.

The autonomous agent unlock

AI agents — systems that take actions in the world, not just generate text — require multimodal perception. An agent that can book a flight needs to see the airline's website, not just read HTML source code. A surgical robot needs to see the operating field. A warehouse robot needs to recognize objects in arbitrary positions and orientations.

Multimodal AI is not optional for physical-world AI. It is the prerequisite.


Limitations, Risks, and What Multimodal AI Still Cannot Do

Multimodal AI is genuinely impressive. It is also genuinely limited. Honest assessment of both dimensions is what distinguishes useful AI literacy from hype.

Technical limitations

  • Spatial reasoning: current models still struggle with precise geometry. 'Which of these five objects is closest to the camera?' often produces errors that a five-year-old child would not make.

  • Temporal reasoning in video: tracking cause and effect across long video sequences — especially when action B at minute 45 was caused by action A at minute 3 — remains unreliable.

  • Fine-grained detail: small text in complex images, whispered audio in noisy environments, and very fast motion in video all cause significant accuracy degradation.

  • Compute and cost: processing video especially is expensive. Real-time multimodal inference at scale remains a substantial engineering challenge.

  • Context window limits: even the largest context windows cannot hold a full-length feature film, a multi-year document archive, or a 24-hour surveillance feed.

  • Modality imbalance: most frontier models are still predominantly language models. Vision and audio are often better-described as 'well-integrated add-ons' than as truly equal modalities in terms of reasoning depth.

Bias and fairness

Multimodal models trained on internet-scale data inherit the biases of that data — and multimodality can compound those biases:

  • Vision models trained predominantly on Western faces and contexts perform measurably worse on non-Western faces, traditional dress, local writing systems, and cultural objects.

  • Audio models trained on American and British English show degraded accuracy for speakers of global English varieties, tonal languages, and non-standard accents.

  • When a biased vision model and a biased language model are combined, errors from each can reinforce each other — a compounding effect absent from either system alone.

Safety and misuse risks

Greater capability brings greater potential for misuse. The multimodal dimension introduces risks that text-only models do not share:

  • Synthetic media and deepfakes: video and audio generation lowers the cost of fabricating realistic footage of people saying or doing things they never said or did.

  • Voice cloning: generating a convincing imitation of a person's voice from a few seconds of audio sample, enabling fraud and impersonation at scale.

  • Disinformation pipelines: combining persuasive text, photorealistic images, and cloned voices makes disinformation campaigns cheaper to produce and harder to detect.

  • Privacy: a model that can process images could, in poorly-governed deployments, identify people, read private documents captured in the background, or analyze sensitive medical imagery without appropriate consent.

What multimodal AI fundamentally cannot do

Even the most capable multimodal systems today have hard limits that are not engineering problems but architectural and epistemological ones:

  • It does not 'understand' in a human sense. It is an extraordinarily sophisticated pattern-matcher. The appearance of understanding can break down on novel inputs that fall outside the distribution of its training data.

  • It has no persistent memory. Without external tooling, each conversation starts fresh. The model does not remember you from last week's session.

  • It cannot verify physical reality. It can model what is likely true based on patterns, but it cannot step outside the model and check. Its knowledge of the physical world is always mediated by training data.

  • It lacks lived experience. Common sense derived from inhabiting a physical body, experiencing time passing, and facing genuine consequences is simply absent — and that absence creates characteristic blind spots.


The Road Ahead: Where Multimodal AI Is Going

Any-to-any models

Current models are mostly asymmetric: strong at taking images as input, weaker at generating them; fluent in text output, just beginning to produce audio. The trajectory is toward fully symmetric any-to-any models — where any modality can serve as input and any modality can be produced as output.

Practical implication: speak a question and receive a hand-drawn diagram as the answer. Describe a concept in text and receive back a short explanatory video. These are not science fiction — they are engineering roadmap items for models already in development.

Real-time and on-device processing

The current frontier is cloud-dependent: multimodal models require significant compute infrastructure that does not fit on a consumer device. The next wave is edge inference — running capable multimodal models on smartphones, tablets, glasses, and wearables.

When a capable multimodal model runs on the device you are wearing, the applications change entirely: real-time translation of everything you see, always-on accessibility tools, private medical monitoring, and AI assistance that works without an internet connection.

Physical world integration

Robotics and embodied AI represent the most significant long-term application of multimodal AI. Systems like Google DeepMind's RT-2 and robotics platforms from companies including Figure AI and Physical Intelligence (Pi) use multimodal transformer models as the reasoning engine for physical manipulation — teaching robots to understand instructions in natural language, perceive their environment visually, and take appropriate actions.

This is qualitatively different from traditional robotics, which required explicit programming for every scenario. A multimodal foundation model can generalize to novel situations because it has learned the structure of the world from its training data — not from hand-coded rules.

New modalities on the horizon

  • 3D and spatial: point cloud data and neural radiance fields (NeRFs) as first-class modalities — essential for augmented reality, robotics, and autonomous vehicles.

  • Structured data: tables, databases, and knowledge graphs treated as full modalities rather than converted to text, preserving their relational structure.

  • Tactile and haptic: early research is exploring touch sensing for robotic manipulation — completing the set of human senses as modalities.

  • Biological signals: EEG and fMRI data for brain-computer interfaces; physiological monitoring (heart rate, skin conductance) for healthcare and human-computer interaction.

Governance and regulatory response

The capabilities described in this guide have not gone unnoticed by lawmakers and standards bodies:

  • EU AI Act: classifies certain multimodal AI systems as high-risk, imposing transparency, documentation, and human oversight requirements.

  • C2PA standard: the Coalition for Content Provenance and Authenticity is developing technical standards for watermarking and provenance tracking of AI-generated media.

  • National AI strategies: the US, UK, China, Canada, Japan, and others have published frameworks specifically addressing generative AI — most of which is multimodal.

The regulatory environment will shape which multimodal capabilities are deployed and under what conditions — making these governance developments as important to track as the technical ones.


Conclusion: The Convergence Moment

The history of AI is, in large part, a history of fragmentation — specialists building brilliant narrow systems that could do one thing extraordinarily well and nothing else at all. Multimodal AI is the convergence of those streams.

Within a single model, the same architecture that learned to write poetry also learned to read an X-ray, transcribe a conversation, and watch a video. Not because these tasks are secretly the same — they are not — but because the transformer's attention mechanism proved powerful enough to find the structure in all of them, and the shared embedding space proved large enough to hold them all together.

The thesis in one sentence

Multimodal AI is the step that closes the gap between how machines process information and how human beings actually experience the world.

You do not need to be a machine learning engineer to benefit from this shift. But understanding it — knowing what multimodal AI actually does, why it is more than a collection of chained tools, and where it still fails — makes you a more effective user, a more critical evaluator, and a better-equipped participant in the conversations that will determine how these systems are built and governed.

The next time you point your phone at something unfamiliar and ask an AI what it is, you are touching the edge of something genuinely new. The question now is not whether AI will be multimodal — it already is. The question is what we will build with it, and whether we will build it wisely.


Glossary of Key Terms

Term

Simple language definition

Attention mechanism

The core operation in a transformer that allows any token to reference any other token when building its representation.

Contrastive learning

A training technique that teaches a model to pull similar things together and push dissimilar things apart in vector space.

Embedding / vector space

A mathematical space where meanings are represented as long lists of numbers. Similar concepts cluster near each other.

Encoder

A component that converts raw input (pixels, waveforms, text) into a vector representation.

Hallucination

When an AI model confidently generates content that is false, invented, or not supported by its source material.

Modality

A type or channel of information — text, image, audio, or video.

Spectrogram

A visual representation of audio showing frequency (pitch) over time — the format most audio AI models work with.

Token

The basic unit of input for a transformer model. Could be a word fragment, an image patch, or a slice of audio.

Transformer

The neural network architecture, introduced in 2017, that underlies virtually all modern large AI models.

Vision Transformer (ViT)

A transformer model applied to images by treating image patches as tokens — the same way words are treated in language models.