The Modalities of Agentic Systems: How AI Agents Interface With Users
Published:
A breakdown of the four modalities agentic systems use to interface with users: text, graphical, speech, and video. Includes real-world tool examples like ChatGPT, Cursor, GitHub Copilot, and Wispr Flow.
A breakdown of the different ways agentic systems communicate with us, and the tools that bring each modality to life.
As I’ve been reading Building Applications with AI Agents by Michael Albada, Chapter 3 on user experience design for agentic systems stood out to me. One concept in particular caught my attention: modalities, or the different ways AI agents interface with users.
There are four primary modalities:
- Text-based interfaces
- Graphical interfaces
- Speech and voice interfaces
- Video-based interfaces
Each modality shapes the user experience in distinct ways. Here’s how they work, along with some tools that bring them to life.
1. Text-Based Interfaces
Text-based interfaces are the most common modality for agentic systems. ChatGPT, Google Gemini, Anthropic’s Claude, and DeepSeek all rely on text as their primary communication channel.
But text-based agents extend beyond chatbots. The content we scroll through on Twitter, Reddit, and Facebook is increasingly generated or influenced by AI. Text is the foundational modality because it’s how large language models natively process and generate information.
Text is universal, asynchronous, and easy to reference. You can copy, paste, edit, and save it. It’s also the most mature modality in terms of tooling and user familiarity.
2. Graphical Interfaces
Graphical interfaces embed agentic systems directly into the tools we already use. This is where AI becomes a collaborator rather than just a conversationalist.
As a software engineer, I use Cursor and VS Code with GitHub Copilot daily. These tools don’t replace my workflow. They enhance it. I can choose when to lean on the AI and when to work independently. That flexibility is what makes graphical interfaces powerful.
The same principle applies outside of coding. Microsoft Copilot in Excel helps analysts run calculations and generate insights without leaving their spreadsheet. The AI meets users where they already are.
Graphical interfaces reduce context-switching. Instead of copying code into a chatbot and pasting responses back, the AI is embedded in the environment where work happens.
3. Speech and Voice Interfaces
Speech interfaces leverage the speed of spoken language and near-instantaneous AI responses to create fluid, conversational experiences.
I use this modality regularly. Wispr Flow lets me dictate and have AI clean up my speech into polished text. It’s how I drafted the first version of this article. ChatGPT’s voice mode is another example: you speak, it responds almost instantly, and the interaction feels natural.
Voice is faster than typing for many people and enables hands-free interaction. It’s particularly useful for brainstorming, quick capture, and accessibility.
4. Video-Based Interfaces
Video-based interfaces are the newest frontier, and the one I find most fascinating (and a little concerning).
On the creative side, AI-generated video can produce educational content with realistic presenters, personalized to different audiences. Imagine a brand creating dozens of ad variations tailored to different demographics, all generated at scale. A soap company could produce distinct spots featuring diverse faces and messaging without the traditional production overhead.
The potential is huge, but so are the risks. Video is visceral and persuasive in ways that text isn’t. The same technology that personalizes ads can also manipulate, deceive, or spread misinformation. This modality demands careful consideration of ethics and trust.
Video combines visual, auditory, and emotional cues. It’s highly engaging and can convey complex information quickly, but that power cuts both ways.
What I’m Building
I’m currently working on CareerBrief, an interview preparation platform that combines text and graphical interfaces. Users interact with a web app to build personalized “briefcases” for job interviews, while AI agents handle the research and synthesis behind the scenes. It’s been a great exercise in thinking through how modality choices affect user experience.
Key Takeaways
- Text-based interfaces are the foundation: familiar, flexible, and universally accessible.
- Graphical interfaces embed AI into existing workflows, reducing friction and context-switching.
- Speech and voice interfaces enable faster, hands-free interaction and open up new accessibility possibilities.
- Video-based interfaces are powerful but carry significant ethical considerations around manipulation and trust.
Five years ago, many of these tools would have felt like science fiction. Today, they’re part of my daily workflow. I’m excited to see where agentic systems go next, and to keep sharing what I learn along the way.
If you’re interested in diving deeper into how agentic systems are designed, I recommend picking up Building Applications with AI Agents by Michael Albada.
Subscribe
Get notified when I publish something new, and unsubscribe at any time.
Latest articles
Read all my blog postsFebruary 20, 2025
Why I'm in Love with the Apple Ecosystem: Simplicity, Speed, and Seamless Integration
Discover how the seamless integration of Apple products can simplify your life, boost your productivity, and transform your daily tech experience.
January 03, 2026
Introducing AI Agents Explained: A New Series
Welcome to AI Agents Explained! I'm diving deep into building applications with AI agents and sharing everything I learn along the way. From system design fundamentals to building tools like Career Brief, this series covers both the technical foundations and real-world applications of this game-changing technology.