
I’m Ben Morss, Developer Evangelist at DeepL. At Apidays Munich 2025 I walked through how AI agents and translation APIs can reshape internationalization (i18n) workflows — and whether they really make i18n “easy”. In this article I’ll share the practical realities, technical patterns, and recommended guardrails that I discussed on stage. You’ll get a clear sense of three concrete scenarios where agents can help — improving raw translations, internationalizing a single‑language website, and automating an end‑to‑end translation workflow — plus when to prefer classic engineering over agentic shortcuts.
Before we dive in: yes, I’ve been to Munich before — once in 2017, eating what locals promise is the best Döner Kebab — and it’s always a pleasure to return and talk APIs, translation, and automation.
Why agents now? A short history of automation for translators
Automation has been evolving in stages. First machines simplified manual labor; later, computation changed office work. For translators, neural machine translation (NMT) first provided faster and more reliable raw translations than rule‑based approaches. More recently, large language models (LLMs) have pushed fluency and contextual understanding even further.
But history shows us a pattern: automation rarely eliminates whole professions. Instead it changes the tasks people do. With machine translation, translators stopped starting from blank pages. They began post‑editing — taking a machine output, correcting mistakes, ensuring terminology matches, and ensuring cultural fit. Computer assisted translation (CAT) tools like memoQ and others made those workflows efficient by combining translation memories, glossaries, and API calls to translation engines.
The question I explored at Apidays is: will AI agents be the next step in that evolution? Can they give translators better first drafts, or even automate parts of the workflow entirely while keeping quality high?
What do I mean by an AI agent?
Definitions vary, but in practical terms I use “AI agent” to mean an automated system built around an AI model (usually an LLM) that performs specific tasks. An agent is more than just a single prompt; it’s code + model + tools. Typical components:
- LLM core: the reasoning and language engine (e.g., GPT‑4, Gemini, or LLM variants).
- Tools/APIs: external functionality the agent can call (translation APIs, GitHub, file systems, spreadsheets, etc.).
- Memory/DB: a place to store state, identifiers, or mapping tables generated during work.
- Orchestration code: logic that coordinates prompts, tool calls, retries, and guardrails.
Agents enable more flexible automation than a single request/response model because they can examine results, call other services, and iterate toward an outcome.
Three real-world translation scenarios for agents
We tested three scenarios to understand where agents truly help:
- Improve raw LLM translations by using agentic reflection and multi‑agent editing.
- Internationalize (i18n) a single‑language website by converting hardcoded strings into translatable keys and JSON resources.
- Automate an entire translation workflow: generate machine translations, collect them in a spreadsheet for human review, ingest human edits, and output final resource files.
Scenario 1 — Can agents make translations better?
This is the simplest question: given an LLM translation, can an agent make it measurably better by reflecting, editing, or orchestrating multiple specialized subagents (fluency checker, adequacy checker, style editor)? The short answer: sometimes, but not always — and cost and latency matter.
There are a few concrete patterns people are trying:
- Iterative reflection: Ask an LLM to translate, then ask it to critique its own output for accuracy, fluency, style, and cultural fit, then ask it to revise the translation using those critiques. This is close to a “chain‑of‑thought” or “self‑debugging” pattern.
- Specialized subagents: Run the translation through several agents specialized in particular evaluations (e.g., fluency, adequacy, domain terminology) and then combine their recommendations through an editor agent.
- Reasoning models: Use models optimized for deep reasoning (like certain instruction‑tuned LLMs) to produce more careful translations than a generic model.
Practical observations from experiments:
- Quality can improve: Reflection and multi‑agent editing sometimes yield translations that read more naturally and catch mistakes a single pass misses.
- Not always worth the cost: In one quick test I ran with Gemini 2.5 Pro, a single short line took 34 seconds and consumed over 3,000 tokens to produce a revised translation. That’s hardly scalable for large volumes or low latency needs.
- Mixed research outcomes: A recent academic paper built a four‑agent pipeline (translator + fluency checker + adequacy checker + style agent -> editor) and reported mixed results. Against some baselines, GPT‑4 underperformed traditional engines like DeepL and Google Translate in their configuration — partly because they used older versions of neural models for the baseline comparisons. In short: it’s work in progress.
One elegant idea attributed to Andrew Ng is to explicitly prompt the LLM to list improvement categories (accuracy, fluency, style, cultural context, terminology) and then feed those suggestions back into an edit prompt. That approach is intuitive and easy to implement, but like the others, it may be token‑heavy and slow.
Recommendation
- Use agentic edit loops where ultra‑high fidelity is needed for short texts (e.g., marketing copy or legal sentences), but be careful with costs for large batches.
- Compare against state‑of‑the‑art NMT/LLM translation engines before adopting a complex agent flow — sometimes a single pass with a tuned translation API equals or beats an expensive multi‑agent pipeline.
Scenario 2 — Internationalize a single‑language website with an agent
This is the kind of practical task many startups face: you built a web app in your native language, hardcoded strings all over your React codebase, and suddenly you have users in several countries. How do you turn that mess into a maintainable, localized application?
The canonical engineering approach is straightforward:
- Introduce an i18n library (i18next, react-intl, formatjs, etc.).
- Replace hardcoded strings with keys (e.g., t(‘header.welcome’)).
- Wrap strings with a translation function (commonly named t()).
- Collect keys and create per‑language JSON resource files.
- Call a translation API to populate values for each target language.
- Test, review, and adjust.
All of the above is reproducible with code, but agents can help here because:
- They can search the codebase for strings across file types (JSX, JSON, config files) more flexibly than brittle regular expressions.
- They can propose human‑readable key names and avoid collisions using contextual heuristics.
- They can call translation APIs and generate resource files automatically.
- They can branch your repo, create new files, and submit pull requests for human review.
What an agent needs to do this well:
- Filesystem or repo access: Either direct filesystem access for local projects or API access for hosted repos (GitHub/GitLab).
- Translation API access: DeepL, Google Translate, or other APIs for bulk translation.
- State memory: A tool or store that accumulates the mapping between original strings and assigned keys so the agent doesn’t lose track.
- Optional MCPs: Model Capability Providers (MCPs) or connector services that expose higher‑level operations for things like Google Sheets, GitHub, or file manipulations can simplify tool design.
Example agent flow for automatic i18n of a React app (MVP):
- Agent scans the repository for potential translatable strings (strings in JSX, string concatenations that look like UI text, contents of template literals).
- For each candidate string, the agent proposes a key (e.g., header.welcome). It records this in a mapping tool or DB.
- It replaces the string source with t(‘header.welcome’) in the code and stages the change.
- The agent calls the translation API to generate target language values and constructs JSON files (en.json, fr.json, de.json, etc.).
- The agent opens a PR with the code changes and the new resource files. Humans review, correct key names, adjust translations, and merge.
Why not just write scripts instead?
Scripting works, and for many teams it’s the most reliable path. But scripts tend to be fragile: regexes miss edge cases, and new file formats or odd build setups break the parser. Agents bring adaptiveness: they can ask clarifying questions, spot strings in unusual places, and infer context. That makes them useful for fast, exploratory migrations or when you can’t feasibly predict every code pattern.
Human‑in‑the‑loop still required
Don’t mistake agentic convenience for correctness. Humans must:
- Choose and configure the i18n package to suit the project’s architecture and maintainability needs.
- Validate the selected strings — agents can’t always know if a quoted literal is user‑visible copy or a debug string.
- Review and adjust generated key names to match your project’s naming conventions.
- Audit translations — ambiguous words like “close” (close a dialog vs. nearby) need UI context and often human judgment.
Scenario 3 — Automating a full translation workflow
This is where agentic systems can shine if designed carefully: automating the drudge work of moving text between systems and people while keeping humans in the loop for final checks.
A practical workflow I outlined:
- Agent extracts strings (as in Scenario 2) and calls a translation API to generate machine translations for each target language.
- Agent writes those translations into a collaborative spreadsheet (Google Sheets, Excel Online) for human review.
- Human translators/editors review, suggest edits, and mark rows as approved.
- Agent monitors the sheet, ingests approved rows, and writes final JSON resource files back to the repository or deployment pipeline.
Why the spreadsheet? Many localization teams trust spreadsheets as the canonical review and sign‑off artifact. They allow translators to view source context, add comments, and iterate — and they’re UI‑friendly for non‑technical stakeholders.
MCPs and connectors make spreadsheet automation much easier. A popular approach is to expose a high‑level MCP for Google Sheets (rather than using the raw API) that provides functions such as addSheet, deleteSheet, getRange, appendRows. Composeo and other teams provide MCP servers that agents can call to manipulate sheets in human‑friendly ways.
Design patterns for a robust workflow
- Idempotency: Ensure the agent can re‑run safely without duplicating rows or overwriting reviewed content.
- Audit trails: Keep records of machine output, human edits, and who approved what.
- Guardrails: Limit the agent’s destructive actions (e.g., never commit to main; commit to a branch or open PR only).
- Validation steps: Run lightweight QA checks (character limits, placeholders intact, ICU formatting preserved) before inserting translations into resource files.
- Human escalation: Let the agent detect uncertainty (low confidence or ambiguous strings) and flag them for manual review rather than guessing.
Agents are excellent at the repetitive pieces: extracting, populating sheets, monitoring approval columns, and generating resource files. But the human remains essential where nuance and context matter.
Common pitfalls and how to avoid them
From running experiments and building flows, here are the traps I see teams fall into and practical advice to avoid them:
1. Token and latency costs
Iterative agent flows can be token‑hungry and slow. If you try reflection loops or multi‑agent editing for thousands of segments, costs stack quickly. Always benchmark token use per segment and estimate your monthly volume before committing to an agentic approach.
2. Over‑automation without guardrails
Giving an agent full commit rights in a monorepo is tempting but risky. Use branches, PRs, and limited permissions for initial runs. Keep humans in a “review and approve” loop.
3. Losing context
Translations depend on UI context, domain terminology, and style guides. If the agent can’t access screenshot context or component metadata, it will guess and sometimes guess wrong. Where possible, enrich inputs with brief context strings (component name, UI role, or a screenshot link).
4. Fragile criteria for string detection
Regex‑based extraction scripts will miss edge cases or include non‑UI strings. Agents are better at understanding context, but they also make mistakes. Make sure you have a verification step.
5. Spreadsheet brittleness
If your automation expects data in fixed columns and a person moves a column, the script breaks. Agents that can “see” the spreadsheet and locate columns by header label, not position, are more resilient. Using an MCP that abstracts spreadsheet actions helps.
When to use agents vs. when to write code
Both have places:
- Write code when: You need speed, reproducibility, and low overhead at scale — predictable patterns work well with scripts and CI jobs.
- Use agents when: Your codebase is messy or heterogeneous, you need flexible heuristics, or you want an interactive tool that asks clarifying questions and adapts to surprises.
In practice a hybrid approach is often best: use agents to discover and propose changes, then generate deterministic scripts from approved proposals for large, repeatable runs.
Practical checklist before you deploy agentic i18n
- Inventory: identify file types and where UI strings live.
- Select i18n strategy: choose library and key naming conventions.
- Permissions: create a safe Git branch and set PR review rules.
- Tooling: provision translation API credentials and spreadsheet access tokens (or MCP credentials).
- Agent test run: run the agent on a small surface area, produce a PR, and review.
- Human review: have translators and engineers verify keys, translations, and code changes.
- Automate finalization: once approved, have the agent produce final resource files and open a release PR or push to a localization pipeline.
- QA: verify in the app — check for UI truncation, placeholders, and language switching behavior.
Security and compliance considerations
When agents access code and text, be mindful of sensitive data:
- Do not send secrets or PII to third‑party LLMs unless you have proper contracts and data handling agreements.
- Prefer on‑prem or private model deployments if your data is sensitive.
- Use scoped API keys for translation and MCP services, and rotate them regularly.
- Track where data flows: agent logs, spreadsheets, and translation APIs should be auditable.
Where the technology is headed
Agent architectures and tooling are moving quickly. Expect improvements in:
- Cost efficiency: lighter weight models and smarter orchestration will lower token bills for iterative editing.
- Connectivity: richer MCP ecosystems that provide secure, higher‑level connectors to sheets, repos, and storage.
- Context awareness: agents that can ingest screenshots, UI metadata, and component trees for better disambiguation.
- Domain tuning: fine‑tuned agents for specific industries that better respect terminology and style guides.
For translators and localization leads, the arrival of intelligent agents is not a threat but an opportunity: agents can handle repetitive, mechanical work while translators and linguistic leads focus on higher‑value tasks like creative adaptation, nuance, and strategy.
Final thoughts — agents as powerful junior teammates, not replacement experts
In my talk I likened agents to “a very eager junior who never tires.” They can find strings, propose keys, generate machine translations, and build draft resource files. They can also make surprising choices, misunderstand context, or do more than you expected. That’s why human oversight matters at every step.
Use agents to accelerate discovery and automation, but keep review loops and guardrails. Combine agent flexibility with script reliability: let agents explore and propose changes, then lock in repeatable processes for scale. For teams that adopt this hybrid model, internationalization becomes much easier in practice — faster, less error‑prone, and more maintainable — while quality and ownership remain where they should: with human experts.
If you want a practical starting point, try this simple experiment: pick a small React component with a few hardcoded strings, write a lightweight agent (or even a guided script) to extract them, propose keys, and generate a French JSON file. Review results with a human translator, iterate, and you’ll quickly see how much time this approach saves — and where it still needs your expert judgment.
And if you’re at Apidays or building i18n workflows, I hope these patterns and warnings help you balance automation with human care.




