The first open benchmark for AI editing of Word documents
Most AI tools that touch a Word file quietly break its formatting, its tracked changes, or the file itself. docx-benchmark measures which tools actually get the edit right and keep the document intact, across realistic contract tasks. It is open source, and anyone can reproduce it or add their own tool. Adeu is one of the tools it scores.
Why this matters
AI tools are good at writing new text. Editing an existing Word document without wrecking it is much harder, and that is exactly what legal work demands.
Formatting gets mangled
Headings, numbering, styles, headers, and footers drift or vanish when a tool rewrites the file.
Tracked changes break
Redlines and comments are lost or corrupted, so the other side cannot see what actually changed.
Files quietly corrupt
The document looks fine until someone opens it in Word and it will not load. For a signed contract, that is unacceptable.
What it measures
Every tool is scored on four things, and only on multi-turn agentic workflows, never one-shot prompts.
Task success
Did the tool actually make the requested edit, correctly and completely?
Formatting fidelity
Did the untouched parts survive: paragraphs, styles, headers and footers, comments, and existing tracked changes?
File integrity
Does the edited file still open as a valid Word document, every single time?
Token cost
How many tokens did the task take? This maps directly to what each edit costs to run.
We are holding the scores off this page until the published results are final. You can reproduce them today by running the suite yourself.
Realistic legal tasks, not toy problems
The benchmark runs five jobs drawn from real contract work. Each is a multi-step task an AI agent has to carry out end to end.
- 01
Form fill
Complete a SAFE template from a deal data sheet, the way a founder or counsel fills in the blanks.
- 02
Template reuse and party swap
Take an executed agreement and swap in a new party throughout, without breaking the rest of the document.
- 03
Policy checklist review
Redline a contract in place against a policy checklist, fixing what does not meet the standard.
- 04
Playbook review of counterparty redlines
Review the other side's tracked changes against your playbook and comment where they cross a line.
- 05
Multi-file deal assembly
Assemble a deal from an intake sheet across several files, the kind of work that spans a whole matter.
Every task is multi-turn and agentic. The tool has to plan, call its own functions, and check its work, just like it would in production.
Read the full methodologyOpen by design
It is our benchmark, and Adeu is one of the tools in it. So we made the whole thing open and reproducible. Do not take our word for it. Run it.
Open source, AGPL-3.0
The full benchmark, scenarios, and scoring code are public on GitHub.
Reproducible
Clone the repo, run one command, and generate the results on your own machine.
Bring your own tool
Add any MCP server to a single config file and score it against the rest. No code changes required.
An open invitation
If you build a document-editing tool, we want you to run it and challenge the results.
Clone the repo and run the suite:
git clone https://github.com/dealfluence/docx-benchmark cd docx-benchmark npm install npm run benchmark
Run the benchmark yourself
It is open source and reproducible. Clone it, run it, and add your own tool to the comparison.