Open benchmark

The first open benchmark for AI editing of Word documents

Most AI tools that touch a Word file quietly break its formatting, its tracked changes, or the file itself. docx-benchmark measures which tools actually get the edit right and keep the document intact, across realistic contract tasks. It is open source, and anyone can reproduce it or add their own tool. Adeu is one of the tools it scores.

See what it measures View on GitHub

Why this matters

AI tools are good at writing new text. Editing an existing Word document without wrecking it is much harder, and that is exactly what legal work demands.

Formatting gets mangled

Headings, numbering, styles, headers, and footers drift or vanish when a tool rewrites the file.

Tracked changes break

Redlines and comments are lost or corrupted, so the other side cannot see what actually changed.

Files quietly corrupt

The document looks fine until someone opens it in Word and it will not load. For a signed contract, that is unacceptable.

What it measures

Every tool is scored on four things, and only on multi-turn agentic workflows, never one-shot prompts.

Task success

Did the tool actually make the requested edit, correctly and completely?

Formatting fidelity

Did the untouched parts survive: paragraphs, styles, headers and footers, comments, and existing tracked changes?

File integrity

Does the edited file still open as a valid Word document, every single time?

Token cost

How many tokens did the task take? This maps directly to what each edit costs to run.

We are holding the scores off this page until the published results are final. You can reproduce them today by running the suite yourself.

Realistic legal tasks, not toy problems

The benchmark runs five jobs drawn from real contract work. Each is a multi-step task an AI agent has to carry out end to end.

01
Form fill
Complete a SAFE template from a deal data sheet, the way a founder or counsel fills in the blanks.
02
Template reuse and party swap
Take an executed agreement and swap in a new party throughout, without breaking the rest of the document.
03
Policy checklist review
Redline a contract in place against a policy checklist, fixing what does not meet the standard.
04
Playbook review of counterparty redlines
Review the other side's tracked changes against your playbook and comment where they cross a line.
05
Multi-file deal assembly
Assemble a deal from an intake sheet across several files, the kind of work that spans a whole matter.

Every task is multi-turn and agentic. The tool has to plan, call its own functions, and check its work, just like it would in production.

Read the full methodology

Open by design

It is our benchmark, and Adeu is one of the tools in it. So we made the whole thing open and reproducible. Do not take our word for it. Run it.

Open source, AGPL-3.0

The full benchmark, scenarios, and scoring code are public on GitHub.

Reproducible

Clone the repo, run one command, and generate the results on your own machine.

Bring your own tool

Add any MCP server to a single config file and score it against the rest. No code changes required.

An open invitation

If you build a document-editing tool, we want you to run it and challenge the results.

Clone the repo and run the suite:

git clone https://github.com/dealfluence/docx-benchmark
cd docx-benchmark
npm install
npm run benchmark

View on GitHub

Run the benchmark yourself

It is open source and reproducible. Clone it, run it, and add your own tool to the comparison.

View on GitHub See what Adeu does