Article written by Elgar Weijtmans and Anna Guo.
How the international legal tech community is working together to streamline AI procurement
Everyone is reinventing the wheel
Two years ago, we at HVG Law started evaluating legal AI tools. At the time, there was no framework or playbook to follow, so we built everything from scratch (I described that process in a previous post). It worked, but it took an enormous amount of time and was far from perfect. One of the most humbling takeaways was discovering just how much you don’t know when you first try to seriously assess an AI tool.
In conversations with colleagues from other firms and companies, I noticed that virtually every legal team was wrestling with the same questions. How do you assess the output of an AI? What do you ask a supplier about data processing? How do you test security? Everyone was reinventing the wheel.
Roel Schrijvers was one of the people who decided to do something about it. He built a test matrix for evaluating legal AI tools and shared it with Anna Guo at Legal Benchmarks, planting the seed for what would become a much larger initiative.
That picture was confirmed further when I came into contact with Legal Benchmarks, an international organisation that develops independent benchmarks, evaluation frameworks, and open-access resources for legal teams looking to evaluate and procure AI. Drawing on their experience with hundreds of legal teams worldwide, they confirmed what I had already suspected: the biggest challenge isn’t even assessing the tools themselves, but the internal coordination that precedes it. You have lawyers who want to know if the tool delivers good output, IT looking at architecture and data flows, and procurement wanting control over the contracting process. Bringing those perspectives together is a project in itself, and most teams simply don’t have the capacity for it. More often than not, it gets done alongside day-to-day work.
Earlier this year, Mark Zijlstra wrote on Mr. Online that the open-source philosophy could be of enormous help to the legal sector. That sharing knowledge, experiences, and above all mistakes is essential for the development of legal tech. I fully endorse that view, and as far as I’m concerned, this project is a great example of it in practice.
From a LinkedIn post to global collaboration
In October 2025, Anna (founder of Legal Benchmarks) shared a call to build a shared evaluation framework together. I recognised it immediately: it closely resembled what we had developed at HVG Law. We got talking and decided to join forces.
Within three weeks, hundreds of (legal) professionals from all over the world had come forward: general counsels, legal ops specialists, IT security experts, privacy lawyers. All sharing the same frustrations, and all willing to contribute their insights and experience. In a sector not exactly known for giving away knowledge freely, that was remarkable to see.
Together with the community, we developed the first version of an evaluation framework to help legal professionals choose and procure the right (legal) AI. They refined the criteria, identified blind spots, and tested the framework against their own evaluation experiences. That diversity of perspectives (legal, technical, operational, spanning different jurisdictions) made the result better than what any single team could have produced on its own.
A three-phase evaluation framework
The result is the Legal AI Evaluation Framework: an open standard with a hundred evaluation criteria spread across eight dimensions, ranging from strategic fit and output quality to security, privacy, and supplier risks.
The approach is deliberately phased, much like the recruitment process for a new colleague. You start with the CV: does the tool meet your minimum requirements on paper, or does it fail the initial screening? Only if the profile looks promising do you invite the candidate for an initial interview: a structured demo where you test whether the tool actually does what the supplier promises. And just as you wouldn’t hire a strong candidate based solely on a good interview, you let the tool run through a pilot phase in practice: with your own workflows, your own documents, and your own team. Only then can you tell if it really works. Each phase comes with a scorecard you can use straight away and adapt to your own situation.
That pilot phase deserves special attention. Experience from Legal Benchmarks shows that the issues teams worry about most after rollout are precisely the things you don’t see in a polished demo: difficult integration with existing workflows, inconsistent output quality, and the ongoing effort required to validate AI output. After all, a strong CV and a smooth interview are no guarantee that someone will actually perform well on the job. The vast majority of teams that have already rolled out an AI tool are considering switching suppliers. That alone makes clear how many purchasing decisions would have turned out differently in hindsight, and how valuable a structured evaluation process can be upfront.
What also stands out is that, for most teams, cost is not the deciding factor at all. Security and output quality often carry far greater weight. Legal teams want to do things well, not necessarily cheaply. This challenges the perception that AI procurement in the legal sector is primarily a budget discussion.
Open source, open standard
The framework is open-source, available free of charge, and developed without the involvement of suppliers. It is published under a Creative Commons licence, with no paywall or mandatory registration. In my view, anyone looking to improve the procurement of AI tools should not face any barriers to doing so.
This is, of course, only the first version; the framework will undoubtedly evolve as the market and the tools change. I believe many (legal) professionals will benefit from it. But perhaps what matters even more is that it has become tangible proof that, as a sector, we make real progress by working together.