Oct 23, 2024 |

Legal AI Benchmarking: CoCounsel

From code to courtroom: The meticulous testing of CoCounsel’s professional-grade AI

We’re excited to be sharing a detailed look into our testing program for CoCounsel, including specific methodologies for evaluating its skills. We aim not only to showcase the steps we take to ensure CoCounsel’s reliability, but also to contribute to broader benchmarking efforts in the legal AI industry. Though it’s challenging to establish universal benchmarks in such a diverse field, we’re engaging with industry stakeholders to work toward the shared goal of elevating the reliability and transparency of AI tools for all legal professionals. 

Why evaluating legal skills is complicated 

Traditional legal benchmarks usually rely on multiple-choice, true/false, or short-answer formats for easy evaluation. But these methods aren’t enough to assess the complex, open-ended tasks lawyers encounter daily and that large language model (LLM)-powered solutions like CoCounsel are built to perform. 

CoCounsel’s skills produce nuanced outputs that must meet multiple criteria, including factual accuracy, adherence to source documents, and logical consistency. These are difficult outputs to evaluate using true/false tests. On top of that, assessing the “correctness” of legal outputs can be subjective. For instance, some users prefer detailed summaries, others prefer concise ones. Neither is “wrong,” it just comes down to preference, which makes it difficult to consistently automate evaluations. 

To make it even more complicated, each CoCounsel skill often involves multiple components, with the LLM handling only the final stage of answer generation. For example, the Search a Database skill first uses various non-LLM-based search systems to retrieve relevant documents before the LLM synthesizes an answer. If the initial retrieval process is substandard, the LLM’s performance will be compromised. So, our evaluation must consider both LLM-based and non-LLM-based aspects, to make sure our assessment of the whole is accurate. 

How we benchmark 

Our benchmarking process begins long before putting CoCounsel through its paces. Whenever a significant new LLM is released, we test it across a wide suite of public and private legal tests, such as the LegalBench dataset created by our Stanford collaborators, to assess their aptitude for legal review and analysis. We then integrate the LLMs that perform well in these initial tests with the CoCounsel platform, in a staging environment, to evaluate how they perform under real-world conditions. 

Then we use an automated platform to run a battery of test cases created by our Trust Team (more on this below), to evaluate the output that comes from this experimental integration. If the results are promising, we conduct additional manual reviews using a skilled team of attorneys. When we see an improvement in performance compared to previous benchmarks, then we start talking as a team about how it might improve the CoCounsel experience for our users. 

How we test 

Our Trust Team has been around as long as CoCounsel has.  This group of experienced attorneys from diverse backgrounds – in-house counsel, large and small law firms, government, public policy – is dedicated to continually rigorously testing CoCounsel performance.  

We continue to follow a process that’s been integral to all our performance evaluation since CoCounsel’s inception: Our Trust Team creates tests representative of the real work attorneys use CoCounsel for and runs these tests against CoCounsel skills. When creating a test, they first consider what the skill’s for and how it might be used, based on their own insights, customer feedback, and secondary sources. Once the test is created, the attorney tester manually completes the test task, just as a lawyer would, to create an answer key – what we refer to as an “ideal response.” These tests and their corresponding ideal responses then undergo peer review. Being this meticulous is crucial, because the quality of our ideal responses determines the benchmark for a passing score.   

Once the ideal response has been created, a member of the Trust Team runs the test, using the applicable CoCounsel skill to complete the task just as a user would. An attorney tester reviews the output, referred to as our “model response.” Then they point-by-point compare CoCounsel’s response to the ideal response, identifying differences and assessing whether these differences deviate from the ideal response in a way or to a degree that would make the skill’s output incomplete, incorrect, or misleading. It is important to note that a test can be failed for many reasons, not just because an answer is outright wrong. For instance, some skills (e.g., Summarize) are more inherently subjective than others and do not lend themselves to a right or wrong answer. Accordingly, even though attorneys’ opinions of the “correct” answer can reasonably vary, we try to maintain the highest grading standards and “fail” the test if it is missing a single detail the attorney-tester considers to be Critical.  

Based on this manual evaluation, we create a set of evaluation instructions tailored to the ideal output unique to each skill, capturing the needs of each skill so an LLM can match our human reviewers’ process. To illustrate this, let’s look at a real example using CoCounsel’s Extract Contract Data skill. After running the test in our automated evaluation platform, here’s what an attorney reviewer would see when reconciling the model response with the LLM’s evaluation of it:  

  • Question: If the customer is acquired through a merger, can the agreement be assigned to the acquiring entity without obtaining the provider’s prior written consent?  
    • Trust Team ideal response: 
    • Short answer: Yes, with 30 days’ prior written notice.   
    • Detailed answer: Yes, if the customer is acquired through a merger, the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent; the customer is only required to provide the provider with 30 days’ prior written notice. The clause specifies that either party may assign its rights or delegate its obligations without such consent, provided that 30 days’ prior written notice is given to the other party, to an entity that acquires all or substantially all of the business or assets of such party, including by merger.  
  • Model response: 
    •  Short answer: Yes, with notice.  
    • Detailed answer: The agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent if the customer is acquired through a merger. The clause provides an exception to the general rule requiring consent for assignment. Specifically, it states that either party may assign its rights or delegate its obligations without such consent to an entity that acquires all or substantially all of the business or assets of the party, including through a merger, provided that 30 days’ prior written notice is given to the other party.  
    • Automated evaluation analysis: The model response captures the correct information that the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent but requires 30 days’ prior written notice. However, it incorrectly states the short answer as “Yes, with notice” instead of “Yes, with 30 days’ prior written notice,” as per the ideal answer. This is a minor difference and does not change the meaning significantly.  

In this instance, the model response included a minor discrepancy from the attorney-authored ideal response. But the LLM’s evaluation of the response accurately determined that the answer remained sufficient, because it captured the complete notice requirement elsewhere in the response.  

Our ideal-response approach provides two key advantages over assertion-based evaluations. It excels at identifying deviations from attorney expectations, including hallucinations. And it pinpoints extraneous or inconsistent information that, while not technically a hallucination, could make even a complete response incorrect if that information introduces logical inconsistencies, which would result in a failing score. 

We rely on our Trust Team to create well-defined ideal responses and auto-evaluation instructions and to determine if a test case passes or fails. A skill’s output definitively fails if it falls short of this ideal because of material omissions, factual incorrectness, or hallucinations. However, we recognize that many legal issues aren’t black-and-white, and the “correct” answer could be open to reasonable disagreement. To address this, we peer review ideal responses in cases when the answer might require a second opinion. And we might eliminate tests when we find insufficient agreement among the attorney testers. This is how we both ensure that our passing criteria remain rigorous and account for the nuanced nature of legal analysis. 

Maintenance and improvement 

Creating a skill test set is only the beginning. Once we begin using it, the Trust Team continually monitors and refines it by manually reviewing failure cases from the automated tests and spot-checking passing samples to make sure the automated evaluation is in line with human judgments. We also regularly add tests to cover more use cases and capture user-reported issues, which could lead to further iterations of the tests submitted for automated evaluation and their success criteria.   

By following this process, every night we can execute, across all CoCounsel skills, more than 1,500 tests on our automated platform under attorney oversight, which combined with manual testing means we’ve run more than 1,000,000 tests since CoCounsel’s launch. And it empowers us to quickly identify areas for improvement, which is vital to ensuring CoCounsel remains the most trustworthy AI legal assistant available.  

Conclusion 

In our previous blog post, we explored what it means for an AI tool to be “professional-grade” and why that standard is crucial for professionals in high-stakes fields like law. This post takes that concept further by diving into how we benchmark CoCounsel to ensure it meets those rigorous standards. By understanding the extensive testing that goes into evaluating its performance, you can see how CoCounsel consistently delivers the reliability and accuracy expected of a true professional-grade GenAI solution. 

To promote the transparency my team and I believe is necessary in the legal AI field, we’ve decided to release some of our performance statistics for the first time and a sample of the tests that are used to arrive at the figures below applying the criteria referenced within this article. Check out our results here.

This is a guest post from Jake Heller, head of CoCounselThomson Reuters.

Share