Depending on who you ask, AI will either free us from the drudgery of our everyday lives, take our jobs, or wipe out humanity. It’s nearly impossible to glance at legal news without reading something about AI. There is, however, a lot more theorizing than actual data-driven research on how AI is working for (or against) the legal profession. However, Professor Lee Peoples reports on the results of his important study evaluating and comparing the performance of various specialized and non-specialized large language models (LLMs) in legal reasoning. Spoiler alert: it varies, and not necessarily how you might assume.
Before getting to the results, let’s examine how Prof. Peoples designed the study. Many first-year law students are taught to think like a lawyer using the IRAC method. As a refresher, this is a system using distinct steps to spot the Issue, identify the Rule, Apply the rule to the facts, and draw a Conclusion about the legal outcome. Prof. Peoples selected seven fact situations from a legal research and writing exercise book and anonymized to test beginning rule analysis, skilled rule analysis, beginning analogical reasoning, skilled analogical reasoning, beginning statutory analysis, intermediate statutory analysis, and skilled statutory analysis. Very importantly, Prof. Peoples told the LLMs not to train on the prompts used in the testing.
Prof. People’s study is thoughtfully and intentionally designed. For example, he explains, “LLMs’ statutory reasoning abilities were explored in more detail because previous studies have demonstrated LLMs’ tendency to hallucinate when analyzing statutes.” (Pp. 56-57.) In response, he has tested three skill levels of statutory analysis to tease out more specificity about LLMs’ capabilities in this area. Other important features of the study include temperature setting (to limit randomness), nucleus sampling (to set a threshold probability), using zero-shot prompting (without additional examples), and employing iterative prompting (such as instructing the LLM to process step by step or to use “chain-of-thought” reasoning). The study tested “Lexis+ AI, Anthropic’s Claude 3 Sonnet, Open AI’s CPT 3.5, Microsoft’s Copilot 365, and Google’s Gemini lightweight LaMDA” in April and May of 2024. (P. 57.)
Prof. Peoples assessed the results of the prompts using eight scoring categories. Six measured different aspects of legal analysis directly; one measured response to iterative prompting; and one measured whether the model hallucinated. Refer to the article for a full explanation of scoring but the categories are: “relied on sources as instructed,” “issue identification,” “stating the rule,” “applying the rule,” “reaching the correct conclusion,” “conclusion stated with certainty,” “correctly responded to the prompt to use chain of thought reasoning,” and “hallucination.” Prof. Peoples explains these categories and the scoring rubric in greater detail, but I think the important takeaway is the variability in performance, not just between models but also within the same model across different tests. This result may feel familiar. My very unscientific survey of more advanced AI users finds they almost uniformly prefer different LLMs for different types of tasks.
Based on the total scores across all tests, Claude won the day, with Lexis + AI trailing the pack. However, as you would expect, the results are more nuanced than the total scores. For example, Copilot outperformed the other models on the Beginning Rule Analysis test, with Claude performing second best but lacking some detail. You might expect a similar result in the Skilled Rule Analysis, but in fact, Claude performed best with Copilot performing worst. In the Skilled Rule Analysis, Lexis+ AI came in second, although Prof. Peoples noted that the Lexis+ AI response had less certainty than the other models.
The statutory analysis tests indicated greater difficulty with legal analysis. All the models performed reasonably well on the Beginner Statutory Analysis test. Interestingly, only Lexis + AI referenced an important state rule critical to the response that wasn’t mentioned in the fact situation. At the same time, Lexis + AI also cited an irrelevant rule that didn’t apply to the facts. This moderate success across all models progressively degraded in the Intermediate and Advanced Statutory Analysis tests.
I’ve frequently heard the AI lore that prompting an LLM to think in steps, or use chain-of-thought, should improve the results. Prof. Peoples’ study indicates that this may not always be true. His results showed that Claude, Copilot, and Gemini improved with a chain-of-thought prompt, with the improvement most pronounced in more complicated scenarios. Meanwhile, the results from Chat GPT 3.5 and Lexis+ AI did not see the same kind of improvement.
Hallucinations may be the most publicized of the legal malpractice disasters using AI. Surprisingly, in Prof. Peoples’ study, all of the models performed with zero or only one hallucination, except Lexis+ AI, which had a hallucination rate of 57%. Prof. Peoples notes, however, that a study of mostly specialized legal LLMs conducted in May 2024 found that Lexis+ AI had the lowest hallucination rate among the models tested. (Varun Magesh, et al., Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools.) Lexis+ AI’s performance in the study relative to non-specialized LLMs is surprising, but Prof. Peoples suggests this may be because the data universe used by Lexis+ AI is much smaller than that of the non-specialized models. It would be reasonable to assume that a model limited to legal materials might outperform the non-specialized LLMs because it is focused on the most relevant sources. However, Lexis+ AI was released only 6-7 months before the tests in this study. Results will likely change over time.
The most important takeaway of this study is not how the various models performed on these specific tasks but what to consider when using AI for legal analysis. In part, this is because the results of the exact same AI prompt change. As Prof. Peoples highlights, one of the issues for AI in legal work is that the results aren’t reproducible. The rule of law relies on results being consistent across similar situations. Precedent is a critical feature of American law. However, “the instability of answers created by LLMs complicates their usefulness for legal work and ability to think like a lawyer. Researchers who repeatedly input identical prompts to generative AI will never receive the same responses.” (P. 75.) (A recent article on inconsistency that may be of interest is Or Cohen-Sasson, Stochastic Justice: Legal Inconsistency by Human and AI, (2025).)
Another issue for AI, that seems intractable, is the lack of transparency often present regarding the algorithms underlying the system and the details of the information models on which they were trained or which they can access. This is protected as intellectual property and a trade secret, but leaves users mostly clueless about how the model works, what it prioritizes, and what information it is using as a basis for responses. This lack of understanding would be challenged, or at least queried in most other legal processes, but is often accepted when using AI tools.
Prof. Peoples’ study on the efficacy of LLMs for legal analysis should be required reading for law students and lawyers using AI tools, and can serve as a guideline for examining the performance of those tools.






