Predictive Coding: Recall Calculations to the Rescue

Featured Guest

Ralph Losey

Ralph Losey practices general employment litigation and e-discovery in the Orlando office of Jackson Lewis P.C. He serves...

Your Host

Michele Lange

Michele Lange is Kroll Ontrack’s Director of Thought and Leadership and a nationally recognized e-discovery expert. Kroll...

This Episode

Published:	February 27, 2015
Podcast:	ESI Report
Category:	e-Discovery

Episode Notes

In 2015, astute ediscovery practitioners are looking for ways to take predictive coding to new heights using recall calculations. Ralph Losey, a leader in the e-discovery and predictive coding field, has devised a new approach to recall and precision accuracy measurement. What is this formula and how is it more effective than previous systems?

In this episode of The ESI Report, Michele Lange interviews Ralph Losey about ei-Recall, a calculation he formulated for measuring the recall of electronic discovery processes. In a very comprehensible interview, Losey explains what recall and precision are in the e-discovery field and shows some of the limitations within current discovery reviews. He then discusses how ei-Recall can be a more simple, realistic, and accurate system for review, because it calculates the recall as a range, only requires one sample, and is an easily-understandable process.

Ralph Losey serves as head of the National E-Discovery Counsel and litigation support in the Orlando office of Jackson Lewis P.C. He is also a writer, researcher, and educator in the field of electronic discovery and legal search. Ralph is the founder of the Electronic Discovery Best Practices group, the developer of an online legal training course in ediscovery, and principle author and publisher of the e-DiscoveryTeam.com blog.

Special thanks to our sponsor, Kroll Ontrack.

Transcript

The ESI Report: Predictive Coding: Recall Calculations to the Rescue – 2/27/2015

Advertiser: What is the most important e-discovery you’ve made? Cost control. Technology assistant review. Information management. Reservation. From the latest cases to the newest technologies. Our goal is to expose legal professionals to the most important e-discovery trends with the brightest e-discovery experts. Rule-making. Predictive coding. Litigation holds. Marginality. Document review. Social media. Efficiency. Ethics. Cooperation. Welcome to the ESI Report, sponsored by Kroll Ontrack, an e-discovery software and service provider. You’re listening to the Legal Talk Network.

Michele Lange: Hi, I’m Michele Lange and welcome to the ESI Report. In 2015, astute e-discovery practitioners are looking for ways to take predictive coding to new heights; predictive coding 2.0, per say. Our guest today is a leader in e-discovery and predictive coding, and he is always looking for new and innovated ways to hone and perfect the predictive coding process. Specifically, he recently went on a quest for the holy grail of recall calculations, and he has devised a new approach to recall methods. Who is this predictive coding mastermind? He is Ralph Losey, shareholder at Jackson Lewis in Orlando. Ralph serves the firm’s National E-Discovery Counsel and head of litigation support. He is also a writer, researcher, and educator in the field of electronic discovery and legal search. Ralph is the founder of the Electronic Discovery Best Practices group, EDBP.com, and the developer of an online legal training course in e-discovery, E-DiscoveryTeamTraining.com, and he is the principle author and publisher of the e-DiscoveryTeam.com blog. Ralph, it’s always good to have you on the ESI Report, welcome back.

Ralph Losey: Thank you, I’m glad to report back on the “holy grail.”

Michele Lange: Before we dig in on ei-Recall specifically, perhaps you could lay a foundation for our listeners. What we plan to cover today is very in-depth predictive coding and when it comes to evaluating predictive coding measures, perhaps you could give us some of the go-to foundations just before we get too deep. Give us a high level.

Ralph Losey: It’s traditionally considered recall precision, an indefinite measure, and I think I can explain that easily be referring you to an oath, which is given thousands of times a day in courtrooms across the country; you may be familiar with it. Where a person swears to tell the truth, the whole truth, and nothing but the truth. Everybody knows this, and actually, it’s the key to understanding what recall precision is. The whole truth in getting all of the relevant information, that’s recall. Nothing but the truth, that’s precision. So if you get 100 documents, and 90 of them are actually relevant, that’s a precision of 90%. If 90 documents you get are relevant and there’s a total of 180 relevant documents in the total collection defined, then you’ve got a recall of 50%. So it’s a matter of getting a whole truth, which is honestly – hopefully we’ll get into that briefly – that’s not even possible anymore in today’s world. But that’s basically the three traditional measures that people have looked at it. How much of the truth you’ve got in recall, how precise you are in only retrieving relevant information, that’s precision. And then this F1 is sort of a mathematical blend of the two that emphasizes recall over precision by a little bit, so it’s not a simple average. Traditionally, those have been the measures that we’ve looked to in order to evaluate the effectiveness of a search project.

Michele Lange: What are some of the limitations of calculating these measures when it comes to evaluating how successful a predictive coding exercise was within a discovery review?

Ralph Losey: Actually, the measures of precision and F1, I don’t think matter much, anymore under today’s modern review procedures. Let me explain why. In today, most reviews that I am familiar with – and I talk to everybody in the community – most of everyone is doing what’s called a Two Pass review now, in large projects. The first pass is done by subject matter experts and professionals in predictive coding and the computer itself, where the computer will basically identify what it believes the most relevant documents are. And that’s the first pass which is done with computer and human experts. Then with that batch of documents that have been predicted relevant, then the second pass occurs where you’re bringing contract lawyers and other human lawyers only to do two things. Number one, to verify the computer was right, that they’re relevant. And number two, to look for confidentiality and to appropriately label confidential documents or remove the privileged documents and that sort of thing. So by having this double pass of relevancy, we’re always going to have a very high privilege rate. I think I’m near 100% now that I’ve had it by computer and then verified it again; so that’s not really the issue. The issue now is recall; that’s what everybody’s focused on. So that’s a change and as far as our traditional measures. Another limitation is the fact that we really have too much data in order to ever know for sure what recall is. There’s just so much sole business of the truth, the whole truth. I’ve been saying since 2006, we can’t afford the whole truth because it would take millions of dollars even to verify 50,000 or 100,000 documents. Even then, it would be wrong because of the inability of humans to agree with each other. I’ve written about this, I don’t want to go through a lot of time with it, but just take my word for it. You can never know for sure what your recall is, you can only do estimates. And that’s okay because of Rule 26(g), which is what lawyers have to sign to certify, we’re not required to produce all relevant documents. We’re only required to certify that we made a reasonable effort to find and produce all relevant documents. So the rule itself is compatible with the technological realities of there being so much information where it’s common to have to search through a million emails to find the 10,000 or 100,000 that may be relevant. The other limitation is it is impossible to know recall as a point, as a specific number. I’ve gone to writing about this, talked to top statisticians and scientists of the world; they all agree about this. You can never know, through using random sampling to determine recall, what exactly recall is. You can only know a range. So you can know that you’ve attained a recall of 70-90%, for example, but you can never know for sure if you’ve attained a recall in the middle of 80%. So there is an inherent limitation in recall that a lot of people sort of gloss over and simply focus on a point or one number. The reality is there is never one particular number, it’s always a range. Another limitation of this is that there is no set minimum recall percentage that you should attain in any review projects in order to meet the requirements of 2016 reasonable efforts. So even though it may be popular to say you’ve got to attain 80% recall, there is no such rule to that effect and it varies according to every project that you have; sometimes it’s more difficult than others. So that’s another limitation is that there is no easy answer of here’s a number you need to reach and if you reach it, you’re good, if you don’t reach it, you’re not good. It’s not quite that simple.

Michele Lange: So enter ei-Recall, something that you put a lot of work into developing a new method of ei-Recall. How would you describe ei-Recall for our listeners?

Ralph Losey: So ei-Recall takes all of these limitations into consideration and comes up with a simple way to calculate a recall range. And by the way, it’s probably the hardest best thing I ever wrote, or blog, explaining all of this. And it was hard because I went through great efforts to make it simple and gave maybe ten different examples and showed you how to do the math. The math is easy, it’s 9th grade level math. It’s basically multiplication and division that’s all that’s required and looking up on tables. And all you have to do is Google, “ei-Recall.” Google that, go where my article is, and you can see all of this in detail. So the bottom line is I urge you to take advantage of the fact that I’ve literally spent hundreds of hours researching all possible methods, talking to the scientists, talking to the experts, and trying a number of alternatives to simplify this. So I’m going to give you the high level overview of this. But basically, what you do at the end of a review project, or near the end of a project, you take a look, you do a random sample of all of the documents that you’re not going to produce. I call these the negatives. These are documents that you have determined that are irrelevant based on the predictive coding, ranking or based upon your looking at them and determining them irrelevant. You take a random sample to look for documents that you’re not going to produce that actually, you should have produced. That’s called a false negative. You’re looking for documents that are not going to be produced that are, in fact, relevant. And the reality is, there’s always going to be mistakes made. And if you take a large enough sample – and I always recommend a sample of at least 1,500. That gives you what’s called a competence interval of 2.5% under a confidence level of 95%. And again, this is basic sampling and I spell all this out in the article. But if you take a sample of that size, you’re almost always going to find some mistakes made. And based upon the number of mistakes that you make, you determine the range – because remember, sampling always creates a range of a high number of total documents that you likely missed in the whole project, and a low number of total documents that you likely missed in the whole project. So by using the low end of the recall range, you do that by dividing the highest number of relevant documents that you could have missed under the sample, and you add that to the relevant documents that you did find, and you divide that sum into the relevant documents you did find. And then you have your low end of your recall range. Do the opposite to calculate your high end recall range by using the lowest number of projected relevant documents that you missed, the lowest number of areas in the range, and dividing that into what you find. And that’s how you calculate a recall level by looking at the documents that you missed, which is called an elusion test. That’s where the “e” comes from and the “i” in “ei” stands for interval. Because it incorporates the two by use, which is really mathematics that say if it works and the projections like this is estimate. And it gives you, again, something like a range between 70 and 90. So you’ll know that somewhere in there, that’s the recall use I.T. In some projects I’ve seen recall very high, where the range is between 90 and 96, that is incredibly high. They were getting recall ranges in anywhere from 40% to 70. So anything over, when you get into the 50,60, between 70 and 80 is very good, between 80 and 100 is extremely good, but you’re always calculating a range by calculating the documents you missed. So that’s basically the approach of how it works, and for the details and seeing it, I urge you to read the article where I spell it out in tremendous detail.

Michele Lange: So in that article, Ralph, you talk about the 5 basic advantages to ei-Recall and you talk about those briefly. Can you give us of an overview of those 5 advantages?

Ralph Losey: Yeah, happy to. This is why the ei-Recall that I come up with – is by the way, after I came up with it and published it I had some other scientists say, yeah, this is what we use all the time anyway. Which was great, only they hadn’t written it up and I shared it. But the number one advantage is it’s mathematically correct and it calculates a recall as a range. If you use a formula that doesn’t calculate recall as a range, you’re wrong; it just won’t stand up to scientific scrutiny. It’s just not the way range sampling works. So that’s the number one advantage is they’re not just calculating one number, they’re calculating two numbers. The highest possible that the sampling shows, and the lowest possibly recall that the sampling shows. You’ll never know for sure if you’re in the low end of the high end. You’ll only know that 95 times out of 100, you’re somewhere in that range. That is the uncertainty that is inherent in all sampling that is one of the weaknesses of a sampling. But at least it’s honest and it’s true in that it calculates both ends of the range. That’s the number one advantage, is that it’s correct because it has range. Number two, the advantage to this method is it only requires one random sample. Only one time do you have to take a sample, typically do 1,500 documents. Other methods, you have to take multiple samples and any time you take multiple samples, this increases the uncertainties inherent in having multiple use of a sample. So one sample is really the most reliable way to go. The third advantage of this is that this sample is taken at the end of the project. Other methods take a sample at the beginning of a project, of the take one at the beginning and the end. The problem of taking a beginning sample, is that you’re not really sure what your relevance is at the beginning of a project. It’s been proven, time and time again. I’ve never been involved in a big review project where we haven’t improved our understanding of relevance over the life of the review. It’s a thing called concept drift. And so your understanding of relevance at the beginning of a project is at its weakest point, conversely, at the end of the project, your understanding of relevance is very good, now it’s totally refined, you’ve got it down pat. So that is the time to take your sample, is at the end of the project, after you have fully developed your understanding of what’s relevant and what isn’t. The fourth advantage is that my method, ei-Recall method, uses, in its formula, documents that have already confirmed to be relevant by two passes, which I’ve mentioned before – the double pass. They confirm relevant by the artificial intelligence in your software, and confirmed by actual contract lawyers reviewing the documents to say yeah, the computer got it right, or no the computer made a mistake in which case is not really relevant. So it’s a double pass, so that improves the accuracy. By the way, in most projects that I’m working on now, after we train the document, the computer is getting it right 80-90% of the time; which under the previous methods was already considered extremely high precision. So now, coming into the contract lawyers, we’ve already got 80-90% of them that the computer got right. So the contract lawyers are just weeding out maybe the 10-20% ends that are being made as part of their second pass review method. So that makes it more reliable. And the fifth thing, even though it may not seem like it when I’m trying to explain this verbally, this is really a simple method to employ. It’s simple math, it’s simple lookup, it doesn’t require you to hire an expert, the experts have already bedded this stuff out. You can do it by looking at the article, you can do it yourself, and you can explain it and refer to this article, and so it’s easy to use it. But those are the five advantages of it.

Michele Lange: So how do you address some of the criticisms that you’ve received, or the potential criticisms, Ralph, with regards to this recall model? Assuming that you’ve come across folks that maybe have found some areas to comment on.

Ralph Losey: Well, I criticize it as well. If it is inherent in any random sampling, there are weaknesses to it. The first criticism I’ve been able to overcome, I think – and I don’t see anybody disagreeing with my response to this criticism yet. But the first criticism is, “Wait a minute. You’re finding mistakes, you’re finding these false negatives, you’re finding a document predicted by the computer to be irrelevant. But that prediction was false,so it’s a false negative. If computers make mistakes, shouldn’t you just keep going until you don’t find any mistakes?” That’s one criticism of the method is that it’s built on the fact that there are mistakes. The answer to that is there always is going to be mistakes in a large project, and perfection is possible for reducing, as science has proven over and over again; having a large part to do with the kind of inherent subjectivity intaking irrelevance and determination to begin with. Much less in being able to measure it with multiple people having different opinions on what’s relevant and what isn’t. But the primary point is I also couple with the ei-Recall test, which is a test of effectiveness. I couple another test that is what’s called, accept on zero error kind of quality assurance test. Accept on zero error. That means if I find any errors on this particular test, I do go back and start over. The error that I will never accept is if when I take a random sample of the negative documents which is not to be reviewed, and I find any highly relevant documents; I find a hot document, I will not accept that error. That means that the computer and the experts programming the computer have made a mistake. Then you have to go back and do further training and find the documents and look for more documents like that. In addition, if I found an error that is not a hot document but is a strong, relevant document in itself. A kind of document, kind of relevance proof that I’ve not seen before. That again is something that is an error that I will not accept and I will go back to training. On the other end, if the error I find is just what everybody calls more of the same, get another document that you’ve already seen 30 times in a slightly different form – it’s not an exact duplicate but it’s similar. Accumulative evidence, it doesn’t matter. Nobody cares about merely relevant documents. If it’s a document like that, then it passes and it doesn’t matter that this is a false negative. It doesn’t matter that you produced it. But if it’ a document that matters, then we go back to the drawing board. That’s what we accept on zero error part. The second criticism I get, which is inherent in sampling – and either there’s just nothing you can do about it, is what do you do with a situation that’s called a low-prevalence situation. By that, I mean, for example, what if there’s less than a half of one percent of documents in a collection that are irrelevant. In a situation like that, there’s so few relevant documents, that it becomes much harder to calculate a range based upon a random sample. All sampling breaks down when you’re looking for something that is extremely rare. You cannot have a sample of 30,000 documents in order to do this test. That doesn’t work because of costs and because 30,000 documents require contract lawyers to be making decisions with it, they’re going to be inconsistent on it, so it isn’t going to work. So there is that inherent limitation in any low-prevalence situation. I think that the ei-Recall has decided what’s the best because it at least doesn’t try to camouflage this, it does show the range. So you may have a very high range in between 30% to 70%. 40% range, that kind of thing can happen in a low-prevalence, but at least you are aware of that whereas other methods kind of hide that defect. The other reason I think it’s okay to still use recall as a measure of assurance, even in these very low-prevalence situations – which are not that uncommon when you’re looking for documents that very few of them exist that are responsive or relevant. The reason that you need to at least try to have recall, is that it’s better than nothing. And the reason that it’s acceptable is it’s not the only thing that you’re doing to try and measure the effectiveness of your search. You’re doing a number of other things, or you should be, in order to measure your compliance with reasonable efforts. The other facts that you should look to and you should try to include in your review in order to make up the limitations of any recall. Particularly recall of low-prevalence, is number one, the qualifications of the attorneys who are directing the review, the qualifications of your subject matter experts that are training the computer, the qualifications of the actual document reviewers themselves, and of the document reviewer managers. In other words, the experience and quality of the people involved are a very good indication of your quality control overall. Not by themselves, because even smart, experienced people can make mistakes, but it’s a good indication. Another good indication to look for is to limit the number of contact reviewers that are used. Every time you add another review to a project, you are adding more error. If you did like Verizon did ten years ago the old fashioned way, the famous study that was written up by EDI where they had a hundred different contract lawyers. They had a consistency rate of only 20%, which means 80% of the time, people weren’t consistent. You might as well be using a dartboard if that’s what you’re going to do. Every time you add additional reviewers, due to the human limitations, you are going to add more mistakes. To me, the ideal review team is of 3 people, maybe 4 or 5 people. When you get into 30 or 40, forget about it. Then you better have a very high recall because you’re going to have mistakes, that’s just how it works. The other thing is to look at for quality assurance is – and this is critical – is how is the software? Is this proven software, has it been around, has it been verified, has it been tested, and even more important than the software itself: how good are the actual training methods used in your predictive coding. We’re all about methods nowadays, so this is just a method of quality assurance, but what about the method of training? Are you using multi-normal, are you using the uncertainty principles? All of the kind of things that people are writing about and doing experiments on now, that is key. In fact, one experiment we had from Electronic Discovery Institute, there was an Oracle experiment. They found that the methods were more important than the software itself, and so too were experienced, knowledgeable lawyers. So that can come back full circle again. You really need to have people that know what they’re doing as the ultimate of veterans that are also doing recall calculations. In fact, the math and the recall, the sampling, it really is the icing on the cake. The cake itself is sound methodology by experienced personnel.

Michele Lange: So, Ralph, that really wraps up the time we have today for the ESI Report. But before we go, I know that you mentioned a couple of resources that listeners could go to to learn more about ei-Recall and dig deeper behind the aspect of this predictive coding. Could you give us a little bit of direction, again, on where those resources are available?

Ralph Losey: Yes, thank you for that. So the main publication that I write this year, my research and whatever my studies are, is I blog, an essential blog, and that is E-DiscoveryTeam.com. I’ve been actually writing that blog for 8 years now, Michele, if you can believe it, and recently switched from weekly blog to monthly blog. Because I tend to write longer, very analytic articles on this blog. And there, if you search ei-Recall, you’ll find my writings on this. Another website that I think is a good resource for your readers and listeners and one more about predictive coding, is one called LegalSearchScience.com. There I put together the basic overview of the method that the scientists and the experiments are doing. And practitioners that specialize in this area of the law are finding about what are the best methods to do predictive coding, to be training a computer. So those would be the two primary resources that I would ask people to check out.

Michele Lange: Well congratulations on 8 years on the blog, that’s an outstanding milestone; and thanks again for joining me on the ESI Report

Ralph Losey: Thank you, Michele; good talking to you.

Michele Lange: Well that wraps up another edition of the ESI Report. Thank you to everyone who’s listening and thanks to Kroll Ontrack for sponsoring the show. If you want to make sure you are in the e-discovery know, make sure you check out the ediscovery.com website or follow Kroll Ontrack on TheEDiscoveryblog.com. Until next time, I’m Michele Lange, signing off for the ESI Report.

Advertiser: This was another edition of the ESI Report, sponsored by Kroll Ontrack. Be sure to follow Kroll Ontrack on Facebook and Twitter, or learn more about Kroll Ontrack software and services at www.KrollOntrack.com.

Thanks for listening to the ESI Report, produced by the broadcast professionals at Legal Talk Network. Join Michele Lang for her next podcast on the latest e-discovery trends. Subscribe to the RSS feed on LegalTalkNetwork.com or in iTunes.
The views expressed by the participants of the program are their own, and do not represent the views of, nor are they endorsed by, Legal Talk Network, it’s officers, directors, employees, agents, representatives, shareholders, and subsidiaries. None of the content should be considered legal advice. As always, consult a lawyer.