It has come to my attention that some people haven.t kept track of the numbers for similarity for text comparisons.
There are a few points to make.
First, in order to make a precise statement of number of words copied between two texts, one would need to specify at a minimum the smallest run of words that will count toward that measure. In the trivial case, with a minimum run length of one, every text is copied 100% verbatim from an unabridged dictionary (assuming correct spelling and no neologisms). We are not interested in the trivial case, though, and that makes things more interesting. If we search for a run length of, say, ten words (which is my default), we can be pretty confident that any two texts that share one or more such runs of words it is because of a shared source, either one from the other or both from a third text. When one looks for such exact matches, one could generate a number with precision and only have one parameter to qualify that with. Any two long but unrelated texts in the same language are likely to share a preponderance of runs of 1 word (it is like the dictionary comparison mentioned earlier), a substantial number of 2-word runs, and decreasing numbers of matches at 3, 4, 5, etc. word runs. By the time one gets to considering 5 words or more in a row, the absolute number of matches between unrelated texts should be close to zero. So even when considering completely verbatim copying, one can have analyses that report differing percentages of matches in a subject text simply because of the length of the run being considered to constitute a match.
Second, when one wants to find more than just simple verbatim copying, one will have to make more choices, and that means more potential variation in results. Words may be changed within a run, words may be inserted, or words may be deleted. .Fourscore and seven long years ago. should be recognized as having been derived from the Gettysburg Address despite having an inserted word not present in the original. One can choose how many words at a time might have been changed, inserted, or deleted and still cause a determination that a match has occurred. This number has to be strictly less than the minimum run length being considered. When I analyse for 10 word runs, I have settled upon up to 4 words as potentially being changed, skipped, or deleted.
Third, making this a matter of algorithm rather than eyeballing removes the subjectivity from the analysis. I can choose my parameters, but once I.ve done that, that.s it. The number that pops out for one text being considered as the source of another is fixed given that choice of parameters. This is a Good Thing.
Fourth, when a percentage is reported, it depends critically upon which way that the analysis was run. The number one obtains to answer the question, .How much of reference text A was copied from subject text B?. is by no means the same number as one gets by asking, .How much of reference text B was copied from subject text A?. At the extreme end, consider a chapter from Moby Dick and the whole book as texts. The chapter is 100% copied from the book, but the book is only, say, 4% copied from the chapter.
Fifth, I like seeing a side-by-side comparison of texts as well as getting the numbers. That is why I have always provided such views as supplements to the summary numbers when I.ve done this sort of text comparison. The present instance is no exception, notwithstanding the apparent inability of some naysayers to notice and follow the provided links.
OK, so here is a list of numbers and what they mean for stuff being discussed here.
90.9% . This is the number the Discovery Institute has settled on as representing how much of the KvD decision.s section on whether ID is science (let.s call it .KvD-IDsci. for short) was copied from the plaintiffs.s proposed findings of fact section dealing with the same topic (ppfof-IDsci). How did they get that? Somebody at the DI eyeballed it and said that.s close enough. If they tasked someone else to do it again, there is no guarantee that the number would remain the same.
70% . the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 5 words in a run and up to 2 words being changed, skipped, or deleted. This is a liberal matching criterion.
66% . the proportion of copied text to uncopied text *100 in KvD-IDsci taken from ppfof-IDsci when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This is a conservative matching criterion, and the standard one I use for text matching.
48% . the proportion of text copied by KvD-IDsci to text not copied there *100 in ppfof-IDsci, using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This one has confused at least one person, who seems to have thought that this was another number applied to the analysis of KvD-IDsci. Instead, this number indicates how much of ppfof-IDsci was used by Judge Jones, not how much of KvD-IDsci came from there.
38% . the proportion of copied text to uncopied text *100 in the KvD decision taken from the plaintiffs.s proposed findings of fact when using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. The whole ruling has quite a bit of text that did not come from the PPFOF.
35% . the proportion of text copied from the complete KvD decision to text not copied there *100 in the full ppfof, using parameters of a minimum of 10 words in a run and up to 4 words being changed, skipped, or deleted. This number indicates how much of the full ppfof was used by Judge Jones in his complete decision, not how much of the decision came from the ppfof.