Statistics and Hapax Legomena in the Mar Saba Letter
Following my recent reading of Carlson’s The Gospel Hoax: Morton Smith’s Invention of Secret Mark (2005) I got my hands on A. H. Criddle’s article, “On the Mar Saba Letter Attributed to Clement of Alexandria” JECS 3.2 (1995) 215-220. He argues that the ratio of new words introduced into the Clemetine corpus to old hapax legomena reused by the Mar Saba letter does not fit, in a statistically significant way, what one would expect from the patterns found in the Clementine corpus. On this basis, he argues against it being by Clement of Alexandria.
For those of you who have not read the article I recommend you do; it is short and offers an interesting argument. Unfortunately, I think it fails because of two critical assumptions it makes, assumptions which I consider completely incorrect. Before I continue, I would like to thank Andrew Criddle for the time he spent via email to help me understand the details of the argument in his paper.
The first assumption seems to come from Criddle himself, which is that the ratio of hapax legomena to total vocabulary is somewhat unique for each writer. He notes that on simple models of vocabulary statistics this ratio remains constant, while on more sophisticated models the ratio slowly falls as the total vocabulary rises. Now, I found this claim interesting since this ratio would naturally be useful in analyzing texts of disputed authorship. Criddle does, in fact, cite three sources: G. Herdan, Calculus of Linguistic Observations (1962), idem, Quantitative Linguistics (1964), and H. A. Simon, “On a Class of Skew Distribution Functions,” Bioketrika 42 (1955) 425-40. I went through these references but was unable to find anything that supports the claim that this ratio is in any way unique or related to authorial style. Rather, these references treat the hapax/vocab ratio as partially dependent on vocabulary size in the given text (as stated by Criddle for the more sophisticated models). This is a serious problem because whatever the results of hapax analysis on the Mar Saba letter, it will never tell you something about Clementine authorship.
Nonetheless, I didn’t stop there since I was also curious about the validity of such computational linguistic techniques in the first place, whether it says something about Clementine authorship or not. This is the second critical assumption in the paper, that theories of vocabulary statistics from computational linguistics are meaningful for our letter. Certainly, the numbers that Criddle provide do indicate that the ratio of new words to reused hapaxes in the letter are unusual in an apparently statistically meaningful way (the discrepency based on a x2 test is significant).
All of these vocabulary analyses ultimately rest on Zipf’s law, which, however, is not really a law in a hard-science sense, but a statistical approximation that seems to hold in a general way for many different types of data sets. In fact, calling it a “law” is probably too strong. Most studies in computational linguistics deal with more modern languages and texts, so I was eager to put all of this to the test against some ancient Greek texts that would more closely match the purported time period and cultural background of the Mar Saba letter.
Doing hapax counts by hand would take a horribly long time, so I decided to do it by computer. I had Bible Works give me an uninflected vocabulary list of both Paul’s letters and all of Philo. I made a separate vocabulary list for each work and also had a cummulative text for each author, and then calculated how many new words and reused hapaxes are created by adding each individual work to the corpus as a whole (not counting the added work in this corpus). I won’t go into the details of how the script works here, but if anyone is interested you can contact me and I’ll give you all of the ugly details.
I made two Pauline groups, one for his “authentic” letters and the other for the pseudepigraphical ones.


For the Pauline corpus as a whole the expected ratio of new words to reused is about 1.8/1, which the authentic letters, taken on average, overshoot, while the pseudo letters match reasonably well. This only made me more suspicious of the ratio’s utility for sorting out issues of authorship.
However, for both groups the dispersion is fairly high and this bothered me. This led me to Philo whose corpus is much larger. The results were somewhat surprising.

With this larger corpus, a certain pattern is clear. The shortest works display the greatest dispersion in ratios. The shortest is the fragmentary Hypothetica with a vocabulary of 753 words, whose ratio of new words to reused hapaxes is 2.4/1. The second shortest work is his de Gigantibus with a vocabulary of 952 and a ratio of 1/1. The third shortest work is de Sobrietate with a vocabulary of 1001 and a ratio of 0.7/1.
I then decided to do some x2 tests and it turns out many of Philo’s works fail the test by a significant margin. What this suggested to me is that for very short works the dispersion is so high that the results of x2 tests are rather meaningless. This is significant for the Mar Saba letter since it has a vocabulary of only 258 words, and in light of this, the letter’s ratio of 4/9 (0.44) appears far less problematic.
To put some numbers to it, I took the standard deviation of the ratio for the ten shortest of Philo’s works. σ = 0.449 and +/- 3σ is 1.52 +/- 1.35, which is a massive range and easily includes the letter’s ratio of 0.44. Of course, comparing the letter’s ratio to Philo’s works is not completely fair, but since I do not have a database of Clement’s works I can’t do that.
To take things one step further, I decided to run my script against each chapter of Acts to see if in fact the dispersion is very high for very small units of text. The average vocabulary size of each of Act’s chapters is 240 words, which provides a good basis of comparison against the Mar Saba letter. The results were as follows:

As I expected the dispersion was very high, σ = 0.663. Chapter 27 has the highest ratio at 3.22/1 and chapter 11 has the lowest at 0.38/1, even smaller than that found for the Mar Saba letter, although this is largely due to repetition of words from chapter 10. Such high dispersion is no doubt due to samples of this size not being representative examples of vocabulary usage of the whole.
Two conclusions can be garnered from this. Firstly, the ratio of new words introduced vs. old hapax legomena reused offers no information about authorship, being rather a function of both language usage in general and the specific size of the text’s vocabulary. Secondly, texts with very small vocabularies will show very large variation in their ratios. This reflects the inherent instability of Zipf’s “law” for small vocabularies. It is a statistical “law” afterall, and if the sample is not representative of the whole then the “law” will not hold either. Under such circumstances what the x2 test seems to be measuring is simply how good or bad the sample is.
The final part of Criddle’s paper extends his results to preposition use and scriptural quotations in the Mar Saba letter, arguing that they too are too Clementine. But, it goes without saying that if the foundations of the hapax argument have fallen away then these two other arguments dissolve as well. It should be noted that Criddle’s paper is the entire support behind Carlson’s argument that the letter is too Clementine (50-54).
I would like to thank Walter for providing as far as I know the first detailed critique of my paper. I’m going to make several points in reply which are more or less independent.
A/ Walter is I’m afraid correct that my section on simple and more complicated vocabulary models (pps 217-218 of my paper) is so brief as to be misleading.
I should probably have said something like this “In the simplest models of vocabulary statistics such as those of Herdan and Simon the fraction of vocabulary used once and once only is constant at one half. This is however, clearly inaccurate. The models can easily be adjusted so that although the fraction may differ from a half, it is constant for a given writer, but this is still too inaccurate and more complex but more accurate models should be used.”
My confusing way of putting things unfortunately caused Walter to regard me as claiming that close agreement of the ratio measured for a new text with that found for the supposed author’s previous work could serve to establish authenticity. Walter is almost certainly right to reject this but I never meant to suggest otherwise. What I was interested in was the use of anomalously high and low values of the ratio as evidence of inauthenticity. And my discussion of models is intended to enable one to determine what in a specific case is a significantly too high or too low value.
B/ Walter is correct that the ratio is variable particularly with short texts. However there are two main problems with his detailed analysis. (I’ll concentrate on Philo’s works.)
Firstly he is using a range of three standard deviations which is I think too high. Certainly I never claimed that my results were significant at that level. My paper implies a chi-square value of 5.2 which is roughly 2.3 standard deviations. ie I was suggesting that values of the ratio as low or lower than that found would occur by sheer chance in between 1 in 80 and 1 in 100 cases.
Secondly Walter is using a linear scale to display the ratio and measure the variation thereof. This is I think mistaken, although it may be difficult to clearly and simply explain the issue. The way the data is presented a difference between a ratio of 2 and a ratio of 3 is much larger than the difference between a ratio of 1/2 and a ratio of 1/3. However the real difference is the same (You can see this by inverting numerator and denominator) With an average of roughly 1.5, although ratios of 2.5 or higher will occur not all that infrequently, ratios of 0.5 or lower will be much rarer. Thus the wide spread of values as presented gives a misleading impression of the likelihood of very low values of the ratio, which is what we are concerned with for the Mar Saba letter. Inverting the ratio ie with an average of 2/3 and the value for the Mar Saba letter from my paper put at 2.25 would distort things in an opposite direction. A logarithmic scale for the ratio would be best.
C/ The validity of my other analyses of the Mar Saba letter statistics does not depend on the validity of my claims about the ratio of new words to reused hapaxes.
Andrew Criddle
August 11th, 2008 at 2:52 pmI think Andrew is probably correct that a logarithmic scale would be better, but the results as they are given I think demonstrate that the ratio’s variability for very short texts still renders it much less useful for arguing authenticity/inauthenticity, assuming that the ratio says anything about authenticity at all.
I think this latter assumption, that the ratio can be used for arguments of authenticity is unwarranted. I have as yet not come by any literature from computational linguistics that uses this ratio for that purpose. It may, in fact, be useful for this task (issues of fragment size aside), but I think it remains to be empirically established.
Therefore, even if one should find the Mar Saba letter’s ratio radically too low, I am not sure what meaning this has for the authenticity issue.
As for using three standard deviations, this was mostly to show that the Mar Saba letter is not outside the range of possibility in a variable ratio, but also because Philo’s texts are significantly larger than the letter. The shortest Philo text has a vocabulary three times that of the letter. I did not think comparing the two would be definitive, but would help establish that the letter’s ratio is not completely off the charts and that the actual statistical situation is far messier than one might be led to believe.
I have a strong suspicion that if I went through every chapter of the New Testament I would find one or more examples of ratios as low as, if not lower than, the Mar Saba letter, ones that do not have the disqualifying repetition of Acts 11 (which I must thank Andrew for spotting). Perhaps I will do this for a follow-up post after I get some other work out of the way first.
August 11th, 2008 at 4:57 pmA very rough estimate of the 6 standard deviations (+/- 3) range for Philo using a log scale for the ratio converted back to a linear scale is 0.58 - 3.6 Although these precise figures should not be taken too seriously, this probably gives a better idea of the range of variation of the Philo data than Walter’s range of 0.17 - 2.87 (1.52 +/- 1.35)
Andrew Criddle
August 12th, 2008 at 5:45 amSome further criticism of statistical analysis techniques have appeared in a recent issue of BAR and I consider another in “A Letter to Theodore” at magicinthenewtestament.com.
December 6th, 2009 at 2:18 pm