At the time of writing I am sitting in the back row of a lecture room in Oxford at a conference of the Early English Books Online Text Creation Partnership (EEBO TCP). Follow on Twitter: #eebotcp.
The Text Creation Partnership is a text-searchable database developed from the Early English Books Online database, making it possible for the first time to get a fairly accurate idea of the frequency and distribution of lexical items and expressions. I make extensive use of this database in my own work. For example, there is an entire section on the satirical use of the expression âpleasant spectacleâ to describe a scene of suffering or atrocity; before TCP putting together a set of collocations like this would have taken years, and I would not have been able to write the book in its present form without access to this database.
I came hoping to learn a bit more about the technical side of the interpretation of statistics. For example, if there is an increase in the occurrence of a particular usage during the last twenty years of the seventeenth century, how should we allow for the increase in the number of publications during this period? To make it more complicated, suppose the occurrences mainly occur in a particular genre (such as devotional literature). To evaluate the significance of these occurrences we would need to know whether publications within that particular genre have increased or not.
EEBO TCP makes possible the analysis of patterns across a range of text. The challenge is, how to draw valid inferences from the range of information the TCP makes available.
I didn't actually get a completely satisfactory answer to the question of how to balance the frequency of occurrence of a particular word or phrase against the output within a particular genre. However, I did learn about something that comes pretty close - a resource developed at Lancaster University that can distinguish occurrences within topic fields.
For example, it would allow one to search for occurrences of a word like "liberty" (that's l[i/y]bert[ee/y/ie/ey], allowing for early modern spelling variants, and distinguish between occurrences in a broadly religious context and those in a basically political one.
Unfortunately, since my university in Japan does not subscribe to EEBO TCP I cannot access either the TCP database or the Lancaster database from Japan (though I can at least access TCP during the few weeks of the year that I can make it to do research in the University Library at Cambridge). There were three other delegates at the conference from universities in Japan, and they all bemoaned the fact that Japan is more or less the only country in Eastern Asia that does not subscribe to EEBO TCP. I hope we can get together and try to change that!
I came away from the conference knowing a lot more than I did about developments in digital humanities, and was struck at how much of a backwater Japan still is in this respect. Even comparatively basic things, such as WiFi enabling participants to be online during the course of a lecture, are unavailable at my university, which is very surprising, given that Japan is so technologically advanced in other ways. Again, I very much hope that this will change in the near future.
Older comments:
jules
http://julesandjames.blogspot.jp
2013-09-20 04:56:21
"For example, if there is an increase in the occurrence of a particular usage during the last twenty years of the seventeenth century, how should we allow for the increase in the number of publications during this period? To make it more complicated, suppose the occurrences mainly occur in a particular genre (such as devotional literature). To evaluate the significance of these occurrences we would need to know whether publications within that particular genre have increased or not."
What maths/statistics you do depends on exactly what your question is. I am not sure if you are asking how to do the stats, or more asking how to query the database to get the information you require. Can't help on the latter of course, but the former could be tackled in a number of ways. I'm not sure what your wider question is so do not know what you mean by "significant" in this context. One thing, however, is that, particularly if you use Bayesian statistics, then you can explicitly include your own expert opinion in the calculation, and you produce not just one answer, but a range of possible answers (that range of uncertainty being due to the fact that you have incomplete information).
jules
John R. Yamamoto-Wilson
2013-09-20 07:37:13
Thanks for this, Jules. I'm primarily a literature person, who's trying at the same time to get a sense of how certain topics of discourse were distributed during the seventeenth century. For example, 'cruelty' collocates with 'injustice' in 'some 1,600 texts on the EEBO TCP database, over 1,500 of which were published in the seventeenth century, more than a third being from the final two decades of the century' (.<em>Pain, Pleasure and Perversity</em>, p. 118, footnote).
On the surface of it, it seems pretty clear that something is going here, and that people were increasingly seeing cruelty as a manifestation of injustice. But there are plenty of question marks. How much should one allow for the increase in the number of printed books as the seventeenth century progressed? How much should that be offset by consideration that EEBO can only record those books which have survived, and seventeenth-century books have a greater survival rate than sixteenth-century ones? And how much does it affect things that the database mainly consists of first editions, and so gives equal weight to books which went through only one edition and books which went through dozens?
Rather than getting too deeply into statistical analysis, my approach has been to assume that these different factors will tend to cancel each other out, at least where there is a fairly large number of occurrences of a given search term. I am not really concerned with wanting to be able to give an exact percentage for the amount of increased discourse on the subject per decade, or anything like that; I merely present the figures, with all their uncertainties, as indicative.
To substantiate the idea that the figures indicate something significant I chose rather to turn to the primary material; I am, after all, basically a student of literature! Here I find, for example, that there is quite a lot of seventeenth-century discourse on the idea that failure to be cruel when circumstances required it was itself seen as a form of injustice; if rebels and traitors were not duly dealt with, such misplaced compassion would only lead, in the end, to increased suffering for loyal subjects.
This adds weight (for example) to Granucci's thesis that, in the expression âcruel and unusualâ in the 1689 Bill of Rights, 'cruel' did not mean quite what it means today, but 'seems to have meant a severe punishment unauthorized by statute and not within the jurisdiction of the court to impose' (Granucci, ââNor Cruel and Unusual Punishments Inflictedâ: The Original Meaningâ, California Law Review, 57.4, 1969: 855â9; p. 859).
That's the kind of thing I'm interested in - relating the statistics to textual analysis.
However, suppose I want take the statistical side of things a little bit further. Suppose I want to know about the contexts in which this discourse took place. How many of the texts were religious, how many political, how many historical? My expectation before the conference was that there might be some work on listing early modern publications by genre (I have come across some work in this area, but nothing really systematic), but I left the conference pretty much as vague about this as I started.
However, I found that the University of Lancaster has developed a resource which provides an ingenious workaround to the problem, enabling one to place occurrences roughly within a genre by evaluating their collocations in each given text.
I'm not sure exactly how this is done (I'd need to have a bit of a chance to play around with it to get to grips properly with how it works), but as I understand it, some occurrences would be in texts that talked about things like "God" and "Jesus" and "heaven" or whatever, others would be in texts that talked about "King Charles" or "the Commonwealth", and so on, and one could infer the number of occurrences in religious, political and other contexts in this way.
This is pretty good stuff, and I'm hoping to find some way of gaining access to this database, but even if I find that the number of references in a religious context goes down, while that in a religious context goes up, I'm still left wondering to what extent that is because the subject is becoming more relevant to political discourse and less relevant to religious ones, and to what extent it is a result of an overall decline in religious writing and an increase in political discourse. Perhaps the Lancaster database will be able to answer this question too; I hope so, but as I say, so far I haven't had a chance to work with it, so I'm not sure of its capabilities!
Comments