SciELO - Scientific Electronic Library Online

 
vol.18 issue2Applicative constructions in PanaráSegmenting corpora of texts author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

Share


DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada

Print version ISSN 0102-4450On-line version ISSN 1678-460X

DELTA vol.18 no.2 São Paulo  2002

http://dx.doi.org/10.1590/S0102-44502002000200003 

Do we need statistics when we have linguistics?

 

Precisamos de estatística quando temos a lingüística?

 

 

Pascual Cantos Gómez

Universidad de Murcia

 

 


ABSTRACT

Statistics is known to be a quantitative approach to research. However, most of the research done in the fields of language and linguistics is of a different kind, namely qualitative. Succinctly, qualitative analysis differs from quantitative analysis is that in the former no attempt is made to assign frequencies, percentages and the like, to the linguistic features found or identified in the data. In quantitative research, linguistic features are classified and counted, and even more complex statistical models are constructed in order to explain these observed facts. In qualitative research, however, we use the data only for identifying and describing features of language usage and for providing real occurrences/examples of particular phenomena. In this paper, we shall try to show how quantitative methods and statistical techniques can supplement qualitative analyses of language. We shall attempt to present some mathematical and statistical properties of natural languages, and introduce some of the quantitative methods which are of the most value in working empirically with texts and corpora, illustrating the various issues with numerous examples and moving from the most basic descriptive techniques (frequency counts and percentages) to decision-taking techniques (chi-square and z-score) and to more sophisticated statistical language models (Type-Token/Lemma-Token/Lemma-Type formulae, cluster analysis and discriminant function analysis).

Key-words: Quantitative analysis; Statistics; Language modelling; Linguistic corpora.


RESUMO

A estatística é conhecida por ser uma abordagem quantitative de pesquisa. No entanto, a maioria da pesquisa feita nos campos da linguagem e da lingüística é de natureza diferente, qual seja, qualitativa. De modo sucinto, a análise qualitativa difere da quantitativa pelo fato de a primeira não é feita tentativa de atribuir freqüências, porcentagens e outros atributos semelhantes, às características lingüísticas encontradas ou identificadas nos dados. Na pesquisa quantitativa, as características lingüísticas são classificadas e contadas, e modelos estatísticos mais complexos ainda são construídos a fim de explicar os fatos observados. Na pesquisa qualitativa, contudo, usamos os dados apenas para identificar e descrever características da linguagem em uso e para fornecer exemplos / ocorrências reais de um fenômeno particular. Neste trabalho, tentaremos mostrar como métodos quantitativos e técnicas estatísticas podem suplementar análises qualitativas da linguagem. Nós tentaremos apresentar alguns métodos quantitativos que são de grande valor para trabalhar empiricamente com textos e com corpora, ilustrando diversas questões com vários exemplos, passando das técnicas mais básicas de descrição (contagem de freqüência e porcentagens) para técnicas de tomada de decisão (qui-quadrado e z-score) e para modelos lingüístico-estatísticos mais sofisticados (fórmulas de Forma-Ocorrência / Lema-Ocorrência / e Lema-Forma, análise de cluster e discriminant function analysis.)

Palavras-chave: Análise quantitativa; Estatística; Modelagem lingüística; Corpora lingüísticos.


 

 

1. Introduction

The title itself is the reverse of Hatzivassiloglou's1. As a statistician, he discussed whether linguistics knowledge could be of any help and contribute to a statistical word grouping system. Our aim here is the opposite: to try to illustrate with numerous examples how quantitative methods can most fruitfully contribute to linguistic analysis and research. In addition, we do not intend here to offer an exhaustive presentation of all statistical techniques available to linguistics, but to demonstrate the contribution that statistics can and should make to linguistic studies.

Among the linguistic community, statistical methods or more generally quantitative techniques are mostly ignored or avoided because of the lack of training, fear and dislike too. The reasons: (1) these techniques are just not related to linguistics, philology or humanities; statistics falls into the province of sciences, mathematics and the like; and/or (2) there is a feeling that these methods may detroy the "magic" in literary text.

George Zipf (1935) was one of the first linguists to prove the existence of statistical regularities in language. His best known law proposes a constant relationship between the rank of a word in a frequency list and the frequency with which it is used in a text. To illustrate this, consider the 30th, 40th, 50th, 60th and 70th most-frequently occurring words taken from a sample of the Corpus Collection B (published by Oxford University Press): all the values (constants) come out at around 20,000 (see Table 1). This is because the relationship between rank and frequency is inversely proportional. In addition, Zipf thought that the constants are obtained regardless of subject matter, author or any other linguistic variable.

 

 

Similarly, another Zipf law showed the inverse relationship between word length and its frequency. In some languages, such as English, for example, the most commonly used words are monosyllabic ones. This effect seems to account for our tendency to abbreviate words whenever their frequency of use increases, i.e. the reduction of 'television' to 'TV' or 'telly'. It would also seem to be an efficient communication principle to have the popular words short and the rare words long.

These examples show how some linguistic patterns are regular and independent of speaker, writer, or subject matter, and how linguistic behaviour conforms closely to some expectations: quantitative or statistical patterns. In the next section, we shall try to exploit the most basic descriptive data: frequencies, and illustrate some potential applications to linguistic research.

 

2. Frequency counts

A preliminary survey of a text or linguistic corpus is to produce a frequency list of its items (tokens, types or lemmas2). At it simplest, the frequency list shows the types that make up the text(s) or corpus, together with their instances of occurrence. It can be produced in several different sequences3. Despite its simplicity, it is a very powerful tool. So for example, frequency lists of huge corpora enable lexicographers to take important decisions on which words a dictionary should include and which particular meanings. Similarly, authors of L2-materials might use this data to decide which words, phrases, expressions or idioms are most relevant in teaching an L2. This evidence of usage is without any doubt a unique and most important source for any enterprise in language description.

It is not just general language, but also sublanguages, that is, specific varieties of language used in certain communicative domains, such as business, medicine, sports, etc, or the study of genders or specific authors that can profit from frequency list analysis. This analysis can, for instance, shed some new light on stylistic variation issues regarding such diverse writers as Henry James and George Orwell, to name but two. To illustrate this, let us compare two sublanguages: arts (literary criticism) and science (biology). The table below (Table 2) summarises the output of a lexicon extraction program, showing the size of the lexicons produced for each corpus and giving the type-token ratio4 for each.

 

 

The type-token ratio is an extremely valuable indicator here5. It shows that, although the two language samples are different in size (the science sample has 35.861 tokens more; it is therefore 16.75 % larger), we find, on average, almost ten new items for every one hundred words (tokens) in the arts sublanguage (type-token ratio = 0.0989 * 100 = 9.89 » 10), whereas its counterpart sublanguage offers only six (0.0676 * 100 = 6.76). Furthermore, on the basis of this evidence, it seems that the sublanguage of science reaches a higher degree of lexical closure than its arts counterpart.

This preliminary approach shows that the sublanguage of arts is, on the whole, richer and more varied regarding the use of different vocabulary items. It resorts to more different words forms (types) and its lexical closure is also more difficult to establish. In contrast, the science sublanguage seems less varied lexically speaking and so, in lexical terms, it would appear that, on the basis of the evidence presented, it tends very strongly towards premature closure, whereas the other sublanguage does not.

Another interesting finding of frequency lists relates to lexical selection, or determining, by means of the evidence of usage, which are the most frequent or relevant items of a particular sublanguage, author, etc. Regarding our two sublanguages (science and arts) we obtain the following top 10 items (Table 3).

 

 

The main problem with this information is that the use of raw frequencies highlights the very common words such as the, of, in, etc., despite the fact that their comparatively high frequencies of occurrence are unlikely to provide conclusive evidence of any specifically used vocabulary in any sublanguage. These are words that, on the basis of frequency of occurrence alone, would be found to occur within most sublanguages, and it can perhaps be read more usefully if the purely grammatical words (close-class items) are discarded. This leaves us with (Table 4):

 

 

The specific vocabularies of both sublanguages become immediately apparent, and we note striking differences. Interesting, however, is the coincidence on the highly frequent use of time in both sublanguages (360 in arts and 526 in science).

A further type of study, using raw frequency lists, could be establishing lexical items exclusively used in each sublanguage and those used in both sublanguages (Tables 5 and 6).

 

 

 

 

From a lexicographical and semantic point of view it could be interesting to investigate the shared items (see the case of time above). An inevitable starting hypothesis is that the same words used in different contexts are likely to bear different meanings. As an example we can examine the use of the singular noun accident. This type occurs four times in each corpus. Here are the full sentences of their occurrences and their distribution according to sublanguage and meaning (Table 7).

 

 

This very brief but revealing lexical analysis confirms our initial hypothesis that the same words used in different contexts are likely to carry different meanings. We see how the use of accident in scientific communication is restricted to a single meaning (3. If something happens by accident, it happens completely by chance), compared with the other two meanings of the noun occurring in the arts corpus (1. An accident happens when a vehicle hits a person…causing injury or damage; 2. If someone has an accident, something unpleasant happens …causing injury or death).

The merits of this apparently simple technique -frequency listing- do not end here. It is, in our opinion, a potentially non-exhaustible resource and an excellent starting point for descriptive linguistic research, as it sometimes turns the invisible into the visible. The close observation of a frequency list may be the first step for the formulation of a hypothesis.

 

3. Significance testing

As mentioned earlier, frequency lists of word forms are never more than a set of hints or clues to the nature of a text. By examining a list, one can get an idea of what further information would be worth acquiring before starting an investigation.

Returning to our arts and science sample corpora, let us suppose now that we are interested in examining how two modal verbs, can and could, are distributed in both sublanguages and compare their usage. The first thing to do is to make a simple count of each of these modals in the two corpora. Having done this, we arrive at the following frequencies (Table 8):

 

 

A quick look at these figures reveals that can and could are more frequently used in scientific communication than in arts. But with what degree of certainty can we infer that this is a genuine finding about the two corpora rather than a result of chance? We cannot decide just by looking at these figures; we need to perform a further calculation: a test of statistical significance and determine how high or low the probability is that the difference between the two corpora on these features is due to chance.

3.1. Chi-squared test

Among the various significance tests available to linguists, we find: the chi-squared test, the t-test, Wilcoxon's rank sum test, etc. The chi-squared test is probably the most commonly used one in corpus linguistics, as it has numerous advantages for linguistic purposes (McEnery and Wilson 1996: 70): (a) it is more accurate than, for example, the t-test; (b) it does not assume that the data is normally distributed7 (quite frequent with linguistic data); (c) it is easy to calculate, even without a computer statistics package; and (d) disparities in corpus size are unimportant.

Probably, the main disadvantage of chi-square is that it is unreliable with very small frequencies (less than 5). Succinctly, chi-square compares the difference between the actual observed frequencies in the texts or corpora, and those frequencies that we would expect (if the only factor operating had been chance). The closer the expected frequencies are to the observed frequencies, the more likely it is that the observed frequencies are a result of chance. However, if the difference between the observed frequencies and the expected ones are greater, then it is more likely that the observed frequencies are being influenced by something other than chance. For instance, if we take our example, a significant difference between the observed frequencies and the expected ones of can and could would mean a true difference in the grammar or style of the two domain languages: arts and science.

The first step is to determine the significance level or threshold of tolerance for error8. In linguistic issues, it is common use to fix the probability of error threshold of 1 in 20, or p < 0.05. Remember that chi-square compares what actually happened to what hypothetically would have happened if all other things were equal. The first thing to do is to calculate the column and row totals, giving (Table 9):

 

 

Next, the expected frequencies are calculated. This is done by multiplying the cell's row total by the cell's column total, divided by the sum total of all observations. So, to derive the expected frequency of the modal verb can in the arts corpus, we multiply its cell row total (1043) by its cell column total (561) and divide that product by the sum total (1646):

All the calculations of the expected frequencies of each cell are shown below (Table 10):

 

 

Now, we need to measure the size of the difference between the pair of observed and expected frequencies in each cell. This is done with the formula9:

Where O = observed frequency and E = expected frequency. So, for instance, the difference measure for can (in the arts corpus) is:

Next we calculate the difference measure for all cases (can in the art corpus, can in the science corpus, could in the art corpus and could in the science corpus), and add all these measures up. The value of chi-square is the sum of all these calculated values. Thus, the formula for chi-square is as follows:

For our data above, this results in a total chi-square value of: 95.35. Having done this, it is necessary to look at a set of statistical tables to see how significant our chi-square value is. To do this one first requires a further value: the number of degrees of freedom (df)10. This is very simple to work out:

df = (number of columns in the table - 1) * (number of rows in the table - 1).

Which is for our case:

df = (2 –1) * (2 –1 ) = 1

We now look in the table of chi-square values11 in the row for the relevant number of degrees of freedom (1 df.) and the appropriate column of significance level (0.05 in linguistics). Returning to our example, we have a chi-value of 95.35 with df = 1, so according to the distribution table, we would need our chi-value to be equal to or greater than 3.84 (see Table 11), which is true. This means that the difference found between the two sublanguages regarding the use of can and could is statistically significant at p < 0.05, and we can therefore, with quite a high degree of certainty, say that this difference is not due to chance, but due to a true reflection of variation in the two sublanguages.

 

 

3.2. Z-score

Up to this point we have been dealing just with single items without paying attention to the co-text, that is the words on either side of the item under investigation. These co-textual words are known as collocations. Collocation has been defined as "the occurrence of two or more words within a short space of each other in a text" (Sinclair 1991: 170).

Concordance lines12 hold the primary information needed for collocation analysis. Approaches for the extraction of collocations may range from simple frequency information for words that occur near to the keyword, to the application of sophisticated statistical techniques, which calculate the figures needed for the comparison, and use them to produce measures of significance.

The most basic and naïve form of collocation analysis provided by concordance packages is to produce a frequency list of all the words occurring within predetermined proximity limits (span). As an example, we extracted the occurrences of the lemma KEEP (keep, keep'em, keeping, keeps, kept) in the arts corpus, by means of the concordance program Monoconc13, using a fixed span of six words on either side of the keyword (see Appendix 1). However, as already mentioned, the major drawback with the data obtained is that the use of raw frequencies highlights the very common words, despite the fact that they are unlikely to provide conclusive evidence of significant collocation patterns. So, we discarded the purely grammatical words (close-class items) and also deleted low frequency words, leaving just those that co-occur at least three times. This would leave us with just:

going 5
hands 4
feet 3
house 3
pony 3
well 3

All of these could form fairly strong patterns with KEEP and would be worth investigating further.

This approach, though useful, is very simple. It just offers some quantitative data and is only the starting point for calculating their significance. The calculation of the data needed for collocation analysis is not very complicated, although several alternative methods are available. The starting point for any collocation analysis is a set of concordance lines for the words under investigation, long enough to contain the required span of words. The first decision to make is to choose an appropriate length of the span. Let us consider the following concordance line:

answering their questions and trying to keep a conversation going while cooking a
-6 -5 -4 -3 -2 -1 node +1 +2 +3 +4 +5 +6

The word under investigation (KEEP) is referred to as the node and is used to generate the concordance lines. The words around the node are numbered according to the position to the node. Those words left to the node are expressed negatively, those to the right positively. So the span here is of twelve words, six words on either side. This set of concordance lines offers the basis for any further significance technique. All three major collocation significance tests, namely z-score, t-score and mutual information14, rely on actual or observed frequency and expected frequency. So once the concordance lines have been obtained, we need to establish the actual frequency of the words within the span. In other words, we produce a frequency list of the concordance lines, similar to the one already discussed and may, alternatively, also eliminate close-class items. Suppose this is what we get:

going 5
hands 4
feet 3
house 3
pony 3
well 3

At this point we need to calculate the expected frequency figures that will be compared to the actual frequencies to assess their significance. The calculation of the expected frequency for words occurring in the span is straightforward. First, we need a theoretical language model (i.e. a representative language sample or corpus of the language or domain that we want to investigate). This model will help us to predict how these words would be distributed if there were no particular pattern of collocation between them and the node. In other words, if we want to check whether the node (KEEP) is exercising some influence over the distribution of going, hands, feet, house, pony and well, we need to know how we would expect these words to behave in the absence of that influence.

Let us return to our frequency list of occurring words with KEEP within the span limit of 12, where the verb form going occurs 5 times (observed frequency). This means that going appears 5 times in the proximity of KEEP in a text sample of overall:

12 * 5 = 60 tokens (words)

On the other hand, its overall frequency in the entire corpus (arts subcorpus of Corpus Collection B), which consists of 178,143 tokens, altogether, is 89. If going is randomly distributed throughout the text, then its expected frequency in any 60 token text sample should be:

That is, the expected frequency of going in any random selected sample of 60 tokens should be just 0.029 compared with its real observed 5 occurrences in the set of lines of KEEP. Though the difference is huge, we cannot decide anything yet just from these figure; we need to perform a further calculation: a test of statistical significance and determine how high or low the probability is that the difference between the observed and expected frequency is due to chance.

The z-score is probably the most familiar of the statistical significance measures used for collocation analysis. The calculation is reasonably easy:

Where O = observed frequency, E = expected frequency and sd = standard deviation15. O is straightforward, E needs to be calculated, as explained previously, and sd uses the formula:

And where p = probability of occurrence of the co-occurring word in the whole text, and N = number of tokens in the set of concordance lines. Thus, the probability of going occurring in the whole text is:

And N (number of tokens in the truncated concordance lines) is:

N = number of concordance lines * span

N = 5 * 12 = 60

So, the sd can now be calculated:

and its z-score gives:

A useful cut-off measure for significance in this type of test is around 3 (Barnbrook 1996: 96). This leads us to conclude that the occurrence of going within the co-text of the lemma KEEP is not due to chance, but due to some kind of lexical 'attraction' (see Appendix 2).

 

4. Putting together quantitative and qualitative approaches

An important advantage of collocation analysis is that it allows us to focus our attention on specific aspects of the contexts of words already selected for investigation through concordance lines. Collocations can help organize the context into major patterns, and we can use our knowledge of those patterns to assess the general behaviour patterns of the language or the uses being made of particular words within the text16. This can make it much easier to identify phrases and idioms, to differentiate among different meanings of a single word form or to determine the range of syntactic features.

In the examination of the results above the most significant collocations for KEEP were going, hands, feet, house, pony and well. However, and for practical reasons, we deliberately discarded all grammatical words and low frequency items. For the following analysis, we included close-class items, focusing particularly on prepositions, since they are likely to be relevant when dealing with verbs (think of prepositional or phrasal verbs, idioms, and the like). Collocation frequency data can be very useful in this respect. The following table (Table 12) shows the data for the verb lemma KEEP (with a frequency threshold of 3; see Appendix 3 for concordance lines):

 

 

Given this data, a first approach could be to group the words in column 1-Right (words that immediately follow KEEP) according to their parts-of-speech; we get:

• Determiners: the, a and his (there is no instance where his is a possessive pronoun, see Appendix 3)

• Prepositions: up, in and to

• Pronouns: him

The right hand side association power of KEEP can now be typified as:

• KEEP + Preposition (up, in, to)

• KEEP + Pronoun (him)

• KEEP + Determiner (the, a, his)

A quick glance at the z-scores for KEEP (see Appendix 2) reveals that the probability for in to co-occur with KEEP is quite low, compared with to and up. The latter two are statistically very significant, particularly up. It is difficult to make assumptions here, due to the small sample analysed, but the z-scores point to one hypothesis: KEEP + up and KEEP + to may form lexical units (prepositional verbs or phrasal verbs), such as in:

The jargon they kept up was delicious for me to hear.

Commentary on form is kept to a minimum and is almost entirely superficial.

However, in seems very unlikely to be part of the verb and is probably part of the prepositional phrase (PP) that follows the verb (see low z-scores; Appendix 2), as in:

Hoomey knew it was a suggestion for sugarlumps, which he kept in his pocket.

Regarding determiners, these do not occur alone, they precede or determine a noun or noun head, we can go further saying that KEEP is capable to associate on its right hand side noun phrases (NPs) of the type:

• NP ® Pr; (You keep him here, and say your prayers, and all will be well)

• NP ® Det (Adj) N; (My uncle will go on keeping the horses if we want them.- He keeps a sleeping bag up there, stuffed behind the old ventilator pipes, and he sleeps in with her.- He was keeping his feet well out of the way, following where his horse ate)

The above is true, as KEEP is a transitive verb. Consequently, we could with some degree of certainty say that the high co-occurrence frequency of pronouns and determiners with the verb KEEP is not due to the configuration of any particular phrase but due to the transitivity of KEEP.

Regarding its association with prepositions, we have three prepositions which are directly attached to the verb (up, in, to), and three other which occur within a word distance of two (2-Right: with, of, to). A first hypothesis could be that the co-occurrence of KEEP + Preposition (up, in or to) attracts other prepositions. If we look at the concordance list, this is only true for up, which attracts with in four out of six occasions, as in:

To keep up with this pace, that is, just to carry out the work that...

This results into the following syntactic frames for KEEP + Preposition:

• KEEP + up

• KEEP + up + with

• KEEP + to

• KEEP + in

Whereby the first three are very likely to form phrasal verbs or prepositional verbs, as already discussed, but not in, which is part of what follows, a PP in this case. In addition, KEEP + up + with, might be a phrasal prepositional verb.

The three non-directly attached prepositions (with, of, to) have different syntactic patterns with respect to those directly attached ones (up, to, in). With, of and to allow another constituent to be placed in between the verb KEEP and the preposition itself; see for instance:

... but where one keeps faith with it by negation and suffering...
One Jew with a pencil stuck behind his ear
kept gesticulating with his hands and…
... that for so long helped to
keep the Jews of Eastern Europe in medieval ghettoes

The allowed 'intruder' is either a present participle or an NP:

• KEEP + NP / Present Participle + with

• KEEP + NP + of

• KEEP + NP + to

An interesting syntactic difference among these non-directly attached prepositions is that in the first two instances (KEEP + NP/Present Participle + with and KEEP + NP + of), the prepositions are part of a PP. That is, the PP that complements the preceding NP or Present Participle. Whereas in KEEP + NP + to, the verb seems to form a kind of discontinuous phrase, and the preposition might, therefore, be thought to be part of the verb:

…it was insisted that they keep themselves to themselves...

We also find the determiner a in position 2-Right, which indicates that whenever KEEP has the pattern:

• KEEP + ... + a...

The determiner introduces an NP, and this might be some hint of transitivity. The transitive instances with this pattern we have detected are the two phrasal or prepositional verbs:

• KEEP + up + a: … keeping up a house

• KEEP + to + a: Commentary on form is kept to a minimum...

However, in the case of KEEP + in, the NP that follows is not a direct object but part of an adverbial-PP headed by the preposition in:

... which he kept in his pocket

Finally, the exploration of the right hand side association of KEEP takes us to hand. Its high z-score (40.51) somehow gives evidence of some kind of idiomatic expression:

• KEEP + Det(Poss) + hands: ... and keeping his hands away from the ever-questing...

Let us now analyse the left hand side association. This side can be interpreted straightforwardly and indicates clearly that KEEP can only be a verb. The word forms heading KEEP are:

• The infinitive particle to: You'd be able to keep order, sir.

• The pronouns I and he (subjects of the verb): I kept the past alive out of a desire for revenge.

• The modal verbs would and could, both require a non-finite verb form to its right: Some would keep me trotting round the parish all day.

• The preposition of which requires a non-finite verb form, in instances such as: … the habit of keeping themselves to themselves was not easily lost.

Note that this has been by no means an exhaustive analysis of KEEP collocation patterns and phrases, just a mere illustration of the potential of using quantitative data in combination with qualitative linguistic data. In addition, for the sake of simplicity and clarity, we have deliberately reduced: (a) the collocation data (freq > 3) and (b) the concordance sets. As already noted, the identification of phrases goes far beyond simple syntactic frame identification and, it may take us to important syntactic, lexical and semantic discoveries about the behaviour of words, collocates and phrases: the sociology of words.

In the following final section, we shall try to introduce the reader to a more challenging side of statistics: language modelling17.

 

5. Statistical language modelling

The term model as used here is understood to represent a simplified description of a natural language property.

5.1. Modelling the lexicon

Sánchez and Cantos (1997; 1998) use the term model to represent mathematically the transitive relationship18 between types, tokens and lemmas. This relationship has led to interesting findings. So, while tokens grow linearly, types and lemmas do so in a curvilinear fashion19 (Figure 1). After a thorough analysis, Sánchez and Cantos (1997) came up with three statements of prediction regarding tokens, types and lemmas. That is, the amount of types and lemmas a text or corpus is made of is related to the total amount of tokens. The first of these statements is the so-called Type-Token formula:

 

 

This formula states that the total amount of types can be modelled and predicted given the total tokens of a text. The K is a text dependent constant value and needs to be calculated beforehand using a small sample of the text or corpus under research20. In other words, the formula above models the relationship between types and tokens. This mathematical model enables researchers to estimate reliably the hypothetical number of types a corpus may entail, even before compiling the corpus. In much the same way, we can also calculate the lemmas:

And also infer the lemmas from a type sample:

The table below shows the predictions of types and lemmas for a hypothetical corpus of contemporary Spanish21 (Table 13).

 

 

The analytic technique for predicting types proposed and applied by Sánchez and Cantos (1997) is simple and straightforward and the resulting formulae are easy to use, flexible and can be applied quickly to any corpora or language samples. After thorough testing on various text samples of different sizes, the formulae have shown to be very reliable with a more than acceptable error margin of ±5%, and this speaks eloquently of their validity. Their most positive contributions can be summarised in the following points: (a) they are stable indicators of lexical diversity and lexical density22; (b) they overcome the reliability flaw of both the token-type ratio and type-token ratio23as they are not constrained or dependent on text length; and (c) they can be used as predictive tools to account for the total amount of types and lemmas that any hypothetical corpus might contain (see Sánchez and Cantos 1998).

A further revealing issue is that the application of the formulae above on different text samples outputs idiosyncratic, unique and distinctive slopes. The contrastive graph below (Figure 2) clearly shows that, for example, Conrad's lexical density is superior to Doyle's and Shakespeare's. This evidence suggests that these formulae might also be valid for text, author and language classifications, among others.

 

 

5.2. Modelling text-classification

For the following experiment, we (a) extracted (from the CUMBRE Corpus) 11 different text samples from textbooks and manuals for secondary education and university level education, relative to various subjects or linguistic domains, (b) obtained their total amounts of tokens and types, and (c) calculated their K-values (see above). The results are shown below in Table 14.

 

 

The mean K-value for the 11 samples is 27.29 and its standard deviation 9.43. Comparing these figures with the individual K-values from the table above (Table 14) reveals a great deal of variability or dispersion among the various text samples. The sample on physics compared with sociology indicates huge differences in lexical density, not to mention mathematics versus architecture. However, geography and history seem to have a very similar lexical density. The histogram above (Figure 3) graphically displays the various text types ordered according to their lexical densities (K-values). Interesting here is the fact that the lexical density ordered scale moves smoothly from pure science subjects (mathematics, computing, chemistry, etc.) to more arts and humanistic content texts. Additionally, neighbourhood on the histogram might suggest subject relatedness: the more dissimilar the lexical density indices (K-values) the less the subjects relate to each other.

The K-values suggest that discrimination between chemistry (18.46) and sociology (42.03) texts might indeed be possible as both figures diverge significantly. However, a K-value based distinction between chemistry (18.46) and physics (19.27) seems less reliable, due to its closeness.

To construct a purely statistical discrimination model, we started experimenting with a statistical technique known as cluster analysis. Succinctly, cluster analysis classifies a set of observations into two or more mutually exclusive groups based on the combination of interval variables24. The purpose of cluster analysis is to discover a system of organizing observations into groups, where members of the group share properties. Cluster analysis classifies unknown groups while discriminant function analysis (see below) classifies known groups. A common approach to performing a cluster analysis is to first create a table or matrix of relative similarities or differences between all objects and second to use this information to combine objects into groups. The table of relative similarities is called a proximity or dissimilarity matrix. Table 15 displays this dissimilarity matrix25.

 

 

Looking at the matrix we find that the least dissimilarity or closest similarity of all is 0.18, between the history text sample and the geography one. We could say that these seem to form the pair that is most alike. Physics and chemistry have a very low dissimilarity index (0.66) and could be grouped, too. Since history is related to geography we could say that these form a cluster. On the opposite scale, we find the hugest difference between mathematics and architecture (916.27). After the distances between the text types have been found, the next step in the cluster analysis procedure is to divide the text types into groups based on the distances. The results of the application of the clustering technique are best described using a dendogram or binary tree. The interpretation of the dendogram is fairly straightforward (Figure 4). For example, Geo/His/Nat form a group, Chem/Phys/Comp/Math form a second group and Arch/Soc make up a "runt" as it does not enter any group until near the end of the procedure. Our dendogram outputs 6 possible solutions:

Solution 1: 1 group: (1) Geo/Hist/Nat/Med/Phil/Chem/Phys/Comp/Math/Arch/Soc.

Solution 2: 2 groups: (1) Geo/Hist/Nat/Med/Phil/Chem/Phys/Comp/Math and (2) Arch/Soc

Solution 3: 3 groups: (1) Geo/Hist/Nat/Med/Phil, (2) Chem/Phys/Comp/Math, and (3) Arch/Soc

Solution 4: 4 groups: (1) Geo/Hist/Nat, (2) Med/Phil, (3) Chem/Phys/Comp/Math, and (4) Arch/Soc

Solution 5: 5 groups: (1) Geo/Hist/Nat, (2) Med/Phil, (3) Chem/Phys/Comp, (4) Math and (5) Arch/Soc

Solution 6: 11 groups: (1) Geo, (2) Hist, (3) Nat, (4) Med, (5) Phil, (6) Chem, (7) Phys, (8) Comp, (9) Math, (10) Arch and (11) Soc

 

 

Obviously, the best solution is (6), as it models the classification of all 11 text types, whereas solution (1) is clearly the worst one, as it is unable to classify any text at all.

Cluster analysis methods always produce a grouping. The grouping produced by this analysis may or may not prove useful for classifying objects. To validate the results of a cluster analysis it has to be used in conjunction with discriminant function analysis on the resulting groups (solutions). Cluster analysis is a positive exploratory tool in order to elucidate possible grouping solutions and to construct at a later stage a group membership predictive model by means of the discriminant function analysis. This later technique is based on the interval variables (K-values). It begins with a set of observations where both group membership and the values of the interval variables are known. The end result of the procedure is a model that allows prediction of membership when only the interval variables are known: the K-values. A second purpose of discriminant function analysis is an understanding of the data set, as a careful examination of the prediction model that results from the procedure can give insight into the relationship between group membership and the variables used to predict group membership.

In order to construct a model using discriminant function analysis, we added 11 more text samples, one for each text type. Next, using the cluster analysis data, we constructed six models, one for each solution or grouping (see above).

The table below (Table 16) displays the discriminant model for solution 6 (11 text types). It shows the case number (each text sample), actual group26, group assignment27 (Highest Group and 2nd Highest Group) and discriminant scores. Note that erroneous group assignment is marked with "**". The success rate (correct group assignment) is 81.81%; it failed on correctly assigning cases 10, 13, 15 and 16, which were, however, correctly classified in the second choice: 2nd Highest Group.

 

 

The next discriminant model based on solution 5 (Table 17) output a very promising 95.5% success rate (it only failed in classifying case 19).

 

 

These analyses are very revealing and it is now up to the reader to choose ordecide which solution is best. We do think that the best model is solution 5, because of its reasonable discrimination power (it is able to discriminate 5 different text types: (1) Geo/Hist/Nat, (2) Med/Phil, (3) Chem/Phys/Comp, (4) Math and (5) Arch/Soc) and its accuracy (95.5%).

Another positive contribution of discriminant function analysis is that once the groups are known, we can construct a model that allows prediction of membership. This is done by means of the resulting discriminant function coefficients. The coefficients for solution 5 are (Table 18):

 

 

Thus, the discriminant equation would be:

TEXTTYPE = Constant + (K_VALUE * x)

Where x stands for any given K-value. Important to note here: do not mistake K_VALUE with given K-value. The former refers to a text specific coefficient that has been calculated by the discriminant function analysis, whereas the latter is the lexical density index discussed earlier in the text (see Section V.1).

To illustrate the discriminant and predictive power of the equation, take, for example, a hypothetical text with a K-value = 14.01, that is x=14.01. We instantiate TEXTTYPE and the coefficients Constant and K_VALUE accordingly and get the following results:

 

 

Next, we just need to maximize the five results, that is, choose the maximum result. So, a text with a K-value = 14.01 would be most likely classified in first choice as being a mathematics text, as Math is the highest resulting coefficient (34.56); and in second choice, it would be classified as Chem/Phys/Comp (30.52). Similarly, the least likely group membership would be Arch/Soc (-116.48).

Interesting in this sense are Figures 3 and 4, and the discriminant function analysis. Figure 3 represents visually the K-value ordered linguistic domains, where we can appreciate a logical and smooth text type transition, that goes from pure science (mathematics) to clear humanity contents (sociology/architecture). This stratification is based on a sole lexical density feature: the K-value. In addition, Figure 4 offers an exploratory grouped hierarchical structure of the text types, highlighting the major flaw of the K-value: its incapacity to distinguish between closely nearby K-values, as these are grouped into single clusters. Clearly, the K-value fails to distinguish between (a) geography, history and natural sciences; (b) medicine and philosophy; (c) chemistry, physics and computing; and (d) sociology and architecture. However, the final modelling of the data by means of the discriminant function analysis reveals that the K-value is valid and reliable for successful differentiation of (a) geography/history/natural sciences, (b) medicine/philosophy, (c) chemistry/physics/computing, (d) sociology/architecture and (e) mathematics from each other. Though a potential text discriminator using K-value does not, in principle, output a very fine-grained classification, it does not invalidate the use of lexical density for text differentiation. The resulting text classification from the experiment is far from being erroneous or exaggeratedly generic. On the contrary, it discriminates clearly distinctive text type clusters with a promising accuracy rate.

 

6. Some final remarks

In this paper, we have tried to scrutinise studies from a number of diverse linguistic areas (lexicography, grammar, stylistics and also computational applications: probabilistic language modelling) and attempted to show the usefulness for statistics in each. In addition, we have also highlighted the issue of texts and corpora as sources of quantitative data. The important role of quantitative analysis and its interaction with qualitative analysis has been described and exemplified. We have tried to call the reader's attention to the singularity of statistical regularities and patterns in natural languages, that is, that while we are engaged in communication, we do not consciously monitor our language to ensure that these statistical properties entail. It would be impossible to do so. Yet, without any deliberate effort on our part, we shall find the same underlying regularities in any large sample of our speech or writing.

Statistics allows us to summarise complex numerical linguistic data in order to draw inferences from them: intuitive comparisons cannot always be used to tell us something significant about linguistic variation and so we considered the role of significance tests, such as chi-square and z-score in telling us how far differences between samples may be due to chance. The need to summarise and infer stems from the fact that there is variation in linguistic data; else there would be no place for statistics in linguistics.

Finally, we also envisage two underlying purposes in this article. First, by applying and discussing the statistical techniques used, the reader can evaluate the techniques employed: a critical evaluation on the appropriateness of the statistical methods used and the assumption they make for linguistic analysis, though no attempt has been made to present a thorough survey of statistical methods available to statistics, trying to make them more accessible for non-specialists (linguists and postgraduate students). Second, as some readers might be interested in planning their linguistic research using statistics, several of the techniques introduced in here might, partly, assist their aim in similar areas and topics.

The increasing accessibility of linguistic corpora and the belief that theory must be based on language as it is have placed empirical linguistics once again at the foreground of linguistics. The immediate implication of these assumptions is that linguists will increasingly demand the use of statistics in their research. The answer, hence to the title of this paper is definitively yes.

 

Acknowledgements

Thanks go to Dr. Javier Valenzuela and Richard Barbosa for their comments, advice and useful suggestions on an earlier draft of this paper.

 

References

ALLWOOD, J., L.-G. Andersson & Ö. Dahl 1977. Logic in Linguistics. Cambridge: CUP.         [ Links ]

BARNBROOK, G. 1996. Language and Computers. Edinburgh: Edinburgh University Press.         [ Links ]

BAAYEN, R. H. 2001. Word Frequency Distribution. Dordrecht: Kluwer Academic Publishers.         [ Links ]

BROWN, D.J. 1988. Understanding Research in Second Language Learning. A Teacher's Guide to Statistics and Research Design. Cambridge: CUP.         [ Links ]

CANTOS, P. 1995. Tratamiento informático y obtención de resultados. CUMBRE corpus lingüístico del español contemporáneo. Fundamentos, metodología y aplicaciones. Ed. A. Sánchez, R. Sarmiento, P. Cantos and J. Simón. Madrid: SGEL: 39-70.         [ Links ]

______ 2000. Investigating Type-token Regression and its Potential for Automated Text Discrimination. Cuadernos de Filología Inglesa. Número monográfico: Corpus-based Research in English Language and Linguistics. Ed. P. Cantos and A. Sánchez. Murcia: Servicio de Publicaciones de la Universidad de Murcia: 71-91.         [ Links ]

______ 2001. An Attempt to Improve Current Collocation Analysis. Technical Papers. Volume 13. Special Issue. Proceedings of the Corpus Linguistics 2001 Conference. Ed. P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja: 100-8.         [ Links ]

CANTOS P. & A. SÁNCHEZ (forthcoming) Lexical Constellations: What Collocates Fail to Tell. International Journal of Corpus Linguistics.         [ Links ]

CHIPERE, N., D. MALVERN, B. RICHARDS & P. DURAN. 2001. Using a Corpus of School Children's Writing to Investigate the Development of Vocabulary Diversity. Technical Papers. Volume 13. Special Issue. Proceedings of the Corpus Linguistics 2001 Conference. Ed. P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja: 126-133.         [ Links ]

CHURCH, K.W., GALE, W., HANKS, P. & HINDLE, D. 1991. Using Statistics in Lexical Analysis. Ed. U. Zernik. Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Hillsdale, NJ: Lawrence Erlbaum Associates: 115-164.         [ Links ]

CLEAR, J. 1993. From Firth Principles: Computational Tools for the Study of Collocation. Ed. M. Baker et al. Text and Technology. Amsterdam: Benjamins: 271-292.         [ Links ]

HAZTIVASSILOGLOU, V. 1994. Do We Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to Statistical Word Grouping System. The Balancing Act. Combining Symbolic and Statistical Approaches to Language. Ed. J.L. Klavans and P. Resnik. Cambridge: The MIT Press. 67-94.         [ Links ]

HUNSTON, S. & G. FRANCIS 1999. Pattern Grammar. A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins.         [ Links ]

MCENERY, T. & A. WILSON 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press.         [ Links ]

MCKEE G, D. MALVERN & B.J. RICHARDS 2000. Measuring Vocabulary Diversity Using Dedicated Software. Literary and Linguistic Computing 15 (3): 323-38.         [ Links ]

OAKES, M. 1998. Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.         [ Links ]

SÁNCHEZ, A. & P. CANTOS 1997. Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish. International Journal of Corpus Linguistics 2(2): 259-280.         [ Links ]

______ 1998. El ritmo incremental de palabras nuevas en los repertorios de textos. Estudio experimental y comparativo basado en dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y española y en cinco autores de ambas lenguas. ATLANTIS XIX(2): 205-223.         [ Links ]

SÁNCHEZ, A., R. SARMIENTO, P. CANTOS & J. SIMÓN 1995. CUMBRE corpus lingüístico del español contemporáneo. Fundamentos, metodología y aplicaciones. Madrid: SGEL.         [ Links ]

SCOTT, M. 1996. Wordsmith Tools. Oxford: Oxford University Press.         [ Links ]

SINCLAIR, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.         [ Links ]

SINCLAIR, J. et al. 1995. Collins Cobuild English Dictionary. London: Harper Collins Publishers.         [ Links ]

STUBBS, M. 1995. Collocations and Semantic Profiles: On the Cause of the Trouble with Quantitative Methods. Functions of Language 2(1): 1-33.         [ Links ]

WOODS, A., P. FLETCHER & A. HUGHES 1986. Statistics in Language Studies. Cambridge: CUP.         [ Links ]

YANG, D.H., P. CANTOS & M. SONG 2000. An Algorithm for Predicting the Relationship between Lemmas and Corpus Size. International Journal of the Electronics and Telecommunications Research Institute (ETRI) 22: 20-31.         [ Links ]

YANG, D.H., M. Song, P. Cantos & S.J. Lim (forthcoming) On the Corpus Size Needed for Compiling a Comprehensive Computational Lexicon by Automatic Lexical Acquisition. Computers in the Humanities.         [ Links ]

ZIPF, G. 1935. The Psycho-Biology of Language. Boston: Houghton Mifflin.         [ Links ]

 

 

Recebido em dezembro de 2001.

 

 

1 Haztivassiloglou, V. (1994) "Do we need Linguistics when we have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to Statistical Word Grouping System". The Balancing Act. Combining Symbolic and Statistical Approaches to Language. Ed. J.L. Klavans and P. Resnik. Cambridge: The MIT Press. 67-94.
2 For a better understanding of the terms tokens, types and lemmas, consider the following word sequence: sings, singing, sang, sing, sings, sung, singing, sung and sang, where we have nine words or tokens, five different word forms or types (sing, sings, singing, sang and sung) and a single base form or lemma, namely sing.
3 For an exhaustive typology of frequency lists see, among others, Cantos (1995).
4 The type-token ratio is the quotient obtained when dividing the total amount of types by the token total. See also footnote 23.
5 See Cantos (2000) for a detailed discussion on the limitations and drawbacks of the type-token ratio; see also McKee, Richards and Malvern (2000), Chipere et al. (2001) and Baayen (2001:4-5). Interesting is also Scott's notion of "standardised type-token ratio (1996).
6 Taken from the Collins Coubild English Dictionary (Sinclair et al. 1995).
7 In a theoretical normal distribution, the mean (the sum of all scores divided by the total number of scores), the median (the middle point or central score of the distribution) and the mode (the value that occurs most frequently in a given set of scores), all three, fall at the same point: the centre or middle (mean = median = mode). Additionally, if we plot graphically the data we get is a symmetric bell-shaped graph.
8 The probability that rejecting the null hypothesis (whenever the difference is not significant) will be an error.
9 Note that squaring the difference ensures positive numbers.
10 This is a technical term from mathematics which we shall not attempt to explain here. For some non-technical and easy accessible explanations see Woods, Fletcher and Hughes (1986: 138-9) and/or Brown (1988: 118-9).
11 The chi-square distribution tables can be found in the appendices of most statistics book/manual, see for instance Oakes (1998: 266).
12 See Appendix 3.
13 MonoConc v. 1.5. (Athelstan Publications; http://www.athel.com/mono.html#mono).
14 There are important differences between the information provided by these three measures: more, perhaps, between t-score and the other two than between z-score and mutual information themselves. It is difficult, if not impossible, to select one measure that provides the best possible assessment of collocations, although there has been ample discussion of their relative merits (see, for example, Church et al. 1991; Clear 1993; or Stubbs 1995).
15 The standard deviation provides a sort of average of the differences of all scores from the mean.
16 See for instance Hunston and Francis'book on a corpus-driven approach to a lexical grammar of English (1999), Cantos (2001) and Cantos and Sánchez's forthcoming article on lexical hierarchical constructions of collocations.
17 Readers unfamiliar with statistics might find some parts of this next section difficult to grasp.
18 If a relation R, whenever it holds between both x and y and between y and z, also holds between x and z, the relation is said to be transitive: "x "y "z ((R (x,y) ® (R (y,z)) ® R (x,y)) (Allwood et al. 1977: 89).
19 For an ample discussion on the predictive power of the Type-Token formula, see Yang, Cantos and Song (2000), and Yang, Song, Cantos and Lim (forthcoming).
20 The K-value is a type-token constant, whereas the KL-value is a lemma-token constant.
21 The calculations are based on a single small 250,000 token sample taken randomly from the CUMBRE Corpus, a corpus of contemporary Spanish (for more details see Sánchez et al. 1995).
22 Succinctly, what is ment by lexical diversity or lexical density is vocabulary richness.
23 Both the token-type ratio and the type-token ratio provide information on the distribution of tokens between the types in a text. The token-type ratio is the mean frequency of each token in a text, whereas the type-token ratio reveals the mean distribution of types in a text or corpus (if we eventually multiply this quotient by 100, we get the mean percentage of different types per one hundred words).
24 The property of intervals is concerned with the relationship of differences between objects. If a measurement system possesses the property of intervals it means that the unit of measurement is the same thing throughout the scale of numbers. That is, a centimetre is a centimetre, no matter were it measured.
25 The dissimilarity matrix, the dendogram and the discriminant function analysis have been calculated and produced using SPSS v. 10, a statistics package for the social sciences.
26 1 = Architecture; 2 = Chemistry; 3 = Computing; 4 = Geography; 5 = History; 6 = Mathematics; 7 = Medicine; 8 = Natural Sciences; 9 = Philosophy; 10 = Physics; and 11 = Sociology.
27 It refers to the automatic text-classification performance of the model, based on a single lexical density measure: K-value.

 

 

 

 

 

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License