## Services on Demand

## Journal

## Article

## Indicators

## Related links

## Share

## DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada

##
*Print version* ISSN 0102-4450*On-line version* ISSN 1678-460X

### DELTA vol.18 no.2 São Paulo 2002

#### http://dx.doi.org/10.1590/S0102-44502002000200003

**Do we need statistics when we have linguistics?**

**Precisamos de estatística quando temos a lingüística?**

**Pascual Cantos Gómez**

Universidad de Murcia

**ABSTRACT**

Statistics is known to be a quantitative approach to research. However, most of the research done in the fields of language and linguistics is of a different kind, namely qualitative. Succinctly, qualitative analysis differs from quantitative analysis is that in the former no attempt is made to assign frequencies, percentages and the like, to the linguistic features found or identified in the data. In quantitative research, linguistic features are classified and counted, and even more complex statistical models are constructed in order to explain these observed facts. In qualitative research, however, we use the data only for identifying and describing features of language usage and for providing real occurrences/examples of particular phenomena. In this paper, we shall try to show how quantitative methods and statistical techniques can supplement qualitative analyses of language. We shall attempt to present some mathematical and statistical properties of natural languages, and introduce some of the quantitative methods which are of the most value in working empirically with texts and corpora, illustrating the various issues with numerous examples and moving from the most basic descriptive techniques (frequency counts and percentages) to decision-taking techniques (chi-square and z-score) and to more sophisticated statistical language models (Type-Token/Lemma-Token/Lemma-Type formulae, cluster analysis and discriminant function analysis).

**Key-words:** Quantitative analysis; Statistics; Language modelling; Linguistic corpora.

**RESUMO**

A estatística é conhecida por ser uma abordagem quantitative de pesquisa. No entanto, a maioria da pesquisa feita nos campos da linguagem e da lingüística é de natureza diferente, qual seja, qualitativa. De modo sucinto, a análise qualitativa difere da quantitativa pelo fato de a primeira não é feita tentativa de atribuir freqüências, porcentagens e outros atributos semelhantes, às características lingüísticas encontradas ou identificadas nos dados. Na pesquisa quantitativa, as características lingüísticas são classificadas e contadas, e modelos estatísticos mais complexos ainda são construídos a fim de explicar os fatos observados. Na pesquisa qualitativa, contudo, usamos os dados apenas para identificar e descrever características da linguagem em uso e para fornecer exemplos / ocorrências reais de um fenômeno particular. Neste trabalho, tentaremos mostrar como métodos quantitativos e técnicas estatísticas podem suplementar análises qualitativas da linguagem. Nós tentaremos apresentar alguns métodos quantitativos que são de grande valor para trabalhar empiricamente com textos e com corpora, ilustrando diversas questões com vários exemplos, passando das técnicas mais básicas de descrição (contagem de freqüência e porcentagens) para técnicas de tomada de decisão (qui-quadrado e z-score) e para modelos lingüístico-estatísticos mais sofisticados (fórmulas de Forma-Ocorrência / Lema-Ocorrência / e Lema-Forma, análise de cluster e discriminant function analysis.)

**Palavras-chave:** Análise quantitativa; Estatística; Modelagem lingüística; Corpora lingüísticos.

**1. Introduction**

The title itself is the reverse of Hatzivassiloglou's^{1}. As a statistician, he discussed whether linguistics knowledge could be of any help and contribute to a statistical word grouping system. Our aim here is the opposite: to try to illustrate with numerous examples how quantitative methods can most fruitfully contribute to linguistic analysis and research. In addition, we do not intend here to offer an exhaustive presentation of all statistical techniques available to linguistics, but to demonstrate the contribution that statistics can and should make to linguistic studies.

Among the linguistic community, statistical methods or more generally quantitative techniques are mostly ignored or avoided because of the lack of training, fear and dislike too. The reasons: (1) these techniques are just not related to linguistics, philology or humanities; statistics falls into the province of sciences, mathematics and the like; and/or (2) there is a feeling that these methods may detroy the "magic" in literary text.

George Zipf (1935) was one of the first linguists to prove the existence of statistical regularities in language. His best known law proposes a constant relationship between the rank of a word in a frequency list and the frequency with which it is used in a text. To illustrate this, consider the 30th, 40th, 50th, 60th and 70th most-frequently occurring words taken from a sample of the *Corpus Collection B* (published by Oxford University Press): all the values (*constants*) come out at around 20,000 (see *Table 1*). This is because the relationship between *rank* and *frequency* is inversely proportional. In addition, Zipf thought that the *constants* are obtained regardless of subject matter, author or any other linguistic variable.

Similarly, another Zipf law showed the inverse relationship between word length and its frequency. In some languages, such as English, for example, the most commonly used words are monosyllabic ones. This effect seems to account for our tendency to abbreviate words whenever their frequency of use increases, i.e. the reduction of 'television' to 'TV' or 'telly'. It would also seem to be an efficient communication principle to have the popular words short and the rare words long.

These examples show how some linguistic patterns are regular and independent of speaker, writer, or subject matter, and how linguistic behaviour conforms closely to some expectations: quantitative or statistical patterns. In the next section, we shall try to exploit the most basic descriptive data: *frequencies*, and illustrate some potential applications to linguistic research.

**2. Frequency counts**

A preliminary survey of a text or linguistic corpus is to produce a frequency list of its items (tokens, types or lemmas^{2}). At it simplest, the frequency list shows the types that make up the text(s) or corpus, together with their instances of occurrence. It can be produced in several different sequences^{3}. Despite its simplicity, it is a very powerful tool. So for example, frequency lists of huge corpora enable lexicographers to take important decisions on which words a dictionary should include and which particular meanings. Similarly, authors of L2-materials might use this data to decide which words, phrases, expressions or idioms are most relevant in teaching an L2. This evidence of usage is without any doubt a unique and most important source for any enterprise in language description.

It is not just general language, but also sublanguages, that is, specific varieties of language used in certain communicative domains, such as business, medicine, sports, etc, or the study of genders or specific authors that can profit from frequency list analysis. This analysis can, for instance, shed some new light on stylistic variation issues regarding such diverse writers as Henry James and George Orwell, to name but two. To illustrate this, let us compare two sublanguages: arts (literary criticism) and science (biology). The table below (*Table 2*) summarises the output of a lexicon extraction program, showing the size of the lexicons produced for each corpus and giving the *type-token ratio*^{4} for each.

The *type-token ratio* is an extremely valuable indicator here^{5}. It shows that, although the two language samples are different in size (the science sample has 35.861 tokens more; it is therefore 16.75 % larger), we find, on average, almost ten new items for every one hundred words (tokens) in the arts sublanguage (*type-token ratio =* 0.0989 * 100 = 9.89 » 10), whereas its counterpart sublanguage offers only six (0.0676 * 100 = 6.76). Furthermore, on the basis of this evidence, it seems that the sublanguage of science reaches a higher degree of lexical closure than its arts counterpart.

This preliminary approach shows that the sublanguage of arts is, on the whole, richer and more varied regarding the use of different vocabulary items. It resorts to more different words forms (types) and its lexical closure is also more difficult to establish. In contrast, the science sublanguage seems less varied lexically speaking and so, in lexical terms, it would appear that, on the basis of the evidence presented, it tends very strongly towards premature closure, whereas the other sublanguage does not.

Another interesting finding of frequency lists relates to lexical selection, or determining, by means of the evidence of usage, which are the most frequent or relevant items of a particular sublanguage, author, etc. Regarding our two sublanguages (science and arts) we obtain the following top 10 items (*Table 3*).

The main problem with this information is that the use of raw frequencies highlights the very common words such as *the, of*, *in*, etc., despite the fact that their comparatively high frequencies of occurrence are unlikely to provide conclusive evidence of any specifically used vocabulary in any sublanguage. These are words that, on the basis of frequency of occurrence alone, would be found to occur within most sublanguages, and it can perhaps be read more usefully if the purely grammatical words (close-class items) are discarded. This leaves us with (*Table 4*):

The specific vocabularies of both sublanguages become immediately apparent, and we note striking differences. Interesting, however, is the coincidence on the highly frequent use of *time* in both sublanguages (360 in arts and 526 in science).

A further type of study, using raw frequency lists, could be establishing lexical items exclusively used in each sublanguage and those used in both sublanguages (*Tables 5 *and* 6*).

From a lexicographical and semantic point of view it could be interesting to investigate the shared items (see the case of *time* above). An inevitable starting hypothesis is that the same words used in different contexts are likely to bear different meanings. As an example we can examine the use of the singular noun *accident. *This type occurs four times in each corpus. Here are the full sentences of their occurrences and their distribution according to sublanguage and meaning (*Table 7*).

This very brief but revealing lexical analysis confirms our initial hypothesis that the same words used in different contexts are likely to carry different meanings. We see how the use of *accident* in scientific communication is restricted to a single meaning (*3. If something happens by accident, it happens completely by chance*), compared with the other two meanings of the noun occurring in the arts corpus (*1. An accident happens when a vehicle hits a person…causing injury or damage*; *2. If someone has an accident, something unpleasant happens …causing injury or death*).

The merits of this apparently simple technique -frequency listing- do not end here. It is, in our opinion, a potentially non-exhaustible resource and an excellent starting point for descriptive linguistic research, as it sometimes turns the invisible into the visible. The close observation of a frequency list may be the first step for the formulation of a hypothesis.

**3. Significance testing**

As mentioned earlier, frequency lists of word forms are never more than a set of hints or clues to the nature of a text. By examining a list, one can get an idea of what further information would be worth acquiring before starting an investigation.

Returning to our arts and science sample corpora, let us suppose now that we are interested in examining how two modal verbs, *can* and *could,* are distributed in both sublanguages and compare their usage. The first thing to do is to make a simple count of each of these modals in the two corpora. Having done this, we arrive at the following frequencies (*Table 8*):

A quick look at these figures reveals that *can *and *could *are more frequently used in scientific communication than in arts. But with what degree of certainty can we infer that this is a genuine finding about the two corpora rather than a result of chance? We cannot decide just by looking at these figures; we need to perform a further calculation: a test of *statistical significance* and determine how high or low the probability is that the difference between the two corpora on these features is due to chance.

*3.1. Chi-squared test*

Among the various significance tests available to linguists, we find: the *chi-squared test*, the *t-test*, *Wilcoxon's rank sum test*, etc. The *chi-squared test* is probably the most commonly used one in corpus linguistics, as it has numerous advantages for linguistic purposes (McEnery and Wilson 1996: 70): (a) it is more accurate than, for example, the *t-test*; (b) it does not assume that the data is *normally distributed*^{7} (quite frequent with linguistic data); (c) it is easy to calculate, even without a computer statistics package; and (d) disparities in corpus size are unimportant.

Probably, the main disadvantage of *chi-square* is that it is unreliable with very small frequencies (less than 5). Succinctly, *chi-square* compares the difference between the actual observed frequencies in the texts or corpora, and those frequencies that we would expect (if the only factor operating had been chance). The closer the expected frequencies are to the observed frequencies, the more likely it is that the observed frequencies are a result of chance. However, if the difference between the observed frequencies and the expected ones are greater, then it is more likely that the observed frequencies are being influenced by something other than chance. For instance, if we take our example, a significant difference between the observed frequencies and the expected ones of *can *and *could* would mean a true difference in the grammar or style of the two domain languages: arts and science.

The first step is to determine the significance level or threshold of tolerance for error^{8}. In linguistic issues, it is common use to fix the probability of error threshold of 1 in 20, or *p < 0.05*. Remember that *chi-square* compares what actually happened to what hypothetically would have happened if all other things were equal. The first thing to do is to calculate the column and row totals, giving (*Table 9*):

Next, the expected frequencies are calculated. This is done by multiplying the cell's row total by the cell's column total, divided by the sum total of all observations. So, to derive the expected frequency of the modal verb *can* in the arts corpus, we multiply its cell row total (1043) by its cell column total (561) and divide that product by the sum total (1646):

All the calculations of the expected frequencies of each cell are shown below (*Table 10*):

Now, we need to measure the size of the difference between the pair of observed and expected frequencies in each cell. This is done with the formula^{9}:

Where *O* = observed frequency and *E* = expected frequency. So, for instance, the difference measure for *can* (in the arts corpus) is:

Next we calculate the difference measure for all cases (*can* in the art corpus, *can* in the science corpus, *could* in the art corpus and *could* in the science corpus), and add all these measures up. The value of *chi-square* is the sum of all these calculated values. Thus, the formula for *chi-square* is as follows:

For our data above, this results in a total *chi-square* value of: 95.35. Having done this, it is necessary to look at a set of statistical tables to see how significant our *chi-square* value is. To do this one first requires a further value: the number of *degrees of freedom *(*df*)^{10}. This is very simple to work out:

*df = (number of columns in the table - 1) * (number of rows in the table - 1).*

Which is for our case:

*df *= (2 –1) * (2 –1 ) = 1

We now look in the table of *chi-square* values^{11} in the row for the relevant number of degrees of freedom (1 *df*.) and the appropriate column of significance level (*0.05* in linguistics). Returning to our example, we have a *chi*-value of 95.35 with *df* = 1, so according to the distribution table, we would need our *chi*-value to be equal to or greater than 3.84 (see *Table 11*), which is true. This means that the difference found between the two sublanguages regarding the use of *can* and *could* is statistically significant at *p < 0.05*, and we can therefore, with quite a high degree of certainty, say that this difference is not due to chance, but due to a true reflection of variation in the two sublanguages.

*3.2. Z-score*

Up to this point we have been dealing just with single items without paying attention to the co-text, that is the words on either side of the item under investigation. These co-textual words are known as collocations. Collocation has been defined as "the occurrence of two or more words within a short space of each other in a text" (Sinclair 1991: 170).

Concordance lines^{12} hold the primary information needed for collocation analysis. Approaches for the extraction of collocations may range from simple frequency information for words that occur near to the keyword, to the application of sophisticated statistical techniques, which calculate the figures needed for the comparison, and use them to produce measures of significance.

The most basic and naïve form of collocation analysis provided by concordance packages is to produce a frequency list of all the words occurring within predetermined proximity limits (*span*). As an example, we extracted the occurrences of the lemma KEEP (*keep, keep'em, keeping, keeps, kept*) in the arts corpus, by means of the concordance program *Monoconc*^{13}, using a fixed span of six words on either side of the keyword (see *Appendix 1*). However, as already mentioned, the major drawback with the data obtained is that the use of raw frequencies highlights the very common words, despite the fact that they are unlikely to provide conclusive evidence of significant collocation patterns. So, we discarded the purely grammatical words (close-class items) and also deleted low frequency words, leaving just those that co-occur at least three times. This would leave us with just:

going | 5 |

hands | 4 |

feet | 3 |

house | 3 |

pony | 3 |

well | 3 |

All of these could form fairly strong patterns with KEEP and would be worth investigating further.

This approach, though useful, is very simple. It just offers some quantitative data and is only the starting point for calculating their significance. The calculation of the data needed for collocation analysis is not very complicated, although several alternative methods are available. The starting point for any collocation analysis is a set of concordance lines for the words under investigation, long enough to contain the required span of words. The first decision to make is to choose an appropriate length of the span. Let us consider the following concordance line:

answering | their | questions | and | trying | to | keep | a | conversation | going | while | cooking | a |

-6 | -5 | -4 | -3 | -2 | -1 | node | +1 | +2 | +3 | +4 | +5 | +6 |

The word under investigation (KEEP) is referred to as the *node *and is used to generate the concordance lines. The words around the node are numbered according to the position to the node. Those words left to the node are expressed negatively, those to the right positively. So the span here is of twelve words, six words on either side. This set of concordance lines offers the basis for any further significance technique. All three major collocation significance tests, namely *z-score*, *t-score* and *mutual information*^{14}, rely on actual or observed frequency and expected frequency. So once the concordance lines have been obtained, we need to establish the actual frequency of the words within the span. In other words, we produce a frequency list of the concordance lines, similar to the one already discussed and may, alternatively, also eliminate close-class items. Suppose this is what we get:

going | 5 |

hands | 4 |

feet | 3 |

house | 3 |

pony | 3 |

well | 3 |

At this point we need to calculate the expected frequency figures that will be compared to the actual frequencies to assess their significance. The calculation of the expected frequency for words occurring in the span is straightforward. First, we need a theoretical language model (i.e. a representative language sample or corpus of the language or domain that we want to investigate). This model will help us to predict how these words would be distributed if there were no particular pattern of collocation between them and the node. In other words, if we want to check whether the node (KEEP) is exercising some influence over the distribution of *going*, *hands*, *feet*, *house*, *pony* and *well*, we need to know how we would expect these words to behave in the absence of that influence.

Let us return to our frequency list of occurring words with KEEP within the span limit of 12, where the verb form *going* occurs 5 times (observed frequency). This means that *going* appears 5 times in the proximity of KEEP in a text sample of overall:

12 * 5 = 60 tokens (words)

On the other hand, its overall frequency in the entire corpus (arts subcorpus of *Corpus Collection B*), which consists of 178,143 tokens, altogether, is 89. If *going* is randomly distributed throughout the text, then its expected frequency in any 60 token text sample should be:

That is, the expected frequency of *going* in any random selected sample of 60 tokens should be just 0.029 compared with its real observed 5 occurrences in the set of lines of KEEP. Though the difference is huge, we cannot decide anything yet just from these figure; we need to perform a further calculation: a test of *statistical significance* and determine how high or low the probability is that the difference between the observed and expected frequency is due to chance.

The *z-score* is probably the most familiar of the statistical significance measures used for collocation analysis. The calculation is reasonably easy:

Where *O* = observed frequency, *E* = expected frequency and *sd* = standard deviation^{15}. *O* is straightforward, *E* needs to be calculated, as explained previously, and *sd* uses the formula:

And where *p* = probability of occurrence of the co-occurring word in the whole text, and *N* = number of tokens in the set of concordance lines. Thus, the probability of *going* occurring in the whole text is:

And *N* (number of tokens in the truncated concordance lines) is:

*N =* *number of concordance lines * span *

*N = *5 * 12 = 60

So, the *sd* can now be calculated:

and its *z-score* gives:

A useful cut-off measure for significance in this type of test is around 3 (Barnbrook 1996: 96). This leads us to conclude that the occurrence of *going* within the co-text of the lemma KEEP is not due to chance, but due to some kind of lexical 'attraction' (see *Appendix 2*).

**4. Putting together quantitative and qualitative approaches**

An important advantage of collocation analysis is that it allows us to focus our attention on specific aspects of the contexts of words already selected for investigation through concordance lines. Collocations can help organize the context into major patterns, and we can use our knowledge of those patterns to assess the general behaviour patterns of the language or the uses being made of particular words within the text^{16}. This can make it much easier to identify phrases and idioms, to differentiate among different meanings of a single word form or to determine the range of syntactic features.

In the examination of the results above the most significant collocations for KEEP were *going, hands, feet, house, pony *and *well*. However, and for practical reasons, we deliberately discarded all grammatical words and low frequency items. For the following analysis, we included close-class items, focusing particularly on prepositions, since they are likely to be relevant when dealing with verbs (think of prepositional or phrasal verbs, idioms, and the like). Collocation frequency data can be very useful in this respect. The following table (*Table 12*) shows the data for the verb lemma KEEP (with a frequency threshold of 3; see *Appendix 3* for concordance lines):

Given this data, a first approach could be to group the words in column *1-Right* (words that immediately follow KEEP) according to their parts-of-speech; we get:

• Determiners: *the, a* and *his* (there is no instance where *his* is a possessive pronoun, see *Appendix 3*)

• Prepositions: *up, in* and *to*

• Pronouns: *him*

The right hand side association power of KEEP can now be typified as:

• KEEP + *Preposition (up, in, to)*

• KEEP + *Pronoun (him)*

• KEEP + *Determiner (the, a, his)*

A quick glance at the *z-scores* for KEEP (see *Appendix 2*) reveals that the probability for *in* to co-occur with KEEP is quite low, compared with *to *and *up.* The latter two are statistically very significant, particularly *up*. It is difficult to make assumptions here, due to the small sample analysed, but the *z-scores* point to one hypothesis: KEEP* + up* and KEEP* + to *may form lexical units (prepositional verbs or phrasal verbs), such as in:

The jargon theykeptup was delicious for me to hear.

Commentary on form iskeptto a minimum and is almost entirely superficial.

However, *in* seems very unlikely to be part of the verb and is probably part of the prepositional phrase (PP) that follows the verb (see low *z-scores*; *Appendix 2*), as in:

Hoomey knew it was a suggestion for sugarlumps, which hekeptin his pocket.

Regarding determiners, these do not occur alone, they precede or *determine *a noun or noun head, we can go further saying that KEEP is capable to associate on its right hand side noun phrases (NPs) of the type:

• NP ® *Pr*; (*You ***keep ***him here, and say your prayers, and all will be well*)

• NP ® *Det (Adj) N*; (*My uncle will go on ***keeping ***the horses if we want them.- He ***keeps ***a sleeping bag up there, stuffed behind the old ventilator pipes, and he sleeps in with her.- He was ***keeping ***his feet well out of the way, following where his horse ate*)

The above is true, as KEEP is a transitive verb. Consequently, we could with some degree of certainty say that the high co-occurrence frequency of pronouns and determiners with the verb KEEP is not due to the configuration of any particular phrase but due to the transitivity of KEEP.

Regarding its association with prepositions, we have three prepositions which are directly attached to the verb (*up, in, to*), and three other which occur within a word distance of two (*2-Right*: *with, of, to*). A first hypothesis could be that the co-occurrence of KEEP* +* *Preposition* (*up, in *or *to)* attracts other prepositions. If we look at the concordance list, this is only true for *up, *which attracts *with *in four out of six occasions, as in:

Tokeepup with this pace, that is, just to carry out the work that...

This results into the following syntactic frames for KEEP* + Preposition*:

• KEEP + *up*

• KEEP + *up* + *with*

• KEEP + *to*

• KEEP + *in*

Whereby the first three are very likely to form phrasal verbs or prepositional verbs, as already discussed, but not *in, *which is part of what follows, a PP in this case. In addition, KEEP* + up + with*, might be a phrasal prepositional verb.

The three non-directly attached prepositions (*with, of, to*) have different syntactic patterns with respect to those directly attached ones (*up, to, in*). *With, of *and *to* allow another constituent to be placed in between the verb KEEP and the preposition itself; see for instance:

... but where onekeepsfaith with it by negation and suffering...

One Jew with a pencil stuck behind his earkeptgesticulating with his hands and…

... that for so long helped tokeepthe Jews of Eastern Europe in medieval ghettoes

The allowed 'intruder' is either a present participle or an NP:

• KEEP + *NP / Present Participle + with*

• KEEP + *NP + of*

• KEEP + *NP + to*

An interesting syntactic difference among these non-directly attached prepositions is that in the first two instances (KEEP* + NP/Present Participle + with* and KEEP* + NP + of*), the prepositions are part of a PP. That is, the PP that complements the preceding NP or Present Participle. Whereas in KEEP* + NP + to*, the verb seems to form a kind of discontinuous phrase, and the preposition might, therefore, be thought to be part of the verb:

…it was insisted that theykeepthemselves to themselves...

We also find the determiner *a* in position *2-Right*, which indicates that whenever KEEP has the pattern:

• KEEP + ... + a...

The determiner introduces an NP, and this might be some hint of transitivity. The transitive instances with this pattern we have detected are the two phrasal or prepositional verbs:

• KEEP + *up + a*: … **keeping*** up a house*

• KEEP + *to + a*: *Commentary on form is ***kept*** to a minimum...*

However, in the case of KEEP + *in*, the NP that follows is not a direct object but part of an adverbial-PP headed by the preposition *in*:

... which hekeptin his pocket

Finally, the exploration of the right hand side association of KEEP takes us to *hand*. Its high *z-score* (40.51) somehow gives evidence of some kind of idiomatic expression:

• KEEP + *Det(Poss) + hands: ... and ***keeping*** his hands away from the ever-questing...*

Let us now analyse the left hand side association. This side can be interpreted straightforwardly and indicates clearly that KEEP can only be a verb. The word forms heading KEEP are:

• The infinitive particle *to*: *You'd be able to*** keep*** order, sir.*

• The pronouns *I* and *he* (subjects of the verb): *I*** kept*** the past alive out of a desire for revenge.*

• The modal verbs *would* and *could*, both require a non-finite verb form to its right: *Some would*** keep*** me trotting round the parish all day.*

• The preposition *of *which requires a non-finite verb form, in instances such as: … *the habit of*** keeping*** themselves to themselves was not easily lost.*

Note that this has been by no means an exhaustive analysis of KEEP collocation patterns and phrases, just a mere illustration of the potential of using quantitative data in combination with qualitative linguistic data. In addition, for the sake of simplicity and clarity, we have deliberately reduced: (a) the collocation data (*freq *__>__ 3) and (b) the concordance sets. As already noted, the identification of phrases goes far beyond simple syntactic frame identification and, it may take us to important syntactic, lexical and semantic discoveries about the behaviour of words, collocates and phrases: *the sociology of words.*

In the following final section, we shall try to introduce the reader to a more challenging side of statistics: language modelling^{17}.

** 5. Statistical language modelling**

The term *model* as used here is understood to represent a simplified description of a natural language property.

*5.1. Modelling the lexicon*

Sánchez and Cantos (1997; 1998) use the term *model* to represent mathematically the transitive relationship^{18} between types, tokens and lemmas. This relationship has led to interesting findings. So, while tokens grow linearly, types and lemmas do so in a curvilinear fashion^{19} (*Figure 1*). After a thorough analysis, Sánchez and Cantos (1997) came up with three statements of prediction regarding tokens, types and lemmas. That is, the amount of types and lemmas a text or corpus is made of is related to the total amount of tokens. The first of these statements is the so-called *Type-Token formula*:

This formula states that the total amount of types can be modelled and predicted given the total tokens of a text. The *K *is a text dependent constant value and needs to be calculated beforehand using a small sample of the text or corpus under research^{20}. In other words, the formula above models the relationship between types and tokens. This mathematical model enables researchers to estimate reliably the hypothetical number of types a corpus may entail, even before compiling the corpus. In much the same way, we can also calculate the lemmas:

And also infer the lemmas from a type sample:

The table below shows the predictions of types and lemmas for a hypothetical corpus of contemporary Spanish^{21} (*Table 13*)*.*

The analytic technique for predicting types proposed and applied by Sánchez and Cantos (1997) is simple and straightforward and the resulting formulae are easy to use, flexible and can be applied quickly to any corpora or language samples. After thorough testing on various text samples of different sizes, the formulae have shown to be very reliable with a more than acceptable error margin of ±5%, and this speaks eloquently of their validity. Their most positive contributions can be summarised in the following points: (a) they are stable indicators of lexical diversity and lexical density^{22}; (b) they overcome the reliability flaw of both the *token-type ratio* and *type-token ratio*^{23}as they are not constrained or dependent on text length; and (c) they can be used as predictive tools to account for the total amount of types and lemmas that any hypothetical corpus might contain (see Sánchez and Cantos 1998).

A further revealing issue is that the application of the formulae above on different text samples outputs idiosyncratic, unique and distinctive slopes. The contrastive graph below (*Figure 2*) clearly shows that, for example, Conrad's lexical density is superior to Doyle's and Shakespeare's. This evidence suggests that these formulae might also be valid for text, author and language classifications, among others.

*5.2. Modelling text-classification*

For the following experiment, we (a) extracted (from the *CUMBRE Corpus*) 11 different text samples from textbooks and manuals for secondary education and university level education, relative to various subjects or linguistic domains, (b) obtained their total amounts of tokens and types, and (c) calculated their *K-values* (see above). The results are shown below in *Table 14.*

The mean *K-value* for the 11 samples is 27.29 and its standard deviation 9.43. Comparing these figures with the individual *K-values* from the table above (*Table 14*) reveals a great deal of variability or dispersion among the various text samples. The sample on *physics* compared with *sociology* indicates huge differences in lexical density, not to mention *mathematics* versus *architecture*. However, *geography* and *history* seem to have a very similar lexical density. The histogram above (*Figure 3*) graphically displays the various text types ordered according to their lexical densities (*K-values*). Interesting here is the fact that the lexical density ordered scale moves smoothly from pure science subjects (*mathematics, computing, chemistry*, etc.) to more arts and humanistic content texts. Additionally, neighbourhood on the histogram might suggest subject relatedness: the more dissimilar the lexical density indices (*K-values*) the less the subjects relate to each other.

The *K-values* suggest that discrimination between *chemistry* (18.46) and *sociology* (42.03) texts might indeed be possible as both figures diverge significantly. However, a *K-value *based distinction between *chemistry* (18.46) and *physics* (19.27) seems less reliable, due to its closeness.

To construct a purely statistical discrimination model, we started experimenting with a statistical technique known as *cluster analysis*. Succinctly, *cluster analysis* classifies a set of observations into two or more mutually exclusive groups based on the combination of interval variables^{24}. The purpose of *cluster analysis* is to discover a system of organizing observations into groups, where members of the group share properties. *Cluster analysis* classifies unknown groups while *discriminant function analysis* (see below) classifies known groups. A common approach to performing a *cluster analysis* is to first create a table or matrix of relative similarities or differences between all objects and second to use this information to combine objects into groups. The table of relative similarities is called a proximity or *dissimilarity matrix*. *Table 15* displays this *dissimilarity matrix*^{25}.

Looking at the matrix we find that the least dissimilarity or closest similarity of all is 0.18, between the *history* text sample and the *geography* one. We could say that these seem to form the pair that is most alike. *Physics* and *chemistry* have a very low dissimilarity index (0.66) and could be grouped, too. Since *history* is related to *geography* we could say that these form a cluster. On the opposite scale, we find the hugest difference between *mathematics* and *architecture* (916.27). After the distances between the text types have been found, the next step in the *cluster analysis* procedure is to divide the text types into groups based on the distances. The results of the application of the clustering technique are best described using a *dendogram* or binary tree. The interpretation of the *dendogram* is fairly straightforward (*Figure 4*). For example, *Geo/His/Nat* form a group, *Chem/Phys/Comp/Math* form a second group and *Arch/Soc* make up a "runt" as it does not enter any group until near the end of the procedure. Our *dendogram* outputs 6 possible solutions:

*Solution 1*: 1 group: (1) *Geo/Hist/Nat/Med/Phil/Chem/Phys/Comp/Math/Arch/Soc*.

*Solution 2*: 2 groups: (1) *Geo/Hist/Nat/Med/Phil/Chem/Phys/Comp/Math* and (2) *Arch/Soc *

*Solution 3*: 3 groups: (1) *Geo/Hist/Nat/Med/Phil*, (2) *Chem/Phys/Comp/Math, *and (3) *Arch/Soc *

*Solution 4*: 4 groups: (1) *Geo/Hist/Nat*, (2) *Med/Phil*, (3) *Chem/Phys/Comp/Math, *and (4) *Arch/Soc *

*Solution 5*: 5 groups: (1) *Geo/Hist/Nat*, (2) *Med/Phil*, (3) *Chem/Phys/Comp*, (4) *Math *and (5) *Arch/Soc *

*Solution 6*: 11 groups: (1) *Geo*, (2) *Hist*, (3) *Nat*, (4) *Med*, (5) *Phil*, (6) *Chem*, (7) *Phys*, (8) *Comp*, (9) *Math*, (10) *Arch *and (11)* Soc*

Obviously, the best solution is (6), as it models the classification of all 11 text types, whereas solution (1) is clearly the worst one, as it is unable to classify any text at all.

*Cluster analysis* methods always produce a grouping. The grouping produced by this analysis may or may not prove useful for classifying objects. To validate the results of a *cluster analysis* it has to be used in conjunction with *discriminant function analysis* on the resulting groups (solutions). *Cluster analysis* is a positive exploratory tool in order to elucidate possible grouping solutions and to construct at a later stage a group membership predictive model by means of the *discriminant function analysis*. This later technique is based on the interval variables (*K-values*). It begins with a set of observations where both group membership and the values of the interval variables are known. The end result of the procedure is a model that allows prediction of membership when only the interval variables are known: the *K-values*. A second purpose of *discriminant function analysis* is an understanding of the data set, as a careful examination of the prediction model that results from the procedure can give insight into the relationship between group membership and the variables used to predict group membership.

In order to construct a model using *discriminant function analysis*, we added 11 more text samples, one for each text type. Next, using the *cluster analysis* data, we constructed six models, one for each solution or grouping (see above).

The table below (*Table 16*) displays the discriminant model for solution 6 (11 text types). It shows the case number (each text sample), actual group^{26}, group assignment^{27} (*Highest Group *and *2 ^{nd} Highest Group*) and discriminant scores. Note that erroneous group assignment is marked with "**". The success rate (correct group assignment) is 81.81%; it failed on correctly assigning cases 10, 13, 15 and 16, which were, however, correctly classified in the second choice:

*2*.

^{nd}Highest Group

The next discriminant model based on solution 5 (*Table 17*) output a very promising 95.5% success rate (it only failed in classifying case 19).

These analyses are very revealing and it is now up to the reader to choose ordecide which solution is best. We do think that the best model is solution 5, because of its reasonable discrimination power (it is able to discriminate 5 different text types: (1) *Geo/Hist/Nat, *(2) *Med/Phil, *(3) *Chem/Phys/Comp, *(4) *Math* and (5) *Arch/Soc*) and its accuracy (95.5%).

Another positive contribution of *discriminant function analysis* is that once the groups are known, we can construct a model that allows prediction of membership. This is done by means of the resulting *discriminant function coefficients*. The coefficients for solution 5 are (*Table 18*):

Thus, the *discriminant equation* would be:

*TEXTTYPE = Constant + (K_VALUE * x*)

Where *x *stands for any given *K-value*. Important to note here: do not mistake *K_VALUE* with given *K-value*. The former refers to a text specific coefficient that has been calculated by the *discriminant function analysis*, whereas the latter is the lexical density index discussed earlier in the text (see *Section V.1*).

To illustrate the discriminant and predictive power of the equation, take, for example, a hypothetical text with a *K-value = *14.01, that is *x=14.01*. We instantiate *TEXTTYPE *and the coefficients* Constant *and *K_VALUE* accordingly and get the following results:

Next, we just need to maximize the five results, that is, choose the maximum result. So, a text with a *K-value = *14.01 would be most likely classified in first choice as being a *mathematics* text, as *Math* is the highest resulting coefficient (34.56); and in second choice, it would be classified as *Chem/Phys/Comp* (30.52). Similarly, the least likely group membership would be *Arch/Soc* (-116.48).

Interesting in this sense are *Figures 3* and *4*, and the *discriminant function analysis*. *Figure 3* represents visually the *K-value* ordered linguistic domains, where we can appreciate a logical and smooth text type transition, that goes from pure science (*mathematics*) to clear humanity contents (*sociology/architecture*). This stratification is based on a sole lexical density feature: the *K-value*. In addition, *Figure 4* offers an exploratory grouped hierarchical structure of the text types, highlighting the major flaw of the *K-value*: its incapacity to distinguish between closely nearby *K-values*, as these are grouped into single clusters. Clearly, the *K-value* fails to distinguish between (a) *geography, history* and *natural sciences*; (b) *medicine *and *philosophy*; (c) *chemistry, physics *and *computing*; and (d) *sociology *and *architecture*. However, the final modelling of the data by means of the *discriminant function analysis* reveals that the *K-value* is valid and reliable for successful differentiation of (a) *geography/history*/*natural sciences*, (b) *medicine/philosophy*, (c) *chemistry/physics/computing*, (d) *sociology/architecture* and (e) *mathematics* from each other. Though a potential text discriminator using *K-value* does not, in principle, output a very fine-grained classification, it does not invalidate the use of lexical density for text differentiation. The resulting text classification from the experiment is far from being erroneous or exaggeratedly generic. On the contrary, it discriminates clearly distinctive text type clusters with a promising accuracy rate.

**6. Some final remarks**

In this paper, we have tried to scrutinise studies from a number of diverse linguistic areas (lexicography, grammar, stylistics and also computational applications: probabilistic language modelling) and attempted to show the usefulness for statistics in each. In addition, we have also highlighted the issue of texts and corpora as sources of quantitative data. The important role of quantitative analysis and its interaction with qualitative analysis has been described and exemplified. We have tried to call the reader's attention to the singularity of statistical regularities and patterns in natural languages, that is, that while we are engaged in communication, we do not consciously monitor our language to ensure that these statistical properties entail. It would be impossible to do so. Yet, without any deliberate effort on our part, we shall find the same underlying regularities in any large sample of our speech or writing.

Statistics allows us to summarise complex numerical linguistic data in order to draw inferences from them: intuitive comparisons cannot always be used to tell us something significant about linguistic variation and so we considered the role of significance tests, such as *chi-square* and *z-score* in telling us how far differences between samples may be due to chance. The need to summarise and infer stems from the fact that there is variation in linguistic data; else there would be no place for statistics in linguistics.

Finally, we also envisage two underlying purposes in this article. First, by applying and discussing the statistical techniques used, the reader can evaluate the techniques employed: a critical evaluation on the appropriateness of the statistical methods used and the assumption they make for linguistic analysis, though no attempt has been made to present a thorough survey of statistical methods available to statistics, trying to make them more accessible for non-specialists (linguists and postgraduate students). Second, as some readers might be interested in planning their linguistic research using statistics, several of the techniques introduced in here might, partly, assist their aim in similar areas and topics.

The increasing accessibility of linguistic corpora and the belief that theory must be based on language *as it is* have placed empirical linguistics once again at the foreground of linguistics. The immediate implication of these assumptions is that linguists will increasingly demand the use of statistics in their research. The answer, hence to the title of this paper is definitively *yes*.

**Acknowledgements**

Thanks go to Dr. Javier Valenzuela and Richard Barbosa for their comments, advice and useful suggestions on an earlier draft of this paper.

**References**

ALLWOOD, J., L.-G. Andersson & Ö. Dahl 1977. *Logic in Linguistics*. Cambridge: CUP. [ Links ]

BARNBROOK, G. 1996. *Language and Computers*. Edinburgh: Edinburgh University Press. [ Links ]

BAAYEN, R. H. 2001. *Word Frequency Distribution. *Dordrecht: Kluwer Academic Publishers. [ Links ]

BROWN, D.J. 1988. *Understanding Research in Second Language Learning. A Teacher's Guide to Statistics and Research Design. *Cambridge: CUP. [ Links ]

CANTOS, P. 1995. Tratamiento informático y obtención de resultados. *CUMBRE corpus lingüístico del español contemporáneo. Fundamentos, metodología y aplicaciones.* Ed. A. Sánchez, R. Sarmiento, P. Cantos and J. Simón. Madrid: SGEL: 39-70. [ Links ]

______ 2000. Investigating Type-token Regression and its Potential for Automated Text Discrimination. *Cuadernos de Filología Inglesa. Número monográfico: Corpus-based Research in English Language and Linguistics.* Ed. P. Cantos and A. Sánchez. Murcia: Servicio de Publicaciones de la Universidad de Murcia: 71-91. [ Links ]

______ 2001. An Attempt to Improve Current Collocation Analysis. *Technical Papers. Volume 13. Special Issue. Proceedings of the Corpus Linguistics 2001 Conference. *Ed. P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja: 100-8. [ Links ]

CANTOS P. & A. SÁNCHEZ (*forthcoming*) Lexical Constellations: What Collocates Fail to Tell. *International Journal of Corpus Linguistics.* [ Links ]

CHIPERE, N., D. MALVERN, B. RICHARDS & P. DURAN. 2001. Using a Corpus of School Children's Writing to Investigate the Development of Vocabulary Diversity. *Technical Papers. Volume 13. Special Issue. Proceedings of the Corpus Linguistics 2001 Conference. *Ed. P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja: 126-133. [ Links ]

CHURCH, K.W., GALE, W., HANKS, P. & HINDLE, D. 1991. Using Statistics in Lexical Analysis. Ed. U. Zernik. *Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon*. Hillsdale, NJ: Lawrence Erlbaum Associates: 115-164. [ Links ]

CLEAR, J. 1993. From Firth Principles: Computational Tools for the Study of Collocation. Ed. M. Baker et al. *Text and Technology*. Amsterdam: Benjamins: 271-292. [ Links ]

HAZTIVASSILOGLOU, V. 1994. Do We Need Linguistics When We Have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to Statistical Word Grouping System. *The Balancing Act. Combining Symbolic and Statistical Approaches to Language.* Ed. J.L. Klavans and P. Resnik. Cambridge: The MIT Press. 67-94. [ Links ]

HUNSTON, S. & G. FRANCIS 1999. *Pattern Grammar. A Corpus-driven Approach to the Lexical Grammar of English.* Amsterdam: John Benjamins. [ Links ]

MCENERY, T. & A. WILSON 1996. *Corpus Linguistics*. Edinburgh: Edinburgh University Press. [ Links ]

MCKEE G, D. MALVERN & B.J. RICHARDS 2000. Measuring Vocabulary Diversity Using Dedicated Software. *Literary and Linguistic Computing* 15 (3): 323-38. [ Links ]

OAKES, M. 1998. *Statistics for Corpus Linguistics. *Edinburgh: Edinburgh University Press. [ Links ]

SÁNCHEZ, A. & P. CANTOS 1997. Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish. *International Journal of Corpus Linguistics* 2(2): 259-280. [ Links ]

______ 1998. El ritmo incremental de palabras nuevas en los repertorios de textos. Estudio experimental y comparativo basado en dos corpus lingüísticos equivalentes de cuatro millones de palabras, de las lenguas inglesa y española y en cinco autores de ambas lenguas. *ATLANTIS *XIX(2): 205-223. [ Links ]

SÁNCHEZ, A., R. SARMIENTO, P. CANTOS & J. SIMÓN 1995. *CUMBRE corpus lingüístico del español contemporáneo. Fundamentos, metodología y aplicaciones.* Madrid: SGEL. [ Links ]

SCOTT, M. 1996. *Wordsmith Tools*. Oxford: Oxford University Press. [ Links ]

SINCLAIR, J. 1991. *Corpus, Concordance, Collocation*. Oxford: Oxford University Press. [ Links ]

SINCLAIR, J. et al. 1995. *Collins Cobuild English Dictionary*. London: Harper Collins Publishers. [ Links ]

STUBBS, M. 1995. Collocations and Semantic Profiles: On the Cause of the Trouble with Quantitative Methods. *Functions of Language* 2(1): 1-33. [ Links ]

WOODS, A., P. FLETCHER & A. HUGHES 1986. *Statistics in Language Studies*. Cambridge: CUP. [ Links ]

YANG, D.H., P. CANTOS & M. SONG 2000. An Algorithm for Predicting the Relationship between Lemmas and Corpus Size. *International Journal of the Electronics and Telecommunications Research Institute (ETRI)* 22: 20-31. [ Links ]

YANG, D.H., M. Song, P. Cantos & S.J. Lim (*forthcoming*) On the Corpus Size Needed for Compiling a Comprehensive Computational Lexicon by Automatic Lexical Acquisition. *Computers in the Humanities.* [ Links ]

ZIPF, G. 1935. *The Psycho-Biology of Language*. Boston: Houghton Mifflin. [ Links ]

Recebido em dezembro de 2001.

1 Haztivassiloglou, V. (1994) "Do we need Linguistics when we have Statistics? A Comparative Analysis of the Contributions of Linguistic Cues to Statistical Word Grouping System". *The Balancing Act. Combining Symbolic and Statistical Approaches to Language.* Ed. J.L. Klavans and P. Resnik. Cambridge: The MIT Press. 67-94.

2 For a better understanding of the terms *tokens, types* and *lemmas*, consider the following word sequence: *sings, singing, sang, sing, sings, sung, singing, sung* and *sang*, where we have nine words or **tokens**, five different word forms or **types** (*sing, sings, singing, sang* and *sung*) and a single base form or **lemma**, namely *sing*.

3 For an exhaustive typology of frequency lists see, among others, Cantos (1995).

4 The *type-token ratio *is the quotient obtained when dividing the total amount of *types *by the *token *total. See also footnote *23*.

5 See Cantos (2000) for a detailed discussion on the limitations and drawbacks of the *type-token ratio*; see also McKee, Richards and Malvern (2000), Chipere et al. (2001) and Baayen (2001:4-5). Interesting is also Scott's notion of "standardised *type-token ratio *(1996).

6 Taken from the *Collins Coubild English Dictionary* (Sinclair et al. 1995).

7 In a theoretical normal distribution, the mean (the sum of all scores divided by the total number of scores), the median (the middle point or central score of the distribution) and the mode (the value that occurs most frequently in a given set of scores), all three, fall at the same point: the centre or middle (mean = median = mode). Additionally, if we plot graphically the data we get is a symmetric bell-shaped graph.

8 The probability that rejecting the null hypothesis (whenever the difference is not significant) will be an error.

9 Note that squaring the difference ensures positive numbers.

10 This is a technical term from mathematics which we shall not attempt to explain here. For some non-technical and easy accessible explanations see Woods, Fletcher and Hughes (1986: 138-9) and/or Brown (1988: 118-9).

11 The *chi-square* distribution tables can be found in the appendices of most statistics book/manual, see for instance Oakes (1998: 266).

12 See *Appendix 3**. *13

*MonoConc*v. 1.5. (Athelstan Publications; http://www.athel.com/mono.html#mono).

14 There are important differences between the information provided by these three measures: more, perhaps, between

*t-score*and the other two than between

*z-score*and

*mutual information*themselves. It is difficult, if not impossible, to select one measure that provides the best possible assessment of collocations, although there has been ample discussion of their relative merits (see, for example, Church et al. 1991; Clear 1993; or Stubbs 1995).

15 The standard deviation provides a sort of average of the differences of all scores from the mean.

16 See for instance Hunston and Francis'book on a corpus-driven approach to a lexical grammar of English (1999), Cantos (2001) and Cantos and Sánchez's forthcoming article on lexical hierarchical constructions of collocations.

17 Readers unfamiliar with statistics might find some parts of this next section difficult to grasp.

18 If a relation

*R*, whenever it holds between both

*x*and

*y*and between

*y*and

*z*, also holds between

*x*and

*z*, the relation is said to be transitive: "

*x*"

*y*"

*z ((R (x,y)*®

*(R (y,z))*®

*R (x,y))*(Allwood et al. 1977: 89).

19 For an ample discussion on the predictive power of the

*Type-Token formula,*see Yang, Cantos and Song (2000), and Yang, Song, Cantos and Lim (

*forthcoming*).

20 The

*K-value*is a

*type-token*constant, whereas the

*K*is a

_{L}-value*lemma-token*constant.

21 The calculations are based on a single small 250,000 token sample taken randomly from the

*CUMBRE Corpus*, a corpus of contemporary Spanish (for more details see Sánchez et al. 1995).

22 Succinctly, what is ment by

*lexical diversity*or

*lexical density*is

*vocabulary richness*.

23 Both the

*token-type ratio*and the

*type-token ratio*provide information on the distribution of tokens between the types in a text. The

*token-type ratio*is the mean frequency of each token in a text, whereas the

*type-token ratio*reveals the mean distribution of types in a text or corpus (if we eventually multiply this quotient by 100, we get the mean percentage of different types per one hundred words).

24 The property of intervals is concerned with the relationship of differences between objects. If a measurement system possesses the property of intervals it means that the unit of measurement is the same thing throughout the scale of numbers. That is, a centimetre is a centimetre, no matter were it measured.

25 The

*dissimilarity matrix*, the

*dendogram*and the

*discriminant function analysis*have been calculated and produced using

*SPSS*

*v. 10*, a statistics package for the social sciences.

26 1 =

*Architecture*; 2 =

*Chemistry*; 3 =

*Computing*; 4 =

*Geography*; 5 =

*History*; 6 =

*Mathematics*; 7 =

*Medicine*; 8 =

*Natural Sciences*; 9 =

*Philosophy*; 10 =

*Physics*; and 11 =

*Sociology.*

27 It refers to the automatic text-classification performance of the model, based on a single lexical density measure:

*K-value.*