II : banco de palavras para leitura de escolares do Ensino Fundamental II † E-READING II : words database for Reading by students from Basic Education II

Purpose: To develop a database of words of high, medium and low frequency in reading for Basic Education II. Methods: The words were taken from the teaching material for Portuguese Language, used by the teaching network of the State of São Paulo in the 6th to the 9th year of Basic Education. Only nouns were selected. The frequency with which each word occurred was recorded and a single database was created. In order to classify the words as of high, medium and low frequency, the decision was taken to work with the distribution terciles, mean frequency and the cutoff point of the terciles. In order to ascertain whether the words of high, medium and low frequency corresponded to this classification, 224 students were assessed: G1 (6th year, n= 61); G2 (7th year, n= 44); G3 (8th year, n= 65); and G4 (9th year, n= 54). The lists of words were presented to the students for reading out loud, in two sessions: 1st) words of high and medium frequency and 2nd) words of low-frequency. Results: Words which encompassed the exclusion criteria, or which caused discomfort or joking on the part of the students, were excluded. The word database was made up of 1659 words and was titled ‘E – LEITURA II’ (‘E-READING II’, in English). Conclusion: The E-LEITURA II database is a useful resource for the professionals, as it provides a database which can be used for research, educational and clinical purposes among students of Basic Education II. The professional can choose the words according to her objectives and criteria for elaborating evaluation or intervention procedures involving reading.


INTRODUCTION
Reading, in the Brazilian and international scientific literature, is presented as one of the skills which is valued and required by society most.Its importance is emphasized in the individual's school, social and cultural life.It is understood as the students' principle tool for learning new concepts and is one of schools' biggest challenges (1)(2)(3)(4) .
In the beginning of Basic Education I, the main objective is to teach the student to read.In later years, reading is shown to be necessary in order to learn the proposed contents, and becomes important in all ambits of this individual's life.
Difficulties in reading hinder the development of basic skills for mastery of the language, such as increasing vocabulary and gaining knowledge of words and writing, which will have repercussions in the development of later learning (5)(6)(7)(8)(9)(10)(11)(12) .
The reading of words may be explained based on the Dual Route model (13,14) , the result of a process which involves phonological mediation (phonological route) or direct visual process (lexical route).
Reading by the phonological route begins with the identification of the letters in the visual analysis system, in which a code of letters is formed, which is translated by the grapheme-phoneme conversion process in chains of phonemes.
In Portuguese, as it lacks an unambiguous correspondence between the letters and the phonemes, the conversion of the letters into a sequence of graphemes is a prerequisite for the process of grapheme-phoneme correspondence and for the learning of reading (15) .
In reading undertaken through the lexical route, the reader, faced with a written word, identifies the letters which make it up (visual analysis system), the information received then being transformed into a code of letters.This code is sent to the visual input lexicon, in which the corresponding visual recognition unit will be activated, resulting in the identification of a word which, in its turn, activates its meaning, filed in the semantic system -thus forming a semantic code which is responsible for activating the speech production unit, filed in the phonemic output lexicon (15) .
The only requirement in order to read using the visual route is to have seen the word for enough time to form an internal representation of it.This form is considered to be similar to what happens when we identify a picture, a number, or a signature.In the phonological route, the main requirement is to learn to use the grapheme-phoneme conversion rules (16) .
The rapid and accurate identification of the words (automatic recognition) is essential and crucial for reading comprehension.The decodification is the first step to automatic reading and has been shown to be associated with performance of the understanding of the text.Thus, poor comprehension in reading may be the result of a general problem of understanding, or of insufficient skill in identifying the written words (1,2,9,11,(17)(18)(19) .
The assessment of the use of the phonological and lexical routes is undertaken through the task of reading isolated words and pseudowords out loud; in this way, it is possible to assess which route is used most by the reader (6,7,13,14) .This task is recognized in various alphabetical languages as an efficacious method for assessing reading, and has been widely studied due to its importance in the beginning of learning (20)(21)(22)(23)(24)(25) .
In Brazil, there are publications of lists of real words and pseudowords, for students in Basic Education, such as that of the study undertaken by Pinheiro (15,26) , which is much used by researchers and clinicians for assessing reading and writing, with words of high and low frequency, divided into regular, irregular and rule, varying in length for students in the first years of Basic Education.Brazilian researchers (27) have elaborated a list of words and pseudowords, entitled "Assessment of reading of words in isolation", which evaluates the oral reading of words and pseudowords which vary in regularity, lexicality, extension and frequency, for students of the old 2 nd and 3 rd grades.
Procedures for evaluating reading, such as the PROLEC (Evaluation of Reading Processes) (16) , which use lists of real words of differing syllabic complexities, frequency (high and low) and lengths, deriving from the list compiled by Pinheiro (26) , and pseudowords of differing syllabic complexities, respecting the syllabic patterns of regularity and length, for students of Basic Education I, are used for the evaluation of the lexical and phonological routes.In the evaluation of writing, the Pró-Ortografia (Spelling Evaluation Protocol) (28) uses, for assessing dictating, real words with regular, rule and irregular syllabic patterns, varying in length, and pseudowords with regular and rule syllabic patterns, also for students in the Basic Education cycle I.
It is necessary to assess students who are in the second cycle of Basic Education, in order to ascertain the automatization of the recognition of words, which is a requirement for understanding a text.This study is justified, given that -although various professionals use the lists of words and pseudowords for assessing the phonological and lexical routes, in Brazil, there is as yet no scientific dissemination of databases of words so that the professional may elaborate her own list, whether for assessing reading isolated words out loud, or for developing speech therapy and educational intervention procedures, depending on her criteria.
In the light of the above, this study aimed to develop a database of words of high, medium and low frequency, termed the E-LEITURA II ('E-READING II', in English) to serve as linguistic encouragement for evaluation and intervention procedures in reading among students of Basic Education II.

METHODS
This is applied research, aiming for the development of a database of words for reading by the students of Basic Education II, termed the E-LEITURA II.Applied research aims to generate knowledge for practical application, with a view to solving already-identified problems.The undertaking of this material is part of a doctoral thesis, currently in its final stage, termed "Translation and cultural adaptation of the evaluation of the processes of reading (PROLEC-SE R) for students of Basic Education cycle II and of Senior High School Education", of the Postgraduate Program in Education, of the Faculdade de Filosofia e Ciências, of the Universidade Estadual Paulista "Júlio de Mesquita Filho" -FFC/UNESP/Marília (SP), approved by the institution's Ethics Committee under Opinion N. 1,125,746.For the development of the E-LEITURA II, use was made of the teaching material of the state teaching network of São Paulo, from the 6 th to the 9 th year of Basic Education -Cycle II, of the four bimesters of 2013.
The student's notebook is part of the actions called for in the "São Paulo Faz Escola" program.The content was developed by specialists in Education, based on the Official Curriculum of the State of São Paulo.This material serves as support for the curriculum proposed by the Education Department of the State of São Paulo.
Each school bimester, a kit of books is distributed, by school year, with the notebooks of the respective subjects (mathematics, Portuguese language, history, English language, geography, sciences, art and physical education).The material selected for this work was the notebook for Portuguese Language -Languages (Table 1), made up of 16 books (four per school year).
All the words from the text which form part of the teaching materials were typed into a single column in an Excel spreadsheet.After typing, only the nouns were selected, due to being a frequent class in any text, and given that nouns exercise important syntactic functions in the sentence.As it is the nucleus of the nominal syntagma, the decision was made to remain with only this class of words in this database.
All the homophone words, and those which might be understood as ambiguous, depending on the context, and which could be classified differently, were removed; for example, the Portuguese word "andar", (walk, gait) which can be placed in the class of nouns, as in 'the boy's gait', can also be placed in the class of verbs, as in 'the boy walks to school'.
The noun words which can take on the role of adjectives, although as metaphors, were kept, as in the example of the word "cat", which can be placed in the class of nouns, as in the sentence "the cat jumped over the wall", and in the class of adjectives as a metaphor, as in the Portuguese phrase "my girlfriend is a cat" (equivalent to calling a person 'a fox' in English).
The written words taken from other languages, such as 'games' and 'show', for example, were removed, as were abbreviations ('CD' for 'compact disc').Also excluded were adverbs, adverbial locutions, prepositional locutions, adjectives, months of the year, numerals and augmentative and diminutive words, as well as slang and words made through juxtaposition.Only words which were made up through agglutination and homonimous words of the homophone type (written differently, although the decodification is the same) were kept.
The words in the augmentative or diminutive, synthetic degree, when suffixes are used, were excluded when they took the regular form, as in the example of boné -bonezinho (cap -little cap), or carro -carrão (car -big car).The irregular diminutives and augmentatives, which are constructed with other suffixes, were kept, as in the examples of palacete (mansion) and ribeirão (creek).
As in Brazilian Portuguese the dominant gender is masculine, when words were presented in the feminine and masculine, the words in the feminine were excluded, although added to the word in the masculine.If the same word was presented both in the plural and singular, the words written in the plural were counted in the singular and the plural forms removed from the database.The same occurred with words which only appeared in the feminine; these were transformed into the masculine.Feminine words were only kept if there would be a change in meaning -that is, if different words were used for representing gender, such as cow → bull, or prince → princess, for example.Words which only appeared in the plural, whether masculine or feminine, were transformed into masculine singular.If the word -upon being changed from the plural into the singular -took on a homonymous homograph or homonymous perfect form, or furthermore, offered any type of ambiguity it was excluded from the database.
After this selection process, all the words which appeared in the material were counted, so as to survey their frequency of occurrence in each school year.The spreadsheets were organized by school year and were sent to a statistician so as to analyze which words were common to all years, thus creating a single database of words for Basic Education II.
In order to classify the words as high, medium and low frequency, the decision was made to work with the terciles of distribution, and also with the mean frequency and the cutoff point of the terciles, due to the frequencies which are found close to the center.In choosing to work only with high and low frequency, for example, a frequency of 48% would be classified as low, and one of 52% as high, although both are very close -not achieving the proposed objective.

Department of Education
Table 2 presents the values based on the cutoff point of the terciles for the number of times that each word can appear in order to be considered to be of high, medium or low frequency.
Based on this cutoff point, the number of words for each type of frequency for Basic Education II is presented below: • High-frequency: 72 words; • Medium frequency: 265 words; • Low-frequency: 1330 words.

Participants
A total of 224 students were assessed, from the 6 th to the 9 th years of Basic Education II, from three state public schools from a town in the nonmetropolitan region of São Paulo: G1) 6 th year (n= 61); G2) 7 th year (n= 44); G3) 8 th year (n= 65); and G4) 9 th year (n= 54).
As the statistics for this study are descriptive, with percentages of correct readings of each word, and analytical, because it is possible to compare this percentage relative to the words with low-frequency and with those of medium and high frequency, the minimum number of 40 students per school year was specified for ascertaining whether the words of high, medium and low frequency genuinely correspond to this classification.

Procedures
• Signing of the Terms of Informed Consent by those responsible for the students; • Signing of the Terms of Assent by the students assessed; • Presentation of the list of words in the E-LEITURA II database for reading out loud.
The lists of words of high, medium and low frequency, from the E-LEITURA II database, were presented to the students, on sulphite paper, A4 size, using the Times New Roman font, size 14, in lowercase letters.Each page presented an average of 72 words, which were read out loud, one at a time, by the student.This procedure was undertaken individually in two sessions, on separate days: 1 st ) reading of words of high and medium frequency, lasting an average of 20 minutes, and 2 nd ) reading of low-frequency words, lasting an average of 30 minutes.The mean duration of the two sessions was 50 minutes.Prior to reading each list of words, the student received an explanation of whether the words were of high, medium or low frequency; in particular, for those of low-frequency, it was explained to the student that she might encounter words which she had rarely or never seen before and that, therefore, she should not stop her reading for the researcher to tell her whether she had read the word correctly or not.

Analysis of the results
The statistical analysis was undertaken using the STATA/SE program (version 12.1), based on the number of correct readings for each word evaluated.The confidence interval calculation was undertaken (CI 95%), indicating the accuracy of the results.

RESULTS
After the evaluation of the reading of the words of high, medium and low frequency, upon observing the students' behavior, it was noted that some words fell within the exclusion criteria, and the others caused problems in understanding, or discomfort on the part of the students.Therefore, the following words were removed from the database: • High-frequency: sexo (sex); • Low-frequency: face (the students pronounced the Portuguese word 'face' as they would the English word 'face', deriving from 'Facebook'), poste (the present subjunctive of the verb 'postar', to post), 'descontrução' (meaning 'deconstruction', this word had been typed with a letter 's' missing), colher (which in Portuguese is both a verb, meaning 'to gather', and a noun, meaning 'spoon'), expectativa (the word was typed twice), and varão (male) (noun and adjective).
After the exclusion of these words, the lists of high, medium and low frequency were constituted by: • High frequency: 71 words (Appendix A); • Medium frequency: 265 words (Appendix B); • Low frequency: 1323 words (Appendix C).
The lists of words are presented in Appendices.The words are presented in alphabetical order, with their respective means, standard deviation and confidence interval (CI 95%).
Table 3 presents the percentage of correct readings of the high frequency words, and some examples.The word with the fewest correct readings on the high frequency list was "concurso" (competition), with 94.6% (CI 95% 91.7-97.6).
The confidence interval indicates the results' accuracy.With 95% confidence, the interval between 91.7-97.6 for the word "concurso", on the list of high-frequency words for students in Basic Education II, covers the true difference of the proportions, that is, that the population mean for Basic Education II is within this interval.
Table 4 presents the percentage of correct readings of the medium frequency words for Basic Education, and some examples of words.The word with the fewest correct readings on the list of medium frequency words was "condômino" ('co-owner'), with 33.8% (CI 95% 27.5-40.1),followed by the word "colibri" ('hummingbird') with 67.6% (CI 95% 61.4-73.8).
Based on these results, it may be observed that words of high, medium and low frequency genuinely correspond to this classification, although each word presents its own level of difficulty, as presented in appendices through mean, standard deviation and confidence interval.Among these, the researcher/professional can choose the word which best fits with her criteria and objective.

DISCUSSION
The creation of the E-LEITURA II word database was based on the need to develop lists of words of high and low frequency for the assessment of students of Basic Education II, considering that, at the time of writing, the authors are not aware of the scientific publishing of a database of words such that Brazilian professionals can elaborate and use their own lists.
In Brazil, the cognitive evaluation of reading has been mainly undertaken through the use of ready-made lists which vary in terms of contrasting psycholinguistic characteristics such as regularity, length and frequency, as observed in studies undertaken in Brazil (15,16,(26)(27)(28) .
The reading of words from these lists, generally undertaken out loud, provides the following information: (1) effects of the variation in the number of letters (length); (2) effects of variation of levels of familiarity with words on the reading (frequency); (3) involvement of the semantic process in the reading; and (4) involvement of the grapheme-phoneme conversion process in the retrieval of the pronunciation (21,22) .
According to one Brazilian study (22) , a list of words for assessing the use of the phonological and lexical routes over the course of the child's development must, firstly, match the words at the level of frequency, that is, a list of words must contain the same number of frequent and non-frequent words, an equal number of regular and irregular words, and -within each level of frequency and regularity -there must be the same number of short and long words.
During the elaboration of the E-LEITURA II database, the following criteria were considered: 1) classification of the words into high, medium and low frequency; 2) words which might be considered to be ambiguous, depending on the context, were not included; and 3) verification relating to the sensitivity of the classification of the words.It was possible to respond to these three issues based on the partnership between the professionals from the area of speech therapy, education (Arts) and the exact sciences (statistics).
It is emphasized that the reading of all the words of the E-LEITURA II database was undertaken for ascertaining the sensitivity of the classification of the words, that is, to check whether the words of high, medium and low frequency genuinely correspond to this classification -as well as in order to make it possible to observe which words caused discomfort or which fell within the exclusion criteria, their application in educational and clinical practice not being viable.
The idea of creating the E-LEITURA II database is to provide researchers and clinicians with a database of words for students of Basic Education II which can be used as linguistic encouragement for procedures of assessment and intervention.The undertaking of these procedures is important as -as observed in one Brazilian study (29) -the students who are identified as having difficulty in understanding reading, who are in the later years of this cycle, present a result in reading which is below that of younger students (from the final years of Basic Education I).The correlations found show that these students, although having less competence, use resources involving cognitive skills in their reading, so as to achieve comprehension; it follows that the cognitive skills of reading must be evaluated and encouraged.
In contrast, one Brazilian study (30) undertaken with students from the 3 rd to 7 th years with good academic performance, identified a reduction in the time for reading texts as educational level advances, as well as the least time spent reading texts with short words, evidencing the influence of the words' size, and the text's syntactic complexity, on the time taken to do the reading.The simpler the syntactic structuring, the less time is taken to read the text.
It is necessary to elaborate instruments which make it possible to assess reading skills in students in Basic Education II, in order for the professionals/researcher to possess the necessary, validated instruments for undertaking the evaluation, and to make therapeutic reasoning possible, based on the findings of the assessment, thus allowing efficacious interventions.
As observed in the results of the present study, each word -regardless of the list to which it belongs -presents its own level of difficulty, and the researcher/professional can choose those which best answer objectives and criteria (for example, syllabic complexity, length of the word, etc.).
It is, however, necessary to stress this study's limitations as, due to having been undertaken in a city in the nonmetropolitan region of the State of São Paulo, due to the regionalism presented, both in our state as in other states and cities of Brazil, the words which were considered of low frequency for this study may not necessarily be considered of low frequency in other regions of Brazil; the same is true for the words of high and medium frequency.It is necessary, furthermore, to take into account that the teaching material used for elaborating the database was provided by the government of the State of São Paulo -this material not being used in the other states.It is necessary, therefore, to undertake a broader study in all the regions of Brazil.

FINAL CONSIDERATIONS
The E-LEITURA II database is a useful resource for the professionals, as it provides -free of charge -for the first time, in the Brazilian context, a database with a wide range of words (classified as high, medium and low frequency) which can be used for the purposes of educational and clinical research with students from Basic Education II.Based on the E-LEITURA II database, the professional can choose the words according to her objectives and criteria.As a result, it is anticipated that the E-LEITURA II database may serve as linguistic encouragement for procedures of assessment and intervention with reading in students of Basic Education II.

Table 1 .
Presentation of the material used for extracting the words for the Words Database

Table 2 .
Distribution based on the cutoff point of the terciles for the frequency of occurrence of the words in the E-LEITURA II database

Table 5 .
Presentation of the percentage of correct readings of the low frequency words of the E-LEITURA II database

Table 3 .
Presentation of the percentage of correct readings of the high frequency words in the E-LEITURA II database