Russian Learner Corpora Research: State of the Art and Call for Action / Pesquisa com corpora

With the increase in availability and user-friendliness of Russian language corpora and corpus -analytic tools, the field of Russian language education has recently begun to employ corpus linguistics as an approach to understanding the dynamic of language development in users of Russian as a second and heritage language. The paper provides a brief overview of the current state of learner corpus research as a field and explores the benefits of application of corpus linguistics methods and instruments to the study of Russian. The paper reviews pertinent issues in corpora design, compilation, and annotation; offers an overview of the existing Russian language corpora and reports on the currently available corpus -based studies of Russian as a second/heritage language. The paper concludes with a call to the field to explore the benefit of corpus -based approaches to the study of Russian


Introduction
The wide-spread advancement of computer technology that gathered speed in the 1990s has resulted in significant changes in many social disciplines, including linguistics and applied language studies, which saw the increased prominence of the new discipline of corpus linguistics that focuses primarily on data-driven (rather than theory-driven) explorations of large and principally-organized language databases known as language corpora. Described as a methodology and a method (Gries, 2009;Hardy, 2012), a practice and a "philosophical approach" (Leech, 1992), corpus linguistics utilizes the methods and instruments of computer-assisted analyses of language that allow researchers to analyze large quantities of authentic linguistic data to search for patterns, regularities and idiosyncrasies of language structure and language use across language modalities, varieties, registers, genres, and groups of speakers. The impact of corpus linguistics on the field of language studies has been significant, and is described by many linguists as nothing short of revolutionary (Hunston, 2002;Kopotev;Mustajoki, 2008;Gries, 2011, inter alia), contributing to every linguistic subfield.
The area of language pedagogy has, arguably, been one of the greatest benefactors of corpus linguistic approaches. Briefly, the convergence between the fields of language education and corpus linguistics has followed two major directions (Leech, 2014). One focuses on applying the knowledge culled from investigations of standard corpora to better serve pedagogical needs of language teachers and learners. This approach, for instance, has produced an array of modern-day evidence-based reference grammars, frequency dictionaries, phrasal lists, textbooks and other teaching/learning materials based on corpus data Biber, 2009;Conrad, 2010;Kopotev;Mustajoki, 2008;Lu et al., 2018;Lebedeva, 2020). In addition, language educators have been developing pedagogical methods and techniques for data-driven learning, an approach that allows for independent and semi-independent exploration of corpus data by language learners (Boulton, 2017).
The other locus of the convergence is in the application of corpus linguistic methods and tools to the study of learner language, that is the language produced by learners at All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 different levels of linguistic proficiency, with an eye toward better understanding of the developmental trajectories of linguistic behaviors, lacunas and abilities of those learning a language as a second (L2), foreign (FL), or heritage language (HL) (Granger, 2009;Leech, 2014).
Both directions have developed robustly over the course of the past three decades.
Admittedly, the most progress has been made in the area of English as a second/foreign language (ESL/EFL), where the availability of well-developed standard and learner corpora and the embrace of corpus linguistic methods were early and supported through various institutionalized practices. Recent years, however, have seen some encouraging developments in Russian corpus linguistics, both with regard to standard corpora and learner corpus linguistics Furniss, 2020;Lebedeva, 2020).
In the current paper, I provide a review of some of these developments, specifically in the area of corpus-based approaches to the study of Russian learner language, 1 and advocate for further advancement in the true convergence between Russian corpus linguistics and Russian second language acquisition studies (SLA).

Advances in the Corpus-Based Study of Russian Learner Language
Since the early 1990s, corpus linguists have argued for the value of learner corpora in language education. Learner corpora represent language produced by speakers whose command of the language has not yet reached maturity (Leech, 2014); these include first language/child language (L1) developmental corpora, second language learner corpora culled from L2 or FL speakers of the language at different levels of proficiency, and, lately, heritage language corpora, comprising language data from HL speakers and/or HL learners of a language. The major purpose of learner corpora is to "contribute to a better understanding of the universal, as well as language-and group-specific, patterns of 1 I gladly refer the reader interested in the direct and indirect applications of corpora to many papers and volumes on the topic, including but not limited to: Dobrusina and Levinzon (2006), Mustajoki, Kopotev, Birjulin, and Protasova (2009) ;Alsufieva, Kisselev, and Freels (2012), Furniss (2013), Kisselev and Furniss (2020), Novikov and Vinogradova (2022), and the special issue on Corpus Linguistics in Teaching Russian as a Second Language of Russkij Yazyk za Rubezom (Ed. Lebedeva, 2020).

4-22
All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 Second/Foreign language acquisition" (Kisselev, 2021, p.525 language proficiency (on the ACTFL proficiency scale), name of the course for which the paper was written, text type (e.g., paragraph, essay, research paper), function targeted by the task (e.g., definition, narration, argumentation, etc.), and time restriction (timed or untimed writing). These metadata help researchers create subcorpora based on learner and text characteristics and compare these subcorpora along various linguistic parameters, with the goal of understanding relative effects of proficiency level, genre, topic, and other characteristics of learners and the texts they create on the linguistic features of the texts.
The original RULEC data is raw, i.e., the language is not lemmatized, tagged for parts of speech or syntactically annotated. Although all of these procedures have since become easily available (Kisselev, 2021), the first studies based on RULEC data utilized the raw data.
In fact, certain research questions could be successfully investigated using only unparsed data with the help of appropriate corpus-analytic procedures. Such was the approach in analyzed functional and structural types of sentences with conjunctions and found that the less frequent types and structurally more complex structures were better represented at the Advanced levels for both groups, with the HL learners exhibiting advantage over the FL learners with regard to structures that require structural manipulation of the constituents of the subordinated clause (e.g., to, chto 'that;' chtoby 'in order to;' kotoryj 'which').
A subsequent study (Kisselev; Kopotev; Klimov, forthcoming) addressed largely the same question of development of complex sentence structure but employed a more advanced computational analysis. First, the authors grammatically parsed the raw RULEC data using the trainable NLP application tool UDPipe (Straka;Straková, 2017), which provides Bakhtiniana, 2023.

Ahead of Print 6-22
All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 tokenization, lemmatization, and morphological and syntactic parsing of language data.
Then, using in-house Python scripts, the researchers analyzed and compared data produced by four learner groups (HL Intermediate, HL Advanced, FL Intermediate, and FL Advanced) along twelve general syntactic complexity indices. These indices included: mean sentence length, proportions of coordinate and subordinate clauses per overall number of clauses, proportion of specific types of subordination (infinitive clauses, adverbial clause modifiers, and relative, gerund and participle clauses), and measures of phrasal "depth" (i.e., maximum and mean nesting depth of a syntactic phrase, as well as the number of phrases with "shallow" nesting depth). The results of the study supported most of the observations of the previous study by Kisselev and Alsufieva (2017) condition) and language learning background (HL or FL), as independent variables.
Comparing the rate and types of errors by group and by time constraint allowed the author to discuss the results of the study in light of the central role that early/late exposure to language plays in language attainment, both in possible representations of nominal functional features in two groups of learners and in processing constraints that the two groups of learners may be subject to in timed task conditions.
As the studies reviewed above demonstrate, a corpus study may be more or less RLC is readily available in raw and POS-tagged forms, and at least a significant part of the corpus is set to be error-tagged.
In a recent study, Eremina (2020) has utilized the tagged parts of the RLC corpus (indiscriminately, regardless of L1 background) using the error tag "Idiom" that marks an infelicitous multiword expression. The researcher categorized the extracted infelicitous expressions into two main types, structural and semantic, and then analyzed the sub-types further, hypothesizing on the nature of each error. Although the study does not venture to Bakhtiniana, 2023.

Ahead of Print 8-22
All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 implement any statistical procedures, it lays the foundation for subsequent statistical analyses of various types of phraseological expressions in the language of L2 learners of Russian.
Given the increased attention that the fields of SLA and language pedagogy are paying to L2 learners' ability to successfully use formulaic expressions in their target language, studies that address the development of phraseological complexity in L2 Russian are much needed.
While the work conducted by the RLC team requires manual tagging, the field of computational linguistics is grappling with issues of automatic error detection and correction.
A number of research projects have been devoted to the methodological issue of automatic error detection in morphologically rich languages, including Russian (Rosen et al., 2014;Rozovskaya;Roth, 2018). The more learner corpora are available to these researchers, the better they can train computational models to recognize specific developmental patterns in language data. These corpora are becoming an important tool for Russian language researchers and Russian language teachers, as investigations of these corpora have the potential to significantly enrich our understanding of the developmental paths of Russian language learners, aid in assessment practices, and help evaluate instructional practices. However, in order to bring this promise to full fruition, more and bigger learner corpora and many more corpus-based research studies are needed in the field. In the following section, I describe some practical considerations and specific steps in creating custom-built learner corpora and conducting corpus-based studies.

Data Collection and Corpus Compilation
Not every language dataset can be called a corpus; in fact, corpus compilation requires careful consideration on behalf of the researcher and considerable planning. The Bakhtiniana, 2023.

10-22
All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 specific principles for collecting and processing data that are entered into a corpus bear as much importance in a corpus-based study as the computational methods used in the analysis.
These principles include the authenticity and size of language data, as well as systematicity of data selection, data representativeness and its balanced-ness.
Authenticity of corpus data. One of the most important principles of corpus linguistics is its focus on authentic language, i.e., language as it is used by its speakers in authentic communicative contexts; investigating authentic language, rather than language samples created for linguistic experiments, is thought to overcome the potential biases that encroach data collected in experimental settings. Many contemporary corpora that are now collected for specific purposes, especially the learner corpora, effectively represent elicited data.
Nonetheless, these elicited data come in the form of elicited narratives, interviews, and other types of situationally grounded discourses. Authenticity also allows the inclusion of contextual and situational aspects into the analysis by recording them as meta-information.
To ensure that the results of a corpus analysis are generalizable, corpora are normally large. At the same time, the size of a corpus is a relative standard; on the one hand, corpora need to be large enough to allow for the application of statistical analyses and generalizable statistical operations, but they can be smaller if the aim of the corpus is narrowed to a specific research question or a local context. Thus, a set of classroom essays and/or oral presentations collected at regular intervals during an academic year or even a semester from the same group of students may become a corpus to be used to assess the students' progress or the effectiveness of instructional approaches in this specific instructional context (Biber, Conrad, & Reppen, 2004).
Systematic data selection ensures that data sampling is not random and is clearly relevant for research questions. Notice that even in the case of a small-scale classroom-based corpus, a teacher-researcher must pay attention to the systematicity of the data collection, considering the intervals at which the data are collected, the mode of data collection (e.g., at home or in class, hand-written or typed, etc.), and the type of data (e.g., the genre and modality of language production). The systematicity principle is inherently connected to the principle of data representativeness, which ensures that the data found in the corpus And, finally, the data entered into the corpus has to be balanced-out across individual authors, text types, registers, modes, etc. For example, in the case of classroom-based corpora, the researcher must ensure that the number of texts authored by the participating students is reasonably equal, that these texts are somewhat similar in length, and/or that no one textual genre or mode is over-represented in the corpus; this is done in order to avoid the potential effect of one (or a small sub-set) of the text parameters. Compilers of large-scale corpora often must go to great lengths in order to ensure that the corpus data are balanced, to ensure the effectiveness and meaningfulness of subsequent analytical procedures.
As one can surmise based on the principles described above, a corpus is not just any (large) set of linguistic data; a language corpus is a sizable and machine-readable, systematically compiled, balanced collection of authentic texts that are representative of a language or a specific language variety. The following subsection reviews how a researcher can further process and analyze the corpus data.

Corpus Annotation
Once the corpus is designed well and the data are collected, they must then be systematically described. As mentioned in the previous section, text description is provided in the form of meta-tags, which typically accompany each text or file in the corpus. Such Bakhtiniana, 2023.

12-22
All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0 meta-tags can include the name of the text author (or any unique text ID such as a pseudonym or a number), biodata (age, gender, first language), date of creation/occurrence, genre of the text, and any other metadata that may be useful to the purposes of the corpus. Metadata descriptors can then be used as variables in analyses of the corpus data. The RULEC corpus, for instance, records various text and learner characteristics in the "Header Identification Box," as illustrated below (see Illustration 1). Using such information can help the researcher group data along some of these parameters and/or, in general, account for these learner and text parameters as variables potentially affecting the linguistic parameters of the linguistic production.
Illustration 1. RULEC corpus text header ID. Reprinted with permission from Alsufieva et al., 2012 While metadata is a sine qua non of corpus design and compilation, additional information may also be added to label or annotate words, sentences, and any longer or shorter meaningful units of text. Annotation (or mark-up) can provide different information about text-level units and may include morpho-syntactic information (e.g., Parts-of-Speech annotation, as well as person, number, gender, case, voice, tense, aspect), syntactic information (e.g., sentence parsing), semantic information (e.g., word-sense disambiguation, animacy, count/non-count), discoursal information (e.g., speech acts), error-tags, and/or any other information needed for a research-specific corpus.
This additional information is "attached" to relevant linguistic units in the form of tags, which is why annotation is often referred to as tagging. See Table 1 for an example of a learner sentence parsed with the UDPipe parser. As discussed in the previous section, the level of annotation needs first and foremost to be necessitated by the research focus of a corpus project, and other annotation schemas, either commercially or publicly available or custom-built, may be applied to the data.

Bakhtiniana, 2023.
Ahead of Print 14-22 All content of Bakhtiniana. Revista de Estudos do Discurso is licensed under a Creative Commons attribution-type CC-BY 4.0

Corpus Analytic Tools and Procedures
A well-compiled and well-described corpus can be subjected to an array of possible statistical procedures; the majority of these procedures fall under some type of data retrieval, obtaining frequencies, and statistical analysis. These analyses are conducted with the help of corpus-analytic and programs software which can be stand-alone (downloadable onto one's personal computer) or web-based. The most commonly used stand-alone programs are the license-based WordSmith Tools (Scott, 2016) and the freely-downloadable AntConc (Anthony, 2019). A host of web-based tools provide similar analytic procedures (see, for example, the suite of tools The NLP Tools for Social Sciences, Kyle & Crossley, 2015, or LancsBox, Brezina et al., 2020. The functionality of these programs may vary, but effectively they are all designed to provide language researchers with tools and ways of quick, automatic, and meaningful ways of corpus data sorting, extraction, and analysis.
Utilizing such computational tools, a researcher can conduct various analytical procedures.
Some of the common procedures that help analyze corpus data include the following.  (Norris;Ortega, 2009;Bulté;Housen, 2012;Polat et al., 2019;Kisselev et al., forthcoming, inter alia). Even word length has been shown to grow with proficiency level in Russian (Kisselev et al., forthcoming).
Descriptive statistics also often include information on type/token ratio (TTR), that is, the percentage of unique words (lemmas) or word forms per all words in the corpus. TTR Word lists are also a useful starting point for many more qualitative inquiries, providing a first glance at learner language and some patterns in lexical knowledge. One can quickly assess over-use/under-use of lexical items, see errors and error patterns, or simply scout the lexical data for further analysis of lexical use patterns. For instance, one of the first Russian learner corpus studies, conducted by Pavlenko and Driagina (2007), focused on the acquisition of emotion vocabulary by L2 speakers of Russian (with English L1). The researchers collected three small corpora of oral narratives produced by American Russian language students (in Russian), Russian monolinguals (in Russian), and American monolinguals (in English). The authors compared the frequencies and appropriateness of emotion words (e.g. rasstraivaetsja 'get upset,' grustnoe 'sad') and their stems (e.g. rasstra/о-'upset,' grust-'sad') among the three groups and found that unlike Russian monolinguals, who showed strong preference for verbs when describing emotion states, the learners preferred adjectival constructions in Russian (similar to monolingual Americans speaking in L1 English); the learners also used a smaller range of emotion words, and often confused or violated conceptual restrictions on the use of emotion vocabulary (e.g., by employing razozlilas' 'got mad' instead of rasstroilas' 'got upset'). An array of research Relatedly, colligation refers to the phenomenon of formulaicity but with grammatical, rather than lexical, constructions. For example, a construction "igrat' v + accusative" and "igrat' na + prepositional" are colligations. Apresjan (2017) is a fitting illustration of colligation research on L2 and HL Russian. The study investigates Russian possessive constructions with and without the overtly expressed existential verb est' using the data from the RLC. The corpus search was formulated as "u + gen (noun, pronouns) + est'" and "u + gen (noun, pronouns) + nom (noun)" (Apresjan, 2017, p.86). The author analyzed the extracted concordance lines with an eye towards understanding whether specific semantic and pragmatic rules govern the usage of these constructions by HL and L2 learners. The Specifically, the L2 learners made twice as many errors with the constructions with unexpressed est', suggesting that L2 learners of Russian may require additional instruction with regard to this structure.
The development of L2 learners' phraseological abilities is an area of increased interest in the field of SLA (Paquot;Granger, 2012). And potential research in this area is now made easier with the development of phrasal dictionaries (e.g., Slovar' russkoj idiomatiki, Kustova, n.d.; Slovar' glagol'noj socetaemosti nepredmetnyx imjon russkogo jazyka, Biriuk, Gusev, & Kalinina, n.d.) and platforms for investigating collocations and colligations in large standard corpora (e.g., CoCoCo, Kopotev, 2020), which provide specific information on lexical and grammatical patterns in standard Russian and can serve as a baseline in the analysis of learner data.
The procedures described above are a few of many. The numberand sophistication of corpus-based procedures available today is continuously expanding; however, the main purpose of these tools is to allow a researcher to engage with large quantities of authentic data, and extract and examine multiple samples of linguistic units produced by speakers and writers of the language varieties in focus. By extracting, sorting, and analyzing (statistically or manually) the linguistic structures chosen for analysis, the researcher can look for regularities and patterns of language use that otherwise escape the intuitions of language researchers and language teachers.

Conclusion and Desiderata
By and large, the task of second/heritage language researchers is to understand the mental processes that underlie language production and development in L2 and HL users.
Language corpora composed of linguistic data produced in authentic settings for communicative purposes have become instrumental in providing language researchers with evidence for the interpretation of these mental processes and the mental representations of knowledge. Coupled with sophisticated computational tools that allow for fast and reliable