Acessibilidade / Reportar erro

Automatic classification of written descriptions by healthy adults: An overview of the application of natural language processing and machine learning techniques to clinical discourse analysis

Classificação Automática de Discurso Descritivo Escrito de Adultos Sadios: uma Visão Geral da Aplicação de Técnicas de Processamento de Línguas Naturais e Aprendizado de Máquina à Análise Clínica do Discurso


Discourse production is an important aspect in the evaluation of brain-injured individuals. We believe that studies comparing the performance of brain-injured subjects with that of healthy controls must use groups with compatible education. A pioneering application of machine learning methods using Brazilian Portuguese for clinical purposes is described, highlighting education as an important variable in the Brazilian scenario.


The aims were to describe how to: (i) develop machine learning classifiers using features generated by natural language processing tools to distinguish descriptions produced by healthy individuals into classes based on their years of education; and (ii) automatically identify the features that best distinguish the groups.


The approach proposed here extracts linguistic features automatically from the written descriptions with the aid of two Natural Language Processing tools: Coh-Metrix-Port and AIC. It also includes nine task-specific features (three new ones, two extracted manually, besides description time; type of scene described - simple or complex; presentation order - which type of picture was described first; and age). In this study, the descriptions by 144 of the subjects studied in Toledo18 were used, which included 200 healthy Brazilians of both genders.


A Support Vector Machine (SVM) with a radial basis function (RBF) kernel is the most recommended approach for the binary classification of our data, classifying three of the four initial classes. CfsSubsetEval (CFS) is a strong candidate to replace manual feature selection methods.

natural language processing; language tests; narratives; adults; educational status; age groups

Um importante aspecto na avaliação de indivíduos com lesão cerebral é a produção de discurso. Acreditamos que estudos que comparam o desempenho de lesados com grupos de controles sadios devem utilizar grupos com escolaridade compatíveis. Nós apresentamos uma abordagem pioneira ao utilizar métodos de aprendizado de máquina com propósitos clínicos, para o Português do Brasil, destacando a escolaridade como variável de importância no cenário brasileiro.


Nosso objetivo é descrever como: (i) desenvolver classificadores via aprendizado de máquina, usando features criadas por ferramentas de processamento de línguas naturais, para diferenciar descrições produzidas por indivíduos sadios em classes de anos de escolaridade e (ii) identificar automaticamente as features que melhor distinguem esses grupos.


A abordagem proposta neste estudo extrai características linguísticas automaticamente a partir das descrições escritas com a ajuda de duas ferramentas de Processamento de Linguagem Natural: Coh-Metrix-Port e AIC. Ela inclui ainda nove features dedicadas à tarefa (três novas, duas extraídas manualmente, além de tempo de descrição; tipo de cena descrita - simples ou complexa; ordem de apresentação das figuras e idade). Neste estudo, foram utilizadas as descrições de 144 indivíduos estudados em Toledo18, que incluiu 200 brasileiros, sadios, de ambos sexos.


SMV com kernel RBF é o mais recomendado para a classificação binária dos nossos dados, classificando três das quatro classes iniciais. O método de seleção das features CfsSubsetEval (CSF) é um forte candidato para substituir métodos de seleção manual.

processamento de linguagem natural; narrativas; adultos; escolaridade; grupos etários

Texto completo disponível apenas em PDF.

Full text available only in PDF format.


  • 1
    Togher L. Discourse sampling in the 21st century. J Commun Disord 2001;34:131-150.
  • 2
    Andreetta S, Cantagallo A, Marini A. Narrative discourse in anomic aphasia. Neuropsychologia 2012;50:1787-1793.
  • 3
    Wills C, Capilouto GJ, Wright HH. Attention and off-topic speech in the recounts of middle-aged and elderly adults: a pilot investigation. Contemp Issues Commun Sci Disord 2012;39:105-113.
  • 4
    Cannizzaro MS, Coelho CA. Analysis of narrative discourse structure as an ecologically relevant measure of executive function in adults. J Psycholinguist Res 2013;42:527-549.
  • 5
    Cooper P. Discourse Production and Normal Aging: Performance on Oral Picture Description Tasks. J Gerontol 1990;45:210-214.
  • 6
    Ash S, Moore P, Antani S, McCawley G, Work M, Grossman M. Trying to tell a tale: Discourse impairments in progressive aphasia and frontotemporal dementia. Neurology 2006;66:1405-1413.
  • 7
    Smith E, Ivnik RJ. Normative neuropsychology. In: Petersen RD. Mild cognitive impairment. New York: Oxford; 2003:63-88.
  • 8
    Marini A, Boewe A, Caltagirone C, Carlomagno S. Age-related Differences in the Production of Textual Descriptions. J Psycholinguist Res 2005;34:439-463.
  • 9
    Wright HH, Capilouto GJ, Koutsoftas A. Evaluating measures of global coherence ability in stories in adults. Int J Lang Commun Disord 2013;48:249-256.
  • 10
    Le Dorze G, Bédard C. Effects of Age and Education on the lexico-semantic content of connected speech in adults. J Commun Disord 1998;31:53-71.
  • 11
    Mackenzie C. Adult spoken discourse: the influences of age and education. Int J Lang Commun Disord 2000;35:269-85.
  • 12
    Neils J, Baris JM, Carter C, et al. Effects of age eduation and living environment on Boston Naming Test performance. J Speech HEAR Res 1995;38:329-223.
  • 13
    Ardila A, Bertolucci PH, Braga LW, et al. Illiteracy: the neuropsychology of cognition without reading. Arch Clin Neuropsychol 2010;25:689-712.
  • 14
    Duong A, Ska B. Production of Narratives: Picture Sequence Facilitates Organization but not Conceptual Processing in Less Educated Subjects. Brain Cogn 2001;46:121-124.
  • 15
    Forbes-McKay KE, Venneri A. Detecting subtle spontaneous language decline in early Alzheimer's disease with a picture description task. Neurol Sci 2005;26:243-254.
  • 16
    Alves DC, Souza LAP. Performance de moradores da grande São Paulo na descrição da Prancha do Roubo dos Biscoitos. Rev Cefac 2005;7:13-20.
  • 17
    Parente MA, Capuano A, Nespoulous J. Ativação de modelos mentais no recontar de historias por idosos. Psicol Reflex Crit [online] 1999;12:157-172.
  • 18
    Toledo CM. Variáveis sociodemográficas na produção do discurso em adultos sadios. Tese Mestrado. School of Medicine of the University of São Paulo; 2011.
  • 19
    Fraser K, Meltzer JA, Graham NL, et al. Automated classification of primary progressive aphasia subtypes from narrative speech transcripts. Cortex 2014;55:43-60.
  • 20
    Roark B, Mitchell M, Hosom JP, Hollingshead K, Kaye J. Spoken language derived measures for detecting mild cognitive impairment. IEEE Trans Audio Speech Lang Processing 2011;19:2081-2090.
  • 21
    MacWhinney B, Fromm D, Forbes M, Holland A. Aphasia Bank: Methods for Studying Discourse. Aphasiology 2011;25:1286-1307.
  • 22
    Price LH, Hendricks S, Cook C. Incorporating Computer-Aided Language Sample Analysis into Clinical Practice. Lang Speech Hear Serv Sch 2010;41:206-222.
  • 23
    Graesser AC, McNamara DS, Louwerse MM, Cai Z. Coh-Metrix: Analysis of text on cohesion and language. Behav Res Methods Instrum Comput 2004;36:193-202.
  • 24
    Aluísio SM, Specia L, Gasperin C, Scarton CE. Readability Assessment for Text Simplification. In: NAACL 5th Workshop on Innovative Use of NLP for Building Educational Applications (BEA-2010), 2010, Los Angeles. Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications. New York: ACL 2010:1:1-9.
  • 25
    Scarton CE, Aluísio SM. Análise da Inteligibilidade de textos via ferramentas de Processamento de Língua Natural: adaptando as métricas do Coh-Metrix para o Português. Linguamática 2010;2(1):45-61.
  • 26
    Cunha A, Toledo CM, Scarton CE, Mansur L, Aluísio SM. Classificação Automática de Discurso Descritivo Escrito de Adultos Sadios: Referência para a Avaliação da Linguagem de Lesados Cerebrais. In: Encontro Nacional de Inteligência Artificial e Computacional, ENIAC 2013, 2013, Fortaleza. Anais do X Encontro Nacional de Inteligência Artificial e Computacional. Porto Alegre: SBC; 2013;1:1-12.
  • 27
    Semenza C, Cipolotti L. Neuropsicologia con carta e matita. Padova: Cleup Editrice Padova; 1989.
  • 28
    Biderman MTC. Dicionário Ilustrado de Português. Editora: Atica; 2005;1:344.
  • 29
    Bick E. The Parsing System Palavras: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. PhD thesis, Aarhus University; 2000 .
  • 30
    Muniz MC, Laporte E, Nunes MGV. UNITEX-PB, a set of flexible language resources for Brazilian Portuguese. In: Anais do III Workshop em Tecnologia da Informação e da Linguagem Humana 2005;1:1-10.
  • 31
    Balage Filho P, Pardo T, Aluísio SM. An Evaluation of the Brazilian Portuguese LIWC Dictionary, 5 p. To be published in the Proceedings of STIL; 2013.
  • 32
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software: An Update; SIGKDD Explorations 2009;2:10-18.
  • 33
    Maziero EG, Pardo TAS. Automatic Identification of Multi-document Relations. In the (on-line) Proceedings of the PROPOR 2012 PhD and MSc/MA Dissertation Contest, Coimbra, Portugal 2012;17:1-8.
  • 34
    Armstrong E. Aphasic discourse analysis: the story so far. Aphasiology 2000;14 :875-892.
  • 35
    Chand V, Baynes K, Bonnici L, Farias ST. Analysis of Idea Density (AID): A Manual. University of California, Davis, 44 p. Available at: 2010.
  • 36
    Chand V, Baynes K, Bonnici LM, Farias ST. A Rubric for Extracting Idea Density from Oral Language Samples. Curr Protocn Neurosci 2012: doi: 10.1002/0471142301.ns1005s58.

Publication Dates

  • Publication in this collection
    Sept 2014


  • Received
    05 Feb 2014
  • Accepted
    20 May 2014
Academia Brasileira de Neurologia, Departamento de Neurologia Cognitiva e Envelhecimento R. Vergueiro, 1353 sl.1404 - Ed. Top Towers Offices, Torre Norte, São Paulo, SP, Brazil, CEP 04101-000, Tel.: +55 11 5084-9463 | +55 11 5083-3876 - São Paulo - SP - Brazil
E-mail: |