Advanced search
Start date
Betweenand

RASTROS: A large eye-tracking corpus of reading data of Higher Education students in Brazil including norms of predictability

Grant number: 19/09807-0
Support Opportunities:Regular Research Grants
Duration: August 01, 2019 - July 31, 2021
Field of knowledge:Interdisciplinary Subjects
Principal Investigator:Sandra Maria Aluísio
Grantee:Sandra Maria Aluísio
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated researchers: Elisângela Nogueira Teixeira ; Erica dos Santos Rodrigues ; Gustavo Henrique Paetzold ; Katerina Lukasova ; Maria da Graca Campos Pimentel ; Maria Teresa Carthery Goulart ; RENÊ ALBERTO MORITZ DA SILVA E FORSTER

Abstract

Currently, eye tracking corpora are frequently used in the study of processing costs of linguistic structures to, for example, (i) evaluate models and metrics of syntactic difficulty, (ii) improve or evaluate computational models of simplification via sentential compression and (iii) evaluate the quality of machine translation with objective metrics. However, only few resources exist, for a small number of languages, for example, English (Luke and Christianson, 2018; Cop et al., 2017), English and French (Kennedy et al., 2013), German (Kliegl et al., 2004), Russian (Laurinavichyute et al., 2018), Hindi (Husain et al., 2015) and Chinese (Yan et al., 2010). For Portuguese, there is no large eye-tracking corpus with predictability norms such as those cited above. This is a large gap that prevents the progress of research in Cognitive Psychology, Psycholinguistics and Natural Language Processing (NLP) areas. In this project, we have two objectives: (i) to create and make publicly available a large corpus of eye movements during silent reading of short paragraphs in Portuguese, by students of higher education in Brazil, and with predictability norms that estimate the predictability of the full orthographic form (traditional Cloze scores), of the morphosyntactic and semantic information for each word in the paragraph, and (ii) to contribute to the dissemination of research using eye-tracking in both Psycholinguistics and NLP areas. The methodology for the development of the RASTROS corpus will follow the same steps as the Provo project (Luke and Christianson, 2018), which used short paragraphs of various genres; reading 55 paragraphs for the eye-tracking test and 5 paragraphs for the Cloze test, and each word of the corpus being read by at least 40 students. For RASTROS, the 50 paragraphs of the corpus were taken from various sources of journalistic and scientific dissemination genres, at a rate of 35% for newspaper articles and 15% for scientific news. The 50 paragraphs were selected from a larger corpus of 100 paragraphs to allow the greatest diversity of linguistic factors relevant to the evaluation of processing costs, reflecting the process of reading: (i) structural complexity of the period (simple vs. compound periods); (ii) verbal transitivity; (iii) subject and object animacity; (iv) types of sentences (active / passive / relative); (v) mechanisms of construction of coreference relations, among others. RASTROS will use a high-accuracy eye-tracker - the EyeLink 1000 Desktop Mount. The stimulus presentations will be made by the Experiment Builder software, data processing will initially be done by Data Viewer or other software that integrates with Psychtoolbox-3 (Matlab) and PyGaze. We will also evaluate and compare the capture of eye movements with the FOVE headset, which costs 2% of the EyeLink 1000 device, in order to increase its use in Psycholinguistics and NLP research areas. We will use 4 semantic similarity methods: Latent Semantic Analysis (LSA) (Landauer and Dumais 1997), Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and Random Projections (RP) (Sahlgren, 2005), and also embeddings trained in the 1.3 billion word corpus of Hartmann et al. (2017). The words will be annotated with morphosyntactic categories of the nlpnet tagger, based on neural networks (Fonseca et al., 2015). (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Please report errors in scientific publications list by writing to: cdi@fapesp.br.