Proyectos de Investigación

Corpus ROBOT-TALK (english)

The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?

This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral) of three main genres (news, film reviews and scientific articles specialised in linguistics).

 

  Sample of the corpus

 

Characteristics of the corpus

  • Text written in Spanish
  • Comparables by author
    • humano
    • Bard
    • Claude
    • GPT-3.5-Turbo
    • GPT-4
    • Mixtral
  • Sources:
    • Scientific journals in linguistics
    • Online news
    • Film review websites
  • Genres
    • Scientific articles
    • News
    • Film reviews

Sources of the corpus

  • Scientific articles on linguistics
    • RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …
  • News
    • RTVE, EFE
  • Film review
    • Filmaffinity

 

Description of the corpus

Comparable corpus Author Human Bard Claude GPT-3.5-Turbo GPT-4 Mixtral No. of texts by genre
Genre ot the text Scientific articles 144 90 0 90 95 90 509
News 171 171 60 111 171 111 795
Film reviews 160 160 65 95 160 95 735
Total no. of text 475 421 125 296 426 296 2039