Corpus ROBOT-TALK (english)
The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?
This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral) of three main genres (news, film reviews and scientific articles specialised in linguistics).
Sample of the corpus
Characteristics of the corpus
- Text written in Spanish
- Comparables by author
- humano
- Bard
- Claude
- GPT-3.5-Turbo
- GPT-4
- Mixtral
- Sources:
- Scientific journals in linguistics
- Online news
- Film review websites
- Genres
- Scientific articles
- News
- Film reviews
Sources of the corpus
- Scientific articles on linguistics
- RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …
- News
- RTVE, EFE
- Film review
- Filmaffinity
Description of the corpus
Comparable corpus | Author | Human | Bard | Claude | GPT-3.5-Turbo | GPT-4 | Mixtral | No. of texts by genre |
---|---|---|---|---|---|---|---|---|
Genre ot the text | Scientific articles | 144 | 90 | 0 | 90 | 95 | 90 | 509 |
News | 171 | 171 | 60 | 111 | 171 | 111 | 795 | |
Film reviews | 160 | 160 | 65 | 95 | 160 | 95 | 735 | |
Total no. of text | 475 | 421 | 125 | 296 | 426 | 296 | 2039 |