Corpus ROBOT-TALK (english)

The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?

This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral) of three main genres (news, film reviews and scientific articles specialised in linguistics).

Sample of the corpus

Characteristics of the corpus

Text written in Spanish
Comparables by author
- humano
- Bard
- Claude
- GPT-3.5-Turbo
- GPT-4
- Mixtral
Sources:
- Scientific journals in linguistics
- Online news
- Film review websites
Genres
- Scientific articles
- News
- Film reviews

Sources of the corpus

Scientific articles on linguistics
- RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …
News
- RTVE, EFE
Film review
- Filmaffinity

Description of the corpus

Comparable corpus	Author	Human	Bard	Claude	GPT-3.5-Turbo	GPT-4	Mixtral	No. of texts by genre
Genre ot the text	Scientific articles	144	90	0	90	95	90	509
	News	171	171	60	111	171	111	795
	Film reviews	160	160	65	95	160	95	735
Total no. of text		475	421	125	296	426	296	2039

Proyecto ROBOT-TALK

Proyectos de Investigación

Corpus ROBOT-TALK (english)

Sample of the corpus

Characteristics of the corpus

Sources of the corpus

Description of the corpus