Master's Thesis Berkay Ugur Senocak

Last modified Jun 3, 2024

No tags assigned

Abstract

Machine-generated texts are ubiquitous. With the significant rise of LLMs, the risks of the texts that machines produced are more evident. In previous work, many machine- generated text detectors have been introduced, focusing on various domains such as education, science, and medicine. Distinguishing machine-generated text on different domains might require different datasets. Our work primarily focuses on investigating the effect of the domain on the machine-generated text detection task. In order to run our experiments, we initially curated a large dataset covering diverse text from different domains, languages, source models, and time frames. We evaluated four different detection methods, namely Log-likelihood, DistilBERT, ConDA, and DetectGPT, on four different major settings (in- domain, cross-domain, complete, and the challenge set). Our findings show that ConDA and DistilBERT have the exceptional capability of detecting machine-generated text across the domains. In addition to the evaluation phase, we analyzed our dataset with and without predictions to gain more insights. Our analyses revealed the structural linguistic disparities between human-written and machine-generated texts. This thesis work aims to provide valuable insights for researchers and practitioners.

Research Questions

RQ 1: How to curate a dataset from existing publicly available datasets for machine-generated text detection?
RQ 2: How can we create a large-scale evaluation of detection methods?

– What kind of metrics should be employed to compare various detectors?
RQ 3: What knowledge can we acquire from the evaluation?
RQ 4: What insights can we gain from the analyses?

– Can we spot differences for machine-generated text on the data level without any evaluation?

– Are there linguistic patterns associated with machine-generated text?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification	Last Editor