Let’s talk about language - and its role for replicability

This blog post summarizes the ReproducibiliTea in the HumaniTeas session on 16/06/2025 with guest speakers Xenia Schmalz, Anna Yi Leung & Johannes Breuer
Replicability
Terminology
Transparency
English
Author

Tetiana Hnatiak

Published

July 28, 2025

The topic of the session

The talk on 16/06/2025 was delivered by three speakers (Xenia Schmalz, Anna Yi Leung & Johannes Breuer), members of the Self-Learning Systems Lab (University of Cologne) and the GESIS (Leibniz Institute for the Social Sciences, Cologne).

Preparatory reading

The talk was based on a recently published paper: Schmalz et al. (2025).

The paper is published in open access and can be read here.

Schmalz et al. (2025) argue that language is an underestimated (and often overlooked) criterion of transparency and replicability in scientific research (see also the discussion and further thoughts sections).

The authors name the main aspects that are to be improved in science through language for research to be more reliably replicable:

  • statistical methods
  • transparency
  • preregistration
  • data sharing

Having said that, language persists to be the main aspect of transparency in research, including when it comes issues such as as vague terminology, linguistic ambiguities, translation-cultural barriers, which are going to be further discussed in this blog.

Main points

Within the scope of the session, language was considered as:

  • a medium or research,

  • language as a tool for research,

  • language as an object of study.

Natural language is inherently ambiguous in terms of being imprecise and context dependent (Sterner 2022). Xenia gave the following example: “Class affects political participation”. The sentence is grammatically and lexically unambiguous, but, in an isolated context, a hearer/reader would not understand what truly is meant by each sentence constituent: which class? How does it affect? And what or whom? Which political participation?

That is where language already starts to play its role in research: how do we formulate hypotheses and interpret data? How do we operationalize variables? For example, if we have the sentence in a questionnaire “I feel blue.” - where do we locate it on the scale from 1 to 10 in order to measure it later?

The same example (“I feel blue”) was cleverly used to reflect the issue of translation: how to use the same survey across different language populations? How do we measure the same constructs?

Figure from the presentation of the session (Schmalz, X., Breuer, J., and Leung, A. Y. 2025).

Language as data

Language as data is mainly used in linguistics and communication science but also in social and behavioral sciences, where it would be transformed into video -> audio -> text

Johannes names as a reason for such usage (language as data in research) the rise of computational methods that allow for the (semi-)automated collection, processing, and analysis of large amounts of text (e.g. text mining, natural language processing, machine learning, and LLMs).

Recommendations & Ways Forward

As possible ways towards linguistic transparency Johannes names the following (Schmalz, X., Breuer, J., and Leung, A. Y. 2025):

  • Community driven refinement of term definitions (for clearer conceptualization): so the terms like ‘stress’ and ‘well-being’ have shared and precise meaning across disciplines

  • Formalization of research questions and hypotheses (for effective communication): setting and using controlled vocabulary for questionnaires

  • Material and data sharing for comparable replications (across scientific communities): making original study materials and data openly accessible, to contribute to successful replication

Discussion

The discussion was mainly about the role of language (as a concept) in the research. One interesting point that I remembered was about selective generalization of conclusions from different studies depending which language they contemplate, so-called ‘linguistic double-standard’: when a study is written in English about the English language, it is common practice to accept the results as something universal or fundamental for all languages. In contrast, when a study contemplates, for example, the Cantonese language, the scientific community tends to pass it as something rather specific for this particular language.

Apart from that, the stigma around replication was brought up, as some journals still rarely accept replication studies. Instead, they are interested in ‘innovative topics’. As we can imagine, such approach does not push the science forward.

Personally interesting points

What I found particularly interesting, was the comment from one of the participants: as we already established, publishers are often reluctant when it comes to replication studies. With this in mind, we can assume that it is a bias on their end: what if one does not mention explicitly in the title that the methods are replicated using different data. Then the chances of publishing are unbelievably and unfairly higher. Here comes the linguistic bias: is it all about how you name your paper?

Further personal thoughts

Attending one conference, I have heard and interesting note from a linguistics graduate who works in data science: “Linguists are truly demanded in the aforementioned field, because, being aware of possible linguistic and pragmatic ambiguities, they know exactly how to convey a message”. The fields of Large Language Models (LLMs), data analysis, and AI are full of STEM graduates, although these fields primarily have relation to theoretical linguistics as they derive in language: to understand mathematics one has to understand patterns. With this in mind, one of the first patterns that our brain developed was language.

In the end of the day, to communicate abstract complex ideas, one has to adapt to the listener’s perspective. To sell one’s idea in a successful consulting company, the one has to sell a pen, desirably in an eloquent way. It is all about pragmatics. How many misunderstood big minds, like Kafka, Van Gogh, Emily Dickinson, were lost to time, to only later be recognized by the further generations, when their descendants were finally ready for their revelations?

Poster with bold yellow text on a blue gradient background. The main text reads 'Language and Replicability' in large font, with smaller text in between that reads 'and its role in'.

Illustration created on canva.com with licence-free items. Created by the author on 2025/07/07.
Slides available

The presentation slides of the ReproducibiliTea in the HumaniTeas talk are available on the OSF repository of ReproducibiliTea in the HumaniTeas.

References

Schmalz, X., Breuer, J., and Leung, A. Y. 2025. “Let’s Talk about Language—and Its Role for Replicability.” Invited talk. Cologne: Invited talk. https://osf.io/k9wr6.
Schmalz, Xenia, Johannes Breuer, Mario Haim, Andrea Hildebrandt, Philipp Knöpfle, Anna Yi Leung, and Timo Roettger. 2025. “Let’s Talk about Language—and Its Role for Replicability.” Humanities and Social Sciences Communications 12 (1): 84. https://doi.org/10.1057/s41599-025-04381-2.
Sterner, Beckett. 2022. “Explaining Ambiguity in Scientific Language.” Synthese 200 (5). https://doi.org/10.1007/s11229-022-03792-x.