Beyond the Gold Standard: Transparency in qualitative corpus analysis (Nathan Dykes)

This blog entry summarises the ReproducibiliTea session on qualitative corpus analysis by Nathan Dykes on 12 May 2025.
Transparency
Corpus Linguistics
Methodology
English
Author

Chiara Zoe Eder, Wayne Lee

Published

May 12, 2025

Abstract

On 5 May 2025, Nathan Dykes joined ReproducibiliTea in the HumaniTeas for a talk and discussion about transparent qualitative corpus analysis and the importance of subjectivity. Using researcher triangulation and keyword/concordance analysis, he highlighted how differing interpretations emerge despite shared data.

Prior reading

The recommended prior reading paper “[If on a winter’s night two researchers…: A challenge to assumptions of soundness of interpretation.](https://cris.unibo.it/bitstream/11585/792146/8/Volume-3_Marchi-Taylor.pdf)” by Anna Marchi and Charlotte Taylor presented a quasi-experiment testing the reliability and objectivity when combining corpus linguistics and critical discourse analyses. While corpus linguistics provides data-driven and replicable findings through large text analysis, critical discourse analyses offers a theory-driven and contextualized approach. To make this methodological mix come alive, the authors explored triangulation and the four types of triangulation (methodological, data, investigator and theoretical). For their experiment, the two researchers analyzed a shared media corpus. They produced both convergent and dissonant findings, which emphasizes the researchers’ influence and interpretation. Concluding, triangulation is valuable but does, in fact, not guarantee objectivity, the researcher’s interpretation of data and results remains significant.

Main points

In Dykes’ presentation, transparency in qualitative corpus analysis was one of the most important keywords. Transparency should be enhanced by not eliminating subjectivity but to navigate and account for it. For this, two analytical settings were presented: a top-down approach (keyword analysis) and a bottom-up approach (concordance analysis). In the top-down, keyword analysis, researchers start by comparing word frequencies between corpora to identify broad patterns, but interpretations and category labels often differ due to subjectivity. In the bottom-up, concordance analysis, patterns emerge through sorting and examining individual text lines, requiring flexible, intuitive decisions without a fixed workflow. Tools for transparency and working with subjectivity are thus researcher triangulation and visualization. Nathan Dykes also presented the Python library FlexiConc, which supports concordance analysis and stores analytical processes in a shareable analysis tree. Visualization is used to highlight correspondence, granularity differences and different interpretations. The key takeaway was that subjectivity is not a flaw, but a feature of qualitative corpus analysis and that the research process must be transparent and replicable.

Discussion

The discussion was mostly about the validity of this framework and alternative approaches. I (Wayne) specifically asked Nathan whether topic models could be an alternative approach to the manual keyword identification and grouping as it provides more statistical and objective grounding. Topic model is a type of algorithm that clusters words into a pre-defined number of groups based on the probability of word co-occurrence. While Nathan agreed that it could be a valid approach, he mentioned that topic models were generally frowned upon in the corpus linguistics community, as it sacrifices the context of texts and draws conclusions before examining evidence. Nevertheless, Nathan suggested that such limitations could sometimes apply to other corpus linguistics methods as well.

What we personally found particularly interesting

It is fascinating how the triangulation framework embraces subjectivity and leverages it to create new questions or refine current questions. My previous training often suggests subjectivity in annotation as an inevitable noise that should be reduced to ensure the validity and coherence of the narrative. Standardized procedures and coding schemes are often imposed. It puzzled me at first that CADS doesn’t require a set procedure or a common expected outcome, such as the number of categories and number of words in each category, as it would be difficult to compare the distribution of categories and words with statistical tests. However, the simple visualization helped me see the full picture that the dissonance was complementary and necessary for a multi-faceted perspective. It’s also interesting how CLiC can improve the transparency of subjectivity in analyses since subjectivity is often reported but without thorough walk-through.

Open question

Marchi and Taylor (2009) mention that triangulation does not guarantee validity or objectivity and both researchers may be equally wrong in the case of convergence. We were left wondering whether this would also be applicable to dissonance if two analyses treating the same results as distinct findings?

Created on canva.com with license-free items. Created on 2025/05/12

Further reading on topic modelling

Bednarek, Monika. 2024. Topic modelling in corpus-based discourse analysis: Uses and critiques. Discourse Studies. SAGE Publications 14614456241293075. https://doi.org/10.1177/14614456241293075.

Brookes, Gavin & Tony McEnery. 2018. The utility of topic modelling for discourse studies: A critical evaluation: Discourse Studies. SAGE PublicationsSage UK: London, England. https://doi.org/10.1177/1461445618814032.

Gillings, Mathew & Andrew Hardie. 2022. The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice. Digital Scholarship in the Humanities fqac075. https://doi.org/10.1093/llc/fqac075.

References

Marchi, Anna, and Charlotte Taylor. 2009. “If on a Winter’s Night Two Researchers: A Challenge to Assumptions of Soundness of Interpretation.” CADAAD Journal, January.