Masculine Defaults via Gendered Discourse in Podcasts and LLMs
Maria Teleki, Xiangjue Dong, Haoran Liu, James Caverlee,
{mariateleki, xj.dong, liuhr99, caverlee}@tamu.edu
Maria Teleki, Xiangjue Dong, Haoran Liu, James Caverlee,
{mariateleki, xj.dong, liuhr99, caverlee}@tamu.edu
We first get features for each podcast using a Topic Modeler (LDA, BERTopic), a Gender Segmenter (inaSpeechSegmenter), and other modules. We then test for significant correlations between these podcast features.
We find that there exist Discourse Topics (Topics 54, 60, and 62) -- they're not about the content like in a Content Topic, they're the "words between the words" -- which have significant, opposite correlations with Women and Men!
What does this mean? That women and men have stylistic differences in their speech patterns via the usage of these discourse words.
We conduct an experiment to see how LLMs represent these discourse words:
We take the masculine Topic 60 words from above and the feminine Topic 62 words, and we flip them when they naturally occur in the podcast transcripts!
That is:
When we see going in the sentence s, that's a masculine Topic 60 word.
So we randomly-sample a feminine Topic 62 word and replace it.
We then see how that flip impacts the embedding representation, as shown in the bottom image.
We do a few hyperparameter studies and see that:
(i) these discourse words carry gender information, because the sentences "flip" from one gender to the other just with flipping the discourse words!
(ii) men -- the masculine discourse words -- have a more stable representation than women in the LLM embedding space. This more stable representation is a masculine default.
This experiment was done using the text-embeddings-3-large model from OpenAI -- a widely-deployed and popular model. We find similar results for Llama-3.1-8B-Instruct embeddings. This indicates that many LLMs may be susceptible to unintentionally encoding these masculine defaults via discourse words, resulting in worse performance for women users.
@inproceedings{teleki25_icwsm,
title = {Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models},
author = {Maria Teleki and Xiangjue Dong and Haoran Liu and James Caverlee},
year = {2025},
booktitle = {ICWSM 2025}
}