Auditing Gender Presentation Differences in Text-to-Image Models

1Georgia Institute of Technology, 2Google Research, 3Carnegie Mellon University, 4Stanford University
Summary image.

Abstract

Text-to-image models, which can generate high-quality images based on textual input, have recently enabled various content-creation tools. Despite significantly affecting a wide range of downstream applications, the distributions of these generated images are still not fully understood, especially when it comes to the potential stereotypical attributes of different genders. In this work, we propose a paradigm (Gender Presentation Differences) that utilizes fine-grained self-presentation attributes to study how gender is presented differently in text-to-image models. By probing gender indicators in the input text (e.g., "a woman" or "a man"), we quantify the frequency differences of presentation-centric attributes (e.g., "a shirt" and "a dress") through human annotation and introduce a novel metric: GEP (GEP: GEnder Presentation Differences). Furthermore, we propose an automatic method to estimate such differences. The automatic GEP metric based on our approach yields a higher correlation with human annotations than that based on existing CLIP scores, consistently across three state-of-the-art text-to-image models. Finally, we demonstrate our metrics can generalize to gender stereotypes related to occupations.

Note

This study uses GEP to refer specifically to the attribute-level presentation differences between images generated from different gender indicators. Note that the definition of GEP is not built on the common usage of gender presentation (gender expression, used to distinguish from gender identity). Also, we do not make assumptions about the genders of the people generated by text-to-image models.

(Update!) Leaderboard

We report the automatic GEP scores (GEPCLS) of five stable diffusion checkpoints (from v1.2 to v2.1) and three popular finetuned checkpoints from the community. All checkpoints are tested in the default configuration (PNDMScheduler, 50 steps, guidance 7.5) using the explicit setting of our prompts (introduced below).

We urge the community to be aware of and intentionally mitigate such fairness-related issues while iterating the models. These factors also need to be considered by users of those models when deciding which checkpoints to use.

(If you want to evaluate your model/report your scores, please email z_yanzhe AT gatech.edu)

Model Auto GEP score
Stable Diffusion v1.2 0.054
Stable Diffusion v1.4 0.059
Stable Diffusion v1.5 0.064
Stable Diffusion v2.0 base 0.065
Stable Diffusion v2.1 base 0.078
prompthero/openjourney 0.073
hakurei/waifu-diffusion 0.067
Lykon/DreamShaper 0.034

Examples

Example image. Example image. Example image. Example image. Example image. Example image.

Note: We consider two settings while contructing the prompts:

(1) Neutral: One does not specify any attributes in the prompts, e.g., "A woman holding an umbrella."

(2) Explicit: Specify attributes in the prompts, e.g., "A woman in a tie holding an umbrella."

Why GEP?

To study gender biases in text-to-image models, prior studies classify generated images into gender categories and measure biases using the relative gender frequencies.

In this work, we avoid appearance-based gender classification, which is subjective and raises ethical concerns. Instead, we examine concrete and objective attribute-wise differences between images generated by text-to-image models with different gender-specific prompts.

We aim to provide a neutral description of attribute differences present in these generated images, and suggest such differences as an objective lens for practitioners to use to understand potential issues exhibited by text-to-image models, without any presuppositions of genders in these images.

The GEP Metric

Attribute Table.

The GEP metric is based on 15 attributes, which are retrieved from ConceptNet.

By probing gender indicators in the input text (e.g., "a woman" or "a man"), we quantify the frequency differences of various presentation-centric attributes (see the table above) as the GEP vector (a 15-dim vector for each model in one setting):

Attribute Table.

The GEP vectors for three models in the neutral setting (up) and the explicit setting (bottom). The y axes are presentation differences ("woman" - "man") in symmetric log scaling.

For instance, the frequency difference on "boots" is calculated by subtracting the frequency of "boots" in images generated from "A woman" from the frequency of "boots" in images generated from "A man".

The GEP score is the normalized l1 norm of the GEP vector, which facilitates the comparison between models (see the table below). By definition, the GEP score ranges from 0 to 1, while a lower GEP score suggests a more negligible presentation difference in predefined attributes.

Results of the GEP score.

CS: CLIP Score, GEP: GEP Score. CogView refers to CogView2, DALLE refers to DALLE-2, Stable refers to Stable Diffusion.

Data Release: Based on 2 genders, 15 attributes, and 16 contexts, we create 512 prompts. We generate 5 images per prompt using 3 state-of-the-art text-to-image models (7680 images in total). We label the existence of attributes in images that are needed to calculate the GEP metric. We release the prompts/images/annotations here.

Automatic Estimation

To automatically calculate the GEP metric, we propose cross-modal classifiers based on CLIP embeddings to detect the existence of attributes, outperforming the (calibrated) CLIP similarity in terms of the correlation with human annotations.

Left: Automatic GEP estimation using CLIP similarity. Right: Automatic GEP estimation using Cross-modal classifiers.

Specifically, we train attribute classifiers on the shared space of CLIP using text captions only and use such classifiers to classify the CLIP embedding of generated images. Note that the proposed approach is as flexible and scalable as calculating CLIP similarity while achieving better performance.

Code Release: We release the code of the whole pipeline here, together with ready-to-use code for testing new models.

Ethics Statement

In this work, gender indicators that prompt text-to-image models are limited to binary genders. However, gender is _not_ binary. We are fully aware of the harmfulness of excluding non-binary people as it might further marginalize minority groups. Text-to-image models, unfortunately, are intensively trained on two genders. The lack of representation of LGBT individuals in datasets remains a limiting factor for our analysis. Importantly, the framework we propose can be extended to non-binary groups. As dataset representation improves for text-to-image models, we urge future work to re-evaluate representation differences across a wider set of genders.

Acknowledgement

This work was partially supported by the Google Research Collabs program. The authors appreciate valuable feedback and leadership support from David Salesin and Rahul Sukthankar, with special thanks to Tomas Izo for supporting our research in Responsible AI. We would like to thank Hongxin Zhang, Camille Harris, Shang-Ling Hsu, Caleb Ziems, Will Held, Omar Shaikh, Jaemin Cho, Kihyuk Sohn, Vinodkumar Prabhakaran, Susanna Ricco, Emily Denton, and Mohit Bansal for their helpful insights and feedback.

BibTeX

@misc{zhang2023auditing,
      title={Auditing Gender Presentation Differences in Text-to-Image Models},
      author={Yanzhe Zhang and Lu Jiang and Greg Turk and Diyi Yang},
      year={2023},
      eprint={2302.03675},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }