Which evaluation metrics best capture creativity in generative AI models?

Creativity in generative AI is best understood as a balance of novelty, value, and surprise, evaluated through both automatic measurements and human judgment. Margaret A. Boden University of Sussex framed creativity in computational systems around novelty and value, and Teresa M. Amabile Harvard Business School emphasized the importance of domain-relevant skills and motivation when assessing creative output. These frameworks steer metric design toward multidimensional evaluation rather than single-number scores.

Quantitative metrics and their limits

Automatic metrics capture parts of creativity but miss context. Novelty can be approximated by distance in embedding or feature space, comparing generated outputs to training data; diversity and coverage measure variation across samples. Image-generation work often uses the Inception Score introduced by Tim Salimans OpenAI and the Fréchet Inception Distance to assess realism and distributional similarity, but these do not measure meaningful innovation. Surprise corresponds to low probability or high self-information under a model and flags unexpected outputs, yet high surprise can reflect noise rather than value. Text overlap metrics like BLEU and ROUGE primarily measure fidelity to references and tend to penalize creative paraphrase, making them inadequate for open-ended creativity evaluation. Combining embeddings-based distance, probabilistic surprisal, and task-specific utility gives a more complete quantitative picture, with the caveat that numerical proxies require careful interpretation.

Human-centered evaluation and cultural context

Human evaluation remains essential for value and contextual relevance. Expert judges assess novelty relative to domain conventions, while crowd raters provide broader judgments of appeal and usability. Amabile’s work underscores that perceived creativity depends on domain norms and evaluator expertise, so panels must be diverse to avoid cultural or territorial bias. Models trained on Western-centric data may produce outputs judged creative in one culture but insensitive or mundane in another, highlighting a need for cross-cultural raters and culturally aware reference sets. Environmental and operational costs also matter: extensive human studies and large-sample automatic evaluations carry compute and energy footprints that influence where and how evaluation is deployed.

A robust evaluation strategy therefore integrates embedding-based novelty measures, surprisal and utility scores, diversity metrics, and structured human assessment, calibrated for cultural context and task goals. No single metric suffices; multi-criteria protocols aligned with scholarly frameworks provide the most trustworthy, actionable assessment of creativity in generative AI.