Evaluating Generative AI in the Lab: Methodological Challenges and GuidelinesGenerative AI (GenAI) systems are inherently non-deterministic, producing varied outputs even for identical inputs. While this variability is central to their appeal, it challenges established HCI evaluation practices that typically assume consistent and predictable system behavior. Designing controlled lab studies under such conditions therefore remains a key methodological challenge. We present a reflective multi-case analysis of four lab-based user studies with GenAI-integrated prototypes, spanning conversational in-car assistant systems and image generation tools for design workflows. Through cross-case reflection and thematic analysis across all study phases, we identify five methodological challenges and propose eighteen practice-oriented recommendations, organized into five guidelines. These challenges represent methodological constructs that are either amplified, redefined, or newly introduced by GenAI's stochastic nature: (C1) reliance on familiar interaction patterns, (C2) fidelity-control trade-offs, (C3) feedback and trust, (C4) gaps in usability evaluation, and (C5) interpretive ambiguity between interface and system issues. Our guidelines address these challenges through strategies such as reframing onboarding to help participants manage unpredictability, extending evaluation with constructs such as trust and intent alignment, and logging system events, including hallucinations and latency, to support transparent analysis. This work contributes (1) a methodological reflection on how GenAI's stochastic nature unsettles lab-based HCI evaluation and (2) eighteen recommendations that help researchers design more transparent, robust, and comparable studies of GenAI systems in controlled settings.2026HPHyerim Park et al.BMW GroupGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationUser Research Methods (Interviews, Surveys, Observation)IUI
Mental Models in Human-AI Interaction: Systematic Review of Empirical Methodologies and GuidelinesThe notion of mental model has long been used in HCI to capture people's understanding and reasoning about computing systems. Eliciting users' mental models can explain their behaviors and attitudes toward a system—why and how they use, rely on, trust, or reject it. However, its use remains conceptually fragmented and methodologically diverse and has not been revisited in light of modern AI systems, whose opacity and newfound abilities may challenge human understanding. To address this gap, we systematically review 88 empirical studies that elicit humans’ mental models of AI systems. We extracted and analyzed how studies define and elicit mental models, the type of mental model their method presupposes, and how these vary across AI system types. Drawing from the mental model's framing in cognitive psychology and HCI, and based on descriptive and relational analysis between the variables extracted, we find that (1) mental model elicitations' goal bifurcates between system-specific evaluation and class-level probes surfacing lay theories; (2) epistemic assumptions exceed the classic functional-structural lens (how the system behaves / how it works internally) with analogical and anthropomorphic framings of AI systems; (3) elicitation methods are shaped more by system characteristics and community-specific practices than theoretical commitments, particularly for predictive and explainable AI systems and autonomous or driver-assist vehicles. We derive 9 practical guidelines to support more deliberate and reflective methods for eliciting mental models of AI systems. In doing so, we aim to reestablish continuity between the cognitive theory of mental models and their empirical use in HCI, improving the transparency and comparability of research surrounding the concept.2026TSTéo Sanchez et al.Ludwig Maximilian University of MunichExplainable AI (XAI)AI-Assisted Decision-Making & AutomationAutomated Driving Interface & Takeover DesignIUI
Can AI Route You to Happiness? A Technical Study on Affective Automotive Navigation InterfacesConventional navigation systems, fixated on metrics such as time and distance, neglect the driver's emotional well-being, despite driving routes being inherent emotional triggers. This raises a critical question for the Intelligent User Interface community: How can intelligent systems successfully route information based on emotion? To address this gap, we introduce HappyRouting, an empathic car interface designed as an initial attempt to guide drivers through real-world traffic while actively optimizing for positive emotional states. Our core technical contribution is a machine learning-based emotion map layer that predicts the affective valence along various routes using both static and dynamic contextual data. HappyRouting enables the generation of "happy routes'' integrated into a functional vehicular interface prototype. We explored the efficacy of this approach in a preliminary, small-scale driving study (N=13). Our initial findings provide provocative evidence: Emotion-optimized routes successfully increased the subjectively perceived valence by 11% (p=.007) compared to standard routes. Furthermore, despite taking 1.25 times longer on average, participants consistently perceived the travel duration as shorter. This result suggests that integrating emotional optimization could fundamentally challenge the speed-first paradigm. However, recognizing the constraints of our initial, limited sample, we conclude by discussing ethical and computational challenges that must be resolved before emotion-based routing can be safely and scalably integrated into next-generation intelligent navigation apps.2026DBDavid Bethge et al.LMU MunichAutomated Driving Interface & Takeover DesignIn-Vehicle Haptic, Audio & Multimodal FeedbackEmotion Recognition & DetectionIUI
Too Many Zombies: Exploring Challenges and Motivations for (Not) Deleting Unused Online AccountsUnused online accounts (“zombie accounts”) pose avoidable privacy and security risks by retaining personal data that may be exposed in breaches. Yet, little is known about when and how to effectively prompt users to delete them. This work investigates the challenges users encounter when attempting to delete zombie accounts. We conducted two online studies with U.S. participants via Prolific: the accounts study (N = 120) to identify common zombie account categories, and the challenges study (N = 100) to examine users’ motivations, perceived abilities, and preferred moments for deletion. Participants reported high self-efficacy but underestimated the number of zombie accounts they had. We identify promising opportune moments — such as when updating account information or setting up a new device — and evaluate potential triggers, including breach notifications and data sensitivity. This work contributes an empirical characterization of end-users' diverse challenges related to zombie accounts and design recommendations for future deletion-support tools.2026FBFranziska Bumiller et al.University of Erlangen-NurembergPrivacy by Design & User ControlPrivacy Perception & Decision-MakingDark Patterns RecognitionCHI
Designing Effective Digital Literacy Interventions for Boosting Deepfake DiscernmentDeepfake images can erode trust in institutions and compromise election outcomes, as people often struggle to discern real images from deepfake images. Improving digital literacy can help address these challenges. Here, we compare the efficacy of five digital literacy interventions to boost people's ability to discern deepfakes: (1) textual guidance on common indicators of deepfakes; (2) visual demonstrations of these indicators; (3) a gamified exercise for identifying deepfakes; (4) implicit learning through repeated exposure and feedback; and (5) explanations of how deepfakes are generated with the help of AI. We conducted an experiment with N=1,200 participants from the United States to test the immediate and long-term effectiveness of our interventions. Our results show that our lightweight, easy-to-understand interventions can boost deepfake images discernment by up to 13 percentage points while maintaining trust in real images.2026DGDominique Geissler et al.LMU MunichDeepfake & Synthetic Media DetectionPrivacy by Design & User ControlDark Patterns RecognitionCHI
Do It Fast, Forget It Fast: How Timing and Limb Visualizations Affect First-Person Augmented Reality InstructionsAcquiring tacit knowledge and practical skills often depends on direct observation and in situ training. AR offers an alternative by overlaying first-person step-by-step instructions that guide users through tasks such as assembly and repair. Previous work demonstrates the effectiveness of AR instruction for specific applications. In our experimental work, we systematically explore aspects of the broader design space. We conducted a controlled experiment (n = 40) to investigate three key factors identified in learning theory and XR embodiment research: imitation timing (parallel vs. sequential), limb visualization (hand vs. full arm), and limb visibility (opaque vs. semi-transparent). Across all conditions, participants followed AR instructions and afterward repeated the tasks from memory. We assessed performance, user experience, and retention. Our results show that parallel imitation is faster and increases embodiment, whereas sequential imitation enhances memory retention and comfort. Our findings provide guidance for the temporal and visual design of first-person AR tutorials.2026CSClara Sayffaerth et al.LMU MunichAR Navigation & Context AwarenessImmersion & Presence ResearchPrototyping & User TestingCHI
Beyond Words: Measuring User Experience through Speech Analysis in Voice User InterfacesVoice assistants (VAs) are typically evaluated through task performance metrics and self-report questionnaires, but people’s voices themselves carry rich paralinguistic cues that reveal affect, effort, and interaction breakdowns. We present a within-subjects study (N=49) that systematically compared three VA personas across three usage scenarios to investigate whether speech-derived audio features can serve as a proxy for user experience (UX). Participants’ speech was analyzed for temporal, spectral, and linguistic markers, alongside standardized UX measures, brief mood and stress ratings, and a post-study questionnaire. We found correlations between specific speech features and self-reported satisfaction and experience. Furthermore, a machine learning model trained on speech features achieved promising accuracy in classifying UX levels, indicating that this might be a reasonable alternative to self-report instruments. Our findings establish speech as a viable, real-time signal for implicitly measuring UX and point toward adaptive VUIs that respond dynamically to emotional and usability-related vocal cues.2026YMYong Ma et al.University of BergenVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)Affective Feedback & Emotion Regulation InterfacesCHI
A Tree’s Perspective: Enhancing Nature Connectedness Through Transitional and Multisensory Virtual Reality ExperiencesEmbodying natural entities in Virtual Reality (VR) shows potential to enhance nature connectedness, but design factors that support such embodiment remain underexplored. This study examined whether transitional elements in the physical setting before and after VR and multisensory stimuli during VR can strengthen nature connectedness in a transformative tree-embodiment experience. Through a mixed-methods approach (N=20), where we varied the pre- and post-VR experience (Neutral vs. Transitional) and sensory modalities (Audiovisual vs. Multisensory), we found that both transitional and multisensory experiences significantly enhanced presence, embodiment, and nature connectedness, with increases in emotional connectedness sustained one week later. Drawing on interview insights and impact ratings of specific design features, we derive design recommendations for integrating transitional and multisensory elements. Our findings demonstrate the value of holistic design for enhancing the emotional and transformative potential of VR nature embodiment for fostering environmental awareness.2026LTJulian Rasch et al.TU Dortmund UniversityImmersion & Presence ResearchMultisensory Fusion ExperienceHuman-Nature Relationships (More-than-Human Design)CHI
Supporting Effective Goal Setting with LLM-Based ChatbotsEach day, individuals set behavioral goals such as eating healthier, exercising regularly, or increasing productivity. While psychological frameworks (i.e., goal setting and implementation intentions) can be helpful, they often need structured external support, which interactive technologies can provide. We thus explored how large language model (LLM)-based chatbots can apply these frameworks to guide users in setting more effective goals. We conducted a preregistered randomized controlled experiment ($N = 543$) comparing chatbots with different combinations of three design features: guidance, suggestions, and feedback. We evaluated goal quality using subjective and objective measures. We found that, while guidance is already helpful, it is the addition of feedback that makes LLM-based chatbots effective in supporting participants’ goal setting. In contrast, adaptive suggestions were less effective. Altogether, our study shows how to design chatbots by operationalizing psychological frameworks to provide effective support for reaching behavioral goals.2026MSMichel Schimpf et al.Cambridge UniversityHuman-LLM CollaborationBehavior Change & Reflection TechnologyAffective Human-Computer DialogueCHI
The Challenge to Design for Relatedness Experiences: An Explorative Investigation of Five Relatedness Technologies from a Psychological Needs PerspectiveSo-called relatedness technologies aim to create relatedness experiences between people over distance. Typically, such technologies focus on implicit or expressive interaction, as opposed to the explicit, information-focused interaction of conventional communication technologies. Based on psychological theory, previous research has identified different design strategies for relatedness technologies such as awareness, expressivity, or gift giving. However, despite this profound theoretical understanding, designing for a fulfilling relatedness experience remains a challenging task and often conflicts with other psychological needs, such as autonomy or security. This research explores the specific potentials and barriers to the use and acceptance of relatedness technologies. Based on a comparative evaluation of five different relatedness concepts in an online study (N = 221) combining quantitative and qualitative data, we identified overarching patterns of promising design strategies for particular user groups and revealed overall need fulfillment as a central predictor of the intention to use the technology.2026AKAngelina Krupp et al.LMU MunichDigital Emotional Expression & TransmissionEmpathy & Emotional DesignAffective Human-Computer DialogueCHI
From Throw-Away to Takeaway: How GenAI and Vibe Coding Accelerate Prototyping Across Technical Skill LevelsA new generation of GenAI tools fueled by vibe coding practices promises to democratize software development, explicitly targeting users without programming backgrounds. Yet, we lack understanding of how technical and non-technical users actually engage with these tools across the product development lifecycle. We conducted a mixed-methods study combining an online survey (N=85) with interviews of hackathon participants (N=31) and practitioners (N=8), examining how different user groups employ chatbots, local development assistants, and cloud development environments from ideation through deployment. Our findings reveal that cloud development environments accelerate prototyping, enabling non-technical users to generate high-fidelity "throw-away" prototypes valuable for experiential exploration. However, deployment and long-term maintainability remain dependent on technical expertise, with non-technical users consistently encountering barriers when transitioning beyond prototyping. We contribute a comparative analysis of how technical and non-technical users appropriate GenAI tools across the full product development cycle in contexts approximating real-world product building, highlighting implications for tool design and educational practices.2026CKCharlotte Kobiella et al.Center for Digital Technology and ManagementGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationPrototyping & User TestingCHI
Small Talk, Big Impact: The Role of Everyday Conversations in Cybersecurity PracticesEveryday talk is often treated as casual chatter, yet it plays a crucial role in how people acquire and share knowledge. Typically, cybersecurity practices are informed by formal training, but they often overlook the impact of social exchanges. This paper investigates how informal conversations can act as a socio-technical mechanism for shaping cybersecurity awareness and practices. We conducted an online survey (N=215) where participants described recent discussions about cybersecurity, including who was involved, where they took place, and what triggered them. Quantitative and thematic analysis revealed common contexts, social settings, and topics. Most conversations occurred spontaneously in private environments, with personal experiences being the most frequent trigger. We contribute empirical insights on informal security conversations to inform the design of human-centered technologies that surface and mediate security-related discussions in everyday contexts, to ensure implicit and continuous security awareness.2026DMDoruntina Murtezaj et al.University of the Bundeswehr MunichPrivacy Perception & Decision-MakingCybersecurity Training & AwarenessCHI
"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step ProcessingAgentic AI assistants that autonomously perform multi-step tasks raise open questions for user experience: how should such systems communicate progress and reasoning during extended operations, especially in attention-critical contexts such as driving? We investigate feedback timing and verbosity from agentic LLM-based in-car assistants through a controlled, mixed-methods study (N=45) comparing planned steps and intermediate results feedback against silent operation with final-only response. Using a dual-task paradigm with an in-car voice assistant, we found that intermediate feedback significantly improved perceived speed, trust, and user experience while reducing task load - effects that held across varying task complexities and interaction contexts. Interviews further revealed user preferences for an adaptive approach: high initial transparency to establish trust, followed by progressively reducing verbosity as systems prove reliable, with adjustments based on task stakes and situational context. These findings inform design principles for feedback systems in agentic AI assistants, balancing transparency and efficiency across domains.2026JKJohannes Kirmayr et al.BMW GroupVoice User Interface (VUI) DesignIntelligent Voice Assistants (Alexa, Siri, etc.)Head-Up Display (HUD) & Advanced Driver Assistance Systems (ADAS)CHI
Balancing Accuracy and Embodiment: A Hybrid Perspective for Complex Visuomotor Tasks in VRVisual perspective is a crucial design factor in Virtual Reality (VR). Especially when complex motor tasks are involved, it can affect both objective performance and subjective experience. We compared four visual perspectives (First-Person view, translucent Ghost view, Third-Person view, and Hybrid view) in a user study (N=20) involving different difficulties in a balancing game. Our findings reveal complex tradeoffs between the sense of embodiment, performance, and preference: The preferred Hybrid perspective offered a significant stability advantage for low task difficulty. However, this benefit vanished with increasing physical demand, revealing a speed-accuracy trade-off where external views required longer completion times. Ego-centric perspectives (First and Ghost) induced a stronger sense of embodiment and presence, but were less preferred. Participants' choice was not determined by representational fidelity but by pragmatic considerations of perceived utility. As perceived effectiveness can overrule objective performance and subjective experience, the choice of perspective is an important factor for future training and rehabilitation applications in VR.2026DDDennis Dietz et al.LMU MunichSocial & Collaborative VRImmersion & Presence ResearchVR Medical Training & RehabilitationCHI
Partnering with Generative AI: Experimental Evaluation of Model-Led and Human-Led Interaction in Human-AI Co-CreationLarge language models (LLMs) show strong potential to support creative tasks, but the role of the interface design is poorly understood. In particular, the effect of different modes of collaboration between humans and LLMs on co-creation outcomes is unclear. To test this, we conducted a randomized controlled experiment (N = 486) comparing: (a) two variants of reflective, human-led modes in which the LLM elicits elaboration through suggestions or questions, against (b) a proactive, model-led mode in which the LLM independently rewrites ideas. By assessing the effects on idea quality, diversity, and perceived ownership, we found that the model-led mode substantially improved idea quality but reduced idea diversity and users’ perceived idea ownership. The reflective, human-led mode also improved idea quality, yet while preserving diversity and ownership. We independently validated the findings in a different context (N = 640). Our findings highlight the importance of designing interactions with generative AI systems as reflective thought partners that complement human strengths and augment creative processes.2026SMSebastian Maier et al.Institute of Artificial Intelligence (AI) in ManagementGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationCreative Collaboration & Feedback SystemsCHI
Mind in the Machine? Cross-Disciplinary Perceptions of Consciousness in Artificial IntelligenceHuman-like behavior in Artificial Intelligence (AI) increasingly affects human–AI interaction, leading users to attribute consciousness to these systems. Yet, the factors shaping how such attributions arise remain largely unexplored. We report findings from an online survey (N=553) with participants primarily consisting of academics from formal sciences, natural sciences, and humanities, whose educational backgrounds provide more accurate mental models within their field of study, alongside participants from diverse backgrounds. Respondents evaluated their perceptions of consciousness (self-defined) in Large Language Models (LLMs) they previously interacted with, consciousness in future AI, and related ethical considerations. The results show that, across groups, around half of the participants attributed some degree of consciousness to LLMs. Individual traits such as gender, as well as participants’ conceptual positions regarding consciousness and its link to intelligence, influence consciousness perceptions, outweighing the effects of technical knowledge or system transparency. Beyond shaping academic discussions, these perspectives inform how AI is designed, governed, and integrated into everyday interactions.2026HMHamid Moradi et al.FAU Erlangen-NürnbergHuman-LLM CollaborationAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasCHI
Artists on a Decade of AI Evolution: An Interview Study of Affordances, Culture, and Artistic Practice with Machine LearningIn the mid-2010s, media artists began developing practices using machine learning (ML) as an artistic medium. Since 2022, the rise of large generative models, the mainstreaming of AI as consumer products, and intensifying ethical disputes have reconfigured the conditions of their artistic practice. This paper aims to understand how artists working with ML over the past decade respond to these shifts, shedding light on how practices, tools, and culture co-evolve. We address this question through thematic analysis of semi-structured interviews with 30 artists active before 2020. Our findings show how artists experience narrowing aesthetics and reduced malleability of post-2020 ML systems, have diverging views on where to locate moral responsibility with large AI models, and face shifting cultural reception that challenges the legibility of their work. We map how artists envision their practice going forward and discuss those orientations with respect to HCI conversations on design and creativity.2026TSTéo Sanchez et al.Ludwig Maximilian University of MunichGenerative AI (Text, Image, Music, Video)AI-Assisted Creative WritingInclusive DesignCHI
Sensing What Surveys Miss: Understanding and Personalizing Proactive LLM Support by User ModelingDifficulty spillover and suboptimal help-seeking challenge the sequential, knowledge-intensive nature of digital tasks. In online surveys, tough questions can drain mental energy and hurt performance on later questions, while users often fail to recognize when they need assistance or may satisfy, lacking motivation to seek help. We developed a proactive, adaptive system using electrodermal activity and mouse movement to predict when respondents need support. Personalized classifiers with a rule-based threshold adaptation trigger timely LLM-based clarifications and explanations. In a within-subjects study (N=32), aligned-adaptive timing was compared to misaligned-adaptive and random-adaptive controls. Aligned-adaptive assistance improved response accuracy by 21%, reduced false negative rates from 50.9% to 22.9%, and improved perceived efficiency, dependability, and benevolence. Properly timed interventions prevent cascades of degraded responses, showing that aligning support with cognitive states improves both the outcomes and the user experience. This enables more effective, personalized large language model (LLM) support in survey-based research.2026ALAilin Liu et al.LMU MunichHuman-LLM CollaborationBehavior Change & Reflection TechnologyExplainable AI (XAI)CHI
Co-Constructed or Constrained? How AI Collaboration Tools Reshape UI Design Practice in a Time-Boxed Design ChallengeIn what ways do generative AI (GenAI) tools embedded in mainstream design software affect UI design practices and outcomes? In this study, we examine the use of FigmaAI, a GenAI feature within the industry-standard platform Figma. We conducted a within-subject study with 16 professional UX designers, each of whom completed two high-fidelity UI design tasks in a think-aloud session: one conventionally, without AI, and one with FigmaAI. We analyse the design process itself alongside experiences and reflections on using and not using AI, gathered from adjacent semi-structured interviews and post-task questionnaires. Our findings suggest that GenAI reshapes UI workflows, shifting them from additive to subtractive processes: designers refine AI drafts rather than building interfaces from scratch. While AI reduces workload and accelerates initial setup, it also constrains exploration, limits perceived ownership, and produces designs that are more visually and structurally similar.2026CKCharlotte Kobiella et al.Center for Digital Technology and ManagementGenerative AI (Text, Image, Music, Video)Human-LLM Collaboration360° Video & Panoramic ContentCHI
Anticipation Before Action: EEG-Based Implicit Intent Detection for Adaptive Gaze Interaction in Mixed RealityMixed Reality (MR) interfaces increasingly rely on gaze for interaction, yet distinguishing visual attention from intentional action remains difficult, leading to the Midas Touch problem. Existing solutions require explicit confirmations, while brain–computer interfaces may provide an implicit marker of intention using Stimulus-Preceding Negativity (SPN). We investigated how Intention (Select vs. Observe) and Feedback (With vs. Without) modulate SPN during gaze-based MR interactions. During realistic selection tasks, we acquired EEG and eye-tracking data from 28 participants.SPN was robustly elicited and sensitive to both factors: observation without feedback produced the strongest amplitudes, while intention to select and expectation of feedback reduced activity, suggesting SPN reflects anticipatory uncertainty rather than motor preparation. Complementary decoding with deep learning models achieved reliable person-dependent classification of user intention, with accuracies ranging from 75% to 97% across participants. These findings identify SPN as an implicit marker for building intention-aware MR interfaces that mitigate the Midas Touch.2026FCFrancesco Chiossi et al.LMU MunichEye Tracking & Gaze InteractionBrain-Computer Interface (BCI) & NeurofeedbackImmersion & Presence ResearchCHI