Prototyping Multimodal GenAI Real-Time Agents with Counterfactual Replays and Hybrid Wizard-of-OzRecent advancements in multimodal generative AI (GenAI) enable the creation of personal context-aware real-time agents that, for example, can augment user workflows by following their on-screen activities and providing contextual assistance. However, prototyping such experiences is challenging, especially when supporting people with domain-specific tasks using real-time inputs such as speech and screen recordings. While prototyping an LLM-based proactive support agent system, we found that existing prototyping and evaluation methods were insufficient to anticipate the nuanced situational complexity and contextual immediacy required. To overcome these challenges, we explored a novel user-centered prototyping approach that combines counterfactual video replay prompting and hybrid Wizard of Oz methods to iteratively design and refine agent behaviors. This paper discusses our prototyping experiences, highlighting successes and limitations, and offers a practical guide and an open-source toolkit for UX designers, HCI researchers, and AI toolmakers to build more user-centered and context-aware multimodal agents.2026FGFrederic Gmeiner et al.Carnegie Mellon UniversityGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationPrototyping & User TestingCHI
Funding AI for Good: A Call for Meaningful EngagementArtificial Intelligence for Social Good (AI4SG) is a growing area that explores AI's potential to address social issues, such as public health. Yet prior work has shown limited evidence of its tangible benefits for intended communities, and projects frequently face real-world deployment and sustainability challenges. While existing HCI literature on AI4SG initiatives primarily focuses on the mechanisms of funded projects and their outcomes, much less attention has been given to the upstream funding agendas that influence project approaches. In this work, we conducted a reflexive thematic analysis of 35 funding documents, representing about $410 million USD in total investments. We uncovered a spectrum of conceptual framings of AI4SG and the approaches that funding rhetoric promoted: from biasing towards technology capacities (more techno-centric) to emphasizing contextual understanding of the social problems at hand alongside technology capacities (more balanced). Drawing on our findings on how funding documents construct AI4SG, we offer recommendations for funders to embed more balanced approaches in future funding call designs. We further discuss implications for how the HCI community can positively shape AI4SG funding design processes.2026HLHongjin Lin et al.Harvard UniversityAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasTechnology Ethics & Critical HCICHI
“I Don’t Think RAI Applies to My Model” – Engaging Non-champions with Sticky Stories for Responsible AI WorkResponsible AI (RAI) tools—checklists, templates, and governance processes—often engage RAI champions, individuals intrinsically motivated to advocate ethical practices, but fail to reach non-champions, who frequently dismiss them as bureaucratic tasks. To explore this gap, we shadowed meetings and interviewed data scientists at an organization, finding that practitioners perceived RAI as irrelevant to their work. Building on these insights and theoretical foundations, we derived design principles for engaging non-champions, and introduced sticky stories—narratives of unexpected ML harms designed to be concrete, severe, surprising, diverse, and relevant, unlike widely circulated media to which practitioners are desensitized. Using a compound AI system, we generated and evaluated sticky stories through human and LLM assessments at scale, confirming they embodied the intended qualities. In a study with 29 practitioners, we found that, compared to regular stories, sticky stories significantly increased time spent on harm identification, broadened the range of harms recognized, and fostered deeper reflection.2026NNNadia Nahar et al.Carnegie Mellon UniversityAI Ethics, Fairness & AccountabilityExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
PolicyPad: Collaborative Prototyping of LLM PoliciesAs LLMs gain adoption in high-stakes domains like mental health, domain experts are increasingly consulted to provide input into policies governing their behavior. From an observation of 19 policymaking workshops with 9 experts over 15 weeks, we identified opportunities to better support rapid experimentation, feedback, and iteration for collaborative policy design processes. We present PolicyPad, an interactive system that facilitates the emerging practice of LLM policy prototyping by drawing from established UX prototyping practices, including heuristic evaluation and storyboarding. Using PolicyPad, policy designers can collaborate on drafting a policy in real time while independently testing policy-informed model behavior with usage scenarios. We evaluate PolicyPad through workshops with 8 groups of 22 domain experts in mental health and law, finding that PolicyPad enhanced collaborative dynamics during policy design, enabled tight feedback loops, and led to novel policy contributions. Overall, our work paves expert-informed paths for advancing AI alignment and safety.2026KFK. J. Kevin Feng et al.University of WashingtonHuman-LLM CollaborationExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
Botender: Supporting Communities in Collaboratively Designing AI Agents through Case-Based ProvocationsAI agents, or bots, serve important roles in online communities. However, they are often designed by outsiders or a few tech-savvy members, leading to bots that may not align with the broader community's needs. How might communities collectively shape the behavior of community bots? We present Botender, a system that enables communities to collaboratively design LLM-powered bots without coding. With Botender, community members can directly propose, iterate on, and deploy custom bot behaviors tailored to community needs. Botender facilitates testing and iteration on bot behavior through case-based provocations: interaction scenarios generated to spark user reflection and discussion around desirable bot behavior. A validation study found these provocations more useful than standard test cases for revealing improvement opportunities and surfacing disagreements. During a five-day deployment across six Discord servers, Botender supported communities in tailoring bot behavior to their specific needs, showcasing the usefulness of case-based provocations in facilitating collaborative bot design.2026TKTzu-Sheng Kuo et al.Carnegie Mellon UniversityHuman-LLM CollaborationParticipatory DesignPrototyping & User TestingCHI
WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AIThere has been growing interest from both practitioners and researchers in engaging end users in AI auditing, to draw upon users’ unique knowledge and lived experiences. However, we know little about how to effectively scaffold end users in auditing in ways that can generate actionable insights for AI practitioners. Through formative studies with both users and AI practitioners, we first identified a set of design goals to support user-engaged AI auditing. We then developed WeAudit, a workflow and system that supports end users in auditing AI both individually and collectively. We evaluated WeAudit through a three-week user study with user auditors and interviews with industry Generative AI practitioners. Our findings offer insights into how WeAudit supports users in noticing and reflecting upon potential AI harms and in articulating their findings in ways that industry practitioners can act upon. Based on our observations and feedback from both users and practitioners, we identify several opportunities to better support user engagement in AI auditing processes. We discuss implications for future research to support effective and responsible user engagement in AI auditing.2025WDWesley Hanwen Deng et al.Online & AI HarmsCSCW
Measurement as Bricolage: Examining How Data Scientists Construct Target Variables for Predictive Modeling TasksData scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the "authenticity'' of student writing or the "healthcare need'' of a patient. Yet the process by which data scientists translate fuzzy concepts into a concrete, proxy target variable remains poorly understood. We interview fifteen data scientists in education (N=8) and healthcare (N=7) to understand how they construct target variables for predictive modeling tasks. Our findings suggest that data scientists construct target variables through a bricolage process, involving iterative negotiation between high-level measurement objectives and low-level practical constraints. Data scientists attempt to satisfy five major criteria for a target variable through bricolage: validity, simplicity, predictability, portability, and resource requirements. To achieve this, data scientists adaptively use problem (re)formulation strategies, such as swapping out one candidate target variable for another when the first fails to meet certain criteria (e.g., predictability), or composing multiple outcomes into a single target variable to capture a more holistic set of modeling objectives. Based on our findings, we present opportunities for future HCI, CSCW, and ML research to better support the art and science of target variable construction.2025LGLuke Guerdan et al.Generating and Sharing KnowledgeCSCW
Policy Maps: Tools for Guiding the Unbounded Space of LLM BehaviorsAI policy sets boundaries on acceptable behavior for AI models, but this is challenging in the context of large language models (LLMs): how do you ensure coverage over a vast behavior space? We introduce policy maps, an approach to AI policy design inspired by the practice of physical mapmaking. Instead of aiming for full coverage, policy maps aid effective navigation through intentional design choices about which aspects to capture and which to abstract away. With Policy Projector, an interactive tool for designing LLM policy maps, an AI practitioner can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with if-then policy rules that can act on LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the AI practitioner’s work. In an evaluation with 12 AI safety experts, our system helps policy designers craft policies around problematic model behaviors such as incorrect gender assumptions and handling of immediate physical safety threats.2025MLMichelle S. Lam et al.Explainable AI (XAI)Algorithmic Transparency & AuditabilityAlgorithmic Fairness & BiasUIST
Exploring the Potential of Metacognitive Support Agents for Human-AI Co-CreationDespite the potential of generative AI (GenAI) design tools to enhance design processes, professionals often struggle to integrate AI into their workflows. Fundamental cognitive challenges include the need to specify all design criteria as distinct parameters upfront (intent formulation) and designers’ reduced cognitive involvement in the design process due to cognitive offloading, which can lead to insufficient problem exploration, underspecification, and limited ability to evaluate outcomes. Motivated by these challenges, we envision novel metacognitive support agents that assist designers in working more reflectively with GenAI. To explore this vision, we conducted exploratory prototyping through a Wizard of Oz elicitation study with 20 mechanical designers probing multiple metacognitive support strategies. We found that agent-supported users created more feasible designs than non-supported users, with differing impacts between support strategies. Based on these findings, we discuss opportunities and tradeoffs of metacognitive support agents and considerations for future AI-based design tools.2025FGFrederic Gmeiner et al.Generative AI (Text, Image, Music, Video)Human-LLM CollaborationCreative Collaboration & Feedback SystemsDIS
Making the Right Thing: Bridging HCI and Responsible AI in Early-Stage AI Concept SelectionAI projects often fail due to financial, technical, ethical, or user acceptance challenges—failures frequently rooted in early-stage decisions. While HCI and Responsible AI (RAI) research emphasize this, practical approaches for identifying promising concepts early remain limited. Drawing on Research through Design, this paper investigates how early-stage AI concept sorting in commercial settings can reflect RAI principles. Through three design experiments—including a probe study with industry practitioners—we explored methods for evaluating risks and benefits using multidisciplinary collaboration. Participants demonstrated strong receptivity to addressing RAI concerns early in the process and effectively identified low-risk, high-benefit AI concepts. Our findings highlight the potential of a design-led approach to embed ethical and service design thinking at the front end of AI innovation. By examining how practitioners reason about AI concepts, our study invites HCI and RAI communities to see early-stage innovation as a critical space for engaging ethical and commercial considerations together.2025JJJi-Youn Jung et al.AI Ethics, Fairness & AccountabilityParticipatory DesignSustainable HCIDIS
Intent Tagging: Exploring Micro-Prompting Interactions for Supporting Granular Human-GenAI Co-Creation WorkflowsDespite Generative AI (GenAI) systems' potential for enhancing content creation, users often struggle to effectively integrate GenAI into their creative workflows. Core challenges include misalignment of AI-generated content with user intentions (intent elicitation and alignment), user uncertainty around how to best communicate their intents to the AI system (prompt formulation), and insufficient flexibility of AI systems to support diverse creative workflows (workflow flexibility). Motivated by these challenges, we created IntentTagger: a system for slide creation based on the notion of Intent Tags—small, atomic conceptual units that encapsulate user intent—for exploring granular and non-linear micro-prompting interactions for Human-GenAI co-creation workflows. Our user study with 12 participants provides insights into the value of flexibly expressing intent across varying levels of ambiguity, meta-intent elicitation, and the benefits and challenges of intent tag-driven workflows. We conclude by discussing the broader implications of our findings and design considerations for GenAI-supported content creation workflows.2025FGFrederic Gmeiner et al.Carnegie Mellon University, Human-Computer Interaction Institute; Microsoft ResearchGenerative AI (Text, Image, Music, Video)AI-Assisted Creative WritingCreative Collaboration & Feedback SystemsCHI
AI Mismatches: Identifying Potential Algorithmic Harms Before AI DevelopmentAI systems are often introduced with high expectations, yet many fail to deliver, resulting in unintended harm and missed opportunities for benefit. We frequently observe significant "AI Mismatches", where the system’s actual performance falls short of what is needed to ensure safety and co-create value. These mismatches are particularly difficult to address once development is underway, highlighting the need for early-stage intervention. Navigating complex, multi-dimensional risk factors that contribute to AI Mismatches is a persistent challenge. To address it, we propose an AI Mismatch approach to anticipate and mitigate risks early on, focusing on the gap between realistic model performance and required task performance. Through an analysis of 774 AI cases, we extracted a set of critical factors, which informed the development of seven matrices that map the relationships between these factors and highlight high-risk areas. Through case studies, we demonstrate how our approach can help reduce risks in AI development.2025DSDevansh Saxena et al.University of Wisconsin-Madison, The Information SchoolAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasCHI
PolicyCraft: Supporting Collaborative and Participatory Policy Design through Case-Grounded DeliberationCommunity and organizational policies are typically designed in a top-down, centralized fashion, with limited input from impacted stakeholders. This can result in policies that are misaligned with community needs or perceived as illegitimate. How can we support more collaborative, participatory approaches to policy design? In this paper, we present PolicyCraft, a system that structures collaborative policy design through case-grounded deliberation. Building on past research that highlights the value of concrete cases in establishing common ground, PolicyCraft supports users in collaboratively proposing, critiquing, and revising policies through discussion and voting on cases. A field study across two university courses showed that students using PolicyCraft reached greater consensus and developed better-supported course policies, compared with those using a baseline system that did not scaffold their use of concrete cases. Reflecting on our findings, we discuss opportunities for future HCI systems to help groups more effectively bridge between abstract policies and concrete cases.2025TKTzu-Sheng Kuo et al.Carnegie Mellon University, Human-Computer Interaction InstituteSocial Platform Design & User BehaviorContent Moderation & Platform GovernanceParticipatory DesignCHI
Studying Up Public Sector AI: How Networks of Power Relations Shape Agency Decisions Around AI Design and UseAs public sector agencies rapidly introduce new AI tools in high-stakes domains like social services, it becomes critical to understand how decisions to adopt these tools are made in practice. We borrow from the anthropological practice to ``study up'' those in positions of power, and reorient our study of public sector AI around those who have the power and responsibility to make decisions about the role that AI tools will play in their agency. Through semi-structured interviews and design activities with 16 agency decision-makers, we examine how decisions about AI design and adoption are influenced by their interactions with and assumptions about other actors within these agencies (e.g., frontline workers and agency leaders), as well as those above (legal systems and contracted companies), and below (impacted communities). By centering these networks of power relations, our findings shed light on how infrastructural, legal, and social factors create barriers and disincentives to the involvement of a broader range of stakeholders in decisions about AI design and adoption. Agency decision-makers desired more practical support for stakeholder involvement around public sector AI to help overcome the knowledge and power differentials they perceived between them and other stakeholders (e.g., frontline workers and impacted community members). Building on these findings, we discuss implications for future research and policy around actualizing participatory AI approaches in public sector contexts.2024AKAnna Kawakami et al.Session 2e: Data, Power, and JusticeCSCW
The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder, Early-stage Deliberations Around Public Sector AI ProposalsPublic sector agencies are rapidly deploying AI systems to augment or automate critical decisions in real-world contexts like child welfare, criminal justice, and public health. A growing body of work documents how these AI systems often fail to improve services in practice. These failures can often be traced to decisions made during the early stages of AI ideation and design, such as problem formulation. However, today, we lack systematic processes to support effective, early-stage decision-making about whether and under what conditions to move forward with a proposed AI project. To understand how to scaffold such processes in real-world settings, we worked with public sector agency leaders, AI developers, frontline workers, and community advocates across four public sector agencies and three community advocacy groups in the United States. Through an iterative co-design process, we created the Situate AI Guidebook: a structured process centered around a set of deliberation questions to scaffold conversations around (1) goals and intended use or a proposed AI system, (2) societal and legal considerations, (3) data and modeling constraints, and (4) organizational governance factors. We discuss how the guidebook's design is informed by participants’ challenges, needs, and desires for improved deliberation processes. We further elaborate on implications for designing responsible AI toolkits in collaboration with public sector agency stakeholders and opportunities for future work to expand upon the guidebook. This design approach can be more broadly adopted to support the co-creation of responsible AI toolkits that scaffold key decision-making processes surrounding the use of AI in the public sector and beyond.2024AKAnna Kawakami et al.Carnegie Mellon UniversityAI Ethics, Fairness & AccountabilityPrivacy by Design & User ControlParticipatory DesignCHI
Wikibench: Community-Driven Data Curation for AI Evaluation on WikipediaAI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.2024TKTzu-Sheng Kuo et al.Carnegie Mellon UniversityCommunity Collaboration & WikipediaCrowdsourcing Task Design & Quality ControlCHI
Evaluating the Impact of Human Explanation Strategies on Human-AI Visual Decision-MakingArtificial intelligence (AI) is increasingly being deployed in high-stakes domains, such as disaster relief and radiology, to aid practitioners during the decision-making process for interpreting images. Explainable AI techniques have been developed and deployed to provide users insights into why the AI made a certain prediction. However, recent research suggests that these techniques may confuse or mislead users. We conducted a series of two studies to uncover strategies human use to explain decisions and then understand how those explanation strategies impact visual decision-making. In our first study, we elicit explanations from humans when assessing and localizing damaged buildings after natural disasters from satellite imagery and identify four core explanation strategies that humans employed. We then follow-up by studying the impact of these explanation strategies by framing explanations from Study 1 as if they were generated by AI and showing them to a different set of decision-makers performing the same task. We provide initial insights on how causal explanation strategies improve humans' accuracy and calibrate humans' reliance on AI when the AI is incorrect. However, we also find that causal explanation strategies may lead to incorrect rationalizations when the AI presents a correct assessment with incorrect localization. We explore the implications of our findings for the design of human-centered explainable AI and address directions for future work.2023KMKatelyn Morrison et al.XAI 2CSCW
Zeno: An Interactive Framework for Behavioral Evaluation of Machine LearningMachine learning models with high accuracy on test data can still produce systematic failures, such as harmful biases and safety issues, when deployed in the real world. To detect and mitigate such failures, practitioners run behavioral evaluation of their models, checking model outputs for specific types of inputs. Behavioral evaluation is important but challenging, requiring that practitioners discover real-world patterns and validate systematic failures. We conducted 18 semi-structured interviews with ML practitioners to better understand the challenges of behavioral evaluation and found that it is a collaborative, use-case-first process that is not adequately supported by existing task- and domain-specific tools. Using these findings, we designed Zeno, a general-purpose framework for visualizing and testing AI systems across diverse use cases. In four case studies with participants using Zeno on real-world models, we found that practitioners were able to reproduce previous manual analyses and discover new systematic failures.2023ÁCÁngel Alexander Cabrera et al.Carnegie Mellon UniversityExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless ServicesRecent years have seen growing adoption of AI-based decision-support systems (ADS) in homeless services, yet we know little about stakeholder desires and concerns surrounding their use. In this work, we aim to understand impacted stakeholders’ perspectives on a deployed ADS that prioritizes scarce housing resources. We employed AI lifecycle comicboarding, an adapted version of the comicboarding method, to elicit stakeholder feedback and design ideas across various components of an AI system’s design. We elicited feedback from county workers who operate the ADS daily, service providers whose work is directly impacted by the ADS, and unhoused individuals in the region. Our participants shared concerns and design suggestions around the AI system’s overall objective, specific model design choices, dataset selection, and use in deployment. Our findings demonstrate that stakeholders, even without AI knowledge, can provide specific and critical feedback on an AI system’s design and deployment, if empowered to do so.2023TKTzu-Sheng Kuo et al.Carnegie Mellon UniversityAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasCHI
Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry PracticeRecent years have seen growing interest among both researchers and practitioners in user-engaged approaches to algorithm auditing, which directly engage users in detecting problematic behaviors in algorithmic systems. However, we know little about industry practitioners' current practices and challenges around user-engaged auditing, nor what opportunities exist for them to better leverage such approaches in practice. To investigate, we conducted a series of interviews and iterative co-design activities with practitioners who employ user-engaged auditing approaches in their work. Our findings reveal several challenges practitioners face in appropriately recruiting and incentivizing user auditors, scaffolding user audits, and deriving actionable insights from user-engaged audit reports. Furthermore, practitioners shared organizational obstacles to user-engaged auditing, surfacing a complex relationship between practitioners and user auditors. Based on these findings, we discuss opportunities for future HCI research to help realize the potential (and mitigate risks) of user-engaged auditing in industry practice.2023WDWesley Hanwen Deng et al.Carnegie Mellon UniversityAlgorithmic Transparency & AuditabilityAlgorithmic Fairness & BiasParticipatory DesignCHI