Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic StudyLarge Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher observed design and evaluation activities and conducted interviews with both developers and domain experts. Analysis revealed four key practices: creating workarounds for data collection, turning to augmentation when expert input was limited, co-developing evaluation criteria with experts, and adopting hybrid expert–developer–LLM evaluation strategies. These practices show how teams made strategic decisions under constraints and demonstrate the central role of domain expertise in shaping the system. Challenges included expert motivation and trust, difficulties structuring participatory design, and questions around ownership and integration of expert knowledge. We propose design opportunities for future LLM development workflows that emphasize AI literacy, transparent consent, and frameworks recognizing evolving expert roles.2026ASAnnalisa Szymanski et al.University of Notre DameHuman-LLM CollaborationParticipatory DesignUser Research Methods (Interviews, Surveys, Observation)IUI
ReUseIt: Synthesizing Reusable AI Agent Workflows for Web AutomationAI-powered web agents have the potential to automate repetitive tasks, such as form filling, information retrieval, and scheduling, but they struggle to reliably execute these tasks without human intervention, requiring users to provide detailed guidance during every run. We address this limitation by automatically synthesizing reusable workflows from an agent's successful and failed attempts. These workflows incorporate execution guards that help agents detect and fix errors while keeping users informed of progress and issues. Our approach enables agents to successfully complete repetitive tasks of the same type with minimal user intervention, increasing the success rates from 24.2% to 70.1% across fifteen tasks. To evaluate this approach, we invited nine users and found that our agent helped them complete web tasks with a higher success rate and less guidance compared to two baseline methods, as well as allowed users to easily monitor agent behavior and understand its failures.2026YLYimeng Liu et al.University of California, Santa BarbaraAI-Assisted Decision-Making & AutomationAutoML InterfacesGenerative AI (Text, Image, Music, Video)IUI
DuoDrama: Supporting Screenplay Refinement Through LLM-Assisted Human ReflectionAI has been increasingly integrated into screenwriting practice. In refinement, screenwriters expect AI to provide feedback that supports reflection across the internal perspective of characters and the external perspective of the overall story. However, existing AI tools cannot sufficiently coordinate the two perspectives to meet screenwriters' needs. To address this gap, we present DuoDrama, an AI system that generates feedback to assist screenwriters' reflection in refinement. To enable DuoDrama, based on performance theories and a formative study with nine professional screenwriters, we design the Experience-Grounded Feedback Generation Workflow for Human Reflection (ExReflect). In ExReflect, an AI agent adopts an experience role to generate experience and then shifts to an evaluation role to generate feedback based on the experience. A study with fourteen professional screenwriters shows that DuoDrama improves feedback quality and alignment and enhances the effectiveness, depth, and richness of reflection. We conclude by discussing broader implications and future directions.2026YTYuying Tang et al.The Hong Kong University of Science and TechnologyHuman-LLM CollaborationAI-Assisted Creative WritingCreative Collaboration & Feedback SystemsCHI
Exploring the Future of AI in Clinical Collaboration: A Study on Tumor Board Case Preparation Multidisciplinary tumor boards (MTBs) bring specialists together to identify therapies for complex cancer cases, but preparing for them is time-intensive. Clinicians must extract key details from extensive records and evaluate treatment options. While large language models (LLMs) show promise in medicine for basic tasks like summarizing notes, little is known about their role in high-stakes tasks like MTB preparation. We conducted a mixed-methods study with 16 oncologists using two AI systems to prepare patient cases for MTB: an off-the-shelf assistant (Copilot) and a task-specific multi-agent system (Healthcare Agent Orchestrator, HAO). We analyzed oncologist prompts, AI responses, and oncologists' perception of AI. Participants showed greater willingness to adopt HAO but were often overconfident in AI summaries and skeptical of AI-recommended therapies. Trust calibration strategies, such as source links and agent-trajectories, failed to align trust with system capabilities. We conclude with how AI systems should be built to support clinicians in high-stakes tasks.2026JLJiachen Li et al.Northeastern UniversityHuman-LLM CollaborationExplainable AI (XAI)AI-Assisted Decision-Making & AutomationCHI
The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better WorkflowsUsers of feature-rich tools like Excel often miss more efficient workflows, repeating tedious steps and making avoidable errors. Current approaches to helping them require either manual prompting, which is effortful for users, or automated logging, which is limiting for developers. We present InvisibleMentor, a system inspired by over-the-shoulder learning: it observes what users do, then shows them how to do it better. To do this, InvisibleMentor analyzes screen recordings with a vision-language model to reconstruct actions and context, then uses a large language model to generate vision-grounded task reflection, structured suggestions grounded in observed behavior. In a user study, participants found InvisibleMentor's suggestions more clear, more relevant, and more useful than those from a prompt-based assistant, demonstrating that AI can do more than automate away work—it can help users master it.2026LYLitao Yan et al.University of PennsylvaniaHuman-LLM CollaborationAI-Assisted Decision-Making & AutomationUser Research Methods (Interviews, Surveys, Observation)CHI
Programmers Who Use Screen Readers in the Vibe Coding Era: Adaptation, Empowerment, and New Accessibility LandscapeGenerative AI agents are reshaping human-computer interaction, shifting users from direct task execution to supervising machine-driven actions, especially the rise of ``\emph{vibe coding}'' in programming. Yet little is known about how programmers who use screen readers interact with AI code assistants in practice. We conducted a longitudinal study with 16 blind and low-vision programmers. Participants completed a \emph{GitHub Copilot} tutorial, engaged with a programming task, and provided initial feedback. After two weeks of AI-assisted programming, follow-ups examined how their practices and perceptions evolved. Our findings show that code assistants enhanced programming efficiency and bridged accessibility gaps. However, participants struggled to convey intent, interpret AI outputs, and manage multiple views while maintaining situational awareness. They showed diverse preferences for accessibility features, expressed a need to balance automation with control, and encountered barriers when learning to use these tools. Furthermore, we propose design principles and recommendations for more accessible and inclusive human-AI collaborations.2026NCNan Chen et al.Microsoft ResearchGenerative AI (Text, Image, Music, Video)Explainable AI (XAI)Visual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)CHI
From Use to Oversight: How Mental Models Influence User Behavior and Output in AI Writing AssistantsAI-based writing assistants are ubiquitous, yet little is known about how users’ mental models shape their use. We examine two types of mental models—functional or related to what the system does, and structural or related to how the system works—and how they affect control behavior—how users request, accept, or edit AI suggestions as they write—and writing outcomes. We primed participants (𝑁 = 48) with different system descriptions to induce these mental models before asking them to complete a cover letter writing task using a writing assistant that occasionally offered preconfigured ungrammatical suggestions to test whether the mental models affected participants’ critical oversight. We find that while participants in the structural mental model condition demonstrate a better understanding of the system, this can have a backfiring effect: while these participants judged the system as more usable, they also produced letters with more grammatical errors, highlighting a complex relationship between system understanding, trust, and control in contexts that require user oversight of error-prone AI outputs.2026SRShalaleh Rismani et al.McGill UniversityHuman-LLM CollaborationExplainable AI (XAI)AI-Assisted Writing & Text GenerationCHI
From Struggle to Success: Context-Aware Guidance for Screen Reader Users in Computer UseEqual access to digital technologies is critical for education, employment, and social participation. However, mainstream interfaces are visually oriented, creating steep learning curves and frequent obstacles for screen reader users, and limiting their independence and opportunities. Existing support is inadequate---tutorials mainly target sighted users, while human assistance lacks real-time availability. We introduce AskEase, an on-demand AI assistant that provides step-by-step, screen reader user-friendly guidance for computer use. AskEase manages multiple sources of context to infer user intent and deliver precise, situation-specific guidance. Its seamless interaction design minimizes disruption and reduces the effort of seeking help. We demonstrated its effectiveness through representative usage scenarios and robustness tests. In a within-subjects study with 12 screen reader users, AskEase significantly improved task success while reducing perceived workload, including physical demand, effort, and frustration. These results demonstrate the potential of LLM-powered assistants to promote accessible computing and expand opportunities for users with visual impairments.2026NCNan Chen et al.Microsoft ResearchVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Human-LLM CollaborationExplainable AI (XAI)CHI
Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation CriteriaLarge Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we propose design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.2026ASAnnalisa Szymanski et al.University of Notre DameHuman-LLM CollaborationAI-Assisted Decision-Making & AutomationUser Research Methods (Interviews, Surveys, Observation)CHI
Engaging Communities Meaningfully in Defining Disability Representation for AI Image GenerationMedia representations of people with disabilities profoundly influence societal perceptions, yet have historically been absent, stereotyped, or inaccurate. As AI-generated visual media becomes increasingly prevalent, there is a critical opportunity to address these misrepresentations. Responding to the lack of collectively negotiated representation standards, this paper presents our human-centric approach to engaging disability communities meaningfully in AI data practices. Over three months, we worked closely with three disability organizations across the Global North and South to develop the Community Library Creator that introduces design scaffolds to support communities in defining ‘good’ representation and curating community-centric AI datasets; laying the foundations for community-specific evaluation metrics and future model adaptations. We contribute qualitative insights into the complexities of community-led data curation; discuss the value and practical challenges of intersecting human insights with AI requirements; and reflect on human-centered AI approaches that empower communities to share their perspectives and actively shape AI data practices.2026ATAnja Thieme et al.Microsoft ResearchAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasDeveloping Countries & HCI for Development (HCI4D)CHI
Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional ModelingRecent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents’ social behaviors and internal emotions: generated behaviors rarely influence an agent’s emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as \textit{Auri}, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.2026YZYi Zheng et al.Communication University of ChinaAgent Personality & AnthropomorphismAffective Human-Computer DialogueHuman-LLM CollaborationCHI
InfoAlign: A Human–AI Co-Creation System for Storytelling with InfographicsStorytelling infographics are a powerful medium for communicating data-driven stories through visual presentation. However, existing authoring tools lack support for maintaining story consistency and aligning with users' story goals throughout the design process. To address this gap, we conducted formative interviews and a quantitative analysis to identify design needs and common story-informed layout patterns in infographics. Based on these insights, we propose a narrative-centric workflow for infographic creation consisting of three phases: story construction, visual encoding, and spatial composition. Building on this workflow, we developed InfoAlign, a human–AI co-creation system that transforms long or unstructured text into stories, recommends semantically aligned visual designs, and generates layout blueprints. Users can intervene and refine the design at any stage, ensuring their intent is preserved and the infographic creation process remains transparent. Evaluations show that InfoAlign preserves story coherence across authoring stages and effectively supports human–AI co-creation for storytelling infographic design.2026JFJielin Feng et al.Fudan UniversityData StorytellingGenerative AI (Text, Image, Music, Video)Creative Collaboration & Feedback SystemsCHI
Identifying, Explaining, and Correcting Ableist Language with AIAbleist language perpetuates harmful stereotypes and exclusion, yet its nuanced nature makes it difficult to recognize and address. Artificial intelligence could serve as a powerful ally in the fight against ableist language, offering tools that detect and suggest alternatives to biased terms. This two-part study investigates the potential of large language models (LLMs), specifically ChatGPT, to rectify ableist language and educate users about inclusive communication. We compared GPT-4o generations with crowdsourced annotations from trained disability community members, then invited disabled participants to evaluate both. Participants reported equal agreement with human and AI annotations but significantly preferred the AI, citing its narrative consistency and accessible style. At the same time, they valued the emotional depth and cultural grounding of human annotations. These findings highlight the promise and limits of LLMs in handling culturally sensitive content. Our contributions include a dataset of nuanced ableism annotations and design considerations for inclusive writing tools.2026KSKynnedy Simone Smith et al.Carnegie Mellon UniversityHuman-LLM CollaborationAI Ethics, Fairness & AccountabilityCognitive Impairment & Neurodiversity (Autism, ADHD, Dyslexia)CHI
From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms to Foster Dignified Human-AI InteractionIn the future of work discourse, AI is touted as the ultimate productivity amplifier. Yet, beneath the efficiency gains lie subtle erosions of human expertise and agency. This paper shifts focus from the future of work to the future of workers by navigating the AI-as-Amplifier Paradox: AI's dual role as enhancer and eroder, simultaneously strengthening performance while eroding underlying expertise. We present a year-long study on longitudinal use of AI in a high-stakes workplace among cancer specialists. Initial operational gains hid "intuition rust'': the gradual dulling of expert judgment. These asymptomatic effects evolved into chronic harms, such as skill atrophy and identity commoditization. Building on these findings, we offer a framework for dignified Human-AI interaction co-constructed with professional knowledge workers facing AI-induced skill erosion without traditional labor protections. The framework operationalizes sociotechnical immunity through dual-purpose mechanisms that serve institutional quality goals while building worker power to detect, contain, and recover from skill erosion, and preserve human identity.Evaluated across healthcare and software engineering, our work takes a foundational step toward dignified human-AI interaction futures by balancing productivity with preservation of human expertise.2026UEUpol Ehsan et al.Northeastern UniversityAI-Assisted Decision-Making & AutomationAI Ethics, Fairness & AccountabilityAlgorithmic Fairness & BiasCHI
The ORBIT India Dataset: Understanding the Challenges of Collecting a Disability-First AI Dataset in Low-Resource EnvironmentsComputer vision systems are increasingly used by blind individuals to navigate their lives, helping, for example, locate objects such as doors or chairs. Yet these recognition systems do not work for many personal objects a blind user might want to find, such as keys or a special notebook. In response, efforts created personalized recognition systems, where individuals train their phones to identify and locate things, like a coffee mug or white cane, using example images/videos. However, these tools are trained on data from high-resource contexts, not necessarily reflecting India’s material culture. This paper discusses the contribution of the ORBIT-India dataset, which extends these tools to the Indian context, home of the world’s largest blind population. The ORBIT-India dataset comprises 105,243 images from 587 videos, representing 76 unique objects. We use this experience to examine dataset collection practices translated from high- to low-resource settings, providing recommendations to support cross-geography dataset collection.2026GIGesu India et al.Swansea UniversityVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Cognitive Impairment & Neurodiversity (Autism, ADHD, Dyslexia)Generative AI (Text, Image, Music, Video)CHI
Interaction-Augmented Instruction: Modeling the Synergy of Prompts and Interactions in Human-GenAI CollaborationText prompt is the most common way for human-generative AI (GenAI) communication. Though convenient, it is challenging to convey fine-grained and referential intent. One promising solution is to combine text prompts with precise GUI interactions, like brushing and clicking. However, there lacks a formal model to capture synergistic designs between prompts and interactions, hindering their comparison and innovation. To fill this gap, via an iterative and deductive process, we develop the Interaction-Augmented Instruction (IAI) model, a compact entity–relation graph formalizing how the combination of interactions and text prompts enhances human-GenAI communication. With the model, we distill twelve recurring and composable atomic interaction paradigms from prior tools, verifying our model’s capability to facilitate systematic design characterization and comparison. Four usage scenarios further demonstrate the model’s utility in applying, refining, and innovating these paradigms. These results illustrate the IAI model’s descriptive, discriminative, and generative power for shaping future GenAI systems.2026LSLeixian Shen et al.The Hong Kong University of Science and TechnologyGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationPrototyping & User TestingCHI
A Framework to Characterize Reporting on Generative AI UseUnlike with traditional predictive AI models, today's generative AI models are increasingly designed to be general-purpose, able to perform a wide range of tasks. This makes it challenging to develop a reliable and useful understanding of the ways in which this technology is and could be used. As a result, academic and policy researchers and generative AI providers have started to publish the results of their own investigations about the use of generative AI. This information is, however, fragmented, potentially incomplete, sometimes ambiguous, and often lacking in methodological specificity. In this paper, we conducted an integrative review to build a multi-dimensional framework that specifies what kind of information about generative AI use could be reported and how, and illustrated its analytical utility by applying the framework to a collection of over 110 industry documents. Our analysis reveals systematic patterns and omissions in current industry reporting and reflects on the narratives this reporting collectively advance about generative AI use.2026ABAgathe Balayn et al.Microsoft ResearchGenerative AI (Text, Image, Music, Video)Explainable AI (XAI)AI Ethics, Fairness & AccountabilityCHI
The Promise and Peril of On-Device AI for Conservation WorkAt the heart of conservation are the field staff who study and monitor ecosystems in challenging environments. Recent advances in AI models raise the question of whether LLM assistants could improve the experience of data collection for these staff. However, on-device AI deployment for conservation field work poses significant challenges, and is understudied. To address this gap, we conducted semi-structured interviews, surveys, and participant observation with partner conservancies in the Pacific Northwest and Namibia to better understand the field work context. We employ speculative methods through the lens of technology acceptance theory to critically analyze how on-device AI would affect field work, by developing an on-device transcription-language model pipeline, which we built atop of EarthRanger, a widely-used, open-source conservation platform. Our findings suggest that although on-device LLMs hold some promise for field work, the infrastructure required by current on-device models clashes with the reality of resource-limited conservation settings.2026CDCynthia Dong et al.University of WashingtonHuman-LLM CollaborationField StudiesComputational Methods in HCICHI
Designing Culturally Aligned AI Systems For Social Good in Non-Western ContextsAI technologies are increasingly deployed in high-stakes domains such as education, healthcare, law, and agriculture to address complex challenges in non-Western contexts. This paper examines eight real-world deployments spanning seven countries and 18 languages, combining 17 interviews with AI developers and domain experts with secondary research. Our findings identify six cross-cutting factors — Language, Institution, Safety, Task, End-User Demography, and Domain — that structured how systems were designed and deployed. These factors were shaped by Sociocultural (diversity, practices), Institutional (resources, policies), and Technological (capabilities, limits) influences. We find that building effective AI systems required extensive collaboration between AI developers and domain experts, with human resources proving more critical to achieving safe and effective outcomes in high-stakes domains than technological expertise alone. Additionally, we present 12 guidelines synthesizing these dynamics for designing AI for social good systems that are culturally grounded, equitable, and responsive to the needs of non-Western contexts.2026DDDeepak Varuvel Dennison et al.Cornell UniversityAI Ethics, Fairness & AccountabilityDeveloping Countries & HCI for Development (HCI4D)Human-LLM CollaborationCHI
Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot SettingsLarge Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.2026HHHamna Hamna et al.Microsoft CorporationHuman-LLM CollaborationMultilingual & Cross-Cultural Voice InteractionMental Health Apps & Online Support CommunitiesCHI