Mapping the Design Space of User Experience for Computer Use AgentsLarge language model (LLM)-based computer use agents execute user commands by interacting with available UI elements, but little is known about how users want to interact with these agents or what design factors matter for their user experience (UX). We conducted a two-phase study to map the UX design space for computer use agents. In Phase 1, we reviewed existing systems to develop a taxonomy of UX considerations, then refined it through interviews with eight UX and AI practitioners. The resulting taxonomy included categories such as user prompts, explainability, user control, and users’ mental models, with corresponding subcategories and example design features. In Phase 2, we ran a Wizard-of-Oz study with 20 participants, where a researcher acted as a web-based computer use agent and probed user reactions during normal, error-prone and risky execution. We used the findings to validate the taxonomy from Phase 1 and deepen our understand of the design space by identifying the connections between design areas and divergence in user needs and scenarios. Our taxonomy and empirical insights provide a map for developers to consider different aspects of user experience in computer use agent design and to situate their designs within users' diverse needs and scenarios.2026RCRuijia Cheng et al.AppleHuman-LLM CollaborationExplainable AI (XAI)AI-Assisted Decision-Making & AutomationIUI
Athena: Intermediate Representations for Iterative Scaffolded App Generation with an LLMIt is challenging to generate the code for a complete user interface using a Large Language Model (LLM). User interfaces are complex and their implementations often consist of multiple, inter-related files that together specify the contents of each screen, the navigation flows between the screens, and the data model used throughout the application. It is challenging to craft a single prompt for an LLM that contains enough detail to generate a complete user interface, and even then the result is frequently a single large and intricate file that contains all of the generated screens. In this paper, we introduce Athena, a prototype application generation environment that demonstrates how the use of shared intermediate representations, including an app storyboard, data model, and GUI skeletons, can help a developer work with an LLM in an iterative fashion to craft a complete user interface. These intermediate representations also scaffold the LLM’s code generation process, producing organized and structured code in multiple files while limiting errors. We evaluated Athena with a user study with 12 developers. Participants appreciated Athena’s support for prototyping multi-screen iOS apps, acknowledged that the intermediate representations improved their control and understanding of generated code, and discussed the limitations of the system and potential directions for improvement.2026JBJon-Tait Beason et al.AppleHuman-LLM CollaborationPrototyping & User TestingComputational Methods in HCIIUI
SceneScout: Towards AI-Driven Access to Street Level Imagery for Blind UsersPeople who are blind or have low-vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation assistance, those supporting pre-travel assistance typically provide information about only landmarks and turn-by-turn instructions, lacking detailed visual context. Street level imagery, which contains rich visual information and has the potential to reveal environmental details, remains inaccessible to BLV people. In this work, we present SceneScout, a multimodal large language model (MLLM)-driven prototype that enables accessible interactions with street level imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling user-driven movement within street level imagery. Our user study demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. An initial analysis of AI-generated descriptions suggests that majority are accurate and describe stable visual elements even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of street level imagery-based navigation experiences.2026GJGaurav Jain et al.Columbia UniversityVisual Impairment Technologies (Screen Readers, Tactile Graphics, Braille)Privacy by Design & User ControlGenerative AI (Text, Image, Music, Video)CHI
The Way We Notice, That’s What Really Matters: Instantiating UI Components with Distinguishing VariationsFront-end developers author UI components to be broadly reusable by parameterizing visual and behavioral properties. While flexible, this makes instantiation harder, as developers must reason about numerous property values and interactions. In practice, they must explore the component’s large design space and provide realistic and natural values to properties. To address this, we introduce distinguishing variations: variations that are both mimetic and distinct. We frame distinguishing variation generation as design-space sampling, combining symbolic inference to identify visually important properties with an LLM-driven mimetic sampler to produce realistic instantiations from its world knowledge. We instantiate distinguishing variations in Celestial, a tool that helps developers explore and visualize distinguishing variations. In a study with front-end developers (n=12), participants found these variations useful for comparing and mapping component design spaces, reported that mimetic instantiations were domain-relevant, and validated that Celestial transformed component instantiation from a manual process into a structured, exploratory activity.2026PVPriyan Vaithilingam et al.AppleHuman-LLM CollaborationPrototyping & User TestingComputational Methods in HCICHI
Improving User Interface Generation Models from Designer FeedbackDespite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratings or rankings are not well-aligned with with designers' workflows and ignore the rich rationale used to critique and improve UI designs. In this paper, we investigate several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation. We first perform an evaluation with 21 designers where they gave feedback using these interactions, which resulted in ~1500 design annotations. We then use this data to finetune a series of LLMs to generate higher quality UIs. Finally, we evaluate these models with human judges, and we find that our designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.2026JWJason Wu et al.AppleHuman-LLM CollaborationPrototyping & User Testing360° Video & Panoramic ContentCHI
Presenting Large Language Models as Companions Affects What Mental Capacities People Attribute to ThemHow might messages about large language models (LLMs) found in public discourse influence the way people think about and interact with these models? To explore this question, we randomly assigned participants (N = 470) to watch short informational videos presenting LLMs as either machines, tools, or companions---or to watch no video. We then assessed how strongly they believed LLMs to possess various mental capacities, such as the ability to have intentions or remember things. We found that participants who watched video messages presenting LLMs as companions reported believing that LLMs more fully possessed these capacities than did participants in other groups. In a follow-up study (N = 604), we replicated these findings and found nuanced effects on how these videos also impact people's reliance on LLM-generated responses when seeking out factual information. Together, these studies suggest that messages about LLMs---beyond technical advances---may shape what people believe about these systems and how they rely on LLM-generated responses.2026ACAllison Chen et al.Princeton UniveristyHuman-LLM CollaborationAgent Personality & AnthropomorphismExplainable AI (XAI)CHI
Policy Maps: Tools for Guiding the Unbounded Space of LLM BehaviorsAI policy sets boundaries on acceptable behavior for AI models, but this is challenging in the context of large language models (LLMs): how do you ensure coverage over a vast behavior space? We introduce policy maps, an approach to AI policy design inspired by the practice of physical mapmaking. Instead of aiming for full coverage, policy maps aid effective navigation through intentional design choices about which aspects to capture and which to abstract away. With Policy Projector, an interactive tool for designing LLM policy maps, an AI practitioner can survey the landscape of model input-output pairs, define custom regions (e.g., “violence”), and navigate these regions with if-then policy rules that can act on LLM outputs (e.g., if output contains “violence” and “graphic details,” then rewrite without “graphic details”). Policy Projector supports interactive policy authoring using LLM classification and steering and a map visualization reflecting the AI practitioner’s work. In an evaluation with 12 AI safety experts, our system helps policy designers craft policies around problematic model behaviors such as incorrect gender assumptions and handling of immediate physical safety threats.2025MLMichelle S. Lam et al.Explainable AI (XAI)Algorithmic Transparency & AuditabilityAlgorithmic Fairness & BiasUIST
SQUIRE: Interactive UI Authoring via Slot QUery Intermediate REpresentationsFrontend developers create UI prototypes to evaluate alternatives, which is a time-consuming process of repeated iteration and refinement. Generative AI code assistants enable rapid prototyping simply by prompting through a chat interface rather than writing code. However, while this interaction gives developers flexibility since they can write any prompt they wish, it makes it challenging to control what is generated. First, natural language on its own can be ambiguous, making it difficult for developers to precisely communicate their intentions. Second, the model may respond unpredictably, requiring the developer to re-prompt through trial-and-error to repair any undesired changes. To address these weaknesses, we introduce SQUIRE, a system designed for guided prototype exploration and refinement. In SQUIRE, the developer incrementally builds a UI component tree by pointing and clicking on different alternatives suggested by the system. Additional affordances let the developer refine the appearance of the targeted UI. All interactions are explicitly scoped, with guarantees on what portions of the UI will and will not be mutated. The system is supported by a novel intermediate representation called SQUIREIR with language support for controlled exploration and refinement. Through a user study where 11 frontend developers used SQUIRE to implement mobile web prototypes, we find that developers effectively explore and iterate on different UI alternatives with high levels of perceived control. Developers additionally scored SQUIRE positively for usability and general satisfaction. Our findings suggest the strong potential for code generation to be controlled in rapid UI prototyping tools by combining chat with explicitly scoped affordances.2025ALAlan Leung et al.Human-LLM CollaborationKnowledge Worker Tools & WorkflowsUIST
From Interaction to Impact: Towards Safer AI Agent Through Understanding and Evaluating Mobile UI Operation ImpactsWith advances in generative AI, there is increasing work towards creating autonomous agents that can manage daily tasks by operating user interfaces (UIs). While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions—particularly those that may be risky or irreversible—remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents. We began by developing a taxonomy of the impacts of mobile UI actions through a series of workshops with domain experts. Following this, we conducted a data synthesis study to gather realistic mobile UI screen traces and action data that users perceive as impactful. We then used our impact categories to annotate our collected data and data repurposed from existing mobile UI navigation datasets. Our quantitative evaluations of different large language models (LLMs) and variants demonstrate how well different LLMs can understand the impacts of mobile UI actions that might be taken by an agent. We show that our taxonomy enhances the reasoning capabilities of these LLMs for understanding the impacts of mobile UI actions, but our findings also reveal significant gaps in their ability to reliably classify more nuanced or complex categories of impact.2025ZZZhuohao (Jerry) Zhang et al.Generative AI (Text, Image, Music, Video)AI-Assisted Decision-Making & AutomationIUI
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine ConversationsMultimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 353K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to UI verification.2025YJYue Jiang et al.Voice User Interface (VUI) DesignHuman-LLM CollaborationIUI
Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data AnnotationText-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We observe that the performance of existing text-to-SQL models drops dramatically when applied to a new schema, primarily due to the lack of domain-specific data for fine-tuning. Furthermore, this lack of data for the new schema also hinders our ability to effectively evaluate the model's performance in the new domain. Nevertheless, it is expensive to continuously obtain text-to-SQL data for an evolving schema in most real-world applications. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through collaboration between humans and a large language model in a structured workflow. A within-subject user study comparing SQLsynth to manual annotation and ChatGPT reveals that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.2025YTYuan Tian et al.Human-LLM CollaborationAutoML InterfacesIUI
VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup LevelEffective error analysis is critical for the successful development and deployment of CVML models. One approach to understanding model errors is to summarize the common characteristics of error samples. This can be particularly challenging in tasks that utilize unstructured, complex data such as images, where patterns are not always obvious. Another method is to analyze error distributions across pre-defined categories, which requires analysts to hypothesize about potential error causes in advance. Forming such hypotheses without access to explicit labels or annotations makes it difficult to isolate meaningful subgroups or patterns, however, as analysts must rely on manual inspection, prior expertise, or intuition. This lack of structured guidance can hinder a comprehensive understanding of where models fail. To address these challenges, we introduce VibE, a semantic error analysis workflow designed to identify where and why computer vision and machine learning (CVML) models fail at the subgroup level, even when labels or annotations are unavailable. VibE incorporates several core features to enhance error analysis: semantic subgroup generation, semantic summarization, candidate issue proposals, semantic concept search, and interactive subgroup analysis. By leveraging large foundation models (such as CLIP and GPT-4) alongside visual analytics, VibE enables developers to semantically interpret and analyze CVML model errors. This interactive workflow helps identify errors through subgroup discovery, supports hypothesis generation with auto-generated subgroup summaries and suggested issues, and allows hypothesis validation through semantic concept search and comparative analysis. Through three diverse CVML tasks and in-depth expert interviews, we demonstrate how VibE can assist error understanding and analysis.2025JYJun Yuan et al.Human-LLM CollaborationInteractive Data VisualizationIUI
Towards AI-driven Sign Language Generation with Non-manual MarkersSign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating from written languages, such as English, into signed videos. However, current systems often fail to meet user needs due to poor translation of grammatical structures, the absence of facial cues and body language, and insufficient visual and motion fidelity. We address these challenges by building on recent advances in LLMs and video generation models to translate English sentences into natural-looking AI ASL signers. The text component of our model extracts information for manual and non-manual components of ASL, which are used to synthesize skeletal pose sequences and corresponding video frames. Our findings from a user study with 30 DHH participants and thorough technical evaluations demonstrate significant progress and identify critical areas necessary to meet user needs.2025HZHan Zhang et al.University of Washington, Paul G. Allen School of Computer Science and EngineeringVoice AccessibilityDeaf & Hard-of-Hearing Support (Captions, Sign Language, Vibration)CHI
Exploring Empty Spaces: Human-in-the-Loop Data AugmentationData augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.2025CYCatherine Yeh et al.Harvard UniversityGenerative AI (Text, Image, Music, Video)Human-LLM CollaborationComputational Methods in HCICHI
InterLink: Linking Text with Code and Output in Computational NotebooksComputational notebooks, widely used for ad-hoc analysis and often shared with others, can be difficult to understand because the standard linear layout is not optimized for reading. In particular, related text, code, and outputs may be spread across the UI making it difficult to draw connections. In response, we introduce InterLink, a plugin designed to present the relationships between text, code, and outputs, thereby making notebooks easier to understand. In a formative study, we identify pain points and derive design requirements for identifying and navigating relationships among various pieces of information within notebooks. Based on these requirements, InterLink features a new layout that separates text from code and outputs into two columns. It uses visual links to signal relationships between text and associated code and outputs and offers interactions for navigating related pieces of information. In a user study with 12 participants, those using InterLink were 13.6% more accurate at finding and integrating information from complex analyses in computational notebooks. These results show the potential of notebook layouts that make them easier to understand.2025YLYanna Lin et al.The Hong Kong University of Science and Technology, Department of Computer Science and Engineering; Human-Computer Interaction Institute, Carnegie Mellon UniversityInteractive Data VisualizationKnowledge Worker Tools & WorkflowsCHI
Misty: UI Prototyping Through Interactive Conceptual BlendingUI prototyping often involves iterating and blending elements from examples such as screenshots and sketches, but current tools offer limited support for incorporating these examples. Inspired by the cognitive process of conceptual blending, we introduce a novel UI workflow that allows developers to rapidly incorporate diverse aspects from design examples into work-in-progress UIs. We prototyped this workflow as Misty. Through a exploratory first-use study with 14 frontend developers, we assessed Misty's effectiveness and gathered feedback on this workflow. Our findings suggest that Misty's conceptual blending workflow helps developers kickstart creative explorations, flexibly specify intent in different stages of prototyping, and inspires developers through serendipitous UI blends. Misty demonstrates the potential for tools that blur the boundaries between developers and designers.2025YLYuwen Lu et al.University of Notre Dame, Computer Science and EngineeringKnowledge Worker Tools & WorkflowsPrototyping & User TestingCHI
ProxiCycle : Passively Mapping Cyclist Safety Using Smart Handlebars for Near-Miss DetectionActive transportation is a valuable tool to prevent some of the most common causes of mortality worldwide, but is severely underutilized. The primary factors preventing cyclist adoption are safety concerns, specifically, the fear of collision from automobiles. One solution to address this concern is to direct cyclists to known safe routes to minimize risk and stress, thus making cycling more approachable. However, few localized safety priors are available, hindering safety based routing. Specifically, road user behavior is unknown. To address this issue, we develop a novel handlebar attachment to passively monitor the proximity of passing cars as a an indicator of cycling safety along historically traveled routes. We deploy this sensor with 15 experienced cyclists in a 2 month longitudinal study to source a citywide map of car passing distance. We then compare this signal to both historic collisions and perceived safety reported by experienced and inexperienced cyclists.2025JBJoseph Breda et al.University of Washington, Paul G. Allen School of Computer Science & EngineeringMotion Sickness & Passenger ExperiencePedestrian & Cyclist SafetyCHI
eaSEL: Promoting Social-Emotional Learning and Parent-Child Interaction through AI-Mediated Content ConsumptionAs children increasingly consume media on devices, parents look for ways this usage can support learning and growth, especially in domains like social-emotional learning. We introduce eaSEL, a system that (a) integrates social-emotional learning (SEL) curricula into children’s video consumption by generating reflection activities and (b) facilitates parent-child discussions around digital media without requiring co-consumption of videos. We present a technical evaluation of our system’s ability to detect social-emotional moments within a transcript and to generate high-quality SEL-based activities for both children and parents. Through a user study with 𝑁 = 20 parent-child dyads, we find that after completing an eaSEL activity, children reflect more on the emotional content of videos. Furthermore, parents find that the tool promotes meaningful active engagement and could scaffold deeper conversations around content. Our work paves directions in how AI can support children’s social-emotional reflection of media and family connections in the digital age.2025JSJocelyn J Shen et al.Massachusetts Institute of Technology, MIT Media LabEarly Childhood Education TechnologyCollaborative Learning & Peer TeachingMental Health Apps & Online Support CommunitiesCHI
Perceptions of the Fairness Impacts of Multiplicity in Machine LearningMachine learning (ML) is increasingly used in high-stakes settings, yet multiplicity – the existence of multiple good models – means that some predictions are essentially arbitrary. ML researchers and philosophers posit that multiplicity poses a fairness risk, but no studies have investigated whether stakeholders agree. In this work, we conduct a survey to see how multiplicity impacts lay stakeholders’ – i.e., decision subjects’ – perceptions of ML fairness, and which approaches to address multiplicity they prefer. We investigate how these perceptions are modulated by task characteristics (e.g., stakes and uncertainty). Survey respondents think that multiplicity threatens the fairness of model outcomes, but not the appropriateness of using the model, even though existing work suggests the opposite. Participants are strongly against resolving multiplicity by using a single model (effectively ignoring multiplicity) or by randomizing the outcomes. Our results indicate that model developers should be intentional about dealing with multiplicity in order to maintain fairness.2025AMAnna P. Meyer et al.University of Wisconsin - MadisonExplainable AI (XAI)Algorithmic Fairness & BiasCHI
Towards Automated Accessibility Report Generation for Mobile AppsACM DL: https://dl.acm.org/doi/full/10.1145/3674967 Many apps have basic accessibility issues, like missing labels or low contrast. To supplement manual testing, automated tools can help developers and QA testers find basic accessibility issues, but they can be laborious to use or require writing dedicated tests. To motivate our work, we interviewed eight accessibility QA professionals at a large technology company. From these interviews, we synthesized three design goals for accessibility report generation systems. Motivated by these goals, we developed a system to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app’s lifetime. We conducted a user study where 19 accessibility engineers and testers used multiple tools to create lists of prioritized issues in the context of an accessibility audit. Our system helped them create lists they were more satisfied with while addressing key limitations of current accessibility scanning tools.2024ASAmanda Swearngin et al.Voice AccessibilityUniversal & Inclusive DesignPrivacy Perception & Decision-MakingUIST