Criticality: Scaffolding Decision-Making with Interactive Critical Thinking and Evidence-Based Reasoning TracesDecision-making requires examining underlying assumptions and concepts, considering diverse perspectives, and weighing potential consequences with clear, accurate reasoning. Recent large language models (LLMs) show promise for assisting decision-makers by combining reasoning capabilities with the ability to retrieve relevant information from large documents. However, our formative study with five professional decision-makers revealed key limitations of using LLM in workflow: time-consuming alignment of user goals, lack of evidence-based grounding, overwhelmingly long outputs, and unsurfaced assumptions undermined user trust in the LLM output and the validity of the final decision. We introduce Criticality, a system that operationalizes the Paul-Elder Critical Thinking framework to structure reasoning into interactive Elements of Thought (e.g., purpose, assumptions, perspectives, implications), and evaluates and guides reasoning using Intellectual Standards (e.g., clarity, fairness, logic). It also retrieves evidence for each claim, classifies it as supporting, neutral, or contradictory, and explains the claim-evidence link. A within-subjects study (n=13) comparing Criticality to ChatGPT 5 Pro, a state-of-the-art reasoning model in conversational interface, found that Criticality improved user interaction of steering and repairing through the decision-making process, producing better decision rationales compared to the baseline.2026MCMinsuk Chang et al.Georgia Institute of TechnologyHuman-LLM CollaborationExplainable AI (XAI)User Research Methods (Interviews, Surveys, Observation)IUI
"I Need to Find That One Chart": How Data Workers Navigate, Summarize and Communicate Analytical ConversationsConversational interfaces are increasingly used for data analysis, enabling data workers to express complex analytical intents in natural language. Yet, these interactions unfold as long, linear transcripts that are misaligned with the iterative, nonlinear nature of real-world analyses. Revisiting and summarizing conversations for different contexts is therefore challenging. This paper investigates how data workers navigate, make sense of, and communicate prior analytical conversations. To study behaviors beyond those supported by standard interfaces (i.e., scrolling and keyword search), we develop a design probe that supplements analytical conversations with structured elements and affordances (e.g., filtering, multi-level navigation and detail-on-demand). In a user study ($n = 10$), participants used the probe to navigate and communicate past analyses, fulfilling information needs (recall, reorient, prioritize) through navigation strategies (visual recall, sequential and abstractive) and summarization practices (adding process details and context). Based on these findings, we discuss design implications to support re-visitation and communication of analytical conversations.2026KGKen Gu et al.University of WashingtonUser Research Methods (Interviews, Surveys, Observation)Prototyping & User TestingInteractive Data VisualizationCHI
Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual AnalyticsLarge Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use-cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test-cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.2026SPSrishti Palani et al.Tableau ResearchHuman-LLM CollaborationInteractive Data VisualizationExplainable AI (XAI)CHI
DesignWeaver: Dimensional Scaffolding for Text-to-Image Product DesignGenerative AI has enabled novice designers to quickly create professional-looking visual representations for product concepts. However, novices have limited domain knowledge that could constrain their ability to write prompts that effectively explore a product design space. To understand how experts explore and communicate about design spaces, we conducted a formative study with 12 experienced product designers and found that experts — and their less-versed clients — often use visual references to guide co-design discussions rather than written descriptions. These insights inspired DesignWeaver, an interface that helps novices generate prompts for a text-to-image model by surfacing key product design dimensions from generated images into a palette for quick selection. In a study with 52 novices, DesignWeaver enabled participants to craft longer prompts with more domain-specific vocabularies, resulting in more diverse, innovative product designs. However, the nuanced prompts heightened participants' expectations beyond what current text-to-image models could deliver. We discuss the implications of AI-based product design support tools.2025STSirui Tao et al.University of California San Diego, Department of Computer Science and Engineering/ University of California San Diego/ ProtoLabGenerative AI (Text, Image, Music, Video)Motor Impairment Assistive Input TechnologiesCustomizable & Personalized ObjectsCHI
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through EditsLLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human-written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM-generated text, formalizing it into a seven-category taxonomy (e.g. clichés, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, building on existing work in automatic editing we evaluated methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.2025TCTuhin Chakrabarty et al.Salesforce Research; Columbia University, Computer ScienceHuman-LLM CollaborationAI-Assisted Creative WritingCHI
Proactive Conversational Agents with Inner ThoughtsOne of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e. being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations.We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.2025XLXingyu "Bruce" Liu et al.UCLA, HCI ResearchConversational ChatbotsAgent Personality & AnthropomorphismHuman-LLM CollaborationCHI
SonicVista: Towards Creating Awareness of Distant Scenes through SonificationGupta 等人开发 SonicVista 系统,通过声化技术将远程场景信息转换为声音,增强用户对远处环境的感知能力。2024CGChitralekha Gupta et al.Context-Aware ComputingUbiComp
Beyond the Chat: Executable and Verifiable Text-Editing with LLMsConversational interfaces powered by Large Language Models (LLMs) have recently become a popular way to obtain feedback during document editing. However, standard chat-based conversational interfaces cannot explicitly surface the editing changes that they suggest. To give the author more control when editing with an LLM, we present InkSync, an editing interface that suggests executable edits directly within the document being edited. Because LLMs are known to introduce factual errors, Inksync also supports a 3-stage approach to mitigate this risk: Warn authors when a suggested edit introduces new information, help authors Verify the new information's accuracy through external search, and allow a third party to Audit with a-posteriori verification via a trace of all auto-generated content. Two usability studies confirm the effectiveness of InkSync's components when compared to standard LLM-based chat interfaces, leading to more accurate and more efficient editing, and improved user experience.2024PLPhilippe Laban et al.Human-LLM CollaborationUIST
Art or Artifice? Large Language Models and the False Promise of CreativityResearchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique and propose Torrance Test of Creative Writing (TTCW) to evaluate creativity as product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.2024TCTuhin Chakrabarty et al.Columbia UniversityHuman-LLM CollaborationAI-Assisted Creative WritingCHI
Designing and Evaluating Interfaces that Highlight News Coverage Diversity Using Discord QuestionsModern news aggregators do the hard work of organizing a large news stream, creating collections for a given news story with tens of source options. This paper shows that navigating large source collections for a news story can be challenging without further guidance. In this work, we design three interfaces -- the Annotated Article, the Recomposed Article, and the Question Grid -- aimed at accompanying news readers in discovering coverage diversity while they read. A first usability study with 10 journalism experts confirms the designed interfaces all reveal coverage diversity and determine each interface's potential use cases and audiences. In a second usability study, we developed and implemented a reading exercise with 95 novice news readers to measure exposure to coverage diversity. Results show that Annotated Article users are able to answer questions 34% more completely than with two existing interfaces while finding the interface equally easy to use.2023PLPhilippe Laban et al.Salesforce ResearchMisinformation & Fact-CheckingUser Research Methods (Interviews, Surveys, Observation)CHI
iSEA : An Interactive Pipeline for Semantic Error Analysis of NLP ModelsError analysis in NLP models is essential to successful model development and deployment. One common approach for diagnosing errors is to identify subpopulations in the dataset where the model produces the most errors. However, existing approaches typically define subpopulations based on pre-defined features, which requires users to form hypotheses of errors in advance. To complement these approaches, we propose iSEA, an Interactive Pipeline for Semantic Error Analysis in NLP Models, which automatically discovers semantically-grounded subpopulations with high error rates in the context of a human-in-the-loop interactive system. iSEA enables model developers to learn more about their model errors through discovered subpopulations, validate the sources of errors through interactive analysis on the discovered subpopulations, and test hypotheses about model errors by defining custom subpopulations. The tool supports semantic descriptions of error-prone subpopulations at the token and concept level, as well as pre-defined higher-level features. Through use cases and expert interviews, we demonstrate how iSEA can assist error understanding and analysis.2022JYJun Yuan et al.Explainable AI (XAI)AI-Assisted Decision-Making & AutomationInteractive Data VisualizationIUI