WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf SmartwatchesTracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions---multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols---and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.2026JKJiwan Kim et al.KAISTHand Gesture RecognitionSmartwatches & Fitness BandsContext-Aware ComputingCHI
Exploring the Impact of Emotional Voice Integration in Sign-to-Speech Translators for Deaf-to-Hearing CommunicationEmotional voice communication plays a crucial role in effective daily interactions. Deaf and Hard of Hearing (DHH) individuals, who often have limited use of voice, rely on facial expressions to supplement sign language and convey emotions. However, in American Sign Language (ASL), facial expressions serve not only emotional purposes but also function as linguistic markers that can alter the meaning of signs. This dual role can often confuse non-signers when interpreting a signer’s emotional state. In this paper, we present studies that: (1) confirm the challenges non-signers face when interpreting emotions from facial expressions in ASL communication, and (2) demonstrate how integrating emotional voice into translation systems can enhance hearing individuals’ understanding of a signer’s emotional intent. An online survey with 45 hearing participants (non-ASL signers) revealed frequent misinterpretations of signers’ emotions when emotional and linguistic facial expressions were used simultaneously. The findings show that incorporating emotional voice into translation systems significantly improves emotion recognition by 32%. Additionally, follow-up survey with 48 DHH participants highlights design considerations for implementing emotional voice features, emphasizing the importance of emotional voice integration to bridge communication gaps between DHH and hearing communities.2025HLHyunchul Lim et al.Deaf and Hard-of-Hearing ResearchCSCW
SpellRing: Recognizing Continuous Fingerspelling in American Sign Language using a RingFingerspelling is a critical part of American Sign Language (ASL) recognition and has become an accessible optional text entry method for Deaf and Hard of Hearing (DHH) individuals. In this paper, we introduce SpellRing, a single smart ring worn on the thumb that recognizes words continuously fingerspelled in ASL. SpellRing uses active acoustic sensing (via a microphone and speaker) and an inertial measurement unit (IMU) to track handshape and movement, which are processed through a deep learning algorithm using Connectionist Temporal Classification (CTC) loss. We evaluated the system with 20 ASL signers (13 fluent and 7 learners), using the MacKenzie-Soukoref Phrase Set of 1,164 words and 100 phrases. Offline evaluation yielded top-1 and top-5 word recognition accuracies of 82.45% (±9.67%) and 92.42% (±5.70%), respectively. In real-time, the system achieved a word error rate (WER) of 0.099 (±0.039) on the phrases. Based on these results, we discuss key lessons and design implications for future minimally obtrusive ASL recognition wearables.2025HLHyunchul Lim et al.Cornell, Computing and Information ScienceFoot & Wrist InteractionVoice AccessibilityMotor Impairment Assistive Input TechnologiesCHI
EchoGuide: Active Acoustic Guidance for LLM-Based Eating Event Analysis from Egocentric VideosSelf-recording eating behaviors is a step towards a healthy lifestyle recommended by many health professionals. However, the current practice of manually recording eating activities using paper records or smartphone apps is often unsustainable and inaccurate. Smart glasses have emerged as a promising wearable form factor for tracking eating behaviors, but existing systems primarily identify when eating occurs without capturing details of the eating activities (E.g., what is being eaten). In this paper, we present EchoGuide, an application and system pipeline that leverages low-power active acoustic sensing to guide head-mounted cameras to capture egocentric videos, enabling efficient and detailed analysis of eating activities. By combining active acoustic sensing for eating detection with video captioning models and large-scale language models for retrieval augmentation, EchoGuide intelligently clips and analyzes videos to create concise, relevant activity records on eating. We evaluated EchoGuide with 9 participants in naturalistic settings involving eating activities, demonstrating high-quality summarization and significant reductions in video data needed, paving the way for practical, scalable eating activity tracking.2024VPVineet Parikh et al.Diet Tracking & Nutrition ManagementSleep & Stress MonitoringBiosensors & Physiological MonitoringUbiComp
MunchSonic: Tracking Fine-grained Dietary Actions through Active Acoustic Sensing on EyeglassesWe introduce MunchSonic, an AI-powered active acoustic sensing system integrated into eyeglasses to track fine-grained dietary actions. MunchSonic emits inaudible ultrasonic waves from the eyeglass frame, with the reflected signals capturing detailed positions and movements of body parts, including the mouth, jaw, arms, and hands involved in eating. These signals are processed by a deep learning pipeline to classify six actions: hand-to-mouth movements for food intake, chewing, drinking, talking, face-hand touching, and other activities (null). In an unconstrained study with 12 participants, MunchSonic achieved a 93.5% macro F1-score in a user-independent evaluation with a 2-second resolution in tracking these actions, also demonstrating its effectiveness in tracking eating episodes and food intake frequency within those episodes.2024SMSaif Mahmud et al.Diet Tracking & Nutrition ManagementBiosensors & Physiological MonitoringUbiComp
SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose TrackingSeams are areas of overlapping fabric formed by stitching two or more pieces of fabric together in the cut-and-sew apparel manufacturing process. In SeamPose, we repurposed seams as capacitive sensors in a shirt for continuous upper-body pose estimation. Compared to previous all-textile motion-capturing garments that place the electrodes on the clothing surface, our solution leverages existing seams inside of a shirt by machine-sewing insulated conductive threads over the seams. The unique invisibilities and placements of the seams afford the sensing shirt to look and wear similarly as a conventional shirt while providing exciting pose-tracking capabilities. To validate this approach, we implemented a proof-of-concept untethered shirt with 8 capacitive sensing seams. With a 12-participant user study, our customized deep-learning pipeline accurately estimates the relative (to the pelvis) upper-body 3D joint positions with a mean per joint position error (MPJPE) of 6.0 cm. SeamPose represents a step towards unobtrusive integration of smart clothing for everyday pose estimation.2024TYTianhong Catherine Yu et al.Haptic WearablesHuman Pose & Activity RecognitionBiosensors & Physiological MonitoringUIST
EchoWrist: Continuous Hand Pose Tracking and Hand-Object Interaction Recognition Using Low-Power Active Acoustic Sensing On a WristbandOur hands serve as a fundamental means of interaction with the world around us. Therefore, understanding hand poses and interaction contexts is critical for human-computer interaction (HCI). We present EchoWrist, a low-power wristband that continuously estimates 3D hand poses and recognizes hand-object interactions using active acoustic sensing. EchoWrist is equipped with two speakers emitting inaudible sound waves toward the hand. These sound waves interact with the hand and its surroundings through reflections and diffractions, carrying rich information about the hand's shape and the objects it interacts with. The information captured by the two microphones goes through a deep learning inference system that recovers hand poses and identifies various everyday hand activities. Results from the two 12-participant user studies show that EchoWrist is effective and efficient at tracking 3D hand poses and recognizing hand-object interactions. Operating at 57.9 mW, EchoWrist can continuously reconstruct 20 3D hand joints with MJEDE of 4.81 mm and recognize 12 naturalistic hand-object interactions with 97.6% accuracy.2024CLChi-Jung Lee et al.Cornell UniversityHand Gesture RecognitionFoot & Wrist InteractionCHI
EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesIn this paper, we introduce EyeEcho, a minimally-obtrusive acoustic sensing system designed to enable glasses to continuously monitor facial expressions. It utilizes two pairs of speakers and microphones mounted on glasses, to emit encoded inaudible acoustic signals directed towards the face, capturing subtle skin deformations associated with facial expressions. The reflected signals are processed through a customized machine-learning pipeline to estimate full facial movements. EyeEcho samples at 83.3 Hz with a relatively low power consumption of 167 mW. Our user study involving 12 participants demonstrates that, with just four minutes of training data, EyeEcho achieves highly accurate tracking performance across different real-world scenarios, including sitting, walking, and after remounting the devices. Additionally, a semi-in-the-wild study involving 10 participants further validates EyeEcho's performance in naturalistic scenarios while participants engage in various daily activities. Finally, we showcase EyeEcho's potential to be deployed on a commercial-off-the-shelf (COTS) smartphone, offering real-time facial expression tracking.2024KLKe Li et al.Cornell UniversityHand Gesture RecognitionEye Tracking & Gaze InteractionHuman Pose & Activity RecognitionCHI
PoseSonic: 3D Upper Body Pose Estimation Through Egocentric Acoustic Sensing on Smartglasses"In this paper, we introduce PoseSonic, an intelligent acoustic sensing solution for smartglasses that estimates upper body poses. Our system only requires two pairs of microphones and speakers on the hinges of the eyeglasses to emit FMCW-encoded inaudible acoustic signals and receive reflected signals for body pose estimation. Using a customized deep learning model, PoseSonic estimates the 3D positions of 9 body joints including the shoulders, elbows, wrists, hips, and nose. We adopt a cross-modal supervision strategy to train our model using synchronized RGB video frames as ground truth. We conducted in-lab and semi-in-the-wild user studies with 22 participants to evaluate PoseSonic, and our user-independent model achieved a mean per joint position error of 6.17 cm in the lab setting and 14.12 cm in semi-in-the-wild setting when predicting the 9 body joint positions in 3D. Our further studies show that the performance was not significantly impacted by different surroundings or when the devices were remounted or by real-world environmental noise. Finally, we discuss the opportunities, challenges, and limitations of deploying PoseSonic in real-world applications." https://doi.org/10.1145/36108952023SMSaif Mahmud et al.Human Pose & Activity RecognitionContext-Aware ComputingUbiComp
C-Auth: Exploring the Feasibility of Using Egocentric View of Face Contour for User Authentication on GlassesIn this paper, we present C-Auth, a novel authentication method for smart glasses that explores the feasibility of authenticating users using spatial facial information. Our system uses a down-facing camera in the middle of the glasses to capture facial contour lines from the nose and cheeks. The images captured by the camera are then processed and learned by a customized algorithm for authentication. To evaluate the system, we conducted a user study with 20 participants in three sessions on different days. Our system correctly identified the 20 users with a true positive rate of 98.0% (SD: 2.96%) and a false positive rate of 4.97% (2.88 %) across all three days. We conclude by discussing current limitations and challenges as well as the potential future applications for C-Auth.2023HLHyunchul Lim et al.Passwords & AuthenticationOn-Skin Display & On-Skin InputUbiComp
HPSpeech: Silent Speech Interface for Commodity HeadphonesWe present HPSpeech, a silent speech interface for commodity headphones. HPSpeech utilizes the existing speakers of the headphones to emit inaudible acoustic signals. The movements of the temporomandibular joint (TMJ) during speech modify the reflection pattern of these signals, which are captured by a microphone positioned inside the headphones. To evaluate the performance of HPSpeech, we tested it on two headphones with a total of 18 participants. The results demonstrated that HPSpeech successfully recognized 8 popular silent speech commands for controlling the music player with an accuracy over 90%. While our tests use modified commodity hardware (both with and without active noise cancellation), our results show that sensing the movement of the TMJ could be as simple as a firmware update for ANC headsets which already include a microphone inside the hear cup. This leaves us to believe that this technique has great potential for rapid deployment in the near future. We further discuss the challenges that need to be addressed before deploying HPSpeech at scale.2023RZRuidong Zhang et al.Voice User Interface (VUI) DesignUbiComp
EchoNose: Sensing Mouth, Breathing and Tongue Gestures inside Oral Cavity using a Non-contact Nose InterfaceSensing movements and gestures inside the oral cavity has been a long-standing challenge for the wearable research community. This paper introduces EchoNose, a novel nose interface that explores a unique sensing approach to recognize gestures related to mouth, breathing, and tongue by analyzing the acoustic signal reflections inside the nasal and oral cavities. The interface incorporates a speaker and a microphone placed at the nostrils, emitting inaudible acoustic signals and capturing the corresponding reflections. These received signals were processed using a customized data processing and machine learning pipeline, enabling the distinction of 16 gestures involving speech, tongue, and breathing. A user study with 10 participants demonstrates that EchoNose achieves an average accuracy of 93.7% in recognizing these 16 gestures. Based on these promising results, we discuss the potential opportunities and challenges associated with applying this innovative nose interface in various future applications.2023RSRujia Sun et al.Electrical Muscle Stimulation (EMS)Hand Gesture RecognitionBrain-Computer Interface (BCI) & NeurofeedbackUbiComp
EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic SensingWe present EchoSpeech, a minimally-obtrusive silent speech interface (SSI) powered by low-power active acoustic sensing. EchoSpeech uses speakers and microphones mounted on a glass-frame and emits inaudible sound waves towards the skin. By analyzing echos from multiple paths, EchoSpeech captures subtle skin deformations caused by silent utterances and uses them to infer silent speech. With a user study of 12 participants, we demonstrate that EchoSpeech can recognize 31 isolated commands and 3-6 figure connected digits with 4.5% (std 3.5%) and 6.1% (std 4.2%) Word Error Rate (WER), respectively. We further evaluated EchoSpeech under scenarios including walking and noise injection to test its robustness. We then demonstrated using EchoSpeech in demo applications in real-time operating at 73.3mW, where the real-time pipeline was implemented on a smartphone with only 1-6 minutes of training data. We believe that EchoSpeech takes a solid step towards minimally-obtrusive wearable SSI for real-life deployment.2023RZRuidong Zhang et al.Cornell UniversityVibrotactile Feedback & Skin StimulationVoice User Interface (VUI) DesignBiosensors & Physiological MonitoringCHI
ReMotion: Supporting Remote Collaboration in Open Space with Automatic Robotic EmbodimentDesign activities, such as brainstorming or critique, often take place in open spaces combining whiteboards and tables to present artefacts. In co-located settings, peripheral awareness enables participants to understand each other’s locus of attention with ease. However, these spatial cues are mostly lost while using videoconferencing tools. Telepresence robots could bring back a sense of presence, but controlling them is distracting. To address this problem, we present ReMotion, a fully automatic robotic proxy designed to explore a new way of supporting non-collocated open-space design activities. ReMotion combines a commodity body tracker (Kinect) to capture a user’s location and orientation over a wide area with a minimally invasive wearable system (NeckFace) to capture facial expressions. Due to its omnidirectional platform, ReMotion embodiment can render a wide range of body movements. A formative evaluation indicated that our system enhances the sharing of attention and the sense of co-presence enabling seamless movement-in-space during a design review task.2023MSMose Sakashita et al.Cornell UniversityHuman-Robot Collaboration (HRC)Teleoperation & TelepresenceCHI
HandyTrak: Recognizing the Holding Hand on a Commodity Smartphone from Body Silhouette ImagesUnderstanding which hand a user holds a smartphone with can help improve the mobile interaction experience. For instance, the layout of the user interface (UI) can be adapted to the holding hand. In this paper, we present HandyTrak, an AI-powered software system that recognizes the holding hand on a commodity smartphone using body silhouette images captured by the front-facing camera. The silhouette images are processed and sent to a customized user-dependent deep learning model (CNN) to infer how the user holds the smartphone (left, right, or both hands). We evaluated our system on each participant's smartphone at five possible front camera positions in a user study with ten participants under two hand positions (in the middle and skewed) and three common usage cases (standing, sitting, and resting against a desk). The results showed that HandyTrak was able to continuously recognize the holding hand with an average accuracy of 89.03\% (SD: 8.98\%) at a 2 Hz sampling rate. We also discuss the challenges and opportunities to deploy HandyTrak on different commodity smartphones and potential applications in real-world scenarios.2021HLHyunchul Lim et al.Hand Gesture RecognitionEye Tracking & Gaze InteractionUIST
TeethTap: Recognizing Discrete Teeth Gestures using Motion and Acoustic Sensing on an EarpieceTeeth gestures become an alternative input modality for different situations and accessibility purposes. In this paper, we present TeethTap, a novel eyes-free and hands-free input technique, which can recognize up to 13 discrete teeth tapping gestures. TeethTap adopts a wearable 3D printed earpiece with an IMU sensor and a contact microphone behind both ears, which works in tandem to detect jaw movement and sound data, respectively. TeethTap uses a support vector machine to classify gestures from noise by fusing acoustic and motion data, and implements K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement using motion data for gesture classification. A user study with 11 participants demonstrated that TeethTap could recognize 13 gestures with a real-time classification accuracy of 90.9% in a laboratory environment. We further uncovered the accuracy differences on different teeth gestures when having sensors on single vs. both sides. Moreover, we explored the activation gesture under real-world environments, including eating, speaking, walking and jumping. Based on our findings, we further discussed potential applications and practical challenges of integrating TeethTap into future devices.2021WSWei Sun et al.Haptic WearablesHand Gesture RecognitionFull-Body Interaction & Embodied InputIUI
C-Face: Continuously reconstructing facial expressions by deep learning contours of the face with ear-mounted miniature camerasC-Face (Contour-Face) is an ear-mounted wearable sensing technology that uses two miniature cameras to continuously reconstruct facial expressions by deep learning contours of the face. When facial muscles move, the contours of the face change from the point of view of the ear-mounted cameras. These subtle changes are fed into a deep learning model which continuously outputs 42 facial feature points representing the shapes and positions of the mouth, eyes and eyebrows. To evaluate C-Face, we embedded our technology into headphones and earphones. We conducted a user study with nine participants. In this study, we compared the output of our system to the feature points outputted by a state of the art computer vision library (Dlib1) from a font facing camera. We found that the mean error of all 42 feature points was 0.77 mm for earphones and 0.74 mm for headphones. The mean error for 20 major feature points capturing the most active areas of the face was 1.43 mm for earphones and 1.39 mm for headphones. The ability to continuously reconstruct facial expressions introduces new opportunities in a variety of applications. As a demonstration, we implemented and evaluated C-Face for two applications: facial expression detection (outputting emojis) and silent speech recognition. We further discuss the opportunities and challenges of deploying C-Face in real-world applications.2020TCTuochao Chen et al.Haptic WearablesHuman Pose & Activity RecognitionUIST