Dissertation Project Brief

An Interaction Design Based on LLM Association for User Attention and Gesture Operations

Origin of the IDEA

New Interactions in AR/VR

History of Interaction

Traditional Human-Computer Interaction (focusing only on the era when computers became mainstream, forget about ENIAC)

In AR/VR (traditional terms), the metaverse (proposed by Meta), and spatial computing (recently released Apple Vision Pro), the essence is very similar.

Existing AR/VR Device Input and Interaction Methods:

Apple Vision Pro offers a more diverse range of interactions:

My Idea — An Interaction Design Based on LLM Association for User Attention and Gesture Operations

Examples

  1. Notice a document in a work setting, association: "Scan to PDF" operation.
  1. See an empty tea cup in the bedroom, association:
  1. See a tea cup in the kitchen, association:

From the overall product flow perspective

It can be divided into three parts:

  1. Detecting user attention or points of interest. Current eye-tracking technology effectively solves this problem.
  1. Semantic segmentation based on the scene, matching the user's gazing point. Recognizing the current environment and using semantic segmentation and scene classification results as prompts. Combine with LLM models to associate objects with potential related operations (hoped to be a research direction during my Ph.D. period).
  1. Adequately display the associative information with simple gesture operations (main content of this dissertation project).

Dissertation Plan: Only design the final interface.

Main Research Components of the Dissertation

  1. Direct text input to GPT-4, returning relevant information, optimizing prompts, or using basic NLP algorithms to extract words related to specific object operations for display.
  1. Considering the matching of the gazing point to objects and potential gesture operations, design a pop-up menu.

If time permits, consider testing existing pre-trained semantic segmentation algorithms (e.g., YOLOv8) in a specific scenario. Input an image, output semantic segmentation, use a cursor as a substitute for the gazing point, capture the semantic content pointed at by the cursor, and format it as input for the GPT-4 interface, obtaining text output from GPT-4.

Given time constraints, avoid expanding the above program to combine interface design to display pop-up menus and other content (involving front-end and back-end separate development, and the sophisticated animation effects demand high front-end development skills).