Alina Lozovskaya commited on
Commit
b7e99d6
·
1 Parent(s): 41d40f0

Clarify vision system defaults to gpt-realtime

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -6,7 +6,7 @@ Conversational demo for the Reachy Mini robot combining OpenAI's realtime APIs,
6
 
7
  ## Overview
8
  - Real-time audio conversation loop powered by the OpenAI realtime API and `fastrtc` for low-latency streaming.
9
- - Local vision processing using SmolVLM2 model running on-device (CPU/GPU/MPS).
10
  - Layered motion system queues primary moves (dances, emotions, goto poses, breathing) while blending speech-reactive wobble and face-tracking.
11
  - Async tool dispatch integrates robot motion, camera capture, and optional face-tracking capabilities through a Gradio web UI with live transcripts.
12
 
@@ -75,10 +75,10 @@ Some wheels (e.g. PyTorch) are large and require compatible CUDA or CPU builds
75
  | Variable | Description |
76
  |----------|-------------|
77
  | `OPENAI_API_KEY` | Required. Grants access to the OpenAI realtime endpoint.
78
- | `MODEL_NAME` | Override the realtime model (defaults to `gpt-realtime`).
79
- | `HF_HOME` | Cache directory for local Hugging Face downloads (defaults to `./cache`).
80
- | `HF_TOKEN` | Optional token for Hugging Face models (falls back to `huggingface-cli login`).
81
- | `LOCAL_VISION_MODEL` | Hugging Face model path for local vision processing (defaults to `HuggingFaceTB/SmolVLM2-2.2B-Instruct`).
82
 
83
  ## Running the demo
84
 
@@ -88,7 +88,7 @@ Activate your virtual environment, ensure the Reachy Mini robot (or simulator) i
88
  reachy-mini-conversation-demo
89
  ```
90
 
91
- By default, the app runs in console mode for direct audio interaction. Use the `--gradio` flag to launch a web UI served locally at http://127.0.0.1:7860/ (required when running in simulation mode). With a camera attached, captured frames are analyzed locally using the SmolVLM2 vision model. Additionally, you can enable face tracking via YOLO or MediaPipe pipelines depending on the extras you installed.
92
 
93
  ### CLI options
94
 
@@ -96,6 +96,7 @@ By default, the app runs in console mode for direct audio interaction. Use the `
96
  |--------|---------|-------------|
97
  | `--head-tracker {yolo,mediapipe}` | `None` | Select a face-tracking backend when a camera is available. YOLO is implemented locally, MediaPipe comes from the `reachy_mini_toolbox` package. Requires the matching optional extra. |
98
  | `--no-camera` | `False` | Run without camera capture or face tracking. |
 
99
  | `--gradio` | `False` | Launch the Gradio web UI. Without this flag, runs in console mode. Required when running in simulation mode. |
100
  | `--debug` | `False` | Enable verbose logging for troubleshooting. |
101
 
@@ -107,6 +108,12 @@ By default, the app runs in console mode for direct audio interaction. Use the `
107
  reachy-mini-conversation-demo --head-tracker mediapipe
108
  ```
109
 
 
 
 
 
 
 
110
  - Disable the camera pipeline (audio-only conversation):
111
 
112
  ```bash
@@ -118,7 +125,7 @@ By default, the app runs in console mode for direct audio interaction. Use the `
118
  | Tool | Action | Dependencies |
119
  |------|--------|--------------|
120
  | `move_head` | Queue a head pose change (left/right/up/down/front). | Core install only. |
121
- | `camera` | Capture the latest camera frame and optionally query a vision backend. | Requires camera worker; vision analysis depends on selected extras. |
122
  | `head_tracking` | Enable or disable face-tracking offsets (not facial recognition - only detects and tracks face position). | Camera worker with configured head tracker. |
123
  | `dance` | Queue a dance from `reachy_mini_dances_library`. | Core install only. |
124
  | `stop_dance` | Clear queued dances. | Core install only. |
 
6
 
7
  ## Overview
8
  - Real-time audio conversation loop powered by the OpenAI realtime API and `fastrtc` for low-latency streaming.
9
+ - Vision processing uses gpt-realtime by default (when camera tool is used), with optional local vision processing using SmolVLM2 model running on-device (CPU/GPU/MPS) via `--local-vision` flag.
10
  - Layered motion system queues primary moves (dances, emotions, goto poses, breathing) while blending speech-reactive wobble and face-tracking.
11
  - Async tool dispatch integrates robot motion, camera capture, and optional face-tracking capabilities through a Gradio web UI with live transcripts.
12
 
 
75
  | Variable | Description |
76
  |----------|-------------|
77
  | `OPENAI_API_KEY` | Required. Grants access to the OpenAI realtime endpoint.
78
+ | `MODEL_NAME` | Override the realtime model (defaults to `gpt-realtime`). Used for both conversation and vision (unless `--local-vision` flag is used).
79
+ | `HF_HOME` | Cache directory for local Hugging Face downloads (only used with `--local-vision` flag, defaults to `./cache`).
80
+ | `HF_TOKEN` | Optional token for Hugging Face models (only used with `--local-vision` flag, falls back to `huggingface-cli login`).
81
+ | `LOCAL_VISION_MODEL` | Hugging Face model path for local vision processing (only used with `--local-vision` flag, defaults to `HuggingFaceTB/SmolVLM2-2.2B-Instruct`).
82
 
83
  ## Running the demo
84
 
 
88
  reachy-mini-conversation-demo
89
  ```
90
 
91
+ By default, the app runs in console mode for direct audio interaction. Use the `--gradio` flag to launch a web UI served locally at http://127.0.0.1:7860/ (required when running in simulation mode). With a camera attached, vision is handled by the gpt-realtime model when the camera tool is used. For local vision processing, use the `--local-vision` flag to process frames periodically using the SmolVLM2 model. Additionally, you can enable face tracking via YOLO or MediaPipe pipelines depending on the extras you installed.
92
 
93
  ### CLI options
94
 
 
96
  |--------|---------|-------------|
97
  | `--head-tracker {yolo,mediapipe}` | `None` | Select a face-tracking backend when a camera is available. YOLO is implemented locally, MediaPipe comes from the `reachy_mini_toolbox` package. Requires the matching optional extra. |
98
  | `--no-camera` | `False` | Run without camera capture or face tracking. |
99
+ | `--local-vision` | `False` | Use local vision model (SmolVLM2) for periodic image processing instead of gpt-realtime vision. Requires `local_vision` extra to be installed. |
100
  | `--gradio` | `False` | Launch the Gradio web UI. Without this flag, runs in console mode. Required when running in simulation mode. |
101
  | `--debug` | `False` | Enable verbose logging for troubleshooting. |
102
 
 
108
  reachy-mini-conversation-demo --head-tracker mediapipe
109
  ```
110
 
111
+ - Run with local vision processing (requires `local_vision` extra):
112
+
113
+ ```bash
114
+ reachy-mini-conversation-demo --local-vision
115
+ ```
116
+
117
  - Disable the camera pipeline (audio-only conversation):
118
 
119
  ```bash
 
125
  | Tool | Action | Dependencies |
126
  |------|--------|--------------|
127
  | `move_head` | Queue a head pose change (left/right/up/down/front). | Core install only. |
128
+ | `camera` | Capture the latest camera frame and send it to gpt-realtime for vision analysis. | Requires camera worker; uses gpt-realtime vision by default. |
129
  | `head_tracking` | Enable or disable face-tracking offsets (not facial recognition - only detects and tracks face position). | Camera worker with configured head tracker. |
130
  | `dance` | Queue a dance from `reachy_mini_dances_library`. | Core install only. |
131
  | `stop_dance` | Clear queued dances. | Core install only. |