Skip to content

Conversation

@mirai-gpro
Copy link

No description provided.

claude added 30 commits February 2, 2026 04:13
- Add pytest configuration and test infrastructure
- Add unit tests for utils (Registry, CosineWarmupScheduler)
- Add unit tests for losses (PixelLoss, TVLoss)
- Add unit tests for models (BasicBlock, ConditionBlock, TransformerDecoder)
- Add unit tests for datasets (camera utilities)
- Support graceful test skipping when PyTorch is not installed

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add tmp/ directory for HuggingFace downloads
- Add *.tar for asset archives
- Add .pytest_cache/ for pytest

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Copy src/ and public/ from production gourmet-sp repository
- Prepare for LAM_WebRender 3D avatar integration
- Keep separate from production environment for safe testing

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Create LAMAvatar.astro component for WebGL Gaussian Splatting
- Integrate LAMAvatar into Concierge.astro with toggle option
- Add setup documentation in README.md
- Support graceful fallback to 2D image when WebGL unavailable

Features:
- Dynamic import of gaussian-splat-renderer-for-lam NPM package
- Expression data API for Audio2Expression integration
- Chat state management (Idle/Listening/Thinking/Responding)

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add lam-websocket-manager.ts with JBIN binary parser
- Update LAMAvatar.astro with WebSocket connection management
- Add audio playback support for TTS output
- Update README with WebSocket integration documentation

The WebSocket manager parses JBIN format (4-byte magic + JSON header + binary data)
containing ARKit 52-channel expression data from OpenAvatarChat backend.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Standalone service that receives TTS audio from gourmet-sp and generates
ARKit expression data (52 channels) for LAMAvatar lip sync.

Features:
- REST API: POST /api/audio2expression
- WebSocket: /ws/{session_id} for real-time streaming
- Mock mode when Audio2Expression model not available
- Minimal changes required to gourmet-sp

Architecture:
  gourmet-sp (TTS) → REST API → Audio2Expression → WebSocket → Browser

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Dockerfile for containerization
- cloudbuild.yaml for CI/CD deployment
- gourmet_support_integration.py - sample code for backend integration
- Updated README with deployment instructions

Deploy with: gcloud builds submit --config cloudbuild.yaml

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Detailed patch showing how to modify app_customer_support.py:
- Add session_id parameter to TTS endpoint
- Generate LINEAR16 audio for Audio2Expression (in addition to MP3)
- Send audio to audio2exp-service asynchronously via threading

Also includes frontend change for core-controller.ts to pass session_id.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add deploy.ps1 matching gourmet-support pattern
- Update cloudbuild.yaml region to us-central1
- Update README with PowerShell deployment instructions

Deploy with: ./deploy.ps1

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Full modified version of gourmet-support/app_customer_support.py with:
- AUDIO2EXP_SERVICE_URL environment variable
- send_to_audio2exp() function for async audio sending
- Modified synthesize_speech() to accept session_id and send PCM audio
- Health check includes audio2exp status

Changes from original:
- Line 11: Added 'import requests'
- Lines 51-56: Added AUDIO2EXP_SERVICE_URL configuration
- Lines 118-137: Added send_to_audio2exp() function
- Lines 445-496: Modified synthesize_speech() for Audio2Expression
- Line 664: Added audio2exp to health check

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Use uvicorn directly in Dockerfile CMD
- Disable model initialization on startup (use mock mode)
- Fix PORT environment variable handling

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add session_id parameter to all 5 TTS fetch calls
- Enables Audio2Expression service to receive session context

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Update lam-websocket-manager.ts to handle JSON format from audio2exp-service
- Add audio2expWsUrl property to ConciergeController
- Add connectLAMAvatarWebSocket method to connect after session init
- Enable real-time expression data streaming from backend to frontend

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Handle case where shops is None from JSON parsing failure
- Use 'or []' pattern instead of default value to handle both None and missing key

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add pingInterval and currentWsUrl properties
- Send ping every 5 seconds to keep Cloud Run WebSocket alive
- Stop ping on disconnect and reconnect

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
…r disappearing

- Create createDefaultExpressionData() function with all channels set to 0
- Initialize expressionData with default values instead of empty object
- Re-enable expression updates in lam-websocket-manager

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
This preserves the original object reference which may help with renderer stability

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Log expression data receipt with key count and jawOpen value
- Add health check timer every 2 seconds to monitor renderer state
- Add error handling to expression update callback

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
…reference

The renderer was modifying the expressionData object directly, causing values
to explode exponentially (jawOpen went from 1.0 to 10^75). Now getExpressionData()
returns a shallow copy to prevent the renderer's internal modifications from
affecting the source data.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Added 'or []' protection to handle case when enrich_shops_with_photos
returns None, preventing 'object of type NoneType has no len()' error.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Reduce jaw scaling from rms*10 to rms*2.0 with max 0.7
- Add mouthFunnel and mouthPucker for more natural mouth shapes
- Previous formula always returned 1.0 (max), causing no visible variation

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The gaussian-splat-renderer-for-lam only applies lip sync when
getChatState() returns 'Responding'. Added setChatState calls to
speakTextGCP() to enable lip sync during TTS playback.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Log state transitions in setChatState()
- Add current state to Health check log
- Helps diagnose if Responding state is being set during TTS

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The previous approach set Responding state before TTS fetch completed,
causing it to immediately switch back to Idle. Now using ttsPlayer's
play/ended/pause events to accurately track audio playback state.

- Add event listeners in init() for play, ended, pause
- Remove setChatState calls from speakTextGCP/stopAvatarAnimation
- State changes now tied to actual audio playback, not API calls

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Instead of relying on TTS player events (which may not fire correctly
due to browser autoplay restrictions), now set state to Responding
directly when expression data with jawOpen > 0.01 is received.

Uses a 500ms timeout to reset to Idle when expression data stops arriving.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The LAM renderer uses mouthLowerDownLeft/Right as the PRIMARY drivers
of mouth movement, not jawOpen. Based on analysis of official demo
expression data where mouthLowerDown values are ~0.46 while jawOpen
is only ~0.06.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Log mouthLowerDownLeft/Right values in expression updates
- Log morphTargetDictionary keys to verify avatar support

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- audio2exp-service: Generate multiple frames based on audio duration
  (30fps = 1 frame per ~533 audio samples at 16kHz)
- lam-websocket-manager: Add ExpressionFrameData type and
  onExpressionFrames callback for multi-frame support
- LAMAvatar: Buffer frames and play back at correct frame rate
  with startFramePlayback/stopFramePlayback methods

This fixes the timing issue where expression data was received all
at once before TTS playback started, causing no visible lip sync.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Queue expression frames instead of playing immediately
- Add startFramePlaybackFromQueue() public method
- Call frame playback from TTS play event handler
- This ensures lip sync starts when audio actually plays,
  not when expression data is received

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
claude added 30 commits February 4, 2026 23:47
When TTS play event fires but no frames are in queue yet,
retry up to 3 times with 200ms delay between retries.
This handles the latency difference between direct TTS audio
and expression data going through audio2exp WebSocket.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Make stopFramePlayback() public in LAMAvatarController
- Call stopFramePlayback() on TTS ended/pause events in concierge-controller
- Fixes timing mismatch where frame playback continued after audio ended

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Update path to use cloned LAM_Audio2Expression directory
- Enable model initialization on startup (CPU mode)
- Add proper weight and config path configuration
- Add detailed logging for initialization process

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
This patch modifies engines/infer.py to:
- Detect available device (CUDA or CPU)
- Use .to(device) instead of .cuda()
- Add map_location to torch.load()

Apply with: cd LAM_Audio2Expression && git apply ../audio2exp-service/cpu_support.patch

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
This implements the official synchronization approach from OpenAvatarChat:

1. audio2exp-service (app.py):
   - Add JBIN format serializer to bundle audio + expression data
   - Add batch tracking per session for speech segment handling
   - Support MP3 audio input (Google Cloud TTS format)
   - Send bundled JBIN data via WebSocket

2. AudioSyncPlayer (new file):
   - Web Audio API player with precise timing tracking
   - Tracks playback position for expression frame sync
   - Supports batch-based speech segments

3. LAMWebSocketManager:
   - Parse JBIN bundled data from WebSocket
   - Store expression frames indexed by batch
   - Integrate with AudioSyncPlayer for playback
   - Add getCurrentExpressionFrame() for audio-based sync

4. LAMAvatar.astro:
   - Support both 'bundled' and 'external' sync modes
   - External mode: sync with gourmet-sp TTS player
   - Expression sync timer at 30fps
   - setExternalTtsPlayer() to link with external audio

5. concierge-controller.ts:
   - Link external TTS player with LAMAvatar
   - Send TTS audio to audio2exp-service for expression
   - Parallel TTS + expression generation

The key improvement is that audio and expression are now properly
synchronized at the protocol level, following the official approach.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Use separate `import type { AudioSample }` to fix module resolution
error in Vite dev server. Interfaces should be imported as types
when only used for type checking.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Add queueExpressionFrames() public method to LAMAvatar for external
  sources to push expression frames
- Update sendAudioToExpression() in concierge-controller to pass
  expression data from REST API to LAMAvatar's frame queue
- This connects the audio2exp-service response to the avatar lip sync

The flow is now:
1. TTS audio sent to audio2exp-service REST API
2. API returns expression frames
3. Frames queued to LAMAvatar
4. TTS play event triggers frame playback

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The gaussian-splat-renderer doesn't call setExpression() in FLAME mode,
so expression data was being ignored. Added applyExpressionToMesh()
to directly set splatMesh.bsWeight, bypassing the FLAME animation
system.

Called from:
- applyFrame(): Apply expression during frame playback
- resetExpression(): Reset expression when playback ends

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The renderer's FLAME mode uses pre-recorded animation from flame_params,
which overwrites our live expression data. Now we:

1. Disable FLAME mode when frame playback starts (setFlameMode(false))
2. Re-enable FLAME mode when playback stops (setFlameMode(true))
3. Force texture update via updateBoneMatrixTexture() after setting bsWeight

This allows our live expression data to be applied to the mesh.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The previous approach of disabling FLAME mode during lip sync caused
renderer crashes because updateFlameBones() still runs and needs
flame_params['expr']. Instead, we now inject our expression data
directly into flame_params['expr'] at the current frame, so the
renderer reads our lip sync data during its normal update loop.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Based on official OpenAvatarChat-WebUI implementation research:
- The official API uses getExpressionData callback for expression control
- Add useFlame: false option to force callback usage instead of FLAME data
- Add comprehensive debug logging to understand renderer state:
  - FLAME mode status (viewer.useFlame, renderer.useFlame)
  - flame_params structure and format (array vs object)
  - bsWeight format and current values
  - morphTargetDictionary channel names
- Update applyExpressionToMesh to detect array vs object format

Reference: OpenAvatarChat-WebUI/src/utils/gaussianAvatar.ts

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
This will help determine if the renderer is actually calling our
getExpressionData callback during its render loop. The log outputs
every 30 calls (~1 second at 30fps) showing current expression values.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
When responses are long, they get split into firstSentence and
remainingSentences with separate TTS synthesis. These parallel TTS
calls were not calling sendAudioToExpression(), causing lip sync
frames to only be generated for the first TTS response.

Added sendAudioToExpression() calls for both firstSentence and
remainingSentences TTS audio.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Check if renderer has setExpression method
- Log morphTargetInfluences on splatMesh
- Log all methods on renderer and viewer objects
- Check jawOpen and mouthLowerDownLeft indices

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Try multiple methods to apply expression data to the mesh:
1. renderer.setExpression() - if renderer has this method
2. viewer.setExpression() - if viewer has this method
3. splatMesh.bsWeight - direct property setting
4. splatMesh.morphTargetInfluences - Three.js standard approach
5. renderer.expressionData - for callback consistency

Removed complex FLAME mode injection logic that wasn't working.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Simplified to match official gaussianAvatar.ts implementation:
- Removed useFlame: false option (let renderer decide)
- Removed applyExpressionToMesh() method (not in official)
- Removed setFlameMode() method (not in official)
- Removed debug logging in getExpressionData()
- applyFrame() now just updates expressionData (like official)
- getExpressionData() just returns expressionData (like official)

Official approach: renderer calls getExpressionData() callback
during its render loop and handles expression application internally.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Match official processor behavior:
- Return undefined when not playing (like processor?.getArkitFaceFrame()?.arkitFace)
- Return actual expression data only during playback

The renderer may differentiate between undefined and empty data.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Changed sendAudioToExpression calls from fire-and-forget to awaited:
- Main speakTextGCP: await expression before setting audio src
- Split sentence TTS: await expression within each promise

This ensures expression data is queued before audio starts playing,
fixing the synchronization issue between mouth movement and speech.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Follow official OpenAvatarChat approach:
- Remove setInterval timer for frame playback
- Calculate frame index from elapsed time using performance.now()
- Record playbackStartTime when playback starts
- getExpressionData() returns frame based on elapsed time
- Fix undefined useBundledSync -> syncMode check

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
The renderer expects all 52 ARKit channels but audio2exp only returns
non-zero channels. Merge frame data with this.expressionData to ensure
complete expression is returned.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Copy official getArkitFaceFrame() logic:
- Return expressionData always (not undefined)
- Create new object each frame: this.expressionData = {}
- Calculate frameIndex: Math.floor(calcDelta / frameInfoInternal)

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Backend:
- Change response format: channels→names, weights→frames[{weights}]
- Add frame_rate field to response

Frontend:
- Update conversion to match official gaussianAvatar.ts logic

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Use .bind(this) like official code
- Add call count logging to getExpressionData()
- Log when renderer calls the callback

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Download official test_expression_1s.json
- Add testMode flag (enabled by default)
- Implement exact same loop playback logic as official gaussianAvatar.ts
- This will verify if renderer works with official data

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Test with official sample avatar to determine if the issue
is with concierge.zip or the renderer/expression logic.

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Normalize RMS values to use full dynamic range
- Add speech-like variation using sine waves at syllable frequencies
- Make jaw and mouth move independently with different timing
- Add more expression channels (pucker, smile)

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
- Track if external player already linked
- Remove old listeners when switching players
- Prevent duplicate play/pause/ended events

https://claude.ai/code/session_01BNt1bYnL3hbBtuKwaoSypY
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants