Skip to content

Conversation

@mirai-gpro
Copy link

No description provided.

claude added 30 commits February 5, 2026 11:26
…n testing

Cloned official repositories to analyze expression data format and WebGL renderer implementation for TTS lip-sync integration.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Key changes:
1. Disable expressionUpdateInterval timer to avoid race condition with renderer's 60fps getExpressionData() calls
2. Move all sync logic to getExpressionData() - now uses TTS player's currentTime for frame selection
3. Add frames directly to frameBuffer in queueExpressionFrames() instead of using frameQueue
4. Clear frameBuffer when starting new speech to prevent stale frame accumulation
5. Add clearFrameBuffer() public method for external control

Root cause identified:
- Two mechanisms (30fps timer + 60fps renderer calls) were competing to update expressionData
- frameQueue wasn't populated before TTS play event fired due to async REST API timing

Also added OpenAvatarChat reference for official WebGL SDK investigation.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
The browser's audio.paused property was unreliable because pause events
can fire immediately after play events (buffering/browser quirks). This
caused the lip sync to fall back to legacy timing mode instead of using
TTS currentTime for frame selection.

Changes:
- Add ttsActive flag to track play/ended state ourselves
- Don't reset ttsActive on pause events (only on ended)
- Use ttsActive flag instead of paused property in getExpressionData()
- Allow sync to start even when currentTime is 0 (just started)
- Add stopTtsSync() method for manual interruption
- Reset ttsActive in clearFrameBuffer()

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Two issues fixed:
1. Legacy mode was running simultaneously with TTS-Sync mode, causing
   lip sync to stop prematurely when legacy timer completed (~1.5s)
   even though TTS audio was still playing (~2.4s).
   Fix: Skip legacy mode when ttsActive is true in external sync mode.

2. When frame buffer exhausted before TTS ended, expression would
   reset because code fell through without returning.
   Fix: Return last frame expression while TTS is still playing.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
In external TTS sync mode, the play/ended handlers no longer call
startFramePlaybackFromQueue() or stopFramePlayback(). These legacy
methods were causing issues with multiple sequential TTS playbacks:

1. startFramePlaybackFromQueue() clears frameQueue after flattening,
   so subsequent TTS plays would find an empty queue even though
   frameBuffer had frames added by queueExpressionFrames().

2. stopFramePlayback() resets state that could interfere with the
   next TTS play starting immediately after.

Now TTS-Sync mode relies purely on:
- ttsActive flag for tracking play/ended state
- frameBuffer populated by queueExpressionFrames()
- getExpressionData() using currentTime to select frames

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Update app.py to detect LAM_Audio2Expression at multiple locations
  (env var, sibling dir, OpenAvatarChat submodule)
- Add model path auto-detection for wav2vec2 and LAM weights
- Update requirements.txt with PyTorch and transformers dependencies
- Add run_local.sh for easy local testing with proper env setup

The service now runs with the real Audio2Expression model in inference
mode (not mock mode) for proper lip sync based on phoneme analysis.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Issues fixed:
1. Frame buffer was being cleared in getExpressionData() when TTS ended,
   causing subsequent segments to have no frames
2. Expression API was called in parallel for all segments, causing frame
   buffer to contain mixed frames from multiple segments

Changes:
- LAMAvatar.astro: Remove buffer clearing on TTS ended (let controller
  manage buffer lifecycle)
- concierge-controller.ts:
  - Only clear buffer on isStart=true (new speech start)
  - For remaining sentences, call expression API right before playback
    (not in parallel) to ensure each segment has its own frames
  - Clear buffer before each subsequent segment plays

This ensures each audio segment has the correct frames in the buffer
when it plays, regardless of how many segments are in the response.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
The LAM Audio2Expression model can only process ~64 frames (~2 seconds)
per inference call. For longer audio, we need to use streaming mode
properly by:

1. Added process_full_audio() method that:
   - Splits audio into 1-second chunks
   - Calls infer_streaming_audio() for each chunk with context
   - Maintains streaming context between calls
   - Concatenates all expression outputs

2. Updated API endpoint to use process_full_audio() instead of
   process_audio() for single-chunk processing

This ensures that a 19.7s audio generates ~591 frames (at 30fps)
instead of only 73 frames from a single inference call.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Update sample rate handling: 24000Hz API input (official AvatarLAMConfig default)
  Model internally resamples to 16000Hz via infer_streaming_audio
- Pass ssr (source sample rate) properly through process_audio -> infer_streaming_audio
- Auto-detect MP3 native sample rate instead of forcing resampling
- Update process_full_audio to accept and propagate sample_rate parameter
- Update WebSocket handler to support sample_rate in messages
- Update mock expression to use correct sample rate for frame calculation

Reference: OpenAvatarChat/src/handlers/avatar/lam/avatar_handler_lam_audio2expression.py

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Includes environment variables for model paths:
- LAM_A2E_PATH: LAM_Audio2Expression code
- LAM_WEIGHT_PATH: lam_audio2exp_streaming.tar weights
- WAV2VEC_PATH: wav2vec2-base-960h encoder

Requires 4Gi memory and 2 CPU for PyTorch inference.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Following official OpenAvatarChat documentation:
- "只有vad和asr运行在本地gpu,对机器性能依赖很轻"
- Audio2Expression runs on CPU (light resource usage)

Includes:
- LAM_Audio2Expression code
- Model files (wav2vec2-base-960h, lam_audio2exp_streaming.tar)
- pydub for MP3 decoding
- All required dependencies for CPU inference

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Add /app/LAM_Audio2Expression to path candidates
- Add /app/models/ paths for model weights and wav2vec2
- Update .gitignore to exclude large model files

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Add start.sh to download models from GCS on startup
- Update Dockerfile to install google-cloud-cli
- Remove COPY models/ from Dockerfile (models loaded at runtime)

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Cloud Build needs LAM_Audio2Expression for the Docker image, but it's
in .gitignore for development. The .gcloudignore allows Cloud Build to
upload the directory while keeping it out of git.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Include LAM_Audio2Expression directory in git (removed from .gitignore)
- CPU modifications in engines/infer.py:
  - Added get_device() function for CPU/GPU detection
  - Changed model.cuda() to model.to(self.device)
  - Changed torch.load() to use map_location=self.device
  - Changed all .cuda(non_blocking=True) to .to(self.device)

This allows Cloud Build to include the modified code directly.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
PyTorch 2.6 changed the default value of weights_only from False to True.
This fix allows loading model checkpoints without the weights_only error.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
The lam_audio2exp_streaming.tar in GCS is a regular tar file, not gzipped.
The previous code incorrectly:
1. Saved it as .tar.gz
2. Tried to gunzip it (which fails)
3. Resulted in missing model file -> mock mode

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Root cause analysis:
- GCS model file (356MB) differs from correct local file (390MB)
- PyTorch fails with "filename 'storages' not found" when loading corrupted checkpoint
- save_path directory may not exist in Docker container

Fixes:
- Add model file size verification in start.sh (warns if <350MB)
- Ensure save_path directory exists before model initialization
- Add fix_gcs_model.sh script to re-upload correct model to GCS

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
The LAM_audio2exp_streaming.tar from HuggingFace/OSS is gzip compressed
(~356MB) and needs decompression to ~390MB for torch.load to work.

Changes:
- Auto-detect gzip compression using 'file' command
- Decompress in-place if needed
- Verify final size is ~390MB after decompression

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Increase timeout from 1200s to 3600s (60 minutes)
- Add E2_HIGHCPU_8 machine type for faster builds and network
- Add 100GB disk size for large Docker image
- Consolidate push steps using --all-tags

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Modified start.sh to start uvicorn immediately while downloading models in background
- Modified app.py to initialize model asynchronously after models are ready
- Health endpoint always returns 'ok' for Cloud Run liveness probe
- Model downloads and initialization happen in parallel with server startup
- Server starts in mock mode and switches to inference mode when model is ready

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Start uvicorn in background first for immediate port binding
- Run download_models in foreground so Cloud Run captures all logs
- Add error handling for gsutil failures
- Wait for uvicorn process at the end

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Replaced gsutil bash commands with google-cloud-storage Python library
for more reliable model downloading in Cloud Run environment:
- Added google-cloud-storage to requirements.txt
- Added download_models_from_gcs() function in app.py
- Simplified start.sh to just start uvicorn
- Python GCS client uses Application Default Credentials automatically

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Major refactoring to use Cloud Run Gen 2 with GCS FUSE mount:
- app.py: Use lifespan context, read models from /mnt/models/audio2exp
- cloudbuild.yaml: Add --execution-environment gen2, --cpu-boost,
  --add-volume for GCS bucket mount, increase memory to 8Gi
- requirements.txt: Remove google-cloud-storage (not needed with FUSE)

Models are mounted as filesystem at /mnt/models, eliminating
download time and authentication issues.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Cloud Run memory quota is 40GB per region. With 8Gi memory per instance,
10 instances would require 80GB, exceeding the quota.
Reduced to 5 instances (40GB) to comply.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
8Gi × 4 = 32GB, well under 40GB quota limit.
Following Gemini's recommendation to avoid edge cases.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- Dockerfile: Download models during Cloud Build, COPY into image
- cloudbuild.yaml: Add GCS download step, remove FUSE configuration
- app.py: Simplify with step-by-step initialization error tracking

This eliminates FUSE complexity and runtime model loading issues.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
- LAM_Audio2Expression_HANDOFF.md: Complete project overview and technical architecture
- ANALYSIS_REQUEST.md: Specific analysis requests with reflection on previous AI's mistakes

These documents are for handing off the task to another model for deeper analysis.

https://claude.ai/code/session_01JWNLRvnwsRuDRVFzGGC37z
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants