Skip to content

Inaccurate Coordinate Outputs for MLX-Quantized UI-TARS-1.5 (4bit/6bit) #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
francedot opened this issue Apr 28, 2025 · 6 comments
Open

Comments

@francedot
Copy link

francedot commented Apr 28, 2025

We've tested the quantized version of the UI-TARS-1.5 model (4-bit and 6-bit quantization) implemented with MLX. The work-in-progress implementation can be found here:
https://github.com/trycua/cua/tree/feature/agent/uitars-mlx

Problems Observed:

  • The model struggles to output accurate (x, y) click coordinates when running under MLX quantization (both 4-bit and 6-bit).
  • Specifically, the outputs often incorrectly target areas such as the center of the screen
  • It’s possible that the quantized models are particularly sensitive to slight implementation differences compared to the full precision models (tested on cloud inference).

Testing Setup:

  • Models tested locally on Mac with MLX (Apple Silicon).
  • Observed significant performance hit when running model + c/ua VM simultaneously.
  • Full precision model hosted on AWS endpoint produced correct behavior, suggesting the issue is specific to the MLX quantization.

Artifacts:

  • Trajectories comparison archive (full precision vs 6-bit quantized outputs) attached:
    cua_uitars_trajectories.zip

Environment Details:

  • Mac M3 Pro (local testing)
  • AWS EC2 for cloud model hosting (full precision comparison)
  • Cua framework for running agents (https://github.com/trycua/cua)

cua uitars trajectories.zip

@prncvrm
Copy link

prncvrm commented May 1, 2025

@francedot this should be similar to this #319

@prncvrm
Copy link

prncvrm commented May 2, 2025

@francedot can you take a pull from my fork and validate if it works as expected?

@ddupont808
Copy link

ddupont808 commented May 5, 2025

@francedot can you take a pull from my fork and validate if it works as expected?

i pulled from your fork (commit b355cf1 from today), but i am still noticing a bug with the coordinate outputs:

mlx-vlm ui-tars:

Clipboard-20250505-145258-678.mp4

torch ui-tars:

Clipboard-20250505-145856-783.mp4

misc environment details:

  • using this code for running the agent: trycua/cua@6a6fe48
  • and the following prompt: "please drag a line from the red circle to the green circle, then open a new tab and go to reddit\n\n(You are operating on macOS, use 'cmd' instead of 'ctrl' for most shortcuts e.g., hotkey(key='cmd c') for copy, hotkey(key='cmd v') for paste, hotkey(key='cmd t') for new tab).)"

@GaleiqTesting
Copy link

hey @prncvrm any fix for the above? keen on trying to get the mlx model to work

@prncvrm
Copy link

prncvrm commented May 6, 2025

yes, the new UITars1.5 uses qwen2.5VL, while the fix i've raised was for Qwen2VL
they are similar issue, i'll try to fix and raise a PR by this weekend hopefully

@Blaizzy
Copy link
Owner

Blaizzy commented May 6, 2025

Awesome, feel free to ping me when it ready!

I also left a comment on your current PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants