🔄 daily merge: master → main 2026-01-22 by antfin-oss · Pull Request #754 · antgroup/ant-ray

antfin-oss · 2026-01-22T03:22:20Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2026-01-22
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…prising (ray-project#59390) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Description This PR adds support in the `JaxTrainer` to schedule across multiple TPU slices using the `ray.util.tpu` public utilities. To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling config, which consolidate the accelerator related fields for TPU and GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a `SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of the desired topology, auto-detecting the required values for `num_workers` and `resources_per_worker` when unspecified. TODO: I'll add some manual testing and usage examples in the comments. ## Related issues ray-project#55162 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…e policy (ray-project#59803) ## Description Given a typical scenario of a fast producing operator followed by a slow producing operator how does the backpressure policy and resource allocator behave? This change just adds tests to cement the expected behavior. ## Related issues DATA-1712 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

This PR adds documentation for several Ray Serve environment variables that were defined in `constants.py` but missing from the documentation, and also cleans up deprecated legacy environment variable names. ### Changes Made #### Documentation additions **`doc/source/serve/production-guide/config.md`** (Proxy config section): - `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` - Control whether to always run a proxy on the head node - `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` - Proxy health check timeout - `RAY_SERVE_PROXY_HEALTH_CHECK_PERIOD_S` - Proxy health check period - `RAY_SERVE_PROXY_READY_CHECK_TIMEOUT_S` - Proxy ready check timeout - `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` - Minimum proxy draining period **`doc/source/serve/production-guide/fault-tolerance.md`** (New "Replica constructor retries" section): - `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT` - Max constructor retries per replica - `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` - Max constructor retries per deployment **`doc/source/serve/advanced-guides/performance.md`**: - `RAY_SERVE_PROXY_PREFER_LOCAL_NODE_ROUTING` - Proxy node locality routing preference - `RAY_SERVE_PROXY_PREFER_LOCAL_AZ_ROUTING` - Proxy AZ locality routing preference - `RAY_SERVE_MAX_CACHED_HANDLES` - Max cached deployment handles (controller debugging section) **`doc/source/serve/monitoring.md`**: - `RAY_SERVE_HTTP_PROXY_CALLBACK_IMPORT_PATH` - HTTP proxy initialization callback - `SERVE_SLOW_STARTUP_WARNING_S` - Slow startup warning threshold - `SERVE_SLOW_STARTUP_WARNING_PERIOD_S` - Slow startup warning interval #### Code cleanup **`python/ray/serve/_private/constants.py`**: - Removed legacy fallback for `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` (now only `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`) - Removed legacy fallback for `MAX_PER_REPLICA_RETRY_COUNT` (now only `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT`) - Removed legacy fallback for `MAX_CACHED_HANDLES` (now only `RAY_SERVE_MAX_CACHED_HANDLES`) **`python/ray/serve/_private/constants_utils.py`**: - Removed `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` and `MAX_PER_REPLICA_RETRY_COUNT` from the deprecated names whitelist --------- Signed-off-by: harshit <harshit@anyscale.com>

…reating (ray-project#59610) Signed-off-by: dayshah <dhyey2019@gmail.com>

## Description allow `RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES` to accept `ALL` so that all events are exported. will be used by history server. (without this config, kuberay needs to explicitly list each event type which is tedious as this list may grow in the future) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…project#59784) ## Description run state api and task event unit tests with both the default (task_event -> gcs flow) and aggregator (task_event -> aggregator -> gcs) to smoothen the transition from default to aggregator flow --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>

AnyscaleJobRunner is the only implementation/child class of CommandRunner right now. There is no need to use inheritance. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

) Add BuildContext TypedDict to capture post_build_script, python_depset, their SHA256 digests, and environment variables for custom BYOD image builds. Changes: - Add build_context.py with BuildContext TypedDict and helper functions: - make_build_context: constructs BuildContext with computed file digests - encode_build_context: deterministic minified JSON serialization - decode_build_context: JSON deserialization - build_context_digest: SHA256 digest of encoded context - Refactor build_anyscale_custom_byod_image to accept BuildContext instead of individual post_build_script and python_depset arguments - Update callers: custom_byod_build.py, ray_bisect.py - Add comprehensive unit tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…project#59839) # Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow chunks to NumPy arrays ## Issue Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow chunks to NumPy arrays with `zero_copy_only=True`: ``` File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks)) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True) File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True ``` This error occurs when checkpoint data is loaded from Ray's object store, where PyArrow buffers may reside in shared memory and cannot be zero-copied to NumPy. ## Reproduction ```python #!/usr/bin/env python3 import ray from ray.data import DataContext from ray.data.checkpoint import CheckpointConfig import tempfile ray.init() with tempfile.TemporaryDirectory() as ckpt_dir, \ tempfile.TemporaryDirectory() as data_dir, \ tempfile.TemporaryDirectory() as output_dir: # Step 1: Create data ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir) # Step 2: Enable checkpoint and write ctx = DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( checkpoint_path=ckpt_dir, id_column="id", delete_checkpoint_on_success=False ) ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir) # Step 3: Second write triggers checkpoint filtering ray.data.read_parquet(data_dir).write_parquet(output_dir) ray.shutdown() ``` ## Solution Change `to_numpy(zero_copy_only=True)` to `to_numpy(zero_copy_only=False)` in `BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows PyArrow to copy data when necessary. ### Changes **File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py` - Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to `ckpt_chunk.to_numpy(zero_copy_only=False)` ### Performance Impact No performance regression expected. PyArrow will only perform a copy when zero-copy is not possible. Signed-off-by: dragongu <andrewgu@vip.qq.com>

## Description Adds repr_name field to actor_lifecycle_event schema and populates it when available. ## Related issues Closes ray-project#59813 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com>

…y-project#59893) ## Description Fix inconsistent task name in metrics between RUNNING and FINISHED states. When a Ray task is defined with a custom name via `.options(name="custom_name")`, the `ray_tasks` metrics show inconsistent names: - **RUNNING** state: shows the original function name (e.g., `RemoteFn`) - **FINISHED/FAILED** state: shows the custom name (e.g., `test`) **Root cause:** The RUNNING task counter in `CoreWorker` uses `FunctionDescriptor()->CallString()` to get the task name, while finished task events correctly use `TaskSpecification::GetName()`. **Fix:** Changed both `HandlePushTask` and `ExecuteTask` in `core_worker.cc` to use `task_spec.GetName()` consistently, which properly returns the custom name when set. ## Related issues None - this PR addresses a newly discovered bug. ## Additional information **Files changed:** - `src/ray/core_worker/core_worker.cc` - Use `GetName()` instead of `FunctionDescriptor()->CallString()` for metrics - `python/ray/tests/test_task_metrics.py` - Added test `test_task_custom_name_metrics` to verify custom names appear correctly in metrics Signed-off-by: Yuan Jiewei <jieweihh.yuan@gmail.com> Co-authored-by: Yuan Jiewei <jieweihh.yuan@gmail.com>

## Description update metrics export docs based on changes in ray-project#59337 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

…ray-project#59808) Adds a new RLlib algorithm TQC, which extends SAC with distributional critics using quantile regression to control Q-function overestimation bias. Key components: - TQC algorithm configuration and implementation - Default TQC RLModule with multiple quantile critics - TQC catalog for building network components - Comprehensive test suite covering compilation, simple environments, and parameter validation - Documentation including > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: tk42 <nsplat@gmail.com> Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>

related to ray-project#58876 ```bash ❯ python -c " import ray # Temporarily patch to test the warning shows from ray._common import pydantic_compat original = pydantic_compat.IS_PYDANTIC_2 pydantic_compat.IS_PYDANTIC_2 = False # Simulate Pydantic v1 ray.init() ray.shutdown() pydantic_compat.IS_PYDANTIC_2 = original " 2025-12-26 22:33:01,387 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:6379... 2025-12-26 22:33:01,407 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 /home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( /home/ubuntu/ray/python/ray/_private/worker.py:2050: FutureWarning: Pydantic v1 is deprecated and will no longer be supported in Ray 2.56. Please upgrade to Pydantic v2 by running `pip install -U pydantic`. See ray-project#58876 for more details. warnings.warn( ~/ray │ on 58876-abrar-pydantic *91 " ``` --------- Signed-off-by: abrar <abrar@anyscale.com>

## Description Create a resource bundle for each learner, do not pack all learners into single bundle. Related to ray-project#51017 --------- Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Mark Towers <mark@anyscale.com>

…ay-project#59921) Migrate remaining std::unordered_map to absl::flat_hash_map --------- Signed-off-by: yang <yanghang233@126.com>

- adding anyscale template configs for async inf template Signed-off-by: harshit <harshit@anyscale.com>

As-is, this script installs for arm architecture, regardless of actual machine type. Also bumping version to unblock issue from running with newer OpenSSL version-- ``` [ERROR 2026-01-07 03:46:50,067] crane_lib.py: 70 Crane command `/home/forge/.cache/bazel/_bazel_forge/5fe90af4e7d1ed9fcf52f59e39e126f5/external/crane_linux_x86_64/crane copy 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu` failed with stderr: -- 2026/01/07 03:46:49 Copying from 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu to us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu ERROR: gcloud failed to load: module 'lib' has no attribute 'X509_V_FLAG_NOTIFY_POLICY' gcloud_main = _import_gcloud_main() import googlecloudsdk.gcloud_main from googlecloudsdk.calliope import cli from googlecloudsdk.calliope import backend from googlecloudsdk.calliope import parser_extensions from googlecloudsdk.core.updater import update_manager from googlecloudsdk.core.updater import installers from googlecloudsdk.core.credentials import store from googlecloudsdk.api_lib.auth import util as auth_util from googlecloudsdk.core.credentials import google_auth_credentials as c_google_auth from oauth2client import client as oauth2client_client from oauth2client import crypt from oauth2client import _openssl_crypt from OpenSSL import crypto from OpenSSL import SSL, crypto from OpenSSL.crypto import ( class X509StoreFlags: NOTIFY_POLICY: int = _lib.X509_V_FLAG_NOTIFY_POLICY This usually indicates corruption in your gcloud installation or problems with your Python interpreter. ``` --------- Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com>

…ect#58435) - Fix memory safety for core_worker in the shutdown executor -- use `weak_ptr` instead of raw pointer. - Ensure shutdown completes before core worker destructs. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

No longer relevant. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Add documentation to 20 functions in ci/raydepsets/cli.py that were missing docstrings, improving code readability and maintainability. 🤖 Generated with [Claude Code] Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…9745) ## Description Fixed a broken link in the read_unity_catalog doc string. Previous URL was outdated. ## Related issues None ## Additional information N/A --------- Signed-off-by: Jess <jessica.jy.kong@gmail.com> Signed-off-by: Jessica Kong <jessica.jy.kong@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Description `CountDistinct` allow users to compute the number of distinct values in a column, similar to SQL's `COUNT(DISTINCT ...)`. ## Related issues close ray-project#58252 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Goutam <goutam@anyscale.com>

…ct#59942) Updating to reflect an issue that I debugged recently. Recommendation is to use `overlayfs` instead of the default `vfs` for faster container startup. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ialization overhead (ray-project#59919) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Fix typos in docs and docstrings. If any are too trivial, just lmk. Agent assisted --------- Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Description This was used early in the development of the Ray Dashboard and is not used any more so we should remove it (I recently came across this). --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

…7735) There have been ask for enabling --temp-dir flag on a per node basis in contrast to the current implementation that only allows all node's temp dir to be configured to the head node's temp dir configuration. This PR introduces the capability for the ray temp directory to be specified on a per node basis, eliminating the restriction that --temp-dir flag can only be used in conjunction with the --head flag. get_user_temp_dir and get_ray_temp_dir has been marked as deprecated and replaced with the resolve_user_ray_temp_dir function to ensure that temp dir is consistent across the system. ## New Behaviors **Temp dir** | | head node temp_dir NOT specified | head node temp_dir specified | |---|---|---| | worker node temp_dir NOT specified | Worker & head node uses `/tmp/ray` | Worker uses head node's temp_dir | | worker node temp_dir specified | Worker uses its own specified temp_dir. Head node uses default | Each node uses its own specified temp_dir | **Object spilling directory** | | head node spilling dir NOT specified | head node spilling dir specified | |---|---|---| | worker node spilling dir NOT specified | Each node uses its own temp_dir as spilling dir | Worker uses head node's spilling dir | | worker node spilling dir specified | Worker uses its own specified spilling dir. Head node uses its temp_dir | Each node uses its own specified spilling dir | ## Testing We tested the expected behaviors on a local multi-node kuberay cluster by verifying that: 1. nodes defaults to `/tmp/ray` when node temp_dir is specified 2. non-head nodes picked up head node's temp_dir specifications when only head node temp_dir was specified 3. non-head nodes can take independent temp_dir regardless of head node temp_dir when specified 4. nodes default to their own temp dir as spilling directory for all three cases above 5. nodes default to head node's spilling directory when only head node spilling directory is specified 6. nodes can have their spilling directory specified independent of the head node's spilling directory Behaviors were verified by checking that the directories were created, and that the right information is fetched from head node. ## Related issues  ray-project#47262 ray-project#51218 ray-project#40628 ray-project#32962 ray-project#40628 ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [x] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [x] No  This PR should not introduce any breaking changes just yet. However, this PR deprecates `get_user_temp_dir` and `get_ray_temp_dir`. The two functions will be marked as errors in the next version update. --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

ray-project#59897) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

…netes token authentication (ray-project#59621) ## Description Per discussion from REP PR (ray-project/enhancements#63), this PR adds a server-side config `RAY_ENABLE_K8S_TOKEN_RBAC=true` to enable Kubernetes-based token authentication. This must be set in addition to `RAY_AUTH_MODE=token`. The main benefit of this change is that the server-side authentication flow becomes opaque to clients, and all clients only need to set `RAY_AUTH_MODE=token` along with their token. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

…-project#60283) ## Summary - Fix Ray Data's cluster autoscalers (V1 and V2) to respect user-configured `resource_limits` set via `ExecutionOptions` - Cap autoscaling resource requests to not exceed user-specified CPU and GPU limits - Update `get_total_resources()` to return the minimum of cluster resources and user limits ## Why are these changes needed? Previously, Ray Data's cluster autoscalers did not respect user-configured resource limits. When a user set explicit limits like: ```python ctx = ray.data.DataContext.get_current() ctx.execution_options.resource_limits = ExecutionResources(cpu=8) ``` The autoscaler would ignore these limits and continue to request more cluster resources from Ray's autoscaler, causing unnecessary node upscaling even when the executor couldn't use the additional resources. This was problematic because: 1. Users explicitly setting resource limits expect Ray Data to stay within those bounds 2. Unnecessary cluster scaling wastes cloud resources and money 3. The `ResourceManager.get_global_limits()` already respects user limits, but the autoscaler bypassed this by requesting resources directly ## Test Plan Added comprehensive unit tests for both autoscaler implementations ## Related issue number Fixes ray-project#60085 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR - [x] I've included any doc changes needed - [x] I've added any new tests if needed --- Would you like me to adjust anything in the PR description? --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…t#60267) it is always an instance of AnyscaleJobRunner. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…0278) and saves the job ID in `_job_id`. this makes the information flow clearer and simpler. this is preparation for refactoring the job sdk usage. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

per anyscale#727 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

… from Ray Data (ray-project#60292) ## Description Remove all top-level imports of `ray.data` from the `ray.train` module. Imports needed only for type annotations should be guarded behind if `TYPE_CHECKING:`. Imports needed at runtime should be moved inline (lazy imports within functions/methods). ## Related issues Fixes ray-project#60152. --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>

…project#60351) ray-project#59631 changed the way the `Dataset` representations look, but CI didn't test `writing-code-snippet` in that PR's premerge CI. This PR fixes the incorrect output. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ns' parameters for the 'serve' API" (ray-project#60355) Reverts ray-project#56507 --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: abrar <abrar@anyscale.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

upgrade jaxlib and jax from 0.4.13 -> 0.4.22 due to missing version on pypi [index](https://pypi.org/project/jaxlib/#history) oldest version available is 0.4.17 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

…project#60347) To improve readability, this PR separates the `DefaultAutoscalerV2` into distinct sections for input validation, getting default values, and setting attributes. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…models (Issue ray-project#60100) (ray-project#60102) ## Description See ray-project#60100 ## Related issues Fixes ray-project#60100 ## Additional information None --------- Signed-off-by: antoine_galataud <antoine@foobot.io>

…urce Cleanup and Process Termination (ray-project#60172) ## Description ### Summary This PR improves the `destroy_module()` method in `SubprocessModuleHandle` to fix race conditions, implement graceful process termination, and ensure complete resource cleanup. The changes prevent resource leaks and improve the reliability of module restart and shutdown. ### Key Changes 1. **Cancel health check task first** to prevent race conditions - Why Cancel Health Check Task First: See ray-project#60214 (comment) - Uses smart detection to avoid canceling the current task when called internally - Prevents health check task from interfering with cleanup 2. **Ordered resource cleanup** with clear dependencies: - Cancel health check task first - Close parent connection - Terminate process gracefully, then forcefully if needed - Close HTTP client session 3. **Graceful process termination**: - First attempts graceful termination with `terminate()` and 5-second timeout - Falls back to force kill (`kill()`) only if necessary - All `join()` calls have timeouts to prevent infinite blocking 4. **Error handling**: Try-except blocks ensure cleanup continues even if one step fails ### Modified Files 1. `python/ray/dashboard/subprocesses/handle.py` - Refactored `destroy_module()` with ordered resource cleanup - Implemented graceful process termination with timeout protection - Added smart health check task cancellation logic - **Smart Detection Logic:** ```python current_task = asyncio.current_task(loop=self.loop) if current_task is None or self.health_check_task is not current_task: self.health_check_task.cancel() # Only cancel if not current task ``` This ensures: - **External calls**: Immediately cancel health check task to prevent interference - **Internal calls**: Don't cancel current task, allowing cleanup and restart to complete - Added comprehensive error handling and logging 2. `python/ray/dashboard/subprocesses/tests/test_e2e.py` - Added `test_destroy_module_cleans_up_resources()` to verify complete resource cleanup - Added mock classes (`_DummyConn`, `_DummyProcess`, `_DummySession`) for isolated testing - Added cleanup logic in `start_http_server_app()` to prevent resource leaks between tests ## Related issues Closes ray-project#60214 --------- Signed-off-by: daiping8 <dai.ping88@zte.com.cn> Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com>

…t#59768) ## Description Fix `uv_runtime_env_hook.py` to pin worker Python version to driver version. If the system python version is different from the driver python version, you can end up with a mismatch between python versions (e.g. driver 3.11 vs worker 3.12), which causes Ray to deliberately crash elsewhere. This change ensures compatibility between the Ray driver and the worker by specifying the Python version, preventing version mismatches. ## Related issues Fixes ray-project#59639. ## Additional information --------- Signed-off-by: David Hall <david.hall@openathena.ai>

…project#59922) ## Description Make the Union operator not blocking when `preserve_order` is enabled if `_add_input_inner` is called with the input in the front. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

https://buildkite.com/ray-project/premerge/builds/58221#019bdf6a-49a1-4013-8ee8-8ccba73559c5/L888 https://buildkite.com/ray-project/premerge/builds/58212#019bdf3f-a280-47af-b6ec-b8db34cf4f9b/L915 --------- Signed-off-by: abrar <abrar@anyscale.com>

## Why are these changes needed? Fixes anyscale#758 Skip `test_no_spammy_errors_in_composed_app` and `test_no_spammy_errors_in_grpc_proxy` on Windows due to: - **Signal handling**: `test_no_spammy_errors_in_grpc_proxy` uses `p.send_signal(signal.SIGINT)` which Windows doesn't support for subprocesses (raises `ValueError: Unsupported signal: 2`) - **Temp directory cleanup**: `test_no_spammy_errors_in_composed_app` writes replica logs to a temp directory, and Windows doesn't release file handles immediately after process termination, causing cleanup failures (`NotADirectoryError: [WinError 267]`) This follows the existing pattern from `test_logging.py:1216` which skips similar tests on Windows: ```python @pytest.mark.skipif(sys.platform == "win32", reason="Fail to look for temp dir.") ``` ## Related issue number anyscale#758 ## Checks - [x] I've signed all my commits - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a temporary directory, causing `NotADirectoryError` on cleanup method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [] Release tests - [ ] This PR is not tested :( Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Why are these changes needed? The `test_router_queue_len_metric` test was flaky because the router queue length gauge has a 100ms throttle (`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates when they happen too quickly. When replica initialization sets the gauge to 0 and a request immediately updates it to 1, the second update may be throttled, causing the test to see 0 instead of 1. ## Related issue number Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling. --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>

## Description After ray-project#60017 got merged, I forgot to update the `test_bundle_queue` test suite. This PR adds more tests for `num_blocks`, `num_rows`, `estimate_size_bytes`, and `len(queue)` ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…ject#60338) ## Description > Briefly describe what this PR accomplishes and why it's needed. This PR adds support for Google Cloud's 7th generation TPU (Ironwood). The TPU 7x generation introduces a change in the accelerator type naming convention reported by the environment. Unlike previous generations (v6e-16, v5p-8, etc.), 7x instances report types starting with tpu (e.g. tpu7x-16). This PR accounts for the new format and enables Ray to auto-detect the v7x hardware automatically (users don't have to manually configure env vars). This is critical for libraries like Ray Train and for vLLM support - where the automatic device discovery is utilized during JAX initialization. ## Related issues Fixes ray-project#59964 ## Additional information For more info about TPU v7x: https://docs.cloud.google.com/tpu/docs/tpu7x. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>

@pytest

## Description 1. the flakyness for test_flush_worker_result_queue is, when queue_backlog_length is 0, after `wg._start()`, we immediately wg.poll_status() and asserts finished, sometimes rank 0’s training thread is still running at that instant . leads to the below error: ``` where False = WorkerGroupPollStatus(worker_statuses={0: WorkerStatus(running=True, error=None, training_report=None), 1: WorkerStatus(running=False, error=None, training_report=None), 2: WorkerStatus(running=False, error=None, training_report=None), 3: WorkerStatus(running=False, error=None, training_report=None)}).finished ``` 2. use the same pattern in `test_poll_status_finished` in the same file to address this flakyness. 3. increase `test_placement_group_handle ` to medium to avoid timeout. ``` python/ray/train/v2/tests/test_placement_group_handle.py::test_slice_handle_shutdown -- Test timed out at 2026-01-20 18:12:46 UTC -- -- [2026-01-20T18:15:17Z] ERROR [100%] [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] ==================================== ERRORS ==================================== [2026-01-20T18:15:17Z] _________________ ERROR at setup of test_slice_handle_shutdown _________________ [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] @pytest.fixture(autouse=True) [2026-01-20T18:15:17Z] def ray_start(): [2026-01-20T18:15:17Z] > ray.init(num_cpus=4) [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] python/ray/train/v2/tests/test_placement_group_handle.py:16: [2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-20T18:15:17Z] /rayci/python/ray/_private/client_mode_hook.py:104: in wrapper [2026-01-20T18:15:17Z] return func(*args, **kwargs) [2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1910: in init [2026-01-20T18:15:17Z] _global_node = ray._private.node.Node( [2026-01-20T18:15:17Z] /rayci/python/ray/_private/node.py:402: in __init__ [2026-01-20T18:15:17Z] time.sleep(0.1) [2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] signum = 15 [2026-01-20T18:15:17Z] frame = <frame at 0x55cf6cb749f0, file '/rayci/python/ray/_private/node.py', line 402, code __init__> [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] def sigterm_handler(signum, frame): [2026-01-20T18:15:17Z] > sys.exit(signum) [2026-01-20T18:15:17Z] E SystemExit: 15 [2026-01-20T18:15:17Z] [2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1670: SystemExit ``` 4. add a `manual` tag for `test_jax_gpu` bazel target to temporally disable CI for this unit test given that pypi jax version only support at least CUDA 12.2 now while our CI runs on CUDA 12.1 --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>

…d of the highest available version. (ray-project#60378) Signed-off-by: irabbani <israbbani@gmail.com>

gemini-code-assist

Code Review

This pull request is a large daily merge from master to main, encompassing a wide range of changes across the repository. The key themes include a major refactoring of the CI/CD system towards a more modular, wanda-based architecture, the deprecation of Python 3.9 in favor of Python 3.10 as the default, and extensive documentation updates. The CI changes are significant, introducing new build steps, rules, and automation scripts for better dependency management and build caching. Documentation has been improved with new examples, API reference updates, and internal design documents for new features like token authentication and port discovery. Several APIs have been updated for clarity and consistency. Overall, these changes represent a significant step forward in improving the project's build system, testing infrastructure, and documentation. My review found one minor issue related to an unused environment variable in a CI configuration file.

gemini-code-assist · 2026-01-22T03:26:20Z

.buildkite/cicd-cron/cicd-cron.rayci.yml

+      RAYCI_DISABLE_JAVA: "false"
+      RAYCI_WANDA_ALWAYS_REBUILD: "true"
+      JDK_SUFFIX: "-jdk"
+      ARCH_SUFFIX: "aarch64"


The ARCH_SUFFIX environment variable is defined here but appears to be unused in the corresponding wanda file (ci/docker/manylinux-cibase.wanda.yaml) or the underlying Dockerfile. This can be confusing for future maintenance. Please consider removing this line if it's not needed.

github-actions · 2026-02-05T13:39:08Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

eicherseiji and others added 30 commits January 6, 2026 15:40

[Serve][LLM] Make PrefixCacheAwareRouter imbalance threshold less sur…

338ce48

…prising (ray-project#59390) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[core][rdt] Support out-of-order actors by extracting metadata when c…

bf1a850

…reating (ray-project#59610) Signed-off-by: dayshah <dhyey2019@gmail.com>

[deps][llm] vLLM 0.13.0 (ray-project#59440)

6b1ac3d

Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>

[release test] use AnyscaleJobRunner in testing (ray-project#59858)

bbc29e6

AnyscaleJobRunner is the only implementation/child class of CommandRunner right now. There is no need to use inheritance. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Migrate remaining std::unordered_map to absl::flat_hash_map (r…

caa936c

…ay-project#59921) Migrate remaining std::unordered_map to absl::flat_hash_map --------- Signed-off-by: yang <yanghang233@126.com>

add config files for async inf template (ray-project#59926)

61818ea

- adding anyscale template configs for async inf template Signed-off-by: harshit <harshit@anyscale.com>

[core] Remove some outdated TODOs (ray-project#59939)

0eab3d8

No longer relevant. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[data][llm] Use numpy arrays for embeddings to avoid torch.Tensor ser…

a1bfd6a

…ialization overhead (ray-project#59919) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[docs] Fix typos in docs and docstrings (ray-project#59901)

82236ba

Fix typos in docs and docstrings. If any are too trivial, just lmk. Agent assisted --------- Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove show_in_dashboard function (ray-project#59916)

85cb41f

## Description This was used early in the development of the Ray Dashboard and is not used any more so we should remove it (I recently came across this). --------- Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

Aydin-ab and others added 22 commits January 20, 2026 12:07

[docs] [template] [data] Refactor current LLM Batch inference template (

4eca80f

ray-project#59897) Signed-off-by: Aydin Abiar <aydin@anyscale.com> Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

[release test] collapse into AnyscaleJobRunner in glue.py (ray-projec…

4311242

…t#60267) it is always an instance of AnyscaleJobRunner. Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[data] disable sort_autoscaling test (ray-project#60339)

6bf7ec0

per anyscale#727 Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

[core] The CPP language front-end tests use CI's bazel version instea…

87a674d

…d of the highest available version. (ray-project#60378) Signed-off-by: irabbani <israbbani@gmail.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners January 22, 2026 03:22

antfin-oss added auto-generated daily-merge labels Jan 22, 2026

antfin-oss assigned ffbin Jan 22, 2026

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

github-actions bot added the stale label Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔄 daily merge: master → main 2026-01-22#754

🔄 daily merge: master → main 2026-01-22#754
antfin-oss wants to merge 449 commits intomainfrom
create-pull-request/patch-87a674d466

antfin-oss commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

antfin-oss commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants