Skip to content

πŸ”„ daily merge: master β†’ main 2026-01-22#754

Open
antfin-oss wants to merge 449 commits intomainfrom
create-pull-request/patch-87a674d466
Open

πŸ”„ daily merge: master β†’ main 2026-01-22#754
antfin-oss wants to merge 449 commits intomainfrom
create-pull-request/patch-87a674d466

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2026-01-22
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

eicherseiji and others added 30 commits January 6, 2026 15:40
…prising (ray-project#59390)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
This PR adds support in the `JaxTrainer` to schedule across multiple TPU
slices using the `ray.util.tpu` public utilities.

To support this, this PR adds new `AcceleratorConfig`s to the V2 scaling
config, which consolidate the accelerator related fields for TPU and
GPU. When `TPUAcceleratorConfig` is specified, the JaxTrainer utilizes a
`SlicePlacementGroup` to atomically reserve `num_slices` TPU slices of
the desired topology, auto-detecting the required values for
`num_workers` and `resources_per_worker` when unspecified.

TODO: I'll add some manual testing and usage examples in the comments.

## Related issues
ray-project#55162

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…e policy (ray-project#59803)

## Description
Given a typical scenario of a fast producing operator followed by a slow
producing operator how does the backpressure policy and resource
allocator behave? This change just adds tests to cement the expected
behavior.

## Related issues
DATA-1712

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
This PR adds documentation for several Ray Serve environment variables
that were defined in `constants.py` but missing from the documentation,
and also cleans up deprecated legacy environment variable names.

### Changes Made

#### Documentation additions

**`doc/source/serve/production-guide/config.md`** (Proxy config
section):
- `RAY_SERVE_ALWAYS_RUN_PROXY_ON_HEAD_NODE` - Control whether to always
run a proxy on the head node
- `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` - Proxy health check timeout
- `RAY_SERVE_PROXY_HEALTH_CHECK_PERIOD_S` - Proxy health check period
- `RAY_SERVE_PROXY_READY_CHECK_TIMEOUT_S` - Proxy ready check timeout
- `RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S` - Minimum proxy draining
period

**`doc/source/serve/production-guide/fault-tolerance.md`** (New "Replica
constructor retries" section):
- `RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT` - Max constructor retries per
replica
- `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` - Max constructor
retries per deployment

**`doc/source/serve/advanced-guides/performance.md`**:
- `RAY_SERVE_PROXY_PREFER_LOCAL_NODE_ROUTING` - Proxy node locality
routing preference
- `RAY_SERVE_PROXY_PREFER_LOCAL_AZ_ROUTING` - Proxy AZ locality routing
preference
- `RAY_SERVE_MAX_CACHED_HANDLES` - Max cached deployment handles
(controller debugging section)

**`doc/source/serve/monitoring.md`**:
- `RAY_SERVE_HTTP_PROXY_CALLBACK_IMPORT_PATH` - HTTP proxy
initialization callback
- `SERVE_SLOW_STARTUP_WARNING_S` - Slow startup warning threshold
- `SERVE_SLOW_STARTUP_WARNING_PERIOD_S` - Slow startup warning interval

#### Code cleanup

**`python/ray/serve/_private/constants.py`**:
- Removed legacy fallback for `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`
(now only `RAY_SERVE_MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT`)
- Removed legacy fallback for `MAX_PER_REPLICA_RETRY_COUNT` (now only
`RAY_SERVE_MAX_PER_REPLICA_RETRY_COUNT`)
- Removed legacy fallback for `MAX_CACHED_HANDLES` (now only
`RAY_SERVE_MAX_CACHED_HANDLES`)

**`python/ray/serve/_private/constants_utils.py`**:
- Removed `MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT` and
`MAX_PER_REPLICA_RETRY_COUNT` from the deprecated names whitelist

---------

Signed-off-by: harshit <harshit@anyscale.com>
…reating (ray-project#59610)

Signed-off-by: dayshah <dhyey2019@gmail.com>
## Description
allow
`RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES`
to accept `ALL` so that all events are exported. will be used by history
server. (without this config, kuberay needs to explicitly list each
event type which is tedious as this list may grow in the future)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…project#59784)

## Description
run state api and task event unit tests with both the default
(task_event -> gcs flow) and aggregator (task_event -> aggregator ->
gcs) to smoothen the transition from default to aggregator flow

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com>
AnyscaleJobRunner is the only implementation/child class of
CommandRunner right now. There is no need to use inheritance.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
)

Add BuildContext TypedDict to capture post_build_script, python_depset,
their SHA256 digests, and environment variables for custom BYOD image
builds.

Changes:
- Add build_context.py with BuildContext TypedDict and helper functions:
- make_build_context: constructs BuildContext with computed file digests
  - encode_build_context: deterministic minified JSON serialization
  - decode_build_context: JSON deserialization
  - build_context_digest: SHA256 digest of encoded context
- Refactor build_anyscale_custom_byod_image to accept BuildContext
instead of individual post_build_script and python_depset arguments
- Update callers: custom_byod_build.py, ray_bisect.py
- Add comprehensive unit tests

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…project#59839)

# Fix `ArrowInvalid` error in checkpoint filter when converting PyArrow
chunks to NumPy arrays

## Issue

Fixes `ArrowInvalid` error when checkpoint filtering converts PyArrow
chunks to NumPy arrays with `zero_copy_only=True`:

```
  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 249, in filter_rows_for_block
    masks = list(executor.map(filter_with_ckpt_chunk, ckpt_chunks))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/data/checkpoint/checkpoint_filter.py", line 229, in filter_with_ckpt_chunk
    ckpt_ids = ckpt_chunk.to_numpy(zero_copy_only=True)
  File "pyarrow/array.pxi", line 1789, in pyarrow.lib.Array.to_numpy
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True
```

This error occurs when checkpoint data is loaded from Ray's object
store, where PyArrow buffers may reside in shared memory and cannot be
zero-copied to NumPy.

## Reproduction

```python
#!/usr/bin/env python3
import ray
from ray.data import DataContext
from ray.data.checkpoint import CheckpointConfig
import tempfile

ray.init()

with tempfile.TemporaryDirectory() as ckpt_dir, \
     tempfile.TemporaryDirectory() as data_dir, \
     tempfile.TemporaryDirectory() as output_dir:
    # Step 1: Create data
    ray.data.range(10).map(lambda x: {"id": f"id_{x['id']}"}).write_parquet(data_dir)

    # Step 2: Enable checkpoint and write
    ctx = DataContext.get_current()
    ctx.checkpoint_config = CheckpointConfig(
        checkpoint_path=ckpt_dir,
        id_column="id",
        delete_checkpoint_on_success=False
    )
    ray.data.read_parquet(data_dir).filter(lambda x: x["id"] != 'id_0').write_parquet(output_dir)

    # Step 3: Second write triggers checkpoint filtering
    ray.data.read_parquet(data_dir).write_parquet(output_dir)

ray.shutdown()
```

## Solution

Change `to_numpy(zero_copy_only=True)` to
`to_numpy(zero_copy_only=False)` in
`BatchBasedCheckpointFilter.filter_rows_for_block()`. This allows
PyArrow to copy data when necessary.

### Changes

**File**: `ray/python/ray/data/checkpoint/checkpoint_filter.py`

- Line 229: Changed `ckpt_chunk.to_numpy(zero_copy_only=True)` to
`ckpt_chunk.to_numpy(zero_copy_only=False)`

### Performance Impact

No performance regression expected. PyArrow will only perform a copy
when zero-copy is not possible.

Signed-off-by: dragongu <andrewgu@vip.qq.com>
## Description
Adds repr_name field to actor_lifecycle_event schema and populates it
when available.

## Related issues
Closes ray-project#59813

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
…y-project#59893)

## Description

Fix inconsistent task name in metrics between RUNNING and FINISHED
states.

When a Ray task is defined with a custom name via
`.options(name="custom_name")`, the `ray_tasks` metrics show
inconsistent names:
- **RUNNING** state: shows the original function name (e.g., `RemoteFn`)
- **FINISHED/FAILED** state: shows the custom name (e.g., `test`)

**Root cause:** The RUNNING task counter in `CoreWorker` uses
`FunctionDescriptor()->CallString()` to get the task name, while
finished task events correctly use `TaskSpecification::GetName()`.

**Fix:** Changed both `HandlePushTask` and `ExecuteTask` in
`core_worker.cc` to use `task_spec.GetName()` consistently, which
properly returns the custom name when set.

## Related issues

None - this PR addresses a newly discovered bug.

## Additional information

**Files changed:**
- `src/ray/core_worker/core_worker.cc` - Use `GetName()` instead of
`FunctionDescriptor()->CallString()` for metrics
- `python/ray/tests/test_task_metrics.py` - Added test
`test_task_custom_name_metrics` to verify custom names appear correctly
in metrics

Signed-off-by: Yuan Jiewei <jieweihh.yuan@gmail.com>
Co-authored-by: Yuan Jiewei <jieweihh.yuan@gmail.com>
## Description
update metrics export docs based on changes in
ray-project#59337

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: sampan <sampan@anyscale.com>
Co-authored-by: sampan <sampan@anyscale.com>
…ray-project#59808)

Adds a new RLlib algorithm TQC, which extends SAC with distributional
critics using quantile regression to control Q-function overestimation
bias.

Key components:
- TQC algorithm configuration and implementation
- Default TQC RLModule with multiple quantile critics
- TQC catalog for building network components
- Comprehensive test suite covering compilation, simple environments,
and parameter validation
- Documentation including

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description
> Briefly describe what this PR accomplishes and why it's needed.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: tk42 <nsplat@gmail.com>
Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>
related to ray-project#58876

```bash
❯ python -c "
import ray
# Temporarily patch to test the warning shows
from ray._common import pydantic_compat
original = pydantic_compat.IS_PYDANTIC_2
pydantic_compat.IS_PYDANTIC_2 = False  # Simulate Pydantic v1

ray.init()
ray.shutdown()

pydantic_compat.IS_PYDANTIC_2 = original
"
2025-12-26 22:33:01,387 INFO worker.py:1811 -- Connecting to existing Ray cluster at address: 172.31.7.228:6379...
2025-12-26 22:33:01,407 INFO worker.py:1991 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2039: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
/home/ubuntu/ray/python/ray/_private/worker.py:2050: FutureWarning: Pydantic v1 is deprecated and will no longer be supported in Ray 2.56. Please upgrade to Pydantic v2 by running `pip install -U pydantic`. See ray-project#58876 for more details.
  warnings.warn(
 ~/ray β”‚ on 58876-abrar-pydantic *91       
"
```

---------

Signed-off-by: abrar <abrar@anyscale.com>
## Description
Create a resource bundle for each learner, do not pack all learners into
single bundle.

Related to ray-project#51017

---------

Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
…ay-project#59921)

Migrate remaining std::unordered_map to absl::flat_hash_map

---------

Signed-off-by: yang <yanghang233@126.com>
- adding anyscale template configs for async inf template

Signed-off-by: harshit <harshit@anyscale.com>
As-is, this script installs for arm architecture, regardless of actual
machine type. Also bumping version to unblock issue from running with
newer
OpenSSL version--
```
[ERROR 2026-01-07 03:46:50,067] crane_lib.py: 70  Crane command `/home/forge/.cache/bazel/_bazel_forge/5fe90af4e7d1ed9fcf52f59e39e126f5/external/crane_linux_x86_64/crane copy 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu` failed with stderr:
--
2026/01/07 03:46:49 Copying from 029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:d656a31a-ray-anyscale-py3.10-cpu to us-west1-docker.pkg.dev/anyscale-oss-ci/anyscale/ray:pr-59902.3702b2-py310-cpu
ERROR: gcloud failed to load: module 'lib' has no attribute 'X509_V_FLAG_NOTIFY_POLICY'
gcloud_main = _import_gcloud_main()
import googlecloudsdk.gcloud_main
from googlecloudsdk.calliope import cli
from googlecloudsdk.calliope import backend
from googlecloudsdk.calliope import parser_extensions
from googlecloudsdk.core.updater import update_manager
from googlecloudsdk.core.updater import installers
from googlecloudsdk.core.credentials import store
from googlecloudsdk.api_lib.auth import util as auth_util
from googlecloudsdk.core.credentials import google_auth_credentials as c_google_auth
from oauth2client import client as oauth2client_client
from oauth2client import crypt
from oauth2client import _openssl_crypt
from OpenSSL import crypto
from OpenSSL import SSL, crypto
from OpenSSL.crypto import (
class X509StoreFlags:
NOTIFY_POLICY: int = _lib.X509_V_FLAG_NOTIFY_POLICY
Β 
This usually indicates corruption in your gcloud installation or problems with your Python interpreter.

```

---------

Signed-off-by: andrew <andrew@anyscale.com>
Signed-off-by: Andrew Pollack-Gray <andrew@anyscale.com>
…ect#58435)

- Fix memory safety for core_worker in the shutdown executor -- use
`weak_ptr` instead of raw pointer.
- Ensure shutdown completes before core worker destructs.

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
No longer relevant.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Add documentation to 20 functions in ci/raydepsets/cli.py that were
missing docstrings, improving code readability and maintainability.

πŸ€– Generated with [Claude Code]

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…9745)

## Description
Fixed a broken link in the read_unity_catalog doc string. Previous URL
was outdated.

## Related issues
None 

## Additional information
N/A

---------

Signed-off-by: Jess <jessica.jy.kong@gmail.com>
Signed-off-by: Jessica Kong <jessica.jy.kong@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## Description

`CountDistinct` allow users to compute the number of distinct values in
a column, similar to SQL's `COUNT(DISTINCT ...)`.

## Related issues

close ray-project#58252

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
…ct#59942)

Updating to reflect an issue that I debugged recently.

Recommendation is to use `overlayfs` instead of the default `vfs` for
faster container startup.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ialization overhead (ray-project#59919)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Fix typos in docs and docstrings. If any are too trivial, just lmk.
Agent assisted

---------

Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
This was used early in the development of the Ray Dashboard and is not
used any more so we should remove it (I recently came across this).

---------

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
…7735)

There have been ask for enabling --temp-dir flag on a per node
basis in contrast to the current implementation that only allows
all node's temp dir to be configured to the head node's temp dir
configuration.

This PR introduces the capability for the ray temp directory
to be specified on a per node basis, eliminating the restriction
that --temp-dir flag can only be used in conjunction with the --head
flag. get_user_temp_dir and get_ray_temp_dir has been marked as
deprecated and replaced with the resolve_user_ray_temp_dir function
to ensure that temp dir is consistent across the system.

## New Behaviors
**Temp dir**

|  | head node temp_dir NOT specified | head node temp_dir specified |
|---|---|---|
| worker node temp_dir NOT specified | Worker & head node uses
`/tmp/ray` | Worker uses head node's temp_dir |
| worker node temp_dir specified | Worker uses its own specified
temp_dir. Head node uses default | Each node uses its own specified
temp_dir |

**Object spilling directory**

| | head node spilling dir NOT specified | head node spilling dir
specified |
|---|---|---|
| worker node spilling dir NOT specified | Each node uses its own
temp_dir as spilling dir | Worker uses head node's spilling dir |
| worker node spilling dir specified | Worker uses its own specified
spilling dir. Head node uses its temp_dir | Each node uses its own
specified spilling dir |

## Testing
We tested the expected behaviors on a local multi-node kuberay cluster
by verifying that:
1. nodes defaults to `/tmp/ray` when node temp_dir is specified
2. non-head nodes picked up head node's temp_dir specifications when
only head node temp_dir was specified
3. non-head nodes can take independent temp_dir regardless of head node
temp_dir when specified
4. nodes default to their own temp dir as spilling directory for all
three cases above
5. nodes default to head node's spilling directory when only head node
spilling directory is specified
6. nodes can have their spilling directory specified independent of the
head node's spilling directory

Behaviors were verified by checking that the directories were created,
and that the right information is fetched from head node.

## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->
ray-project#47262
ray-project#51218
ray-project#40628
ray-project#32962
ray-project#40628

## Types of change

- [ ] Bug fix πŸ›
- [ ] New feature ✨
- [x] Enhancement πŸš€
- [ ] Code refactoring πŸ”§
- [ ] Documentation update πŸ“–
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->
This PR should not introduce any breaking changes just yet. However,
this PR deprecates `get_user_temp_dir` and `get_ray_temp_dir`. The two
functions will be marked as errors in the next version update.

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Aydin-ab and others added 22 commits January 20, 2026 12:07
ray-project#59897)

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com>
Co-authored-by: Aydin Abiar <aydin@anyscale.com>
…netes token authentication (ray-project#59621)

## Description

Per discussion from REP PR
(ray-project/enhancements#63), this PR adds a
server-side config `RAY_ENABLE_K8S_TOKEN_RBAC=true` to enable
Kubernetes-based token authentication. This must be set in addition to
`RAY_AUTH_MODE=token`. The main benefit of this change is that the
server-side authentication flow becomes opaque to clients, and all
clients only need to set `RAY_AUTH_MODE=token` along with their token.

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…-project#60283)

## Summary
- Fix Ray Data's cluster autoscalers (V1 and V2) to respect
user-configured `resource_limits` set via `ExecutionOptions`
- Cap autoscaling resource requests to not exceed user-specified CPU and
GPU limits
- Update `get_total_resources()` to return the minimum of cluster
resources and user limits

## Why are these changes needed?

Previously, Ray Data's cluster autoscalers did not respect
user-configured resource limits. When a user set explicit limits like:

```python
ctx = ray.data.DataContext.get_current()
ctx.execution_options.resource_limits = ExecutionResources(cpu=8)
```

The autoscaler would ignore these limits and continue to request more
cluster resources from Ray's autoscaler, causing unnecessary node
upscaling even when the executor couldn't use the additional resources.

This was problematic because:
1. Users explicitly setting resource limits expect Ray Data to stay
within those bounds
2. Unnecessary cluster scaling wastes cloud resources and money
3. The `ResourceManager.get_global_limits()` already respects user
limits, but the autoscaler bypassed this by requesting resources
directly

## Test Plan

Added comprehensive unit tests for both autoscaler implementations

## Related issue number

Fixes ray-project#60085

## Checks
- [x] I've signed off every commit
- [x] I've run `scripts/format.sh` to lint the changes in this PR
- [x] I've included any doc changes needed
- [x] I've added any new tests if needed

---

Would you like me to adjust anything in the PR description?

---------

Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
…t#60267)

it is always an instance of AnyscaleJobRunner.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…0278)

and saves the job ID in `_job_id`. this makes the information flow
clearer and simpler.

this is preparation for refactoring the job sdk usage.

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
per anyscale#727

Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
… from Ray Data (ray-project#60292)

## Description
Remove all top-level imports of `ray.data` from the `ray.train` module.
Imports needed only for type annotations should be guarded behind if
`TYPE_CHECKING:`. Imports needed at runtime should be moved inline (lazy
imports within functions/methods).

## Related issues
Fixes ray-project#60152.

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
…project#60351)

ray-project#59631 changed the way the
`Dataset` representations look, but CI didn't test
`writing-code-snippet` in that PR's premerge CI. This PR fixes the
incorrect output.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ns' parameters for the 'serve' API" (ray-project#60355)

Reverts ray-project#56507

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: abrar <abrar@anyscale.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
upgrade jaxlib and jax from 0.4.13 -> 0.4.22 due to missing version on
pypi [index](https://pypi.org/project/jaxlib/#history)

oldest version available is 0.4.17

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
…project#60347)

To improve readability, this PR separates the `DefaultAutoscalerV2` into
distinct sections for input validation, getting default values, and
setting attributes.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…models (Issue ray-project#60100) (ray-project#60102)

## Description

See ray-project#60100

## Related issues

Fixes ray-project#60100 

## Additional information

None

---------

Signed-off-by: antoine_galataud <antoine@foobot.io>
…urce Cleanup and Process Termination (ray-project#60172)

## Description
### Summary

This PR improves the `destroy_module()` method in
`SubprocessModuleHandle` to fix race conditions, implement graceful
process termination, and ensure complete resource cleanup. The changes
prevent resource leaks and improve the reliability of module restart and
shutdown.

### Key Changes

1. **Cancel health check task first** to prevent race conditions
- Why Cancel Health Check Task First: See
ray-project#60214 (comment)
- Uses smart detection to avoid canceling the current task when called
internally
    - Prevents health check task from interfering with cleanup
2. **Ordered resource cleanup** with clear dependencies:
    - Cancel health check task first
    - Close parent connection
    - Terminate process gracefully, then forcefully if needed
    - Close HTTP client session
3. **Graceful process termination**:
- First attempts graceful termination with `terminate()` and 5-second
timeout
    - Falls back to force kill (`kill()`) only if necessary
    - All `join()` calls have timeouts to prevent infinite blocking
4. **Error handling**: Try-except blocks ensure cleanup continues even
if one step fails

### Modified Files

1. `python/ray/dashboard/subprocesses/handle.py`
    - Refactored `destroy_module()` with ordered resource cleanup
    - Implemented graceful process termination with timeout protection
    - Added smart health check task cancellation logic
      - **Smart Detection Logic:**
        ```python
        current_task = asyncio.current_task(loop=self.loop)
if current_task is None or self.health_check_task is not current_task:
self.health_check_task.cancel() # Only cancel if not current task
        ```
        This ensures:
- **External calls**: Immediately cancel health check task to prevent
interference
- **Internal calls**: Don't cancel current task, allowing cleanup and
restart to complete
    - Added comprehensive error handling and logging

2. `python/ray/dashboard/subprocesses/tests/test_e2e.py`    
- Added `test_destroy_module_cleans_up_resources()` to verify complete
resource cleanup
- Added mock classes (`_DummyConn`, `_DummyProcess`, `_DummySession`)
for isolated testing
- Added cleanup logic in `start_http_server_app()` to prevent resource
leaks between tests
## Related issues
Closes ray-project#60214

---------

Signed-off-by: daiping8 <dai.ping88@zte.com.cn>
Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com>
…t#59768)

## Description

Fix `uv_runtime_env_hook.py` to pin worker Python version to driver
version.

If the system python version is different from the driver python
version, you can end up with a mismatch between python versions (e.g.
driver 3.11 vs worker 3.12), which causes Ray to deliberately crash
elsewhere.

This change ensures compatibility between the Ray driver and the worker
by specifying the Python version, preventing version mismatches.

## Related issues

Fixes ray-project#59639.

## Additional information

---------

Signed-off-by: David Hall <david.hall@openathena.ai>
…project#59922)

## Description
Make the Union operator not blocking when `preserve_order` is enabled if
`_add_input_inner` is called with the input in the front.

---------

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
## Why are these changes needed?

Fixes anyscale#758

Skip `test_no_spammy_errors_in_composed_app` and
`test_no_spammy_errors_in_grpc_proxy` on Windows due to:

- **Signal handling**: `test_no_spammy_errors_in_grpc_proxy` uses
`p.send_signal(signal.SIGINT)` which Windows doesn't support for
subprocesses (raises `ValueError: Unsupported signal: 2`)
- **Temp directory cleanup**: `test_no_spammy_errors_in_composed_app`
writes replica logs to a temp directory, and Windows doesn't release
file handles immediately after process termination, causing cleanup
failures (`NotADirectoryError: [WinError 267]`)

This follows the existing pattern from `test_logging.py:1216` which
skips similar tests on Windows:
```python
@pytest.mark.skipif(sys.platform == "win32", reason="Fail to look for temp dir.")
```

## Related issue number

anyscale#758

## Checks

- [x] I've signed all my commits
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
        temporary directory, causing `NotADirectoryError` on cleanup
method in Tune, I've added it in `doc/source/tune/api/` under the
        corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Why are these changes needed?

The `test_router_queue_len_metric` test was flaky because the router
queue length gauge has a 100ms throttle
(`RAY_SERVE_ROUTER_QUEUE_LEN_GAUGE_THROTTLE_S`) that can skip updates
when they happen too quickly.

When replica initialization sets the gauge to 0 and a request
immediately updates it to 1, the second update may be throttled, causing
the test to see 0 instead of 1.

## Related issue number

Fixes flaky test introduced in ray-project#59233 after ray-project#60139 added throttling.

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
## Description
After ray-project#60017 got merged, I forgot to update the `test_bundle_queue` test
suite. This PR adds more tests for `num_blocks`, `num_rows`,
`estimate_size_bytes`, and `len(queue)`
## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ject#60338)

## Description
> Briefly describe what this PR accomplishes and why it's needed.
This PR adds support for Google Cloud's 7th generation TPU (Ironwood).

The TPU 7x generation introduces a change in the accelerator type naming
convention reported by the environment. Unlike previous generations
(v6e-16, v5p-8, etc.), 7x instances report types starting with tpu (e.g.
tpu7x-16).

This PR accounts for the new format and enables Ray to auto-detect the
v7x hardware automatically (users don't have to manually configure env
vars). This is critical for libraries like Ray Train and for vLLM
support - where the automatic device discovery is utilized during JAX
initialization.

## Related issues
Fixes ray-project#59964

## Additional information
For more info about TPU v7x:
https://docs.cloud.google.com/tpu/docs/tpu7x.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
## Description

1. the flakyness for test_flush_worker_result_queue is, when
queue_backlog_length is 0, after `wg._start()`, we immediately
wg.poll_status() and asserts finished, sometimes rank 0’s training
thread is still running at that instant .
leads to the below error:
```
where False = WorkerGroupPollStatus(worker_statuses={0: WorkerStatus(running=True, error=None, training_report=None), 1: WorkerStatus(running=False, error=None, training_report=None), 2: WorkerStatus(running=False, error=None, training_report=None), 3: WorkerStatus(running=False, error=None, training_report=None)}).finished
```
2. use the same pattern in `test_poll_status_finished` in the same file
to address this flakyness.
3. increase `test_placement_group_handle ` to medium to avoid timeout. 
```

python/ray/train/v2/tests/test_placement_group_handle.py::test_slice_handle_shutdown -- Test timed out at 2026-01-20 18:12:46 UTC --
--
[2026-01-20T18:15:17Z] ERROR [100%]
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] ==================================== ERRORS ====================================
[2026-01-20T18:15:17Z] _________________ ERROR at setup of test_slice_handle_shutdown _________________
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z]     @pytest.fixture(autouse=True)
[2026-01-20T18:15:17Z]     def ray_start():
[2026-01-20T18:15:17Z] >       ray.init(num_cpus=4)
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] python/ray/train/v2/tests/test_placement_group_handle.py:16:
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/client_mode_hook.py:104: in wrapper
[2026-01-20T18:15:17Z]     return func(*args, **kwargs)
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1910: in init
[2026-01-20T18:15:17Z]     _global_node = ray._private.node.Node(
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/node.py:402: in __init__
[2026-01-20T18:15:17Z]     time.sleep(0.1)
[2026-01-20T18:15:17Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] signum = 15
[2026-01-20T18:15:17Z] frame = <frame at 0x55cf6cb749f0, file '/rayci/python/ray/_private/node.py', line 402, code __init__>
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z]     def sigterm_handler(signum, frame):
[2026-01-20T18:15:17Z] >       sys.exit(signum)
[2026-01-20T18:15:17Z] E       SystemExit: 15
[2026-01-20T18:15:17Z]
[2026-01-20T18:15:17Z] /rayci/python/ray/_private/worker.py:1670: SystemExit


```

4. add a `manual` tag for `test_jax_gpu` bazel target to temporally
disable CI for this unit test given that pypi jax version only support
at least CUDA 12.2 now while our CI runs on CUDA 12.1

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…d of the highest available version. (ray-project#60378)

Signed-off-by: irabbani <israbbani@gmail.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large daily merge from master to main, encompassing a wide range of changes across the repository. The key themes include a major refactoring of the CI/CD system towards a more modular, wanda-based architecture, the deprecation of Python 3.9 in favor of Python 3.10 as the default, and extensive documentation updates. The CI changes are significant, introducing new build steps, rules, and automation scripts for better dependency management and build caching. Documentation has been improved with new examples, API reference updates, and internal design documents for new features like token authentication and port discovery. Several APIs have been updated for clarity and consistency. Overall, these changes represent a significant step forward in improving the project's build system, testing infrastructure, and documentation. My review found one minor issue related to an unused environment variable in a CI configuration file.

RAYCI_DISABLE_JAVA: "false"
RAYCI_WANDA_ALWAYS_REBUILD: "true"
JDK_SUFFIX: "-jdk"
ARCH_SUFFIX: "aarch64"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ARCH_SUFFIX environment variable is defined here but appears to be unused in the corresponding wanda file (ci/docker/manylinux-cibase.wanda.yaml) or the underlying Dockerfile. This can be confusing for future maintenance. Please consider removing this line if it's not needed.

@github-actions
Copy link

github-actions bot commented Feb 5, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.