Skip to content

Conversation

@reneburghardt
Copy link
Member

@reneburghardt reneburghardt commented Jul 4, 2025

Still needs testing

@gitnotebooks
Copy link

gitnotebooks bot commented Jul 4, 2025

@pollytur pollytur requested review from Copilot and pollytur July 4, 2025 10:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds precision-scaled interpolation and higher-level data retrieval methods to Experiment, and refactors Dataset to use these methods.

  • Introduce interpolate_precisionscale_precision in Experiment for consistent time truncation.
  • Add get_data_for_interval and get_data_for_chunks to Experiment.
  • Refactor datasets.py to delegate chunked interpolation to Experiment and remove duplicate time calculations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
experanto/experiment.py Added precision scaling; new get_data_for_interval and get_data_for_chunks methods; updated signature of interpolate.
experanto/datasets.py Updated __getitem__ to use Experiment.get_data_for_chunks and cleaned up redundant time‐generation logic.
Comments suppressed due to low confidence (4)

experanto/experiment.py:83

  • The return type annotation -> tuple[np.ndarray, np.ndarray] does not match the method’s implementation or docstring, which returns a dict[str, np.ndarray]. Update the signature accordingly.
    def get_data_for_interval(

experanto/experiment.py:145

  • The return type -> tuple[np.ndarray, np.ndarray] conflicts with the actual return of two dictionaries (out, timestamps). Consider annotating -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]].
    def get_data_for_chunks(

experanto/experiment.py:105

  • Docstring for get_data_for_interval states it returns a dict but the signature says a tuple. Synchronize the documentation with the code behavior.
        dict[str, np.ndarray]

experanto/experiment.py:115

  • New methods get_data_for_interval and get_data_for_chunks should have unit tests covering edge cases such as missing sampling rates, single vs. multiple devices, and empty intervals.
        devices = self._resolve_devices(devices)

@reneburghardt reneburghardt marked this pull request as ready for review October 15, 2025 10:55
@reneburghardt
Copy link
Member Author

@pollytur @schewskone ready for review from my side

@pollytur
Copy link
Contributor

@reneburghardt why have you decided to move chunks to the Experiment level? What is the high level motivation behind this decision? (I think I missed this discussion)

@reneburghardt
Copy link
Member Author

reneburghardt commented Oct 15, 2025

@reneburghardt why have you decided to move chunks to the Experiment level? What is the high level motivation behind this decision? (I think I missed this discussion)

To get data with regards to chunk size or with regards to interval seemed to me to be on the same level or, in other words, having both options not in the same place felt weird. They also share a fair bit of code such as scaling for precision and resolving sampling rate and offset.

How do you see that @pollytur ?

@pollytur
Copy link
Contributor

pollytur commented Dec 11, 2025

[follow up from meeting with Fabee] lets keep intervals quering on the level of experiment but chunking and disretezination should only be on the level of dataset

@reneburghardt reneburghardt marked this pull request as draft December 31, 2025 08:49
Copilot AI review requested due to automatic review settings January 1, 2026 18:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +440 to +450
def test_resolve_chunk_sizes_with_dict():
with create_experiment() as experiment_path:
experiment = Experiment(
root_folder=experiment_path, modality_config=get_default_config()
)

custom = {"device_0": 10.0, "device_1": 20.0}
result = experiment._resolve_chunk_sizes(
list(custom.keys()), chunk_sizes=custom
)
assert result == custom
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for resolving chunk sizes with a dictionary (test_resolve_chunk_sizes_with_dict) passes float values (10.0, 20.0) in the dictionary, but the _resolve_chunk_sizes method doesn't convert dictionary values to integers. This means chunk_sizes could contain floats when integers are expected. The test should either verify that floats are converted to ints, or the test should use integer values to match the expected behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +64
if shifts_per_signal:
shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9
np.save(sequence_root / "meta" / "phase_shifts.npy", shifts)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'shifts' is only defined inside the 'if shifts_per_signal' block, but it's referenced in the return statement on line 69 outside of this block. When 'shifts_per_signal' is False, this will cause a NameError because 'shifts' will be undefined. Initialize 'shifts = None' before the conditional block to ensure it's always defined.

Copilot uses AI. Check for mistakes.
Comment on lines +632 to +637
# TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse.
if "screen" in out:
if out["screen"].shape[-1] == 3:
out["screen"] = out["screen"].permute(0, 3, 1, 2)
if out["screen"].shape[0] == self.chunk_sizes["screen"]:
out["screen"] = out["screen"].transpose(0, 1)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screen-specific transformation logic (permute and transpose operations) has been moved outside the device loop. The condition on line 636 checks 'if out["screen"].shape[0] == self.chunk_sizes["screen"]' which determines whether to transpose. This logic assumes the chunk_size dimension is in position 0 after transforms, but this might not always be the case depending on the transforms applied. The previous code (removed) applied this logic immediately after transforms within the loop, which may have been more reliable. Verify that this reordering doesn't break the logic for different transform configurations.

Suggested change
# TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse.
if "screen" in out:
if out["screen"].shape[-1] == 3:
out["screen"] = out["screen"].permute(0, 3, 1, 2)
if out["screen"].shape[0] == self.chunk_sizes["screen"]:
out["screen"] = out["screen"].transpose(0, 1)
# TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse.
if device_name == "screen":
if out["screen"].shape[-1] == 3:
out["screen"] = out["screen"].permute(0, 3, 1, 2)
if out["screen"].shape[0] == self.chunk_sizes["screen"]:
out["screen"] = out["screen"].transpose(0, 1)

Copilot uses AI. Check for mistakes.
elif isinstance(chunk_sizes, (int, float)):
return {d: int(chunk_sizes) for d in devices}
else:
return chunk_sizes
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _resolve_chunk_sizes method converts float values to int on line 227, but chunk_sizes should be integers representing the number of samples. If a dictionary is passed with float values (as demonstrated in test_resolve_chunk_sizes_with_dict on line 446), those float values won't be converted to integers. For consistency, the method should also convert dictionary values to int, similar to how it handles the single int/float case.

Suggested change
return chunk_sizes
# Ensure all chunk sizes are integers, even when provided as a dict
return {device: int(size) for device, size in chunk_sizes.items()}

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +162
"""
Retrieve interpolated data for a fixed number of timesteps (chunk size) per device.

Parameters
----------
chunk_sizes : int or dict[str, int], optional
Number of time steps to retrieve per device. If a single int is provided, it is used for all devices.
If a dictionary is provided, it should map device names to their respective number of timesteps.
If None, default chunk sizes defined in the configuration are used.
target_sampling_rates : float or dict[str, float], optional
Target sampling rate(s) in Hz. If a single float is provided, it is applied to all devices.
If a dictionary is provided, it should map device names to their respective sampling rates.
If None or a device is not specified in the dictionary, the default sampling rate from the modality config is used.
devices : str or list of str, optional
Devices to retrieve data for. If None, all available devices (`self.devices`) are used.
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring has incomplete or missing documentation for the 'start_time' parameter. The Parameters section jumps directly to documenting 'chunk_sizes' without first documenting 'start_time', which is the first parameter in the function signature. Add proper documentation for the 'start_time' parameter explaining that it represents the starting time point from which to retrieve data chunks.

Copilot uses AI. Check for mistakes.
Comment on lines +141 to +147
def get_data_for_chunks(
self,
start_time: float,
chunk_sizes: Optional[Union[int, dict[str, int]]] = None,
target_sampling_rates: Optional[Union[float, dict[str, float]]] = None,
devices: Optional[Union[str, List[str]]] = None,
) -> dict[str, np.ndarray]:
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation for 'get_data_for_chunks' indicates it returns 'dict[str, np.ndarray]', but the function actually returns a tuple of two dictionaries: '(out, timestamps)' as seen on line 204. The correct return type should be 'tuple[dict[str, np.ndarray], dict[str, np.ndarray]]'.

Copilot uses AI. Check for mistakes.
Comment on lines +464 to +475
def test_get_device_offset_applies_scale_precision():
with create_experiment() as experiment_path:
modality_config = get_default_config()
modality_config["device_0"]["offset"] = 0.1234
experiment = Experiment(
root_folder=experiment_path,
modality_config=modality_config,
interpolate_precision=3,
)

offset = experiment._get_device_offset("device_0")
assert offset == 123
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects the offset to be scaled by the precision (0.1234 * 10^3 = 123.4, rounded to 123), but there's a potential issue: the test doesn't verify that this is the correct behavior, and the implementation in _get_device_offset (line 232-234) uses round() which could produce unexpected results for edge cases. Consider adding assertions that verify the rounding behavior matches expectations, especially for negative offsets and values near rounding boundaries.

Copilot uses AI. Check for mistakes.
@@ -85,11 +85,8 @@ def __init__(
modality_config: dict = DEFAULT_MODALITY_CONFIG,
seed: Optional[int] = None,
safe_interval_threshold: float = 0.5,
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'interpolate_precision' parameter was removed from ChunkDataset's init signature (previously on removed lines around line 87). However, Experiment now has this parameter with a default value of 5. If ChunkDataset users previously relied on setting a custom interpolate_precision, they can no longer do so through ChunkDataset. This is a breaking API change. Consider whether ChunkDataset should still accept and forward this parameter to Experiment for backward compatibility.

Suggested change
safe_interval_threshold: float = 0.5,
safe_interval_threshold: float = 0.5,
interpolate_precision: int = 5,

Copilot uses AI. Check for mistakes.
Comment on lines 617 to +620
out = {}
timestamps = {}
s = self._valid_screen_times[idx]
out, timestamps = self._experiment.get_data_for_chunks(
s,
chunk_sizes=self.chunk_sizes,
target_sampling_rates=self.sampling_rates,
devices=self.device_names,
)
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'out' is used to store both data and timestamps initially (line 615), but then 'out' is reassigned to only contain data while 'timestamps' is a separate variable. This creates confusion about what 'out' represents at different points in the code. Consider using more descriptive names like 'data' instead of 'out' from the beginning, or avoiding the initial assignment pattern altogether by directly unpacking: 'data, timestamps = self._experiment.get_data_for_chunks(...)'.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +7
import yaml

Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'yaml' is not used.

Suggested change
import yaml

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 9, 2026 15:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +69
if shifts_per_signal:
shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9
np.save(sequence_root / "meta" / "phase_shifts.npy", shifts)

fp = np.memmap(filename, dtype=data.dtype, mode="w+", shape=data.shape)
fp[:] = data[:]
fp.flush() # Ensure data is written to disk
del fp
meta["dtype"] = str(data.dtype)
with open(sequence_root / "meta.yml", "w") as f:
yaml.dump(meta, f)

if shifts_per_signal:
shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9
np.save(SEQUENCE_ROOT / "meta" / "phase_shifts.npy", shifts)
return timestamps, data, shifts if shifts_per_signal else None
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable shifts is only defined when shifts_per_signal is True (line 63), but it's referenced in the ternary expression on line 69 (shifts if shifts_per_signal else None). In Python, even though the condition is evaluated first, all names in the expression must be defined. If shifts_per_signal is False, this will raise a NameError. Initialize shifts = None before line 62 to fix this issue.

Copilot uses AI. Check for mistakes.
def interpolate(self, times: slice, device=None) -> tuple[np.ndarray, np.ndarray]:
def interpolate(
self, times: slice, device: str = None
) -> tuple[np.ndarray, np.ndarray]:
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type annotation is incorrect. The method now returns only the interpolated values (either a dict or np.ndarray), not a tuple of (values, valid). The return type should be Union[dict[str, np.ndarray], np.ndarray] to match the actual return behavior.

Suggested change
) -> tuple[np.ndarray, np.ndarray]:
) -> Union[dict[str, np.ndarray], np.ndarray]:

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +175
def get_data_for_chunks(
self,
start_time: float,
chunk_sizes: Optional[Union[int, dict[str, int]]] = None,
target_sampling_rates: Optional[Union[float, dict[str, float]]] = None,
devices: Optional[Union[str, List[str]]] = None,
) -> dict[str, np.ndarray]:
"""
Retrieve interpolated data for a fixed number of timesteps (chunk size) per device.

Parameters
----------
chunk_sizes : int or dict[str, int], optional
Number of time steps to retrieve per device. If a single int is provided, it is used for all devices.
If a dictionary is provided, it should map device names to their respective number of timesteps.
If None, default chunk sizes defined in the configuration are used.
target_sampling_rates : float or dict[str, float], optional
Target sampling rate(s) in Hz. If a single float is provided, it is applied to all devices.
If a dictionary is provided, it should map device names to their respective sampling rates.
If None or a device is not specified in the dictionary, the default sampling rate from the modality config is used.
devices : str or list of str, optional
Devices to retrieve data for. If None, all available devices (`self.devices`) are used.

Returns
-------
dict[str, np.ndarray]
A dictionary mapping each device name to its corresponding interpolated data array
of shape `(chunk_size, ...)`.

Raises
------
AssertionError
If a specified device is not found in `self.devices`.
ValueError
If no sampling rate is specified or available for a device.
"""
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start_time parameter is missing from the documentation. While other parameters are documented, start_time (the first required parameter) is not described in the Parameters section.

Copilot uses AI. Check for mistakes.
if chunk_sizes is None:
return {d: self.modality_config[d].get("chunk_size") for d in devices}
elif isinstance(chunk_sizes, (int, float)):
return {d: int(chunk_sizes) for d in devices}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _resolve_chunk_sizes method converts floats to ints, but chunk sizes should always be integers. Accepting floats as a valid type (in line 225) could lead to confusion. Consider removing float from the type check or add validation to ensure only integer values are provided.

Suggested change
return {d: int(chunk_sizes) for d in devices}
if isinstance(chunk_sizes, float) and not chunk_sizes.is_integer():
raise ValueError(
f"chunk_sizes must be an integer number of timesteps, got {chunk_sizes!r}"
)
resolved_size = int(chunk_sizes)
return {d: resolved_size for d in devices}

Copilot uses AI. Check for mistakes.
Comment on lines 612 to 613
out = {}
timestamps = {}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables out and timestamps are initialized as empty dictionaries on lines 612-613, but immediately overwritten on line 615 when unpacking the return value from get_data_for_chunks. The initialization is unnecessary and should be removed.

Suggested change
out = {}
timestamps = {}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants