-
Notifications
You must be signed in to change notification settings - Fork 19
Querying by time and move data interpolating into experiment #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Querying by time and move data interpolating into experiment #85
Conversation
|
Review these changes at https://app.gitnotebooks.com/sensorium-competition/experanto/pull/85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds precision-scaled interpolation and higher-level data retrieval methods to Experiment, and refactors Dataset to use these methods.
- Introduce
interpolate_precision→scale_precisioninExperimentfor consistent time truncation. - Add
get_data_for_intervalandget_data_for_chunkstoExperiment. - Refactor
datasets.pyto delegate chunked interpolation toExperimentand remove duplicate time calculations.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| experanto/experiment.py | Added precision scaling; new get_data_for_interval and get_data_for_chunks methods; updated signature of interpolate. |
| experanto/datasets.py | Updated __getitem__ to use Experiment.get_data_for_chunks and cleaned up redundant time‐generation logic. |
Comments suppressed due to low confidence (4)
experanto/experiment.py:83
- The return type annotation
-> tuple[np.ndarray, np.ndarray]does not match the method’s implementation or docstring, which returns adict[str, np.ndarray]. Update the signature accordingly.
def get_data_for_interval(
experanto/experiment.py:145
- The return type
-> tuple[np.ndarray, np.ndarray]conflicts with the actual return of two dictionaries (out, timestamps). Consider annotating-> tuple[dict[str, np.ndarray], dict[str, np.ndarray]].
def get_data_for_chunks(
experanto/experiment.py:105
- Docstring for
get_data_for_intervalstates it returns a dict but the signature says a tuple. Synchronize the documentation with the code behavior.
dict[str, np.ndarray]
experanto/experiment.py:115
- New methods
get_data_for_intervalandget_data_for_chunksshould have unit tests covering edge cases such as missing sampling rates, single vs. multiple devices, and empty intervals.
devices = self._resolve_devices(devices)
|
@pollytur @schewskone ready for review from my side |
|
@reneburghardt why have you decided to move chunks to the Experiment level? What is the high level motivation behind this decision? (I think I missed this discussion) |
To get data with regards to chunk size or with regards to interval seemed to me to be on the same level or, in other words, having both options not in the same place felt weird. They also share a fair bit of code such as scaling for precision and resolving sampling rate and offset. How do you see that @pollytur ? |
|
[follow up from meeting with Fabee] lets keep intervals quering on the level of experiment but chunking and disretezination should only be on the level of dataset |
…anto into querying_by_time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_resolve_chunk_sizes_with_dict(): | ||
| with create_experiment() as experiment_path: | ||
| experiment = Experiment( | ||
| root_folder=experiment_path, modality_config=get_default_config() | ||
| ) | ||
|
|
||
| custom = {"device_0": 10.0, "device_1": 20.0} | ||
| result = experiment._resolve_chunk_sizes( | ||
| list(custom.keys()), chunk_sizes=custom | ||
| ) | ||
| assert result == custom |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for resolving chunk sizes with a dictionary (test_resolve_chunk_sizes_with_dict) passes float values (10.0, 20.0) in the dictionary, but the _resolve_chunk_sizes method doesn't convert dictionary values to integers. This means chunk_sizes could contain floats when integers are expected. The test should either verify that floats are converted to ints, or the test should use integer values to match the expected behavior.
| if shifts_per_signal: | ||
| shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9 | ||
| np.save(sequence_root / "meta" / "phase_shifts.npy", shifts) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'shifts' is only defined inside the 'if shifts_per_signal' block, but it's referenced in the return statement on line 69 outside of this block. When 'shifts_per_signal' is False, this will cause a NameError because 'shifts' will be undefined. Initialize 'shifts = None' before the conditional block to ensure it's always defined.
| # TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse. | ||
| if "screen" in out: | ||
| if out["screen"].shape[-1] == 3: | ||
| out["screen"] = out["screen"].permute(0, 3, 1, 2) | ||
| if out["screen"].shape[0] == self.chunk_sizes["screen"]: | ||
| out["screen"] = out["screen"].transpose(0, 1) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screen-specific transformation logic (permute and transpose operations) has been moved outside the device loop. The condition on line 636 checks 'if out["screen"].shape[0] == self.chunk_sizes["screen"]' which determines whether to transpose. This logic assumes the chunk_size dimension is in position 0 after transforms, but this might not always be the case depending on the transforms applied. The previous code (removed) applied this logic immediately after transforms within the loop, which may have been more reliable. Verify that this reordering doesn't break the logic for different transform configurations.
| # TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse. | |
| if "screen" in out: | |
| if out["screen"].shape[-1] == 3: | |
| out["screen"] = out["screen"].permute(0, 3, 1, 2) | |
| if out["screen"].shape[0] == self.chunk_sizes["screen"]: | |
| out["screen"] = out["screen"].transpose(0, 1) | |
| # TODO: find better convention for image, video, color, gray channels. This makes the monkey data same as mouse. | |
| if device_name == "screen": | |
| if out["screen"].shape[-1] == 3: | |
| out["screen"] = out["screen"].permute(0, 3, 1, 2) | |
| if out["screen"].shape[0] == self.chunk_sizes["screen"]: | |
| out["screen"] = out["screen"].transpose(0, 1) |
| elif isinstance(chunk_sizes, (int, float)): | ||
| return {d: int(chunk_sizes) for d in devices} | ||
| else: | ||
| return chunk_sizes |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _resolve_chunk_sizes method converts float values to int on line 227, but chunk_sizes should be integers representing the number of samples. If a dictionary is passed with float values (as demonstrated in test_resolve_chunk_sizes_with_dict on line 446), those float values won't be converted to integers. For consistency, the method should also convert dictionary values to int, similar to how it handles the single int/float case.
| return chunk_sizes | |
| # Ensure all chunk sizes are integers, even when provided as a dict | |
| return {device: int(size) for device, size in chunk_sizes.items()} |
| """ | ||
| Retrieve interpolated data for a fixed number of timesteps (chunk size) per device. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| chunk_sizes : int or dict[str, int], optional | ||
| Number of time steps to retrieve per device. If a single int is provided, it is used for all devices. | ||
| If a dictionary is provided, it should map device names to their respective number of timesteps. | ||
| If None, default chunk sizes defined in the configuration are used. | ||
| target_sampling_rates : float or dict[str, float], optional | ||
| Target sampling rate(s) in Hz. If a single float is provided, it is applied to all devices. | ||
| If a dictionary is provided, it should map device names to their respective sampling rates. | ||
| If None or a device is not specified in the dictionary, the default sampling rate from the modality config is used. | ||
| devices : str or list of str, optional | ||
| Devices to retrieve data for. If None, all available devices (`self.devices`) are used. |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring has incomplete or missing documentation for the 'start_time' parameter. The Parameters section jumps directly to documenting 'chunk_sizes' without first documenting 'start_time', which is the first parameter in the function signature. Add proper documentation for the 'start_time' parameter explaining that it represents the starting time point from which to retrieve data chunks.
| def get_data_for_chunks( | ||
| self, | ||
| start_time: float, | ||
| chunk_sizes: Optional[Union[int, dict[str, int]]] = None, | ||
| target_sampling_rates: Optional[Union[float, dict[str, float]]] = None, | ||
| devices: Optional[Union[str, List[str]]] = None, | ||
| ) -> dict[str, np.ndarray]: |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation for 'get_data_for_chunks' indicates it returns 'dict[str, np.ndarray]', but the function actually returns a tuple of two dictionaries: '(out, timestamps)' as seen on line 204. The correct return type should be 'tuple[dict[str, np.ndarray], dict[str, np.ndarray]]'.
| def test_get_device_offset_applies_scale_precision(): | ||
| with create_experiment() as experiment_path: | ||
| modality_config = get_default_config() | ||
| modality_config["device_0"]["offset"] = 0.1234 | ||
| experiment = Experiment( | ||
| root_folder=experiment_path, | ||
| modality_config=modality_config, | ||
| interpolate_precision=3, | ||
| ) | ||
|
|
||
| offset = experiment._get_device_offset("device_0") | ||
| assert offset == 123 |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test expects the offset to be scaled by the precision (0.1234 * 10^3 = 123.4, rounded to 123), but there's a potential issue: the test doesn't verify that this is the correct behavior, and the implementation in _get_device_offset (line 232-234) uses round() which could produce unexpected results for edge cases. Consider adding assertions that verify the rounding behavior matches expectations, especially for negative offsets and values near rounding boundaries.
| @@ -85,11 +85,8 @@ def __init__( | |||
| modality_config: dict = DEFAULT_MODALITY_CONFIG, | |||
| seed: Optional[int] = None, | |||
| safe_interval_threshold: float = 0.5, | |||
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'interpolate_precision' parameter was removed from ChunkDataset's init signature (previously on removed lines around line 87). However, Experiment now has this parameter with a default value of 5. If ChunkDataset users previously relied on setting a custom interpolate_precision, they can no longer do so through ChunkDataset. This is a breaking API change. Consider whether ChunkDataset should still accept and forward this parameter to Experiment for backward compatibility.
| safe_interval_threshold: float = 0.5, | |
| safe_interval_threshold: float = 0.5, | |
| interpolate_precision: int = 5, |
| out = {} | ||
| timestamps = {} | ||
| s = self._valid_screen_times[idx] | ||
| out, timestamps = self._experiment.get_data_for_chunks( | ||
| s, | ||
| chunk_sizes=self.chunk_sizes, | ||
| target_sampling_rates=self.sampling_rates, | ||
| devices=self.device_names, | ||
| ) |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'out' is used to store both data and timestamps initially (line 615), but then 'out' is reassigned to only contain data while 'timestamps' is a separate variable. This creates confusion about what 'out' represents at different points in the code. Consider using more descriptive names like 'data' instead of 'out' from the beginning, or avoiding the initial assignment pattern altogether by directly unpacking: 'data, timestamps = self._experiment.get_data_for_chunks(...)'.
| import yaml | ||
|
|
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'yaml' is not used.
| import yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if shifts_per_signal: | ||
| shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9 | ||
| np.save(sequence_root / "meta" / "phase_shifts.npy", shifts) | ||
|
|
||
| fp = np.memmap(filename, dtype=data.dtype, mode="w+", shape=data.shape) | ||
| fp[:] = data[:] | ||
| fp.flush() # Ensure data is written to disk | ||
| del fp | ||
| meta["dtype"] = str(data.dtype) | ||
| with open(sequence_root / "meta.yml", "w") as f: | ||
| yaml.dump(meta, f) | ||
|
|
||
| if shifts_per_signal: | ||
| shifts = np.random.rand(n_signals) / meta["sampling_rate"] * 0.9 | ||
| np.save(SEQUENCE_ROOT / "meta" / "phase_shifts.npy", shifts) | ||
| return timestamps, data, shifts if shifts_per_signal else None |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable shifts is only defined when shifts_per_signal is True (line 63), but it's referenced in the ternary expression on line 69 (shifts if shifts_per_signal else None). In Python, even though the condition is evaluated first, all names in the expression must be defined. If shifts_per_signal is False, this will raise a NameError. Initialize shifts = None before line 62 to fix this issue.
| def interpolate(self, times: slice, device=None) -> tuple[np.ndarray, np.ndarray]: | ||
| def interpolate( | ||
| self, times: slice, device: str = None | ||
| ) -> tuple[np.ndarray, np.ndarray]: |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type annotation is incorrect. The method now returns only the interpolated values (either a dict or np.ndarray), not a tuple of (values, valid). The return type should be Union[dict[str, np.ndarray], np.ndarray] to match the actual return behavior.
| ) -> tuple[np.ndarray, np.ndarray]: | |
| ) -> Union[dict[str, np.ndarray], np.ndarray]: |
| def get_data_for_chunks( | ||
| self, | ||
| start_time: float, | ||
| chunk_sizes: Optional[Union[int, dict[str, int]]] = None, | ||
| target_sampling_rates: Optional[Union[float, dict[str, float]]] = None, | ||
| devices: Optional[Union[str, List[str]]] = None, | ||
| ) -> dict[str, np.ndarray]: | ||
| """ | ||
| Retrieve interpolated data for a fixed number of timesteps (chunk size) per device. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| chunk_sizes : int or dict[str, int], optional | ||
| Number of time steps to retrieve per device. If a single int is provided, it is used for all devices. | ||
| If a dictionary is provided, it should map device names to their respective number of timesteps. | ||
| If None, default chunk sizes defined in the configuration are used. | ||
| target_sampling_rates : float or dict[str, float], optional | ||
| Target sampling rate(s) in Hz. If a single float is provided, it is applied to all devices. | ||
| If a dictionary is provided, it should map device names to their respective sampling rates. | ||
| If None or a device is not specified in the dictionary, the default sampling rate from the modality config is used. | ||
| devices : str or list of str, optional | ||
| Devices to retrieve data for. If None, all available devices (`self.devices`) are used. | ||
|
|
||
| Returns | ||
| ------- | ||
| dict[str, np.ndarray] | ||
| A dictionary mapping each device name to its corresponding interpolated data array | ||
| of shape `(chunk_size, ...)`. | ||
|
|
||
| Raises | ||
| ------ | ||
| AssertionError | ||
| If a specified device is not found in `self.devices`. | ||
| ValueError | ||
| If no sampling rate is specified or available for a device. | ||
| """ |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The start_time parameter is missing from the documentation. While other parameters are documented, start_time (the first required parameter) is not described in the Parameters section.
| if chunk_sizes is None: | ||
| return {d: self.modality_config[d].get("chunk_size") for d in devices} | ||
| elif isinstance(chunk_sizes, (int, float)): | ||
| return {d: int(chunk_sizes) for d in devices} |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _resolve_chunk_sizes method converts floats to ints, but chunk sizes should always be integers. Accepting floats as a valid type (in line 225) could lead to confusion. Consider removing float from the type check or add validation to ensure only integer values are provided.
| return {d: int(chunk_sizes) for d in devices} | |
| if isinstance(chunk_sizes, float) and not chunk_sizes.is_integer(): | |
| raise ValueError( | |
| f"chunk_sizes must be an integer number of timesteps, got {chunk_sizes!r}" | |
| ) | |
| resolved_size = int(chunk_sizes) | |
| return {d: resolved_size for d in devices} |
| out = {} | ||
| timestamps = {} |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables out and timestamps are initialized as empty dictionaries on lines 612-613, but immediately overwritten on line 615 when unpacking the return value from get_data_for_chunks. The initialization is unnecessary and should be removed.
| out = {} | |
| timestamps = {} |
Still needs testing