Skip to content

Newdata agent#192

Merged
koldunovn merged 16 commits intomainfrom
newdata_agent
Feb 3, 2026
Merged

Newdata agent#192
koldunovn merged 16 commits intomainfrom
newdata_agent

Conversation

@kuivi
Copy link
Copy Markdown
Collaborator

@kuivi kuivi commented Feb 2, 2026

a new data-analysis pipeline, ERA5 tooling, and a new DestinE climate data source, plus sandboxed analysis outputs and UI/CLI wiring

Summary
This PR introduces a major refactoring of the agent architecture, separating the smart agent into focused information-gathering and data-analysis components, while adding comprehensive ERA5 observational data integration and a new high-resolution DestinE climate data provider.

Changes
New Data Analysis Agent & Workflow Integration

  • Add data_analysis_agent.py - a new agent that performs quantitative climate analysis with visualizations, using ERA5 observations as ground truth baseline
  • Integrate data analysis agent into the LangGraph workflow, running after parallel info-gathering agents and before the combine step
  • Add agent_helpers.py with standardized agent executor creation
  • Extend AgentState in climsight_classes.py with new fields for sandbox paths, ERA5 responses, and analysis outputs
    Sandbox & Session Management
  • Add sandbox_utils.py with per-session sandbox directories for isolated analysis outputs
  • Implement thread-safe session ID management across Streamlit, CLI, and background workers
  • Surface generated plots in Streamlit UI with proper path handling
  • Harden stream_handler.py for parallel agent execution (NoSessionContext handling)
    ERA5 Tooling
  • Add era5_climatology_tool.py - extract pre-computed ERA5 2015-2025 monthly climatology as observational baseline
  • Add era5_retrieval_tool.py - retrieve ERA5 time series from Earthmover/Arraylake with secure API key passing (no environment variable overwrite)
  • Support both factory-function pattern (bound API key) and environment variable fallback
    New Analysis Tools
  • Add get_data_components.py - extract climate model variables for non-REPL workflows
  • Add visualization_tools.py - file listing and wise-agent guidance tools
  • Add reflection_tools.py - image reflection for plot quality verification
  • Add utils.py with JSON serialization helpers
    Python REPL Upgrade
  • Migrate from subprocess-based REPL to persistent Jupyter kernel executor
  • Implement proper kernel lifecycle management with session isolation
  • Add automatic package installation and matplotlib backend configuration
  • Support relative sandbox paths for consistent file access
    DestinE Climate Data Provider
  • Add DestinEProvider in climate_data_providers.py for IFS-FESOM high-resolution simulations
  • Handle unstructured grid (12.5M points) with cKDTree spatial indexing and IDW interpolation
  • Support 4 time periods: 1990-2014 (historical), 2015-2019, 2020-2029, 2040-2049 (SSP3-7.0)
  • Unit conversions: K→°C for temperature, kg m⁻² s⁻¹→mm/month for precipitation
    Smart Agent Refactoring
  • Slim down smart_agent.py to focus on information gathering only (Wikipedia, RAG, ECOCROP)
  • Remove data extraction and visualization responsibilities (moved to data_analysis_agent)
  • Clean up stale prompt sections and tool definitions
    Configuration & Dependencies
  • Add config toggles: use_era5_data, use_powerful_data_analysis, era5_climatology.enabled
  • Add llm_dataanalysis section with optional filter step
  • Add DestinE provider configuration with time periods and variable mappings
  • Fix longitude normalization in NextGEMS HEALPix provider and climate selectors
  • Update dependencies: add arraylake, python-dotenv, numpy; organize into sections
  • Add tmp/ to .gitignore for sandbox directories
    UI/CLI Updates
  • Add Arraylake API key input field in Streamlit (secure variable passing, not environment)
  • Add climate data source selector with DestinE option
  • Update terminal_interface.py with sandbox path integration

Diff Summary

28 files changed, 3601 insertions(+), 655 deletions(-)
New files (11):

  • src/climsight/agent_helpers.py
  • src/climsight/config.py
  • src/climsight/data_analysis_agent.py
  • src/climsight/sandbox_utils.py
  • src/climsight/utils.py
  • src/climsight/tools/era5_climatology_tool.py
  • src/climsight/tools/era5_retrieval_tool.py
  • src/climsight/tools/get_data_components.py
  • src/climsight/tools/package_tools.py
  • src/climsight/tools/reflection_tools.py
  • src/climsight/tools/visualization_tools.py

Modified files (17):

  • Core: climsight_engine.py, climsight_classes.py, smart_agent.py
  • Providers: climate_data_providers.py, climate_functions.py
  • Tools: python_repl.py, image_viewer.py, tools/init.py
  • UI: streamlit_interface.py, terminal_interface.py, stream_handler.py
  • Config: config.yml, geo_functions.py
  • Build: requirements.txt, pyproject.toml, environment.yml, .gitignore

Authors: @dmpantiu , @kuivi
Co-authored-by: @dmpantiu
Co-authored-by: @kuivi

p.s.

Optional: download DestinE data (large ~12 GB, not downloaded by default)

python download_data.py DestinE

kuivi and others added 14 commits January 16, 2026 18:10
…d data analysis

This commit restructures the multi-agent workflow to improve modularity and enable
true parallel execution of information gathering agents.

Key Changes:
- Split smart_agent responsibilities:
  * smart_agent: Info gathering only (Wikipedia, RAG, ECOCROP)
  * data_analysis_agent: Data extraction & visualization (stub for now)

- Updated routing logic in climsight_engine.py:
  * smart_agent now runs in true parallel with other agents (not triggered from data_agent)
  * Removed route_fromdata() function
  * All parallel agents → data_analysis_agent → combine_agent

- Fixed NoSessionContext errors:
  * Added exception handling in stream_handler.py
  * Added exception handling in streamlit_interface.py
  * Parallel agents can now safely call update_progress() from worker threads

- Added data_analysis_response field to AgentState

Files modified: 5
Files added: 1
- Removed prompt sections referencing get_data_components and python_repl
- Removed get_data_components tool definition (244 lines)
- Removed tool output processing for get_data_components and python_repl
- Prompt now correctly matches available tools (wikipedia_search, RAG_search, ECOCROP_search only)

This fixes the mismatch where the prompt instructed the model to use tools
that were no longer exposed in the agent's tool list.
…l improvements

Major changes:
- Implement full data_analysis_agent with dynamic tool prompt based on config
- Make ERA5 mandatory when enabled - use as ground truth for climate model validation
- Remove redundant get_data_components when Python_REPL is enabled
- Preserve user query in analysis_brief with "USER QUESTION:" header
- Add configurable filter step via use_filter_step in config
- Accumulate multiple Wikipedia/RAG results in smart_agent (was overwriting)
- Add Wikipedia call limit (10 max) to prevent excessive API calls

New files:
- agent_helpers.py: Helper utilities for tool-based agents
- sandbox_utils.py: Sandbox directory management
- config.py: Configuration utilities
- utils.py: Logging and history utilities
- tools/era5_retrieval_tool.py: ERA5 data download tool
- tools/get_data_components.py: Climate data extraction tool
- tools/visualization_tools.py: File listing and helper tools
- tools/reflection_tools.py: Agent reflection tool
- tools/package_tools.py: Package installation tool

Type changes in AgentState:
- wikipedia_tool_response: str -> list (accumulate multiple results)
- rag_search_response: str -> list (accumulate multiple results)
- Added: data_analysis_prompt_text, data_analysis_images, thread_id,
  uuid_main_dir, results_dir, climate_data_dir, era5_data_dir

Python REPL improvements:
- Rewrite to use JupyterKernelExecutor (PangaeaGPT pattern)
- Auto-load datasets from sandbox paths
- Better plot detection and results directory handling
- Fix NextGEMSProvider: normalize negative longitudes to 0-360 range
  before KDTree query (fixes wrong temperatures for western hemisphere)
- Remove "Provide additional information" toggle, always show extra info
- Minor config and gitignore updates
add DestinE config (time periods, variable mapping/suffixes) and provider implementation with unstructured grid interpolation and unit conversions
wire DestinE into provider factory and availability list
gate ERA5 retrieval tool on Arraylake API key, pass via config from Streamlit UI
allow era5 retrieval tool creation with bound API key while keeping env-based fallback
@kuivi kuivi requested review from dmpantiu and koldunovn February 2, 2026 10:26
kuivi and others added 2 commits February 2, 2026 12:21
- Fix regex to handle leading whitespace before code blocks (python_repl.py)
- Reset is_initialized flag on kernel restart to re-run initialization (python_repl.py)
- Add sandbox_path parameter for relative path resolution (image_viewer.py)
- Pass sandbox_path in data_analysis_agent.py call
- Implement atomic writes for ERA5 cache to prevent corruption (era5_retrieval_tool.py)
@koldunovn koldunovn merged commit 1b31301 into main Feb 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants