Skip to content

Conversation

@paco-barreras
Copy link
Collaborator

Relatively large overhaul to how we rasterize a city and build street graphs connecting rasterized buildings.

  • New persistence methods
  • hub-based proxy graph distance (for scalability)
  • New indexing for buildings and "door" logic assignment to avoid index overlaps
  • New tests for city generation and trajectory generation (all pass)

- Remove chunking logic for city-scale downloads (single query ~4x faster)
- Fix CRS handling in download_osm_buildings/streets for polygon queries
- Convert all chunk_miles parameters to chunk_km (metric units)
- Add progress reporting for chunked downloads
- Add gpkg files to gitignore
- Move synthetic Philadelphia notebook to virtual_philadelphia directory
- Remove redundant column existence checks (we control the pipeline)
- Remove empty list checks (data always exists after download/load)
- Remove broad exception catching that masks real errors
- Simplify plotting to only show streets
- Add automatic data loading from persisted GeoPackage
- Integrate rasterization utilities into city_gen.py (RasterCityGenerator class)
- Implement lazy street graph building with shortcut (hub) network for O(n) memory
- Add manual_streets mode to City class for rasterized cities
- Update building classification: fix residential/home mapping, add residential signal requirement for inference
- Standardize chunking to kilometers (chunk_km) throughout map_utils
- Add utility scripts: ab_category_audit.py, streets_benchmark.py, create_sandbox.py
- Add comprehensive tests: test_rasterization_city_gen.py
- Remove obsolete rasterization.py and RASTERIZATION_PLAN.md
- Update .gitignore to exclude large GeoPackage files from research directory
…ibility

- Rewrite create_sandbox.py to download OSM data directly for Old City bbox
- Remove dependency on synthetic_philadelphia.py output
- Update all scripts to use relative paths instead of __file__ for Jupyter notebook compatibility
- Scripts now work when run as notebooks or from command line
OSMnx expects geographic coordinates (EPSG:4326) for polygon queries.
The download functions handle CRS conversion internally.
…ppers, add GeoJSON/Parquet/Shapefile I/O; enable street consolidation/length filter; speed up rasterization; fix tests; update report
…th lazy shortcut routing and boundary scaling/cropping
…ching; expand sandbox boundary by 2.5x; remove City.id_to_door_cell and use direct columns; simplify get_building()
…recomputed gravity in traj_gen; report gravity build in test_rasterization_report
…stance table and exponent in gravity; improve docstrings; normalize building_type in from_geopackage; tests pass
…e canonicalization; update traj_gen tests to normalize legacy fixture labels; docs clarified
…rkplace/retail/park) with rng.choice; enforce no fallbacks; City.from_geopackage: poi_cols naming
…_sample_horizontal_noise; return true burst_info; add tests incl. bursts tz/window check
…ct schema and errors for missing building ids
…erCityGenerator; enforce 'building_type' in City.from_geodataframes; fix KeyError in rasterization report use-cases
- Replaced _select_hubs with quantile-based evenly-spaced hub selection
- Pivoted hub_df from edge list to dense adjacency matrix for O(1) lookup
- Vectorized gravity computation using numpy broadcasting and fancy indexing
- Simplified traj_gen.py gravity query from 17 lines to 3 lines
- Fixed diagonal not being zeroed in distance matrix
- Added test_compute_gravity() with real data from sandbox fixture
- Streamlined rasterization_report.py to focus on timing and object inspection
…cessary defensive checks, derive timestamps when extending diaries, early gravity check, and streamline initialization paths.
- Add test_resolve_overlaps to verify feature increases building count
- Test confirms resolve_overlaps=True adds more buildings (34 → 53 in test)
- Add resolve_overlaps=True to rasterization_report.py
- Benchmark results show 40-54% more buildings across all box sizes:
  - Small: 347 → 535 (+54%)
  - Medium: 3,935 → 5,504 (+40%)
  - Large: 20,252 → 28,419 (+40%)
- Gravity computation scales with N² as expected (more buildings = longer time)
- Memory footprint remains tiny with lean structures (<3 MB for 28k buildings)
- All tests pass
- Refactor compute_gravity Part 1 to use 10 chunks (default)
- Process buildings in chunks to avoid dense N×N matrix allocation
- Memory reduction: 3.2 GB → 323 MB peak (10x improvement)
- Runtime improvement: 444s → 17s for 28k buildings (26x faster!)
- Eliminates memory spikes and thrashing
- All tests pass, backward compatible
- Update synthetic_philadelphia.py with streamlined EPR pipeline:
  - Use resolve_overlaps=True and callable_only=True
  - Generate 10 agents with destination diaries for 1 week
  - Persist config and diaries to disk
- Production-ready for 100k+ building simulations
…cture

- Remove bloat and useless print statements
- Clean FULL_CITY vs LARGE_BOX branching logic
- Match exact timing structure from rasterization_report.py
- Professional formatting with clear sections
- Separate timing for each pipeline step
- Clean config persistence
- No unnecessary variables or defensive checks
- Production-ready notebook
- Fixed to_timestamp() to handle scalar inputs (NumPy 2.0 compatibility)
  - Wraps scalars in pd.Series for processing, unwraps on return
  - Prevents ValueError from pd.api.types.is_datetime64_any_dtype()

- Added test_to_timestamp_scalar() to verify scalar string handling

- Improved consistency in traj_gen.py by using to_timestamp() instead of .timestamp()
  - Lines 620, 643, 771 now handle string datetime inputs robustly

- Added warning in generate_dest_diary when last_ping is beyond end_time
  - Prevents silent generation of empty destination diaries
  - Provides actionable error message to user

- Removed debug pdb.set_trace() from generate_dest_diary

All tests pass (13 traj_gen + 26 filter tests)
…t_diary

- Convert string datetime to pd.Timestamp in Agent.__init__
  - Ensures last_ping['datetime'] is always a proper datetime object
  - Fixes AttributeError: 'str' object has no attribute 'hour'

- Convert string datetime in generate_dest_diary when reading from destination_diary
  - Handles case where destination_diary contains string datetimes

All tests pass (13 traj_gen tests)
- Tests 9 different initialization scenarios:
  1. Default (no kwargs) - uses home centroid and current time
  2. x, y coordinates
  3. location (building ID)
  4. datetime as string
  5. datetime as pd.Timestamp
  6. timestamp (Unix seconds)
  7. trajectory (takes precedence)
  8. trajectory + kwargs (warns, trajectory wins)
  9. Combined x, y, and datetime string with timezone

- Verifies datetime objects always have .hour attribute
- Ensures trajectory takes precedence over kwargs with warning
- Confirms all datetime strings are converted to pd.Timestamp

All 14 traj_gen tests pass
Bug: RasterCity._rasterize was using door coordinates for building IDs,
causing duplicates when multiple buildings shared the same door.

Root Cause: Line 1942 used door directly instead of an internal building
block adjacent to the door (as City.add_building does at lines 358-366).

Fix: Apply same logic as City.add_building - use check_adjacent() to find
an internal building block adjacent to the door, ensuring unique IDs since
buildings cannot overlap.

Testing: Added test_unique_building_ids() regression test. All 17 tests pass.
…t path

Changed from O(n² × partial_BFS) to O(n × full_BFS) where n is number of hubs.

Before: Called nx.shortest_path_length(G, origin, dest) for each hub pair
After: Call nx.single_source_shortest_path_length(G, origin) once per origin,
       then query hub destinations via direct dictionary lookup

For n=100 hubs: 10,000 BFS runs -> 100 BFS runs

Trade-off: Loses early-stop benefit but eliminates redundant BFS traversals.
Expected to be faster since hubs are scattered evenly across large graphs.

All non-sandbox tests pass (4 tests fail due to deleted sandbox data).
Tests should not depend on data in examples/research/ folders.

Changed:
- Renamed _load_sandbox() to _load_fixture()
- Updated to load from nomad/data/city_fixture.gpkg (proper test fixture)
- Added helpful error message pointing to generate_fixture.py if missing
- Removed dependency on experimental sandbox data

The fixture is generated by nomad/data/generate_fixture.py which downloads
a small OSM area, rotates it, and saves buildings/streets/boundary.

All 17 tests pass.
Added pdb import at top of city_gen.py (following coding principles).

Debug statements in compute_gravity_row (city_gen.py):
- Check for unexpected zero distances before nearby door override
- Print building IDs with zero distances and hub info
- Set breakpoint if more than 1 zero distance found

Debug statements in generate_dest_diary (traj_gen.py):
- Print curr, probs shape, sum, inf/nan/zero counts
- Set breakpoint if inf, nan, or zero sum detected
- Print first 10 y.index and probs values for inspection

These will help diagnose why gravity contains NaN values.
- Add rotation_deg, offset_x, offset_y attributes to base City class
- Update RasterCity to pass rotation and document coordinate flow
- Fix streets_gdf geometry to use garden city units (consistent with buildings)
- Set CRS to None for internal gdfs (garden city units are abstract)
- Implement to_file method with reverse_affine_transformation option
- Add comprehensive docstrings explaining coordinate system transformations
- Test verifies to_file exports with and without reverse transformation
- Checks garden city columns are dropped when reverse_affine_transformation=True
- Verifies CRS is set correctly to Web Mercator
- Validates geometry scaling by comparing areas
- Add blocks_to_mercator_gdf() and mercator_to_blocks_gdf() in map_utils.py
- Delete vestigial to_geodataframes() method
- Delete OLD to_file() method (lines 978-986)
- Refactor NEW to_file() to use map_utils transforms (reduced from ~100 to ~40 lines)
- Refactor save_geopackage() to use direct gdf access and add reverse_affine parameter
- Rename reverse_affine_transformation to reverse_affine for consistency
- Update all tests to use city.buildings_gdf/streets_gdf directly
- No bloat: clean separation of concerns (transforms in map_utils, persistence in City)
- All 19 tests pass
@paco-barreras
Copy link
Collaborator Author

Also squashed a bug related to newer versions of Numpy breaking some filters.py functions

- Cache path_ml and bound_poly to avoid redundant unary_union and MultiLineString computations
- Check cache validity by verifying cached path ends at current destination building
- Clear cache when agent arrives at destination
- Use previous destination building geometry in bound_poly when departing from it
- Vectorize street_geom and path construction using blocks_gdf directly
- Remove unnecessary start_info variable and clamping logic
- Simplify cache validation to only check destination match

Performance improvements:
- test_complete_workflow: 7.09s -> 4.43s (37% faster)
- prescribed_trajectory_exp_1.py: 10.74s -> 6.29s (41% faster)
- Fixed _connect_building_door_to_streets() to use building_type instead of redundant 'kind' column
- Added 5 door connection tests from routing branch (updated for building_type)
- Added compute_shortest_paths() calls to tests to reflect proper usage
- Added auto-initialization to get_shortest_path() for convenience
- Removed bloat: eliminated unnecessary hasattr check
…tests

- Remove type checking and defensive hasattr checks
- Remove redundant ValueError check for empty buildings
- Remove DEBUG VERIFY code
- Initialize street_graph = None in __init__
- Add warning for auto-initialization of shortest_paths
- Fix pd.concat to preserve GeoDataFrame geometry (FutureWarning fix)
- Add tests for geometry containment and building_type/building_id consistency
- Add verification prints to rasterization_report.py
…version

- Replace notebook with full pipeline version from ROUTING branch
- Includes trajectory generation, sparse sampling, reprojection, and visualization
- Keep ROUTING branch configuration (medium box, smaller parameters)
- Remove temporary report .md files (CACHING_PLAN, CODING_PRINCIPLES, FINAL_REVISED_PLAN)
- Delete ROUTING branch notebook file
…tadata, remove analysis/report/plan markdown files
@paco-barreras paco-barreras force-pushed the rasterization-and-street-graph branch from 9fbb49d to 5991272 Compare November 14, 2025 06:09
- Remove type hints from 5 functions (Principle #4)
- Remove dead code: unused imports (pdb, cm), unused functions, unused variables
- Fix broken reset_trajectory() cache nulling
- Optimize O(n) → O(1) building lookups using indexed .loc[] (7 locations)
- Remove defensive bloat: unnecessary .empty checks, hasattr checks
- Remove bounds clamping (agents always kept in bounds by geometry checks)
- Extract duplicate destination selection logic to helper function
- Update test expectations: KeyError instead of IndexError for missing buildings
- Add testing/environment notes to CODING_PRINCIPLES.md

Results: 1491 → 1477 lines (-14 lines), all 14 tests pass
Significant performance improvement from O(1) building lookups
@paco-barreras paco-barreras deleted the rasterization-and-street-graph branch November 18, 2025 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants