Removed dataset creation from pipeline #45

jathavaan · 2026-01-20T13:52:24Z

This pull request introduces significant changes to how geospatial datasets (FKB and OSM) are loaded and processed, moving from file-based workflows to reading pre-processed Parquet files from blob storage. It also introduces new utility methods for handling Parquet bytes as GeoDataFrames, updates dependency injection wiring, and refines the SQL queries and data handling for improved consistency and extensibility.

Major workflow changes:

FKB and OSM data extraction now reads Parquet files (fkb.parquet, osm.parquet) from blob storage, replacing previous workflows that relied on downloading and processing raw files. This simplifies and speeds up data ingestion. [1] [2] [3] [4] [5] [6]

New and updated utility methods:

Added convert_parquet_bytes_to_gdf method to IBytesService and its implementation, enabling direct conversion of Parquet byte arrays (with WKB geometry) to GeoPandas GeoDataFrames. [1] [2]
Updated dependency injection in containers.py to provide blob_storage_service and bytes_service to services that now require them. [1] [2]

SQL and schema improvements:

Standardized SQL queries in conflation_service.py to use external_id and building_id fields, and added new fields (feature_update_time, feature_capture_time, building_type) for better data tracking and compatibility. [1] [2] [3]
Enhanced clip_dataframes_to_wkb to dynamically include optional columns if present, improving robustness and extensibility. [1] [2] [3]

Configuration and enum updates:

Added CONTRIBUTION to the StorageContainer enum to support the new blob storage workflow.
Improved environment variable loading by using find_dotenv() in config.py for more reliable configuration.

Platform-specific and minor improvements:

Set DuckDB Azure transport option to curl on Linux for compatibility. [1] [2]
Minor import and formatting updates for clarity and consistency. [1] [2]

These changes collectively modernize the data ingestion pipeline, making it more maintainable and performant by leveraging pre-processed data and improving code modularity.

Copilot

Pull request overview

This pull request modernizes the geospatial data ingestion pipeline by replacing file-based workflows with reading pre-processed Parquet files from blob storage for both FKB and OSM datasets. The changes simplify data loading and improve performance by leveraging pre-processed data.

Changes:

Replaced raw file processing with direct Parquet file loading from blob storage for FKB and OSM datasets
Added new utility method convert_parquet_bytes_to_gdf for converting Parquet bytes with WKB geometry to GeoDataFrames
Updated SQL queries to use standardized field names (external_id, building_id) and added optional timestamp fields for better data tracking

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
src/presentation/entrypoints/release_pipeline.py	Commented out obsolete download and polygon creation steps in the pipeline
src/infra/persistence/context/duckdb.py	Added platform-specific Azure transport configuration for Linux compatibility
src/infra/persistence/context/azure_blob_storage.py	Removed BOM character from import statement
src/infra/infrastructure/services/vector_service.py	Enhanced clipping logic to dynamically handle optional timestamp columns
src/infra/infrastructure/services/open_street_map_service.py	Replaced PBF file processing with Parquet file reading from blob storage
src/infra/infrastructure/services/fkb_service.py	Replaced Hugging Face download workflow with Parquet file reading from blob storage
src/infra/infrastructure/services/conflation_service.py	Standardized SQL queries with `external_id` and `building_id` fields and added timestamp columns
src/infra/infrastructure/services/bytes_service.py	Implemented new `convert_parquet_bytes_to_gdf` method for Parquet to GeoDataFrame conversion
src/infra/infrastructure/containers.py	Updated dependency injection to provide blob storage and bytes services
src/domain/enums/storage_container.py	Added CONTRIBUTION container enum value
src/config.py	Improved environment variable loading with `find_dotenv()`
src/application/contracts/bytes_service_interface.py	Added interface definition for `convert_parquet_bytes_to_gdf` method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/application/contracts/bytes_service_interface.py

src/infra/infrastructure/services/fkb_service.py

src/infra/infrastructure/services/vector_service.py

src/application/contracts/bytes_service_interface.py

src/infra/infrastructure/services/open_street_map_service.py

src/infra/infrastructure/services/fkb_service.py

src/infra/infrastructure/services/conflation_service.py

src/infra/infrastructure/services/fkb_service.py

src/infra/persistence/context/duckdb.py

src/infra/infrastructure/services/fkb_service.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…data-from-contribution-container' of https://github.com/kartAI/doppa-data into refactor/44-remove-data-reading-from-pipeline-and-load-data-from-contribution-container

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T14:04:30Z

src/infra/infrastructure/services/open_street_map_service.py

 import geopandas as gpd
-from duckdb import DuckDBPyConnection



The deprecated import at line 1 is not used anywhere in this file. The commented-out method at lines 39-50 doesn't use the @deprecated decorator. Consider removing this unused import.

Suggested change

import geopandas as gpd

from duckdb import DuckDBPyConnection

Copilot · 2026-01-20T14:04:30Z

src/presentation/entrypoints/release_pipeline.py

+    # building_polygons = fkb_service.create_building_polygons(gdf=fkb_dataset, crs=EPSGCode.WGS84)
+    return [fkb_dataset]


Commented-out code should be removed rather than left in the codebase. If this functionality needs to be preserved for reference, it should be documented in version control history. Keeping commented-out code reduces readability and maintainability.

Copilot · 2026-01-20T14:04:30Z

src/infra/infrastructure/services/vector_service.py

+            select_parts = [
+                "external_id",
+                "geometry",
+                "building_id",
+                "* EXCLUDE(external_id, geometry, building_id)"


The SELECT clause includes both explicit column references (external_id, geometry, building_id) and then uses EXCLUDE to select all other columns except these same three. This creates redundancy in the query. Consider restructuring to either: 1) Use SELECT * if all columns are needed in their original positions, or 2) Only select the specific columns that need repositioning, followed by * EXCLUDE(...) for the remainder.

Copilot · 2026-01-20T14:04:31Z

src/application/contracts/bytes_service_interface.py

    @staticmethod
    @abstractmethod
+    def convert_parquet_bytes_to_gdf(data: bytes, epsg_code: EPSGCode) -> gpd.GeoDataFrame:
+        """
+        Converts a byte array to a GeoPandas GeoDataFrame. This assumes that the file is a parquet file and that
+        there is a geometry column with geometries represented as WKB.
+
+        :param data: Byte array of the parquet file. Often downloaded from blob storage.
+        :param epsg_code: EPSG code for the coordinate reference system (CRS).
+        :return: GeoDataFrame representation of the byte array.
+        :rtype: gpd.GeoDataFrame
+        :raises NotImplementedError: This method must be implemented by subclasses.


The docstring is missing a raise NotImplementedError statement, which is inconsistent with the other abstract methods in this interface (e.g., line 18). Abstract methods should raise NotImplementedError to ensure proper interface contract enforcement.

Copilot · 2026-01-20T14:04:31Z

src/infra/persistence/context/duckdb.py

@@ -1,4 +1,6 @@
-import duckdb
+import platform


The os module is imported but not used in this file. Only the platform module is needed for the platform check. Consider removing the unused import.

Suggested change

import platform

Copilot · 2026-01-20T14:04:31Z

src/infra/infrastructure/services/open_street_map_service.py

)

The logger and Config imports are no longer used in the active code - they only appear in the commented-out implementation. If the commented code is removed (as recommended), these imports should also be removed to keep the imports clean.

Copilot · 2026-01-20T14:04:32Z

src/infra/infrastructure/services/fkb_service.py

 from shapely import from_wkb

-from src import Config
-from src.application.common import logger
+from application.contracts import IBlobStorageService
+from domain.enums import StorageContainer
 from src.application.contracts import (
    IFKBService, IFKBFileService, IZipService, IBytesService
 )


The logger and Config imports (lines 10-11) are no longer used in the active code - they only appear in the commented-out implementation. If the commented code is removed (as recommended), these imports should also be removed to keep the imports clean.

Suggested change

from shapely import from_wkb

from src import Config

from src.application.common import logger

from application.contracts import IBlobStorageService

from domain.enums import StorageContainer

from src.application.contracts import (

IFKBService, IFKBFileService, IZipService, IBytesService

)

from shapely import from_wkb

from application.contracts import IBlobStorageService

from domain.enums import StorageContainer

Copilot · 2026-01-20T14:04:32Z

src/infra/infrastructure/services/conflation_service.py

                    partition_key,
-                    TRY_CAST(bygningstype AS VARCHAR) AS type, 
-                    TRY_CAST(bygningsnummer AS INTEGER) AS building_id,
+                    TRY_CAST(building_type AS VARCHAR) AS building_type, 


There is a spelling error in the column alias: "buildling_type" should be "building_type".

Suggested change

TRY_CAST(building_type AS VARCHAR) AS building_type,

TRY_CAST(building_type AS VARCHAR) AS building_type,

Removed dataset creation from pipeline

8aafe46

jathavaan self-assigned this Jan 20, 2026

Copilot AI review requested due to automatic review settings January 20, 2026 13:52

jathavaan linked an issue Jan 20, 2026 that may be closed by this pull request

Remove data reading from pipeline and load data from contribution container #44

Closed

Copilot started reviewing on behalf of jathavaan January 20, 2026 13:52 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Update src/application/contracts/bytes_service_interface.py

e9a379a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings January 20, 2026 13:58

Copilot started reviewing on behalf of jathavaan January 20, 2026 13:59 View session

jathavaan and others added 6 commits January 20, 2026 14:59

#44 Removed unused imports

f25ae54

Update src/application/contracts/bytes_service_interface.py

dd853c2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'refactor/44-remove-data-reading-from-pipeline-and-load-…

77acad8

…data-from-contribution-container' of https://github.com/kartAI/doppa-data into refactor/44-remove-data-reading-from-pipeline-and-load-data-from-contribution-container

#44 Removed unused code

54f14cc

#44 Corrected column name typo

9b8b12d

#44 Removed unused imports

13f24a2

Copilot AI reviewed Jan 20, 2026

View reviewed changes

#44 Removed unused dependency

685fa98

jathavaan merged commit 83c1b4a into main Jan 20, 2026

jathavaan deleted the refactor/44-remove-data-reading-from-pipeline-and-load-data-from-contribution-container branch January 20, 2026 14:07

		import geopandas as gpd
		from duckdb import DuckDBPyConnection

		# building_polygons = fkb_service.create_building_polygons(gdf=fkb_dataset, crs=EPSGCode.WGS84)
		return [fkb_dataset]

	TRY_CAST(building_type AS VARCHAR) AS building_type,
	TRY_CAST(building_type AS VARCHAR) AS building_type,

Removed dataset creation from pipeline #45

Removed dataset creation from pipeline #45

Uh oh!

Conversation

jathavaan commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant