diff --git a/docs/Configuration/basins_metadata.md b/docs/Configuration/basins_metadata.md index 4c1d24b6..48d3c0c4 100644 --- a/docs/Configuration/basins_metadata.md +++ b/docs/Configuration/basins_metadata.md @@ -6,9 +6,9 @@ To implement this feature RAT uses `basins_metadata` which is a csv file that is ![Screenshot of Basins_Metadata.csv](../images/configure/basins_metadata_sample.jpg) -It should definitely have the index `basin_name` for section `BASIN` as it needs to be different for all basins and same is true for `basin_id`. Rest all those parameters which do not vary between basins can be provided in the configuration file and those which will vary must be provided in the `basins_metadata` csv file. +It should definitely have the index `basin_name` for section `BASIN` as it needs to be different for all basins and same is true for `basin_id`. Rest all those parameters which do not vary between basins can be provided in the configuration file and those which will vary must be provided in the `basins_metadata` csv file. A special index `run`for section `BASIN` must be used in `basins_metadata` and it's value should be set to 1 to indicate which river basins should RAT run, otherwise 0. !!! tip_note "Tip" 1. To make use of `basins_metadata`, `multiple_basin_run` in `GLOBAL` section should be `true`. - 2. The values of `basin_name` from `basins_metadata` should be provided in the list `basins_to_process` for all the river basins for which you want to run RAT for. + 2. The values of `run` in `BASIN` must be provided as either 1 or 0 for each basin in the `basins_metadata` where 1 is for all the river basins for which you want to run RAT for. 3. A sample copy of `basins_metadata` is provided in 'params' folder inside `project_dir` with the name of 'basins_metadata_sample.csv'. \ No newline at end of file diff --git a/docs/Configuration/rat_config.md b/docs/Configuration/rat_config.md index c57b370f..774181c3 100644 --- a/docs/Configuration/rat_config.md +++ b/docs/Configuration/rat_config.md @@ -37,7 +37,22 @@ RAT config file has 12 major sections that defines several parameters which are -1 -2 ``` - + +*
*`cleaning`* :
+ Optional parameter + + Description : Boolean flag indicating whether to do cleaning during RAT run. If True, it cleans the data specified by the [`CLEAN UP`](#clean-up) section in the configuration file. + + Default : False + + Syntax : If you want to delete data produced by RAT and want to use `CLEAN UP` section, then + ``` + GLOBAL: + cleaning: True + ``` + !!! note + If you want to use [`CLEAN UP` section](#clean-up), `cleaning` must be True. + *
*`project_dir`* :
Required parameter @@ -143,19 +158,6 @@ RAT config file has 12 major sections that defines several parameters which are basins_metadata: /Cheetah/rat_project/params/basins_metadata_sample.csv ``` -*
*`basins_to_process`* :
- Required parameter - - Description : List of basins to run RAT for within the `basins_metadata`. The list values must match with the values of `basin_name` in `BASIN` section in `basins_metadata`. For more information, please look [multiple basin run](../basins_metadata). - - Default : It is blank by default and can be filled by the user. - - Syntax : If you want to run RAT for basins Sabine and Nueces, then - ``` - GLOBAL: - basins_to_process: ['Sabine','Nueces'] - ``` - ### Basin *
*`region_name`* :
@@ -712,38 +714,74 @@ This section of the configuration file describes the parameters defined by `rout 2. If `station_global_data` is `False`, AEC file names should be <'dam_name_column' value where spaces are replaced by '_'>. For example, the file name for a reservoir with 'dam_name' as 'Tehri Dam' will be 'Tehri_Dam.csv'. 3. Each AEC file should have two columns with headers as 'Elevation' and 'CumArea'. 'Elevation' should be in meters and 'CumArea' should be in square Kilometers. +*
*`catchment_vector_file`* :
+ Optional parameter + + Description : Absolute path of the catchment vector file where the geometry is represented by reservoirs' catchment polygons. It can be "global" (relative to basin) and will be automatically filtered for the basin. It can have unique id column and dam name column. + + Default : It is blank by default and can be filled by the user. + + Syntax : If `catchment_vector_file` has the path *'/Cheetah/rat_project/custom_files/catchments.geojson/'*, then + ``` + POST_PROCESSING: + catchment_vector_file : /Cheetah/rat_project/custom_files/catchments.geojson + ``` + +*
*`catchment_vector_file_columns_dict`* :
+ Optional parameter + + Description : Dictionary of column names for `catchment_vector_file`. The dictionary must have keys 'id_column' and 'dam_name_column' and their values should be the actual name of the corresponding columns respectively. + + Default : `{id_column : 'GRAND_ID', dam_name_column : 'DAM_NAME'}` + + Syntax : If `catchment_vector_file` has column names 'GRAND_ID' and 'DAM_NAME', then + ``` + POST_PROCESSIN: + catchment_vector_file_columns_dict: {id_column : 'GRAND_ID', dam_name_column : 'DAM_NAME'} + ``` + or + ``` + POST_PROCESSIN: + catchment_vector_file_columns_dict: + id_column: GRAND_ID + dam_name_column: DAM_NAME + ``` + ### Clean Up -*
*`clean_preprocessing`* :
- Required parameter +!!!note + To use this section, `cleaning` should be set to True in [Global Section](#global) of the configuration file. This section is Optional if `cleaning` is `False` or not provided in Global section. + +*
*`clean_processing`* :
+ Optional parameter - Description : `True` if you want to delete intermediate pre-processed data for a river basin except global raw data downloaded from servers after the RAT run. Otherwise, `False`. + Description : `True` if you want to delete intermediate pre-processed and post-processed data for a river basin except global raw data downloaded from servers after the RAT run. Otherwise, `False`. Default : `False` - Syntax : If you want to delete intermediate pre-processed data for a river basin, + Syntax : If you want to delete intermediate pre-processed and post-processed data for a river basin, ``` CLEAN_UP: - clean_preprocessing: True + clean_processing: True ``` *
*`clean_metsim`* :
- Required parameter + Optional parameter - Description : `True` if you want to delete intermediate metsim outputs for a river basin after the RAT run. Otherwise, `False`. + Description : `True` if you want to delete intermediate metsim inputs and outputs for a river basin after the RAT run. Otherwise, `False`. Default : `False` - Syntax : If you want to delete intermediate metsim outputs for a river basin, + Syntax : If you want to delete intermediate metsim inputs and outputs for a river basin, ``` CLEAN_UP: clean_metsim: True ``` *
*`clean_vic`* :
- Required parameter + Optional parameter - Description : `True` if you want to delete intermediate vic inputs and outputs, and any vic initial soil state file that is older than 15 days, for a river basin after the RAT run. Otherwise, `False`. + Description : `True` if you want to delete intermediate vic inputs and outputs, and any vic initial soil state file that is older than 20 days, for a river basin after the RAT run. Otherwise, `False`. Default : `False` @@ -754,9 +792,9 @@ This section of the configuration file describes the parameters defined by `rout ``` *
*`clean_routing`* :
- Required parameter + Optional parameter - Description : `True` if you want to delete intermediate routing inputs and outputs, and any routing initial state file that is older than 15 days, for a river basin after the RAT run. Otherwise, `False`. + Description : `True` if you want to delete intermediate routing inputs and outputs, and any routing initial state file that is older than 20 days, for a river basin after the RAT run. Otherwise, `False`. Default : `False` @@ -767,7 +805,7 @@ This section of the configuration file describes the parameters defined by `rout ``` *
*`clean_gee`* :
- Required parameter + Optional parameter Description : `True` if you want to delete gee produced small chunk files of surface area time series for a river basin after the RAT run. Otherwise, `False`. @@ -782,7 +820,7 @@ This section of the configuration file describes the parameters defined by `rout If `clean_gee` is `True`, it will not delete the final gee outputs that will be appended with new data in next RAT run. To delete that, use `clean_previous_outputs`. *
*`clean_altimetry`* :
- Required parameter + Optional parameter Description : `True` if you want to delete raw altimetry data that takes a lot of time to download for a river basin after the RAT run. Otherwise, `False`. @@ -796,8 +834,36 @@ This section of the configuration file describes the parameters defined by `rout !!!note If `clean_altimetry` is `True`, it will not delete the extracted altimetry data that will be appended with new data in next RAT run. To delete that, use `clean_previous_outputs`. +*
*`clean_basin_parameter_files`* :
+ Optional parameter + + Description : `True` if you want to delete parameter files related to the basin. Otherwise, `False`. + + Default : `False` + + Syntax : If you want to delete parameter files related to basin, + ``` + CLEAN_UP: + clean_basin_parameter_files: True + ``` + +*
*`clean_basin_meteorological_data`* :
+ Optional parameter + + Description : `True` if you want to delete combined meteorological data of the basin which is required to hot-start RAT for next RAT run. Otherwise, `False`. + + Default : `False` + + Syntax : If you want to delete combined meteorological data of the basin, + ``` + CLEAN_UP: + clean_basin_meteorological_data: True + ``` + !!! note + It should be noted that combined meteorological data is the data that helps in running VIC (hydrological model) without any spin-up for next run. Without this data, RAT needs to run the hydrological model for extra 2-3 years. You should use `clean_basin_meteorological_data` if you want to recreate combined meteorological data file for a river basin in the next RAT run. + *
*`clean_previous_outputs`* :
- Required parameter + Optional parameter Description : `True` if you want to delete previous outputs, gee extracted surface area time series and altimetry extracted height data produced by last RAT run. Otherwise, `False`. @@ -828,4 +894,8 @@ This section of the configuration file describes the parameters defined by `rout secrets: /Cheetah/rat_project/secrets/secrets.ini ``` !!! note - It will be left blank if '-s' argument is not provided in `rat init` command. \ No newline at end of file + It will be left blank if '-s' argument is not provided in `rat init` command. + +### Plugins + +This section of the configuration file is optional and can be used to run different Plugins as mentioned [here](../../Plugins/). It should be used to describe the parameters required by a plugin that you want to use. \ No newline at end of file diff --git a/docs/images/tutorials/KarnatakaFloods/test_output.png b/docs/images/tutorials/KarnatakaFloods/test_output.png index 06ee0a06..b941f917 100644 Binary files a/docs/images/tutorials/KarnatakaFloods/test_output.png and b/docs/images/tutorials/KarnatakaFloods/test_output.png differ diff --git a/environment.yml b/environment.yml index 39729b6a..e95296f2 100644 --- a/environment.yml +++ b/environment.yml @@ -16,8 +16,14 @@ dependencies: - ruamel_yaml - yaml - gdown + - plotly - cfgrib - conda-build + - gfortran + - make + - gdal + - libnetcdf + - libgdal-netcdf - pip - pip: - geonetworkx \ No newline at end of file diff --git a/src/rat/cli/rat_init_config.py b/src/rat/cli/rat_init_config.py index f966220a..49132742 100644 --- a/src/rat/cli/rat_init_config.py +++ b/src/rat/cli/rat_init_config.py @@ -1,13 +1,14 @@ DOWNLOAD_LINKS_DROPBOX = { 'route_model': "https://www.dropbox.com/scl/fi/1kjivr13kyf6gn7wlhbzt/routing.zip?rlkey=zq8j501amqyirfprcgb9dquec&dl=1", - 'params': "https://www.dropbox.com/scl/fi/9qk9bwrbawryx79o7cclg/params.zip?rlkey=j5dqxipkvwuo3nso4dl1cxaux&dl=1", + 'params': "https://www.dropbox.com/scl/fi/zvluibsnhz36k4smzcavv/params.zip?rlkey=uqbxdi4ulr9dv4u4imktf605m&st=r0qrt2gd&dl=1", 'global_data': "https://www.dropbox.com/scl/fi/dhy3y3e9dw6tg89vt6x66/global_data.zip?rlkey=uvy6772cj5bpdfvtvqa4u8enj&st=4sti44e7&dl=1", 'global_vic_params': "https://www.dropbox.com/s/jsg2wu62qi2ltwz/global_vic_params.zip?dl=1", } +## For google drive link, https://drive.google.com/file/d//view?usp=sharing, use https://drive.google.com/uc?id= DOWNLOAD_LINKS_GOOGLE = { 'route_model': "https://drive.google.com/uc?id=1zr3VH0wy-XN-yF2_n0xic89_PiT_V_Mb", - 'params': "https://drive.google.com/uc?id=1LAGivWvgBdtJvDWzkGKorzfjPEKvO69k", + 'params': "https://drive.google.com/uc?id=1o5TH8GBmP4rjvFFROQmbid-ILMJw7xiS", 'global_data': "https://drive.google.com/uc?id=1Obm_cKPFoumJKhcNTvFxIZNkSU3OmgT_", 'global_vic_params': "https://drive.google.com/uc?id=16P95eu2yG0i77ac_NrmTSVutNtVXWNeA", } diff --git a/src/rat/cli/rat_test_config.py b/src/rat/cli/rat_test_config.py index fa1ace16..79241ed4 100644 --- a/src/rat/cli/rat_test_config.py +++ b/src/rat/cli/rat_test_config.py @@ -1,10 +1,12 @@ from datetime import date DOWNLOAD_LINK_DROPBOX = { - 'test_data': "https://www.dropbox.com/scl/fi/f1pnyz9mo178kweh3agtf/test_data.zip?dl=1&rlkey=qxvsfrhc2li55dh4yj4a03bq2" + 'test_data': "https://www.dropbox.com/scl/fi/q88v9qpg20rhzhj8v4ktq/test_data.zip?rlkey=j6y1mlhc8vgzaicdofj4hn8g4&st=l7bcqmx1&dl=1" } + +## For google drive link, https://drive.google.com/file/d//view?usp=sharing, use https://drive.google.com/uc?id= DOWNLOAD_LINK_GOOGLE = { - 'test_data': "https://drive.google.com/uc?id=1F5Z6f4rX7nBSXXo7XYGK_XC0i5-G6zBF" + 'test_data': "https://drive.google.com/uc?id=1yV2zfqovPjAyWI4yWL5lhN7_GWaXL2Va" } PATHS = { diff --git a/src/rat/cli/rat_test_verify.py b/src/rat/cli/rat_test_verify.py index e4856759..51eacf62 100644 --- a/src/rat/cli/rat_test_verify.py +++ b/src/rat/cli/rat_test_verify.py @@ -27,6 +27,9 @@ def verify_test_results(self): print("############################## Test-6 (Surface Area) ##############################",end="\n") self._verify_test_results_for_var('sarea_tmsos','Surface Area') + + print("############################## Test-7 (NSSC) ##############################",end="\n") + self._verify_test_results_for_var('nssc','Normalized Suspended Sediment Concentration') # Comparing true and estimated files for a variable def _verify_test_results_for_var(self, var_to_observe, var_to_display): diff --git a/src/rat/core/run_postprocessing.py b/src/rat/core/run_postprocessing.py index 48b3883a..f594e013 100644 --- a/src/rat/core/run_postprocessing.py +++ b/src/rat/core/run_postprocessing.py @@ -1,7 +1,9 @@ import os +import glob import pandas as pd import numpy as np import xarray as xr +import rioxarray as rxr import geopandas as gpd import warnings import datetime @@ -17,7 +19,7 @@ def calc_dels(aecpath, sareapath, savepath): - aec = pd.read_csv(aecpath) + aec = pd.read_csv(aecpath, comment='#') df = pd.read_csv(sareapath, parse_dates=['date']) df = df.drop_duplicates('date') @@ -50,28 +52,91 @@ def calc_dels(aecpath, sareapath, savepath): df.to_csv(savepath, index=False) def calc_E(res_data, start_date, end_date, forcings_path, vic_res_path, sarea, savepath, forecast_mode=False): - ds = xr.open_dataset(vic_res_path) - forcings_ds = xr.open_mfdataset(forcings_path, engine='netcdf4') - ## Slicing the latest run time period - ds = ds.sel(time=slice(start_date, end_date)) - forcings_ds = forcings_ds.sel(time=slice(start_date, end_date)) - - # create buffer to get required bounds - res_geom = res_data.geometry - res_buf = res_geom.buffer(0.1) - - minx, miny, maxx, maxy = res_buf.bounds - - log.debug(f"Bounds: {res_buf.bounds}") - log.debug("Clipping forcings") - forcings_ds_clipped = forcings_ds['air_pressure'].sel(lon=slice(minx, maxx), lat=slice(maxy, miny)) - forcings = forcings_ds_clipped.load() - - log.debug("Clipping VIC results") - ds_clipped = ds.sel(lon=slice(minx, maxx), lat=slice(maxy, miny)) - reqvars_clipped = ds_clipped[['OUT_EVAP', 'OUT_R_NET', 'OUT_VP', 'OUT_WIND', 'OUT_AIR_TEMP']] - reqvars = reqvars_clipped.load() + # Initialize existing_data to None + existing_data = None + # If vic result file exists, then use that to calculate evaporation + if os.path.isfile(vic_res_path) and glob.glob(forcings_path): + ds = xr.open_dataset(vic_res_path) + forcings_ds = xr.open_mfdataset(forcings_path, engine='netcdf4') + ## Slicing the latest run time period + ds = ds.sel(time=slice(start_date, end_date)) + forcings_ds = forcings_ds.sel(time=slice(start_date, end_date)) + + # create buffer to get required bounds + res_geom = res_data.geometry + res_buf = res_geom.buffer(0.1) + minx, miny, maxx, maxy = res_buf.bounds + + log.debug(f"Bounds: {res_buf.bounds}") + log.debug("Clipping forcings") + forcings_ds_clipped = forcings_ds['air_pressure'].sel(lon=slice(minx, maxx), lat=slice(maxy, miny)) + forcings = forcings_ds_clipped.load() + + log.debug("Clipping VIC results") + ds_clipped = ds.sel(lon=slice(minx, maxx), lat=slice(maxy, miny)) + reqvars_clipped = ds_clipped[['OUT_R_NET', 'OUT_VP', 'OUT_WIND', 'OUT_AIR_TEMP']] + reqvars = reqvars_clipped.load() + + log.debug("Checking if grid cells lie inside reservoir") + last_layer = reqvars.isel(time=-1).to_dataframe().reset_index() + temp_gdf = gpd.GeoDataFrame(last_layer, geometry=gpd.points_from_xy(last_layer.lon, last_layer.lat)) + log.debug('Calculating points within') + points_within = temp_gdf[temp_gdf.within(res_geom)]['geometry'] + + if len(points_within) == 0: + log.debug("No points inside reservoir, using nearest point to centroid") + centroid = res_geom.centroid + + data = reqvars.sel(lat=centroid.y, lon=centroid.x, method='nearest') + data = data.to_dataframe().reset_index()[1:].set_index('time') + + P = forcings.sel(lat=centroid.y, lon=centroid.x, method='nearest') + P = P.to_dataframe().reset_index().set_index('time')['air_pressure'].resample('1D').mean()[1:] + P.head() + + else: + log.debug(f"{len(points_within)} Grid cells found inside reservoir, averaging their values") + # data = reqvars.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest').to_dataframe().reset_index().groupby('time').mean()[1:] + # Convert dataset to a spatial-aware dataset + reqvars = reqvars.rio.write_crs("EPSG:4326") # Ensure correct CRS + # Clip using geometry + reqvars_roi_clip = reqvars.rio.clip([res_geom], reqvars.rio.crs, drop=True) + # reqvars_roi_clip = reqvars.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest') + # Averaging the data + data = reqvars_roi_clip.mean(dim=['lat','lon']).to_dataframe().reset_index()[1:].set_index('time') + forcings = forcings.rio.write_crs("EPSG:4326") + # Ensure dataset has proper spatial dimensions + forcings_xy = forcings.rename({'lon': 'x', 'lat': 'y'}) + forcings_roi_clip = forcings_xy.rio.clip([res_geom], reqvars.rio.crs, drop=True) + # forcings_roi_clip = forcings.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest') + P = ( + forcings_roi_clip.mean(dim=['x', 'y']) # Average over spatial dimensions + .resample(time='1D').mean() # Resample first! + .to_dataframe() # Convert to DataFrame + .reset_index()[1:] + .set_index('time') # Optional: Remove first row if necessary + ) + + P = P['air_pressure'] + # P = forcings_roi_clip.mean(dim=['lat','lon']).to_dataframe().resample(time='1D').reset_index()[1:] + # P = P['air_pressure'] + # P = forcings.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest').resample({'time':'1D'}).mean().to_dataframe().groupby('time').mean()[1:] + + data['P'] = P + # If no vic result file exists then check if data is there in existing file between start and end time + elif os.path.isfile(savepath): + existing_data = pd.read_csv(savepath, parse_dates=['time']) + # Filter data between start_date and end_date (inclusive) + data = existing_data[ + (existing_data['time'] >= pd.Timestamp(start_date)) & + (existing_data['time'] <= pd.Timestamp(end_date)) + ] + data = data.set_index('time').sort_values(by='time') + + else: + raise ValueError("VIC result file or any existing evaporation rat_outputs file not found.") + # get sarea - if string, read from file, else use same surface area value for all time steps log.debug(f"Getting surface areas - {sarea}") if isinstance(sarea, str): @@ -85,54 +150,44 @@ def calc_E(res_data, start_date, end_date, forcings_path, vic_res_path, sarea, s if forecast_mode: # forecast mode. extrapolate using forward fill. ix = pd.date_range(start=first_obs, end=end_date, freq='D') sarea_interpolated = sarea_interpolated.reindex(ix).fillna(method='ffill') - sarea_interpolated = sarea_interpolated[start_date:end_date] - - log.debug("Checking if grid cells lie inside reservoir") - last_layer = reqvars.isel(time=-1).to_dataframe().reset_index() - temp_gdf = gpd.GeoDataFrame(last_layer, geometry=gpd.points_from_xy(last_layer.lon, last_layer.lat)) - points_within = temp_gdf[temp_gdf.within(res_geom)]['geometry'] - - if len(points_within) == 0: - log.debug("No points inside reservoir, using nearest point to centroid") - centroid = res_geom.centroid - - data = reqvars.sel(lat=centroid.y, lon=centroid.x, method='nearest') - data = data.to_dataframe().reset_index()[1:].set_index('time') - - P = forcings.sel(lat=centroid.y, lon=centroid.x, method='nearest') - P = P.to_dataframe().reset_index().set_index('time')['air_pressure'].resample('1D').mean()[1:] - P.head() - - else: - # print(f"[!] {len(points_within)} Grid cells inside reservoir found, averaging their values") - data = reqvars.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest').to_dataframe().reset_index().groupby('time').mean()[1:] - - P = forcings.sel(lat=np.array(points_within.y), lon=np.array(points_within.x), method='nearest').resample({'time':'1D'}).mean().to_dataframe().groupby('time').mean()[1:] - + # Slicing is exclusive of end date. But we want inclusive of both start and end dates. + # sarea_interpolated.index = pd.to_datetime(sarea_interpolated.index) + # data.index = pd.to_datetime(data.index) + # print("Sarea Data:", sarea_interpolated.head(5)) + # print(sarea_interpolated.index.isin(data.index)) + # print(data.index.equals(sarea_interpolated.index)) + # sarea_interpolated = sarea_interpolated.loc[sarea_interpolated.index.isin(data.index)] + sarea_interpolated = sarea_interpolated.reindex(data.index) + if isinstance(sarea, pd.DataFrame): data['area'] = sarea_interpolated['area'] else: data['area'] = sarea - data['P'] = P + data = data.dropna() if (data.empty): print('After removal of NAN values, no data left to calculate evaporation.') return None else: data['penman_E'] = data.apply(lambda row: penman(row['OUT_R_NET'], row['OUT_AIR_TEMP'], row['OUT_WIND'], row['OUT_VP'], row['P'], row['area']), axis=1) + + data =data.rename({'penman_E': 'OUT_EVAP'}, axis=1) data = data.reset_index() # Save (Writing new file if not exist otherwise append) - if os.path.isfile(savepath): - existing_data = pd.read_csv(savepath, parse_dates=['time']) - new_data = data[['time', 'penman_E']].rename({'penman_E': 'OUT_EVAP'}, axis=1) + if existing_data is not None: + new_data = data.copy().reset_index(drop=True) + existing_data = existing_data.reset_index(drop=True) + # Remove duplicate columns for concatenation + existing_data = existing_data.loc[:, ~existing_data.columns.duplicated()] + new_data = new_data.loc[:, ~new_data.columns.duplicated()] # Concat the two dataframes into a new dataframe holding all the data (memory intensive): complement = pd.concat([existing_data, new_data], ignore_index=True) # Remove all duplicates: complement.drop_duplicates(subset=['time'],inplace=True, keep='last') complement.to_csv(savepath, index=False) else: - data[['time', 'penman_E']].rename({'penman_E': 'OUT_EVAP'}, axis=1).to_csv(savepath, index=False) + data.to_csv(savepath, index=False) def calc_outflow(inflowpath, dspath, epath, area, savepath): @@ -180,10 +235,13 @@ def calc_outflow(inflowpath, dspath, epath, area, savepath): def run_postprocessing(basin_name, basin_data_dir, reservoir_shpfile, reservoir_shpfile_column_dict, aec_dir_path, start_date, end_date, rout_init_state_save_file, use_rout_state, - evap_datadir, dels_savedir, outflow_savedir, vic_status, routing_status, gee_status, forecast_mode=False): + evap_datadir, dels_savedir, nssc_savedir, outflow_savedir, vic_status, routing_status, gee_status, forecast_mode=False): # read file defining mapped resrvoirs # reservoirs_fn = os.path.join(project_dir, 'backend/data/ancillary/RAT-Reservoirs.geojson') - reservoirs = gpd.read_file(reservoir_shpfile) + if os.path.isfile(reservoir_shpfile): + reservoirs = gpd.read_file(reservoir_shpfile) + else: + raise Exception("Reservoir Shapefile is not avaialble. Evaporation, Storage change and Outflow cannot be calculated.") start_date_str = start_date.strftime("%Y-%m-%d") if(use_rout_state): start_date_str_evap = (start_date-datetime.timedelta(days=100)).strftime("%Y-%m-%d") @@ -199,7 +257,7 @@ def run_postprocessing(basin_name, basin_data_dir, reservoir_shpfile, reservoir_ sarea_raw_dir = os.path.join(basin_data_dir,'gee', "gee_sarea_tmsos") ## No of failed files (no_failed_files) is tracked and used to print a warning message in log level 1 file. - # DelS + # DelS calculation if(gee_status): log.debug("Calculating ∆S") no_failed_files = 0 @@ -210,14 +268,15 @@ def run_postprocessing(basin_name, basin_data_dir, reservoir_shpfile, reservoir_ # Reading reservoir information reservoir_name = str(reservoir[reservoir_shpfile_column_dict['unique_identifier']]) sarea_path = os.path.join(sarea_raw_dir, reservoir_name + ".csv") - savepath = os.path.join(dels_savedir, reservoir_name + ".csv") + dels_savepath = os.path.join(dels_savedir, reservoir_name + ".csv") aecpath = os.path.join(aec_dir, reservoir_name + ".csv") if os.path.isfile(sarea_path): - log.debug(f"Calculating ∆S for {reservoir_name}, saving at: {savepath}") - calc_dels(aecpath, sarea_path, savepath) + log.debug(f"Calculating ∆S for {reservoir_name}, saving at: {dels_savepath}") + calc_dels(aecpath, sarea_path, dels_savepath) else: raise Exception("Surface area file not found; skipping ∆S calculation") + except: log.exception(f"∆S for {reservoir_name} could not be calculated.") no_failed_files += 1 @@ -226,7 +285,35 @@ def run_postprocessing(basin_name, basin_data_dir, reservoir_shpfile, reservoir_ log_level1.warning(f"∆S was not calculated for {no_failed_files} reservoir(s). Please check Level-2 log file for more details.") else: log.debug("Cannot Calculate ∆S because GEE Run Failed.") - + + # NSSC + nssc_raw_dir = os.path.join(basin_data_dir,'gee', "gee_nssc") + # copying of NSSC files + if(gee_status): + log.debug("Copying NSSC data to RAT outputs.") + no_failed_files = 0 + + for reservoir_no,reservoir in reservoirs.iterrows(): + try: + # Reading reservoir information + reservoir_name = str(reservoir[reservoir_shpfile_column_dict['unique_identifier']]) + nssc_path = os.path.join(nssc_raw_dir, reservoir_name + ".csv") + nssc_savepath = os.path.join(nssc_savedir, reservoir_name + ".csv") + + if os.path.isfile(nssc_path): + nssc_df = pd.read_csv(nssc_path) + nssc_df.to_csv(nssc_savepath, index=False) + else: + raise Exception(f"NSSC file not found for {reservoir_name}; skipping copy to RAT Outputs") + + except: + log.exception(f"NSSC for {reservoir_name} could not be calculated.") + no_failed_files += 1 + DELS_STATUS=1 + if no_failed_files: + log_level1.warning(f"NSSC was not calculated for {no_failed_files} reservoir(s). Please check Level-2 log file for more details.") + else: + log.debug("Cannot Calculate NSSC because GEE Run Failed.") # Evaporation if(vic_status and gee_status): @@ -259,13 +346,13 @@ def run_postprocessing(basin_name, basin_data_dir, reservoir_shpfile, reservoir_ log.debug(f"Calculating Evaporation for {reservoir_name}") calc_E(reservoir, start_date_str_evap, end_date_str, forcings_path, vic_results_path, sarea, e_path, forecast_mode=forecast_mode) else: - raise Exception("Surface area file not found; skipping evaporation calculation") + raise Exception("Surface area or VIC result file not found; skipping evaporation calculation") except: log.exception(f"Evaporation for {reservoir_name} could not be calculated.") no_failed_files +=1 EVAP_STATUS = 1 if no_failed_files: - log_level1.warning(f"Evapotration was not calculated for {no_failed_files} reservoir(s). Please check Level-2 log file for more details.") + log_level1.warning(f"Evaporation was not calculated for {no_failed_files} reservoir(s). Please check Level-2 log file for more details.") elif((not vic_status) and (not gee_status)): log.debug("Cannot Retrieve Evaporation because both VIC and GEE Run Failed.") elif(vic_status): diff --git a/src/rat/core/run_routing.py b/src/rat/core/run_routing.py index a86af263..8890d88a 100644 --- a/src/rat/core/run_routing.py +++ b/src/rat/core/run_routing.py @@ -136,7 +136,8 @@ def generate_inflow(src_dir, dst_dir): files = [src_dir / f for f in src_dir.glob('*.day')] if len(files)==0: - raise Exception("No routing output was found to generate inflow.") + log_level1.warning("No routing output was found to generate inflow. Skipping inflow calculation.") + return if not dst_dir.exists(): log.error("Directory does not exist: %s", dst_dir) diff --git a/src/rat/core/run_sarea.py b/src/rat/core/run_sarea.py index 000b79d6..673d7ed7 100644 --- a/src/rat/core/run_sarea.py +++ b/src/rat/core/run_sarea.py @@ -3,19 +3,24 @@ from logging import getLogger from rat.utils.logging import LOG_NAME, NOTIFICATION, LOG_LEVEL1_NAME +from rat.ee_utils.ee_utils import simplify_geometry from rat.core.sarea.sarea_cli_s2 import sarea_s2 +from rat.core.sarea.sarea_cli_l5 import sarea_l5 +from rat.core.sarea.sarea_cli_l7 import sarea_l7 from rat.core.sarea.sarea_cli_l8 import sarea_l8 from rat.core.sarea.sarea_cli_l9 import sarea_l9 from rat.core.sarea.sarea_cli_sar import sarea_s1 from rat.core.sarea.bot_filter import bot_filter from rat.core.sarea.TMS import TMS +from rat.core.sarea.multisensor_ssc_integrator import multi_sensor_ssc_integration, normalize_ssc + log = getLogger(f"{LOG_NAME}.{__name__}") log_level1 = getLogger(f"{LOG_LEVEL1_NAME}.{__name__}") -def run_sarea(start_date, end_date, datadir, reservoirs_shpfile, shpfile_column_dict, filt_options = None): +def run_sarea(start_date, end_date, sarea_save_dir, reservoirs_shpfile, shpfile_column_dict, filt_options = None, nssc_save_dir = None): if isinstance(reservoirs_shpfile, gpd.GeoDataFrame): reservoirs_polygon = reservoirs_shpfile else: @@ -26,7 +31,12 @@ def run_sarea(start_date, end_date, datadir, reservoirs_shpfile, shpfile_column_ Partial_optical_tmsos_files = 0 i = 1 for reservoir_no,reservoir in reservoirs_polygon.iterrows(): - print(f"\n\n +++ PROCESSING RESERVOIR: {reservoir[shpfile_column_dict['id_column']]} - {reservoir[shpfile_column_dict['dam_name_column']]} ({i}/{len(reservoirs_polygon)}) +++\n\n") + # printing reservoir id & name (whatever available) + if shpfile_column_dict.get('id_column'): + print(f"\n\n +++ PROCESSING RESERVOIR: {reservoir[shpfile_column_dict['id_column']]} - {reservoir[shpfile_column_dict['dam_name_column']]} ({i}/{len(reservoirs_polygon)}) +++\n\n") + else: + print(f"\n\n +++ PROCESSING RESERVOIR: {reservoir[shpfile_column_dict['dam_name_column']]} ({i}/{len(reservoirs_polygon)}) +++\n\n") + i += 1 try: # Reading reservoir information @@ -34,7 +44,7 @@ def run_sarea(start_date, end_date, datadir, reservoirs_shpfile, shpfile_column_ reservoir_area = float(reservoir[shpfile_column_dict['area_column']]) reservoir_polygon = reservoir.geometry log.info(f"Calculating surface area for {reservoir_name}.") - method = run_sarea_for_res(reservoir_name, reservoir_area, reservoir_polygon, start_date, end_date, datadir) + method = run_sarea_for_res(reservoir_name, reservoir_area, reservoir_polygon, start_date, end_date, sarea_save_dir, nssc_save_dir) log.info(f"Calculated surface area for {reservoir_name} successfully using {method} method.") if method == 'Optical': Optical_files += 1 @@ -60,35 +70,60 @@ def run_sarea(start_date, end_date, datadir, reservoirs_shpfile, shpfile_column_ log_level1.error(f"BOT Filter run failed for all reservoirs.") log_level1.error("Filter values out of bounds. Please ensure that a value in between 0 and 9 is selected") else: - bot_filter(datadir,shpfile_column_dict,reservoirs_shpfile,**filt_options) - -def run_sarea_for_res(reservoir_name, reservoir_area, reservoir_polygon, start_date, end_date, datadir): + bot_filter(sarea_save_dir,shpfile_column_dict,reservoirs_shpfile,**filt_options) +def run_sarea_for_res(reservoir_name, reservoir_area, reservoir_polygon, start_date, end_date, sarea_save_dir, nssc_save_dir, simplification=True): + + if simplification: + # Below function simplifies geometry with shape index (complexity) higher than a threshold, otherwise original geometry is retained + reservoir_polygon = simplify_geometry(reservoir_polygon) + # Obtain surface areas # Sentinel-2 log.debug(f"Reservoir: {reservoir_name}; Downloading Sentinel-2 data from {start_date} to {end_date}") - sarea_s2(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(datadir, 's2')) - s2_dfpath = os.path.join(datadir, 's2', reservoir_name+'.csv') + sarea_s2(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 's2')) + s2_dfpath = os.path.join(sarea_save_dir, 's2', reservoir_name+'.csv') + + # Landsat-5 + log.debug(f"Reservoir: {reservoir_name}; Downloading Landsat-5 data from {start_date} to {end_date}") + sarea_l5(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 'l5')) + l5_dfpath = os.path.join(sarea_save_dir, 'l5', reservoir_name+'.csv') + + # Landsat-7 + log.debug(f"Reservoir: {reservoir_name}; Downloading Landsat-7 data from {start_date} to {end_date}") + sarea_l7(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 'l7')) + l7_dfpath = os.path.join(sarea_save_dir, 'l7', reservoir_name+'.csv') # Landsat-8 log.debug(f"Reservoir: {reservoir_name}; Downloading Landsat-8 data from {start_date} to {end_date}") - sarea_l8(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(datadir, 'l8')) - l8_dfpath = os.path.join(datadir, 'l8', reservoir_name+'.csv') + sarea_l8(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 'l8')) + l8_dfpath = os.path.join(sarea_save_dir, 'l8', reservoir_name+'.csv') # Landsat-9 log.debug(f"Reservoir: {reservoir_name}; Downloading Landsat-9 data from {start_date} to {end_date}") - sarea_l9(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(datadir, 'l9')) - l9_dfpath = os.path.join(datadir, 'l9', reservoir_name+'.csv') + sarea_l9(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 'l9')) + l9_dfpath = os.path.join(sarea_save_dir, 'l9', reservoir_name+'.csv') # Sentinel-1 log.debug(f"Reservoir: {reservoir_name}; Downloading Sentinel-1 data from {start_date} to {end_date}") - s1_dfpath = sarea_s1(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(datadir, 'sar')) - s1_dfpath = os.path.join(datadir, 'sar', reservoir_name+'_12d_sar.csv') + s1_dfpath = sarea_s1(reservoir_name, reservoir_polygon, start_date, end_date, os.path.join(sarea_save_dir, 'sar')) + s1_dfpath = os.path.join(sarea_save_dir, 'sar', reservoir_name+'_12d_sar.csv') + # Using TMSOS to ensemble surface area data tmsos = TMS(reservoir_name, reservoir_area) - result,method = tmsos.tms_os(l9_dfpath=l9_dfpath, l8_dfpath=l8_dfpath, s2_dfpath=s2_dfpath, s1_dfpath=s1_dfpath) - - tmsos_savepath = os.path.join(datadir, reservoir_name+'.csv') + result,method = tmsos.tms_os(l5_dfpath=l5_dfpath, l7_dfpath=l7_dfpath, l9_dfpath=l9_dfpath, l8_dfpath=l8_dfpath, + s2_dfpath=s2_dfpath, s1_dfpath=s1_dfpath) + tmsos_savepath = os.path.join(sarea_save_dir, reservoir_name+'.csv') log.debug(f"Saving surface area of {reservoir_name} at {tmsos_savepath}") result.reset_index().rename({'index': 'date', 'filled_area': 'area'}, axis=1).to_csv(tmsos_savepath, index=False) + + # NSSC calculations + if nssc_save_dir: + ssc_components_df = multi_sensor_ssc_integration(l5_dfpath=l5_dfpath, l7_dfpath=l7_dfpath, l9_dfpath=l9_dfpath, l8_dfpath=l8_dfpath, + s2_dfpath=s2_dfpath) + nssc_df = normalize_ssc(ssc_components_df) + nssc_savepath = os.path.join(nssc_save_dir, reservoir_name+'.csv') + log.debug(f"Saving NSSC data of {reservoir_name} at {nssc_savepath}") + nssc_df.to_csv(nssc_savepath, index=False) + return method diff --git a/src/rat/core/sarea/TMS.py b/src/rat/core/sarea/TMS.py index 0fc98a0a..96c187fc 100644 --- a/src/rat/core/sarea/TMS.py +++ b/src/rat/core/sarea/TMS.py @@ -32,9 +32,11 @@ def __init__(self, reservoir_name, area=None, AREA_DEVIATION_THRESHOLD_PCNT=5): self.AREA_DEVIATION_THRESHOLD = self.area * AREA_DEVIATION_THRESHOLD_PCNT/100 def tms_os(self, + l5_dfpath: str = "", + l7_dfpath: str = "", l8_dfpath: str = "", + l9_dfpath: str = "", s2_dfpath: str = "", - l9_dfpath: str = "", s1_dfpath: str = "", CLOUD_THRESHOLD: float = 90.0, MIN_DATE: str = '2019-01-01' @@ -42,7 +44,10 @@ def tms_os(self, ## TODO: add conditional, S1 required, any one of optical datasets required """Implements the TMS-OS methodology Args: - l8_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_l8.py` - Landsat derived surface areas + l5_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_l5.py` - Landsat-5 derived surface areas + l7_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_l7.py` - Landsat-7 derived surface areas + l8_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_l8.py` - Landsat-8 derived surface areas + l9_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_l9.py` - Landsat-9 derived surface areas s2_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_s2.py` - Sentinel-2 derived surface areas s1_dfpath (string): Path of the surface area dataframe obtained using `sarea_cli_sar.py` - Sentinel-1 derived surface areas CLOUD_THRESHOLD (float): Threshold to use for cloud-masking in % (default: 90.0) @@ -51,11 +56,91 @@ def tms_os(self, MIN_DATE = pd.to_datetime(MIN_DATE, format='%Y-%m-%d') S2_TEMPORAL_RESOLUTION = 5 S1_TEMPORAL_RESOLUTION = 12 + L5_TEMPORAL_RESOLUTION = 16 + L7_TEMPORAL_RESOLUTION = 16 L8_TEMPORAL_RESOLUTION = 16 L9_TEMPORAL_RESOLUTION = 16 TO_MERGE = [] + if os.path.isfile(l5_dfpath): + # Read in Landsat-8 + l5df = pd.read_csv(l5_dfpath, parse_dates=['mosaic_enddate']).rename({ + 'mosaic_enddate': 'date', + 'water_area_cordeiro': 'water_area_uncorrected', + 'non_water_area_cordeiro': 'non_water_area', + 'corrected_area_cordeiro': 'water_area_corrected' + }, axis=1).set_index('date') + l5df = l5df[['water_area_uncorrected', 'non_water_area', 'cloud_area', 'water_area_corrected']] + l5df.replace(-1, np.nan, inplace=True) + l5df['cloud_percent'] = l5df['cloud_area'] * 100 / l5df[['water_area_uncorrected', 'non_water_area', 'cloud_area']].sum(axis=1, skipna=True) + + # QUALITY_DESCRIPTION + # 0: Good, not interpolated either due to missing data or high clouds + # 1: Poor, interpolated either due to high clouds + # 2: Poor, interpolated either due to missing data + l5df.loc[:, "QUALITY_DESCRIPTION"] = 0 + l5df.loc[l5df['cloud_percent']>=CLOUD_THRESHOLD, ("water_area_uncorrected", "non_water_area", "water_area_corrected")] = np.nan + l5df.loc[l5df['cloud_percent']>=CLOUD_THRESHOLD, "QUALITY_DESCRIPTION"] = 1 + + # in some cases l5df may have duplicated rows (with same values) that have to be removed + if l5df.index.duplicated().sum() > 0: + print("Duplicated labels, deleting") + l5df = l5df[~l5df.index.duplicated(keep='last')] + + # Fill in the gaps in l5df created due to high cloud cover with np.nan values + l5df_interpolated = l5df.reindex(pd.date_range(l5df.index[0], l5df.index[-1], freq=f'{L5_TEMPORAL_RESOLUTION}D')) + l5df_interpolated.loc[np.isnan(l5df_interpolated["QUALITY_DESCRIPTION"]), "QUALITY_DESCRIPTION"] = 2 + l5df_interpolated.loc[np.isnan(l5df_interpolated['cloud_area']), 'cloud_area'] = max(l5df['cloud_area']) + l5df_interpolated.loc[np.isnan(l5df_interpolated['cloud_percent']), 'cloud_percent'] = 100 + l5df_interpolated.loc[np.isnan(l5df_interpolated['non_water_area']), 'non_water_area'] = 0 + l5df_interpolated.loc[np.isnan(l5df_interpolated['water_area_uncorrected']), 'water_area_uncorrected'] = 0 + + # Interpolate bad data (max upto 6-7 months (seasonal) for l5) + l5df_interpolated.loc[:, "water_area_corrected"] = l5df_interpolated.loc[:, "water_area_corrected"].interpolate(method="linear", limit_direction="forward", limit=13) + l5df_interpolated['sat'] = 'l5' + + TO_MERGE.append(l5df_interpolated) + + if os.path.isfile(l7_dfpath): + # Read in Landsat-8 + l7df = pd.read_csv(l7_dfpath, parse_dates=['mosaic_enddate']).rename({ + 'mosaic_enddate': 'date', + 'water_area_cordeiro': 'water_area_uncorrected', + 'non_water_area_cordeiro': 'non_water_area', + 'corrected_area_cordeiro': 'water_area_corrected' + }, axis=1).set_index('date') + l7df = l7df[['water_area_uncorrected', 'non_water_area', 'cloud_area', 'water_area_corrected']] + l7df.replace(-1, np.nan, inplace=True) + l7df['cloud_percent'] = l7df['cloud_area'] * 100 / l7df[['water_area_uncorrected', 'non_water_area', 'cloud_area']].sum(axis=1, skipna=True) + + # QUALITY_DESCRIPTION + # 0: Good, not interpolated either due to missing data or high clouds + # 1: Poor, interpolated either due to high clouds + # 2: Poor, interpolated either due to missing data + l7df.loc[:, "QUALITY_DESCRIPTION"] = 0 + l7df.loc[l7df['cloud_percent']>=CLOUD_THRESHOLD, ("water_area_uncorrected", "non_water_area", "water_area_corrected")] = np.nan + l7df.loc[l7df['cloud_percent']>=CLOUD_THRESHOLD, "QUALITY_DESCRIPTION"] = 1 + + # in some cases l7df may have duplicated rows (with same values) that have to be removed + if l7df.index.duplicated().sum() > 0: + print("Duplicated labels, deleting") + l7df = l7df[~l7df.index.duplicated(keep='last')] + + # Fill in the gaps in l7df created due to high cloud cover with np.nan values + l7df_interpolated = l7df.reindex(pd.date_range(l7df.index[0], l7df.index[-1], freq=f'{L7_TEMPORAL_RESOLUTION}D')) + l7df_interpolated.loc[np.isnan(l7df_interpolated["QUALITY_DESCRIPTION"]), "QUALITY_DESCRIPTION"] = 2 + l7df_interpolated.loc[np.isnan(l7df_interpolated['cloud_area']), 'cloud_area'] = max(l7df['cloud_area']) + l7df_interpolated.loc[np.isnan(l7df_interpolated['cloud_percent']), 'cloud_percent'] = 100 + l7df_interpolated.loc[np.isnan(l7df_interpolated['non_water_area']), 'non_water_area'] = 0 + l7df_interpolated.loc[np.isnan(l7df_interpolated['water_area_uncorrected']), 'water_area_uncorrected'] = 0 + + # Interpolate bad data (max upto 6-7 months (seasonal) for l7) + l7df_interpolated.loc[:, "water_area_corrected"] = l7df_interpolated.loc[:, "water_area_corrected"].interpolate(method="linear", limit_direction="forward", limit=13) + l7df_interpolated['sat'] = 'l7' + + TO_MERGE.append(l7df_interpolated) + if os.path.isfile(l8_dfpath): # Read in Landsat-8 l8df = pd.read_csv(l8_dfpath, parse_dates=['mosaic_enddate']).rename({ @@ -65,8 +150,8 @@ def tms_os(self, 'corrected_area_cordeiro': 'water_area_corrected' }, axis=1).set_index('date') l8df = l8df[['water_area_uncorrected', 'non_water_area', 'cloud_area', 'water_area_corrected']] - l8df['cloud_percent'] = l8df['cloud_area']*100/(l8df['water_area_uncorrected']+l8df['non_water_area']+l8df['cloud_area']) l8df.replace(-1, np.nan, inplace=True) + l8df['cloud_percent'] = l8df['cloud_area'] * 100 / l8df[['water_area_uncorrected', 'non_water_area', 'cloud_area']].sum(axis=1, skipna=True) # QUALITY_DESCRIPTION # 0: Good, not interpolated either due to missing data or high clouds @@ -105,8 +190,8 @@ def tms_os(self, 'corrected_area_cordeiro': 'water_area_corrected' }, axis=1).set_index('date') l9df = l9df[['water_area_uncorrected', 'non_water_area', 'cloud_area', 'water_area_corrected']] - l9df['cloud_percent'] = l9df['cloud_area']*100/(l9df['water_area_uncorrected']+l9df['non_water_area']+l9df['cloud_area']) l9df.replace(-1, np.nan, inplace=True) + l9df['cloud_percent'] = l9df['cloud_area'] * 100 / l9df[['water_area_uncorrected', 'non_water_area', 'cloud_area']].sum(axis=1, skipna=True) # QUALITY_DESCRIPTION # 0: Good, not interpolated either due to missing data or high clouds @@ -205,13 +290,14 @@ def tms_os(self, sar.loc[extrapolated_date, "sarea"] = extrapolated_value sar = sar.rename({'sarea': 'area'}, axis=1) + sar_correction = 'Possible' # If SAR has less than 3 points else: - sar = None + sar_correction = 'Not Possible' print("Sentinel-1 SAR has less than 3 data points.") # If SAR file does not exist else: - sar = None + sar_correction = 'Not Possible' print("Sentinel-1 SAR file does not exist.") # combine opticals into one dataframes @@ -219,40 +305,91 @@ def tms_os(self, optical = optical.loc[~optical.index.duplicated(keep='last')] # when both s2 and l8 are present, keep s2 optical.rename({'water_area_corrected': 'area'}, axis=1, inplace=True) - + error_message = None # Apply the trend based corrections - if(sar is not None): - # If Optical begins before SAR and has a difference of more than 15 days - if(sar.index[0]-optical.index[0]>pd.Timedelta(days=15)): - # Optical without SAR - optical_with_no_sar = optical[optical.index[0]:sar.index[0]].copy() - optical_with_no_sar['non-smoothened optical area'] = optical_with_no_sar['area'] - optical_with_no_sar.loc[:, 'days_passed'] = optical.index.to_series().diff().dt.days.fillna(0) - # Calculate smoothed values with moving weighted average method if more than 7 values; weights are calculated using cloud percent. - if len(optical_with_no_sar)>7: - optical_with_no_sar['filled_area'] = weighted_moving_average(optical_with_no_sar['non-smoothened optical area'], weights = (101-optical_with_no_sar['cloud_percent']),window_size=3) - # Drop 'area' column from optical_with_no_sar - optical_with_no_sar = optical_with_no_sar.drop('area',axis=1) - # Optical with SAR - optical_with_sar = trend_based_correction(optical.copy(), sar.copy(), self.AREA_DEVIATION_THRESHOLD) - # Merge both - result = pd.concat([optical_with_no_sar,optical_with_sar],axis=0) - # Smoothen the combined surface area estimates to avoid noise or peaks using savgol_filter if more than 9 values (to increase smoothness and include more points as we have both TMS-OS and Optical) - if len(result)>9: - result['filled_area'] = savgol_filter(result['filled_area'], window_length=7, polyorder=3) - method = 'Combine' - # If SAR begins before Optical - else: - result = trend_based_correction(optical.copy(), sar.copy(), self.AREA_DEVIATION_THRESHOLD) - method = 'TMS-OS' - else: + if(sar_correction == 'Possible'): + try: + # If Optical begins before SAR and has a difference of more than 15 days + if(sar.index[0]-optical.index[0]>pd.Timedelta(days=15)): + # Optical without SAR + optical_with_no_sar = optical[optical.index[0]:sar.index[0]].copy() + optical_with_no_sar['non-smoothened optical area'] = optical_with_no_sar['area'] + optical_with_no_sar.loc[:, 'days_passed'] = optical.index.to_series().diff().dt.days.fillna(0) + # Calculate smoothed values with moving weighted average method if more than 7 values; weights are calculated using cloud percent. + if len(optical_with_no_sar)>7: + # Temporarily interpolate missing values + optical_with_no_sar['interpolated_area'] = optical_with_no_sar['non-smoothened optical area'].interpolate(method='linear') + + # Apply weighted moving average using interpolated values + optical_with_no_sar['smoothed_area'] = weighted_moving_average( + optical_with_no_sar['interpolated_area'], + weights=(101 - optical_with_no_sar['cloud_percent']), + window_size=3 + ) + + # Restore NaN values to preserve the original structure + optical_with_no_sar['filled_area'] = optical_with_no_sar['smoothed_area'].where( + ~optical_with_no_sar['non-smoothened optical area'].isna() + ) + + # Drop temporary columns + optical_with_no_sar = optical_with_no_sar.drop(['interpolated_area', 'smoothed_area'], axis=1) + + # Drop 'area' column from optical_with_no_sar + optical_with_no_sar = optical_with_no_sar.drop('area',axis=1) + # Optical with SAR + optical_with_sar = trend_based_correction(optical.copy(), sar.copy(), self.AREA_DEVIATION_THRESHOLD) + # Merge both + result = pd.concat([optical_with_no_sar,optical_with_sar],axis=0) + # Smoothen the combined surface area estimates to avoid noise or peaks using savgol_filter if more than 9 values (to increase smoothness and include more points as we have both TMS-OS and Optical) + if len(result)>9: + # Temporarily interpolate missing values for smoothing + result['interpolated_area'] = result['filled_area'].interpolate(method='linear') + + # Apply Savitzky-Golay filter + result['smoothed_filled_area'] = savgol_filter( + result['interpolated_area'], window_length=7, polyorder=3 + ) + + # Restore NaN values + filled_area_with_nans = result['smoothed_filled_area'].where( + ~result['filled_area'].isna() + ) + result['filled_area'] = filled_area_with_nans + + # Drop temporary columns + result = result.drop(['interpolated_area', 'smoothed_filled_area'], axis=1) + + method = 'Combine' + # If SAR begins before Optical + + else: + result = trend_based_correction(optical.copy(), sar.copy(), self.AREA_DEVIATION_THRESHOLD) + method = 'TMS-OS' + except Exception as e: + sar_correction = 'Failed' + error_message = e + + if sar_correction=='Failed' or sar_correction=='Not Possible': + if sar_correction=='Failed': + print(f'SAR trend based correction (TMS-OS) failed due to {error_message}. Using only optical data.') + result = optical.copy() result['non-smoothened optical area'] = result['area'] result.loc[:, 'days_passed'] = optical.index.to_series().diff().dt.days.fillna(0) # Calculate smoothed values with Savitzky-Golay method if more than 7 values if len(result)>7: - result['filled_area'] = weighted_moving_average(result['non-smoothened optical area'], weights = (101-result['cloud_percent']),window_size=3) - result['filled_area'] = savgol_filter(result['filled_area'], window_length=7, polyorder=3) + filled_area_temp = weighted_moving_average( + result['non-smoothened optical area'].interpolate(method='linear'), # Temporary interpolate for calculation + weights=(101 - result['cloud_percent']), + window_size=3 + ) + + # Apply Savitzky-Golay filter + smoothed_area_temp = savgol_filter(filled_area_temp, window_length=7, polyorder=3) + + # Reapply NaN mask to maintain sparse points + result['filled_area'] = pd.Series(smoothed_area_temp).where(~result['non-smoothened optical area'].isna()) method = 'Optical' # Returning method used for surface area estimation return result,method diff --git a/src/rat/core/sarea/multisensor_ssc_integrator.py b/src/rat/core/sarea/multisensor_ssc_integrator.py new file mode 100644 index 00000000..a2143170 --- /dev/null +++ b/src/rat/core/sarea/multisensor_ssc_integrator.py @@ -0,0 +1,146 @@ +import numpy as np +import pandas as pd +import os +from sklearn.preprocessing import MinMaxScaler + +def multi_sensor_ssc_integration(l5_dfpath='', l7_dfpath='', l8_dfpath='', l9_dfpath='', s2_dfpath=''): + """ + Integrates Suspended Sediment Concentration (SSC) data from multiple satellite sources into a single DataFrame. + + This function reads SSC data from CSV files generated by different satellite sensors (Landsat 5, 7, 8, 9, and Sentinel-2), + merges the data into a unified DataFrame, and sorts it by date. Each row in the resulting DataFrame contains the SSC + values from one of the sensors, and an additional column indicates the satellite source. Only unique dates are retained, + with the first occurrence kept in cases of duplicates. + + Parameters: + ---------- + l5_dfpath : str, optional + Path to the CSV file for Landsat 5 data. Default is an empty string. + l7_dfpath : str, optional + Path to the CSV file for Landsat 7 data. Default is an empty string. + l8_dfpath : str, optional + Path to the CSV file for Landsat 8 data. Default is an empty string. + l9_dfpath : str, optional + Path to the CSV file for Landsat 9 data. Default is an empty string. + s2_dfpath : str, optional + Path to the CSV file for Sentinel-2 data. Default is an empty string. + + Returns: + ------- + pd.DataFrame + A DataFrame containing the combined SSC proxy ratios from the specified satellite sources. + Columns: + - date: The date of the SSC measurement. + - water_red_sum: Sum of red band reflectance values over water pixels. + - water_green_sum: Sum of green band reflectance values over water pixels. + - water_nir_sum: Sum of NIR band reflectance values over water pixels. + - water_red_green_mean: Mean of red/green ratio over water pixels. + - water_nir_red_mean: Mean of NIR/red ratio over water pixels. + - sat: The satellite source for each row (l5, l7, l8, l9, or s2). + + Raises: + ------ + ValueError + If none of the specified file paths exist or are provided. + + Notes: + ------ + - At least one valid file path must be provided for the function to execute successfully. + + Example Usage: + -------------- + >>> l5_path = 'path/to/l5_data.csv' + >>> l7_path = 'path/to/l7_data.csv' + >>> s2_path = 'path/to/s2_data.csv' + >>> merged_ssc_df = multi_sensor_ssc_integration(l5_dfpath=l5_path, l7_dfpath=l7_path, s2_dfpath=s2_path) + >>> print(merged_ssc_df.head()) + """ + + # Check that at least one valid file path is provided + if not any([os.path.isfile(p) for p in [l5_dfpath, l7_dfpath, l8_dfpath, l9_dfpath, s2_dfpath]]): + raise ValueError("At least one valid file path must be provided.") + + # Define a list to store DataFrames + dfs = [] + + # Load and process each file if it exists + for path, satellite in zip( + [l5_dfpath, l7_dfpath, l8_dfpath, l9_dfpath, s2_dfpath], + ['l5', 'l7', 'l8', 'l9', 's2'] + ): + if os.path.isfile(path): # Check if file exists + df = pd.read_csv(path) + # Rename date column for consistency + if satellite == 's2': + df = df.rename(columns={'date': 'mosaic_enddate'}) + df['date'] = pd.to_datetime(df['mosaic_enddate']) + df['sat'] = satellite # Add satellite column + # Select only relevant columns + df = df[['date', 'water_red_sum', 'water_green_sum', 'water_nir_sum', 'water_red_green_mean', 'water_nir_red_mean', 'sat']] + dfs.append(df) + + # Concatenate all DataFrames + merged_df = pd.concat(dfs, ignore_index=True) + + # Sort by date and drop duplicates keeping the first occurrence + merged_df = merged_df.sort_values(by='date').drop_duplicates(subset='date', keep='first').reset_index(drop=True) + + return merged_df + +def normalize_ssc(df): + """ + Creates Normalized Suspended Sediment Concentration (NSSC) values in a DataFrame using ratio of reflectance values. + + This function adds four new columns to the DataFrame by normalizing pixel-level and reservoir-level + SSC estimates. Pixel-level SSC ratios are normalized directly from `water_red_green_mean` and + `water_nir_red_mean`. Reservoir-level SSC ratios are calculated by dividing sum columns + (`water_red_sum`, `water_green_sum`, `water_nir_sum`) and then normalized. + + Parameters: + ---------- + df : pd.DataFrame + The DataFrame containing SSC data, with columns: + - water_red_sum + - water_green_sum + - water_nir_sum + - water_red_green_mean + - water_nir_red_mean + + Returns: + ------- + pd.DataFrame + The original DataFrame with the following new columns added: + - nssc_rd_gn_px: Normalized pixel-level SSC ratio (red/green). + - nssc_nr_rd_px: Normalized pixel-level SSC ratio (nir/red). + - nssc_rd_gn_res: Normalized reservoir-level SSC ratio (red/green). + - nssc_nr_rd_res: Normalized reservoir-level SSC ratio (nir/red). + """ + + # Initialize scaler + scaler = MinMaxScaler() + + ## Normalize pixel-level SSC ratios, ignoring NaNs + if df['water_red_green_mean'].notna().any(): + non_na_data = df['water_red_green_mean'].dropna().values.reshape(-1, 1) + df.loc[df['water_red_green_mean'].notna(), 'nssc_rd_gn_px'] = scaler.fit_transform(non_na_data).flatten() + + if df['water_nir_red_mean'].notna().any(): + non_na_data = df['water_nir_red_mean'].dropna().values.reshape(-1, 1) + df.loc[df['water_nir_red_mean'].notna(), 'nssc_nr_rd_px'] = scaler.fit_transform(non_na_data).flatten() + + ## Calculate reservoir-level SSC ratios and normalize, ignoring NaNs + valid_ratio_rd_gn = (df['water_red_sum'] / df['water_green_sum']) + # Replace infinities with NaN and drop NaNs + valid_ratio_rd_gn = valid_ratio_rd_gn.replace([np.inf, -np.inf], np.nan).dropna() + # Ensure there are valid values before scaling + if not valid_ratio_rd_gn.empty: + scaled_data = scaler.fit_transform(valid_ratio_rd_gn.to_numpy().reshape(-1, 1)).flatten() + df.loc[valid_ratio_rd_gn.index, 'nssc_rd_gn_res'] = scaled_data + + valid_ratio_nr_rd = (df['water_nir_sum'] / df['water_red_sum']) + valid_ratio_nr_rd = valid_ratio_nr_rd.replace([np.inf, -np.inf], np.nan).dropna() + if not valid_ratio_nr_rd.empty: + scaled_data = scaler.fit_transform(valid_ratio_nr_rd.to_numpy().reshape(-1, 1)).flatten() + df.loc[valid_ratio_nr_rd.index, 'nssc_nr_rd_res'] = scaled_data + + return df \ No newline at end of file diff --git a/src/rat/core/sarea/sarea_cli_l5.py b/src/rat/core/sarea/sarea_cli_l5.py new file mode 100644 index 00000000..cff41afc --- /dev/null +++ b/src/rat/core/sarea/sarea_cli_l5.py @@ -0,0 +1,579 @@ +import ee +from datetime import date, datetime, timedelta, timezone +import pandas as pd +import time +import os +from random import randint +from itertools import zip_longest + +from rat.ee_utils.ee_utils import poly2feature +from rat.utils.logging import LOG_NAME, NOTIFICATION +from rat.utils.utils import days_between +from logging import getLogger + +# Defining global constants +log = getLogger(f"{LOG_NAME}.{__name__}") + +l5 = ee.ImageCollection("LANDSAT/LT05/C02/T1_L2") +gswd = ee.Image("JRC/GSW1_4/GlobalSurfaceWater") + +NDWI_THRESHOLD = 0.3 +SMALL_SCALE = 30 +MEDIUM_SCALE = 120 +LARGE_SCALE = 500 +BUFFER_DIST = 500 +CLOUD_COVER_LIMIT = 90 +TEMPORAL_RESOLUTION = 16 +RESULTS_PER_ITER = 5 +MIN_RESULTS_PER_ITER = 1 +QUALITY_PIXEL_BAND_NAME = 'QA_PIXEL' +BLUE_BAND_NAME = 'SR_B1' +GREEN_BAND_NAME = 'SR_B2' +RED_BAND_NAME = 'SR_B3' +NIR_BAND_NAME = 'SR_B4' +SWIR1_BAND_NAME = 'SR_B5' +SWIR2_BAND_NAME = 'SR_B7' +MISSION_END_DATE = date(2012,5,6) # last date of landsat 5 data on GEE + +def preprocess(im): + clipped = im # clipping adds processing overhead, setting clipped = im + + # clipped = ee.Image(ee.Algorithms.If('BQA' in clipped.bandNames(), preprocess_image_mask(im), clipped.updateMask(ee.Image.constant(1)))) + # Mask appropriate QA bits + QA = im.select([QUALITY_PIXEL_BAND_NAME]) + + cloudShadowBitMask = 1 << 3 + cloudsBitMask = 1 << 4 + + cloud = (QA.bitwiseAnd(cloudsBitMask).neq(0)).Or(QA.bitwiseAnd(cloudShadowBitMask).neq(0)).rename("cloud") + + clipped = clipped.updateMask(cloud.eq(0).select("cloud")) + + # SR scaling + clipped = clipped.select('SR_B.').multiply(0.0000275).add(-0.2) + clipped = clipped.addBands(cloud) + + clipped = clipped.set('system:time_start', im.get('system:time_start')) + clipped = clipped.set('system:time_end', im.get('system:time_end')) + # clipped = clipped.set('cloud_area', cloud_area) + + return clipped + + +##########################################################/ +## Processing individual images - water classification ## +##########################################################/ + +def identify_water_cluster(im, max_cluster_value): + im = ee.Image(im) + mbwi = im.select('MBWI') + + clusters = ee.List.sequence(0, max_cluster_value) + + def calc_avg_mbwi(cluster_val): + cluster_val = ee.Number(cluster_val) + avg_mbwi = mbwi.updateMask(im.select('cluster').eq(ee.Image(cluster_val))).reduceRegion( + reducer = ee.Reducer.mean(), + scale = MEDIUM_SCALE, + geometry = aoi, + maxPixels = 1e10 + ).get('MBWI') + avg_mbwi_not_null = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual(ee.Number(avg_mbwi)), + ee.Number(-99), + ee.Number(avg_mbwi))) + return avg_mbwi_not_null + # print(calc_avg_mbwi(clusters.get(0)).getInfo()) + avg_mbwis = ee.Array(clusters.map(calc_avg_mbwi)) + + max_mbwi_index = avg_mbwis.argmax().get(0) + + water_cluster = clusters.get(max_mbwi_index) + + return water_cluster + + +def cordeiro(im): + ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on + ## calinski harabasz's work + ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans + band_subset = ee.List(['NDWI', 'MNDWI', SWIR2_BAND_NAME]) # using NDWI, MNDWI and B7 (SWIR2) + sampled_pts = im.select(band_subset).sample( + region = aoi, + scale = SMALL_SCALE, + numPixels = 5e3-1 ## limit of 5k points + ) + + no_sampled_pts = sampled_pts.size() + + def if_enough_sample_pts(im): + clusterer = ee.Clusterer.wekaCascadeKMeans( + minClusters = 2, + maxClusters = 7, + init = True + ).train(sampled_pts) + + classified = im.select(band_subset).cluster(clusterer) + im = im.addBands(classified) + max_cluster_value = ee.Number(im.select('cluster').reduceRegion( + reducer = ee.Reducer.max(), + geometry = aoi, + scale = LARGE_SCALE, + maxPixels = 1e10 + ).get('cluster')) + return ee.Dictionary({'max_cluster_value': max_cluster_value, 'classified': classified}) + + def if_not_enough_sample_pts(): + return ee.Dictionary({'max_cluster_value': ee.Number(0), 'classified': ee.Image(0).rename('cluster')}) + + # If clustering is possible do clustering + def if_clustering_possible(max_cluster_value,classified,im): + im = im.addBands(classified) + + water_cluster = identify_water_cluster(im, max_cluster_value) + water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + return water_map + + # If no clustering is possible, use NDWI water map + def if_clustering_not_possible(im): + water_map = im.select(['water_map_NDWI'],['water_map_cordeiro']) + return water_map + + + after_training_dict = ee.Dictionary(ee.Algorithms.If(ee.Number(no_sampled_pts), + if_enough_sample_pts(im), + if_not_enough_sample_pts())) + + max_cluster_value = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual( + ee.Number(after_training_dict.get('max_cluster_value'))), + ee.Number(0), + ee.Number(after_training_dict.get('max_cluster_value')))) + + classified = ee.Image(after_training_dict.get('classified')) + water_map = ee.Image(ee.Algorithms.If(ee.Algorithms.IsEqual(max_cluster_value,ee.Number(0)), + if_clustering_not_possible(im), + if_clustering_possible(max_cluster_value, + classified, + im) + )) + im = im.addBands(water_map) + + return im + + +def process_image(im): + ndwi = im.normalizedDifference([NIR_BAND_NAME, SWIR1_BAND_NAME]).rename('NDWI') + im = im.addBands(ndwi) + mndwi = im.normalizedDifference([GREEN_BAND_NAME, SWIR1_BAND_NAME]).rename('MNDWI') + im = im.addBands(mndwi) + mbwi = im.expression("MBWI = 3*green-red-nir-swir1-swir2", { + 'green': im.select(GREEN_BAND_NAME), + 'red': im.select(RED_BAND_NAME), + 'nir': im.select(NIR_BAND_NAME), + 'swir1': im.select(SWIR1_BAND_NAME), + 'swir2': im.select(SWIR2_BAND_NAME) + }) + im = im.addBands(mbwi) + + #cloud_area = AOI area - area od pixels in cloud band where there is no data because we masked it coz of clouds + cloud_area = aoi.area().subtract(im.select('cloud').Not().multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('cloud')) + cloud_percent = cloud_area.multiply(100).divide(aoi.area()) + + cordeiro_will_run_when = cloud_percent.lt(CLOUD_COVER_LIMIT) + # NDWI based water map: Classify water wherever NDWI is greater than NDWI_THRESHOLD and add water_map_NDWI band. + im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) + # Clustering based classification of water pixels for cloud pixels when cloud cover is less than CLOUD_COVER_LIMIT + im = im.addBands(ee.Image(ee.Algorithms.If(cordeiro_will_run_when, cordeiro(im), ee.Image.constant(-1e6)))) # run cordeiro only if cloud percent is < 90% + + # Calculate water area in water_area_cordeiro map + water_area_cordeiro = ee.Number(ee.Algorithms.If(cordeiro_will_run_when, + ee.Number(im.select('water_map_cordeiro').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_cordeiro')), + ee.Number(-1e6) + )) + # Calculate non-water area in water_area_cordeiro map. + non_water_area_cordeiro = ee.Number(ee.Algorithms.If(cordeiro_will_run_when, + ee.Number(im.select('water_map_cordeiro').neq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_cordeiro')), + ee.Number(-1e6) + )) + + # Calculate water area in water_map_NDWI. + water_area_NDWI = ee.Number(im.select('water_map_NDWI').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate non-water area in water_map_NDWI. + non_water_area_NDWI = ee.Number(im.select('water_map_NDWI').neq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band sum for water area in water_map_NDWI. + water_red_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate green band sum for water area in water_map_NDWI. + water_green_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band sum for water area in water_map_NDWI. + water_nir_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band/green band mean for water area in water_map_NDWI. + water_red_green_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).divide(im.select( + GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band/red band mean for water area in water_map_NDWI. + water_nir_red_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).divide(im.select( + RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + + # Set attributes to retrive later + im = im.set('cloud_area', cloud_area.multiply(1e-6)) + im = im.set('cloud_percent', cloud_percent) + im = im.set('water_area_cordeiro', water_area_cordeiro.multiply(1e-6)) + im = im.set('non_water_area_cordeiro', non_water_area_cordeiro.multiply(1e-6)) + im = im.set('water_area_NDWI', water_area_NDWI.multiply(1e-6)) + im = im.set('non_water_area_NDWI', non_water_area_NDWI.multiply(1e-6)) + im = im.set('water_red_sum', water_red_sum) + im = im.set('water_green_sum', water_green_sum) + im = im.set('water_nir_sum', water_nir_sum) + im = im.set('water_red_green_mean', water_red_green_mean) + im = im.set('water_nir_red_mean', water_nir_red_mean) + + return im + + +def postprocess_wrapper(im, bandName, raw_area): + im = ee.Image(im) + bandName = ee.String(bandName) + date = im.get('to_date') + + def postprocess(): + gswd_masked = gswd.updateMask(im.select(bandName).eq(1)) + + hist = ee.List(gswd_masked.reduceRegion( + reducer = ee.Reducer.autoHistogram(minBucketWidth = 1), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('occurrence')) + + def if_hist_not_null(im,hist): + + counts = ee.Array(hist).transpose().toList() + + omega = ee.Number(0.17) + count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + + count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) + occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) + + water_map = im.select([bandName], ['water_map']) + gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + + improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + + corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_zhao_gao')) + + improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); + improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); + + return improved + + def if_hist_null(im): + # Preserve the uncorrected area in the corrected area column & POSTPROCESSING=1 + # because otherwise it will be substituted with nan. + uncorrected_area = ee.Number(im.get('water_area_clustering')) + improved = im.set('corrected_area', uncorrected_area) + improved = improved.set('POSTPROCESSING_SUCCESSFUL', 1) + return improved + + improved = ee.Image(ee.Algorithms.If( + hist, if_hist_not_null(im,hist), if_hist_null(im) + )) + return improved + + def dont_post_process(): + improved = ee.Image.constant(-1) + improved = improved.set("corrected_area", -1) + return improved + + condition = ee.Number(im.get('cloud_percent')).lt(CLOUD_COVER_LIMIT).And(ee.Number(raw_area).gt(0)) + improved = ee.Image(ee.Algorithms.If(condition, postprocess(), dont_post_process())) + + improved = improved.set("to_date", date) + + return improved + + +############################################################/ +## Code from here takes care of the time-series generation ## +############################################################/ +def calc_ndwi(im): + return im.addBands(im.normalizedDifference([NIR_BAND_NAME, SWIR1_BAND_NAME]).rename('NDWI')) + +def process_date(date): + # Given date, calculate end date by adding TEMPORAL_RESOLUTION - 1 days + date = ee.Date(date) + from_date = date + to_date = date.advance(TEMPORAL_RESOLUTION - 1, 'day') + # Filter the image collection for these dates and AOI and run preprocess function (cloud calculations, scaling, adding & setting start and end) on them + l5_subset = l5.filterDate(from_date, to_date).filterBounds(aoi).map(preprocess) + + # Get mosaic of images if there is atleast one image for this time duration with NDWI as the quality factor (keep High NDWI) + im = ee.Image(ee.Algorithms.If(l5_subset.size().neq(0), l5_subset.map(calc_ndwi).qualityMosaic('NDWI'), ee.Image.constant(0))) + # Process NDWI Image if there is atleast one image for this time duration + im = ee.Image(ee.Algorithms.If(l5_subset.size().neq(0), process_image(im), ee.Image.constant(0))) + + # Set attributes of from and to date along with number of images during the time duration + im = im.set('from_date', from_date.format("YYYY-MM-dd")) + im = im.set('to_date', to_date.format("YYYY-MM-dd")) + im = im.set('l5_images', l5_subset.size()) + + # im = ee.Algorithms.If(im.bandNames().size().eq(1), ee.Number(0), im) + + return ee.Image(im) + + +def generate_timeseries(dates): + raw_ts = dates.map(process_date) + # raw_ts = raw_ts.removeAll([0]) + imcoll = ee.ImageCollection.fromImages(raw_ts) + + return imcoll + +def get_first_obs(start_date, end_date): + # Filter collection, sort by date, and get the first image's timestamp + first_im = ( + l5.filterBounds(aoi) + .filterDate(start_date, end_date) + .limit(1, 'system:time_start') # Explicitly limit by earliest date + .reduceColumns(ee.Reducer.first(), ['system:time_start']) + ) + + # Convert timestamp to formatted date + first_date = ee.Date(first_im.get('first')).format('YYYY-MM-dd') + + return first_date + +def run_process_long(res_name, res_polygon, start, end, datadir, results_per_iter=RESULTS_PER_ITER): + fo = start #fo: first observation + enddate = end + + # Extracting reservoir geometry + global aoi + aoi = poly2feature(res_polygon,BUFFER_DIST).geometry() + + ## Checking the number of images in the interval as Landsat 5 might have missing data for a lot of places for longer durations. + number_of_images = l5.filterBounds(aoi).filterDate(start, end).size().getInfo() + + if(number_of_images): + # getting first observation in the filtered collection + print('Checking first observation date in the given time interval.') + fo = get_first_obs(start, end).getInfo() + first_obs = datetime.strptime(fo, '%Y-%m-%d') + print(f"First Observation: {first_obs}") + + scratchdir = os.path.join(datadir, "_scratch") + + # If data already exists, only get new data starting from the last one + savepath = os.path.join(datadir, f"{res_name}.csv") + + # If an existing file exists, + if os.path.isfile(savepath): + # Read the existing file + temp_df = pd.read_csv(savepath, parse_dates=['mosaic_enddate']).set_index('mosaic_enddate') + + # Get the last date in the existing file and adjust the first observation to before last date (last date might not be for this satellite. Its TMS-OS data's last date.) + last_date = temp_df.index[-1].to_pydatetime() + fo = (last_date - timedelta(days=TEMPORAL_RESOLUTION*2)).strftime("%Y-%m-%d") + # Create an array with filepath + to_combine = [savepath] + print(f"Existing file found - Last observation ({TEMPORAL_RESOLUTION*2} day lag): {last_date}") + + # If {TEMPORAL_RESOLUTION} days have not passed since last observation, skip the processing + days_passed = (datetime.strptime(end, "%Y-%m-%d") - last_date).days + print(f"No. of days passed since: {days_passed}") + if days_passed < TEMPORAL_RESOLUTION: + print(f"No new observation expected. Quitting early") + return None + # If no file exists already, create an empty array + else: + to_combine = [] + + # Extracting data in scratch directory + savedir = os.path.join(scratchdir, f"{res_name}_l5_cordeiro_zhao_gao_{fo}_{enddate}") + if not os.path.isdir(savedir): + os.makedirs(savedir) + + print(f"Extracting SA for the period {fo} -> {enddate}") + + # Creating list of dates from fo to enddate with frequency of TEMPORAL_RESOLUTION + dates = pd.date_range(fo, enddate, freq=f'{TEMPORAL_RESOLUTION}D') + # Grouping dates into smaller arrays to process in GEE + grouped_dates = grouper(dates, results_per_iter) + + # Until results per iteration is less than min results per iteration + while results_per_iter >= MIN_RESULTS_PER_ITER: + # try to run for each subset of dates + try: + # For each smaller array of dates + for subset_dates in grouped_dates: + try: + print(subset_dates) + # Check if the start of subset_dates is after the end date of the mission. If so quit. + if subset_dates[0].date() > MISSION_END_DATE: + print(f"Reached mission end date. No further data available from Landsat-5 satellite mission in GEE - {MISSION_END_DATE}") + break + # Convert dates list to earth engine object + dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + # Generate Timeseries of one image corresponding to each date with water area in its attributes + res = generate_timeseries(dates).filterMetadata('l5_images', 'greater_than', 0) + # Extracting uncorrected water area and other information from attributes + uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l5_images', + 'water_red_sum', 'water_green_sum', 'water_nir_sum','water_red_green_mean','water_nir_red_mean'] + uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') + uncorrected_final_data = uncorrected_final_data_ee.getInfo() + print("Uncorrected", uncorrected_final_data) + # Extracting corrected area after corrrecting for cloud covered pixels using zhao gao correction. + res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) + corrected_columns_to_extract = ['to_date', 'corrected_area'] + corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ + .filterMetadata('corrected_area', 'not_equals', None) \ + .reduceColumns( + ee.Reducer.toList( + len(corrected_columns_to_extract)), + corrected_columns_to_extract + ).get('list') + corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() + print("Corrected - Cordeiro", corrected_final_data_cordeiro) + # If no data point for this duration, then skip + if len(uncorrected_final_data) == 0: + continue + # Create pandas dataframes with the extracted information and merge them + uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) + corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) + df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') + + df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") + df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") + df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') + df = df.set_index('mosaic_enddate') + print(df.head(2)) + # Save the dataframe on the disk + fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") + df.to_csv(fname) + # Create a randonm sleep time + s_time = randint(20, 30) + print(f"Sleeping for {s_time} seconds") + time.sleep(randint(20, 30)) + + if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: + print(f"Quitting: Reached enddate {enddate}") + break + elif df.index[-1].strftime('%Y-%m-%d') == fo: + print(f"Reached last available observation - {fo}") + break + except Exception as e: + log.error(e) + # Adjust results_per_iter only if error includes "Too many concurrent aggregations" + if "Too many concurrent aggregations" in str(e): + results_per_iter -= 1 + print(f"Reducing Results per iteration to {results_per_iter} due to error.") + if results_per_iter < MIN_RESULTS_PER_ITER: + print("Minimum Results per iteration reached. Continuing to next group of dates.") + results_per_iter = MIN_RESULTS_PER_ITER + continue + else: + raise Exception(f'Reducing Results per iteration to {results_per_iter}.') + else: + continue + # This exception will be only raised if the error is "Too many concurrent aggregations". + # and Results per iteration will be reduced but still be greater than or equal to minimum results per iteration. + # We will continue while loop and for loop within while loop from the left over grouped dates. + except Exception as e: + dates = pd.date_range(subset_dates[0], enddate, freq=f'{TEMPORAL_RESOLUTION}D') + grouped_dates = grouper(dates, results_per_iter) + continue + # In case no exception is raised and the complete for loop ran succesfully, break the while loop + # because we need to run the for loop only once. + else: + break + + # Combine the files into one database + to_combine.extend([os.path.join(savedir, f) for f in os.listdir(savedir) if f.endswith(".csv")]) + if len(to_combine): + files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] + + data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") + data.to_csv(savepath) + + return savepath + else: + print(f"Observed data between {start} and {end} could not be processed to get surface area. It may be due to cloud cover or other issues, Quitting!") + return None + else: + print(f"No observation observed between {start} and {end}. Quitting!") + return None + +# User-facing wrapper function +def sarea_l5(res_name,res_polygon, start, end, datadir): + return run_process_long(res_name,res_polygon, start, end, datadir) + + + +def grouper(iterable, n, *, incomplete='fill', fillvalue=None): + "Collect data into non-overlapping fixed-length chunks or blocks" + # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx + # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError + # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF + args = [iter(iterable)] * n + if incomplete == 'fill': + return zip_longest(*args, fillvalue=fillvalue) + if incomplete == 'strict': + return zip(*args, strict=True) + if incomplete == 'ignore': + return zip(*args) + else: + raise ValueError('Expected fill, strict, or ignore') + diff --git a/src/rat/core/sarea/sarea_cli_l7.py b/src/rat/core/sarea/sarea_cli_l7.py new file mode 100644 index 00000000..8a466ef0 --- /dev/null +++ b/src/rat/core/sarea/sarea_cli_l7.py @@ -0,0 +1,606 @@ +import ee +from datetime import date, datetime, timedelta, timezone +import pandas as pd +import time +import os +from random import randint +from itertools import zip_longest + +from rat.ee_utils.ee_utils import poly2feature +from rat.utils.logging import LOG_NAME, NOTIFICATION +from rat.utils.utils import days_between +from logging import getLogger + +# Defining global constants +log = getLogger(f"{LOG_NAME}.{__name__}") + +l7= ee.ImageCollection("LANDSAT/LE07/C02/T1_L2") +gswd = ee.Image("JRC/GSW1_4/GlobalSurfaceWater") + +NDWI_THRESHOLD = 0.3 +SMALL_SCALE = 30 +MEDIUM_SCALE = 120 +LARGE_SCALE = 500 +BUFFER_DIST = 500 +CLOUD_COVER_LIMIT = 90 +TEMPORAL_RESOLUTION = 16 +RESULTS_PER_ITER = 5 +MIN_RESULTS_PER_ITER = 1 +QUALITY_PIXEL_BAND_NAME = 'QA_PIXEL' +BLUE_BAND_NAME = 'SR_B1' +GREEN_BAND_NAME = 'SR_B2' +RED_BAND_NAME = 'SR_B3' +NIR_BAND_NAME = 'SR_B4' +SWIR1_BAND_NAME = 'SR_B5' +SWIR2_BAND_NAME = 'SR_B7' + +def preprocess(im): + clipped = im # clipping adds processing overhead, setting clipped = im + + # clipped = ee.Image(ee.Algorithms.If('BQA' in clipped.bandNames(), preprocess_image_mask(im), clipped.updateMask(ee.Image.constant(1)))) + # Mask appropriate QA bits + QA = im.select([QUALITY_PIXEL_BAND_NAME]) + + cloudShadowBitMask = 1 << 3 + cloudsBitMask = 1 << 4 + + cloud = (QA.bitwiseAnd(cloudsBitMask).neq(0)).Or(QA.bitwiseAnd(cloudShadowBitMask).neq(0)).rename("cloud") + clipped = clipped.updateMask(cloud.eq(0).select("cloud")) + + # SR scaling + clipped = clipped.select('SR_B.').multiply(0.0000275).add(-0.2) + clipped = clipped.addBands(cloud) + + clipped = clipped.set('system:time_start', im.get('system:time_start')) + clipped = clipped.set('system:time_end', im.get('system:time_end')) + # clipped = clipped.set('cloud_area', cloud_area) + + return clipped + + +##########################################################/ +## Processing individual images - water classification ## +##########################################################/ + +def identify_water_cluster(im, max_cluster_value): + im = ee.Image(im) + mbwi = im.select('MBWI') + + clusters = ee.List.sequence(0, max_cluster_value) + + def calc_avg_mbwi(cluster_val): + cluster_val = ee.Number(cluster_val) + avg_mbwi = mbwi.updateMask(im.select('cluster').eq(ee.Image(cluster_val))).reduceRegion( + reducer = ee.Reducer.mean(), + scale = MEDIUM_SCALE, + geometry = aoi, + maxPixels = 1e10 + ).get('MBWI') + avg_mbwi_not_null = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual(ee.Number(avg_mbwi)), + ee.Number(-99), + ee.Number(avg_mbwi))) + return avg_mbwi_not_null + # print(calc_avg_mbwi(clusters.get(0)).getInfo()) + avg_mbwis = ee.Array(clusters.map(calc_avg_mbwi)) + + max_mbwi_index = avg_mbwis.argmax().get(0) + + water_cluster = clusters.get(max_mbwi_index) + + return water_cluster + + +def cordeiro(im): + ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on + ## calinski harabasz's work + ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans + band_subset = ee.List(['NDWI', 'MNDWI', SWIR2_BAND_NAME]) # using NDWI, MNDWI and B7 (SWIR2) + sampled_pts = im.select(band_subset).sample( + region = aoi, + scale = SMALL_SCALE, + numPixels = 5e3-1 ## limit of 5k points + ) + + no_sampled_pts = sampled_pts.size() + + def if_enough_sample_pts(im): + clusterer = ee.Clusterer.wekaCascadeKMeans( + minClusters = 2, + maxClusters = 7, + init = True + ).train(sampled_pts) + + classified = im.select(band_subset).cluster(clusterer) + im = im.addBands(classified) + max_cluster_value = ee.Number(im.select('cluster').reduceRegion( + reducer = ee.Reducer.max(), + geometry = aoi, + scale = LARGE_SCALE, + maxPixels = 1e10 + ).get('cluster')) + return ee.Dictionary({'max_cluster_value': max_cluster_value, 'classified': classified}) + + def if_not_enough_sample_pts(): + return ee.Dictionary({'max_cluster_value': ee.Number(0), 'classified': ee.Image(0).rename('cluster')}) + + # If clustering is possible do clustering + def if_clustering_possible(max_cluster_value,classified,im): + im = im.addBands(classified) + + water_cluster = identify_water_cluster(im, max_cluster_value) + water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + return water_map + + # If no clustering is possible, use NDWI water map + def if_clustering_not_possible(im): + water_map = im.select(['water_map_NDWI'],['water_map_cordeiro']) + return water_map + + + after_training_dict = ee.Dictionary(ee.Algorithms.If(ee.Number(no_sampled_pts), + if_enough_sample_pts(im), + if_not_enough_sample_pts())) + + max_cluster_value = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual( + ee.Number(after_training_dict.get('max_cluster_value'))), + ee.Number(0), + ee.Number(after_training_dict.get('max_cluster_value')))) + + classified = ee.Image(after_training_dict.get('classified')) + water_map = ee.Image(ee.Algorithms.If(ee.Algorithms.IsEqual(max_cluster_value,ee.Number(0)), + if_clustering_not_possible(im), + if_clustering_possible(max_cluster_value, + classified, + im) + )) + im = im.addBands(water_map) + + return im + + +def process_image(im): + # Landsat 7 process Scan Line Corrector (SLC) + + ndwi = im.normalizedDifference([NIR_BAND_NAME, SWIR1_BAND_NAME]).rename('NDWI') + im = im.addBands(ndwi) + mndwi = im.normalizedDifference([GREEN_BAND_NAME, SWIR1_BAND_NAME]).rename('MNDWI') + im = im.addBands(mndwi) + mbwi = im.expression("MBWI = 3*green-red-nir-swir1-swir2", { + 'green': im.select(GREEN_BAND_NAME), + 'red': im.select(RED_BAND_NAME), + 'nir': im.select(NIR_BAND_NAME), + 'swir1': im.select(SWIR1_BAND_NAME), + 'swir2': im.select(SWIR2_BAND_NAME) + }) + im = im.addBands(mbwi) + + #cloud_area = AOI area - area od pixels in cloud band where there is no data because we masked it due to presence of clouds + cloud_area = aoi.area().subtract(im.select('cloud').Not().multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('cloud')) + cloud_percent = cloud_area.multiply(100).divide(aoi.area()) + + cordeiro_will_run_when = cloud_percent.lt(CLOUD_COVER_LIMIT) + # NDWI based water map: Classify water wherever NDWI is greater than NDWI_THRESHOLD and add water_map_NDWI band. + im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) + # Clustering based classification of water pixels for cloud pixels when cloud cover is less than CLOUD_COVER_LIMIT + im = im.addBands(ee.Image(ee.Algorithms.If(cordeiro_will_run_when, cordeiro(im), ee.Image.constant(-1e6)))) # run cordeiro only if cloud percent is < 90% + + # Calculate water area in water_area_cordeiro map + water_area_cordeiro = ee.Number(ee.Algorithms.If(cordeiro_will_run_when, + ee.Number(im.select('water_map_cordeiro').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_cordeiro')), + ee.Number(-1e6) + )) + # Calculate non-water area in water_area_cordeiro map. + non_water_area_cordeiro = ee.Number(ee.Algorithms.If(cordeiro_will_run_when, + ee.Number(im.select('water_map_cordeiro').neq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_cordeiro')), + ee.Number(-1e6) + )) + + # Calculate water area in water_map_NDWI. + water_area_NDWI = ee.Number(im.select('water_map_NDWI').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate non-water area in water_map_NDWI. + non_water_area_NDWI = ee.Number(im.select('water_map_NDWI').neq(1).multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band sum for water area in water_map_NDWI. + water_red_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate green band sum for water area in water_map_NDWI. + water_green_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band sum for water area in water_map_NDWI. + water_nir_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band/green band mean for water area in water_map_NDWI. + water_red_green_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).divide(im.select( + GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band/red band mean for water area in water_map_NDWI. + water_nir_red_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).divide(im.select( + RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + + # Set attributes to retrive later + im = im.set('cloud_area', cloud_area.multiply(1e-6)) + im = im.set('cloud_percent', cloud_percent) + im = im.set('water_area_cordeiro', water_area_cordeiro.multiply(1e-6)) + im = im.set('non_water_area_cordeiro', non_water_area_cordeiro.multiply(1e-6)) + im = im.set('water_area_NDWI', water_area_NDWI.multiply(1e-6)) + im = im.set('non_water_area_NDWI', non_water_area_NDWI.multiply(1e-6)) + im = im.set('water_red_sum', water_red_sum) + im = im.set('water_green_sum', water_green_sum) + im = im.set('water_nir_sum', water_nir_sum) + im = im.set('water_red_green_mean', water_red_green_mean) + im = im.set('water_nir_red_mean', water_nir_red_mean) + + return im + + +def postprocess_wrapper(im, bandName, raw_area): + im = ee.Image(im) + bandName = ee.String(bandName) + date = im.get('to_date') + + def postprocess(): + gswd_masked = gswd.updateMask(im.select(bandName).eq(1)) + + hist = ee.List(gswd_masked.reduceRegion( + reducer = ee.Reducer.autoHistogram(minBucketWidth = 1), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('occurrence')) + + def if_hist_not_null(im,hist): + + counts = ee.Array(hist).transpose().toList() + + omega = ee.Number(0.17) + count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + + count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) + occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) + + water_map = im.select([bandName], ['water_map']) + gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + + improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + + corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_zhao_gao')) + + improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); + improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); + + return improved + + def if_hist_null(im): + # Preserve the uncorrected area in the corrected area column & POSTPROCESSING=1 + # because otherwise it will be substituted with nan. + uncorrected_area = ee.Number(im.get('water_area_clustering')) + improved = im.set('corrected_area', uncorrected_area) + improved = improved.set('POSTPROCESSING_SUCCESSFUL', 1) + return improved + + improved = ee.Image(ee.Algorithms.If( + hist, if_hist_not_null(im,hist), if_hist_null(im) + )) + return improved + + def dont_post_process(): + improved = ee.Image.constant(-1) + improved = improved.set("corrected_area", -1) + return improved + + condition = ee.Number(im.get('cloud_percent')).lt(CLOUD_COVER_LIMIT).And(ee.Number(raw_area).gt(0)) + improved = ee.Image(ee.Algorithms.If(condition, postprocess(), dont_post_process())) + + improved = improved.set("to_date", date) + + return improved + + +############################################################/ +## Code from here takes care of the time-series generation ## +############################################################/ +def calc_ndwi(im): + return im.addBands(im.normalizedDifference([NIR_BAND_NAME, SWIR1_BAND_NAME]).rename('NDWI')) + +def slc_failure_correction(im): + filled_image = im.select('SR_B.') + + # Apply focal mean with the radius of 2 + + focal_mean_image = filled_image.focal_mean(2, 'square', 'pixels', 2) + + # Create a mask for NaN areas in the filled image + nan_mask = filled_image.mask().Not() + + # Only update the NaN areas with focal mean result + filled_image = filled_image.blend(focal_mean_image.updateMask(nan_mask)) + + # Preserve the time metadata + filled_image = filled_image.set('system:time_start', im.get('system:time_start')) + filled_image = filled_image.set('system:time_end', im.get('system:time_end')) + + #Add the Quality pixel band + filled_image = filled_image.addBands(im.select(QUALITY_PIXEL_BAND_NAME)) + + return filled_image + + +def process_date(date): + # Given date, calculate end date by adding TEMPORAL_RESOLUTION - 1 days + date = ee.Date(date) + from_date = date + to_date = date.advance(TEMPORAL_RESOLUTION - 1, 'day') + + # from_date_client_obj = from_date.format('YYYY-MM-dd') + # from_date_client_obj = pd.to_datetime(from_date_client_obj) + + # Only do SLC failure correction if date is greater than 31st may 2003. + # if(from_date_client_obj>=pd.to_datetime('2003-05-31')): + # Filter the image collection for these dates and AOI and do SLC failure correction for landsat-7 and run preprocess function (cloud calculations, scaling, adding & setting start and end) on them + l7_subset = ee.ImageCollection(ee.Algorithms.If(ee.Date('2003-05-31').millis().lt(to_date.millis()), + l7.filterDate(from_date, to_date).filterBounds(aoi).map(slc_failure_correction).map(preprocess), + l7.filterDate(from_date, to_date).filterBounds(aoi).map(preprocess))) + # else: + # # Filter the image collection for these dates and AOI and run preprocess function (cloud calculations, scaling, adding & setting start and end) on them + # l7_subset = l7.filterDate(from_date, to_date).filterBounds(aoi).map(preprocess) + + # Get mosaic of images if there is atleast one image for this time duration with NDWI as the quality factor (keep High NDWI) + im = ee.Image(ee.Algorithms.If(l7_subset.size().neq(0), l7_subset.map(calc_ndwi).qualityMosaic('NDWI'), ee.Image.constant(0))) + # Process NDWI Image if there is atleast one image for this time duration + im = ee.Image(ee.Algorithms.If(l7_subset.size().neq(0), process_image(im), ee.Image.constant(0))) + + # Set attributes of from and to date along with number of images during the time duration + im = im.set('from_date', from_date.format("YYYY-MM-dd")) + im = im.set('to_date', to_date.format("YYYY-MM-dd")) + im = im.set('l7_images', l7_subset.size()) + + # im = ee.Algorithms.If(im.bandNames().size().eq(1), ee.Number(0), im) + + return ee.Image(im) + + +def generate_timeseries(dates): + raw_ts = dates.map(process_date) + # raw_ts = raw_ts.removeAll([0]) + imcoll = ee.ImageCollection.fromImages(raw_ts) + + return imcoll + +def get_first_obs(start_date, end_date): + # Filter collection, sort by date, and get the first image's timestamp + first_im = ( + l7.filterBounds(aoi) + .filterDate(start_date, end_date) + .limit(1, 'system:time_start') # Explicitly limit by earliest date + .reduceColumns(ee.Reducer.first(), ['system:time_start']) + ) + + # Convert timestamp to formatted date + first_date = ee.Date(first_im.get('first')).format('YYYY-MM-dd') + + return first_date + +def run_process_long(res_name, res_polygon, start, end, datadir, results_per_iter=RESULTS_PER_ITER): + fo = start #fo: first observation + enddate = end + + # Extracting reservoir geometry + global aoi + aoi = poly2feature(res_polygon,BUFFER_DIST).geometry() + + ## Checking the number of images in the interval as Landsat 7 might have missing data for a lot of places for longer durations. + number_of_images = l7.filterBounds(aoi).filterDate(start, end).size().getInfo() + + if(number_of_images): + # getting first observation in the filtered collection + print('Checking first observation date in the given time interval.') + fo = get_first_obs(start, end).getInfo() + first_obs = datetime.strptime(fo, '%Y-%m-%d') + print(f"First Observation: {first_obs}") + + scratchdir = os.path.join(datadir, "_scratch") + + # If data already exists, only get new data starting from the last one + savepath = os.path.join(datadir, f"{res_name}.csv") + + # If an existing file exists, + if os.path.isfile(savepath): + # Read the existing file + temp_df = pd.read_csv(savepath, parse_dates=['mosaic_enddate']).set_index('mosaic_enddate') + + # Get the last date in the existing file and adjust the first observation to before last date (last date might not be for this satellite. Its TMS-OS data's ;ast date.) + last_date = temp_df.index[-1].to_pydatetime() + fo = (last_date - timedelta(days=TEMPORAL_RESOLUTION*2)).strftime("%Y-%m-%d") + # Create an array with filepath + to_combine = [savepath] + print(f"Existing file found - Last observation ({TEMPORAL_RESOLUTION*2} day lag): {last_date}") + + # If {TEMPORAL_RESOLUTION} days have not passed since last observation, skip the processing + days_passed = (datetime.strptime(end, "%Y-%m-%d") - last_date).days + print(f"No. of days passed since: {days_passed}") + if days_passed < TEMPORAL_RESOLUTION: + print(f"No new observation expected. Quitting early") + return None + # If no file exists already, create an empty array + else: + to_combine = [] + + # Extracting data in scratch directory + savedir = os.path.join(scratchdir, f"{res_name}_l7_cordeiro_zhao_gao_{fo}_{enddate}") + if not os.path.isdir(savedir): + os.makedirs(savedir) + + print(f"Extracting SA for the period {fo} -> {enddate}") + + # Creating list of dates from fo to enddate with frequency of TEMPORAL_RESOLUTION + dates = pd.date_range(fo, enddate, freq=f'{TEMPORAL_RESOLUTION}D') + # Grouping dates into smaller arrays to process in GEE + grouped_dates = grouper(dates, results_per_iter) + + # Until results per iteration is less than min results per iteration + while results_per_iter >= MIN_RESULTS_PER_ITER: + # try to run for each subset of dates + try: + # For each smaller array of dates + for subset_dates in grouped_dates: + try: + print(subset_dates) + # Convert dates list to earth engine object + dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + # Generate Timeseries of one image corresponding to each date with water area in its attributes + res = generate_timeseries(dates).filterMetadata('l7_images', 'greater_than', 0) + # Extracting uncorrected water area and other information from attributes + uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l7_images', + 'water_red_sum', 'water_green_sum', 'water_nir_sum','water_red_green_mean','water_nir_red_mean'] + uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') + uncorrected_final_data = uncorrected_final_data_ee.getInfo() + print("Uncorrected", uncorrected_final_data) + # Extracting corrected area after corrrecting for cloud covered pixels using zhao gao correction. + res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) + corrected_columns_to_extract = ['to_date', 'corrected_area'] + corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ + .filterMetadata('corrected_area', 'not_equals', None) \ + .reduceColumns( + ee.Reducer.toList( + len(corrected_columns_to_extract)), + corrected_columns_to_extract + ).get('list') + corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() + print("Corrected - Cordeiro", corrected_final_data_cordeiro) + # If no data point for this duration, then skip + if len(uncorrected_final_data) == 0: + continue + # Create pandas dataframes with the extracted information and merge them + uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) + corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) + df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') + + df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") + df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") + df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') + df = df.set_index('mosaic_enddate') + print(df.head(2)) + # Save the dataframe on the disk + fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") + df.to_csv(fname) + # Create a randonm sleep time + s_time = randint(20, 30) + print(f"Sleeping for {s_time} seconds") + time.sleep(randint(20, 30)) + + if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: + print(f"Quitting: Reached enddate {enddate}") + break + elif df.index[-1].strftime('%Y-%m-%d') == fo: + print(f"Reached last available observation - {fo}") + break + except Exception as e: + log.error(e) + # Adjust results_per_iter only if error includes "Too many concurrent aggregations" + if "Too many concurrent aggregations" in str(e): + results_per_iter -= 1 + print(f"Reducing Results per iteration to {results_per_iter} due to error.") + if results_per_iter < MIN_RESULTS_PER_ITER: + print("Minimum Results per iteration reached. Continuing to next group of dates.") + results_per_iter = MIN_RESULTS_PER_ITER + continue + else: + raise Exception(f'Reducing Results per iteration to {results_per_iter}.') + else: + continue + # This exception will be only raised if the error is "Too many concurrent aggregations". + # and Results per iteration will be reduced but still be greater than or equal to minimum results per iteration. + # We will continue while loop and for loop within while loop from the left over grouped dates. + except Exception as e: + dates = pd.date_range(subset_dates[0], enddate, freq=f'{TEMPORAL_RESOLUTION}D') + grouped_dates = grouper(dates, results_per_iter) + continue + # In case no exception is raised and the complete for loop ran succesfully, break the while loop + # because we need to run the for loop only once. + else: + break + + # Combine the files into one database + to_combine.extend([os.path.join(savedir, f) for f in os.listdir(savedir) if f.endswith(".csv")]) + if len(to_combine): + files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] + + data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") + data.to_csv(savepath) + + return savepath + else: + print(f"Observed data between {start} and {end} could not be processed to get surface area. It may be due to cloud cover or other issues, Quitting!") + return None + else: + print(f"No observation observed between {start} and {end}. Quitting!") + return None + +# User-facing wrapper function +def sarea_l7(res_name,res_polygon, start, end, datadir): + return run_process_long(res_name,res_polygon, start, end, datadir) + +def grouper(iterable, n, *, incomplete='fill', fillvalue=None): + "Collect data into non-overlapping fixed-length chunks or blocks" + # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx + # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError + # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF + args = [iter(iterable)] * n + if incomplete == 'fill': + return zip_longest(*args, fillvalue=fillvalue) + if incomplete == 'strict': + return zip(*args, strict=True) + if incomplete == 'ignore': + return zip(*args) + else: + raise ValueError('Expected fill, strict, or ignore') \ No newline at end of file diff --git a/src/rat/core/sarea/sarea_cli_l8.py b/src/rat/core/sarea/sarea_cli_l8.py index 7e301d9e..a4539cd4 100644 --- a/src/rat/core/sarea/sarea_cli_l8.py +++ b/src/rat/core/sarea/sarea_cli_l8.py @@ -6,7 +6,6 @@ import os from random import randint from itertools import zip_longest -from rat.ee_utils.ee_utils import poly2feature from rat.ee_utils.ee_utils import poly2feature from rat.utils.logging import LOG_NAME, NOTIFICATION @@ -37,7 +36,7 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): # NEW STUFF l8 = ee.ImageCollection("LANDSAT/LC08/C02/T1_L2") -gswd = ee.Image("JRC/GSW1_3/GlobalSurfaceWater") +gswd = ee.Image("JRC/GSW1_4/GlobalSurfaceWater") rgb_vis_params = {"bands":["B4","B3","B2"],"min":0,"max":0.4} NDWI_THRESHOLD = 0.3 @@ -50,6 +49,14 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): end_date = ee.Date('2019-02-01') TEMPORAL_RESOLUTION = 16 RESULTS_PER_ITER = 5 +MIN_RESULTS_PER_ITER = 1 +QUALITY_PIXEL_BAND_NAME = 'QA_PIXEL' +BLUE_BAND_NAME = 'SR_B2' +GREEN_BAND_NAME = 'SR_B3' +RED_BAND_NAME = 'SR_B4' +NIR_BAND_NAME = 'SR_B5' +SWIR1_BAND_NAME = 'SR_B6' +SWIR2_BAND_NAME = 'SR_B7' # s2_subset = s2.filterBounds(aoi).filterDate(start_date, end_date) @@ -92,17 +99,10 @@ def preprocess(im): ## Processing individual images - water classification ## ##########################################################/ -def identify_water_cluster(im): +def identify_water_cluster(im, max_cluster_value): im = ee.Image(im) mbwi = im.select('MBWI') - - max_cluster_value = ee.Number(im.select('cluster').reduceRegion( - reducer = ee.Reducer.max(), - geometry = aoi, - scale = LARGE_SCALE, - maxPixels = 1e10 - ).get('cluster')) - + clusters = ee.List.sequence(0, max_cluster_value) def calc_avg_mbwi(cluster_val): @@ -113,8 +113,11 @@ def calc_avg_mbwi(cluster_val): geometry = aoi, maxPixels = 1e10 ).get('MBWI') - return avg_mbwi - + avg_mbwi_not_null = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual(ee.Number(avg_mbwi)), + ee.Number(-99), + ee.Number(avg_mbwi))) + return avg_mbwi_not_null + # print(calc_avg_mbwi(clusters.get(0)).getInfo()) avg_mbwis = ee.Array(clusters.map(calc_avg_mbwi)) max_mbwi_index = avg_mbwis.argmax().get(0) @@ -123,29 +126,69 @@ def calc_avg_mbwi(cluster_val): return water_cluster - def cordeiro(im): - band_subset = ee.List(['NDWI', 'MNDWI', 'SR_B7']) # using NDWI, MNDWI and B7 (SWIR) + ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on + ## calinski harabasz's work + ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans + band_subset = ee.List(['NDWI', 'MNDWI', SWIR2_BAND_NAME]) # using NDWI, MNDWI and B7 (SWIR2) sampled_pts = im.select(band_subset).sample( region = aoi, scale = SMALL_SCALE, numPixels = 5e3-1 ## limit of 5k points ) - ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on - ## calinski harabasz's work - ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans - clusterer = ee.Clusterer.wekaCascadeKMeans( - minClusters = 2, - maxClusters = 7, - init = True - ).train(sampled_pts) + no_sampled_pts = sampled_pts.size() + + def if_enough_sample_pts(im): + clusterer = ee.Clusterer.wekaCascadeKMeans( + minClusters = 2, + maxClusters = 7, + init = True + ).train(sampled_pts) + + classified = im.select(band_subset).cluster(clusterer) + im = im.addBands(classified) + max_cluster_value = ee.Number(im.select('cluster').reduceRegion( + reducer = ee.Reducer.max(), + geometry = aoi, + scale = LARGE_SCALE, + maxPixels = 1e10 + ).get('cluster')) + return ee.Dictionary({'max_cluster_value': max_cluster_value, 'classified': classified}) + + def if_not_enough_sample_pts(): + return ee.Dictionary({'max_cluster_value': ee.Number(0), 'classified': ee.Image(0).rename('cluster')}) - classified = im.select(band_subset).cluster(clusterer) - im = im.addBands(classified) + # If clustering is possible do clustering + def if_clustering_possible(max_cluster_value,classified,im): + im = im.addBands(classified) + + water_cluster = identify_water_cluster(im, max_cluster_value) + water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + return water_map + + # If no clustering is possible, use NDWI water map + def if_clustering_not_possible(im): + water_map = im.select(['water_map_NDWI'],['water_map_cordeiro']) + return water_map + + + after_training_dict = ee.Dictionary(ee.Algorithms.If(ee.Number(no_sampled_pts), + if_enough_sample_pts(im), + if_not_enough_sample_pts())) + + max_cluster_value = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual( + ee.Number(after_training_dict.get('max_cluster_value'))), + ee.Number(0), + ee.Number(after_training_dict.get('max_cluster_value')))) - water_cluster = identify_water_cluster(im) - water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + classified = ee.Image(after_training_dict.get('classified')) + water_map = ee.Image(ee.Algorithms.If(ee.Algorithms.IsEqual(max_cluster_value,ee.Number(0)), + if_clustering_not_possible(im), + if_clustering_possible(max_cluster_value, + classified, + im) + )) im = im.addBands(water_map) return im @@ -176,7 +219,8 @@ def process_image(im): cloud_percent = cloud_area.multiply(100).divide(aoi.area()) cordeiro_will_run_when = cloud_percent.lt(CLOUD_COVER_LIMIT) - + # NDWI based water map: Classify water wherever NDWI is greater than NDWI_THRESHOLD and add water_map_NDWI band. + im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) # Clusting based im = im.addBands(ee.Image(ee.Algorithms.If(cordeiro_will_run_when, cordeiro(im), ee.Image.constant(-1e6)))) # run cordeiro only if cloud percent is < 90% @@ -199,8 +243,7 @@ def process_image(im): ee.Number(-1e6) )) - # NDWI based - im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) + water_area_NDWI = ee.Number(im.select('water_map_NDWI').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( reducer = ee.Reducer.sum(), geometry = aoi, @@ -213,6 +256,43 @@ def process_image(im): scale = SMALL_SCALE, maxPixels = 1e10 ).get('water_map_NDWI')) + # Calculate red band sum for water area in water_map_NDWI. + water_red_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate green band sum for water area in water_map_NDWI. + water_green_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band sum for water area in water_map_NDWI. + water_nir_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band/green band mean for water area in water_map_NDWI. + water_red_green_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).divide(im.select( + GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band/red band mean for water area in water_map_NDWI. + water_nir_red_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).divide(im.select( + RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) im = im.set('cloud_area', cloud_area.multiply(1e-6)) im = im.set('cloud_percent', cloud_percent) @@ -220,6 +300,11 @@ def process_image(im): im = im.set('non_water_area_cordeiro', non_water_area_cordeiro.multiply(1e-6)) im = im.set('water_area_NDWI', water_area_NDWI.multiply(1e-6)) im = im.set('non_water_area_NDWI', non_water_area_NDWI.multiply(1e-6)) + im = im.set('water_red_sum', water_red_sum) + im = im.set('water_green_sum', water_green_sum) + im = im.set('water_nir_sum', water_nir_sum) + im = im.set('water_red_green_mean', water_red_green_mean) + im = im.set('water_nir_red_mean', water_nir_red_mean) return im @@ -239,28 +324,46 @@ def postprocess(): maxPixels = 1e10 ).get('occurrence')) - counts = ee.Array(hist).transpose().toList() - - omega = ee.Number(0.17) - count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) - - count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) - occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) - - water_map = im.select([bandName], ['water_map']) - gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + def if_hist_not_null(im,hist): + + counts = ee.Array(hist).transpose().toList() + + omega = ee.Number(0.17) + count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + + count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) + occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) + + water_map = im.select([bandName], ['water_map']) + gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + + improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + + corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_zhao_gao')) + + improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); + improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); - improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + return improved - corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( - reducer = ee.Reducer.sum(), - geometry = aoi, - scale = SMALL_SCALE, - maxPixels = 1e10 - ).get('water_map_zhao_gao')) + def if_hist_null(im): + # Preserve the uncorrected area in the corrected area column & POSTPROCESSING=1 + # because otherwise it will be substituted with nan. + uncorrected_area = ee.Number(im.get('water_area_clustering')) + improved = im.set('corrected_area', uncorrected_area) + improved = improved.set('POSTPROCESSING_SUCCESSFUL', 1) + return improved - improved = improved.set("corrected_area", corrected_area.multiply(1e-6)) + improved = ee.Image(ee.Algorithms.If( + hist, if_hist_not_null(im,hist), if_hist_null(im) + )) return improved + def dont_post_process(): improved = ee.Image.constant(-1) @@ -310,11 +413,20 @@ def generate_timeseries(dates): return imcoll def get_first_obs(start_date, end_date): - first_im = l8.filterBounds(aoi).filterDate(start_date, end_date).first() - str_fmt = 'YYYY-MM-dd' - return ee.Date.parse(str_fmt, ee.Date(first_im.get('system:time_start')).format(str_fmt)) + # Filter collection, sort by date, and get the first image's timestamp + first_im = ( + l8.filterBounds(aoi) + .filterDate(start_date, end_date) + .limit(1, 'system:time_start') # Explicitly limit by earliest date + .reduceColumns(ee.Reducer.first(), ['system:time_start']) + ) + + # Convert timestamp to formatted date + first_date = ee.Date(first_im.get('first')).format('YYYY-MM-dd') + + return first_date -def run_process_long(res_name, res_polygon, start, end, datadir): +def run_process_long(res_name, res_polygon, start, end, datadir, results_per_iter=RESULTS_PER_ITER): fo = start enddate = end @@ -328,14 +440,11 @@ def run_process_long(res_name, res_polygon, start, end, datadir): number_of_images = 1 # more than a month difference simply run, so no need to calculate number_of_images if(number_of_images): - fo = get_first_obs(start, end).format('YYYY-MM-dd').getInfo() + fo = get_first_obs(start, end).getInfo() first_obs = datetime.strptime(fo, '%Y-%m-%d') scratchdir = os.path.join(datadir, "_scratch") - # flag = True - num_runs = 0 - # If data already exists, only get new data starting from the last one savepath = os.path.join(datadir, f"{res_name}.csv") @@ -363,103 +472,118 @@ def run_process_long(res_name, res_polygon, start, end, datadir): print(f"Extracting SA for the period {fo} -> {enddate}") dates = pd.date_range(fo, enddate, freq=f'{TEMPORAL_RESOLUTION}D') - grouped_dates = grouper(dates, RESULTS_PER_ITER) - - # # redo the calculations part and see where it is complaining about too many aggregations - # subset_dates = next(grouped_dates) - # dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + grouped_dates = grouper(dates, results_per_iter) - # print(subset_dates) - # res = generate_timeseries(dates).filterMetadata('s2_images', 'greater_than', 0) - # pprint.pprint(res.aggregate_array('s2_images').getInfo()) - - # uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'water_area_NDWI', 'non_water_area_NDWI', 'cloud_area', 's2_images'] - # uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') - # uncorrected_final_data = uncorrected_final_data_ee.getInfo() - - for subset_dates in grouped_dates: + # Until results per iteration is less than min results per iteration + while results_per_iter >= MIN_RESULTS_PER_ITER: + # try to run for each subset of dates try: - print(subset_dates) - dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) - - res = generate_timeseries(dates).filterMetadata('l8_images', 'greater_than', 0) - # pprint.pprint(res.getInfo()) - - uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l8_images'] - uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') - uncorrected_final_data = uncorrected_final_data_ee.getInfo() - print("Uncorrected", uncorrected_final_data) - - res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) - corrected_columns_to_extract = ['to_date', 'corrected_area'] - corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ - .filterMetadata('corrected_area', 'not_equals', None) \ - .reduceColumns( - ee.Reducer.toList( - len(corrected_columns_to_extract)), - corrected_columns_to_extract - ).get('list') - corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() - print("Corrected - Cordeiro", corrected_final_data_cordeiro) - - # res_corrected_NDWI = res.map(lambda im: postprocess_wrapper(im, 'water_map_NDWI', im.get('water_area_NDWI'))) - # corrected_final_data_NDWI_ee = res_corrected_NDWI \ - # .filterMetadata('corrected_area', 'not_equals', None) \ - # .reduceColumns( - # ee.Reducer.toList( - # len(corrected_columns_to_extract)), - # corrected_columns_to_extract - # ).get('list') - - # corrected_final_data_NDWI = corrected_final_data_NDWI_ee.getInfo() - - # print(uncorrected_final_data, corrected_final_data_cordeiro) - if len(uncorrected_final_data) == 0: - continue - uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) - corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) - # corrected_NDWI_df = pd.DataFrame(corrected_final_data_NDWI, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_NDWI'}, axis=1) - # corrected_df = pd.merge(corrected_cordeiro_df, corrected_NDWI_df, 'left', 'to_date') - df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') - - df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") - df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") - df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') - df = df.set_index('mosaic_enddate') - print(df.head(2)) - - fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") - df.to_csv(fname) - - s_time = randint(20, 30) - print(f"Sleeping for {s_time} seconds") - time.sleep(randint(20, 30)) - - if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: - print(f"Quitting: Reached enddate {enddate}") - break - elif df.index[-1].strftime('%Y-%m-%d') == fo: - print(f"Reached last available observation - {fo}") - break - elif num_runs > 1000: - print("Quitting: Reached 1000 iterations") - break + for subset_dates in grouped_dates: + try: + print(subset_dates) + dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + + res = generate_timeseries(dates).filterMetadata('l8_images', 'greater_than', 0) + # pprint.pprint(res.getInfo()) + + uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l8_images', + 'water_red_sum', 'water_green_sum', 'water_nir_sum','water_red_green_mean','water_nir_red_mean'] + uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') + uncorrected_final_data = uncorrected_final_data_ee.getInfo() + print("Uncorrected", uncorrected_final_data) + + res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) + corrected_columns_to_extract = ['to_date', 'corrected_area'] + corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ + .filterMetadata('corrected_area', 'not_equals', None) \ + .reduceColumns( + ee.Reducer.toList( + len(corrected_columns_to_extract)), + corrected_columns_to_extract + ).get('list') + corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() + print("Corrected - Cordeiro", corrected_final_data_cordeiro) + + # res_corrected_NDWI = res.map(lambda im: postprocess_wrapper(im, 'water_map_NDWI', im.get('water_area_NDWI'))) + # corrected_final_data_NDWI_ee = res_corrected_NDWI \ + # .filterMetadata('corrected_area', 'not_equals', None) \ + # .reduceColumns( + # ee.Reducer.toList( + # len(corrected_columns_to_extract)), + # corrected_columns_to_extract + # ).get('list') + + # corrected_final_data_NDWI = corrected_final_data_NDWI_ee.getInfo() + + # print(uncorrected_final_data, corrected_final_data_cordeiro) + if len(uncorrected_final_data) == 0: + continue + uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) + corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) + # corrected_NDWI_df = pd.DataFrame(corrected_final_data_NDWI, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_NDWI'}, axis=1) + # corrected_df = pd.merge(corrected_cordeiro_df, corrected_NDWI_df, 'left', 'to_date') + df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') + + df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") + df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") + df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') + df = df.set_index('mosaic_enddate') + print(df.head(2)) + + fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") + df.to_csv(fname) + + s_time = randint(20, 30) + print(f"Sleeping for {s_time} seconds") + time.sleep(randint(20, 30)) + + if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: + print(f"Quitting: Reached enddate {enddate}") + break + elif df.index[-1].strftime('%Y-%m-%d') == fo: + print(f"Reached last available observation - {fo}") + break + except Exception as e: + log.error(e) + # Adjust results_per_iter only if error includes "Too many concurrent aggregations" + if "Too many concurrent aggregations" in str(e): + results_per_iter -= 1 + print(f"Reducing Results per iteration to {results_per_iter} due to error.") + if results_per_iter < MIN_RESULTS_PER_ITER: + print("Minimum Results per iteration reached. Continuing to next group of dates.") + results_per_iter = MIN_RESULTS_PER_ITER + continue + else: + raise Exception(f'Reducing Results per iteration to {results_per_iter}.') + else: + continue + # This exception will be only raised if the error is "Too many concurrent aggregations". + # and Results per iteration will be reduced but still be greater than or equal to minimum results per iteration. + # We will continue while loop and for loop within while loop from the left over grouped dates. except Exception as e: - log.error(e) + dates = pd.date_range(subset_dates[0], enddate, freq=f'{TEMPORAL_RESOLUTION}D') + grouped_dates = grouper(dates, results_per_iter) continue + # In case no exception is raised and the complete for loop ran succesfully, break the while loop + # because we need to run the for loop only once. + else: + break # Combine the files into one database to_combine.extend([os.path.join(savedir, f) for f in os.listdir(savedir) if f.endswith(".csv")]) + if len(to_combine): + files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] + data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") - files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] - data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") + data.to_csv(savepath) - data.to_csv(savepath) - - return savepath - + return savepath + else: + print(f"Observed data between {start} and {end} could not be processed to get surface area. It may be due to cloud cover or other issues, Quitting!") + return None + else: print(f"No observation observed between {start} and {end}. Quitting!") return None diff --git a/src/rat/core/sarea/sarea_cli_l9.py b/src/rat/core/sarea/sarea_cli_l9.py index 58962e5d..e55aa77f 100644 --- a/src/rat/core/sarea/sarea_cli_l9.py +++ b/src/rat/core/sarea/sarea_cli_l9.py @@ -8,7 +8,6 @@ from random import randint import argparse from itertools import zip_longest -from rat.ee_utils.ee_utils import poly2feature from rat.ee_utils.ee_utils import poly2feature from rat.utils.logging import LOG_NAME, NOTIFICATION @@ -39,7 +38,7 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): # NEW STUFF l9 = ee.ImageCollection("LANDSAT/LC09/C02/T1_L2") -gswd = ee.Image("JRC/GSW1_3/GlobalSurfaceWater") +gswd = ee.Image("JRC/GSW1_4/GlobalSurfaceWater") rgb_vis_params = {"bands":["B4","B3","B2"],"min":0,"max":0.4} NDWI_THRESHOLD = 0.3 @@ -52,7 +51,15 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): end_date = ee.Date('2019-02-01') TEMPORAL_RESOLUTION = 16 RESULTS_PER_ITER = 5 +MIN_RESULTS_PER_ITER = 1 MISSION_START_DATE = (2022,1,1) # Rough start date for mission/satellite data +QUALITY_PIXEL_BAND_NAME = 'QA_PIXEL' +BLUE_BAND_NAME = 'SR_B2' +GREEN_BAND_NAME = 'SR_B3' +RED_BAND_NAME = 'SR_B4' +NIR_BAND_NAME = 'SR_B5' +SWIR1_BAND_NAME = 'SR_B6' +SWIR2_BAND_NAME = 'SR_B7' # s2_subset = s2.filterBounds(aoi).filterDate(start_date, end_date) @@ -95,17 +102,10 @@ def preprocess(im): ## Processing individual images - water classification ## ##########################################################/ -def identify_water_cluster(im): +def identify_water_cluster(im, max_cluster_value): im = ee.Image(im) mbwi = im.select('MBWI') - - max_cluster_value = ee.Number(im.select('cluster').reduceRegion( - reducer = ee.Reducer.max(), - geometry = aoi, - scale = LARGE_SCALE, - maxPixels = 1e10 - ).get('cluster')) - + clusters = ee.List.sequence(0, max_cluster_value) def calc_avg_mbwi(cluster_val): @@ -116,8 +116,11 @@ def calc_avg_mbwi(cluster_val): geometry = aoi, maxPixels = 1e10 ).get('MBWI') - return avg_mbwi - + avg_mbwi_not_null = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual(ee.Number(avg_mbwi)), + ee.Number(-99), + ee.Number(avg_mbwi))) + return avg_mbwi_not_null + # print(calc_avg_mbwi(clusters.get(0)).getInfo()) avg_mbwis = ee.Array(clusters.map(calc_avg_mbwi)) max_mbwi_index = avg_mbwis.argmax().get(0) @@ -128,27 +131,68 @@ def calc_avg_mbwi(cluster_val): def cordeiro(im): - band_subset = ee.List(['NDWI', 'MNDWI', 'SR_B7']) # using NDWI, MNDWI and B7 (SWIR) + ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on + ## calinski harabasz's work + ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans + band_subset = ee.List(['NDWI', 'MNDWI', SWIR2_BAND_NAME]) # using NDWI, MNDWI and B7 (SWIR2) sampled_pts = im.select(band_subset).sample( region = aoi, scale = SMALL_SCALE, numPixels = 5e3-1 ## limit of 5k points ) - ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on - ## calinski harabasz's work - ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans - clusterer = ee.Clusterer.wekaCascadeKMeans( - minClusters = 2, - maxClusters = 7, - init = True - ).train(sampled_pts) + no_sampled_pts = sampled_pts.size() + + def if_enough_sample_pts(im): + clusterer = ee.Clusterer.wekaCascadeKMeans( + minClusters = 2, + maxClusters = 7, + init = True + ).train(sampled_pts) + + classified = im.select(band_subset).cluster(clusterer) + im = im.addBands(classified) + max_cluster_value = ee.Number(im.select('cluster').reduceRegion( + reducer = ee.Reducer.max(), + geometry = aoi, + scale = LARGE_SCALE, + maxPixels = 1e10 + ).get('cluster')) + return ee.Dictionary({'max_cluster_value': max_cluster_value, 'classified': classified}) + + def if_not_enough_sample_pts(): + return ee.Dictionary({'max_cluster_value': ee.Number(0), 'classified': ee.Image(0).rename('cluster')}) + + # If clustering is possible do clustering + def if_clustering_possible(max_cluster_value,classified,im): + im = im.addBands(classified) + + water_cluster = identify_water_cluster(im, max_cluster_value) + water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + return water_map + + # If no clustering is possible, use NDWI water map + def if_clustering_not_possible(im): + water_map = im.select(['water_map_NDWI'],['water_map_cordeiro']) + return water_map + - classified = im.select(band_subset).cluster(clusterer) - im = im.addBands(classified) + after_training_dict = ee.Dictionary(ee.Algorithms.If(ee.Number(no_sampled_pts), + if_enough_sample_pts(im), + if_not_enough_sample_pts())) - water_cluster = identify_water_cluster(im) - water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_cordeiro']) + max_cluster_value = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual( + ee.Number(after_training_dict.get('max_cluster_value'))), + ee.Number(0), + ee.Number(after_training_dict.get('max_cluster_value')))) + + classified = ee.Image(after_training_dict.get('classified')) + water_map = ee.Image(ee.Algorithms.If(ee.Algorithms.IsEqual(max_cluster_value,ee.Number(0)), + if_clustering_not_possible(im), + if_clustering_possible(max_cluster_value, + classified, + im) + )) im = im.addBands(water_map) return im @@ -179,7 +223,8 @@ def process_image(im): cloud_percent = cloud_area.multiply(100).divide(aoi.area()) cordeiro_will_run_when = cloud_percent.lt(CLOUD_COVER_LIMIT) - + # NDWI based water map: Classify water wherever NDWI is greater than NDWI_THRESHOLD and add water_map_NDWI band. + im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) # Clusting based im = im.addBands(ee.Image(ee.Algorithms.If(cordeiro_will_run_when, cordeiro(im), ee.Image.constant(-1e6)))) # run cordeiro only if cloud percent is < 90% @@ -202,8 +247,6 @@ def process_image(im): ee.Number(-1e6) )) - # NDWI based - im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) water_area_NDWI = ee.Number(im.select('water_map_NDWI').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( reducer = ee.Reducer.sum(), geometry = aoi, @@ -216,6 +259,43 @@ def process_image(im): scale = SMALL_SCALE, maxPixels = 1e10 ).get('water_map_NDWI')) + # Calculate red band sum for water area in water_map_NDWI. + water_red_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate green band sum for water area in water_map_NDWI. + water_green_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band sum for water area in water_map_NDWI. + water_nir_sum = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate red band/green band mean for water area in water_map_NDWI. + water_red_green_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(RED_BAND_NAME)).divide(im.select( + GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) + # Calculate nir band/red band mean for water area in water_map_NDWI. + water_nir_red_mean = ee.Number(im.select('water_map_NDWI').eq(1).multiply(im.select(NIR_BAND_NAME)).divide(im.select( + RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_NDWI')) im = im.set('cloud_area', cloud_area.multiply(1e-6)) im = im.set('cloud_percent', cloud_percent) @@ -223,6 +303,11 @@ def process_image(im): im = im.set('non_water_area_cordeiro', non_water_area_cordeiro.multiply(1e-6)) im = im.set('water_area_NDWI', water_area_NDWI.multiply(1e-6)) im = im.set('non_water_area_NDWI', non_water_area_NDWI.multiply(1e-6)) + im = im.set('water_red_sum', water_red_sum) + im = im.set('water_green_sum', water_green_sum) + im = im.set('water_nir_sum', water_nir_sum) + im = im.set('water_red_green_mean', water_red_green_mean) + im = im.set('water_nir_red_mean', water_nir_red_mean) return im @@ -242,27 +327,44 @@ def postprocess(): maxPixels = 1e10 ).get('occurrence')) - counts = ee.Array(hist).transpose().toList() - - omega = ee.Number(0.17) - count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + def if_hist_not_null(im,hist): + + counts = ee.Array(hist).transpose().toList() + + omega = ee.Number(0.17) + count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + + count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) + occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) + + water_map = im.select([bandName], ['water_map']) + gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + + improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + + corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = SMALL_SCALE, + maxPixels = 1e10 + ).get('water_map_zhao_gao')) + + improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); + improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); - count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) - occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) - - water_map = im.select([bandName], ['water_map']) - gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + return improved - improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + def if_hist_null(im): + # Preserve the uncorrected area in the corrected area column & POSTPROCESSING=1 + # because otherwise it will be substituted with nan. + uncorrected_area = ee.Number(im.get('water_area_clustering')) + improved = im.set('corrected_area', uncorrected_area) + improved = improved.set('POSTPROCESSING_SUCCESSFUL', 1) + return improved - corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( - reducer = ee.Reducer.sum(), - geometry = aoi, - scale = SMALL_SCALE, - maxPixels = 1e10 - ).get('water_map_zhao_gao')) - - improved = improved.set("corrected_area", corrected_area.multiply(1e-6)) + improved = ee.Image(ee.Algorithms.If( + hist, if_hist_not_null(im,hist), if_hist_null(im) + )) return improved def dont_post_process(): @@ -313,11 +415,20 @@ def generate_timeseries(dates): return imcoll def get_first_obs(start_date, end_date): - first_im = l9.filterBounds(aoi).filterDate(start_date, end_date).first() - str_fmt = 'YYYY-MM-dd' - return ee.Date.parse(str_fmt, ee.Date(first_im.get('system:time_start')).format(str_fmt)) + # Filter collection, sort by date, and get the first image's timestamp + first_im = ( + l9.filterBounds(aoi) + .filterDate(start_date, end_date) + .limit(1, 'system:time_start') # Explicitly limit by earliest date + .reduceColumns(ee.Reducer.first(), ['system:time_start']) + ) + + # Convert timestamp to formatted date + first_date = ee.Date(first_im.get('first')).format('YYYY-MM-dd') + + return first_date -def run_process_long(res_name, res_polygon, start, end, datadir): +def run_process_long(res_name, res_polygon, start, end, datadir, results_per_iter=RESULTS_PER_ITER): fo = start enddate = end @@ -334,13 +445,11 @@ def run_process_long(res_name, res_polygon, start, end, datadir): number_of_images = 1 # more than a month difference simply run, so no need to calculate number_of_images if(number_of_images): - fo = get_first_obs(start, end).format('YYYY-MM-dd').getInfo() + fo = get_first_obs(start, end).getInfo() first_obs = datetime.strptime(fo, '%Y-%m-%d') scratchdir = os.path.join(datadir, "_scratch") - # flag = True - num_runs = 0 # If data already exists, only get new data starting from the last one savepath = os.path.join(datadir, f"{res_name}.csv") @@ -369,102 +478,116 @@ def run_process_long(res_name, res_polygon, start, end, datadir): print(f"Extracting SA for the period {fo} -> {enddate}") dates = pd.date_range(fo, enddate, freq=f'{TEMPORAL_RESOLUTION}D') - grouped_dates = grouper(dates, RESULTS_PER_ITER) - - # # redo the calculations part and see where it is complaining about too many aggregations - # subset_dates = next(grouped_dates) - # dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) - - # print(subset_dates) - # res = generate_timeseries(dates).filterMetadata('s2_images', 'greater_than', 0) - # pprint.pprint(res.aggregate_array('s2_images').getInfo()) - - # uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'water_area_NDWI', 'non_water_area_NDWI', 'cloud_area', 's2_images'] - # uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') - # uncorrected_final_data = uncorrected_final_data_ee.getInfo() + grouped_dates = grouper(dates, results_per_iter) - for subset_dates in grouped_dates: + # Until results per iteration is less than min results per iteration + while results_per_iter >= MIN_RESULTS_PER_ITER: + # try to run for each subset of dates try: - print(subset_dates) - dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) - - res = generate_timeseries(dates).filterMetadata('l9_images', 'greater_than', 0) - # pprint.pprint(res.getInfo()) - - uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l9_images'] - uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') - uncorrected_final_data = uncorrected_final_data_ee.getInfo() - print("Uncorrected", uncorrected_final_data) - - res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) - corrected_columns_to_extract = ['to_date', 'corrected_area'] - corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ - .filterMetadata('corrected_area', 'not_equals', None) \ - .reduceColumns( - ee.Reducer.toList( - len(corrected_columns_to_extract)), - corrected_columns_to_extract - ).get('list') - corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() - print("Corrected - Cordeiro", corrected_final_data_cordeiro) - - # res_corrected_NDWI = res.map(lambda im: postprocess_wrapper(im, 'water_map_NDWI', im.get('water_area_NDWI'))) - # corrected_final_data_NDWI_ee = res_corrected_NDWI \ - # .filterMetadata('corrected_area', 'not_equals', None) \ - # .reduceColumns( - # ee.Reducer.toList( - # len(corrected_columns_to_extract)), - # corrected_columns_to_extract - # ).get('list') - - # corrected_final_data_NDWI = corrected_final_data_NDWI_ee.getInfo() - - # print(uncorrected_final_data, corrected_final_data_cordeiro) - if len(uncorrected_final_data) == 0: - continue - uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) - corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) - # corrected_NDWI_df = pd.DataFrame(corrected_final_data_NDWI, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_NDWI'}, axis=1) - # corrected_df = pd.merge(corrected_cordeiro_df, corrected_NDWI_df, 'left', 'to_date') - df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') - - df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") - df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") - df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') - df = df.set_index('mosaic_enddate') - print(df.head(2)) - - fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") - df.to_csv(fname) - - s_time = randint(20, 30) - print(f"Sleeping for {s_time} seconds") - time.sleep(randint(20, 30)) - - if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: - print(f"Quitting: Reached enddate {enddate}") - break - elif df.index[-1].strftime('%Y-%m-%d') == fo: - print(f"Reached last available observation - {fo}") - break - elif num_runs > 1000: - print("Quitting: Reached 1000 iterations") - break + for subset_dates in grouped_dates: + try: + print(subset_dates) + dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + + res = generate_timeseries(dates).filterMetadata('l9_images', 'greater_than', 0) + # pprint.pprint(res.getInfo()) + + uncorrected_columns_to_extract = ['from_date', 'to_date', 'water_area_cordeiro', 'non_water_area_cordeiro', 'cloud_area', 'l9_images', + 'water_red_sum', 'water_green_sum', 'water_nir_sum','water_red_green_mean','water_nir_red_mean'] + uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') + uncorrected_final_data = uncorrected_final_data_ee.getInfo() + print("Uncorrected", uncorrected_final_data) + + res_corrected_cordeiro = res.map(lambda im: postprocess_wrapper(im, 'water_map_cordeiro', im.get('water_area_cordeiro'))) + corrected_columns_to_extract = ['to_date', 'corrected_area'] + corrected_final_data_cordeiro_ee = res_corrected_cordeiro \ + .filterMetadata('corrected_area', 'not_equals', None) \ + .reduceColumns( + ee.Reducer.toList( + len(corrected_columns_to_extract)), + corrected_columns_to_extract + ).get('list') + corrected_final_data_cordeiro = corrected_final_data_cordeiro_ee.getInfo() + print("Corrected - Cordeiro", corrected_final_data_cordeiro) + + # res_corrected_NDWI = res.map(lambda im: postprocess_wrapper(im, 'water_map_NDWI', im.get('water_area_NDWI'))) + # corrected_final_data_NDWI_ee = res_corrected_NDWI \ + # .filterMetadata('corrected_area', 'not_equals', None) \ + # .reduceColumns( + # ee.Reducer.toList( + # len(corrected_columns_to_extract)), + # corrected_columns_to_extract + # ).get('list') + + # corrected_final_data_NDWI = corrected_final_data_NDWI_ee.getInfo() + + # print(uncorrected_final_data, corrected_final_data_cordeiro) + if len(uncorrected_final_data) == 0: + continue + uncorrected_df = pd.DataFrame(uncorrected_final_data, columns=uncorrected_columns_to_extract) + corrected_cordeiro_df = pd.DataFrame(corrected_final_data_cordeiro, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_cordeiro'}, axis=1) + # corrected_NDWI_df = pd.DataFrame(corrected_final_data_NDWI, columns=corrected_columns_to_extract).rename({'corrected_area': 'corrected_area_NDWI'}, axis=1) + # corrected_df = pd.merge(corrected_cordeiro_df, corrected_NDWI_df, 'left', 'to_date') + df = pd.merge(uncorrected_df, corrected_cordeiro_df, 'left', 'to_date') + + df['from_date'] = pd.to_datetime(df['from_date'], format="%Y-%m-%d") + df['to_date'] = pd.to_datetime(df['to_date'], format="%Y-%m-%d") + df['mosaic_enddate'] = df['to_date'] - pd.Timedelta(1, unit='day') + df = df.set_index('mosaic_enddate') + print(df.head(2)) + + fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") + df.to_csv(fname) + + s_time = randint(20, 30) + print(f"Sleeping for {s_time} seconds") + time.sleep(randint(20, 30)) + + if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: + print(f"Quitting: Reached enddate {enddate}") + break + elif df.index[-1].strftime('%Y-%m-%d') == fo: + print(f"Reached last available observation - {fo}") + break + except Exception as e: + # Adjust results_per_iter only if error includes "Too many concurrent aggregations" + if "Too many concurrent aggregations" in str(e): + results_per_iter -= 1 + print(f"Reducing Results per iteration to {results_per_iter} due to error.") + if results_per_iter < MIN_RESULTS_PER_ITER: + print("Minimum Results per iteration reached. Continuing to next group of dates.") + results_per_iter = MIN_RESULTS_PER_ITER + continue + else: + raise Exception(f'Reducing Results per iteration to {results_per_iter}.') + else: + continue + # This exception will be only raised if the error is "Too many concurrent aggregations". + # and Results per iteration will be reduced but still be greater than or equal to minimum results per iteration. + # We will continue while loop and for loop within while loop from the left over grouped dates. except Exception as e: - log.error(e) + dates = pd.date_range(subset_dates[0], enddate, freq=f'{TEMPORAL_RESOLUTION}D') + grouped_dates = grouper(dates, results_per_iter) continue - + # In case no exception is raised and the complete for loop ran succesfully, break the while loop + # because we need to run the for loop only once. + else: + break + # Combine the files into one database to_combine.extend([os.path.join(savedir, f) for f in os.listdir(savedir) if f.endswith(".csv")]) + if len(to_combine): + files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] + data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") - files = [pd.read_csv(f, parse_dates=["mosaic_enddate"]).set_index("mosaic_enddate") for f in to_combine] - data = pd.concat(files).drop_duplicates().sort_values("mosaic_enddate") + data.to_csv(savepath) - data.to_csv(savepath) - - return savepath + return savepath + else: + print(f"Observed data between {start} and {end} could not be processed to get surface area. It may be due to cloud cover or other issues, Quitting!") + return None else: print(f"No observation observed between {start} and {end}. Quitting!") diff --git a/src/rat/core/sarea/sarea_cli_s2.py b/src/rat/core/sarea/sarea_cli_s2.py index 45dec1df..e2d06041 100644 --- a/src/rat/core/sarea/sarea_cli_s2.py +++ b/src/rat/core/sarea/sarea_cli_s2.py @@ -7,7 +7,6 @@ import os from random import randint from itertools import zip_longest -from rat.ee_utils.ee_utils import poly2feature from rat.ee_utils.ee_utils import poly2feature from rat.utils.logging import LOG_NAME, NOTIFICATION @@ -32,12 +31,14 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): raise ValueError('Expected fill, strict, or ignore') # NEW STUFF -s2 = ee.ImageCollection("COPERNICUS/S2_SR") -gswd = ee.Image("JRC/GSW1_3/GlobalSurfaceWater") +s2 = ee.ImageCollection("COPERNICUS/S2_SR_HARMONIZED") +gswd = ee.Image("JRC/GSW1_4/GlobalSurfaceWater") rgb_vis_params = {"bands":["B4","B3","B2"],"min":0,"max":0.4} NDWI_THRESHOLD = 0.3; -SMALL_SCALE = 20; +SPATIAL_SCALE_SMALL = 20; +SPATIAL_SCALE_MEDIUM = 50; +SPATIAL_SCALE_LARGE = 200; MEDIUM_SCALE = 120; LARGE_SCALE = 500; BUFFER_DIST = 500 @@ -46,6 +47,15 @@ def grouper(iterable, n, *, incomplete='fill', fillvalue=None): end_date = ee.Date('2019-02-01') TEMPORAL_RESOLUTION = 5 RESULTS_PER_ITER = 5 +MIN_RESULTS_PER_ITER = 1 +MISSION_START_DATE = (2022,1,1) # Rough start date for mission/satellite data +QUALITY_PIXEL_BAND_NAME = 'QA_PIXEL' +BLUE_BAND_NAME = 'B2' +GREEN_BAND_NAME = 'B3' +RED_BAND_NAME = 'B4' +NIR_BAND_NAME = 'B8' +SWIR1_BAND_NAME = 'B11' +SWIR2_BAND_NAME = 'B12' # aoi = reservoir.geometry().simplify(100).buffer(500); @@ -65,7 +75,7 @@ def scl_cloud_mask(im): # cloud_area = cloudmask.reduceRegion( # reducer = ee.Reducer.sum(), # geometry = aoi, - # scale = SMALL_SCALE, + # scale = SPATIAL_SCALE_SMALL, # maxPixels = 1e10 # ).get('') im = im.addBands(cloudmask) @@ -86,17 +96,10 @@ def preprocess(im): return im -def identify_water_cluster(im): +def identify_water_cluster(im, max_cluster_value): im = ee.Image(im) mbwi = im.select('MBWI') - - max_cluster_value = ee.Number(im.select('cluster').reduceRegion( - reducer = ee.Reducer.max(), - geometry = aoi, - scale = LARGE_SCALE, - maxPixels = 1e10 - ).get('cluster')) - + clusters = ee.List.sequence(0, max_cluster_value) def calc_avg_mbwi(cluster_val): @@ -107,8 +110,11 @@ def calc_avg_mbwi(cluster_val): geometry = aoi, maxPixels = 1e10 ).get('MBWI') - return avg_mbwi - + avg_mbwi_not_null = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual(ee.Number(avg_mbwi)), + ee.Number(-99), + ee.Number(avg_mbwi))) + return avg_mbwi_not_null + # print(calc_avg_mbwi(clusters.get(0)).getInfo()) avg_mbwis = ee.Array(clusters.map(calc_avg_mbwi)) max_mbwi_index = avg_mbwis.argmax().get(0) @@ -118,34 +124,74 @@ def calc_avg_mbwi(cluster_val): return water_cluster -def clustering(im): +def clustering(im, spatial_scale): + ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on + ## calinski harabasz's work + ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans band_subset = ee.List(['NDWI', 'B12']) sampled_pts = im.select(band_subset).sample( region = aoi, - scale = SMALL_SCALE, + scale = SPATIAL_SCALE_SMALL, numPixels = 4999 ## limit of 5k points, staying at 4k ) + no_sampled_pts = sampled_pts.size() - ## Agglomerative Clustering isn't available, using Cascade K-Means Clustering based on - ## calinski harabasz's work - ## https:##developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans - clusterer = ee.Clusterer.wekaCascadeKMeans( - minClusters = 2, - maxClusters = 7, - init = True - ).train(sampled_pts) + def if_enough_sample_pts(im): + clusterer = ee.Clusterer.wekaCascadeKMeans( + minClusters = 2, + maxClusters = 7, + init = True + ).train(sampled_pts) + + classified = im.select(band_subset).cluster(clusterer) + im = im.addBands(classified) + max_cluster_value = ee.Number(im.select('cluster').reduceRegion( + reducer = ee.Reducer.max(), + geometry = aoi, + scale = LARGE_SCALE, + maxPixels = 1e10 + ).get('cluster')) + return ee.Dictionary({'max_cluster_value': max_cluster_value, 'classified': classified}) + + def if_not_enough_sample_pts(): + return ee.Dictionary({'max_cluster_value': ee.Number(0), 'classified': ee.Image(0).rename('cluster')}) + + # If clustering is possible do clustering + def if_clustering_possible(max_cluster_value,classified,im): + im = im.addBands(classified) + + water_cluster = identify_water_cluster(im, max_cluster_value) + water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_clustering']) + return water_map + + # If no clustering is possible, use NDWI water map + def if_clustering_not_possible(im): + water_map = im.select(['water_map_NDWI'],['water_map_clustering']) + return water_map - classified = im.select(band_subset).cluster(clusterer) - im = im.addBands(classified) - water_cluster = identify_water_cluster(im) - water_map = classified.select('cluster').eq(ee.Image.constant(water_cluster)).select(['cluster'], ['water_map_clustering']) + after_training_dict = ee.Dictionary(ee.Algorithms.If(ee.Number(no_sampled_pts), + if_enough_sample_pts(im), + if_not_enough_sample_pts())) + + max_cluster_value = ee.Number(ee.Algorithms.If(ee.Algorithms.IsEqual( + ee.Number(after_training_dict.get('max_cluster_value'))), + ee.Number(0), + ee.Number(after_training_dict.get('max_cluster_value')))) + + classified = ee.Image(after_training_dict.get('classified')) + water_map = ee.Image(ee.Algorithms.If(ee.Algorithms.IsEqual(max_cluster_value,ee.Number(0)), + if_clustering_not_possible(im), + if_clustering_possible(max_cluster_value, + classified, + im) + )) im = im.addBands(water_map) return im -def process_image(im): +def process_image(im, spatial_scale): # Process Image ndwi = im.normalizedDifference(['B3', 'B8']).rename('NDWI'); @@ -161,34 +207,40 @@ def process_image(im): }) im = im.addBands(mbwi); + # NDWI based water map: Classify water wherever NDWI is greater than NDWI_THRESHOLD and add water_map_NDWI band. + im = im.addBands(ndwi.gte(NDWI_THRESHOLD).select(['NDWI'], ['water_map_NDWI'])) + cloud_area = aoi.area().subtract(im.select('cloud').Not().multiply(ee.Image.pixelArea()).reduceRegion( reducer = ee.Reducer.sum(), geometry = aoi, - scale = SMALL_SCALE, + scale = spatial_scale, maxPixels = 1e10 ).get('cloud')) cloud_percent = cloud_area.multiply(100).divide(aoi.area()) CLOUD_LIMIT_SATISFIED = cloud_percent.lt(CLOUD_COVER_LIMIT) - + # Clustering based + # print('starting clustering') im = im.addBands( ee.Image( ee.Algorithms.If( CLOUD_LIMIT_SATISFIED, - clustering(im), + clustering(im, spatial_scale), ee.Image.constant(-1e6) ) ) ) # run clustering only if cloud percent is < 90% - + # except: + # print('Clustering could not be done. Using NDWI water map instead.') + # water_map = 'water_map_NDWI' water_area_clustering = ee.Number( ee.Algorithms.If( CLOUD_LIMIT_SATISFIED, ee.Number(im.select('water_map_clustering').eq(1).multiply(ee.Image.pixelArea()).reduceRegion( reducer = ee.Reducer.sum(), geometry = aoi, - scale = SMALL_SCALE, + scale = spatial_scale, maxPixels = 1e10 ).get('water_map_clustering')), ee.Number(-1e6) @@ -200,7 +252,74 @@ def process_image(im): ee.Number(im.select('water_map_clustering').neq(1).multiply(ee.Image.pixelArea()).reduceRegion( reducer = ee.Reducer.sum(), geometry = aoi, - scale = SMALL_SCALE, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_clustering')), + ee.Number(-1e6) + ) + ) + # Calculate red band sum for water area in water_map_clustering. + water_red_sum = ee.Number( + ee.Algorithms.If( + CLOUD_LIMIT_SATISFIED, + ee.Number(im.select('water_map_clustering').eq(1).multiply(im.select(RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_clustering')), + ee.Number(-1e6) + ) + ) + # Calculate green band sum for water area in water_map_NDWI. + water_green_sum = ee.Number( + ee.Algorithms.If( + CLOUD_LIMIT_SATISFIED, + ee.Number(im.select('water_map_clustering').eq(1).multiply(im.select(GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_clustering')), + ee.Number(-1e6) + ) + ) + # Calculate nir band sum for water area in water_map_NDWI. + water_nir_sum = ee.Number( + ee.Algorithms.If( + CLOUD_LIMIT_SATISFIED, + ee.Number(im.select('water_map_clustering').eq(1).multiply(im.select(NIR_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_clustering')), + ee.Number(-1e6) + ) + ) + # Calculate red band/green band mean for water area in water_map_NDWI. + water_red_green_mean = ee.Number( + ee.Algorithms.If( + CLOUD_LIMIT_SATISFIED, + ee.Number(im.select('water_map_clustering').eq(1).multiply(im.select(RED_BAND_NAME)).divide(im.select( + GREEN_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_clustering')), + ee.Number(-1e6) + ) + ) + # Calculate nir band/red band mean for water area in water_map_NDWI. + water_nir_red_mean = ee.Number( + ee.Algorithms.If( + CLOUD_LIMIT_SATISFIED, + ee.Number(im.select('water_map_clustering').eq(1).multiply(im.select(NIR_BAND_NAME)).divide(im.select( + RED_BAND_NAME)).reduceRegion( + reducer = ee.Reducer.mean(), + geometry = aoi, + scale = spatial_scale, maxPixels = 1e10 ).get('water_map_clustering')), ee.Number(-1e6) @@ -212,59 +331,85 @@ def process_image(im): im = im.set('water_area_clustering', water_area_clustering.multiply(1e-6)) im = im.set('non_water_area_clustering', non_water_area_clustering.multiply(1e-6)) im = im.set('PROCESSING_SUCCESSFUL', CLOUD_LIMIT_SATISFIED) + im = im.set('water_red_sum', water_red_sum) + im = im.set('water_green_sum', water_green_sum) + im = im.set('water_nir_sum', water_nir_sum) + im = im.set('water_red_green_mean', water_red_green_mean) + im = im.set('water_nir_red_mean', water_nir_red_mean) return im -def postprocess(im, bandName='water_map_clustering'): +def postprocess(im, spatial_scale, bandName='water_map_clustering'): gswd_masked = gswd.updateMask(im.select(bandName).eq(1)) hist = ee.List(gswd_masked.reduceRegion( reducer = ee.Reducer.autoHistogram(minBucketWidth = 1), geometry = aoi, - scale = SMALL_SCALE, + scale = spatial_scale, maxPixels = 1e10 ).get('occurrence')) - counts = ee.Array(hist).transpose().toList() - - omega = ee.Number(0.17) - count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + def if_hist_not_null(im,hist): - count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) - occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) + counts = ee.Array(hist).transpose().toList() + + omega = ee.Number(0.17) + count_thresh = ee.Number(counts.map(lambda lis: ee.List(lis).reduce(ee.Reducer.mean())).get(1)).multiply(omega) + + count_thresh_index = ee.Array(counts.get(1)).gt(count_thresh).toList().indexOf(1) + occurrence_thresh = ee.Number(ee.List(counts.get(0)).get(count_thresh_index)) - water_map = im.select([bandName], ['water_map']) - gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) - - improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + water_map = im.select([bandName], ['water_map']) + gswd_improvement = gswd.clip(aoi).gte(occurrence_thresh).updateMask(water_map.mask().Not()).select(["occurrence"], ["water_map"]) + + improved = ee.ImageCollection([water_map, gswd_improvement]).mosaic().select(['water_map'], ['water_map_zhao_gao']) + + corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( + reducer = ee.Reducer.sum(), + geometry = aoi, + scale = spatial_scale, + maxPixels = 1e10 + ).get('water_map_zhao_gao')) + + improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); + improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); - corrected_area = ee.Number(improved.select('water_map_zhao_gao').multiply(ee.Image.pixelArea()).reduceRegion( - reducer = ee.Reducer.sum(), - geometry = aoi, - scale = SMALL_SCALE, - maxPixels = 1e10 - ).get('water_map_zhao_gao')) + return improved - improved = improved.set("corrected_area", corrected_area.multiply(1e-6)); - improved = improved.set("POSTPROCESSING_SUCCESSFUL", 1); + def if_hist_null(im): + # Preserve the uncorrected area in the corrected area column & POSTPROCESSING=1 + # because otherwise it will be substituted with nan. + uncorrected_area = ee.Number(im.get('water_area_clustering')) + improved = im.set('corrected_area', uncorrected_area) + improved = improved.set('POSTPROCESSING_SUCCESSFUL', 1) + return improved + improved = ee.Image(ee.Algorithms.If( + hist, if_hist_not_null(im,hist), if_hist_null(im) + )) return improved - -def postprocess_wrapper(im, bandName='water_map_clustering'): - def do_not_postprocess(): - im = ee.Image.constant(-1) - im = im.set('corrected_area', -1) - im = im.set('POSTPROCESSING_SUCCESSFUL', 0) +def postprocess_wrapper(im, spatial_scale, bandName='water_map_clustering'): - return im + def do_not_postprocess(): + default_im = ee.Image.constant(-1).rename(bandName) + default_im = default_im.set('corrected_area', -1) + default_im = default_im.set('POSTPROCESSING_SUCCESSFUL', 0) + return default_im - improved = ee.Algorithms.If( + # Ensure PROCESSING_SUCCESSFUL is boolean and defaults to False + processing_successful = ee.Algorithms.If( im.get('PROCESSING_SUCCESSFUL'), - postprocess(im, bandName), - do_not_postprocess() + im.get('PROCESSING_SUCCESSFUL'), + False ) + improved = ee.Image(ee.Algorithms.If( + processing_successful, + postprocess(im, spatial_scale, bandName), + do_not_postprocess() + )) + return improved @@ -272,9 +417,12 @@ def do_not_postprocess(): ## Code from here takes care of the time-series generation ## ############################################################/ def calc_ndwi(im): - return im.addBands(im.normalizedDifference(['B3', 'B8']).rename('NDWI')) + im = im.addBands(im.normalizedDifference(['B3', 'B8']).rename('NDWI')) + # Sort the bands in ascending order of their name for consistency + im = im.select(im.bandNames().sort()) + return im -def process_date(date): +def process_date(date, spatial_scale): date = ee.Date(date) to_date = date.advance(1, 'day') from_date = date.advance(-(TEMPORAL_RESOLUTION-1), 'day') @@ -298,11 +446,11 @@ def not_enough_images(): im = ee.Image( ee.Algorithms.If( ENOUGH_IMAGES, - process_image(im), + process_image(im, spatial_scale), not_enough_images() ) ) - + # print('Processed image') im = im.set('from_date', from_date.format("YYYY-MM-dd")) im = im.set('to_date', to_date.format("YYYY-MM-dd")) im = im.set('system:time_start', date.format("YYYY-MM-dd")) @@ -312,19 +460,29 @@ def not_enough_images(): return ee.Image(im) -def generate_timeseries(dates): - raw_ts = dates.map(process_date) +def generate_timeseries(dates, spatial_scale): + # raw_ts = process_date(dates.get(4)) + raw_ts = dates.map(lambda date: process_date(date,spatial_scale)) # raw_ts = raw_ts.removeAll([0]); imcoll = ee.ImageCollection.fromImages(raw_ts) return imcoll def get_first_obs(start_date, end_date): - first_im = s2.filterBounds(aoi).filterDate(start_date, end_date).first() - str_fmt = 'YYYY-MM-dd' - return ee.Date.parse(str_fmt, ee.Date(first_im.get('system:time_start')).format(str_fmt)) + # Filter collection, sort by date, and get the first image's timestamp + first_im = ( + s2.filterBounds(aoi) + .filterDate(start_date, end_date) + .limit(1, 'system:time_start') # Explicitly limit by earliest date + .reduceColumns(ee.Reducer.first(), ['system:time_start']) + ) + + # Convert timestamp to formatted date + first_date = ee.Date(first_im.get('first')).format('YYYY-MM-dd') + + return first_date -def run_process_long(res_name,res_polygon, start, end, datadir): +def run_process_long(res_name,res_polygon, start, end, datadir, results_per_iter=RESULTS_PER_ITER): fo = start enddate = end @@ -338,13 +496,11 @@ def run_process_long(res_name,res_polygon, start, end, datadir): number_of_images = 1 # more than a month difference simply run, so no need to calculate number_of_images if(number_of_images): - fo = get_first_obs(start, end).format('YYYY-MM-dd').getInfo() + fo = get_first_obs(start, end).getInfo() first_obs = datetime.strptime(fo, '%Y-%m-%d') scratchdir = os.path.join(datadir, "_scratch") - # flag = True - num_runs = 0 # If data already exists, only get new data starting from the last one savepath = os.path.join(datadir, f"{res_name}.csv") @@ -373,7 +529,7 @@ def run_process_long(res_name,res_polygon, start, end, datadir): print(f"Extracting SA for the period {fo} -> {enddate}") dates = pd.date_range(fo, enddate, freq=f'{TEMPORAL_RESOLUTION}D') - grouped_dates = grouper(dates, RESULTS_PER_ITER) + grouped_dates = grouper(dates, results_per_iter) # # redo the calculations part and see where it is complaining about too many aggregations # subset_dates = next(grouped_dates) @@ -387,95 +543,173 @@ def run_process_long(res_name,res_polygon, start, end, datadir): # uncorrected_final_data_ee = res.reduceColumns(ee.Reducer.toList(len(uncorrected_columns_to_extract)), uncorrected_columns_to_extract).get('list') # uncorrected_final_data = uncorrected_final_data_ee.getInfo() - - - for subset_dates in grouped_dates: + # Until results per iteration is less than min results per iteration + while results_per_iter >= MIN_RESULTS_PER_ITER: + # try to run for each subset of dates try: - print(subset_dates) - dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) - - ts_imcoll = generate_timeseries(dates) - postprocessed_ts_imcoll = ts_imcoll.map(postprocess_wrapper) - - # Download the data locally - ts_imcoll_L = ts_imcoll.getInfo() - postprocessed_ts_imcoll_L = postprocessed_ts_imcoll.getInfo() - - # Parse the data to create dataframe - PROCESSING_STATUSES = [] - POSTPROCESSING_STATUSES = [] - cloud_areas = [] - cloud_percents = [] - from_dates = [] - to_dates = [] - obs_dates = [] - non_water_areas = [] - water_areas = [] - water_areas_zhaogao = [] - for f, f_postprocessed in zip(ts_imcoll_L['features'], postprocessed_ts_imcoll_L['features']): - PROCESSING_STATUS = f['properties']['PROCESSING_SUCCESSFUL'] - PROCESSING_STATUSES.append(PROCESSING_STATUS) - POSTPROCESSING_STATUS = f_postprocessed['properties']['POSTPROCESSING_SUCCESSFUL'] - POSTPROCESSING_STATUSES.append(POSTPROCESSING_STATUS) - obs_dates.append(pd.to_datetime(f['properties']['system:time_start'])) - from_dates.append(pd.to_datetime(f['properties']['from_date'])) - to_dates.append(pd.to_datetime(f['properties']['to_date'])) - if PROCESSING_STATUS: - water_areas.append(f['properties']['water_area_clustering']) - non_water_areas.append(f['properties']['non_water_area_clustering']) - cloud_areas.append(f['properties']['cloud_area']) - cloud_percents.append(f['properties']['cloud_percent']) - else: - water_areas.append(np.nan) - non_water_areas.append(np.nan) - cloud_areas.append(np.nan) - cloud_percents.append(np.nan) - if POSTPROCESSING_STATUS: - water_areas_zhaogao.append(f_postprocessed['properties']['corrected_area']) - else: - water_areas_zhaogao.append(np.nan) - - df = pd.DataFrame({ - 'date': obs_dates, - 'PROCESSING_STATUS': PROCESSING_STATUSES, - 'POSTPROCESSING_STATUS': POSTPROCESSING_STATUSES, - 'from_date': from_dates, - 'to_date': to_dates, - 'cloud_area': cloud_areas, - 'cloud_percent': cloud_percents, - 'water_area_uncorrected': water_areas, - 'non_water_area': non_water_areas, - 'water_area_corrected': water_areas_zhaogao - }).set_index('date') - - fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") - df.to_csv(fname) - print(df.tail()) - - s_time = randint(5, 10) - print(f"Sleeping for {s_time} seconds") - time.sleep(s_time) - - if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: - print(f"Quitting: Reached enddate {enddate}") - break - elif df.index[-1].strftime('%Y-%m-%d') == fo: - print(f"Reached last available observation - {fo}") - break - elif num_runs > 1000: - print("Quitting: Reached 1000 iterations") - break + scale_to_use = SPATIAL_SCALE_SMALL + for subset_dates in grouped_dates: + # try to run for subset_dates with results_per_iter + try: + print(subset_dates) + dates = ee.List([ee.Date(d) for d in subset_dates if d is not None]) + + success_status = 0 # initialize success_status with 0 which will remain 0 on fail attempt and become 1 on successful attempt + while (success_status == 0): + try: + ts_imcoll = generate_timeseries(dates, spatial_scale=scale_to_use) + # Postprocess the image collection + postprocessed_ts_imcoll = ts_imcoll.map(lambda img: postprocess_wrapper(img, spatial_scale=scale_to_use)) + # Run the generate timeseries + ts_imcoll_L = ts_imcoll.getInfo() + # Run thw postprocess of image collection + postprocessed_ts_imcoll_L = postprocessed_ts_imcoll.getInfo() + except Exception as e: + log.error(e) + # Adjust scale_to_use only if error includes "Computation timed out" + if "Computation timed out" in str(e): + if scale_to_use == SPATIAL_SCALE_SMALL: + scale_to_use = SPATIAL_SCALE_MEDIUM + log.warning(f"Trying with larger spatial resolution: {scale_to_use} m.") + success_status = 0 + continue + elif scale_to_use == SPATIAL_SCALE_MEDIUM: + scale_to_use = SPATIAL_SCALE_LARGE + log.warning(f"Trying with larger spatial resolution: {scale_to_use} m.") + success_status = 0 + continue + else: + log.error("Trying with larger spatial resolution failed. Moving to next iteration.") + scale_to_use = SPATIAL_SCALE_MEDIUM + success_status = -1 + break + else: + success_status = -1 + else: + success_status = 1 + if success_status==1: + # Parse the data to create dataframe + PROCESSING_STATUSES = [] + POSTPROCESSING_STATUSES = [] + cloud_areas = [] + cloud_percents = [] + from_dates = [] + to_dates = [] + obs_dates = [] + non_water_areas = [] + water_areas = [] + water_areas_zhaogao = [] + water_red_sums = [] + water_green_sums = [] + water_nir_sums = [] + water_red_green_means = [] + water_nir_red_means = [] + for f, f_postprocessed in zip(ts_imcoll_L['features'], postprocessed_ts_imcoll_L['features']): + PROCESSING_STATUS = f['properties']['PROCESSING_SUCCESSFUL'] + PROCESSING_STATUSES.append(PROCESSING_STATUS) + POSTPROCESSING_STATUS = f_postprocessed['properties']['POSTPROCESSING_SUCCESSFUL'] + POSTPROCESSING_STATUSES.append(POSTPROCESSING_STATUS) + obs_dates.append(pd.to_datetime(f['properties']['system:time_start'])) + from_dates.append(pd.to_datetime(f['properties']['from_date'])) + to_dates.append(pd.to_datetime(f['properties']['to_date'])) + if PROCESSING_STATUS: + water_areas.append(f['properties']['water_area_clustering']) + non_water_areas.append(f['properties']['non_water_area_clustering']) + cloud_areas.append(f['properties']['cloud_area']) + cloud_percents.append(f['properties']['cloud_percent']) + water_red_sums.append(f['properties']['water_red_sum']) + water_green_sums.append(f['properties']['water_green_sum']) + water_nir_sums.append(f['properties']['water_nir_sum']) + water_red_green_means.append(f['properties']['water_red_green_mean']) + water_nir_red_means.append(f['properties']['water_nir_red_mean']) + else: + water_areas.append(np.nan) + non_water_areas.append(np.nan) + cloud_areas.append(np.nan) + cloud_percents.append(np.nan) + water_red_sums.append(np.nan) + water_green_sums.append(np.nan) + water_nir_sums.append(np.nan) + water_red_green_means.append(np.nan) + water_nir_red_means.append(np.nan) + if POSTPROCESSING_STATUS: + water_areas_zhaogao.append(f_postprocessed['properties']['corrected_area']) + else: + water_areas_zhaogao.append(np.nan) + + df = pd.DataFrame({ + 'date': obs_dates, + 'PROCESSING_STATUS': PROCESSING_STATUSES, + 'POSTPROCESSING_STATUS': POSTPROCESSING_STATUSES, + 'from_date': from_dates, + 'to_date': to_dates, + 'cloud_area': cloud_areas, + 'cloud_percent': cloud_percents, + 'water_area_uncorrected': water_areas, + 'non_water_area': non_water_areas, + 'water_area_corrected': water_areas_zhaogao, + 'water_red_sum': water_red_sums, + 'water_green_sum': water_green_sums, + 'water_nir_sum': water_nir_sums, + 'water_red_green_mean': water_red_green_means, + 'water_nir_red_mean': water_nir_red_means + }).set_index('date') + + fname = os.path.join(savedir, f"{df.index[0].strftime('%Y%m%d')}_{df.index[-1].strftime('%Y%m%d')}_{res_name}.csv") + df.to_csv(fname) + print(df.tail()) + + s_time = randint(5, 10) + print(f"Sleeping for {s_time} seconds") + time.sleep(s_time) + + if (datetime.strptime(enddate, "%Y-%m-%d")-df.index[-1]).days < TEMPORAL_RESOLUTION: + print(f"Quitting: Reached enddate {enddate}") + break + elif df.index[-1].strftime('%Y-%m-%d') == fo: + print(f"Reached last available observation - {fo}") + break + else: + raise Exception("Skipping this iteration of dates due to failed attempt(s).") + # If exception is "Too many concurrent aggregations", reduce results_per_iter + # and rerun for loop for leftover dates by raising Exception. + # Else just print the exception and continue. + except Exception as e: + log.error(e) + # Adjust results_per_iter only if error includes "Too many concurrent aggregations" + if "Too many concurrent aggregations" in str(e): + results_per_iter -= 1 + print(f"Reducing Results per iteration to {results_per_iter} due to error.") + if results_per_iter < MIN_RESULTS_PER_ITER: + print("Minimum Results per iteration reached. Continuing to next group of dates.") + results_per_iter = MIN_RESULTS_PER_ITER + continue + else: + raise Exception(f'Reducing Results per iteration to {results_per_iter}.') + else: + continue + # This exception will be only raised if the error is "Too many concurrent aggregations". + # and Results per iteration will be reduced but still be greater than or equal to minimum results per iteration. + # We will continue while loop and for loop within while loop from the left over grouped dates. except Exception as e: - log.error(e) + dates = pd.date_range(subset_dates[0], enddate, freq=f'{TEMPORAL_RESOLUTION}D') + grouped_dates = grouper(dates, results_per_iter) continue + # In case no exception is raised and the complete for loop ran succesfully, break the while loop + # because we need to run the for loop only once. + else: + break # Combine the files into one database to_combine.extend([os.path.join(savedir, f) for f in os.listdir(savedir) if f.endswith(".csv")]) + if len(to_combine): + files = [pd.read_csv(f, parse_dates=["date"]).set_index("date") for f in to_combine] + data = pd.concat(files).drop_duplicates().sort_values("date") - files = [pd.read_csv(f, parse_dates=["date"]).set_index("date") for f in to_combine] - data = pd.concat(files).drop_duplicates().sort_values("date") - - data.to_csv(savepath) + data.to_csv(savepath) + else: + print(f"Observed data between {start} and {end} could not be processed to get surface area. It may be due to cloud cover or other issues, Quitting!") + return None else: print(f"No observation observed between {start} and {end}. Quitting!") return None diff --git a/src/rat/core/sarea/sarea_cli_sar.py b/src/rat/core/sarea/sarea_cli_sar.py index c1f29934..bf70ec8a 100644 --- a/src/rat/core/sarea/sarea_cli_sar.py +++ b/src/rat/core/sarea/sarea_cli_sar.py @@ -2,11 +2,16 @@ import numpy as np import pandas as pd import os -from datetime import datetime, timedelta +from datetime import datetime, timedelta, date +from logging import getLogger from rat.core.sarea.sarea_cli_s2 import TEMPORAL_RESOLUTION from rat.ee_utils.ee_utils import poly2feature from rat.utils.utils import days_between +from rat.utils.logging import LOG_NAME, NOTIFICATION + + +log = getLogger(f"{LOG_NAME}.{__name__}") s1 = ee.ImageCollection("COPERNICUS/S1_GRD") @@ -15,7 +20,10 @@ ANGLE_THRESHOLD_2 = ee.Number(31.66) REVISIT_TIME = ee.Number(12) BUFFER_DIST = 500 - +SPATIAL_SCALE_SMALL = 10 +SPATIAL_SCALE_MEDIUM = 30 +SPATIAL_SCALE_LARGE = 50 +MISSION_START_DATE = date(2014,1,1) # Rough start date for mission/satellite data # functions def getfirstobs(imcoll): @@ -42,10 +50,10 @@ def mask_by_angle(img): return vv.updateMask(mask2) # Calculating the water pixels -def calcWaterPix(img): +def calcWaterPix(img, scale): water_pixels = ee.Algorithms.If( img.bandNames().contains('Class'), - img.reduceRegion(reducer = ee.Reducer.sum(), geometry = ROI, scale = 10, maxPixels = 10e9).get('Class'), + img.reduceRegion(reducer = ee.Reducer.sum(), geometry = ROI, scale = scale, maxPixels = 10e9).get('Class'), None) return img.set("water_pixels", water_pixels) @@ -76,7 +84,7 @@ def in_case_we_have_obs(): return res # client side code -def ee_get_data(ee_Date_Start, ee_Date_End): +def ee_get_data(ee_Date_Start, ee_Date_End, spatial_scale): date_start_str = ee_Date_Start date_end_str = ee_Date_End ee_Date_Start, ee_Date_End = ee.Date(ee_Date_Start), ee.Date(ee_Date_End) @@ -85,8 +93,6 @@ def ee_get_data(ee_Date_Start, ee_Date_End): .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VV')) \ .filter(ee.Filter.eq('instrumentMode', 'IW')) - print(S1.size().getInfo()) - ## Checking that the image collection should not be empty in GEE ## Not checking for only cases where time interval is small ## because SAR images in GEE are irregularly present. So there can be large such intervals as well. @@ -101,12 +107,12 @@ def ee_get_data(ee_Date_Start, ee_Date_End): dates = dates.map(lambda n: first_date.advance(n, 'day')) classified_water_sar = ee.ImageCollection(dates.map(lambda d: detectWaterSAR(d, ref_image))) - classified_water_sar = classified_water_sar.map(calcWaterPix) + classified_water_sar = classified_water_sar.map(lambda img: calcWaterPix(img, scale=spatial_scale)) # print('size line 103:',classified_water_sar.size().getInfo()) # print('size line 104:',ee.Array(classified_water_sar.aggregate_array('water_pixels')).multiply(0.0001).size().getInfo()) - wc = ee.Array(classified_water_sar.aggregate_array('water_pixels')).multiply(0.0001).getInfo() # area in sq. km + wc = ee.Array(classified_water_sar.aggregate_array('water_pixels')).multiply(spatial_scale).multiply(spatial_scale).divide(1000000).getInfo() # area in sq. km d = classified_water_sar.aggregate_array('system:time_start').getInfo() # convert from miliseconds to seconds from epoch df = pd.DataFrame({ @@ -125,16 +131,49 @@ def ee_get_data(ee_Date_Start, ee_Date_End): return df def retrieve_sar(start_date, end_date, res='ys'): #ys-year-start frequency - date_ranges = list((pd.date_range(start_date, end_date, freq=res).union([pd.to_datetime(start_date), pd.to_datetime(end_date)])).strftime("%Y-%m-%d").tolist()) #ys-year-start frequency + # Ensure start_date is not before MISSION_START_DATE + print(f'Sentinel 1 mission start date is roughly: {MISSION_START_DATE}. Ensuring start date to extract surface area is greater than or equal to it.') + start_date = max(pd.to_datetime(start_date), pd.to_datetime(MISSION_START_DATE)) + + # Generate date range, ensuring both start and end dates are included + date_ranges = pd.date_range(start_date, end_date, freq=res).union( + [pd.to_datetime(start_date), pd.to_datetime(end_date)] + ).strftime("%Y-%m-%d").tolist() + print(date_ranges) dfs = [] + scale_to_use = SPATIAL_SCALE_SMALL # for begin, end in zip(date_ranges[:-1], date_ranges.shift(1)[:-1]): for begin, end in zip(date_ranges[:-1], date_ranges[1:]): - # begin_str, end_str = begin.strftime('%Y-%m-%d'), end.strftime('%Y-%m-%d') - print(f"Processing: {begin} - {end} ") - dfs.append(ee_get_data(begin, end)) - print(dfs[-1].head()) - print(f"Processed: {begin} - {end} ") + success_status = 0 # initialize success_status with 0 which will remain 0 on fail attempt and become 1 on successful attempt + while (success_status == 0): + try: + # begin_str, end_str = begin.strftime('%Y-%m-%d'), end.strftime('%Y-%m-%d') + print(f"Processing: {begin} - {end} ") + dfs.append(ee_get_data(begin, end, spatial_scale=scale_to_use)) + except Exception as e: + log.error(e) + # Adjust scale_to_use only if error includes "Computation timed out" + if "Computation timed out" in str(e): + if scale_to_use == SPATIAL_SCALE_SMALL: + scale_to_use = SPATIAL_SCALE_MEDIUM + log.warning(f"Trying with larger spatial resolution: {scale_to_use} m.") + success_status = 0 + continue + elif scale_to_use == SPATIAL_SCALE_MEDIUM: + scale_to_use = SPATIAL_SCALE_LARGE + log.warning(f"Trying with larger spatial resolution: {scale_to_use} m.") + success_status = 0 + continue + else: + log.error("Trying with larger spatial resolution failed. Moving to next iteration.") + scale_to_use = SPATIAL_SCALE_MEDIUM + success_status = 1 + break + else: + success_status = 1 + print(dfs[-1].head()) + print(f"Processed: {begin} - {end} ") return pd.concat(dfs) diff --git a/src/rat/ee_utils/ee_aec_file_creator.py b/src/rat/ee_utils/ee_aec_file_creator.py index 435f3ca9..8fbcfd3b 100644 --- a/src/rat/ee_utils/ee_aec_file_creator.py +++ b/src/rat/ee_utils/ee_aec_file_creator.py @@ -4,12 +4,14 @@ import pandas as pd import numpy as np from itertools import zip_longest,chain -from rat.ee_utils.ee_utils import poly2feature +from rat.ee_utils.ee_utils import poly2feature, simplify_geometry from pathlib import Path from scipy.optimize import minimize from scipy.integrate import cumulative_trapezoid from shapely.geometry import Point + +WATER_SAREA_DIFF_Z_THRESHOLD = 3.0 BUFFER_DIST = 500 DEM = ee.Image('USGS/SRTMGL1_003') @@ -35,7 +37,9 @@ def _aec(n,elev_dem,roi, scale=30): DEM141Count = DEM141.reduceRegion( geometry= roi, scale= scale, - reducer= ee.Reducer.sum() + reducer= ee.Reducer.sum(), + maxPixels = 1e16, + bestEffort=True ) area=ee.Number(DEM141Count.get('elevation')).multiply(30*30).divide(1e6) return area @@ -49,15 +53,25 @@ def get_obs_aec_above_water_surface(aec): Returns: pd.DataFrame: A DataFrame containing the filtered and processed AEC data with 'Elevation' and 'CumArea' columns. + Boolean: Boolean indicating whether the AEC data has water surface or not. """ obs_aec_above_water = aec.sort_values('Elevation') obs_aec_above_water['CumArea_diff'] = obs_aec_above_water['CumArea'].diff() - obs_aec_above_water['z_score'] = (obs_aec_above_water['CumArea_diff'] - obs_aec_above_water['CumArea'].mean()) / obs_aec_above_water['CumArea'].std() + obs_aec_above_water['z_score'] = (obs_aec_above_water['CumArea_diff'] - obs_aec_above_water['CumArea_diff'].mean()) / obs_aec_above_water['CumArea_diff'].std() + max_z_score = obs_aec_above_water['z_score'].max() max_z_core_idx = obs_aec_above_water['z_score'].idxmax() - obs_aec_above_water = obs_aec_above_water.loc[max_z_core_idx:, :] - obs_aec_above_water = obs_aec_above_water[['Elevation', 'CumArea']] - - return obs_aec_above_water + if max_z_score > WATER_SAREA_DIFF_Z_THRESHOLD: + obs_aec_above_water = obs_aec_above_water.loc[max_z_core_idx:, :] + obs_aec_above_water = obs_aec_above_water[['Elevation', 'CumArea']] + water_surface_exists = True + print(f"Clipped to elevations above water surface. Max Sarea difference zscore is {max_z_score}.") + else: + obs_aec_above_water = obs_aec_above_water[['Elevation', 'CumArea']] + water_surface_exists = False + print(f"Skipped clipping as No water surface was found using AEC because max 'sarea difference' zscore is {max_z_score}. Either the reservoir was created after DEM data was aquired or the AEC is already extrapolated.") + pass + + return obs_aec_above_water, water_surface_exists def calculate_storage(aec_df): """ @@ -269,7 +283,7 @@ def get_dam_bottom( # Load GRWL data grwl = gpd.read_file(grwl_fp) # Check if GRWL intersects with reservoir geometry - intersection = grwl.intersects(reservoir.union_all()) + intersection = grwl.intersects(reservoir.unary_union) else: intersection = np.array([False]) @@ -331,13 +345,27 @@ def extrapolate_reservoir( Returns: pd.DataFrame: DataFrame containing the predicted storage values with columns 'CumArea', 'Elevation', 'Storage', 'Storage (mil. m3)', and 'Elevation_Observed'. """ + aec_original = aec.copy() + dam_bottom_elevation, method = get_dam_bottom(reservoir, buffer_distance=buffer_distance, dam_location=dam_location, grwl_fp=grwl_fp) # from GRWL downstream point - dam_top_elevation = dam_bottom_elevation + dam_height - print(f"Dam bottom elevation for {reservoir_name} is {dam_bottom_elevation}") - print(f"Dam top elevation for {reservoir_name} is {dam_top_elevation}") - - aec = aec[(aec['Elevation'] > dam_bottom_elevation)&(aec['Elevation'] <= dam_top_elevation)] + + if dam_height>0: + dam_top_elevation = dam_bottom_elevation + dam_height + print(f"Dam height for {reservoir_name} is {dam_height}") + print(f"Dam top elevation for {reservoir_name} is {dam_top_elevation}") + + # Remove elevations below and above dam's bottom and top elevation. If less than 5 observations are left, then just remove elevations below dam bottom. + aec = aec_original[(aec_original['Elevation'] > dam_bottom_elevation)&(aec_original['Elevation'] <= dam_top_elevation)] + if len(aec) > 5: + pass + else: + aec = aec_original[(aec_original['Elevation'] > dam_bottom_elevation)] + else: + print(f"Dam height is not available for {reservoir_name} : {dam_height}") + print("Using all the elevation data above dam bottom elevation.") + aec = aec_original[(aec_original['Elevation'] > dam_bottom_elevation)] + x = aec['CumArea'] y = aec['Elevation'] @@ -387,7 +415,7 @@ def extrapolate_reservoir( return predicted_storage_df -def get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_water_surf=False): +def get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_water_surf=False, simplification=True): """ Generates an observed Area-Elevation Curve (AEC) file for a given reservoir using SRTM data. The function checks if an AEC file already exists for the given reservoir. If it does, it reads the file and returns the data. @@ -399,9 +427,11 @@ def get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_wat reservoir (object): The reservoir object containing geometry information. reservoir_name (str): The name of the reservoir. clip_to_water_surf (bool): If True, clips the AEC data to elevations above the water surface. Default is False. + simplification (bool): If true, reservoir geometry will be simplified before use (only if shape index is extremely high). Default is True. Returns: pd.DataFrame: A DataFrame containing the elevation and cumulative area data. + Boolean : Boolean indicating whether the AEC data has water surface or not. """ print(f"Generating observed AEC file for {reservoir_name} from SRTM") aec_dst_fp = os.path.join(aec_dir_path,reservoir_name+'.csv') @@ -413,11 +443,14 @@ def get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_wat # done already. In that case, the 'CumArea' and 'Elevation_Observed' represent the SRTM # observed AEC. if 'Elevation_Observed' in aec_df.columns: - aec_df = aec_df[['CumArea', 'Elevation_Observed']].rename({'Elevation_Observed': 'Elevation'}, axis=1) - aec_df = aec_df.dropna(subset='Elevation') + # aec_df = aec_df[['CumArea', 'Elevation_Observed']].rename({'Elevation_Observed': 'Elevation'}, axis=1) + # aec_df = aec_df.dropna(subset='Elevation') clip_to_water_surf = False # in this case, clipping is not required. set it to false else: reservoir_polygon = reservoir.geometry + if simplification: + # Below function simplifies geometry with shape index (complexity) higher than a threshold, otherwise original geometry is retained + reservoir_polygon = simplify_geometry(reservoir_polygon) aoi = poly2feature(reservoir_polygon,BUFFER_DIST).geometry() min_elev = DEM.reduceRegion( reducer = ee.Reducer.min(), geometry = aoi, @@ -458,14 +491,12 @@ def get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_wat print(f"Observed AEC obtained from SRTM for {reservoir_name}") if clip_to_water_surf: - aec_df = get_obs_aec_above_water_surface( - aec_df - ) - print(f"Clipped to elevations above water surface") - + aec_df,water_surface_exists = get_obs_aec_above_water_surface(aec_df) + else: + water_surface_exists = False aec_df.to_csv(aec_dst_fp,index=False) - return aec_df + return aec_df, water_surface_exists def aec_file_creator( reservoir_shpfile, shpfile_column_dict, aec_dir_path, @@ -511,19 +542,33 @@ def aec_file_creator( reservoir_gpd = reservoir_gpd.set_crs(reservoirs_polygon.crs) reservoir_name = str(reservoir[shpfile_column_dict['unique_identifier']]) - dam_height = float(reservoir[shpfile_column_dict['dam_height']]) - dam_lat = float(reservoir[shpfile_column_dict['dam_lat']]) - dam_lon = float(reservoir[shpfile_column_dict['dam_lon']]) - dam_location = Point(dam_lon, dam_lat) - - aec = get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_water_surf=True) - - if dam_height > 0: - extrapolate_reservoir( - reservoir_gpd, dam_location, reservoir_name, dam_height, aec, - aec_dir_path, grwl_fp=grwl_fp - ) + if shpfile_column_dict.get('dam_height'): + if reservoir[shpfile_column_dict['dam_height']] is not None: + dam_height = float(reservoir[shpfile_column_dict['dam_height']]) + else: + dam_height = np.nan + else: + dam_height = np.nan + + # + dam_lat = float(reservoir[shpfile_column_dict.get('dam_lat')]) if shpfile_column_dict.get('dam_lat') else None + dam_lon = float(reservoir[shpfile_column_dict.get('dam_lon')]) if shpfile_column_dict.get('dam_lat') else None + if dam_lat and dam_lon: + dam_location = Point(dam_lon, dam_lat) + else: + dam_location = None + + aec, water_surface_exists = get_obs_aec_srtm(aec_dir_path, scale, reservoir, reservoir_name, clip_to_water_surf=True) + + if water_surface_exists: + if dam_location: + extrapolate_reservoir( + reservoir_gpd, dam_location, reservoir_name, dam_height, aec, + aec_dir_path, grwl_fp=grwl_fp + ) + else: + print(f"No extrapolation was done in AEC for reservoir {reservoir_name} because of absence of absence of Dam Location. Reservoir's bottom elevation cannot be determined.") else: - print(f"Dam height can't be used to extrapolate aev: {dam_height}") + print(f"No extrapolation was done in AEC for reservoir {reservoir_name} because of absence of water surface in AEC.") return 1 diff --git a/src/rat/ee_utils/ee_utils.py b/src/rat/ee_utils/ee_utils.py index 3eeaaf89..cf23c6f7 100644 --- a/src/rat/ee_utils/ee_utils.py +++ b/src/rat/ee_utils/ee_utils.py @@ -1,12 +1,13 @@ import numpy as np import ee +import math # Coverts a polygon geometry object to earth engine feature def poly2feature(polygon,buffer_distance): ''' Returns an earth engine feature with the same geometry as the polygon object with buffer_distance added. buffer_distance is in meters. ''' - if(polygon.type=='MultiPolygon'): + if(polygon.geom_type=='MultiPolygon'): all_cords=[] for poly in polygon.geoms: x,y = poly.exterior.coords.xy @@ -20,4 +21,77 @@ def poly2feature(polygon,buffer_distance): g=ee.Geometry.Polygon(cords).buffer(buffer_distance) #buffer for polygon_object in meters feature = ee.Feature(g) - return feature \ No newline at end of file + return feature + +def shape_index(polygon): + """ + Calculate the shape index of a given polygon. + + The shape index is defined as the square of the perimeter divided by the area: + SI = (perimeter^2) / area + + Parameters: + polygon (shapely.geometry.Polygon): The input polygon. + + Returns: + float: The shape index value. + """ + return (polygon.length ** 2) / polygon.area + +def compute_initial_tolerance(si): + """ + Compute the initial simplification tolerance based on the shape index magnitude. + + The tolerance is set as 10^(-order_of_magnitude), where order_of_magnitude is + the exponent of the shape index rounded up. + + Parameters: + si (float): The shape index value. + + Returns: + float: The initial simplification tolerance. + """ + order_of_magnitude = math.ceil(math.log10(si)) # Get the exponent of SI + return 10 ** (-order_of_magnitude) # Set tolerance as inverse of order + +def simplify_geometry(polygon, threshold=800, initial_tolerance=None): + """ + Simplify a polygon iteratively until its shape index falls below a given threshold. + + If no initial tolerance is provided, it is determined dynamically using the + compute_initial_tolerance function. The simplification process gradually + increases the tolerance by a factor of 3.5 in each iteration. + + Parameters: + polygon (shapely.geometry.Polygon): The input polygon to simplify. + threshold (float, optional): The shape index threshold for stopping simplification. Default is 800. + initial_tolerance (float, optional): The starting tolerance for simplification. If None, + it is computed based on the shape index. + + Returns: + shapely.geometry.Polygon: The simplified polygon. + """ + simplification = 0 + si_original = shape_index(polygon) + si = si_original + if initial_tolerance is None: + initial_tolerance = compute_initial_tolerance(si) + else: + initial_tolerance = initial_tolerance + + tolerance = initial_tolerance # Start with an initial tolerance + while si > threshold: # Keep simplifying until shape index is below threshold + simplification = 1 + simplified_polygon = polygon.simplify(tolerance, preserve_topology=True) + + # Stop if no significant simplification occurs + if simplified_polygon == polygon: + break + + polygon = simplified_polygon + si = shape_index(polygon) + tolerance *= 3.5 # Gradually increase tolerance for more simplification + if simplification: + print(f"Using simplified geometry to extract surface area because of complex shape.") + print(f"Shape index of original geometry is {si_original} which is above threshold of {threshold}. Shape index of simplified geometry is {si}.") + return polygon \ No newline at end of file diff --git a/src/rat/rat_basin.py b/src/rat/rat_basin.py index d5c09a27..a2f38834 100644 --- a/src/rat/rat_basin.py +++ b/src/rat/rat_basin.py @@ -32,7 +32,7 @@ from rat.core.run_postprocessing import run_postprocessing -from rat.utils.convert_to_final_outputs import convert_sarea, convert_inflow, convert_dels, convert_evaporation, convert_outflow, convert_altimeter, copy_aec_files +from rat.utils.convert_to_final_outputs import convert_sarea, convert_inflow, convert_dels, convert_evaporation, convert_outflow, convert_altimeter, copy_aec_files, convert_nssc, convert_meteorological_ts # Step-(-1): Reading Configuration settings to run RAT # Step-0: Creating required directory structure for RAT @@ -50,6 +50,7 @@ # Step-12: Generating Area Elevation Curves for reservoirs # Step-13: Calculation of Inflow, Outflow, Evaporation and Storage change # Step-14: Conversion of output data to final format as time series +# Step-Z: Cleaning up of memory space by removal of unwanted data #TODO: Converting steps to separate modules to make RAT more robust and generalized #module-1 step-1,2 data_preparation_vic @@ -96,7 +97,8 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base if config['GLOBAL'].get('steps'): steps = config['GLOBAL']['steps'] else: - steps = [1,2,3,4,5,6,7,8,9,10,11,12,13,14] + steps = [] + rat_logger.warning("RAT is being executed without any defined steps to process. Please use steps parameter in GLOBAL section of configuration file.") # Defining resolution to run RAT xres = 0.0625 @@ -185,7 +187,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base # Defining logger - log = init_logger( + log, log_file_path = init_logger( log_dir= log_dir, verbose= False, # notify= True, @@ -198,7 +200,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base # Clearing out previous rat outputs so that the new data does not gets appended. if(config['CLEAN_UP']['clean_previous_outputs']): - rat_logger.info("Clearing up memory space: Removal of previous rat outputs, routing inflow, extracted altimetry data and gee extracted surface area time series") + rat_logger.info("Clearing up memory space: Removal of previous rat outputs, routing outputs (streamflow), extracted altimetry data, and gee extracted surface area and NSSC time series") cleaner.clean_previous_outputs() # Initializing Status for different models & tasks (1 for successful run & 0 for failed run) @@ -208,12 +210,12 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base ROUTING_STATUS = 1 GEE_STATUS = 1 ALTIMETER_STATUS = 1 - DELS_STATUS = 0 - EVAP_STATUS = 0 - OUTFLOW_STATUS = 0 - AEC_STATUS = 0 + DELS_STATUS = 1 + EVAP_STATUS = 1 + OUTFLOW_STATUS = 1 + AEC_STATUS = 1 except: - no_errors = -1 + no_errors = no_errors + 1 rat_logger.exception("Error in Configuration parameters defined to run RAT.") return (no_errors, latest_altimetry_cycle) else: @@ -223,6 +225,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base rat_logger.info(f"Running RAT from {config['BASIN']['start'] } to {config['BASIN']['end']} which includes spin-up.") else: rat_logger.info(f"Running RAT from {config['BASIN']['start'] } to {config['BASIN']['end']}.") + rat_logger.info(f"Level-2 (DETAILED) log for this River Basin is at {log_file_path}") if gfs_days: rat_logger.info(f"Note 1: Due to low latency availability, GFS daily forecasted data will be used for {gfs_days} most recent days.") rat_logger.info(f"Note 2: The GFS data will be removed and replaced in next RAT run by observed data, if available.") @@ -335,6 +338,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base reservoirs_gdf_column_dict['unique_identifier'] = reservoirs_gdf_column_dict['dam_name_column'] # Defining paths to save surface area from gee and heights from altimetry sarea_savepath = create_directory(os.path.join(basin_data_dir,'gee','gee_sarea_tmsos',''), True) + nssc_savepath = create_directory(os.path.join(basin_data_dir,'gee','gee_nssc',''), True) altimetry_savepath = os.path.join(basin_data_dir,'altimetry','altimetry_timeseries') #----------- Paths Necessary for running of Surface Area Calculation and Altimetry-----------# @@ -344,12 +348,27 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base aec_dir_path = config['POST_PROCESSING'].get('aec_dir') else: aec_dir_path = create_directory(os.path.join(basin_data_dir,'post_processing','post_processing_gee_aec',''), True) + # Defining paths for creating meteorlogical timeseries for reservoir catchment + if (config['POST_PROCESSING'].get('catchment_vector_file')): + catchment_vector_file_path = config['POST_PROCESSING'].get('catchment_vector_file') + catchments_gdf_column_dict = config['POST_PROCESSING'].get('catchment_vector_file_columns_dict') + # Adding key-value pair to Basin Reservoir Shapefile's column dictionary ### + if (config['ROUTING']['station_global_data']): + catchments_gdf_column_dict['unique_identifier'] = 'uniq_id' + else: + catchments_gdf_column_dict['unique_identifier'] = catchments_gdf_column_dict['dam_name_column'] + else: + catchment_vector_file_path = None + catchments_gdf_column_dict = None + # Reading catchment vector file's column dictionary + ## Paths for storing post-processed data and in webformat data if forecast_mode: evap_savedir = create_directory(os.path.join(basin_data_dir,'rat_outputs', 'forecast_evaporation', f"{forecast_basedate:%Y%m%d}"), True) else: evap_savedir = create_directory(os.path.join(basin_data_dir,'rat_outputs', 'Evaporation'), True) dels_savedir = create_directory(os.path.join(basin_data_dir,'rat_outputs', "dels"), True) + nssc_savedir = create_directory(os.path.join(basin_data_dir,'rat_outputs', "nssc"), True) outflow_savedir = create_directory(os.path.join(basin_data_dir,'rat_outputs', "rat_outflow"),True) aec_savedir = Path(create_directory(os.path.join(basin_data_dir,'rat_outputs', "aec"),True)) final_output_path = create_directory(os.path.join(basin_data_dir,'final_outputs',''),True) @@ -366,7 +385,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base ## End of defining paths for storing post-processed data and webformat data except: - no_errors = -1 + no_errors = no_errors+1 rat_logger.exception("Error in creating required directory structure for RAT") return (no_errors, latest_altimetry_cycle) else: @@ -673,9 +692,8 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base rat_logger.info("Starting Step-9: Preparation of parameter files for Surface Area Calculation") #------------- Selection of Reservoirs within the basin begins--------------# ###### Preparing basin's reservoir shapefile and it's associated column dictionary for calculating surface area ##### - ### Creating Basin Reservoir Shapefile, if not exists ### - if not os.path.exists(basin_reservoir_shpfile_path): - create_basin_reservoir_shpfile(config['GEE']['reservoir_vector_file'], reservoirs_gdf_column_dict, basin_data, + ### Creating Basin Reservoir Shapefile, overwritten if exists ### + create_basin_reservoir_shpfile(config['GEE']['reservoir_vector_file'], reservoirs_gdf_column_dict, basin_data, config['ROUTING']['station_global_data'], basin_reservoir_shpfile_path) ###### Prepared basin's reservoir shapefile and it's associated column dictionary ##### except: @@ -699,7 +717,7 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base # Get Sarea filt_options = config['GEE'].get('bot_filter') run_sarea(gee_start_date.strftime("%Y-%m-%d"), config['BASIN']['end'].strftime("%Y-%m-%d"), sarea_savepath, - basin_reservoir_shpfile_path, reservoirs_gdf_column_dict,filt_options) + basin_reservoir_shpfile_path, reservoirs_gdf_column_dict,filt_options,nssc_savepath) GEE_STATUS = 1 except: no_errors = no_errors+1 @@ -761,7 +779,8 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base rat_logger.warning("AEC files could not be copied to rat_outputs directory.", exc_info=True) #Generating evaporation, storage change and outflow. DELS_STATUS, EVAP_STATUS, OUTFLOW_STATUS = run_postprocessing(basin_name, basin_data_dir, basin_reservoir_shpfile_path, reservoirs_gdf_column_dict, - aec_dir_path, config['BASIN']['start'], config['BASIN']['end'], rout_init_state_save_file, use_state, evap_savedir, dels_savedir, outflow_savedir, VIC_STATUS, ROUTING_STATUS, GEE_STATUS, forecast_mode=forecast_mode) + aec_savedir, config['BASIN']['start'], config['BASIN']['end'], rout_init_state_save_file, use_state, evap_savedir, dels_savedir, + nssc_savedir, outflow_savedir, VIC_STATUS, ROUTING_STATUS, GEE_STATUS, forecast_mode=forecast_mode) except: no_errors = no_errors+1 rat_logger.exception("Error Executing Step-13: Calculation of Outflow, Evaporation, Storage change and Inflow") @@ -774,21 +793,35 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base try: rat_logger.info("Starting Step-14: Creating final outputs in a timeseries format and cleaning up.") - ##---------- Convert all time-series to final output csv format and clean up----------## + ##---------- Convert all time-series to final output csv format --------------------## ## Surface Area if(GEE_STATUS): convert_sarea(sarea_savepath,final_output_path) - rat_logger.info("Converted Surface Area to the Output Format.") + convert_nssc(nssc_savepath,final_output_path) + rat_logger.info("Converted Surface Area and NSSC to the Output Format.") else: - rat_logger.info("Could not convert Surface Area to the Output Format as GEE run failed.") + rat_logger.info("Could not convert Surface Area and NSSC to the Output Format as GEE run failed.") ## Inflow if(ROUTING_STATUS): - convert_inflow(inflow_dst_dir, basin_reservoir_shpfile_path, reservoirs_gdf_column_dict, final_output_path) + convert_inflow(inflow_dst_dir, final_output_path) rat_logger.info("Converted Inflow to the Output Format.") else: rat_logger.info("Could not convert Inflow to the Output Format as Routing run failed.") + ## Climatological TS + if (ROUTING_STATUS): + try: + no_failed_reservoirs = convert_meteorological_ts(catchment_vector_file_path, catchments_gdf_column_dict, basin_data, combined_datapath, final_output_path) + if no_failed_reservoirs: + rat_logger.warning(f"Could not extract Catchment's Climatological TS for {no_failed_reservoirs} reservoir(s).") + else: + rat_logger.info("Converted Catchment's Climatological TS to the Output Format (from NetCDF).") + except: + rat_logger.exception("Failed to convert Catchment's Climatological TS to the Output Format (from NetCDF).") + else: + rat_logger.info("Could not convert Catchment's Climatological TS to the Output Format (from NetCDF) as Routing run failed.") + ## Dels if(DELS_STATUS): convert_dels(dels_savedir, final_output_path) @@ -848,16 +881,31 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base runResorr(basin_data_dir,basin_station_latlon_file,resorr_startDate,resorr_endDate) else: rat_logger.warning("No station latlon file found to run RESORR. Try running Step-8 or provide station_latlon_path in routing section of config file.") - + + except: + no_errors = no_errors+1 + rat_logger.exception("Error Executing Step-14: Creating final outputs in a timeseries format and cleaning up.") + else: + rat_logger.info("Finished Step-14: Creating final outputs in a timeseries format and cleaning up.") + ##----------Converted all time-series to final output csv format -----------------## + + ######### Step-Z Mandatory Last Step + if (config['GLOBAL'].get('cleaning')): + try: + ##------------------ Cleaning up the memory --------------------## + rat_logger.info("Starting to clean up memory space by removing unwanted data.") # Clearing out memory space as per user input + if(config['CLEAN_UP'].get('clean_processing')): + rat_logger.info("Clearing up memory space: Removal of preprocessed meteorolgical data (not combined NetCDF) and post-processed AEC files.") + cleaner.clean_processing() if(config['CLEAN_UP'].get('clean_metsim')): - rat_logger.info("Clearing up memory space: Removal of metsim output files") + rat_logger.info("Clearing up memory space: Removal of metsim input and output files") cleaner.clean_metsim() if(config['CLEAN_UP'].get('clean_vic')): - rat_logger.info("Clearing up memory space: Removal of vic input, output files and previous init_state_files") + rat_logger.info("Clearing up memory space: Removal of vic input, output files and previous init_state_files older than 20 days.") cleaner.clean_vic() if(config['CLEAN_UP'].get('clean_routing')): - rat_logger.info("Clearing up memory space: Removal of routing input and output files") + rat_logger.info("Clearing up memory space: Removal of routing input files and previous rout_state_files older than 20 days.") cleaner.clean_routing() if(config['CLEAN_UP'].get('clean_gee')): rat_logger.info("Clearing up memory space: Removal of unwanted gee extracted small chunk files") @@ -865,12 +913,20 @@ def rat_basin(config, rat_logger, forecast_mode=False, gfs_days=0, forecast_base if(config['CLEAN_UP'].get('clean_altimetry')): rat_logger.info("Clearing up memory space: Removal of raw altimetry downloaded data files.") cleaner.clean_altimetry() + if(config['CLEAN_UP'].get('clean_basin_parameter_files')): + rat_logger.info("Clearing up memory space: Removal of basin's parameter files (for VIC, routing, gee and other) created by RAT.") + cleaner.clean_basin_parameter_files() + if(config['CLEAN_UP'].get('clean_basin_meteorological_data')): + rat_logger.info("Clearing up memory space: Removal of basin's meteorological data in the combined NetCDf format.") + cleaner.clean_basin_meteorological_data() + except: no_errors = no_errors+1 - rat_logger.exception("Error Executing Step-14: Creating final outputs in a timeseries format and cleaning up.") + rat_logger.exception("Error in cleaning up memory space by removal of unwanted data.") else: - rat_logger.info("Finished Step-14: Creating final outputs in a timeseries format and cleaning up.") - ##----------Converted all time-series to final output csv format and cleaned up----------## - + rat_logger.info("Finished cleaning up memory space and removed unwanted data.") + ##------------------ Cleaned up the memory --------------------## + + close_logger() return (no_errors, latest_altimetry_cycle) \ No newline at end of file diff --git a/src/rat/run_rat.py b/src/rat/run_rat.py index 42c435b2..d82a349a 100644 --- a/src/rat/run_rat.py +++ b/src/rat/run_rat.py @@ -3,6 +3,7 @@ import datetime import copy import os +import sys import pandas as pd import numpy as np @@ -38,8 +39,7 @@ def run_rat(config_fn, operational_latency=None ): # Logging this run log_dir = os.path.join(config['GLOBAL']['data_dir'],'runs','logs','') - print(f"Logging this run at {log_dir}") - log = init_logger( + log, log_file_path = init_logger( log_dir, verbose=False, # notify=True, @@ -48,6 +48,7 @@ def run_rat(config_fn, operational_latency=None ): logger_name=LOG_LEVEL1_NAME, for_basin=False ) + print(f"Logging this run at {log_file_path}") log.debug("Initiating Dask Client ... ") cluster = LocalCluster(name="RAT", n_workers=config['GLOBAL']['multiprocessing'], threads_per_worker=1) @@ -66,12 +67,15 @@ def run_rat(config_fn, operational_latency=None ): with StringIO() as fake_stderr, redirect_stderr(fake_stderr): ee_credentials = ee.ServiceAccountCredentials(ee_configuration.service_account,ee_configuration.key_file) ee.Initialize(ee_credentials) - log.info("Connected to earth engine succesfully.") + log.info("Connected to earth engine successfully.\n") except Exception as e: log.error(f"Failed to connect to Earth Engine. RAT will not be able to use Surface Area Estimations. Error: {e}") + finally: + # Ensure sys.stderr is restored + sys.stderr = sys.__stderr__ ############################ ----------- Single basin run ---------------- ###################################### - if(not config['GLOBAL']['multiple_basin_run']): + if(not config['GLOBAL'].get('multiple_basin_run')): log.info('############## Starting RAT for '+config['BASIN']['basin_name']+' #################') # Checking if Rat is running operationally with some latency. If yes, update start, end and vic_init_state dates. @@ -143,7 +147,7 @@ def run_rat(config_fn, operational_latency=None ): if(forecast_no_errors>0): log.info('############## RAT-Forecasting run finished for '+config_copy['BASIN']['basin_name']+ ' with '+str(forecast_no_errors)+' error(s). #################') elif(forecast_no_errors==0): - log.info('############## Succesfully run RAT-Forecasting for '+config_copy['BASIN']['basin_name']+' #################') + log.info('############## Successfully run RAT-Forecasting for '+config_copy['BASIN']['basin_name']+' #################') else: log.error('############## RAT-Forecasting run failed for '+config_copy['BASIN']['basin_name']+' #################') # Displaying and storing RAT function outputs in the copy (non-mutabled as it was not passes to function) @@ -153,7 +157,7 @@ def run_rat(config_fn, operational_latency=None ): if(no_errors>0): log.info('############## RAT run finished for '+config['BASIN']['basin_name']+ ' with '+str(no_errors)+' error(s). #################') elif(no_errors==0): - log.info('############## Succesfully run RAT for '+config['BASIN']['basin_name']+' #################') + log.info('############## Successfully run RAT for '+config['BASIN']['basin_name']+' #################') else: log.error('############## RAT run failed for '+config['BASIN']['basin_name']+' #################') @@ -163,7 +167,7 @@ def run_rat(config_fn, operational_latency=None ): try: basins_metadata = pd.read_csv(config['GLOBAL']['basins_metadata'],header=[0,1]) except: - raise("Please provide the proper path of a csv file in basins_metadata in the Global section of RAT's config file") + raise Exception("Please provide the proper path of a csv file in basins_metadata in the Global section of RAT's config file") if ('BASIN','run') in basins_metadata.columns: basins_metadata_filtered = basins_metadata[basins_metadata['BASIN','run']==1] ####### Remove in future version : Deprecation (start)######## @@ -172,13 +176,13 @@ def run_rat(config_fn, operational_latency=None ): if ('BASIN','basin_name') in basins_metadata.columns: basins_metadata_filtered = basins_metadata[basins_metadata['BASIN','basin_name'].isin(config['GLOBAL']['basins_to_process'])] else: - raise("No column in 'basins_metadata' file corresponding to 'basin_name' in 'BASIN' section of RAT's config file.") + raise Exception("No column in 'basins_metadata' file corresponding to 'basin_name' in 'BASIN' section of RAT's config file.") ####### Remove in future version : Deprecation (end) ######## if ('BASIN','basin_name') in basins_metadata.columns: basins_to_process = basins_metadata_filtered['BASIN','basin_name'].tolist() else: - raise("No column in 'basins_metadata' file corresponding to 'basin_name' in 'BASIN' section of RAT's config file.") + raise Exception("No column in 'basins_metadata' file corresponding to 'basin_name' in 'BASIN' section of RAT's config file.") # For each basin for basin in basins_to_process: @@ -294,7 +298,7 @@ def run_rat(config_fn, operational_latency=None ): if config.get('PLUGINS', {}).get('forecasting'): # Importing the forecast module try: - from plugins.forecasting.forecast_basin import forecast + from rat.plugins.forecasting.forecast_basin import forecast except: log.exception("Failed to import Forecast plugin due to missing package(s).") log.info('############## Starting RAT forecast for '+config['BASIN']['basin_name']+' #################') @@ -302,7 +306,7 @@ def run_rat(config_fn, operational_latency=None ): if(forecast_no_errors>0): log.info('############## RAT-Forecasting run finished for '+config_copy['BASIN']['basin_name']+ ' with '+str(forecast_no_errors)+' error(s). #################') elif(forecast_no_errors==0): - log.info('############## Succesfully run RAT-Forecasting for '+config_copy['BASIN']['basin_name']+' #################') + log.info('############## Successfully run RAT-Forecasting for '+config_copy['BASIN']['basin_name']+' #################') else: log.error('############## RAT-Forecasting run failed for '+config_copy['BASIN']['basin_name']+' #################') # Displaying and storing RAT function outputs @@ -313,11 +317,11 @@ def run_rat(config_fn, operational_latency=None ): basins_metadata['ALTIMETER','last_cycle_number'].where(basins_metadata['BASIN','basin_name']!= basin, latest_altimetry_cycle, inplace=True) basins_metadata.to_csv(config_copy['GLOBAL']['basins_metadata'], index=False) if(no_errors>0): - log.info('############## RAT run finished for '+config_copy['BASIN']['basin_name']+ ' with '+str(no_errors)+' error(s). #################') + log.info('############## RAT run finished for '+config_copy['BASIN']['basin_name']+ ' with '+str(no_errors)+' error(s). #################\n') elif(no_errors==0): - log.info('############## Succesfully run RAT for '+config_copy['BASIN']['basin_name']+' #################') + log.info('############## Successfully run RAT for '+config_copy['BASIN']['basin_name']+' #################\n') else: - log.error('############## RAT run failed for '+config_copy['BASIN']['basin_name']+' #################') + log.error('############## RAT run failed for '+config_copy['BASIN']['basin_name']+' #################\n') # Closing logger close_logger('rat_run') diff --git a/src/rat/toolbox/data_transform.py b/src/rat/toolbox/data_transform.py new file mode 100644 index 00000000..9912ccc1 --- /dev/null +++ b/src/rat/toolbox/data_transform.py @@ -0,0 +1,150 @@ +import geopandas as gpd +import os +import xarray as xr +import rioxarray as rxr +import pandas as pd +from shapely.geometry import mapping, Polygon, MultiPolygon + +def create_meterological_ts(roi, nc_file_path, output_csv_path): + """ + Create a meteorological timeseries for a given region of interest (ROI) using combined meteorological + data stored in a NetCDF file produced by RAT, and save the output as a CSV file. + + This function extracts time-series data for the variables 'precip', 'tmin', 'tmax', and 'wind' for the specified + geometry (ROI), calculates the spatial average for each time step, and stores the results in a CSV file. + + Parameters: + ---------- + roi : shapely.geometry.Polygon + A shapely geometry representing the region of interest (ROI) for which the meteorological + timeseries will be extracted. The geometry should be in the WGS84 coordinate reference system (CRS), same as the + NetCDF dataset. + + nc_file_path : str + Path to the NetCDF file containing combined meteorological data produced by RAT. The NetCDF file should include + the variables 'precip', 'tmin', 'tmax', and 'wind', and have appropriate spatial dimensions such as 'lat' and 'lon'. + or 'longitude' and 'latitude'. If crs is not specified, it is assumed to be a WGS84. + + output_csv_path : str + Path where the output CSV file will be saved (or appended in case, file exists). The file will contain the time series data for the spatially averaged + values of the meteorological variables ('precip', 'tmin', 'tmax', 'wind'). + + Returns: + ------- + None + The function does not return any value but saves the meteorological time series to the specified CSV file. + + Raises: + ------ + ValueError : + If the spatial dimensions ('lon', 'lat', 'longitude', or 'latitude') are not found in the NetCDF dataset. + + Warning : + If the CRS of the GeoDataFrame is not set, appropriate warnings will be raised and file will not be created. + + Notes: + ------ + - The function clips the dataset based on the ROI geometry and calculates the spatial average for each time step. + - The output CSV will contain columns for 'time', 'precip', 'tmin', 'tmax', and 'wind' with their spatially averaged values. + - If the output CSV exists, data will be appended and in case of duplicate dates, latest data will be kept for each date. + + Example: + -------- + # Create a GeoDataFrame representing the ROI (Polygon) + roi = geopandas.GeoDataFrame(...) + + # Specify the path to the NetCDF file + nc_file_path = 'path_to_netcdf_file.nc' + + # Specify the path to save the output CSV + output_csv_path = 'output_timeseries.csv' + + # Create the meteorological time series and save it to CSV + create_meterological_ts(roi, nc_file_path, output_csv_path) + """ + + print("Creating meterological timeseries for a given geometry using comibined meteorlogical NetCDF produced by RAT.") + if os.path.isfile(nc_file_path): + # Load the NetCDF file as an xarray Dataset + ds = xr.open_dataset(nc_file_path, chunks={"time": 100, "x": 100, "y": 100}) + + # Ensure spatial dimensions are set correctly as x and y for rioxarray use + if 'lon' in ds.dims and 'lat' in ds.dims: + ds = ds.rename({'lon': 'x', 'lat': 'y'}) + elif 'longitude' in ds.dims and 'latitude' in ds.dims: + ds = ds.rename({'longitude': 'x', 'latitude': 'y'}) + else: + raise ValueError("Spatial dimensions not found. Expected 'lon', 'lat', 'longitude', or 'latitude'.") + + # Set spatial dimensions + ds = ds.rio.set_spatial_dims(x_dim='x', y_dim='y') + + # Get the dataset resolution in degrees + res_x, res_y = ds.rio.resolution() # Dataset's resolution in x and y directions + + # Calculate half of the resolution size to add a buffer if needed + half_pixel_size = max(abs(res_x), abs(res_y)) / 2 + + # Check if the ROI needs to be expanded with a buffer + if isinstance(roi, (Polygon, MultiPolygon)): + roi_bounds = roi.bounds # (minx, miny, maxx, maxy) + roi_width = roi_bounds[2] - roi_bounds[0] + roi_height = roi_bounds[3] - roi_bounds[1] + + # Apply buffer conditionally: apply buffer if ROI is smaller than the resolution + if roi_width < abs(res_x) or roi_height < abs(res_y): + print("Catchment ROI is too small compared to resolution, applying buffer.") + roi_expanded = roi.buffer(half_pixel_size) # Buffer using half of the dataset's resolution + else: + print("Catchment ROI is sufficiently large, no buffer applied.") + roi_expanded = roi # No buffer needed, keep original ROI + else: + raise ValueError("Catchment ROI must be a polygon or a multipolygon.") + + # Sets default CRS for the dataset if missing + if ds.rio.crs is None: + print("CRS not found for dataset. Setting CRS to EPSG:4326.") + ds.rio.write_crs("EPSG:4326", inplace=True) + + try: + # Convert the combined geometry to a format that rioxarray can work with + geometries = [mapping(roi_expanded)] + + # Clip the xarray Dataset using the ROI geometry + ds_clipped = ds.rio.clip(geometries, from_disk=True) + except: + print("Catchment ROI is still not large enough. Clipping with all touching pixels for original catchment ROI.") + # Convert the combined geometry to a format that rioxarray can work with + geometries = [mapping(roi)] + + # Clip the xarray Dataset using the ROI geometry + ds_clipped = ds.rio.clip(geometries, all_touched = True, from_disk=True) + + + # Calculate the spatial average for each time step + spatial_mean = ds_clipped.mean(dim=['x', 'y']) + + # Convert the spatial mean to a pandas DataFrame + df = pd.DataFrame({ + 'time': spatial_mean.time.values, + 'precip': spatial_mean.precip.values, + 'tmin': spatial_mean.tmin.values, + 'tmax': spatial_mean.tmax.values, + 'wind': spatial_mean.wind.values + }) + + # Append the data if file exists + if os.path.exists(output_csv_path): + print(f"File {output_csv_path} exists. Appending new data and removing duplicates.") + existing_df = pd.read_csv(output_csv_path, parse_dates=['time']) + combined_df = pd.concat([existing_df, df], ignore_index=True) + # Remove duplicates, keeping the latest entry for each date + combined_df = combined_df.sort_values(by='time').drop_duplicates(subset='time', keep='last') + combined_df.to_csv(output_csv_path, index=False) + else: + # Save the new data + df.to_csv(output_csv_path, index=False) + + print(f"CSV file has been updated and saved to {output_csv_path}.") + else: + raise ValueError(f"File {nc_file_path} does not exist.") \ No newline at end of file diff --git a/src/rat/utils/clean.py b/src/rat/utils/clean.py index 0f3df9b8..ae1c7637 100644 --- a/src/rat/utils/clean.py +++ b/src/rat/utils/clean.py @@ -6,19 +6,32 @@ class Clean: def __init__(self, basin_data_dir): self.basin_data_dir = basin_data_dir + self.days_old_to_delete = 20 pass - def clean_pre_processing(self): + def clean_processing(self): try: - pre_processing_path = os.path.join(self.basin_data_dir,'pre_processing','') - shutil.rmtree(pre_processing_path) + pre_processing_processed_path = os.path.join(self.basin_data_dir,'pre_processing','processed','') + shutil.rmtree(pre_processing_processed_path) except: - print("No pre_processing folder to delete") + print("No processed folder in pre_processing to delete") + + try: + post_processing_path = os.path.join(self.basin_data_dir,'post_processing','') + shutil.rmtree(post_processing_path) + except: + print("No nc folder in pre_processing to delete") def clean_metsim(self): try: - metsim_path = os.path.join(self.basin_data_dir,'metsim','metsim_outputs','') - shutil.rmtree(metsim_path) + metsim_inputs_path = os.path.join(self.basin_data_dir,'metsim','metsim_inputs','') + shutil.rmtree(metsim_inputs_path) + except: + print("No metsim_inputs folder to delete") + + try: + metsim_outputs_path = os.path.join(self.basin_data_dir,'metsim','metsim_outputs','') + shutil.rmtree(metsim_outputs_path) except: print("No metsim_outputs folder to delete") @@ -37,7 +50,7 @@ def clean_vic(self): try: vic_init_states_dir_path = os.path.join(self.basin_data_dir,'vic','vic_init_states','') - days_old = 15 #n max of days + days_old = self.days_old_to_delete #n max of days time_interval = datetime.now() - timedelta(days_old) file_namelist = os.listdir(vic_init_states_dir_path) @@ -58,16 +71,16 @@ def clean_routing(self): shutil.rmtree(rout_inputs_path) except: print("No rout_inputs folder to delete") - + try: - rout_outputs_path = os.path.join(self.basin_data_dir,'ro','ou','') - shutil.rmtree(rout_outputs_path) + rout_workspace_path = os.path.join(self.basin_data_dir,'ro','wkspc','') + shutil.rmtree(rout_workspace_path) except: - print("No rout_outputs folder to delete") - + print("No routing wkspc folder to delete") + try: rout_init_states_dir_path = os.path.join(self.basin_data_dir,'ro','rout_state_file','') - days_old = 15 #n max of days + days_old = self.days_old_to_delete #n max of days time_interval = datetime.now() - timedelta(days_old) file_namelist = os.listdir(rout_init_states_dir_path) @@ -83,12 +96,30 @@ def clean_routing(self): print("No rout init_state file to delete") def clean_gee(self): + try: + l5_scratch_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos','l5','_scratch') + shutil.rmtree(l5_scratch_path) + except: + print("No _scratch folder to delete for landsat-5 based reserevoir area extraction") + + try: + l7_scratch_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos','l7','_scratch') + shutil.rmtree(l7_scratch_path) + except: + print("No _scratch folder to delete for landsat-7 based reserevoir area extraction") + try: l8_scratch_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos','l8','_scratch') shutil.rmtree(l8_scratch_path) except: print("No _scratch folder to delete for landsat-8 based reserevoir area extraction") + try: + l9_scratch_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos','l9','_scratch') + shutil.rmtree(l9_scratch_path) + except: + print("No _scratch folder to delete for landsat-9 based reserevoir area extraction") + try: s2_scratch_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos','s2','_scratch') shutil.rmtree(s2_scratch_path) @@ -108,6 +139,12 @@ def clean_previous_outputs(self): shutil.rmtree(rat_outputs_path) except: print("No previous rat_outputs folder to delete") + + try: + rout_outputs_path = os.path.join(self.basin_data_dir,'ro','ou','') + shutil.rmtree(rout_outputs_path) + except: + print("No rout_outputs folder to delete") try: gee_sarea_path = os.path.join(self.basin_data_dir,'gee','gee_sarea_tmsos') @@ -115,6 +152,12 @@ def clean_previous_outputs(self): except: print("No previous gee extracted surface area data folder to delete") + try: + gee_nssc_path = os.path.join(self.basin_data_dir,'gee','gee_nssc') + shutil.rmtree(gee_nssc_path) + except: + print("No previous gee extracted NSSC data folder to delete") + try: altimetry_timeseries_path = os.path.join(self.basin_data_dir,'altimetry','altimetry_timeseries') shutil.rmtree(altimetry_timeseries_path) @@ -132,3 +175,39 @@ def clean_previous_outputs(self): shutil.rmtree(final_outputs_path) except: print("No final_outputs folder to delete with previous outputs") + + def clean_basin_parameter_files(self): + try: + basin_grid_data_path = os.path.join(self.basin_data_dir,'basin_grid_data') + shutil.rmtree(basin_grid_data_path) + except: + print("No basin_grid_data folder to delete") + + try: + vic_basin_params_path = os.path.join(self.basin_data_dir,'vic','vic_basin_params') + shutil.rmtree(vic_basin_params_path) + except: + print("No vic_basin_params folder to delete inside vic") + + try: + ro_basin_params_path = os.path.join(self.basin_data_dir,'ro','pars') + shutil.rmtree(ro_basin_params_path) + except: + print("No pars folder to delete inside ro") + + try: + gee_basin_params_path = os.path.join(self.basin_data_dir,'gee','gee_basin_params') + shutil.rmtree(gee_basin_params_path) + except: + print("No gee_basin_params folder to delete inside gee") + + def clean_basin_meteorological_data(self): + try: + basin_meteorological_combined_path = os.path.join(self.basin_data_dir,'pre_processing','nc') + shutil.rmtree(basin_meteorological_combined_path) + except: + print("No nc folder (containing basin's combined meteorological file) to delete inside pre_processing") + + + + diff --git a/src/rat/utils/convert_to_final_outputs.py b/src/rat/utils/convert_to_final_outputs.py index 327364ba..2b1614d6 100644 --- a/src/rat/utils/convert_to_final_outputs.py +++ b/src/rat/utils/convert_to_final_outputs.py @@ -4,6 +4,7 @@ import numpy as np from pathlib import Path from rat.utils.utils import create_directory +from rat.toolbox.data_transform import create_meterological_ts def convert_sarea(sarea_dir, website_v_dir): # Surface Area @@ -23,10 +24,8 @@ def convert_sarea(sarea_dir, website_v_dir): df.to_csv(savepath, index=False) -def convert_inflow(inflow_dir, reservoir_shpfile, reservoir_shpfile_column_dict, final_out_dir): +def convert_inflow(inflow_dir, final_out_dir): # Inflow - reservoirs = gpd.read_file(reservoir_shpfile) - reservoirs['Inflow_filename'] = reservoirs[reservoir_shpfile_column_dict['unique_identifier']].astype(str) inflow_paths = list(Path(inflow_dir).glob('*.csv')) final_out_inflow_dir = Path(final_out_dir) / 'inflow' @@ -135,7 +134,54 @@ def copy_aec_files(src_dir, dst_dir): 'Elevation_Observed': 'elevation_srtm' }, axis=1, inplace=True) aec.to_csv(dst_dir / src_path.name, index=False) + +def convert_nssc(nssc_dir, final_out_dir): + src_dir = Path(nssc_dir) + dst_dir = Path(create_directory(os.path.join(final_out_dir,'nssc'),True)) + + for src_path in src_dir.glob('*.csv'): + nssc_df = pd.read_csv(src_path) + df_to_save = nssc_df[['date','nssc_rd_gn_px','nssc_nr_rd_px', 'nssc_rd_gn_res', 'nssc_nr_rd_res']] + df_to_save.rename({ + 'nssc_rd_gn_px': 'NSSC (red/green per pixel)', + 'nssc_nr_rd_px': 'NSSC (nir/red per pixel)', + 'nssc_rd_gn_res': 'NSSC (total_red/total_green)', + 'nssc_nr_rd_res': 'NSSC (total_nir/totaL_red)', + }, axis=1, inplace=True) + df_to_save.to_csv(dst_dir / src_path.name, index=False) +def convert_meteorological_ts(catchment_shpfile, catchments_gdf_column_dict, basin_gpd_df, meteorological_nc_file_path, final_out_dir): + if catchment_shpfile: + print('Reading Catchment Shapefile') + catchments = gpd.read_file(catchment_shpfile) + dst_dir = Path(create_directory(os.path.join(final_out_dir,'catchment_climate'),True)) + + if catchments_gdf_column_dict['unique_identifier'] == 'uniq_id': + print('Finding a spatial join of catchments because of Global Station Vector File') + basin_data_crs_changed = basin_gpd_df.to_crs(catchments.crs) + catchments_spatial_filtered = gpd.sjoin(catchments, basin_data_crs_changed, "inner")[catchments.columns] + catchments_spatial_filtered['uniq_id'] = catchments_spatial_filtered[catchments_gdf_column_dict['id_column']].astype(str)+'_'+ \ + catchments_spatial_filtered[catchments_gdf_column_dict['dam_name_column']].astype(str).str.replace(' ','_') + else: + catchments_spatial_filtered = catchments.copy() + + failed_res_no = 0 + for index, row in catchments_spatial_filtered.iterrows(): + try: + res_name = row[catchments_gdf_column_dict['unique_identifier']] + save_file_path = dst_dir / (res_name+'.csv') + catchment_roi = row['geometry'] + print(f"Creating Catchment's Climatolgical time series for reservoir : {res_name}") + create_meterological_ts(catchment_roi, meteorological_nc_file_path, save_file_path) + except Exception as e: + failed_res_no = failed_res_no + 1 + print(f"Error during creating meteorlogical TS for {res_name}: {e}") + continue + + return failed_res_no + else: + return 'all' + def convert_v2_frontend(basin_data_dir, res_name, inflow_src, sarea_src, dels_src, outflow_src): """Converts the files according to the newer version of the frontend (v2). diff --git a/src/rat/utils/files_creator.py b/src/rat/utils/files_creator.py index 9cf04fe0..5d376221 100644 --- a/src/rat/utils/files_creator.py +++ b/src/rat/utils/files_creator.py @@ -61,28 +61,39 @@ def create_vic_domain_param_file(global_vic_param_file,global_vic_domain_file,ba #Inserting run_cell as mask of basin_grid in vic_param.nc basin_vic_param=gl_param.interp(lon=np.array([round_up(x,5) for x in basin_grid.lon.data ]),lat=np.array([round_up(x,5) for x in basin_grid.lat.data ]),method='nearest') + # Identify cells where 'run_cell' is NaN in original vic soil params (coz of oceanic region) + mask = basin_vic_param['run_cell'].isnull() + # Change run_cell to basin grid cells basin_vic_param['run_cell']=(('lat','lon'),basin_grid.data.astype('int32')) + if mask.sum()!=0: + # Make run_cell 0 over the null cells of original vic soil params (oceanic regions) + basin_vic_param['run_cell'] = basin_vic_param['run_cell'].where(~mask, 0) + #Saving vic_param.nc basin_vic_param.to_netcdf(os.path.join(output_dir_path,'vic_soil_param.nc')) - #Inserting run_cell as mask of basin_grid in vic_param.nc + #Inserting mask as mask of basin_grid in vic_param.nc basin_vic_domain=gl_domain.interp(lon=np.array([round_up(x,5) for x in basin_grid.lon.data ]),lat=np.array([round_up(x,5) for x in basin_grid.lat.data ]),method='nearest') basin_vic_domain['mask']=(('lat','lon'),basin_grid.data.astype('int32')) + if mask.sum()!=0: + # Make mask 0 over the null cells of original vic soil params (oceanic regions) + basin_vic_domain['mask'] = basin_vic_domain['mask'].where(~mask, 0) #Saving vic_domain.nc basin_vic_domain.to_netcdf(os.path.join(output_dir_path,'vic_domain.nc')) def create_basin_grid_flow_asc(global_flow_grid_dir_tif, basingridfile_path, savepath, flow_direction_replace_dict = None): - global_flow_grid_dir=rxr.open_rasterio(global_flow_grid_dir_tif) - basin_grid=rxr.open_rasterio(basingridfile_path) - basin_flow_grid_dir = global_flow_grid_dir.interp(x=np.array([round_up(i,5) for i in basin_grid.x.data ]), - y=np.array([round_up(i,5) for i in basin_grid.y.data ]),method='nearest') - if (flow_direction_replace_dict): - for i in flow_direction_replace_dict: - basin_flow_grid_dir = basin_flow_grid_dir.where(basin_flow_grid_dir!=i, flow_direction_replace_dict[i]) - - basin_flow_grid_dir = basin_flow_grid_dir.rio.write_nodata(0) - basin_flow_grid_dir = basin_flow_grid_dir.where(basin_grid.data==1,0) - basin_flow_grid_dir.rio.to_raster(savepath+'.tif', dtype='int16') + if not os.path.exists(savepath+'.tif'): + global_flow_grid_dir=rxr.open_rasterio(global_flow_grid_dir_tif) + basin_grid=rxr.open_rasterio(basingridfile_path) + basin_flow_grid_dir = global_flow_grid_dir.interp(x=np.array([round_up(i,5) for i in basin_grid.x.data ]), + y=np.array([round_up(i,5) for i in basin_grid.y.data ]),method='nearest') + if (flow_direction_replace_dict): + for i in flow_direction_replace_dict: + basin_flow_grid_dir = basin_flow_grid_dir.where(basin_flow_grid_dir!=i, flow_direction_replace_dict[i]) + + basin_flow_grid_dir = basin_flow_grid_dir.rio.write_nodata(0) + basin_flow_grid_dir = basin_flow_grid_dir.where(basin_grid.data==1,0) + basin_flow_grid_dir.rio.to_raster(savepath+'.tif', dtype='int16') # Change format, and save as asc file cmd = [ @@ -158,11 +169,11 @@ def create_basin_reservoir_shpfile(reservoir_shpfile,reservoir_shpfile_column_di reservoirs_gdf_column_dict = reservoir_shpfile_column_dict if routing_station_global_data: - reservoirs_spatialjoin = gpd.sjoin(reservoirs, basin_data_crs_changed, "inner")[reservoirs.columns] + reservoirs_spatialjoin = gpd.sjoin(reservoirs, basin_data_crs_changed, predicate="within", how="inner")[reservoirs.columns] reservoirs_spatialjoin['uniq_id'] = reservoirs_spatialjoin[reservoirs_gdf_column_dict['id_column']].astype(str)+'_'+ \ reservoirs_spatialjoin[reservoirs_gdf_column_dict['dam_name_column']].astype(str).str.replace(' ','_') else: - reservoirs_spatialjoin = gpd.sjoin(reservoirs, basin_data_crs_changed, "inner")[reservoirs.columns] + reservoirs_spatialjoin = gpd.sjoin(reservoirs, basin_data_crs_changed, predicate="within", how="inner")[reservoirs.columns] if(reservoirs_spatialjoin.empty): raise Exception('Reservoir names in reservoir shapefile are not matching with the station names in the station file used for routing.') diff --git a/src/rat/utils/logging.py b/src/rat/utils/logging.py index 5146a319..7b858f3b 100644 --- a/src/rat/utils/logging.py +++ b/src/rat/utils/logging.py @@ -83,9 +83,13 @@ def init_logger(log_dir='./', log_level='DEBUG', verbose=False, notify=False, lo if(for_basin): log_file = os.path.join(log_dir, 'RAT-'+ basin_name + strftime('%Y%m%d-%H%M%S', gmtime()) + '.log') + log_detail = 'LEVEL 2' + log_mode = 'DETAILED' else: log_file = os.path.join(log_dir, 'RAT_run-'+ strftime('%Y%m%d-%H%M%S', gmtime()) + '.log') + log_detail = 'LEVEL 1' + log_mode = 'BRIEF' fh = logging.FileHandler(log_file) fh.setLevel(log_level) fh.setFormatter(FORMATTER) @@ -126,9 +130,12 @@ def init_logger(log_dir='./', log_level='DEBUG', verbose=False, notify=False, lo logger.info('Logging To Console: %s', verbose) logger.info('LOG FILE: %s', log_file) logger.info('NOTIFY: %s', notify) - logger.info('----------------------------------------------------------\n') + logger.info('----------------------------------------------------------------') + logger.info('LOG DETAIL: %s', log_detail) + logger.info('LOG MODE: %s', log_mode) + logger.info('----------------------------------------------------------------\n') - return logger + return logger, log_file # -------------------------------------------------------------------- # diff --git a/src/rat/utils/utils.py b/src/rat/utils/utils.py index d787ce08..ecf5df01 100644 --- a/src/rat/utils/utils.py +++ b/src/rat/utils/utils.py @@ -30,7 +30,7 @@ def round_up(n, decimals=0): if n>0: return math.ceil(n * multiplier) / multiplier else: - return -math.ceil(abs(n) * multiplier) / multiplier + return -math.floor(abs(n) * multiplier) / multiplier # https://gist.github.com/pritamd47/e7ddc49f25ae7f1b06c201f0a8b98348 # Clip time-series