From b1d9c977dfec299c8b02cf0bd523ac2324ca41be Mon Sep 17 00:00:00 2001 From: rodalynbarce <121169437+rodalynbarce@users.noreply.github.com> Date: Tue, 15 Aug 2023 12:11:47 +0100 Subject: [PATCH 01/42] Added dagster and query examples (#4) * Added Dagster examples Signed-off-by: rodalynbarce * Updated dagster examples Signed-off-by: rodalynbarce * Added query examples Signed-off-by: rodalynbarce * Updated example links Signed-off-by: rodalynbarce * Update links Signed-off-by: rodalynbarce * Added links Signed-off-by: rodalynbarce * Changed links and removed token. Signed-off-by: rodalynbarce * Added tag_name to description Signed-off-by: rodalynbarce --------- Signed-off-by: rodalynbarce --- .../README.md | 66 ++++++++++++++++++ .../pipeline.py | 36 ++++++++++ .../Fledge-Dagster-Pipeline-Local/README.md | 38 ++++++++++ .../Fledge-Dagster-Pipeline-Local/pipeline.py | 69 +++++++++++++++++++ queries/Interpolate/README.md | 34 +++++++++ queries/Interpolate/interpolate.py | 25 +++++++ queries/Interpolation-at-Time/README.md | 28 ++++++++ .../interpolation_at_time.py | 20 ++++++ queries/Metadata/README.md | 22 ++++++ queries/Metadata/metadata.py | 17 +++++ queries/Raw/README.md | 26 +++++++ queries/Raw/raw.py | 21 ++++++ queries/Resample/README.md | 37 ++++++++++ queries/Resample/resample.py | 24 +++++++ queries/Time-Weighted-Average/README.md | 37 ++++++++++ .../time_weighted_average.py | 25 +++++++ 16 files changed, 525 insertions(+) create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py create mode 100644 queries/Interpolate/README.md create mode 100644 queries/Interpolate/interpolate.py create mode 100644 queries/Interpolation-at-Time/README.md create mode 100644 queries/Interpolation-at-Time/interpolation_at_time.py create mode 100644 queries/Metadata/README.md create mode 100644 queries/Metadata/metadata.py create mode 100644 queries/Raw/README.md create mode 100644 queries/Raw/raw.py create mode 100644 queries/Resample/README.md create mode 100644 queries/Resample/resample.py create mode 100644 queries/Time-Weighted-Average/README.md create mode 100644 queries/Time-Weighted-Average/time_weighted_average.py diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md new file mode 100644 index 0000000..caff39c --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md @@ -0,0 +1,66 @@ +# Fledge Pipeline using Dagster and Databricks Connect + +This article provides a guide on how to deploy a pipeline in dagster using the RTDIP SDK and Databricks Connect. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +!!! note "Note" + Reading from Eventhubs is currently not supported on Databricks Connect. + +## Prerequisites +Deployment using Databricks Connect requires: + +* a Databricks workspace + +* a cluster in the same workspace + +* a personal access token + +Further information on Databricks requirements can be found [here](https://docs.databricks.com/en/dev-tools/databricks-connect-ref.html#requirements). + + +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [databricks-connect](https://pypi.org/project/databricks-connect/) + +* [dagster](https://docs.dagster.io/getting-started/install) + + +!!! note "Dagster Installation" + For Mac users with an M1 or M2 chip, installation of dagster should be done as follows: + ``` + pip install dagster dagster-webserver --find-links=https://github.com/dagster-io/build-grpcio/wiki/Wheels + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[SparkDeltaSource](../../../../code-reference/pipelines/sources/spark/delta.md)|Read data from a Delta table.| +|[BinaryToStringTransformer](../../../../code-reference/pipelines/transformers/spark/binary_to_string.md)|Converts a Spark DataFrame column from binary to string.| +|[FledgeOPCUAJsonToPCDMTransformer](../../../../code-reference/pipelines/transformers/spark/fledge_opcua_json_to_pcdm.md)|Converts a Spark DataFrame column containing a json string to the Process Control Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Authentication +For Databricks authentication, the following fields should be added to a configuration profile in your [`.databrickscfg`](https://docs.databricks.com/en/dev-tools/auth.html#config-profiles) file: + +``` +[PROFILE] +host = https://{workspace_instance} +token = dapi... +cluster_id = {cluster_id} +``` + +This profile should match the configurations in your `DatabricksSession` in the example below as it will be used by the [Databricks extension](https://docs.databricks.com/en/dev-tools/vscode-ext-ref.html#configure-the-extension) in VS Code for authenticating your Databricks cluster. + +## Example +Below is an example of how to set up a pipeline to read Fledge data from a Delta table, transform it to RTDIP's [PCDM model](../../../../../domains/process_control/data_model.md) and write it to a Delta table. + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py" +``` + +## Deploy +The following command deploys the pipeline to dagster: +`dagster dev -f ` + +Using the link provided from the command above, click on Launchpad and hit run to run the pipeline. \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py new file mode 100644 index 0000000..d03332f --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py @@ -0,0 +1,36 @@ +from dagster import Definitions, ResourceDefinition, graph, op +from databricks.connect import DatabricksSession +from rtdip_sdk.pipelines.sources.spark.delta import SparkDeltaSource +from rtdip_sdk.pipelines.transformers.spark.binary_to_string import BinaryToStringTransformer +from rtdip_sdk.pipelines.transformers.spark.fledge_opcua_json_to_pcdm import FledgeOPCUAJsonToPCDMTransformer +from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination + +# Databricks cluster configuration +databricks_resource = ResourceDefinition.hardcoded_resource( + DatabricksSession.builder.remote( + host = "https://{workspace_instance_name}", + token = "{token}", + cluster_id = "{cluster_id}" + ).getOrCreate() +) + +# Pipeline +@op(required_resource_keys={"databricks"}) +def pipeline(context): + spark = context.resources.databricks + source = SparkDeltaSource(spark, {}, "{path_to_table}").read_batch() + transformer = BinaryToStringTransformer(source, "{source_column_name}", "{target_column_name}").transform() + transformer = FledgeOPCUAJsonToPCDMTransformer(transformer, "{source_column_name}").transform() + SparkDeltaDestination(transformer, {}, "{path_to_table}").write_batch() + +@graph +def fledge_pipeline(): + pipeline() + +fledge_pipeline_job = fledge_pipeline.to_job( + resource_defs={ + "databricks": databricks_resource + } +) + +defs = Definitions(jobs=[fledge_pipeline_job]) \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md new file mode 100644 index 0000000..1cbb5a1 --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md @@ -0,0 +1,38 @@ +# Fledge Pipeline using Dagster + +This article provides a guide on how to deploy a pipeline in dagster using the RTDIP SDK. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +## Prerequisites +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [dagster](https://docs.dagster.io/getting-started/install) + + +!!! note "Dagster Installation" + For Mac users with an M1 or M2 chip, installation of dagster should be done as follows: + ``` + pip install dagster dagster-webserver --find-links=https://github.com/dagster-io/build-grpcio/wiki/Wheels + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[SparkEventhubSource](../../../../code-reference/pipelines/sources/spark/eventhub.md)|Read data from an Eventhub.| +|[BinaryToStringTransformer](../../../../code-reference/pipelines/transformers/spark/binary_to_string.md)|Converts a Spark DataFrame column from binary to string.| +|[FledgeOPCUAJsonToPCDMTransformer](../../../../code-reference/pipelines/transformers/spark/fledge_opcua_json_to_pcdm.md)|Converts a Spark DataFrame column containing a json string to the Process Control Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Example +Below is an example of how to set up a pipeline to read Fledge data from an Eventhub, transform it to RTDIP's [PCDM model](../../../../../domains/process_control/data_model.md) and write it to a Delta table on your machine. + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py" +``` + +## Deploy +The following command deploys the pipeline to dagster: +`dagster dev -f ` + +Using the link provided from the command above, click on Launchpad and hit run to run the pipeline. \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py new file mode 100644 index 0000000..ba0cefd --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py @@ -0,0 +1,69 @@ +import json +from datetime import datetime as dt +from dagster import Definitions, graph, op +from dagster_pyspark.resources import pyspark_resource +from rtdip_sdk.pipelines.sources.spark.eventhub import SparkEventhubSource +from rtdip_sdk.pipelines.transformers.spark.binary_to_string import BinaryToStringTransformer +from rtdip_sdk.pipelines.transformers.spark.fledge_opcua_json_to_pcdm import FledgeOPCUAJsonToPCDMTransformer +from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination + +# PySpark cluster configuration +packages = "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22,io.delta:delta-core_2.12:2.4.0" +my_pyspark_resource = pyspark_resource.configured( + {"spark_conf": {"spark.default.parallelism": 1, + "spark.jars.packages": packages, + "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension", + "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog" + } + } +) + +# EventHub configuration +eventhub_connection_string = "{eventhub_connection_string}" +eventhub_consumer_group = "{eventhub_consumer_group}" + +startOffset = "-1" +endTime = dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ") + +startingEventPosition = { + "offset": startOffset, + "seqNo": -1, + "enqueuedTime": None, + "isInclusive": True +} + +endingEventPosition = { + "offset": None, + "seqNo": -1, + "enqueuedTime": endTime, + "isInclusive": True +} + +ehConf = { +'eventhubs.connectionString' : eventhub_connection_string, +'eventhubs.consumerGroup': eventhub_consumer_group, +'eventhubs.startingPosition' : json.dumps(startingEventPosition), +'eventhubs.endingPosition' : json.dumps(endingEventPosition), +'maxEventsPerTrigger': 1000 +} + +# Pipeline +@op(required_resource_keys={"spark"}) +def pipeline(context): + spark = context.resources.pyspark.spark_session + source = SparkEventhubSource(spark, ehConf).read_batch() + transformer = BinaryToStringTransformer(source, "{source_column_name}", "{target_column_name}").transform() + transformer = FledgeOPCUAJsonToPCDMTransformer(transformer, "{source_column_name}").transform() + SparkDeltaDestination(transformer, {}, "{path_to_table}").write_batch() + +@graph +def fledge_pipeline(): + pipeline() + +fledge_pipeline_job = fledge_pipeline.to_job( + resource_defs={ + "spark": my_pyspark_resource + } +) + +defs = Definitions(jobs=[fledge_pipeline_job]) \ No newline at end of file diff --git a/queries/Interpolate/README.md b/queries/Interpolate/README.md new file mode 100644 index 0000000..057acb2 --- /dev/null +++ b/queries/Interpolate/README.md @@ -0,0 +1,34 @@ +# Interpolate + +[Interpolate](../../code-reference/query/interpolate.md) - takes resampling one step further to estimate the values of unknown data points that fall between existing, known data points. In addition to the resampling parameters, interpolation also requires: + +Interpolation Method - Forward Fill, Backward Fill or Linear + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +region|str|Region| +asset|str|Asset| +data_security_level|str|Level of data security| +data_type|str|Type of the data (float, integer, double, string) +tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +sample_rate|int|(deprecated) Please use time_interval_rate instead. See below.| +sample_unit|str|(deprecated) Please use time_interval_unit instead. See below.| +time_interval_rate|str|The time interval rate (numeric input)| +time_interval_unit|str|The time interval unit (second, minute, day, hour)| +agg_method|str|Aggregation Method (first, last, avg, min, max)| +interpolation_method|str|Interpolation method (forward_fill, backward_fill, linear)| +include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Interpolate/interpolate.py" +``` \ No newline at end of file diff --git a/queries/Interpolate/interpolate.py b/queries/Interpolate/interpolate.py new file mode 100644 index 0000000..bad75fb --- /dev/null +++ b/queries/Interpolate/interpolate.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import interpolate + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "agg_method": "first", + "interpolation_method": "forward_fill", + "include_bad_data": True, +} +x = interpolate.get(connection, parameters) +print(x) diff --git a/queries/Interpolation-at-Time/README.md b/queries/Interpolation-at-Time/README.md new file mode 100644 index 0000000..e1a2ded --- /dev/null +++ b/queries/Interpolation-at-Time/README.md @@ -0,0 +1,28 @@ +# Interpolation at Time + +[Interpolation at Time](../../code-reference/query/interpolation_at_time.md) - works out the linear interpolation at a specific time based on the points before and after. This is achieved by providing the following parameter: + +Timestamps - A list of timestamp or timestamps + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|str|List of tagname or tagnames ["tag_1", "tag_2"]| +|timestamps|list|List of timestamp or timestamps in the format YYY-MM-DDTHH:MM:SS or YYY-MM-DDTHH:MM:SS+zz:zz where %z is the timezone. (Example +00:00 is the UTC timezone)| +|window_length|int|Add longer window time in days for the start or end of specified date to cater for edge cases.| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Interpolation-at-Time/interpolation_at_time.py" +``` \ No newline at end of file diff --git a/queries/Interpolation-at-Time/interpolation_at_time.py b/queries/Interpolation-at-Time/interpolation_at_time.py new file mode 100644 index 0000000..dd8ec51 --- /dev/null +++ b/queries/Interpolation-at-Time/interpolation_at_time.py @@ -0,0 +1,20 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import interpolation_at_time + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "timestamps": ["2023-01-01", "2023-01-02"], + "window_length": 1, +} +x = interpolation_at_time.get(connection, parameters) +print(x) diff --git a/queries/Metadata/README.md b/queries/Metadata/README.md new file mode 100644 index 0000000..d32ea96 --- /dev/null +++ b/queries/Metadata/README.md @@ -0,0 +1,22 @@ +# Metadata + +[Metadata](../../code-reference/query/metadata.md) queries provide contextual information for time series measurements and include information such as names, descriptions and units of measure. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|tag_names|(optional, list)|Either pass a list of tagname/tagnames ["tag_1", "tag_2"] or leave the list blank [] or leave the parameter out completely| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Metadata/metadata.py" +``` \ No newline at end of file diff --git a/queries/Metadata/metadata.py b/queries/Metadata/metadata.py new file mode 100644 index 0000000..01bceaa --- /dev/null +++ b/queries/Metadata/metadata.py @@ -0,0 +1,17 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import metadata + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], +} +x = metadata.get(connection, parameters) +print(x) diff --git a/queries/Raw/README.md b/queries/Raw/README.md new file mode 100644 index 0000000..0df6883 --- /dev/null +++ b/queries/Raw/README.md @@ -0,0 +1,26 @@ +# Raw + +[Raw](../../code-reference/query/raw.md) facilitates performing raw extracts of time series data, typically filtered by a Tag Name or Device Name and an event time. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Raw/raw.py" +``` \ No newline at end of file diff --git a/queries/Raw/raw.py b/queries/Raw/raw.py new file mode 100644 index 0000000..2fdc7fe --- /dev/null +++ b/queries/Raw/raw.py @@ -0,0 +1,21 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import raw + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "include_bad_data": True, +} +x = raw.get(connection, parameters) +print(x) diff --git a/queries/Resample/README.md b/queries/Resample/README.md new file mode 100644 index 0000000..3d1754b --- /dev/null +++ b/queries/Resample/README.md @@ -0,0 +1,37 @@ +# Resample + +[Resample](../../code-reference/query/resample.md) enables changing the frequency of time series observations. This is achieved by providing the following parameters: + +Sample Rate - (deprecated) +Sample Unit - (deprecated) +Time Interval Rate - The time interval rate +Time Interval Unit - The time interval unit (second, minute, day, hour) +Aggregation Method - Aggregations including first, last, avg, min, max + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|sample_rate|int|(deprecated) Please use time_interval_rate instead. See below.| +|sample_unit|str|(deprecated) Please use time_interval_unit instead. See below.| +|time_interval_rate|str|The time interval rate (numeric input)| +|time_interval_unit|str|The time interval unit (second, minute, day, hour)| +|agg_method|str|Aggregation Method (first, last, avg, min, max)| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Resample/resample.py" +``` \ No newline at end of file diff --git a/queries/Resample/resample.py b/queries/Resample/resample.py new file mode 100644 index 0000000..b326a0f --- /dev/null +++ b/queries/Resample/resample.py @@ -0,0 +1,24 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import resample + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "agg_method": "first", + "include_bad_data": True, +} +x = resample.get(connection, parameters) +print(x) diff --git a/queries/Time-Weighted-Average/README.md b/queries/Time-Weighted-Average/README.md new file mode 100644 index 0000000..a6f5caa --- /dev/null +++ b/queries/Time-Weighted-Average/README.md @@ -0,0 +1,37 @@ +# Time Weighted Average + +[Time Weighted Averages](../../code-reference/query/time-weighted-average.md) provide an unbiased average when working with irregularly sampled data. The RTDIP SDK requires the following parameters to perform time weighted average queries: + +Window Size Mins - (deprecated) +Time Interval Rate - The time interval rate +Time Interval Unit - The time interval unit (second, minute, day, hour) +Window Length - Adds a longer window time for the start or end of specified date to cater for edge cases +Step - Data points with step "enabled" or "disabled". The options for step are "true", "false" or "metadata" as string types. For "metadata", the query requires that the TagName has a step column configured correctly in the meta data table + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a utc date in the format YYYY-MM-DD or a utc datetime in the format YYYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a utc date in the format YYYY-MM-DD or a utc datetime in the format YYYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|window_size_mins|int|(deprecated) Window size in minutes. Please use time_interval_rate and time_interval_unit below instead| +|time_interval_rate|str|The time interval rate (numeric input)| +|time_interval_unit|str|The time interval unit (second, minute, day, hour)| +|window_length|int|Add longer window time in days for the start or end of specified date to cater for edge cases| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| +|step|str|Data points with step "enabled" or "disabled". The options for step are "true", "false" or "metadata". "metadata" will retrieve the step value from the metadata table| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Time-Weighted-Average/time_weighted_average.py" +``` \ No newline at end of file diff --git a/queries/Time-Weighted-Average/time_weighted_average.py b/queries/Time-Weighted-Average/time_weighted_average.py new file mode 100644 index 0000000..10cce1d --- /dev/null +++ b/queries/Time-Weighted-Average/time_weighted_average.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import time_weighted_average + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "window_length": 1, + "include_bad_data": True, + "step": "true" +} +x = time_weighted_average.get(connection, parameters) +print(x) From 3584996b41aa339894f782aebfaacd572bc51a04 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Wed, 16 Aug 2023 10:23:02 +0100 Subject: [PATCH 02/42] Added Circular Average/Standard Deviation examples Signed-off-by: rodalynbarce --- queries/Circular-Average/README.md | 30 +++++++++++++++++++ queries/Circular-Average/circular_average.py | 25 ++++++++++++++++ queries/Circular-Standard-Deviation/README.md | 30 +++++++++++++++++++ .../circular_standard_deviation.py | 25 ++++++++++++++++ 4 files changed, 110 insertions(+) create mode 100644 queries/Circular-Average/README.md create mode 100644 queries/Circular-Average/circular_average.py create mode 100644 queries/Circular-Standard-Deviation/README.md create mode 100644 queries/Circular-Standard-Deviation/circular_standard_deviation.py diff --git a/queries/Circular-Average/README.md b/queries/Circular-Average/README.md new file mode 100644 index 0000000..b268e22 --- /dev/null +++ b/queries/Circular-Average/README.md @@ -0,0 +1,30 @@ +# Circular Average + +[Circular Average](../../code-reference/query/circular-average.md) - A function that receives a dataframe of raw tag data and computes the circular mean for samples in a range, returning the results. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +region|str|Region| +asset|str|Asset| +data_security_level|str|Level of data security| +data_type|str|Type of the data (float, integer, double, string) +tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +time_interval_rate|str|The time interval rate (numeric input)| +time_interval_unit|str|The time interval unit (second, minute, day, hour)| +lower_bound|int|Lower boundary for the sample range| +upper_bound|int|Upper boundary for the sample range| +include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Circular-Average/circular_average.py" +``` \ No newline at end of file diff --git a/queries/Circular-Average/circular_average.py b/queries/Circular-Average/circular_average.py new file mode 100644 index 0000000..ab9b2e3 --- /dev/null +++ b/queries/Circular-Average/circular_average.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import circular_average + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "lower_bound": 0, + "upper_bound": 360, + "include_bad_data": True, +} +x = circular_average.get(connection, parameters) +print(x) diff --git a/queries/Circular-Standard-Deviation/README.md b/queries/Circular-Standard-Deviation/README.md new file mode 100644 index 0000000..4bf244d --- /dev/null +++ b/queries/Circular-Standard-Deviation/README.md @@ -0,0 +1,30 @@ +# Circular Standard Deviation + +[Circular Standard Deviation](../../code-reference/query/circular_standard_deviation.md) - A function that receives a dataframe of raw tag data and computes the circular standard deviation for samples assumed to be in the range, returning the results. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +region|str|Region| +asset|str|Asset| +data_security_level|str|Level of data security| +data_type|str|Type of the data (float, integer, double, string) +tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +time_interval_rate|str|The time interval rate (numeric input)| +time_interval_unit|str|The time interval unit (second, minute, day, hour)| +lower_bound|int|Lower boundary for the sample range| +upper_bound|int|Upper boundary for the sample range| +include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Circular-Standard-Deviation/circular_standard_deviation.py" +``` \ No newline at end of file diff --git a/queries/Circular-Standard-Deviation/circular_standard_deviation.py b/queries/Circular-Standard-Deviation/circular_standard_deviation.py new file mode 100644 index 0000000..2789131 --- /dev/null +++ b/queries/Circular-Standard-Deviation/circular_standard_deviation.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import circular_standard_deviation + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "lower_bound": 0, + "upper_bound": 360, + "include_bad_data": True, +} +x = circular_standard_deviation.get(connection, parameters) +print(x) From 3389ebc38981f4242e93f0601f8135627112ca42 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Wed, 16 Aug 2023 10:23:31 +0100 Subject: [PATCH 03/42] Updated link Signed-off-by: rodalynbarce --- queries/Circular-Standard-Deviation/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/queries/Circular-Standard-Deviation/README.md b/queries/Circular-Standard-Deviation/README.md index 4bf244d..8d7bff9 100644 --- a/queries/Circular-Standard-Deviation/README.md +++ b/queries/Circular-Standard-Deviation/README.md @@ -1,6 +1,6 @@ # Circular Standard Deviation -[Circular Standard Deviation](../../code-reference/query/circular_standard_deviation.md) - A function that receives a dataframe of raw tag data and computes the circular standard deviation for samples assumed to be in the range, returning the results. +[Circular Standard Deviation](../../code-reference/query/circular-standard-deviation.md) - A function that receives a dataframe of raw tag data and computes the circular standard deviation for samples assumed to be in the range, returning the results. ## Prerequisites Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. From 7aa2009674aae8f0c1d850c3d1044e831c7f4117 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Wed, 16 Aug 2023 10:31:38 +0100 Subject: [PATCH 04/42] Added token Signed-off-by: rodalynbarce --- queries/Circular-Average/circular_average.py | 2 +- .../Circular-Standard-Deviation/circular_standard_deviation.py | 2 +- queries/Interpolate/interpolate.py | 2 +- queries/Interpolation-at-Time/interpolation_at_time.py | 2 +- queries/Metadata/metadata.py | 2 +- queries/Raw/raw.py | 2 +- queries/Resample/resample.py | 2 +- queries/Time-Weighted-Average/time_weighted_average.py | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/queries/Circular-Average/circular_average.py b/queries/Circular-Average/circular_average.py index ab9b2e3..556e2dc 100644 --- a/queries/Circular-Average/circular_average.py +++ b/queries/Circular-Average/circular_average.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import circular_average auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Circular-Standard-Deviation/circular_standard_deviation.py b/queries/Circular-Standard-Deviation/circular_standard_deviation.py index 2789131..34b3601 100644 --- a/queries/Circular-Standard-Deviation/circular_standard_deviation.py +++ b/queries/Circular-Standard-Deviation/circular_standard_deviation.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import circular_standard_deviation auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Interpolate/interpolate.py b/queries/Interpolate/interpolate.py index bad75fb..3698e95 100644 --- a/queries/Interpolate/interpolate.py +++ b/queries/Interpolate/interpolate.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import interpolate auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Interpolation-at-Time/interpolation_at_time.py b/queries/Interpolation-at-Time/interpolation_at_time.py index dd8ec51..441ab07 100644 --- a/queries/Interpolation-at-Time/interpolation_at_time.py +++ b/queries/Interpolation-at-Time/interpolation_at_time.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import interpolation_at_time auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Metadata/metadata.py b/queries/Metadata/metadata.py index 01bceaa..05009d7 100644 --- a/queries/Metadata/metadata.py +++ b/queries/Metadata/metadata.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import metadata auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Raw/raw.py b/queries/Raw/raw.py index 2fdc7fe..9de0b9d 100644 --- a/queries/Raw/raw.py +++ b/queries/Raw/raw.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import raw auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Resample/resample.py b/queries/Resample/resample.py index b326a0f..1a23e9e 100644 --- a/queries/Resample/resample.py +++ b/queries/Resample/resample.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import resample auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { diff --git a/queries/Time-Weighted-Average/time_weighted_average.py b/queries/Time-Weighted-Average/time_weighted_average.py index 10cce1d..6ba9afd 100644 --- a/queries/Time-Weighted-Average/time_weighted_average.py +++ b/queries/Time-Weighted-Average/time_weighted_average.py @@ -3,7 +3,7 @@ from rtdip_sdk.queries import time_weighted_average auth = DefaultAuth().authenticate() -token = auth.get_token("{token}").token +token = auth.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) parameters = { From cff20b6c93a14e40bb292678b34e002147aaa7c9 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Wed, 16 Aug 2023 10:41:52 +0100 Subject: [PATCH 05/42] Updated link Signed-off-by: rodalynbarce --- queries/Interpolation-at-Time/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/queries/Interpolation-at-Time/README.md b/queries/Interpolation-at-Time/README.md index e1a2ded..fe1ba21 100644 --- a/queries/Interpolation-at-Time/README.md +++ b/queries/Interpolation-at-Time/README.md @@ -1,6 +1,6 @@ # Interpolation at Time -[Interpolation at Time](../../code-reference/query/interpolation_at_time.md) - works out the linear interpolation at a specific time based on the points before and after. This is achieved by providing the following parameter: +[Interpolation at Time](../../code-reference/query/interpolation-at-time.md) - works out the linear interpolation at a specific time based on the points before and after. This is achieved by providing the following parameter: Timestamps - A list of timestamp or timestamps From c0f0bb0da74296ae312dabd92eb3952dc3d0b8d4 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Tue, 22 Aug 2023 15:56:42 +0100 Subject: [PATCH 06/42] Initial Commit --- .../Spark-Single-Node-Notebook-AWS/README.md | 22 ++ .../run_conda_installer.sh | 211 ++++++++++++++++++ 2 files changed, 233 insertions(+) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md new file mode 100644 index 0000000..96ce447 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -0,0 +1,22 @@ +# Spark Single Node Notebook AWS + +This article provides a guide how to create a conda based self contained environment to run RTDIP that integrates the following components: +* Java and Spark (Single node configuration). Currently v3.3.2 Spark has been configured +* AWS Libraries for Spark/Hadoop v3.3.2 +* Jupyter Notebook +* RTDIP (v0.6.1) + + +The components of this environment are pinned to specific versions. + +## Prerequisites +The prerequisites for running the environment are: + +* run_conda_installer.sh: An x86 Linux environment with enough free space (Tested on Linux Ubuntu 22.04. A clean environment is preferred) +* the installer will run Jupyter notebook on port 8080. Check that this port is free or change the configuration in the installer. + +# Deploy and Running +Run *run_conda_installer.sh*. After the installer completes: +* A new file *conda_environment_rtdip-sdk.sh* is created. Please use this file (e.g. *source ./conda_environment_rtdip-sdk.sh*) to activate the conda environment. +* On http://host:8080/ where host is the machine where the installer was run, a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines. + diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh new file mode 100644 index 0000000..08a5f00 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -0,0 +1,211 @@ +#!/usr/bin/env bash +# Dependencies: x86_64 Architecture, Linux (Tested on Ubuntu >= 20.04 LTS) curl, unzip +start_time=`date +%s` +echo "PATH: $PATH" +CONDA_CHECK_CMD="conda" +CONDA_CMD_DESCRIPTION=$CONDA_CHECK_CMD +CONDA_INSTALLER_NAME="Miniconda3-latest-Linux-x86_64.sh" +CONDA_INSTALLER_URL="https://repo.anaconda.com/miniconda/$CONDA_INSTALLER_NAME" +DEPLOYER_TMP_DIR=$(echo ${TMPDIR:-/tmp}"/DEPLOYER") +MINICONDA_NAME=miniconda +MINICONDA_PATH=$HOME/$MINICONDA_NAME/ +PATH=$MINICONDA_PATH/bin:$PATH +CONDA_ENV="rtdip-sdk" +CONDA_ENV_HOME=$(pwd)/apps/$CONDA_ENV +mkdir -p $CONDA_ENV_HOME +CWD=$(pwd) +echo "Current Working Directory: $CWD" +echo "CONDA ENV HOME: $CONDA_ENV_HOME" +echo "DEPLOYER TMP Dir: $DEPLOYER_TMP_DIR" +if ! command -v $CONDA_CHECK_CMD &> /dev/null +then + echo "Current dir:" + echo "$CONDA_CMD_DESCRIPTION could not be found. Going to Install it" + mkdir -p $DEPLOYER_TMP_DIR + echo "Working Dir to download conda:" + cd $DEPLOYER_TMP_DIR + pwd + curl -O --url $CONDA_INSTALLER_URL + chmod +x $DEPLOYER_TMP_DIR/*.sh + bash $CONDA_INSTALLER_NAME -b -p $HOME/miniconda +fi +cd $CONDA_ENV_HOME +echo "Current Dir:" +pwd + +echo "Updating Conda" +conda update -n base conda -y +echo "Installing Mamba Solver" +conda install -n base conda-libmamba-solver -y +echo "Setting Solver to libmama" +conda config --set solver libmamba + + + +echo "Creating Conda Environment" +conda env create -f environment.yml -y + +# +# JDK +echo "Installing JDK jdk-17.0.2 ***********************************" +export JAVA_VERSION="jdk-17.0.2" +export JDK_FILE_NAME="openjdk-17.0.2_linux-x64_bin.tar.gz" +export JDK_DOWNLOAD_URL="https://download.java.net/java/GA/jdk17.0.2/dfd4a8d0985749f896bed50d7138ee7f/8/GPL/$JDK_FILE_NAME" + +if [ -f "$CONDA_ENV/$JDK_FILE_NAME" ]; then + echo "$CONDA_ENV/$JDK_FILE_NAME Exists" + echo "Removing JDK: $JDK_FILE_NAME" + rm -rf $CONDA_ENV/$JAVA_VERSION + # rm $CONDA_ENV/$JDK_FILE_NAME + unlink $HOME/JDK +fi + +if test -f "$JDK_FILE_NAME"; +then + echo "$JDK_FILE_NAME exists" +else + echo "$JDK_FILE_NAME does not exists. Downloading it from $JDK_DOWNLOAD_URL" + curl -o $JDK_FILE_NAME $JDK_DOWNLOAD_URL +fi + +tar xvfz $JDK_FILE_NAME > /dev/null +ln -s $CONDA_ENV_HOME/$JAVA_VERSION $HOME/JDK +export JAVA_HOME=$HOME/JDK +export PATH=$HOME/JDK/bin:$PATH + +# SPARK 3.3.2 +echo "Installing SPARK 3.3.2 ***********************************" +export SPARK_VERSION="spark-3.3.2-bin-hadoop3" +export SPARK_FILE_NAME="spark-3.3.2-bin-hadoop3.tgz" +export SPARK_DOWNLOAD_URL="https://archive.apache.org/dist/spark/spark-3.3.2/$SPARK_FILE_NAME" +export PYSPARK_VERSION="3.3.2" + + +if [ -f "$CONDA_ENV/$SPARK_VERSION" ]; then + echo "$CONDA_ENV/$SPARK_FILE_NAME Exists" + echo "Removing Spark: $SPARK_FILE_NAME" + rm -rf $CONDA_ENV/$SPARK_VERSION + # rm $CONDA_ENV/$SPARK_FILE_NAME + unlink $HOME/SPARK +fi + +if test -f "$SPARK_FILE_NAME"; +then + echo "$SPARK_FILE_NAME exists" +else + echo "$SPARK_FILE_NAME does not exists. Downloading it from $SPARK_DOWNLOAD_URL" + curl -o $SPARK_FILE_NAME $SPARK_DOWNLOAD_URL +fi + + +tar xvfz $SPARK_FILE_NAME > /dev/null +ln -s $CONDA_ENV_HOME/$SPARK_VERSION $HOME/SPARK +export SPARK_HOME=$HOME/SPARK + + +# Extra libraries +echo "Installing Extra Libs ***********************************" +export AWS_JAVA_SDK_BUNDLE_JAR_FILE_NAME="aws-java-sdk-bundle-1.11.1026.jar" +export AWS_JAVA_SDK_BUNDLE_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/$AWS_JAVA_SDK_BUNDLE_JAR_FILE_NAME" + +export HADOOP_AWS_JAR_FILE_NAME="hadoop-aws-3.3.2.jar" +export HADOOP_AWS_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/$HADOOP_AWS_JAR_FILE_NAME" + +export HADOOP_COMMON_JAR_FILE_NAME="hadoop-common-3.3.2.jar" +export HADOOP_COMMON_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.3.2/$HADOOP_COMMON_JAR_FILE_NAME" + +export HADOOP_HDFS_JAR_FILE_NAME="hadoop-hdfs-3.3.2.jar" +export HADOOP_HDFS_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/3.3.2/$HADOOP_HDFS_JAR_FILE_NAME" + +export WOODSTOCK_CORE_JAR_FILE_NAME="woodstox-core-6.5.1.jar" +export WOODSTOCK_CORE_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/com/fasterxml/woodstox/woodstox-core/6.5.1/$WOODSTOCK_CORE_JAR_FILE_NAME" + +export STAX2_API_JAR_FILE_NAME="stax2-api-4.2.1.jar" +export STAX2_API__JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/codehaus/woodstox/stax2-api/4.2.1/$STAX2_API_JAR_FILE_NAME" + +export COMMONS_CONFIGURATION_JAR_FILE_NAME="commons-configuration2-2.9.0.jar" +export COMMONS_CONFIGURATION_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/apache/commons/commons-configuration2/2.9.0/$COMMONS_CONFIGURATION_JAR_FILE_NAME" + +export RE2J_JAR_FILE_NAME="re2j-1.7.jar" +export RE2J_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/com/google/re2j/re2j/1.7/$RE2J_JAR_FILE_NAME" + +export AZURE_EVENTHUBS_SPARK_JAR_FILE_NAME="azure-eventhubs-spark_2.12-2.3.22.jar" +export AZURE_EVENTHUBS_SPARK_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/com/microsoft/azure/azure-eventhubs-spark_2.12/2.3.22/$AZURE_EVENTHUBS_SPARK_JAR_FILE_NAME" + +export AZURE_EVENTHUBS_JAR_FILE_NAME="azure-eventhubs-3.3.0.jar" +export AZURE_EVENTHUBS_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/com/microsoft/azure/azure-eventhubs/3.3.0/$AZURE_EVENTHUBS_JAR_FILE_NAME" + +export SCALA_JAVA8_COMPAT_JAR_FILE_NAME="scala-java8-compat_2.12-1.0.2.jar" +export SCALA_JAVA8_COMPAT_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/scala-lang/modules/scala-java8-compat_2.12/1.0.2/$SCALA_JAVA8_COMPAT_JAR_FILE_NAME" + +export PROTON_J_JAR_FILE_NAME="proton-j-0.34.1.jar" +export PROTON_J_JAR_DOWNLOAD_URL="https://repo1.maven.org/maven2/org/apache/qpid/proton-j/0.34.1/$PROTON_J_JAR_FILE_NAME" + +curl -o $AWS_JAVA_SDK_BUNDLE_JAR_FILE_NAME $AWS_JAVA_SDK_BUNDLE_JAR_DOWNLOAD_URL +mv $AWS_JAVA_SDK_BUNDLE_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $HADOOP_AWS_JAR_FILE_NAME $HADOOP_AWS_JAR_DOWNLOAD_URL +mv $HADOOP_AWS_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $HADOOP_COMMON_JAR_FILE_NAME $HADOOP_COMMON_JAR_DOWNLOAD_URL +mv $HADOOP_COMMON_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $HADOOP_HDFS_JAR_FILE_NAME $HADOOP_HDFS_JAR_DOWNLOAD_URL +mv $HADOOP_HDFS_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $WOODSTOCK_CORE_JAR_FILE_NAME $WOODSTOCK_CORE_JAR_DOWNLOAD_URL +mv $WOODSTOCK_CORE_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $STAX2_API_JAR_FILE_NAME $STAX2_API__JAR_DOWNLOAD_URL +mv $STAX2_API_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $COMMONS_CONFIGURATION_JAR_FILE_NAME $COMMONS_CONFIGURATION_JAR_DOWNLOAD_URL +mv $COMMONS_CONFIGURATION_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $RE2J_JAR_FILE_NAME $RE2J_JAR_DOWNLOAD_URL +mv $RE2J_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $AZURE_EVENTHUBS_SPARK_JAR_FILE_NAME $AZURE_EVENTHUBS_SPARK_JAR_DOWNLOAD_URL +mv $AZURE_EVENTHUBS_SPARK_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $AZURE_EVENTHUBS_JAR_FILE_NAME $AZURE_EVENTHUBS_JAR_DOWNLOAD_URL +mv $AZURE_EVENTHUBS_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $SCALA_JAVA8_COMPAT_JAR_FILE_NAME $SCALA_JAVA8_COMPAT_JAR_DOWNLOAD_URL +mv $SCALA_JAVA8_COMPAT_JAR_FILE_NAME $SPARK_HOME/jars + +curl -o $PROTON_J_JAR_FILE_NAME $PROTON_J_JAR_DOWNLOAD_URL +mv $PROTON_J_JAR_FILE_NAME $SPARK_HOME/jars + +# Cleaning up +rm $SPARK_FILE_NAME +rm $JDK_FILE_NAME + +# +echo "Finished INSTALLING $JAVA_VERSION and $SPARK_VERSION and Extra Libraries" + +eval "$(conda shell.bash hook)" +conda config --set default_threads 4 +conda env list +# Uncoment the line below to avoid error: CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. +# source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh + +conda activate $CONDA_ENV + +conda info +end_time=`date +%s` +runtime=$((end_time-start_time)) +echo "Total Installation Runtime: $runtime [seconds]" +# Creating env file +CONDA_ENVIRONMENT_FILE_NAME="conda_environment_$CONDA_ENV.sh" +echo "#!/usr/bin/env bash" > $CONDA_ENVIRONMENT_FILE_NAME +echo "export PATH=$PATH" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export JAVA_HOME=$JAVA_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME +chmod +x $CONDA_ENVIRONMENT_FILE_NAME +echo "export SPARK_HOME=$SPARK_HOME" +if [ -z ${NOTEBOOK_PORT+x} ]; then NOTEBOOK_PORT="8080"; else echo "NOTEBOOK_PORT: $NOTEBOOK_PORT"; fi +echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" +source ./$CONDA_ENVIRONMENT_FILE_NAME +jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root From 3befe54d7c52c0b58b59611d144e7b5e77b7bacd Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Wed, 23 Aug 2023 14:46:25 +0100 Subject: [PATCH 07/42] Added Envinronment --- .../environment.yml | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml new file mode 100644 index 0000000..d2b9acc --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml @@ -0,0 +1,30 @@ +name: rtdip-sdk +channels: + - conda-forge + - defaults + +dependencies: + - notebook=6.4.12 + - python==3.10.9 + - azure-storage-file-datalake==12.10.1 + - azure-keyvault-secrets==4.7.0 + - azure-identity==1.12.0 + - pyodbc==4.0.39 + - pandas==1.5.2 + - jinja2==3.0.3 + - jinjasql==0.1.8 + - pyspark==3.3.2 + - delta-spark==2.3.0 + - dependency_injector==4.41.0 + - pydantic==1.10.7 + - boto3==1.26.123 + - semver==3.0.0 + - xlrd==2.0.1 + - pip==23.1.2 + - awscli==1.27.142 + - filelock==3.12.2 + - web3==6.5.0 + - pip: + - hvac==1.1.0 + - rtdip-sdk==0.6.1 + From 89f44ecb7f42bf13560757af67cfc93ac2596a90 Mon Sep 17 00:00:00 2001 From: JamesKnBr Date: Fri, 1 Sep 2023 10:21:23 +0100 Subject: [PATCH 08/42] add python delta sample --- .../deploy/Python-Delta-to-Delta/README.md | 22 +++++++++++++++++++ .../deploy/Python-Delta-to-Delta/pipeline.py | 6 +++++ 2 files changed, 28 insertions(+) create mode 100644 pipelines/deploy/Python-Delta-to-Delta/README.md create mode 100644 pipelines/deploy/Python-Delta-to-Delta/pipeline.py diff --git a/pipelines/deploy/Python-Delta-to-Delta/README.md b/pipelines/deploy/Python-Delta-to-Delta/README.md new file mode 100644 index 0000000..36734b6 --- /dev/null +++ b/pipelines/deploy/Python-Delta-to-Delta/README.md @@ -0,0 +1,22 @@ +# Fledge Pipeline using Dagster + +This article provides a guide on how to execute a simple Delta Table copy locally without Spark using the RTDIP SDK. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +## Prerequisites +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + + +## Components +|Name|Description| +|---------------------------|----------------------| +|[PythonDeltaSource](../../../../code-reference/pipelines/sources/python/delta.md)|Reads data from a Delta Table.| +|[PythonDeltaDestination](../../../../code-reference/pipelines/destinations/python/delta.md)|Writes to a Delta table.| + +## Example +Below is an example of how to read from and write to Delta Tables locally without the need for Spark + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/Python-Delta-to-Delta/pipeline.py" +``` \ No newline at end of file diff --git a/pipelines/deploy/Python-Delta-to-Delta/pipeline.py b/pipelines/deploy/Python-Delta-to-Delta/pipeline.py new file mode 100644 index 0000000..a3fb585 --- /dev/null +++ b/pipelines/deploy/Python-Delta-to-Delta/pipeline.py @@ -0,0 +1,6 @@ +from rtdip_sdk.pipelines.sources.python.delta import PythonDeltaSource +from rtdip_sdk.pipelines.destinations.python.delta import PythonDeltaDestination + +source = PythonDeltaSource("{/path/to/source/table}").read_batch() + +destination = PythonDeltaDestination(source, "{/path/to/destination/table}", mode="append").write_batch() From 751610b7b9d9083f61da6d872428cef24b21f808 Mon Sep 17 00:00:00 2001 From: JamesKnBr Date: Fri, 1 Sep 2023 16:42:07 +0100 Subject: [PATCH 09/42] change relative links Signed-off-by: JamesKnBr --- pipelines/deploy/Python-Delta-to-Delta/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pipelines/deploy/Python-Delta-to-Delta/README.md b/pipelines/deploy/Python-Delta-to-Delta/README.md index 36734b6..ddc20dd 100644 --- a/pipelines/deploy/Python-Delta-to-Delta/README.md +++ b/pipelines/deploy/Python-Delta-to-Delta/README.md @@ -5,14 +5,14 @@ This article provides a guide on how to execute a simple Delta Table copy locall ## Prerequisites This pipeline job requires the packages: -* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) +* [rtdip-sdk](../../../../getting-started/installation.md#installing-the-rtdip-sdk) ## Components |Name|Description| |---------------------------|----------------------| -|[PythonDeltaSource](../../../../code-reference/pipelines/sources/python/delta.md)|Reads data from a Delta Table.| -|[PythonDeltaDestination](../../../../code-reference/pipelines/destinations/python/delta.md)|Writes to a Delta table.| +|[PythonDeltaSource](../../../code-reference/pipelines/sources/python/delta.md)|Reads data from a Delta Table.| +|[PythonDeltaDestination](../../../code-reference/pipelines/destinations/python/delta.md)|Writes to a Delta table.| ## Example Below is an example of how to read from and write to Delta Tables locally without the need for Spark From 6b8af8e46ab37ae3d51a49c20446fdbf52b14c90 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Thu, 10 Aug 2023 14:37:47 +0100 Subject: [PATCH 10/42] Added MISO example Signed-off-by: rodalynbarce --- .../MISO-Daily-Load-Pipeline-Local/README.md | 34 +++++++ .../pipeline.py | 98 +++++++++++++++++++ 2 files changed, 132 insertions(+) create mode 100644 pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md create mode 100644 pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py diff --git a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md new file mode 100644 index 0000000..e02a373 --- /dev/null +++ b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md @@ -0,0 +1,34 @@ +# MISO Pipeline using RTDIP +This article provides a guide on how to execute a pipeline using the RTDIP SDK's MISO components. This pipeline was tested on an M2 Macbook Pro using VS Code in a Conda (3.11) environment. + +## Prerequisites +This pipeline assumes you have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: + +* RTDIP SDK + +* Java + +!!! note "RTDIP SDK Installation" + Ensure you have installed the RTDIP SDK as follows: + ``` + pip install "rtdip-sdk[pipelines,pyspark]" + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[MISODailyLoadISOSource](../../../../code-reference/pipelines/sources/spark/iso/miso_daily_load_iso.md)|Read daily load data from MISO API.| +|[MISOToMDMTransformer](../../../../code-reference/pipelines/transformers/spark/iso/miso_to_mdm.md)|Converts MISO Raw data into Meters Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Example +Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table on your machine. +```python +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/0434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py:7:" +``` + +!!! note "Using environments" + If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: + ```python + --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/0434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py::5" + ``` \ No newline at end of file diff --git a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py new file mode 100644 index 0000000..81c1984 --- /dev/null +++ b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py @@ -0,0 +1,98 @@ +import sys +import os + +os.environ['PYSPARK_PYTHON'] = sys.executable +os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable + +from rtdip_sdk.pipelines.execute import PipelineJob, PipelineStep, PipelineTask, PipelineJobExecute +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination + +step_list = [] + +# Read step +step_list.append(PipelineStep( + name="step_1", + description="Read Forecast data from MISO API.", + component=MISODailyLoadISOSource, + component_parameters={"options": { + "load_type": "actual", + "date": "20230520", + }, + }, + provide_output_to_step=["step_2", "step_4"] +)) + +# Transform step - Values +step_list.append(PipelineStep( + name="step_2", + description="Get measurement data.", + component=MISOToMDMTransformer, + component_parameters={ + "output_type": "usage", + }, + depends_on_step=["step_1"], + provide_output_to_step=["step_3"] +)) + +# Write step +step_list.append(PipelineStep( + name="step_3", + description="Write measurement data to Delta table.", + component=SparkDeltaDestination, + component_parameters={ + "destination": "MISO_ISO_Usage_Data", + "options": { + "partitionBy":"timestamp" + }, + "mode": "overwrite" + }, + depends_on_step=["step_2"] +)) + +# Transformer step - Meta information +step_list.append(PipelineStep( + name="step_4", + description="Get meta information.", + component=MISOToMDMTransformer, + component_parameters={ + "output_type": "meta", + }, + depends_on_step=["step_1"], + provide_output_to_step=["step_5"] +)) + +# Write step +step_list.append(PipelineStep( + name="step_5", + description="step_5", + component=SparkDeltaDestination, + component_parameters={ + "destination": "MISO_ISO_Meta_Data", + "options": {}, + "mode": "overwrite" + }, + depends_on_step=["step_4"] +)) + +# Tasks contain a list of steps +task = PipelineTask( + name="test_task", + description="test_task", + step_list=step_list, + batch_task=True +) + +# Job containing a list of tasks +pipeline_job = PipelineJob( + name="test_job", + description="test_job", + version="0.0.1", + task_list=[task] +) + +# Execute +pipeline = PipelineJobExecute(pipeline_job) + +result = pipeline.run() \ No newline at end of file From f38b80975a789ffc7b3701379ca3ab7eb432956c Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Thu, 10 Aug 2023 15:07:03 +0100 Subject: [PATCH 11/42] Added RTDIP and Java links Signed-off-by: rodalynbarce --- pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md index e02a373..a71d13a 100644 --- a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md +++ b/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md @@ -4,9 +4,9 @@ This article provides a guide on how to execute a pipeline using the RTDIP SDK's ## Prerequisites This pipeline assumes you have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: -* RTDIP SDK +* [RTDIP SDK](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) -* Java +* [Java](../../../../../getting-started/installation.md#java) !!! note "RTDIP SDK Installation" Ensure you have installed the RTDIP SDK as follows: From ef2d13251e80e00fd5c8ace06895cd69aa87a7ff Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Tue, 15 Aug 2023 09:44:46 +0100 Subject: [PATCH 12/42] Update folder name Signed-off-by: rodalynbarce --- .../README.md | 0 .../pipeline.py | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename pipelines/deploy/{MISO-Daily-Load-Pipeline-Local => MISO-RTDIP-Pipeline-Local}/README.md (100%) rename pipelines/deploy/{MISO-Daily-Load-Pipeline-Local => MISO-RTDIP-Pipeline-Local}/pipeline.py (100%) diff --git a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md similarity index 100% rename from pipelines/deploy/MISO-Daily-Load-Pipeline-Local/README.md rename to pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md diff --git a/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py similarity index 100% rename from pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py rename to pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py From 2e766cfe6698254f22c56ab031785d6cb65794aa Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Tue, 15 Aug 2023 09:50:21 +0100 Subject: [PATCH 13/42] Update link Signed-off-by: rodalynbarce --- pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md index a71d13a..3f4cb00 100644 --- a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md +++ b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md @@ -24,7 +24,7 @@ This pipeline assumes you have followed the installation instructions as specifi ## Example Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table on your machine. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/0434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py:7:" +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py:7:" ``` !!! note "Using environments" From dc000c82a92533edc9ad90af726d72b2deabf509 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Tue, 15 Aug 2023 09:52:29 +0100 Subject: [PATCH 14/42] Update link Signed-off-by: rodalynbarce --- pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md index 3f4cb00..71741ba 100644 --- a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md +++ b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md @@ -24,7 +24,7 @@ This pipeline assumes you have followed the installation instructions as specifi ## Example Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table on your machine. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py:7:" +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py:7:" ``` !!! note "Using environments" From 2239ad75664637e2d5aa7d38728be09188a60977 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Tue, 15 Aug 2023 09:55:13 +0100 Subject: [PATCH 15/42] Update link Signed-off-by: rodalynbarce --- pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md index 71741ba..77dd24d 100644 --- a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md +++ b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md @@ -30,5 +30,5 @@ Below is an example of how to set up a pipeline to read daily load data from the !!! note "Using environments" If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: ```python - --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/0434/pipelines/deploy/MISO-Daily-Load-Pipeline-Local/pipeline.py::5" + --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py::5" ``` \ No newline at end of file From eb0c45303f7d582ed72d6a358b114266f2b45753 Mon Sep 17 00:00:00 2001 From: rodalynbarce <121169437+rodalynbarce@users.noreply.github.com> Date: Tue, 15 Aug 2023 12:11:47 +0100 Subject: [PATCH 16/42] Added dagster and query examples (#4) * Added Dagster examples Signed-off-by: rodalynbarce * Updated dagster examples Signed-off-by: rodalynbarce * Added query examples Signed-off-by: rodalynbarce * Updated example links Signed-off-by: rodalynbarce * Update links Signed-off-by: rodalynbarce * Added links Signed-off-by: rodalynbarce * Changed links and removed token. Signed-off-by: rodalynbarce * Added tag_name to description Signed-off-by: rodalynbarce --------- Signed-off-by: rodalynbarce --- .../README.md | 66 ++++++++++++++++++ .../pipeline.py | 36 ++++++++++ .../Fledge-Dagster-Pipeline-Local/README.md | 38 ++++++++++ .../Fledge-Dagster-Pipeline-Local/pipeline.py | 69 +++++++++++++++++++ queries/Interpolate/README.md | 34 +++++++++ queries/Interpolate/interpolate.py | 25 +++++++ queries/Interpolation-at-Time/README.md | 28 ++++++++ .../interpolation_at_time.py | 20 ++++++ queries/Metadata/README.md | 22 ++++++ queries/Metadata/metadata.py | 17 +++++ queries/Raw/README.md | 26 +++++++ queries/Raw/raw.py | 21 ++++++ queries/Resample/README.md | 37 ++++++++++ queries/Resample/resample.py | 24 +++++++ queries/Time-Weighted-Average/README.md | 37 ++++++++++ .../time_weighted_average.py | 25 +++++++ 16 files changed, 525 insertions(+) create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md create mode 100644 pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py create mode 100644 queries/Interpolate/README.md create mode 100644 queries/Interpolate/interpolate.py create mode 100644 queries/Interpolation-at-Time/README.md create mode 100644 queries/Interpolation-at-Time/interpolation_at_time.py create mode 100644 queries/Metadata/README.md create mode 100644 queries/Metadata/metadata.py create mode 100644 queries/Raw/README.md create mode 100644 queries/Raw/raw.py create mode 100644 queries/Resample/README.md create mode 100644 queries/Resample/resample.py create mode 100644 queries/Time-Weighted-Average/README.md create mode 100644 queries/Time-Weighted-Average/time_weighted_average.py diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md new file mode 100644 index 0000000..caff39c --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/README.md @@ -0,0 +1,66 @@ +# Fledge Pipeline using Dagster and Databricks Connect + +This article provides a guide on how to deploy a pipeline in dagster using the RTDIP SDK and Databricks Connect. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +!!! note "Note" + Reading from Eventhubs is currently not supported on Databricks Connect. + +## Prerequisites +Deployment using Databricks Connect requires: + +* a Databricks workspace + +* a cluster in the same workspace + +* a personal access token + +Further information on Databricks requirements can be found [here](https://docs.databricks.com/en/dev-tools/databricks-connect-ref.html#requirements). + + +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [databricks-connect](https://pypi.org/project/databricks-connect/) + +* [dagster](https://docs.dagster.io/getting-started/install) + + +!!! note "Dagster Installation" + For Mac users with an M1 or M2 chip, installation of dagster should be done as follows: + ``` + pip install dagster dagster-webserver --find-links=https://github.com/dagster-io/build-grpcio/wiki/Wheels + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[SparkDeltaSource](../../../../code-reference/pipelines/sources/spark/delta.md)|Read data from a Delta table.| +|[BinaryToStringTransformer](../../../../code-reference/pipelines/transformers/spark/binary_to_string.md)|Converts a Spark DataFrame column from binary to string.| +|[FledgeOPCUAJsonToPCDMTransformer](../../../../code-reference/pipelines/transformers/spark/fledge_opcua_json_to_pcdm.md)|Converts a Spark DataFrame column containing a json string to the Process Control Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Authentication +For Databricks authentication, the following fields should be added to a configuration profile in your [`.databrickscfg`](https://docs.databricks.com/en/dev-tools/auth.html#config-profiles) file: + +``` +[PROFILE] +host = https://{workspace_instance} +token = dapi... +cluster_id = {cluster_id} +``` + +This profile should match the configurations in your `DatabricksSession` in the example below as it will be used by the [Databricks extension](https://docs.databricks.com/en/dev-tools/vscode-ext-ref.html#configure-the-extension) in VS Code for authenticating your Databricks cluster. + +## Example +Below is an example of how to set up a pipeline to read Fledge data from a Delta table, transform it to RTDIP's [PCDM model](../../../../../domains/process_control/data_model.md) and write it to a Delta table. + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py" +``` + +## Deploy +The following command deploys the pipeline to dagster: +`dagster dev -f ` + +Using the link provided from the command above, click on Launchpad and hit run to run the pipeline. \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py new file mode 100644 index 0000000..d03332f --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Databricks/pipeline.py @@ -0,0 +1,36 @@ +from dagster import Definitions, ResourceDefinition, graph, op +from databricks.connect import DatabricksSession +from rtdip_sdk.pipelines.sources.spark.delta import SparkDeltaSource +from rtdip_sdk.pipelines.transformers.spark.binary_to_string import BinaryToStringTransformer +from rtdip_sdk.pipelines.transformers.spark.fledge_opcua_json_to_pcdm import FledgeOPCUAJsonToPCDMTransformer +from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination + +# Databricks cluster configuration +databricks_resource = ResourceDefinition.hardcoded_resource( + DatabricksSession.builder.remote( + host = "https://{workspace_instance_name}", + token = "{token}", + cluster_id = "{cluster_id}" + ).getOrCreate() +) + +# Pipeline +@op(required_resource_keys={"databricks"}) +def pipeline(context): + spark = context.resources.databricks + source = SparkDeltaSource(spark, {}, "{path_to_table}").read_batch() + transformer = BinaryToStringTransformer(source, "{source_column_name}", "{target_column_name}").transform() + transformer = FledgeOPCUAJsonToPCDMTransformer(transformer, "{source_column_name}").transform() + SparkDeltaDestination(transformer, {}, "{path_to_table}").write_batch() + +@graph +def fledge_pipeline(): + pipeline() + +fledge_pipeline_job = fledge_pipeline.to_job( + resource_defs={ + "databricks": databricks_resource + } +) + +defs = Definitions(jobs=[fledge_pipeline_job]) \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md new file mode 100644 index 0000000..1cbb5a1 --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/README.md @@ -0,0 +1,38 @@ +# Fledge Pipeline using Dagster + +This article provides a guide on how to deploy a pipeline in dagster using the RTDIP SDK. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +## Prerequisites +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [dagster](https://docs.dagster.io/getting-started/install) + + +!!! note "Dagster Installation" + For Mac users with an M1 or M2 chip, installation of dagster should be done as follows: + ``` + pip install dagster dagster-webserver --find-links=https://github.com/dagster-io/build-grpcio/wiki/Wheels + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[SparkEventhubSource](../../../../code-reference/pipelines/sources/spark/eventhub.md)|Read data from an Eventhub.| +|[BinaryToStringTransformer](../../../../code-reference/pipelines/transformers/spark/binary_to_string.md)|Converts a Spark DataFrame column from binary to string.| +|[FledgeOPCUAJsonToPCDMTransformer](../../../../code-reference/pipelines/transformers/spark/fledge_opcua_json_to_pcdm.md)|Converts a Spark DataFrame column containing a json string to the Process Control Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Example +Below is an example of how to set up a pipeline to read Fledge data from an Eventhub, transform it to RTDIP's [PCDM model](../../../../../domains/process_control/data_model.md) and write it to a Delta table on your machine. + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py" +``` + +## Deploy +The following command deploys the pipeline to dagster: +`dagster dev -f ` + +Using the link provided from the command above, click on Launchpad and hit run to run the pipeline. \ No newline at end of file diff --git a/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py new file mode 100644 index 0000000..ba0cefd --- /dev/null +++ b/pipelines/deploy/Fledge-Dagster-Pipeline-Local/pipeline.py @@ -0,0 +1,69 @@ +import json +from datetime import datetime as dt +from dagster import Definitions, graph, op +from dagster_pyspark.resources import pyspark_resource +from rtdip_sdk.pipelines.sources.spark.eventhub import SparkEventhubSource +from rtdip_sdk.pipelines.transformers.spark.binary_to_string import BinaryToStringTransformer +from rtdip_sdk.pipelines.transformers.spark.fledge_opcua_json_to_pcdm import FledgeOPCUAJsonToPCDMTransformer +from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination + +# PySpark cluster configuration +packages = "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22,io.delta:delta-core_2.12:2.4.0" +my_pyspark_resource = pyspark_resource.configured( + {"spark_conf": {"spark.default.parallelism": 1, + "spark.jars.packages": packages, + "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension", + "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog" + } + } +) + +# EventHub configuration +eventhub_connection_string = "{eventhub_connection_string}" +eventhub_consumer_group = "{eventhub_consumer_group}" + +startOffset = "-1" +endTime = dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ") + +startingEventPosition = { + "offset": startOffset, + "seqNo": -1, + "enqueuedTime": None, + "isInclusive": True +} + +endingEventPosition = { + "offset": None, + "seqNo": -1, + "enqueuedTime": endTime, + "isInclusive": True +} + +ehConf = { +'eventhubs.connectionString' : eventhub_connection_string, +'eventhubs.consumerGroup': eventhub_consumer_group, +'eventhubs.startingPosition' : json.dumps(startingEventPosition), +'eventhubs.endingPosition' : json.dumps(endingEventPosition), +'maxEventsPerTrigger': 1000 +} + +# Pipeline +@op(required_resource_keys={"spark"}) +def pipeline(context): + spark = context.resources.pyspark.spark_session + source = SparkEventhubSource(spark, ehConf).read_batch() + transformer = BinaryToStringTransformer(source, "{source_column_name}", "{target_column_name}").transform() + transformer = FledgeOPCUAJsonToPCDMTransformer(transformer, "{source_column_name}").transform() + SparkDeltaDestination(transformer, {}, "{path_to_table}").write_batch() + +@graph +def fledge_pipeline(): + pipeline() + +fledge_pipeline_job = fledge_pipeline.to_job( + resource_defs={ + "spark": my_pyspark_resource + } +) + +defs = Definitions(jobs=[fledge_pipeline_job]) \ No newline at end of file diff --git a/queries/Interpolate/README.md b/queries/Interpolate/README.md new file mode 100644 index 0000000..057acb2 --- /dev/null +++ b/queries/Interpolate/README.md @@ -0,0 +1,34 @@ +# Interpolate + +[Interpolate](../../code-reference/query/interpolate.md) - takes resampling one step further to estimate the values of unknown data points that fall between existing, known data points. In addition to the resampling parameters, interpolation also requires: + +Interpolation Method - Forward Fill, Backward Fill or Linear + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +region|str|Region| +asset|str|Asset| +data_security_level|str|Level of data security| +data_type|str|Type of the data (float, integer, double, string) +tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +sample_rate|int|(deprecated) Please use time_interval_rate instead. See below.| +sample_unit|str|(deprecated) Please use time_interval_unit instead. See below.| +time_interval_rate|str|The time interval rate (numeric input)| +time_interval_unit|str|The time interval unit (second, minute, day, hour)| +agg_method|str|Aggregation Method (first, last, avg, min, max)| +interpolation_method|str|Interpolation method (forward_fill, backward_fill, linear)| +include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Interpolate/interpolate.py" +``` \ No newline at end of file diff --git a/queries/Interpolate/interpolate.py b/queries/Interpolate/interpolate.py new file mode 100644 index 0000000..bad75fb --- /dev/null +++ b/queries/Interpolate/interpolate.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import interpolate + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "agg_method": "first", + "interpolation_method": "forward_fill", + "include_bad_data": True, +} +x = interpolate.get(connection, parameters) +print(x) diff --git a/queries/Interpolation-at-Time/README.md b/queries/Interpolation-at-Time/README.md new file mode 100644 index 0000000..e1a2ded --- /dev/null +++ b/queries/Interpolation-at-Time/README.md @@ -0,0 +1,28 @@ +# Interpolation at Time + +[Interpolation at Time](../../code-reference/query/interpolation_at_time.md) - works out the linear interpolation at a specific time based on the points before and after. This is achieved by providing the following parameter: + +Timestamps - A list of timestamp or timestamps + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|str|List of tagname or tagnames ["tag_1", "tag_2"]| +|timestamps|list|List of timestamp or timestamps in the format YYY-MM-DDTHH:MM:SS or YYY-MM-DDTHH:MM:SS+zz:zz where %z is the timezone. (Example +00:00 is the UTC timezone)| +|window_length|int|Add longer window time in days for the start or end of specified date to cater for edge cases.| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Interpolation-at-Time/interpolation_at_time.py" +``` \ No newline at end of file diff --git a/queries/Interpolation-at-Time/interpolation_at_time.py b/queries/Interpolation-at-Time/interpolation_at_time.py new file mode 100644 index 0000000..dd8ec51 --- /dev/null +++ b/queries/Interpolation-at-Time/interpolation_at_time.py @@ -0,0 +1,20 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import interpolation_at_time + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "timestamps": ["2023-01-01", "2023-01-02"], + "window_length": 1, +} +x = interpolation_at_time.get(connection, parameters) +print(x) diff --git a/queries/Metadata/README.md b/queries/Metadata/README.md new file mode 100644 index 0000000..d32ea96 --- /dev/null +++ b/queries/Metadata/README.md @@ -0,0 +1,22 @@ +# Metadata + +[Metadata](../../code-reference/query/metadata.md) queries provide contextual information for time series measurements and include information such as names, descriptions and units of measure. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|tag_names|(optional, list)|Either pass a list of tagname/tagnames ["tag_1", "tag_2"] or leave the list blank [] or leave the parameter out completely| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Metadata/metadata.py" +``` \ No newline at end of file diff --git a/queries/Metadata/metadata.py b/queries/Metadata/metadata.py new file mode 100644 index 0000000..01bceaa --- /dev/null +++ b/queries/Metadata/metadata.py @@ -0,0 +1,17 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import metadata + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], +} +x = metadata.get(connection, parameters) +print(x) diff --git a/queries/Raw/README.md b/queries/Raw/README.md new file mode 100644 index 0000000..0df6883 --- /dev/null +++ b/queries/Raw/README.md @@ -0,0 +1,26 @@ +# Raw + +[Raw](../../code-reference/query/raw.md) facilitates performing raw extracts of time series data, typically filtered by a Tag Name or Device Name and an event time. + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Raw/raw.py" +``` \ No newline at end of file diff --git a/queries/Raw/raw.py b/queries/Raw/raw.py new file mode 100644 index 0000000..2fdc7fe --- /dev/null +++ b/queries/Raw/raw.py @@ -0,0 +1,21 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import raw + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "include_bad_data": True, +} +x = raw.get(connection, parameters) +print(x) diff --git a/queries/Resample/README.md b/queries/Resample/README.md new file mode 100644 index 0000000..3d1754b --- /dev/null +++ b/queries/Resample/README.md @@ -0,0 +1,37 @@ +# Resample + +[Resample](../../code-reference/query/resample.md) enables changing the frequency of time series observations. This is achieved by providing the following parameters: + +Sample Rate - (deprecated) +Sample Unit - (deprecated) +Time Interval Rate - The time interval rate +Time Interval Unit - The time interval unit (second, minute, day, hour) +Aggregation Method - Aggregations including first, last, avg, min, max + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit of the data| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a date in the format YY-MM-DD or a datetime in the format YYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|sample_rate|int|(deprecated) Please use time_interval_rate instead. See below.| +|sample_unit|str|(deprecated) Please use time_interval_unit instead. See below.| +|time_interval_rate|str|The time interval rate (numeric input)| +|time_interval_unit|str|The time interval unit (second, minute, day, hour)| +|agg_method|str|Aggregation Method (first, last, avg, min, max)| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Resample/resample.py" +``` \ No newline at end of file diff --git a/queries/Resample/resample.py b/queries/Resample/resample.py new file mode 100644 index 0000000..b326a0f --- /dev/null +++ b/queries/Resample/resample.py @@ -0,0 +1,24 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import resample + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "agg_method": "first", + "include_bad_data": True, +} +x = resample.get(connection, parameters) +print(x) diff --git a/queries/Time-Weighted-Average/README.md b/queries/Time-Weighted-Average/README.md new file mode 100644 index 0000000..a6f5caa --- /dev/null +++ b/queries/Time-Weighted-Average/README.md @@ -0,0 +1,37 @@ +# Time Weighted Average + +[Time Weighted Averages](../../code-reference/query/time-weighted-average.md) provide an unbiased average when working with irregularly sampled data. The RTDIP SDK requires the following parameters to perform time weighted average queries: + +Window Size Mins - (deprecated) +Time Interval Rate - The time interval rate +Time Interval Unit - The time interval unit (second, minute, day, hour) +Window Length - Adds a longer window time for the start or end of specified date to cater for edge cases +Step - Data points with step "enabled" or "disabled". The options for step are "true", "false" or "metadata" as string types. For "metadata", the query requires that the TagName has a step column configured correctly in the meta data table + +## Prerequisites +Ensure you have installed the RTDIP SDK as specified in the [Getting Started](../../../getting-started/installation.md#installing-the-rtdip-sdk) section. + +This example is using [DefaultAuth()](../../code-reference/authentication/azure.md) and [DatabricksSQLConnection()](../../code-reference/query/db-sql-connector.md) to authenticate and connect. You can find other ways to authenticate here. The alternative built in connection methods are either by [PYODBCSQLConnection()](../../code-reference/query/pyodbc-sql-connector.md), [TURBODBCSQLConnection()](../../code-reference/query/turbodbc-sql-connector.md) or [SparkConnection()](../../code-reference/query/spark-connector.md). + +## Parameters +|Name|Type|Description| +|---|---|---| +|business_unit|str|Business unit| +|region|str|Region| +|asset|str|Asset| +|data_security_level|str|Level of data security| +|data_type|str|Type of the data (float, integer, double, string)| +|tag_names|list|List of tagname or tagnames ["tag_1", "tag_2"]| +|start_date|str|Start date (Either a utc date in the format YYYY-MM-DD or a utc datetime in the format YYYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|end_date|str|End date (Either a utc date in the format YYYY-MM-DD or a utc datetime in the format YYYY-MM-DDTHH:MM:SS or specify the timezone offset in the format YYYY-MM-DDTHH:MM:SS+zz:zz)| +|window_size_mins|int|(deprecated) Window size in minutes. Please use time_interval_rate and time_interval_unit below instead| +|time_interval_rate|str|The time interval rate (numeric input)| +|time_interval_unit|str|The time interval unit (second, minute, day, hour)| +|window_length|int|Add longer window time in days for the start or end of specified date to cater for edge cases| +|include_bad_data|bool|Include "Bad" data points with True or remove "Bad" data points with False| +|step|str|Data points with step "enabled" or "disabled". The options for step are "true", "false" or "metadata". "metadata" will retrieve the step value from the metadata table| + +## Example +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/queries/Time-Weighted-Average/time_weighted_average.py" +``` \ No newline at end of file diff --git a/queries/Time-Weighted-Average/time_weighted_average.py b/queries/Time-Weighted-Average/time_weighted_average.py new file mode 100644 index 0000000..10cce1d --- /dev/null +++ b/queries/Time-Weighted-Average/time_weighted_average.py @@ -0,0 +1,25 @@ +from rtdip_sdk.authentication.azure import DefaultAuth +from rtdip_sdk.connectors import DatabricksSQLConnection +from rtdip_sdk.queries import time_weighted_average + +auth = DefaultAuth().authenticate() +token = auth.get_token("{token}").token +connection = DatabricksSQLConnection("{server_hostname}", "{http_path}", token) + +parameters = { + "business_unit": "{business_unit}", + "region": "{region}", + "asset": "{asset_name}", + "data_security_level": "{security_level}", + "data_type": "float", + "tag_names": ["{tag_name_1}", "{tag_name_2}"], + "start_date": "2023-01-01", + "end_date": "2023-01-31", + "time_interval_rate": "15", + "time_interval_unit": "minute", + "window_length": 1, + "include_bad_data": True, + "step": "true" +} +x = time_weighted_average.get(connection, parameters) +print(x) From d35e3a177a48ce4f2401453bf80c2e3621af49a0 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Mon, 21 Aug 2023 15:01:54 +0100 Subject: [PATCH 17/42] Added MISO sample pipeline Signed-off-by: rodalynbarce --- .../MISO-RTDIP-Pipeline-Local/pipeline.py | 98 ------------------- .../README.md | 43 ++++++++ .../deploy.py | 83 ++++++++++++++++ .../maintenance.py | 22 +++++ .../pipeline.py | 44 +++++++++ .../README.md | 6 +- .../pipeline.py | 53 ++++++++++ 7 files changed, 248 insertions(+), 101 deletions(-) delete mode 100644 pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py create mode 100644 pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md create mode 100644 pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py create mode 100644 pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py create mode 100644 pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py rename pipelines/deploy/{MISO-RTDIP-Pipeline-Local => MISODailyLoad-Batch-Pipeline-Local}/README.md (83%) create mode 100644 pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py diff --git a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py b/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py deleted file mode 100644 index 81c1984..0000000 --- a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py +++ /dev/null @@ -1,98 +0,0 @@ -import sys -import os - -os.environ['PYSPARK_PYTHON'] = sys.executable -os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable - -from rtdip_sdk.pipelines.execute import PipelineJob, PipelineStep, PipelineTask, PipelineJobExecute -from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource -from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer -from rtdip_sdk.pipelines.destinations import SparkDeltaDestination - -step_list = [] - -# Read step -step_list.append(PipelineStep( - name="step_1", - description="Read Forecast data from MISO API.", - component=MISODailyLoadISOSource, - component_parameters={"options": { - "load_type": "actual", - "date": "20230520", - }, - }, - provide_output_to_step=["step_2", "step_4"] -)) - -# Transform step - Values -step_list.append(PipelineStep( - name="step_2", - description="Get measurement data.", - component=MISOToMDMTransformer, - component_parameters={ - "output_type": "usage", - }, - depends_on_step=["step_1"], - provide_output_to_step=["step_3"] -)) - -# Write step -step_list.append(PipelineStep( - name="step_3", - description="Write measurement data to Delta table.", - component=SparkDeltaDestination, - component_parameters={ - "destination": "MISO_ISO_Usage_Data", - "options": { - "partitionBy":"timestamp" - }, - "mode": "overwrite" - }, - depends_on_step=["step_2"] -)) - -# Transformer step - Meta information -step_list.append(PipelineStep( - name="step_4", - description="Get meta information.", - component=MISOToMDMTransformer, - component_parameters={ - "output_type": "meta", - }, - depends_on_step=["step_1"], - provide_output_to_step=["step_5"] -)) - -# Write step -step_list.append(PipelineStep( - name="step_5", - description="step_5", - component=SparkDeltaDestination, - component_parameters={ - "destination": "MISO_ISO_Meta_Data", - "options": {}, - "mode": "overwrite" - }, - depends_on_step=["step_4"] -)) - -# Tasks contain a list of steps -task = PipelineTask( - name="test_task", - description="test_task", - step_list=step_list, - batch_task=True -) - -# Job containing a list of tasks -pipeline_job = PipelineJob( - name="test_job", - description="test_job", - version="0.0.1", - task_list=[task] -) - -# Execute -pipeline = PipelineJobExecute(pipeline_job) - -result = pipeline.run() \ No newline at end of file diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md new file mode 100644 index 0000000..4f11664 --- /dev/null +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md @@ -0,0 +1,43 @@ +# MISO Pipeline using RTDIP and Databricks +This article provides a guide on how to deploy a MISO pipeline from a local file to a Databricks workflow using the RTDIP SDK and was tested on an M2 Macbook Pro using VS Code in a Conda (3.11) environment. RTDIP Pipeline Components provide Databricks with all the required Python packages and JARs to execute each component, this will automatically be set up during workflow creation. + +## Prerequisites +This pipeline assumes you have a Databricks workspace and have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: + +* [RTDIP SDK](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [Java](../../../../../getting-started/installation.md#java) + +!!! note "RTDIP SDK Installation" + Ensure you have installed the RTDIP SDK as follows: + ``` + pip install "rtdip-sdk[pipelines]" + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[MISODailyLoadISOSource](../../../../code-reference/pipelines/sources/spark/iso/miso_daily_load_iso.md)|Read daily load data from MISO API.| +|[MISOToMDMTransformer](../../../../code-reference/pipelines/transformers/spark/iso/miso_to_mdm.md)|Converts MISO Raw data into Meters Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| +|[DatabricksSDKDeploy](../../../../code-reference/pipelines/deploy/databricks.md)|Deploys an RTDIP Pipeline to Databricks Workflows leveraging the Databricks [SDK.](https://docs.databricks.com/dev-tools/sdk-python.html)| +|[DeltaTableOptimizeUtility](../../../../code-reference/pipelines/utilities/spark/delta_table_optimize.md)|[Optimizes](https://docs.delta.io/latest/optimizations-oss.html) a Delta Table| +|[DeltaTableVacuumUtility](../../../../code-reference/pipelines/utilities/spark/delta_table_vacuum.md)|[Vacuums](https://docs.delta.io/latest/delta-utility.html#-delta-vacuum) a Delta Table| + +## Example +Below is an example of how to set up a pipeline job to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table. +```python +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py" +``` + +## Maintenance +The RTDIP SDK can be used to maintain Delta tables in Databricks, an example of how to set up a maintenance job to optimize and vacuum the MISO tables written from the previous example is provided below. +```python +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py" +``` + +## Deploy +Deployment to Databricks uses the Databricks [SDK](https://docs.databricks.com/en/dev-tools/sdk-python.html). Users have the option to control the job's configurations including the cluster and schedule. +```python +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py" +``` \ No newline at end of file diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py new file mode 100644 index 0000000..9ecd09b --- /dev/null +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py @@ -0,0 +1,83 @@ +from rtdip_sdk.pipelines.deploy import DatabricksSDKDeploy, CreateJob, JobCluster, ClusterSpec, Task, NotebookTask, AutoScale, RuntimeEngine, DataSecurityMode, CronSchedule, Continuous, PauseStatus +from rtdip_sdk.authentication.azure import DefaultAuth + +def deploy(): + credential = DefaultAuth().authenticate() + access_token = credential.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default").token + + DATABRICKS_WORKSPACE = "{databricks-workspace-url}" + + # Create clusters + cluster_list = [] + cluster_list.append(JobCluster( + job_cluster_key="pipeline-cluster", + new_cluster=ClusterSpec( + node_type_id="Standard_E4ds_v5", + autoscale=AutoScale(min_workers=1, max_workers=8), + spark_version="13.3.x-scala2.12", + data_security_mode=DataSecurityMode.SINGLE_USER, + runtime_engine=RuntimeEngine.STANDARD + ) + )) + + # Create tasks + task_list = [] + task_list.append(Task( + task_key="pipeline", + job_cluster_key="pipeline-cluster", + notebook_task=NotebookTask( + notebook_path="{path/to/pipeline.py}" + ) + )) + + # Create a Databricks Job for the Task + job = CreateJob( + name="rtdip-miso-batch-pipeline-job", + job_clusters=cluster_list, + tasks=task_list, + continuous=Continuous(pause_status=PauseStatus.UNPAUSED) + ) + + # Deploy to Databricks + databricks_pipeline_job = DatabricksSDKDeploy(databricks_job=job, host=DATABRICKS_WORKSPACE, token=access_token, workspace_directory="{path/to/databricks/workspace/directory}") + databricks_pipeline_job.deploy() + + cluster_list = [] + cluster_list.append(JobCluster( + job_cluster_key="maintenance-cluster", + new_cluster=ClusterSpec( + node_type_id="Standard_E4ds_v5", + autoscale=AutoScale(min_workers=1, max_workers=3), + spark_version="13.3.x-scala2.12", + data_security_mode=DataSecurityMode.SINGLE_USER, + runtime_engine=RuntimeEngine.PHOTON + ) + )) + + task_list = [] + task_list.append(Task( + task_key="rtdip-miso-maintenance-task", + job_cluster_key="maintenance-cluster", + notebook_task=NotebookTask( + notebook_path="{path/to/maintenance.py}" + ) + )) + + # Create a Databricks Job for the Task + job = CreateJob( + name="rtdip-miso-maintenance-job", + job_clusters=cluster_list, + tasks=task_list, + schedule=CronSchedule( + quartz_cron_expression="4 * * * * ?", + timezone_id="UTC", + pause_status=PauseStatus.UNPAUSED + ) + ) + + # Deploy to Databricks + databricks_pipeline_job = DatabricksSDKDeploy(databricks_job=job, host=DATABRICKS_WORKSPACE, token=access_token, workspace_directory="{path/to/databricks/workspace/directory}") + databricks_pipeline_job.deploy() + +if __name__ == "__main__": + deploy() \ No newline at end of file diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py new file mode 100644 index 0000000..cfe17aa --- /dev/null +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py @@ -0,0 +1,22 @@ +from rtdip_sdk.pipelines.utilities import DeltaTableOptimizeUtility, DeltaTableVacuumUtility + +def maintenance(): + TABLE_NAMES = [ + "{path.to.table.miso_usage_data}", + "{path.to.table.miso_meta_data}" + ] + + for table in TABLE_NAMES: + + DeltaTableOptimizeUtility( + spark=spark, + table_name=table + ).execute() + + DeltaTableVacuumUtility( + spark=spark, + table_name=table + ).execute() + +if __name__ == "__main__": + maintenance() \ No newline at end of file diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py new file mode 100644 index 0000000..75b28d9 --- /dev/null +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py @@ -0,0 +1,44 @@ +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination + +def pipeline(): + source_df = MISODailyLoadISOSource( + spark = spark, + options = { + "load_type": "actual", + "date": "20230520", + } + ).read_batch() + + transform_value_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "usage" + ).transform() + + transform_meta_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "meta" + ).transform() + + SparkDeltaDestination( + data=transform_value_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_usage_data" + ).write_batch() + + SparkDeltaDestination( + data=transform_meta_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_meta_data", + mode="overwrite" + ).write_batch() + +if __name__ == "__main__": + pipeline() \ No newline at end of file diff --git a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md similarity index 83% rename from pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md rename to pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md index 77dd24d..5ad5ad3 100644 --- a/pipelines/deploy/MISO-RTDIP-Pipeline-Local/README.md +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md @@ -1,5 +1,5 @@ # MISO Pipeline using RTDIP -This article provides a guide on how to execute a pipeline using the RTDIP SDK's MISO components. This pipeline was tested on an M2 Macbook Pro using VS Code in a Conda (3.11) environment. +This article provides a guide on how to execute a MISO pipeline using RTDIP. This pipeline was tested on an M2 Macbook Pro using VS Code in a Conda (3.11) environment. ## Prerequisites This pipeline assumes you have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: @@ -22,9 +22,9 @@ This pipeline assumes you have followed the installation instructions as specifi |[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| ## Example -Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table on your machine. +Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py:7:" +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py:6:" ``` !!! note "Using environments" diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py new file mode 100644 index 0000000..7ba1ce3 --- /dev/null +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py @@ -0,0 +1,53 @@ +import sys, os + +os.environ['PYSPARK_PYTHON'] = sys.executable +os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable + +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination +from pyspark.sql import SparkSession + +def pipeline(): + spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ + .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ + .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() + + source_df = MISODailyLoadISOSource( + spark = spark, + options = { + "load_type": "actual", + "date": "20230520", + } + ).read_batch() + + transform_value_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "usage" + ).transform() + + transform_meta_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "meta" + ).transform() + + SparkDeltaDestination( + data=transform_value_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_usage_data" + ).write_batch() + + SparkDeltaDestination( + data=transform_meta_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_meta_data" + ).write_batch() + +if __name__ == "__main__": + pipeline() \ No newline at end of file From 72c6908b58bd4cdd2f64826c556b0b37ba934439 Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Mon, 21 Aug 2023 15:13:17 +0100 Subject: [PATCH 18/42] Update link Signed-off-by: rodalynbarce --- pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md index 5ad5ad3..03e42e0 100644 --- a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md @@ -30,5 +30,5 @@ Below is an example of how to set up a pipeline to read daily load data from the !!! note "Using environments" If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: ```python - --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISO-RTDIP-Pipeline-Local/pipeline.py::5" + --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py::5" ``` \ No newline at end of file From a3ee41f0bb0e3656eb8becddc592bfe9ad0e702c Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Mon, 4 Sep 2023 13:35:31 +0100 Subject: [PATCH 19/42] Added PJM Signed-off-by: rodalynbarce --- .../README.md | 34 ++++++++++++ .../pipeline.py | 53 +++++++++++++++++++ 2 files changed, 87 insertions(+) create mode 100644 pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md create mode 100644 pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py diff --git a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md new file mode 100644 index 0000000..278a926 --- /dev/null +++ b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md @@ -0,0 +1,34 @@ +# MISO Pipeline using RTDIP +This article provides a guide on how to execute a MISO pipeline using RTDIP. This pipeline was tested on an M2 Macbook Pro using VS Code in a Conda (3.11) environment. + +## Prerequisites +This pipeline assumes you have a valid API key from [PJM](https://apiportal.pjm.com/) and have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: + +* [RTDIP SDK](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) + +* [Java](../../../../../getting-started/installation.md#java) + +!!! note "RTDIP SDK Installation" + Ensure you have installed the RTDIP SDK as follows: + ``` + pip install "rtdip-sdk[pipelines,pyspark]" + ``` + +## Components +|Name|Description| +|---------------------------|----------------------| +|[PJMDailyLoadISOSource](../../../../code-reference/pipelines/sources/spark/iso/pjm_daily_load_iso.md)|Read daily load data from MISO API.| +|[PJMToMDMTransformer](../../../../code-reference/pipelines/transformers/spark/iso/pjm_to_mdm.md)|Converts PJM Raw data into Meters Data Model.| +|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| + +## Example +Below is an example of how to set up a pipeline to read daily load data from the PJM API, transform it into the Meters Data Model and write it to a Delta table. +```python +--8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py:6:" +``` + +!!! note "Using environments" + If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: + ```python + --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py::5" + ``` \ No newline at end of file diff --git a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py new file mode 100644 index 0000000..8fb7d91 --- /dev/null +++ b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py @@ -0,0 +1,53 @@ +import sys, os + +os.environ['PYSPARK_PYTHON'] = sys.executable +os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable + +from rtdip_sdk.pipelines.sources import PJMDailyLoadISOSource +from rtdip_sdk.pipelines.transformers import PJMToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination +from pyspark.sql import SparkSession + +def pipeline(): + spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ + .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ + .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() + + source_df = PJMDailyLoadISOSource( + spark = spark, + options = { + "api_key": "{api_key}", + "load_type": "actual" + } + ).read_batch() + + transform_value_df = PJMToMDMTransformer( + spark=spark, + data=source_df, + output_type= "usage" + ).transform() + + transform_meta_df = PJMToMDMTransformer( + spark=spark, + data=source_df, + output_type= "meta" + ).transform() + + SparkDeltaDestination( + data=transform_value_df, + options={ + "partitionBy":"timestamp" + }, + destination="pjm_usage_data" + ).write_batch() + + SparkDeltaDestination( + data=transform_meta_df, + options={ + "partitionBy":"timestamp" + }, + destination="pjm_meta_data" + ).write_batch() + +if __name__ == "__main__": + pipeline() \ No newline at end of file From 8dff1c37b26c35dec80e735b37cd119e321c53bb Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Mon, 4 Sep 2023 13:45:49 +0100 Subject: [PATCH 20/42] Updated doc links Signed-off-by: rodalynbarce --- .../MISODailyLoad-Batch-Pipeline-Local/README.md | 10 +++++----- .../deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md index 03e42e0..4464693 100644 --- a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md @@ -4,9 +4,9 @@ This article provides a guide on how to execute a MISO pipeline using RTDIP. Thi ## Prerequisites This pipeline assumes you have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: -* [RTDIP SDK](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) +* [RTDIP SDK](../../../../getting-started/installation.md#installing-the-rtdip-sdk) -* [Java](../../../../../getting-started/installation.md#java) +* [Java](../../../../getting-started/installation.md#java) !!! note "RTDIP SDK Installation" Ensure you have installed the RTDIP SDK as follows: @@ -17,9 +17,9 @@ This pipeline assumes you have followed the installation instructions as specifi ## Components |Name|Description| |---------------------------|----------------------| -|[MISODailyLoadISOSource](../../../../code-reference/pipelines/sources/spark/iso/miso_daily_load_iso.md)|Read daily load data from MISO API.| -|[MISOToMDMTransformer](../../../../code-reference/pipelines/transformers/spark/iso/miso_to_mdm.md)|Converts MISO Raw data into Meters Data Model.| -|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| +|[MISODailyLoadISOSource](../../../code-reference/pipelines/sources/spark/iso/miso_daily_load_iso.md)|Read daily load data from MISO API.| +|[MISOToMDMTransformer](../../../code-reference/pipelines/transformers/spark/iso/miso_to_mdm.md)|Converts MISO Raw data into Meters Data Model.| +|[SparkDeltaDestination](../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| ## Example Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table. diff --git a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md index 278a926..c3709be 100644 --- a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md +++ b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md @@ -4,9 +4,9 @@ This article provides a guide on how to execute a MISO pipeline using RTDIP. Thi ## Prerequisites This pipeline assumes you have a valid API key from [PJM](https://apiportal.pjm.com/) and have followed the installation instructions as specified in the Getting Started section. In particular ensure you have installed the following: -* [RTDIP SDK](../../../../../getting-started/installation.md#installing-the-rtdip-sdk) +* [RTDIP SDK](../../../../getting-started/installation.md#installing-the-rtdip-sdk) -* [Java](../../../../../getting-started/installation.md#java) +* [Java](../../../../getting-started/installation.md#java) !!! note "RTDIP SDK Installation" Ensure you have installed the RTDIP SDK as follows: @@ -17,9 +17,9 @@ This pipeline assumes you have a valid API key from [PJM](https://apiportal.pjm. ## Components |Name|Description| |---------------------------|----------------------| -|[PJMDailyLoadISOSource](../../../../code-reference/pipelines/sources/spark/iso/pjm_daily_load_iso.md)|Read daily load data from MISO API.| -|[PJMToMDMTransformer](../../../../code-reference/pipelines/transformers/spark/iso/pjm_to_mdm.md)|Converts PJM Raw data into Meters Data Model.| -|[SparkDeltaDestination](../../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| +|[PJMDailyLoadISOSource](../../../code-reference/pipelines/sources/spark/iso/pjm_daily_load_iso.md)|Read daily load data from MISO API.| +|[PJMToMDMTransformer](../../../code-reference/pipelines/transformers/spark/iso/pjm_to_mdm.md)|Converts PJM Raw data into Meters Data Model.| +|[SparkDeltaDestination](../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to a Delta table.| ## Example Below is an example of how to set up a pipeline to read daily load data from the PJM API, transform it into the Meters Data Model and write it to a Delta table. From d2b641be72cffc7685a80ac0b0fff5b56298027f Mon Sep 17 00:00:00 2001 From: rodalynbarce Date: Mon, 4 Sep 2023 13:54:09 +0100 Subject: [PATCH 21/42] Updated pipeline links Signed-off-by: rodalynbarce --- .../MISODailyLoad-Batch-Pipeline-Databricks/README.md | 6 +++--- .../deploy/MISODailyLoad-Batch-Pipeline-Local/README.md | 4 ++-- .../deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md index 4f11664..78ae943 100644 --- a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/README.md @@ -27,17 +27,17 @@ This pipeline assumes you have a Databricks workspace and have followed the inst ## Example Below is an example of how to set up a pipeline job to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py" +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/pipeline.py" ``` ## Maintenance The RTDIP SDK can be used to maintain Delta tables in Databricks, an example of how to set up a maintenance job to optimize and vacuum the MISO tables written from the previous example is provided below. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py" +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/maintenance.py" ``` ## Deploy Deployment to Databricks uses the Databricks [SDK](https://docs.databricks.com/en/dev-tools/sdk-python.html). Users have the option to control the job's configurations including the cluster and schedule. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py" +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Databricks/deploy.py" ``` \ No newline at end of file diff --git a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md index 4464693..bef58ce 100644 --- a/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md +++ b/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/README.md @@ -24,11 +24,11 @@ This pipeline assumes you have followed the installation instructions as specifi ## Example Below is an example of how to set up a pipeline to read daily load data from the MISO API, transform it into the Meters Data Model and write it to a Delta table. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py:6:" +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py:6:" ``` !!! note "Using environments" If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: ```python - --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py::5" + --8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/MISODailyLoad-Batch-Pipeline-Local/pipeline.py::5" ``` \ No newline at end of file diff --git a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md index c3709be..d2f500b 100644 --- a/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md +++ b/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/README.md @@ -24,11 +24,11 @@ This pipeline assumes you have a valid API key from [PJM](https://apiportal.pjm. ## Example Below is an example of how to set up a pipeline to read daily load data from the PJM API, transform it into the Meters Data Model and write it to a Delta table. ```python ---8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py:6:" +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py:6:" ``` !!! note "Using environments" If using an environment, include the following lines at the top of your script to prevent a difference in Python versions in worker and driver: ```python - --8<-- "https://raw.githubusercontent.com/rodalynbarce/samples/feature/00434/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py::5" + --8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/PJMDailyLoad-Batch-Pipeline-Local/pipeline.py::5" ``` \ No newline at end of file From 16eea665ee9c530eec422600f1c475e530902bef Mon Sep 17 00:00:00 2001 From: JamesKnBr Date: Tue, 5 Sep 2023 14:25:46 +0100 Subject: [PATCH 22/42] Add EdgeX eventhub to delta Signed-off-by: JamesKnBr --- .../deploy/EdgeX-Eventhub-to-Delta/README.md | 31 +++++++++++++++++++ .../EdgeX-Eventhub-to-Delta/pipeline.py | 31 +++++++++++++++++++ .../deploy/Python-Delta-to-Delta/README.md | 2 +- 3 files changed, 63 insertions(+), 1 deletion(-) create mode 100644 pipelines/deploy/EdgeX-Eventhub-to-Delta/README.md create mode 100644 pipelines/deploy/EdgeX-Eventhub-to-Delta/pipeline.py diff --git a/pipelines/deploy/EdgeX-Eventhub-to-Delta/README.md b/pipelines/deploy/EdgeX-Eventhub-to-Delta/README.md new file mode 100644 index 0000000..7c8b78b --- /dev/null +++ b/pipelines/deploy/EdgeX-Eventhub-to-Delta/README.md @@ -0,0 +1,31 @@ +# EdgeX Eventhub to Delta Pipeline + +This article provides a guide on how to execute a pipeline that batch reads EdgeX data from an Eventhub and writes to a Delta Table locally using the RTDIP SDK. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. + +## Prerequisites +This pipeline job requires the packages: + +* [rtdip-sdk](../../../../getting-started/installation.md#installing-the-rtdip-sdk) + + +## Components +|Name|Description| +|---------------------------|----------------------| +|[SparkEventhubSource](../../../code-reference/pipelines/sources/spark/eventhub.md)|Reads data from an Eventhub.| +|[BinaryToStringTransformer](../../../code-reference/pipelines/transformers/spark/binary_to_string.md)|Transforms Spark DataFrame column to string.| +|[EdgeXOPCUAJsonToPCDMTransformer](../../../code-reference/pipelines/transformers/spark/edgex_opcua_json_to_pcdm.md)|Transforms EdgeX to PCDM.| +|[SparkDeltaDestination](../../../code-reference/pipelines/destinations/spark/delta.md)|Writes to Delta.| + +## Common Errors +|Error|Solution| +|---------------------------|----------------------| +|[com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/apache/spark/ErrorClassesJsonReader]|The Delta version in the Spark Session must be compatible with your local Pyspark version. See [here](https://docs.delta.io/latest/releases.html){ target="_blank" } for version compatibility| + + + +## Example +Below is an example of how to read from and write to Delta Tables locally without the need for Spark + +```python +--8<-- "https://raw.githubusercontent.com/rtdip/samples/main/pipelines/deploy/EdgeX-Eventhub-to-Delta/pipeline.py" +``` \ No newline at end of file diff --git a/pipelines/deploy/EdgeX-Eventhub-to-Delta/pipeline.py b/pipelines/deploy/EdgeX-Eventhub-to-Delta/pipeline.py new file mode 100644 index 0000000..a32c802 --- /dev/null +++ b/pipelines/deploy/EdgeX-Eventhub-to-Delta/pipeline.py @@ -0,0 +1,31 @@ +from rtdip_sdk.pipelines.sources.spark.eventhub import SparkEventhubSource +from rtdip_sdk.pipelines.transformers.spark.binary_to_string import BinaryToStringTransformer +from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination +from rtdip_sdk.pipelines.transformers.spark.edgex_opcua_json_to_pcdm import EdgeXOPCUAJsonToPCDMTransformer +from pyspark.sql import SparkSession +from pyspark.sql.types import * +from pyspark.sql.functions import * +import json + + +def edgeX_eventhub_to_delta(): + + # Spark session setup not required if running in Databricks + spark = (SparkSession.builder.appName("MySparkSession") + .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0,com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22") + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ + .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") + .getOrCreate()) + + ehConf = { + "eventhubs.connectionString": "{EventhubConnectionString}", + "eventhubs.consumerGroup": "{EventhubConsumerGroup}", + "eventhubs.startingPosition": json.dumps({"offset": "0", "seqNo": -1, "enqueuedTime": None, "isInclusive": True})} + + source = SparkEventhubSource(spark, ehConf).read_batch() + string_data = BinaryToStringTransformer(source,"body", "body").transform() + PCDM_data = EdgeXOPCUAJsonToPCDMTransformer(string_data,"body").transform() + SparkDeltaDestination(data= PCDM_data, options= {}, destination="{/path/to/destination}").write_batch() + +if __name__ == "__main__": + edgeX_eventhub_to_delta() \ No newline at end of file diff --git a/pipelines/deploy/Python-Delta-to-Delta/README.md b/pipelines/deploy/Python-Delta-to-Delta/README.md index ddc20dd..9dfc253 100644 --- a/pipelines/deploy/Python-Delta-to-Delta/README.md +++ b/pipelines/deploy/Python-Delta-to-Delta/README.md @@ -1,4 +1,4 @@ -# Fledge Pipeline using Dagster +# Python Delta Local Pipeline This article provides a guide on how to execute a simple Delta Table copy locally without Spark using the RTDIP SDK. This pipeline was tested on an M2 Macbook Pro using VS Code in a Python (3.10) environment. From 56c15232f44065d9dae391e4541204b80133e3d7 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Wed, 1 Nov 2023 12:14:08 +0000 Subject: [PATCH 23/42] Updated samples and documentation --- .../Spark-Single-Node-Notebook-AWS/Dockerfile | 13 +++ .../MISO_pipeline_sample.py | 51 ++++++++++ .../Spark-Single-Node-Notebook-AWS/README.md | 24 ++--- .../environment.yml | 93 +++++++++++++------ .../run_conda_installer.sh | 62 ++++++++----- .../run_in_docker.sh | 7 ++ 6 files changed, 186 insertions(+), 64 deletions(-) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile new file mode 100644 index 0000000..ba06998 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile @@ -0,0 +1,13 @@ +FROM ubuntu:22.04 +RUN apt-get update && apt-get upgrade -y +RUN apt-get install curl -y +RUN apt-get install unzip -y +RUN apt-get autoclean -y +RUN apt-get autoremove -y +RUN useradd -rm -d /home/rtdip -s /bin/bash -g root -G sudo -u 1001 rtdip +COPY run_conda_installer.sh /home/rtdip/ +RUN mkdir -p /home/rtdip/apps/lfenergy +COPY environment.yml /home/rtdip/apps/lfenergy/ +RUN chmod +x /home/rtdip/run_conda_installer.sh +WORKDIR /home/rtdip +ENTRYPOINT ["/home/rtdip/run_conda_installer.sh"] diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py new file mode 100644 index 0000000..d62f638 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python +# coding: utf-8 + +# In[1]: + + +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination +from pyspark.sql import SparkSession + +spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ + .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ + .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() + +source_df = MISODailyLoadISOSource( + spark = spark, + options = { + "load_type": "actual", + "date": "20230520", + } +).read_batch() + +transform_value_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "usage" +).transform() + +transform_meta_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "meta" +).transform() + +SparkDeltaDestination( + data=transform_value_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_usage_data" +).write_batch() + +SparkDeltaDestination( + data=transform_meta_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_meta_data" +).write_batch() + diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index 96ce447..b625315 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -1,22 +1,22 @@ # Spark Single Node Notebook AWS -This article provides a guide how to create a conda based self contained environment to run RTDIP that integrates the following components: -* Java and Spark (Single node configuration). Currently v3.3.2 Spark has been configured -* AWS Libraries for Spark/Hadoop v3.3.2 +This article provides a guide how to create a conda based selfcontained environment to run RTDIP that integrates the following components: +* Java and Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. +* AWS Libraries for Spark/Hadoop * Jupyter Notebook -* RTDIP (v0.6.1) - -The components of this environment are pinned to specific versions. +The components of this environment are all pinned to a specific source distribution of RTDIP. ## Prerequisites -The prerequisites for running the environment are: -* run_conda_installer.sh: An x86 Linux environment with enough free space (Tested on Linux Ubuntu 22.04. A clean environment is preferred) -* the installer will run Jupyter notebook on port 8080. Check that this port is free or change the configuration in the installer. +* run_in_docker: Docker desktop or another local Docker environment (e.g. Ubuntu Docker) +* run_conda_installer.sh: Tested on an x86 Environment. +* For AWS access (e.g. S3) the required permissions available in the environment are required at runtime. + +After the installer completes, Jupyter notebook will be running on port 8080. Please check that this port is free or change the configuration in the installer if required. # Deploy and Running -Run *run_conda_installer.sh*. After the installer completes: -* A new file *conda_environment_rtdip-sdk.sh* is created. Please use this file (e.g. *source ./conda_environment_rtdip-sdk.sh*) to activate the conda environment. -* On http://host:8080/ where host is the machine where the installer was run, a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines. +Run *run_in_docker.sh*. After the installer completes: +* Inside the container a new file *conda_environment_rtdip-sdk.sh* will be created. If required please use this file (e.g. *source ./conda_environment_rtdip-sdk.sh*) to activate the conda environment within the container. +* On http://localhost:8080/ where host is the machine where the installer was run, a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines. diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml index d2b9acc..82f7625 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/environment.yml @@ -1,30 +1,63 @@ -name: rtdip-sdk -channels: - - conda-forge - - defaults - -dependencies: - - notebook=6.4.12 - - python==3.10.9 - - azure-storage-file-datalake==12.10.1 - - azure-keyvault-secrets==4.7.0 - - azure-identity==1.12.0 - - pyodbc==4.0.39 - - pandas==1.5.2 - - jinja2==3.0.3 - - jinjasql==0.1.8 - - pyspark==3.3.2 - - delta-spark==2.3.0 - - dependency_injector==4.41.0 - - pydantic==1.10.7 - - boto3==1.26.123 - - semver==3.0.0 - - xlrd==2.0.1 - - pip==23.1.2 - - awscli==1.27.142 - - filelock==3.12.2 - - web3==6.5.0 - - pip: - - hvac==1.1.0 - - rtdip-sdk==0.6.1 - +# Pinned 09/20/2023 +name: lfenergy +channels: + - conda-forge + - defaults +dependencies: + - python>=3.8,<3.12 + # - mkdocs-material==9.1.21 + - mkdocs-material-extensions==1.1.1 + - jinja2==3.1.2 + - pytest==7.4.0 + - pytest-mock==3.11.1 + - pytest-cov==4.1.0 + - pylint==2.17.4 + - pip==23.1.2 + - turbodbc==4.5.10 + - numpy>=1.23.4 + - pandas>=1.5.2,<3.0.0 + - oauthlib>=3.2.2 + - cryptography>=38.0.3 + - azure-identity==1.12.0 + - azure-storage-file-datalake==12.12.0 + - azure-keyvault-secrets==4.7.0 + - boto3==1.28.2 + - pyodbc==4.0.39 + - fastapi==0.100.1 + - httpx==0.24.1 + - trio==0.22.1 + - pyspark==3.4.1 + - delta-spark>=2.2.0,<3.1.0 + - grpcio>=1.48.1 + - grpcio-status>=1.48.1 + - googleapis-common-protos>=1.56.4 + - openai==0.27.8 + - mkdocstrings==0.22.0 + - mkdocstrings-python==1.4.0 + - mkdocs-macros-plugin==1.0.1 + - pygments==2.16.1 + - pymdown-extensions==10.1.0 + - databricks-sql-connector==2.9.3 + - databricks-sdk==0.6.0 + - semver==3.0.0 + - xlrd==2.0.1 + - pygithub==1.59.0 + - strawberry-graphql[fastapi,pydantic]==0.194.4 + - web3==6.5.0 + - twine==4.0.2 + - delta-sharing-python==0.7.4 + - polars==0.18.8 + - moto[s3]==4.1.14 + - xarray>=2023.1.0,<2023.8.0 + - ecmwf-api-client==1.6.3 + - netCDF4==1.6.4 + - black==23.7.0 + - pip: + - dependency-injector==4.41.0 + - azure-functions==1.15.0 + - nest_asyncio==1.5.6 + - hvac==1.1.1 + - langchain==0.0.291 + - build==0.10.0 + - deltalake==0.10.1 + - mkdocs-material==9.2.0b3 diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 08a5f00..f2abe4f 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -10,7 +10,7 @@ DEPLOYER_TMP_DIR=$(echo ${TMPDIR:-/tmp}"/DEPLOYER") MINICONDA_NAME=miniconda MINICONDA_PATH=$HOME/$MINICONDA_NAME/ PATH=$MINICONDA_PATH/bin:$PATH -CONDA_ENV="rtdip-sdk" +CONDA_ENV="lfenergy" CONDA_ENV_HOME=$(pwd)/apps/$CONDA_ENV mkdir -p $CONDA_ENV_HOME CWD=$(pwd) @@ -40,14 +40,31 @@ conda install -n base conda-libmamba-solver -y echo "Setting Solver to libmama" conda config --set solver libmamba +# RTDIP +export RTDIP_FILE_NAME="InnowattsRelease.zip" +export RTDIP_DOWNLOAD_URL="https://github.com/vbayon/core/archive/refs/heads/$RTDIP_FILE_NAME" +export RTDIP_DIR="core-InnowattsRelease" +echo "Installing RTDIP ***********************************" +rm -rf ./$RTDIP_DIR +rm -rf ./api +rm -rf ./sdk +curl -L -o $RTDIP_FILE_NAME $RTDIP_DOWNLOAD_URL +unzip -o ./$RTDIP_FILE_NAME > /dev/null +cp -r ./$RTDIP_DIR/src/sdk/python/* . +rm ./$RTDIP_FILE_NAME -echo "Creating Conda Environment" -conda env create -f environment.yml -y + +echo "Creating the environment with [CONDA]: CONDA LIBMAMBA SOLVER" +## Copying the env file +rm ./environment.yml +cp ./$RTDIP_DIR/environment.yml ./ +find ./environment.yml -type f -exec sed -i 's/rtdip-sdk/lfenergy/g' {} \; +conda env create -f environment.yml # # JDK -echo "Installing JDK jdk-17.0.2 ***********************************" +echo "JDK jdk-17.0.2 ***********************************" export JAVA_VERSION="jdk-17.0.2" export JDK_FILE_NAME="openjdk-17.0.2_linux-x64_bin.tar.gz" export JDK_DOWNLOAD_URL="https://download.java.net/java/GA/jdk17.0.2/dfd4a8d0985749f896bed50d7138ee7f/8/GPL/$JDK_FILE_NAME" @@ -56,7 +73,6 @@ if [ -f "$CONDA_ENV/$JDK_FILE_NAME" ]; then echo "$CONDA_ENV/$JDK_FILE_NAME Exists" echo "Removing JDK: $JDK_FILE_NAME" rm -rf $CONDA_ENV/$JAVA_VERSION - # rm $CONDA_ENV/$JDK_FILE_NAME unlink $HOME/JDK fi @@ -73,19 +89,17 @@ ln -s $CONDA_ENV_HOME/$JAVA_VERSION $HOME/JDK export JAVA_HOME=$HOME/JDK export PATH=$HOME/JDK/bin:$PATH -# SPARK 3.3.2 -echo "Installing SPARK 3.3.2 ***********************************" -export SPARK_VERSION="spark-3.3.2-bin-hadoop3" -export SPARK_FILE_NAME="spark-3.3.2-bin-hadoop3.tgz" -export SPARK_DOWNLOAD_URL="https://archive.apache.org/dist/spark/spark-3.3.2/$SPARK_FILE_NAME" -export PYSPARK_VERSION="3.3.2" +# SPARK 3.4.1 +echo "Installing SPARK 3.4.1***********************************" +export SPARK_VERSION="spark-3.4.1-bin-hadoop3" +export SPARK_FILE_NAME="spark-3.4.1-bin-hadoop3.tgz" +export SPARK_DOWNLOAD_URL="https://dlcdn.apache.org/spark/spark-3.4.1/$SPARK_FILE_NAME" if [ -f "$CONDA_ENV/$SPARK_VERSION" ]; then echo "$CONDA_ENV/$SPARK_FILE_NAME Exists" echo "Removing Spark: $SPARK_FILE_NAME" rm -rf $CONDA_ENV/$SPARK_VERSION - # rm $CONDA_ENV/$SPARK_FILE_NAME unlink $HOME/SPARK fi @@ -177,26 +191,29 @@ mv $SCALA_JAVA8_COMPAT_JAR_FILE_NAME $SPARK_HOME/jars curl -o $PROTON_J_JAR_FILE_NAME $PROTON_J_JAR_DOWNLOAD_URL mv $PROTON_J_JAR_FILE_NAME $SPARK_HOME/jars -# Cleaning up -rm $SPARK_FILE_NAME -rm $JDK_FILE_NAME - -# echo "Finished INSTALLING $JAVA_VERSION and $SPARK_VERSION and Extra Libraries" + eval "$(conda shell.bash hook)" conda config --set default_threads 4 conda env list -# Uncoment the line below to avoid error: CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. +# Load by default conda environment vars when running in container +# To avoid error: CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'. # source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh - +## +conda install -y conda-build conda activate $CONDA_ENV +# Adding source code to the lib path +conda develop $CONDA_ENV_HOME + conda info +echo "Finished Installing Conda [Mamba] Env $CONDA_ENV" end_time=`date +%s` runtime=$((end_time-start_time)) echo "Total Installation Runtime: $runtime [seconds]" -# Creating env file +echo "Test environment not intended for using in production. Backup any changes made to this environment" +# CONDA_ENVIRONMENT_FILE_NAME="conda_environment_$CONDA_ENV.sh" echo "#!/usr/bin/env bash" > $CONDA_ENVIRONMENT_FILE_NAME echo "export PATH=$PATH" >> $CONDA_ENVIRONMENT_FILE_NAME @@ -205,7 +222,8 @@ echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" -if [ -z ${NOTEBOOK_PORT+x} ]; then NOTEBOOK_PORT="8080"; else echo "NOTEBOOK_PORT: $NOTEBOOK_PORT"; fi echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" -source ./$CONDA_ENVIRONMENT_FILE_NAME +# Install and Run Notebook +conda install -y notebook=6.5.4 +export NOTEBOOK_PORT="8080" jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh new file mode 100644 index 0000000..4073989 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +docker container stop rtdip +docker container rm rtdip +docker system prune -a -f +docker image rm "rtdip:Dockerfile" +docker build -t "rtdip:Dockerfile" . +docker run --name rtdip --publish 8080:8080 "rtdip:Dockerfile" From a6bb8bc66d7905f710436d6ef58a211821df791b Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Wed, 1 Nov 2023 13:42:45 +0000 Subject: [PATCH 24/42] Edited documentation --- .../deploy/Spark-Single-Node-Notebook-AWS/README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index b625315..16f652c 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -1,9 +1,9 @@ # Spark Single Node Notebook AWS -This article provides a guide how to create a conda based selfcontained environment to run RTDIP that integrates the following components: -* Java and Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. -* AWS Libraries for Spark/Hadoop -* Jupyter Notebook +This article provides a guide on how to create a conda based self-contained environment to run RTDIP that integrates the following components: +* Java JDK and Apache Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. +* AWS Libraries. +* Jupyter Notebook. The components of this environment are all pinned to a specific source distribution of RTDIP. @@ -13,7 +13,8 @@ The components of this environment are all pinned to a specific source distribut * run_conda_installer.sh: Tested on an x86 Environment. * For AWS access (e.g. S3) the required permissions available in the environment are required at runtime. -After the installer completes, Jupyter notebook will be running on port 8080. Please check that this port is free or change the configuration in the installer if required. +When the installation completes, a Jupyter notebook will be running on port 8080. +Please check that this port is available or change the configuration in the installer if required. # Deploy and Running Run *run_in_docker.sh*. After the installer completes: From 5be9a8ace64e72477f47504832f7d882a4dbd18a Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 10:29:03 +0000 Subject: [PATCH 25/42] Updated --- pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile | 1 + pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile index ba06998..b2b3263 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile @@ -8,6 +8,7 @@ RUN useradd -rm -d /home/rtdip -s /bin/bash -g root -G sudo -u 1001 rtdip COPY run_conda_installer.sh /home/rtdip/ RUN mkdir -p /home/rtdip/apps/lfenergy COPY environment.yml /home/rtdip/apps/lfenergy/ +COPY MISO_pipeline_sample.py /home/rtdip/apps/lfenergy/ RUN chmod +x /home/rtdip/run_conda_installer.sh WORKDIR /home/rtdip ENTRYPOINT ["/home/rtdip/run_conda_installer.sh"] diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index 16f652c..fc74682 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -18,6 +18,7 @@ Please check that this port is available or change the configuration in the inst # Deploy and Running Run *run_in_docker.sh*. After the installer completes: -* Inside the container a new file *conda_environment_rtdip-sdk.sh* will be created. If required please use this file (e.g. *source ./conda_environment_rtdip-sdk.sh*) to activate the conda environment within the container. +* Inside the container a new file *conda_environment_rtdip-sdk.sh* will be created. If required please use this file (e.g. *source ./conda_environment_lfenergy.sh*) to activate the conda environment within the container. * On http://localhost:8080/ where host is the machine where the installer was run, a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines. +* To test the envinroment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run it. From ea8bd2969fc1c7a27a2e90fdc559365f8b43a851 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 10:52:16 +0000 Subject: [PATCH 26/42] Updated --- pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index fc74682..9544884 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -19,6 +19,6 @@ Please check that this port is available or change the configuration in the inst # Deploy and Running Run *run_in_docker.sh*. After the installer completes: * Inside the container a new file *conda_environment_rtdip-sdk.sh* will be created. If required please use this file (e.g. *source ./conda_environment_lfenergy.sh*) to activate the conda environment within the container. -* On http://localhost:8080/ where host is the machine where the installer was run, a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines. -* To test the envinroment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run it. +* At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines (see below). +* To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run it. This pipeline queries MISO and saves the results locally under a newly created directory called spark-warehouse. From a55f7b5addb8ae1d357baa5abe5da1b0f474b471 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 11:14:47 +0000 Subject: [PATCH 27/42] Documentation update --- .../Spark-Single-Node-Notebook-AWS/README.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index 9544884..a3e1856 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -1,24 +1,23 @@ # Spark Single Node Notebook AWS -This article provides a guide on how to create a conda based self-contained environment to run RTDIP that integrates the following components: +This article provides a guide on how to create a conda based self-contained environment to run LFEnergy RTDIP that integrates the following components: * Java JDK and Apache Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. -* AWS Libraries. -* Jupyter Notebook. +* AWS Libraries (e.g for accessing files in S3). +* Jupyter Notebook server. -The components of this environment are all pinned to a specific source distribution of RTDIP. +The components of this environment are all pinned to a specific source distribution of RTDIP and have been tested in x86 Windows and Linux environments. ## Prerequisites - -* run_in_docker: Docker desktop or another local Docker environment (e.g. Ubuntu Docker) -* run_conda_installer.sh: Tested on an x86 Environment. -* For AWS access (e.g. S3) the required permissions available in the environment are required at runtime. +* Docker desktop or another local Docker environment (e.g. Ubuntu Docker). +* gitbash environment for Windows environments. When the installation completes, a Jupyter notebook will be running on port 8080. Please check that this port is available or change the configuration in the installer if required. # Deploy and Running Run *run_in_docker.sh*. After the installer completes: -* Inside the container a new file *conda_environment_rtdip-sdk.sh* will be created. If required please use this file (e.g. *source ./conda_environment_lfenergy.sh*) to activate the conda environment within the container. * At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines (see below). * To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run it. This pipeline queries MISO and saves the results locally under a newly created directory called spark-warehouse. +* For debugging and running from inside the container new RTDIP pipeplines, a new file *conda_environment_rtdip-sdk.sh* is created. Please use this environment to activate +the LFEnergy RTDIP environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. From d7a51ca3d41a51bdf23d810954200bb5f21f21f2 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 11:26:31 +0000 Subject: [PATCH 28/42] Documentation updated --- .../Spark-Single-Node-Notebook-AWS/README.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index a3e1856..50dab38 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -2,22 +2,21 @@ This article provides a guide on how to create a conda based self-contained environment to run LFEnergy RTDIP that integrates the following components: * Java JDK and Apache Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. -* AWS Libraries (e.g for accessing files in S3). +* AWS Libraries (e.g for accessing files in AWS S3 if required). * Jupyter Notebook server. -The components of this environment are all pinned to a specific source distribution of RTDIP and have been tested in x86 Windows and Linux environments. +The components of this environment are all pinned to a specific source distribution of RTDIP and have been tested in x86 Windows (using gitbash) and Linux environments. ## Prerequisites * Docker desktop or another local Docker environment (e.g. Ubuntu Docker). * gitbash environment for Windows environments. -When the installation completes, a Jupyter notebook will be running on port 8080. +When the installation completes, a Jupyter notebook will be running locally on port 8080. Please check that this port is available or change the configuration in the installer if required. -# Deploy and Running +# Deploy Run *run_in_docker.sh*. After the installer completes: -* At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example RTDIP pipelines (see below). -* To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run it. This pipeline queries MISO and saves the results locally under a newly created directory called spark-warehouse. -* For debugging and running from inside the container new RTDIP pipeplines, a new file *conda_environment_rtdip-sdk.sh* is created. Please use this environment to activate -the LFEnergy RTDIP environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. +* At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example new RTDIP pipelines. +* To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run the notebook. This pipeline queries MISO (Midcontinent Independent System Operator) and saves the results of the query locally under a newly created directory called spark-warehouse. +* For debugging purposes and running from inside the container other RTDIP pipeplines, a new file *conda_environment_rtdip-sdk.sh* is created. Please use this file to activate the conda environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. From a594389a5a383fbd745889fbef22a62717e58d3d Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 11:40:35 +0000 Subject: [PATCH 29/42] Documentation updated --- pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index 50dab38..bb744f5 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -18,5 +18,5 @@ Please check that this port is available or change the configuration in the inst Run *run_in_docker.sh*. After the installer completes: * At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example new RTDIP pipelines. * To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run the notebook. This pipeline queries MISO (Midcontinent Independent System Operator) and saves the results of the query locally under a newly created directory called spark-warehouse. -* For debugging purposes and running from inside the container other RTDIP pipeplines, a new file *conda_environment_rtdip-sdk.sh* is created. Please use this file to activate the conda environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. +* For debugging purposes and running from inside the container other RTDIP pipeplines, a new file *conda_environment_lfenergy.sh* is created. Please use this file to activate the conda environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. From ad5a82afd2172d1512fa47322007402ae3314b93 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Fri, 3 Nov 2023 16:20:57 +0000 Subject: [PATCH 30/42] Added Dagster Installers --- .../dagster.yaml | 63 +++++++++++++++++++ .../run_conda_installer.sh | 12 +++- .../run_in_docker.sh | 2 +- 3 files changed, 75 insertions(+), 2 deletions(-) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml new file mode 100644 index 0000000..197745a --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml @@ -0,0 +1,63 @@ +scheduler: + module: dagster.core.scheduler + class: DagsterDaemonScheduler + +run_coordinator: + module: dagster.core.run_coordinator + class: QueuedRunCoordinator + +run_launcher: + module: dagster_docker + class: DockerRunLauncher + config: + env_vars: + - DAGSTER_POSTGRES_USER + - DAGSTER_POSTGRES_PASSWORD + - DAGSTER_POSTGRES_DB + network: dagster_network + container_kwargs: + volumes: # Make docker client accessible to any launched containers as well + - /var/run/docker.sock:/var/run/docker.sock + - /tmp/io_manager_storage:/tmp/io_manager_storage + +run_storage: + module: dagster_postgres.run_storage + class: PostgresRunStorage + config: + postgres_db: + hostname: docker_example_postgresql + username: + env: DAGSTER_POSTGRES_USER + password: + env: DAGSTER_POSTGRES_PASSWORD + db_name: + env: DAGSTER_POSTGRES_DB + port: 5432 + +schedule_storage: + module: dagster_postgres.schedule_storage + class: PostgresScheduleStorage + config: + postgres_db: + hostname: docker_example_postgresql + username: + env: DAGSTER_POSTGRES_USER + password: + env: DAGSTER_POSTGRES_PASSWORD + db_name: + env: DAGSTER_POSTGRES_DB + port: 5432 + +event_log_storage: + module: dagster_postgres.event_log + class: PostgresEventLogStorage + config: + postgres_db: + hostname: docker_example_postgresql + username: + env: DAGSTER_POSTGRES_USER + password: + env: DAGSTER_POSTGRES_PASSWORD + db_name: + env: DAGSTER_POSTGRES_DB + port: 5432 \ No newline at end of file diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index f2abe4f..07572c7 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -226,4 +226,14 @@ echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" # Install and Run Notebook conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" -jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root +echo "Going to install dagster" +conda install -y dagster=1.5.6 +echo "Going to install dagster-webserver" +yes | pip install dagster-webserver==1.5.6 +echo "Going to run Jupyter" +jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root & +echo "Going to run dagster dev" +dagster dev & +date +echo "Running...." +sleep infinity diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh index 4073989..6fbf2d8 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_in_docker.sh @@ -4,4 +4,4 @@ docker container rm rtdip docker system prune -a -f docker image rm "rtdip:Dockerfile" docker build -t "rtdip:Dockerfile" . -docker run --name rtdip --publish 8080:8080 "rtdip:Dockerfile" +docker run --name rtdip --publish 8080:8080 --publish 3000:3000 "rtdip:Dockerfile" From 8ae1bc33a410acc9401508f1ef09b09aa4dae307 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 4 Nov 2023 08:08:37 +0000 Subject: [PATCH 31/42] Updated --- .../Spark-Single-Node-Notebook-AWS/Dockerfile | 2 +- .../MISO_pipeline_sample_dagster.py | 52 +++++++++++++++++++ .../run_conda_installer.sh | 9 ++-- 3 files changed, 58 insertions(+), 5 deletions(-) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile index b2b3263..8748115 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile @@ -8,7 +8,7 @@ RUN useradd -rm -d /home/rtdip -s /bin/bash -g root -G sudo -u 1001 rtdip COPY run_conda_installer.sh /home/rtdip/ RUN mkdir -p /home/rtdip/apps/lfenergy COPY environment.yml /home/rtdip/apps/lfenergy/ -COPY MISO_pipeline_sample.py /home/rtdip/apps/lfenergy/ +COPY MISO_pipeline_sample*.* /home/rtdip/apps/lfenergy/ RUN chmod +x /home/rtdip/run_conda_installer.sh WORKDIR /home/rtdip ENTRYPOINT ["/home/rtdip/run_conda_installer.sh"] diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py new file mode 100644 index 0000000..26916c6 --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py @@ -0,0 +1,52 @@ + + +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.destinations import SparkDeltaDestination +from pyspark.sql import SparkSession + + +from dagster import asset + +@asset # add the asset decorator to tell Dagster this is an asset +def run_miso_ingest(): + spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ + .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ + .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() + + source_df = MISODailyLoadISOSource( + spark = spark, + options = { + "load_type": "actual", + "date": "20230520", + } + ).read_batch() + + transform_value_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "usage" + ).transform() + + transform_meta_df = MISOToMDMTransformer( + spark=spark, + data=source_df, + output_type= "meta" + ).transform() + + SparkDeltaDestination( + data=transform_value_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_usage_data" + ).write_batch() + + SparkDeltaDestination( + data=transform_meta_df, + options={ + "partitionBy":"timestamp" + }, + destination="miso_meta_data" + ).write_batch() + diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 07572c7..179068a 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -226,14 +226,15 @@ echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" # Install and Run Notebook conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" +export DAGSTER_PORT="3000" +export DAGSTER_HOST="0.0.0.0" +# Install and run Dagster echo "Going to install dagster" conda install -y dagster=1.5.6 echo "Going to install dagster-webserver" yes | pip install dagster-webserver==1.5.6 echo "Going to run Jupyter" jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root & -echo "Going to run dagster dev" -dagster dev & -date -echo "Running...." +echo "Going to run Dagster dev" +dagster dev -h $DAGSTER_HOST -p $DAGSTER_PORT-f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py sleep infinity From a452b742fa27eb658acff45a7d3d42eae9aa7d60 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 4 Nov 2023 08:52:02 +0000 Subject: [PATCH 32/42] Updated --- .../run_conda_installer.sh | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 179068a..1e665c2 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -227,14 +227,14 @@ echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" export DAGSTER_PORT="3000" -export DAGSTER_HOST="0.0.0.0" +export HOST="0.0.0.0" # Install and run Dagster echo "Going to install dagster" conda install -y dagster=1.5.6 echo "Going to install dagster-webserver" yes | pip install dagster-webserver==1.5.6 -echo "Going to run Jupyter" -jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root & -echo "Going to run Dagster dev" -dagster dev -h $DAGSTER_HOST -p $DAGSTER_PORT-f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py +echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" +jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & +echo "Going to run Dagster dev on host:$HOST/port:$DAGSTER_PORT " +dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py sleep infinity From e4189f63973a32865356625f9b582ce2349dc6c2 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sun, 5 Nov 2023 06:09:28 +0000 Subject: [PATCH 33/42] Updated --- .../MISO_pipeline_sample.py | 52 ++++++++------- .../MISO_pipeline_sample_dagster.py | 63 ++++++++++--------- 2 files changed, 62 insertions(+), 53 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py index d62f638..66d0835 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py @@ -9,43 +9,47 @@ from rtdip_sdk.pipelines.destinations import SparkDeltaDestination from pyspark.sql import SparkSession -spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ - .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ - .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() +import shutil + +# First Clear local files +shutil.rmtree("spark-warehouse") + +spark = ( + SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") + .config( + "spark.sql.catalog.spark_catalog", + "org.apache.spark.sql.delta.catalog.DeltaCatalog", + ) + .getOrCreate() +) source_df = MISODailyLoadISOSource( - spark = spark, - options = { - "load_type": "actual", - "date": "20230520", - } + spark=spark, + options={ + "load_type": "actual", + "date": "20230520", + }, ).read_batch() transform_value_df = MISOToMDMTransformer( - spark=spark, - data=source_df, - output_type= "usage" + spark=spark, data=source_df, output_type="usage" ).transform() transform_meta_df = MISOToMDMTransformer( - spark=spark, - data=source_df, - output_type= "meta" + spark=spark, data=source_df, output_type="meta" ).transform() SparkDeltaDestination( data=transform_value_df, - options={ - "partitionBy":"timestamp" - }, - destination="miso_usage_data" -).write_batch() + options={"partitionBy": "timestamp"}, + destination="miso_usage_data", +).write_batch() SparkDeltaDestination( data=transform_meta_df, - options={ - "partitionBy":"timestamp" - }, - destination="miso_meta_data" -).write_batch() + options={"partitionBy": "timestamp"}, + destination="miso_meta_data", +).write_batch() +spark.stop() diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py index 26916c6..9c18711 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py @@ -1,52 +1,57 @@ - - -from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource -from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer from rtdip_sdk.pipelines.destinations import SparkDeltaDestination -from pyspark.sql import SparkSession +from rtdip_sdk.pipelines.transformers import MISOToMDMTransformer +from rtdip_sdk.pipelines.sources import MISODailyLoadISOSource +from pyspark.sql import SparkSession from dagster import asset +import shutil -@asset # add the asset decorator to tell Dagster this is an asset -def run_miso_ingest(): - spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0")\ - .config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension")\ - .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate() +@asset # add the asset decorator to tell Dagster this is an asset +def run_miso_ingest(): + # First Clear local files + shutil.rmtree("spark-warehouse") + + spark = ( + SparkSession.builder.config( + "spark.jars.packages", "io.delta:delta-core_2.12:2.4.0" + ) + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") + .config( + "spark.sql.catalog.spark_catalog", + "org.apache.spark.sql.delta.catalog.DeltaCatalog", + ) + .getOrCreate() + ) + + # Build the query source_df = MISODailyLoadISOSource( - spark = spark, - options = { - "load_type": "actual", - "date": "20230520", - } + spark=spark, + options={ + "load_type": "actual", + "date": "20230520", + }, ).read_batch() transform_value_df = MISOToMDMTransformer( - spark=spark, - data=source_df, - output_type= "usage" + spark=spark, data=source_df, output_type="usage" ).transform() transform_meta_df = MISOToMDMTransformer( - spark=spark, - data=source_df, - output_type= "meta" + spark=spark, data=source_df, output_type="meta" ).transform() SparkDeltaDestination( data=transform_value_df, - options={ - "partitionBy":"timestamp" - }, - destination="miso_usage_data" + options={"partitionBy": "timestamp"}, + destination="miso_usage_data", ).write_batch() SparkDeltaDestination( data=transform_meta_df, - options={ - "partitionBy":"timestamp" - }, - destination="miso_meta_data" + options={"partitionBy": "timestamp"}, + destination="miso_meta_data", ).write_batch() + spark.stop() From f7030ec760adc9f29526ea0cda08e7272970aa88 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 16:11:01 +0000 Subject: [PATCH 34/42] Updated --- .../Spark-Single-Node-Notebook-AWS/Dockerfile | 1 + .../dagster.yaml | 86 ++++++++----------- .../run_conda_installer.sh | 58 ++++++++++++- .../workspace.yaml | 4 + 4 files changed, 94 insertions(+), 55 deletions(-) create mode 100644 pipelines/deploy/Spark-Single-Node-Notebook-AWS/workspace.yaml diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile index 8748115..4cad4c2 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile @@ -9,6 +9,7 @@ COPY run_conda_installer.sh /home/rtdip/ RUN mkdir -p /home/rtdip/apps/lfenergy COPY environment.yml /home/rtdip/apps/lfenergy/ COPY MISO_pipeline_sample*.* /home/rtdip/apps/lfenergy/ +COPY dagster.yaml /home/rtdip/apps/lfenergy/ RUN chmod +x /home/rtdip/run_conda_installer.sh WORKDIR /home/rtdip ENTRYPOINT ["/home/rtdip/run_conda_installer.sh"] diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml index 197745a..d8ea6e3 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/dagster.yaml @@ -1,3 +1,4 @@ +# MYSQL Configuration scheduler: module: dagster.core.scheduler class: DagsterDaemonScheduler @@ -6,58 +7,41 @@ run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator -run_launcher: - module: dagster_docker - class: DockerRunLauncher - config: - env_vars: - - DAGSTER_POSTGRES_USER - - DAGSTER_POSTGRES_PASSWORD - - DAGSTER_POSTGRES_DB - network: dagster_network - container_kwargs: - volumes: # Make docker client accessible to any launched containers as well - - /var/run/docker.sock:/var/run/docker.sock - - /tmp/io_manager_storage:/tmp/io_manager_storage - run_storage: - module: dagster_postgres.run_storage - class: PostgresRunStorage - config: - postgres_db: - hostname: docker_example_postgresql - username: - env: DAGSTER_POSTGRES_USER - password: - env: DAGSTER_POSTGRES_PASSWORD - db_name: - env: DAGSTER_POSTGRES_DB - port: 5432 + module: dagster_mysql.run_storage + class: MySQLRunStorage + config: + mysql_db: + hostname: DAGSTER_MYSQL_HOST + username: DAGSTER_MYSQL_USER + password: DAGSTER_MYSQL_PASSWORD + db_name: DAGSTER_MYSQL_DB + port: DAGSTER_MYSQL_PORT -schedule_storage: - module: dagster_postgres.schedule_storage - class: PostgresScheduleStorage - config: - postgres_db: - hostname: docker_example_postgresql - username: - env: DAGSTER_POSTGRES_USER - password: - env: DAGSTER_POSTGRES_PASSWORD - db_name: - env: DAGSTER_POSTGRES_DB - port: 5432 event_log_storage: - module: dagster_postgres.event_log - class: PostgresEventLogStorage - config: - postgres_db: - hostname: docker_example_postgresql - username: - env: DAGSTER_POSTGRES_USER - password: - env: DAGSTER_POSTGRES_PASSWORD - db_name: - env: DAGSTER_POSTGRES_DB - port: 5432 \ No newline at end of file + module: dagster_mysql.event_log + class: MySQLEventLogStorage + config: + mysql_db: + hostname: DAGSTER_MYSQL_HOST + username: DAGSTER_MYSQL_USER + password: DAGSTER_MYSQL_PASSWORD + db_name: DAGSTER_MYSQL_DB + port: DAGSTER_MYSQL_PORT + +schedule_storage: + module: dagster_mysql.schedule_storage + class: MySQLScheduleStorage + config: + mysql_db: + hostname: DAGSTER_MYSQL_HOST + username: DAGSTER_MYSQL_USER + password: DAGSTER_MYSQL_PASSWORD + db_name: DAGSTER_MYSQL_DB + port: DAGSTER_MYSQL_PORT + + +run_launcher: + module: dagster.core.launcher + class: DefaultRunLauncher \ No newline at end of file diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 1e665c2..1e565f1 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -193,6 +193,47 @@ mv $PROTON_J_JAR_FILE_NAME $SPARK_HOME/jars echo "Finished INSTALLING $JAVA_VERSION and $SPARK_VERSION and Extra Libraries" + Password Generator +apt-get install pwgen +## +echo "Installing MySQL" +# Generate rnd passwords +MYSQL_ROOT_PASSWORD=$(pwgen -s -c -n 10) +MYSQL_DAGSTER_PASSWORD=$(pwgen -s -c -n 10) + + +export MYSQL_HOSTNAME="localhost" +export MYSQL_PORT="3306" +export MYSQL_DAGSTER_DATABASE_NAME="dagster" +export MYSQL_VOLUME="/tmp" +export MYSQL_DAGSTER_USERNAME="dagster" + +apt-get install debconf -y +debconf-set-selections <<< "mysql-server mysql-server/root_password password $MYSQL_ROOT_PASSWORD" +debconf-set-selections <<< "mysql-server mysql-server/root_password_again password $MYSQL_ROOT_PASSWORD" + + +apt-get install -y mysql-server +apt-get install -y mysql-client +service mysql --full-restart +mysql -uroot -p$MYSQL_ROOT_PASSWORD -e "CREATE DATABASE ${MYSQL_DAGSTER_DATABASE_NAME} CHARACTER SET utf8 COLLATE utf8_unicode_ci;;" +mysql -uroot -p$MYSQL_ROOT_PASSWORD -e "CREATE USER ${MYSQL_DAGSTER_USERNAME}@localhost IDENTIFIED BY '${MYSQL_DAGSTER_PASSWORD}';" +mysql -uroot -p$MYSQL_ROOT_PASSWORD -e "GRANT ALL PRIVILEGES ON ${MYSQL_DAGSTER_DATABASE_NAME}.* TO '${MYSQL_DAGSTER_USERNAME}'@'localhost';" +mysql -uroot -p$MYSQL_ROOT_PASSWORD -e "FLUSH PRIVILEGES;" + +export DAGSTER_HOME=$CONDA_ENV_HOME + +sed -i "s/DAGSTER_MYSQL_HOST/$MYSQL_HOSTNAME/" $DAGSTER_HOME/dagster.yaml +sed -i "s/DAGSTER_MYSQL_USER/$MYSQL_DAGSTER_USERNAME/" $DAGSTER_HOME/dagster.yaml +sed -i "s/DAGSTER_MYSQL_PASSWORD/$MYSQL_DAGSTER_PASSWORD/" $DAGSTER_HOME/dagster.yaml +sed -i "s/DAGSTER_MYSQL_DB/$MYSQL_DAGSTER_DATABASE_NAME/" $DAGSTER_HOME/dagster.yaml +sed -i "s/DAGSTER_MYSQL_PORT/$MYSQL_PORT/" $DAGSTER_HOME/dagster.yaml + + hostname: DAGSTER_MYSQL_HOST + username: DAGSTER_MYSQL_USER + password: DAGSTER_MYSQL_PASSWORD + db_name: DAGSTER_MYSQL_DB + port: DAGSTER_MYSQL_PORT eval "$(conda shell.bash hook)" conda config --set default_threads 4 @@ -219,22 +260,31 @@ echo "#!/usr/bin/env bash" > $CONDA_ENVIRONMENT_FILE_NAME echo "export PATH=$PATH" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export JAVA_HOME=$JAVA_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export DAGSTER_HOME=$DAGSTER_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" # Install and Run Notebook -conda install -y notebook=6.5.4 +## conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" export DAGSTER_PORT="3000" export HOST="0.0.0.0" # Install and run Dagster echo "Going to install dagster" conda install -y dagster=1.5.6 +conda install -y dagster-mysql=1.5.6 echo "Going to install dagster-webserver" yes | pip install dagster-webserver==1.5.6 echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" -jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & -echo "Going to run Dagster dev on host:$HOST/port:$DAGSTER_PORT " -dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py +## jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & +echo "Checking Dagster Config files: dagster.yaml and workspace.yaml" +ls -la dagster.yaml +ls -la workspace.yaml +echo "Going to run Dagster Webserver on host:$HOST/port:$DAGSTER_PORT " +dagster-webserver -h $HOST -p $DAGSTER_PORT & +echo "Going to run Dagster Daemon" +dagster-daemon run & +# dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py sleep infinity + diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/workspace.yaml b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/workspace.yaml new file mode 100644 index 0000000..41b38ec --- /dev/null +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/workspace.yaml @@ -0,0 +1,4 @@ +# workspace.yaml + +load_from: + - python_file: MISO_pipeline_sample_dagster.py \ No newline at end of file From c0cce10b3e4b1e7f7b560efeb99ac610ed5177a1 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 16:12:11 +0000 Subject: [PATCH 35/42] Updated --- .../MISO_pipeline_sample_dagster.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py index 9c18711..6f4c3d1 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py @@ -8,9 +8,9 @@ import shutil -@asset # add the asset decorator to tell Dagster this is an asset +@asset def run_miso_ingest(): - # First Clear local files + # First: Clear local files shutil.rmtree("spark-warehouse") spark = ( From 7f90338974db07b0f3f92e1564ca564e032267ca Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 16:48:14 +0000 Subject: [PATCH 36/42] Updated --- .../Spark-Single-Node-Notebook-AWS/run_conda_installer.sh | 7 ------- 1 file changed, 7 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 1e565f1..1d42ee0 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -229,12 +229,6 @@ sed -i "s/DAGSTER_MYSQL_PASSWORD/$MYSQL_DAGSTER_PASSWORD/" $DAGSTER_HOME/dagster sed -i "s/DAGSTER_MYSQL_DB/$MYSQL_DAGSTER_DATABASE_NAME/" $DAGSTER_HOME/dagster.yaml sed -i "s/DAGSTER_MYSQL_PORT/$MYSQL_PORT/" $DAGSTER_HOME/dagster.yaml - hostname: DAGSTER_MYSQL_HOST - username: DAGSTER_MYSQL_USER - password: DAGSTER_MYSQL_PASSWORD - db_name: DAGSTER_MYSQL_DB - port: DAGSTER_MYSQL_PORT - eval "$(conda shell.bash hook)" conda config --set default_threads 4 conda env list @@ -285,6 +279,5 @@ echo "Going to run Dagster Webserver on host:$HOST/port:$DAGSTER_PORT " dagster-webserver -h $HOST -p $DAGSTER_PORT & echo "Going to run Dagster Daemon" dagster-daemon run & -# dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py sleep infinity From 2e0d2dc8e48d18769d09ea1347b9efcd137b4013 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 17:29:03 +0000 Subject: [PATCH 37/42] Updated --- .../deploy/Spark-Single-Node-Notebook-AWS/Dockerfile | 1 + .../run_conda_installer.sh | 11 +++++++---- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile index 4cad4c2..22468cf 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/Dockerfile @@ -10,6 +10,7 @@ RUN mkdir -p /home/rtdip/apps/lfenergy COPY environment.yml /home/rtdip/apps/lfenergy/ COPY MISO_pipeline_sample*.* /home/rtdip/apps/lfenergy/ COPY dagster.yaml /home/rtdip/apps/lfenergy/ +COPY workspace.yaml /home/rtdip/apps/lfenergy/ RUN chmod +x /home/rtdip/run_conda_installer.sh WORKDIR /home/rtdip ENTRYPOINT ["/home/rtdip/run_conda_installer.sh"] diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 1d42ee0..6c84e51 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -255,6 +255,8 @@ echo "export PATH=$PATH" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export JAVA_HOME=$JAVA_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export DAGSTER_HOME=$DAGSTER_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export HOST=$HOST" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export DAGSTER_PORT=$DAGSTER_PORT" >> $CONDA_ENVIRONMENT_FILE_NAME echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" @@ -275,9 +277,10 @@ echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" echo "Checking Dagster Config files: dagster.yaml and workspace.yaml" ls -la dagster.yaml ls -la workspace.yaml -echo "Going to run Dagster Webserver on host:$HOST/port:$DAGSTER_PORT " -dagster-webserver -h $HOST -p $DAGSTER_PORT & -echo "Going to run Dagster Daemon" -dagster-daemon run & +echo "Going to run Dagster DEV on host:$HOST/port:$DAGSTER_PORT " +## dagster-webserver -h $HOST -p $DAGSTER_PORT & +## echo "Going to run Dagster Daemon" +## dagster-daemon run & +dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py sleep infinity From bd0e66f618e33d92215cf2e9e471f6f8a7490be9 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 17:49:03 +0000 Subject: [PATCH 38/42] Updated --- .../MISO_pipeline_sample_dagster.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py index 6f4c3d1..8687b0b 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py @@ -5,13 +5,21 @@ from pyspark.sql import SparkSession from dagster import asset + import shutil +import os @asset def run_miso_ingest(): + # First: Clear local files - shutil.rmtree("spark-warehouse") + spark_warehouse_local_path: str = "spark-warehouse" + if os.path.exists(spark_warehouse_local_path) and os.path.isdir(spark_warehouse_local_path): + try: + shutil.rmtree("spark-warehouse") + except Exception as ex: + print(str(ex)) spark = ( SparkSession.builder.config( From 824d13efe3693c0a77b2302498f2c959e651789a Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sat, 11 Nov 2023 21:28:47 +0000 Subject: [PATCH 39/42] Updated --- .../MISO_pipeline_sample_dagster.py | 2 +- .../Spark-Single-Node-Notebook-AWS/run_conda_installer.sh | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py index 8687b0b..90b0b0d 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample_dagster.py @@ -12,7 +12,7 @@ @asset def run_miso_ingest(): - + # First: Clear local files spark_warehouse_local_path: str = "spark-warehouse" if os.path.exists(spark_warehouse_local_path) and os.path.isdir(spark_warehouse_local_path): diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 6c84e51..107298b 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -257,6 +257,8 @@ echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export DAGSTER_HOME=$DAGSTER_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export HOST=$HOST" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export DAGSTER_PORT=$DAGSTER_PORT" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export DAGSTER_MYSQL_USER=$DAGSTER_MYSQL_USER" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export DAGSTER_MYSQL_PASSWORD=$DAGSTER_MYSQL_PASSWORD" >> $CONDA_ENVIRONMENT_FILE_NAME echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" From aa2c7e10729f100dc527944d973ce8221bb7e548 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Sun, 12 Nov 2023 06:51:33 +0000 Subject: [PATCH 40/42] Updated --- .../run_conda_installer.sh | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index 107298b..bc26e8f 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -197,11 +197,11 @@ echo "Finished INSTALLING $JAVA_VERSION and $SPARK_VERSION and Extra Libraries" apt-get install pwgen ## echo "Installing MySQL" -# Generate rnd passwords +# Generate random passwords MYSQL_ROOT_PASSWORD=$(pwgen -s -c -n 10) MYSQL_DAGSTER_PASSWORD=$(pwgen -s -c -n 10) - +# MySQL Config for MySQL Clients connecting to MySQL Server export MYSQL_HOSTNAME="localhost" export MYSQL_PORT="3306" export MYSQL_DAGSTER_DATABASE_NAME="dagster" @@ -249,6 +249,8 @@ runtime=$((end_time-start_time)) echo "Total Installation Runtime: $runtime [seconds]" echo "Test environment not intended for using in production. Backup any changes made to this environment" # +export DAGSTER_PORT="3000" +export HOST="0.0.0.0" CONDA_ENVIRONMENT_FILE_NAME="conda_environment_$CONDA_ENV.sh" echo "#!/usr/bin/env bash" > $CONDA_ENVIRONMENT_FILE_NAME echo "export PATH=$PATH" >> $CONDA_ENVIRONMENT_FILE_NAME @@ -257,8 +259,8 @@ echo "export SPARK_HOME=$SPARK_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export DAGSTER_HOME=$DAGSTER_HOME" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export HOST=$HOST" >> $CONDA_ENVIRONMENT_FILE_NAME echo "export DAGSTER_PORT=$DAGSTER_PORT" >> $CONDA_ENVIRONMENT_FILE_NAME -echo "export DAGSTER_MYSQL_USER=$DAGSTER_MYSQL_USER" >> $CONDA_ENVIRONMENT_FILE_NAME -echo "export DAGSTER_MYSQL_PASSWORD=$DAGSTER_MYSQL_PASSWORD" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export MYSQL_DAGSTER_USERNAME=$MYSQL_DAGSTER_USERNAME" >> $CONDA_ENVIRONMENT_FILE_NAME +echo "export MYSQL_DAGSTER_PASSWORD=$MYSQL_DAGSTER_PASSWORD" >> $CONDA_ENVIRONMENT_FILE_NAME echo "source $HOME/$MINICONDA_NAME/etc/profile.d/conda.sh" >> $CONDA_ENVIRONMENT_FILE_NAME chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" @@ -266,23 +268,21 @@ echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" # Install and Run Notebook ## conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" -export DAGSTER_PORT="3000" -export HOST="0.0.0.0" # Install and run Dagster echo "Going to install dagster" conda install -y dagster=1.5.6 conda install -y dagster-mysql=1.5.6 echo "Going to install dagster-webserver" yes | pip install dagster-webserver==1.5.6 -echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" +## echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" ## jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & -echo "Checking Dagster Config files: dagster.yaml and workspace.yaml" -ls -la dagster.yaml -ls -la workspace.yaml -echo "Going to run Dagster DEV on host:$HOST/port:$DAGSTER_PORT " -## dagster-webserver -h $HOST -p $DAGSTER_PORT & -## echo "Going to run Dagster Daemon" -## dagster-daemon run & -dagster dev -h $HOST -p $DAGSTER_PORT -f $CONDA_ENV_HOME/MISO_pipeline_sample_dagster.py +echo "Going to run Dagster on host:$HOST/port:$DAGSTER_PORT " +echo "Running Dagster daemon" +dagster-daemon run > dagster_daemon_logs.txt 2>&1 & +echo "Allowing for dagster-daemon to start running" +sleep 60 +echo "Running Webserver" +dagster-webserver -h $HOST -p $DAGSTER_PORT > dagster_webserver_logs.txt 2>&1 & +tail -f *.txt sleep infinity From 5d0a93571a1c84e321f6358ba7f714be4043b949 Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Mon, 13 Nov 2023 10:14:32 +0000 Subject: [PATCH 41/42] Notebook reenabled --- .../MISO_pipeline_sample.py | 11 ++++++++--- .../run_conda_installer.sh | 6 +++--- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py index 66d0835..a54870a 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/MISO_pipeline_sample.py @@ -10,9 +10,14 @@ from pyspark.sql import SparkSession import shutil - -# First Clear local files -shutil.rmtree("spark-warehouse") +import os + +spark_warehouse_local_path: str = "spark-warehouse" +if os.path.exists(spark_warehouse_local_path) and os.path.isdir(spark_warehouse_local_path): + try: + shutil.rmtree("spark-warehouse") + except Exception as ex: + print(str(ex)) spark = ( SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh index bc26e8f..d2e414b 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/run_conda_installer.sh @@ -266,7 +266,7 @@ chmod +x $CONDA_ENVIRONMENT_FILE_NAME echo "export SPARK_HOME=$SPARK_HOME" echo "NOTEBOOK_PORT: $NOTEBOOK_PORT" # Install and Run Notebook -## conda install -y notebook=6.5.4 +conda install -y notebook=6.5.4 export NOTEBOOK_PORT="8080" # Install and run Dagster echo "Going to install dagster" @@ -274,8 +274,8 @@ conda install -y dagster=1.5.6 conda install -y dagster-mysql=1.5.6 echo "Going to install dagster-webserver" yes | pip install dagster-webserver==1.5.6 -## echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" -## jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & +echo "Going to run Jupyter on host:$HOST/port:$NOTEBOOK_PORT" +jupyter notebook --no-browser --port=$NOTEBOOK_PORT --ip=$HOST --NotebookApp.token='' --NotebookApp.password='' --allow-root & echo "Going to run Dagster on host:$HOST/port:$DAGSTER_PORT " echo "Running Dagster daemon" dagster-daemon run > dagster_daemon_logs.txt 2>&1 & From a92e9dd558d5a9bce26664c3f9c75168c42fee1e Mon Sep 17 00:00:00 2001 From: Victor Bayon Date: Mon, 13 Nov 2023 10:52:01 +0000 Subject: [PATCH 42/42] Documentation Update --- .../Spark-Single-Node-Notebook-AWS/README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md index bb744f5..77af9ef 100644 --- a/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md +++ b/pipelines/deploy/Spark-Single-Node-Notebook-AWS/README.md @@ -1,9 +1,10 @@ -# Spark Single Node Notebook AWS +# Spark Single Node Dagster MySql Notebook AWS Integration This article provides a guide on how to create a conda based self-contained environment to run LFEnergy RTDIP that integrates the following components: * Java JDK and Apache Spark (Single node configuration). Currently, v3.4.1 Spark (PySpark) has been configured and tested. * AWS Libraries (e.g for accessing files in AWS S3 if required). * Jupyter Notebook server. +* Dagster (MySQL backend). The components of this environment are all pinned to a specific source distribution of RTDIP and have been tested in x86 Windows (using gitbash) and Linux environments. @@ -11,12 +12,14 @@ The components of this environment are all pinned to a specific source distribut * Docker desktop or another local Docker environment (e.g. Ubuntu Docker). * gitbash environment for Windows environments. -When the installation completes, a Jupyter notebook will be running locally on port 8080. +When the installation completes, a Jupyter notebook will be running locally on port 8080 and Dagster Webserver will be running on port 3000 Please check that this port is available or change the configuration in the installer if required. -# Deploy +# Deployment Run *run_in_docker.sh*. After the installer completes: -* At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example new RTDIP pipelines. -* To test the environment, create a new notebook and copy the contents of MISO_pipeline_sample.py and run the notebook. This pipeline queries MISO (Midcontinent Independent System Operator) and saves the results of the query locally under a newly created directory called spark-warehouse. -* For debugging purposes and running from inside the container other RTDIP pipeplines, a new file *conda_environment_lfenergy.sh* is created. Please use this file to activate the conda environment (e.g. *source ./conda_environment_lfenergy.sh*) within the container. +* At http://localhost:8080/ a jupyter notebook server will be running. Notebooks can be created to run for example new RTDIP pipelines. Sample MISO_pipeline_sample.py provided. This pipeline can be run in a Notebook. +* At http://localhost:3000/ Dagster Server with the sample MISO_pipeline_sample_dagster.py configured as a Dagster asset. +* To test the Notebook environment, create a new notebook and copy the contents of MISO_pipeline_sample.py into it and run it. This pipeline queries MISO (Midcontinent Independent System Operator) and saves the results of the query locally under a newly created directory called spark-warehouse. +* To test the Dagster environment, materialize the asset. +* For debugging purposes and running from inside the container, a file called *conda_environment_lfenergy.sh* is created under /home/rtdip/apps/lfenergy. Please use this file to activate the conda environment (e.g. *source ./conda_environment_lfenergy.sh; conda activate lfenergy*) within the container.