Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 17 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in

### unreleased

#### Fixed
#### Fixed
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime
* Refactored `DataAnalyzer` and `BasicStockTickerProvider` to comply with ANSI SQL standards
* Removed internal modification of `SparkSession`
Expand All @@ -23,6 +23,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
#### Added
* Added support for serialization to/from JSON format
* Added Ruff and mypy tooling
* Pydantic-based specification API (Experimental)


### Version 0.4.0 Hotfix 2
Expand Down Expand Up @@ -59,7 +60,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in
* Updated docs for complex data types / JSON to correct code examples
* Updated license file in public docs

#### Fixed
#### Fixed
* Fixed scenario where `DataAnalyzer` is used on dataframe containing a column named `summary`

### Version 0.3.6
Expand Down Expand Up @@ -90,14 +91,14 @@ All notable changes to the Databricks Labs Data Generator will be documented in
### Version 0.3.4 Post 2

### Fixed
* Fix for use of values in columns of type array, map and struct
* Fix for use of values in columns of type array, map and struct
* Fix for generation of arrays via `numFeatures` and `structType` attributes when numFeatures has value of 1


### Version 0.3.4 Post 1

### Fixed
* Fix for use and configuration of root logger
* Fix for use and configuration of root logger

### Acknowledgements
Thanks to Marvin Schenkel for the contribution
Expand All @@ -120,7 +121,7 @@ Thanks to Marvin Schenkel for the contribution

#### Changed
* Fixed use of logger in _version.py and in spark_singleton.py
* Fixed template issues
* Fixed template issues
* Document reformatting and updates, related code comment changes

### Fixed
Expand All @@ -133,19 +134,19 @@ Thanks to Marvin Schenkel for the contribution
### Version 0.3.2

#### Changed
* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
* Adjusted column build phase separation (i.e which select statement is used to build columns) so that a
column with a SQL expression can refer to previously created columns without use of a `baseColumn` attribute
* Changed build labelling to comply with PEP440

#### Fixed
#### Fixed
* Fixed compatibility of build with older versions of runtime that rely on `pyparsing` version 2.4.7

#### Added
#### Added
* Parsing of SQL expressions to determine column dependencies

#### Notes
* The enhancements to build ordering does not change actual order of column building -
but adjusts which phase columns are built in
but adjusts which phase columns are built in


### Version 0.3.1
Expand All @@ -154,11 +155,11 @@ Thanks to Marvin Schenkel for the contribution
* Refactoring of template text generation for better performance via vectorized implementation
* Additional migration of tests to use of `pytest`

#### Fixed
#### Fixed
* added type parsing support for binary and constructs such as `nvarchar(10)`
* Fixed error occurring when schema contains map, array or struct.
* Fixed error occurring when schema contains map, array or struct.

#### Added
#### Added
* Ability to change name of seed column to custom name (defaults to `id`)
* Added type parsing support for structs, maps and arrays and combinations of the above

Expand Down Expand Up @@ -207,14 +208,14 @@ See the contents of the file `python/require.txt` to see the Python package depe
The code for the Databricks Data Generator has the following dependencies

* Requires Databricks runtime 9.1 LTS or later
* Requires Spark 3.1.2 or later
* Requires Spark 3.1.2 or later
* Requires Python 3.8.10 or later

While the data generator framework does not require all libraries used by the runtimes, where a library from
While the data generator framework does not require all libraries used by the runtimes, where a library from
the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.

The recommended method to install the package is to use `pip install` in your notebook to install the package from
The recommended method to install the package is to use `pip install` in your notebook to install the package from
PyPi

For example:
Expand All @@ -227,7 +228,7 @@ To use an older DB runtime version in your notebook, you can use the following c
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
```

See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
for the full list of dependencies used by the Databricks runtime.

This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html
Expand Down
91 changes: 46 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Databricks Labs Data Generator (`dbldatagen`)
# Databricks Labs Data Generator (`dbldatagen`)

<!-- Top bar will be removed from PyPi packaged versions -->
<!-- Dont remove: exclude package -->
[Documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) |
[Release Notes](CHANGELOG.md) |
[Examples](examples) |
[Tutorial](tutorial)
[Tutorial](tutorial)
<!-- Dont remove: end exclude package -->

[![build](https://github.com/databrickslabs/dbldatagen/workflows/build/badge.svg?branch=master)](https://github.com/databrickslabs/dbldatagen/actions?query=workflow%3Abuild+branch%3Amaster)
Expand All @@ -14,53 +14,54 @@
[![PyPi downloads](https://img.shields.io/pypi/dm/dbldatagen?label=PyPi%20Downloads)](https://pypistats.org/packages/dbldatagen)
[![lines of code](https://tokei.rs/b1/github/databrickslabs/dbldatagen)]([https://codecov.io/github/databrickslabs/dbldatagen](https://github.com/databrickslabs/dbldatagen))

<!--
<!--
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/databrickslabs/dbldatagen.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/databrickslabs/dbldatagen/context:python)
[![downloads](https://img.shields.io/github/downloads/databrickslabs/dbldatagen/total.svg)](https://hanadigital.github.io/grev/?user=databrickslabs&repo=dbldatagen)
-->

## Project Description
The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
other uses.

It operates by defining a data generation specification in code that controls
It operates by defining a data generation specification in code that controls
how the synthetic data is generated.
The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.

It has no dependencies on any libraries that are not already installed in the Databricks
It has no dependencies on any libraries that are not already installed in the Databricks
runtime, and you can use it from Scala, R or other languages by defining
a view over the generated data.

### Feature Summary
It supports:
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
merge and join scenarios with consistency between primary and foreign keys
* Generating synthetic data for all of the
Spark SQL supported primitive types as a Spark data frame which may be persisted,
saved to external storage or
* Generating synthetic data for all of the
Spark SQL supported primitive types as a Spark data frame which may be persisted,
saved to external storage or
used in other computations
* Generating ranges of dates, timestamps, and numeric values
* Generation of discrete values - both numeric and text
* Generation of values at random and based on the values of other fields
* Generation of values at random and based on the values of other fields
(either based on the `hash` of the underlying values or the values themselves)
* Ability to specify a distribution for random data generation
* Ability to specify a distribution for random data generation
* Generating arrays of values for ML-style feature arrays
* Applying weights to the occurrence of values
* Generating values to conform to a schema or independent of an existing schema
* use of SQL expressions in synthetic data generation
* plugin mechanism to allow use of 3rd party libraries such as Faker
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
* Generate synthetic data generation code from existing schema or data (experimental)
* Pydantic-based specification API for type-safe data generation (experimental)
* Use of standard datasets for quick generation of synthetic data

Details of these features can be found in the online documentation -
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).

## Documentation

Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for
Please refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for
details of use and many examples.

Release notes and details of the latest changes for this specific release
Expand All @@ -76,40 +77,40 @@ Within a Databricks notebook, invoke the following in a notebook cell
%pip install dbldatagen
```

The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
and even works on the Databricks community edition.

The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
contains details of installation using alternative mechanisms.

## Compatibility
The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
## Compatibility
The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog
compatibily.

For full library compatibility for a specific Databricks Spark release, see the Databricks
For full library compatibility for a specific Databricks Spark release, see the Databricks
release notes for library compatibility

- https://docs.databricks.com/release-notes/runtime/releases.html

In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
Depending on settings, the `Custom` access mode may be supported for those releases.

The use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2
onwards.
onwards.

*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
version and alleviates these issues.*

See the following documentation for more information:

- https://docs.databricks.com/data-governance/unity-catalog/compute.html

## Using the Data Generator
To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
To use the data generator, install the library using the `%pip install` method or install the Python wheel directly
in your environment.

Once the library has been installed, you can use it to generate a data frame composed of synthetic data.
Expand All @@ -120,7 +121,7 @@ for your use case.
```buildoutcfg
import dbldatagen as dg
df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
num_rows=df.count()
num_rows=df.count()
```

You can also define fully custom data sets using the `DataGenerator` class.
Expand All @@ -135,48 +136,48 @@ data_rows = 1000 * 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
partitions=4)
.withIdOutput()
.withColumn("r", FloatType(),
.withColumn("r", FloatType(),
expr="floor(rand() * 350) * (86400 + 3600)",
numColumns=column_count)
.withColumn("code1", IntegerType(), minValue=100, maxValue=200)
.withColumn("code2", IntegerType(), minValue=0, maxValue=10)
.withColumn("code3", StringType(), values=['a', 'b', 'c'])
.withColumn("code4", StringType(), values=['a', 'b', 'c'],
.withColumn("code4", StringType(), values=['a', 'b', 'c'],
random=True)
.withColumn("code5", StringType(), values=['a', 'b', 'c'],
.withColumn("code5", StringType(), values=['a', 'b', 'c'],
random=True, weights=[9, 1, 1])

)

df = df_spec.build()
num_rows=df.count()
num_rows=df.count()
```
Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
examples.
Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
examples.

The GitHub repository also contains further examples in the examples directory.

## Spark and Databricks Runtime Compatibility
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
including `current` and `preview`.
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
including `current` and `preview`.

While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
APIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older
runtimes.
runtimes.

By design, installing `dbldatagen` does not install releases of dependent packages in order
By design, installing `dbldatagen` does not install releases of dependent packages in order
to preserve the curated set of packages pre-installed in any Databricks runtime environment.

When building on local environments, run `make dev` to install required dependencies.

## Project Support
Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
(SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
(SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
They will be reviewed as time permits, but there are no formal SLAs for support.


Expand Down
49 changes: 49 additions & 0 deletions dbldatagen/spec/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""Pydantic-based specification API for dbldatagen (Experimental).

This module provides Pydantic models and specifications for defining data generation
in a type-safe, declarative way.

.. warning::
Experimental - This API is experimental and both APIs and generated code
are liable to change in future versions.
"""

from typing import Any

# Import only the compat layer by default to avoid triggering Spark/heavy dependencies
from .compat import BaseModel, Field, constr, root_validator, validator


# Lazy imports for heavy modules - import these explicitly when needed
# from .column_spec import ColumnSpec
# from .generator_spec import GeneratorSpec
# from .generator_spec_impl import GeneratorSpecImpl

__all__ = [
"BaseModel",
"ColumnDefinition",
"DatagenSpec",
"Field",
"Generator",
"constr",
"root_validator",
"validator",
]


def __getattr__(name: str) -> Any: # noqa: ANN401
"""Lazy import heavy modules to avoid triggering Spark initialization.

Note: Imports are intentionally inside this function to enable lazy loading
and avoid importing heavy dependencies (pandas, IPython, Spark) until needed.
"""
if name == "ColumnSpec":
from .column_spec import ColumnDefinition # noqa: PLC0415
return ColumnDefinition
elif name == "GeneratorSpec":
from .generator_spec import DatagenSpec # noqa: PLC0415
return DatagenSpec
elif name == "GeneratorSpecImpl":
from .generator_spec_impl import Generator # noqa: PLC0415
return Generator
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
Loading