Generate synthetic dataset with JSON strings

Would be cool to run benchmarks on a dataset with lots of JSON data.

Would be nice for running benchmarks on things like Spark + JSON Strings vs. Spark + the new variant data type.