How to source our test data? #66

thorwhalen · 2023-08-17T19:55:45Z

thorwhalen
Aug 17, 2023
Maintainer

Context

You're writing some code to test or demo something.
You need some data to do this.
Where do you put it?

Problem

If the data is small (e.g. a small array or dict) we often see the data being "made" within the code itself like this

test_data = [1, 2, 3]
other_test_data = {'a': 'apple', 'b': 'banana'}

Now, there's pros and cons to this, but we won't go there for now.
Let's focus on the biggest problem first: When the data is a bit bigger or complex.
Then what do folks do?

They often do what's the easiest, without thinking of the consequences behind their myopic need.
That's not necessarily bad, if we provide the tools so the path of least resistance just so happens to be the one with a more scalable design. Let's see what the various ways people take care of this problem, and discuss the pros and cons.

Solutions

Local files

They put things in files, and tell the test to look there for it's data dependencies.
What files?
Well, too often, it's a local file, which therefore can only be run by the original author as is.

Shared files

They can make that a bit better by sharing that file (e.g. via dropbox), and using the shared link in the code.
With graze, for example, they can now have code that will automatically go fetch the data if it's missing locally, and then use the local copy.

Project files

Another common way to do this is to put the files in the same repository as the code, often under a data folder in the tests folder. (By the way, the proper robust way to then refer to these files is described here.)

This is a good solution, but still has a big con as the data required becomes even bigger. That data is there for tests, yet when I download the repo, or do any kind of git synching operations on it, I find myself spending more than 99% of the computing resources (bandwidth, storage, sync-computations,...) on something that is only for tests.

So...

Separate data

Putting the files elsewhere could go a long way here. Put them in a remote storage system -- could be a github repo, some S3 bucket, some dropbox folder, etc. -- just separate. Then in your test code, use a function that will mediate the access to that data, and take care of aspects like "do I keep a local copy?" etc.

Having the test data separate also enables reuse of the data in other contexts than just one test.

Data accessor

Here, some kind of "lazy" approach would be warranted. One where the data is "acquired" only if, and when, needed.

More generally, if you source your data through a function that gives you the data (from where ever), it keeps the test data consumers open-closed from that aspect: You just change the function (how it gets the data).

(_This is what I advise especially when developing GUIs that usually start with some hard-coded data, than move on to getting those datas from other GUI components or APIs to external resources. If from the start, the data dependencies go through functions or "callbacks", it makes the design SOLID from the start. _)

Generate the data

If and when the test data can be actually generated by computation, then we can get unbounded amounts of data with a finite storage footprint (just the function's code). This is what hum (for vibration) and forged (for other types of data) are for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to source our test data? #66

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to source our test data? #66

Uh oh!

Uh oh!

thorwhalen Aug 17, 2023 Maintainer

Context

Problem

Solutions

Local files

Shared files

Project files

Separate data

Data accessor

Generate the data

Replies: 0 comments

thorwhalen
Aug 17, 2023
Maintainer