HarmonizationScripts_CBSmicrodata: a data harmonization initiative for researchers working with CBS microdata
By: Bastian Ravesteijn and Mirthe Hendriks, Erasmus University Rotterdam: Erasmus School of Economics
This is a community-based initiative for researchers working with CBS microdata in the Remote Access environment. The motivation behind this initiative is to facilitate easier, less time-consuming data harmonization, particularly to optimize the combining (or joining) of different CBS microdata files. How? by making data harmonization scripts openly accessible through GitHub. The intention is that other CBS microdata-users can use these ‘harmonization-scripts’; run the scripts and easily extract the harmonized data of interest. The intention is that other CBS microdata-users can use these ‘harmonization-scripts’; run the scripts in the RA-environment and extract the harmonized data of interest.
Do you work with the CBS microdata? Did you write your own scripts to harmonize the CBS microdata? Please publish these harmonization-scripts on this GitHub repository!
How? Log in to your own GitHub account and go to this repository. Click on Fork. By doing this you create your own copy of this repository, one in which you can make changes without affecting the original. In the forked repository, you can create a new folder and add files - your own harmonization scripts - to this folder. Also add a project description in which you describe your project and your scripts. When you are finished, submit a pull request to the original repository.
Not familiar with GitHub? Please send me your harmonization-scripts and I will publish them for you (mail: m.m.j.hendriks@ese.eur.nl).
For users of CBS microdata, data harmonization is an unavoidable but time-consuming part the data analysis process. Data harmonization refers to the effort of combining data from different sources with varying file locations, file formats, and naming conventions, and transforming it into a single cohesive data set. Our aim is to provide communal services for CBS microdata-users by making data harmonization scrips openly and easily accessible.
While the CBS microdata infrastructure facilitates ground-breaking research, it remains a challenge for researchers to manage the vast amount of datafiles, the documentation and to link the data. Why is data harmonization challenging? The CBS microdata has different file formats (i.e. -bus and -tab files with observations by year, month or periods), file paths (which change when files are moved or new versions are published), subject areas with data in multiple (sub)folders, and naming conventions. Currently, most CBS microdata-users spend a considerable amount of time harmonizing the same data, reinventing the wheel.
In the project Children and (future) Parents, supported by Prediction and Professionals in Prevention, to improve Opportunity researchers work with CBS microdata on child health and development, demographic variables and parental characteristics. We have harmonized microdata of eleven data topics, and have made the R scripts openly accessible via GitHub. The intention is that other CBS microdata-users can use these ‘harmonization-scripts’; run the scripts and easily extract the harmonized data of interest. Moreover, we will encourage other users of CBS microdata to publish harmonization-scripts on this GitHub repository.
This communal data harmonization services initiative for users of CBS microdata provides a range of benefits: increasing visibility of harmonization efforts, optimizing reproducibility to reduce time-consuming work, and improving efficiency. Not only does this initiative promote the implementation of Open Science and FAIR principles, it might also stimulate users of CBS microdata to cooperate.
Voor vragen en/of opmerkingen: m.m.j.hendriks@ese.eur.nl of www.linkedin.com/in/mirthe-hendriks Would you like to be involved in futher developing this initiative? Please reach out!