OpenSDP is an open community for education data analysts. In the spring of 2017 I worked with the Strategic Data Project at the Center for Education and Policy Research at Harvard University to develop and launch a community site for education data analysts across the country. I was responsible for building the R versions of the tutorials published on the site.
I also developed a method for generating realistic synthetic education data. The goal of this project was to allow users to collaborate across agencies without the need for a data sharing agreement, but then translate their code back to their original data:
OpenSDPsynthR is not actually a dataset; it is a data simulation package written in R. There are advantages to using simulation to generate synthetic data. The data can become richer and more complex over time as the simulation code is tuned and extended. Eventually, it’s possible that simulation could leverage publicly available data to generate synthetic versions of any school district on demand.
The system is available on GitHub. It takes advantage of Markov chains and hierarchical models to generate realistic synthetic data that is responsive to parameters specified by the user. Future work will allow users to give high-level template descriptions of the data they want to simulate and get realistic education data back.