Cutting down on database maintenance with automated tools
In a world reliant on giant databases, few recurring software maintenance tasks can cause as much headache for a developer as a new database schema. A database’s schema defines the name, type, format, and meaning of the data stored there. When the data needs for an organization or project change, their database needs to change to match them — and with it, every application that interacts with their database. This ripple effect can lead to days or weeks of downstream labor in rewriting and repairing existing software.
And schema changes are by no means rare. In four and a half years, Wikipedia logged over 170 schema versions. Each evolution caused up to 70% of software tests to fail, requiring major effort to revise their backend.
“In general, evolving database applications is extremely labor-intensive,” says Xinyu Wang, assistant professor of computer science and engineering at the University of Michigan. “This often accounts for 40-75% of the overall development cost.”
Wang is working to alleviate this headache with a new NSF-funded project, “Formal Methods for Evolving Database Applications.” His goal is to devise automated tools that can not only help reformat data to meet a new schema’s requirements, but even update the code that interacts with it.
The process of a schema change begins with the data itself. Schema changes can involve anything from new relationships between tables, additional columns, new conditional logic, all the way up to discarding the database’s structure for something completely new. Before they can do any transferring, developers need to understand the new schema and come up with a way to map existing data to the new format. They can then code up the data transformation and test it.
“But that’s not the end of the story,” Wang says.
The surrounding application code, which relies on the database’s old syntax, also has to be migrated. This includes any code meant to query the database or update it with new entries — which often means nearly everything. There may also be performance implications. It’s possible that naively converting the previous query to a new one under the new schema would make the query slower than ideal.
“You have to be careful about both the functional correctness as well as the performance aspect during the application migration process,” Wang explains.
Wang’s project aims to automate this entire pipeline using program synthesis and formal verification, which are methods of constructing a program that provably satisfies some formal specification. This technique has been used previously to prove software applications are safe to use without bugs, but database applications introduce new challenges.
“Data migration must respect the schema as well as other integrity constraints that the database must follow, which makes it hard.” says Wang. ” Database-backed applications consist of code written in both imperative languages (such as PHP) and declarative languages (like SQL), making them very difficult to migrate.”
To automate this process, Wang needs to be able to demonstrate that two programs operate equivalently, one on the old schema and one with the new. He can then verify when programs generated automatically match the specifications. He’s using a paradigm called counterexample-guided inductive synthesis (CEGIS), which involves searching through candidate programs and generating counterexamples to prove which don’t match the expected behavior. This gives a generator additional information it can use to construct better programs, dodging the errors of its predecessors.
“This significantly helps speed up the synthesis process, making it faster to find a target application that is equivalent,” Wang says. Along the way, he adds, using formal techniques can make the process less error-prone and help eliminate bugs.
Ultimately, this automation can turn a routine time and money sink into a much smoother process that impacts nearly every industry. The project is co-led by collaborators at the University of Texas at Austin, and was awarded a $725,000 grant as part of NSF’s Software and Hardware Foundations program.