MIT researchers introduce generative AI for databases | MIT News

The new tool makes it easier for database users to perform complex statistical analyzes on tabular data without having to know what’s going on behind the scenes.

GenSQL, a generative artificial intelligence system for databases, could help users make predictions, detect anomalies, guess missing values, correct errors, or generate synthetic data with a few keystrokes.

For example, if the system were used to analyze medical data from a patient who had always had high blood pressure, it might pick up a blood pressure reading that is low for that particular patient but would otherwise be in the normal range.

GenSQL automatically integrates a tabular dataset and a generative probabilistic AI model that can account for uncertainty and adjust its decision-making based on new data.

Additionally, GenSQL can be used to produce and analyze synthetic data that mimics the real data in the database. This could be particularly useful in situations where sensitive data cannot be shared, such as patient health records, or where the actual data is scarce.

This new tool is built on SQL, a programming language for creating and manipulating databases that was introduced in the late 1970s and is used by millions of developers worldwide.

“Historically, SQL has taught the business world what a computer can do. They didn’t have to write their own programs, they just had to query the database in a high-level language. We think that as we move from just querying data to asking questions of models and data, we’re going to need an analog language that teaches people coherent questions that you can ask a computer that has a probabilistic model of the data,” says Vikash. Mansinghka ’05, MEng ’09, PhD ’09, lead author of the article introducing GenSQL and principal scientist and project leader of Probabilistic Computing at MIT’s Department of Brain and Cognitive Sciences.

When the researchers compared GenSQL to popular AI-based approaches to data analysis, they found that it was not only faster, but also produced more accurate results. Importantly, the probabilistic models used by GenSQL are interpretable so that users can read and edit them.

“If you look at the data and try to find some meaningful patterns just by using simple statistical rules, you can miss important interactions. You really want to capture the correlations and dependencies of the variables, which can be quite complicated, in the model. With GenSQL, we want to enable a large set of users to query their data and their model without having to know all the details,” adds lead author Mathieu Huot, a researcher in the Department of Brain and Cognitive Sciences and a member of the Probabilistic Computing Project.

They were joined on paper by Matin Ghavami and Alexander Lew, MIT graduate students; Cameron Freer, Research Associate; Ulrich Schaechtel and Zane Shelby of Digital Garage; Martin Rinard, MIT professor in the Department of Electrical Engineering and Computer Science and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and Feras Saad ’15, MEng ’16, PhD ’22, assistant professor at Carnegie Mellon University. The research was recently presented at the ACM Conference on Programming Language Design and Implementation.

Combination of models and databases

SQL, which stands for Structured Query Language, is a programming language for storing and manipulating information in a database. In SQL, people can ask questions about data using keywords, such as summing, filtering, or grouping database records.

However, querying the model can provide deeper insight because the models can capture what the data means to the individual. For example, a developer who questions whether she is underpaid is probably more interested in what the salary data means for her individually than she is in trends from database records.

The researchers noted that SQL does not provide an efficient way to incorporate probabilistic AI models, but at the same time, approaches that use probabilistic models to make inferences do not support complex database queries.

They created GenSQL to fill this gap, allowing someone to query both a dataset and a probabilistic model using a straightforward but powerful formal programming language.

A GenSQL user uploads their data and a probabilistic model, which the system automatically integrates. It can then run queries on the data that also get input from a probabilistic model running behind the scenes. This not only allows for more complex queries, but can also provide more accurate answers.

For example, a query in GenSQL might be something like, “What is the probability that a developer from Seattle knows the Rust programming language?” Just by looking at the correlation between columns in a database, you can miss subtle dependencies. Incorporating a probabilistic model can capture more complex interactions.

Additionally, the probabilistic models that GenSQL uses are auditable, so people can see which data the model uses to make decisions. In addition, these models provide a measure of the calibrated uncertainty along with each response.

For example, with this calibrated uncertainty, if someone asks the model about the predicted outcomes of various cancer treatments for a patient from a minority group that is underrepresented in the dataset, GenSQL would tell the user that it is uncertain and how uncertain it is, rather than overconfidently advocating incorrect treatment.

Faster and more accurate results

To evaluate GenSQL, the researchers compared their system to popular underlying methods that use neural networks. GenSQL was 1.7 to 6.8 times faster than these approaches, executing most queries in milliseconds while providing more accurate results.

They also used GenSQL in two case studies: in one, the system identified mislabeled data from clinical trials, and in the other, it generated accurate synthetic data that captured complex relationships in genomics.

Next, researchers want to use GenSQL more broadly to conduct large-scale modeling of human populations. With GenSQL, they can generate synthetic data to make inferences about things like health and salary, while controlling what information is used in the analysis.

They also want to make GenSQL easier to use and perform by adding new optimizations and automation to the system. In the long term, the researchers want to enable users to perform natural language queries in GenSQL. Ultimately, their goal is to develop a ChatGPT-like AI expert to talk to about any database that bases its answers on GenSQL queries.

This research is funded in part by the Defense Advanced Research Projects Agency (DARPA), Google, and the Siegel Family Foundation.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top