Google launches VaultGemma, a new AI designed to protect your data

Google has released a new artificial intelligence model named VaultGemma. This experimental project focuses on a major concern in AI: preventing models from leaking the private information they were trained on.

The challenge for AI developers is that large language models, or LLMs, are trained on massive amounts of internet data. Sometimes, these models can accidentally “memorize” and later reproduce sensitive details from that data. This could include personal user information or even copyrighted material, leading to privacy violations and legal issues.

To solve this, the Google Research team used a technique called “differential privacy.” In simple terms, this method adds carefully measured noise, or randomness, into the data while the AI is learning. This makes it much harder for the final model to remember and spit out any specific piece of its training data.

However, adding this privacy protection has traditionally made AI models less accurate and required more computing power. Google’s new research is the first to thoroughly figure out how this privacy technique changes the way AI models scale. They found that to maintain good performance while adding privacy, developers need to balance their budget for data, computing power, and privacy.

Google VaultGemma is the first real-world model to come from this research. It is a relatively small AI, built with 1 billion parameters based on the Gemma 2 architecture. Despite its use of privacy-focused training, Google states that VaultGemma performs as well as similar-sized models that do not have these privacy features.

This development is likely most important for smaller, specialized AI tools rather than giant general-purpose models, as the research shows differential privacy works best at a smaller scale. Google hopes its findings will help other developers efficiently build more private AI systems.

VaultGemma is available for researchers and developers to download now from the Hugging Face and Kaggle platforms. It is offered with “open weights,” meaning its core design can be used and modified. However, it is not fully open source, as users must agree to Google’s terms of use, which prohibit misuse and require the license to be shared with any modified versions.

Source: Google

Leave a comment