theme:default, class: invert looks good

theme:gradient, class: blue looks good

theme:king looks good

Today I will be talking about deduplication and it's compatibility - or lack of compatibility - with modern encryption.

Let's start by discussing deduplication in general. Deduplication is a technique aimed at eliminating duplicate copies of repeating data. Its primary purpose is to reduce storage space and improve data management efficiency. If your computer has a deduplicated filesystem and you have many copies of a large file, the filesystem will only store one copy of the file and reference it multiple times. This can save a lot of space. The benefits of deduplication are: 1. Storage Savings: By removing duplicate data, the overall storage requirement is significantly reduced. 2. Cost Reduction: With less data to store, the costs associated with storage are lower. 3. Improved Efficiency: Backup and recovery processes are faster and more efficient since there is less data to manage.

Deduplication has numerous real-world applications across various industries. Let's explore a few key use cases: 1. Backup and Disaster Recovery: Deduplication is crucial in backup systems. Imagine you created weekly backups of all your files. From one week to another, most of your data wouldn't change and saving dozens of copies of he same file would waste a lot of space. 2. Virtualization: In virtual machine environments, deduplication reduces storage requirements by eliminating redundant virtual machine images. This is particularly useful in data centers where multiple similar VMs are deployed. 3. Email Systems: Email platforms can significantly reduce storage needs by removing duplicate attachments and identical emails across user accounts. This optimizes storage and improves system performance. 4. Cloud Storage Services: Providers such as Dropbox, Google Drive, and Amazon S3 utilize deduplication to reduce storage costs by eliminating redundant data. This ensures efficient use of storage resources. We will be focusing on the last use case.

You may have noticed that there is a slight conflict of interest between cloud platforms and their users. Users will always want the greatest possible privacy and security, while cloud platforms will want to save as much space as possible. As you recall from the lecture, modern encryption schemes are generally not deterministic. If two different users encrypt the same message, they will get different ciphertexts. This isn't good for the deduplication system, which relies on identical ciphertexts to identify and remove duplicates. To address this incompatibility, cloud providers use Message-Locked Encryption (MLE).

MLE derives the encryption key from the message itself, ensuring that identical plaintexts produce identical ciphertexts. This way, we can achieve the storage efficiency of deduplication while maintaining data security.

Let's look at how MLE works. Just as a disclaimer: There a many different variants of MLE and this is an explanation of the basic variant, called Convergent Encryption (CE). First, the key is derived from the message itself using a cryptographic hash function. For example, the file is fed into SHA256, and the output is used as the key. Next, this derived key is used to encrypt the message. This ciphertext is uploaded to the cloud. The key is either retained by the user or encrypted in some other way and stored on the cloud platform. The cloud platform then hashes the ciphertext to create a tag. This tag is used to identify duplicate ciphertexts. Finally, when the user wants to retrieve the file, the key is retreieved, and the ciphertext is decrypted. This process ensures that the same message will always result in the same encryption key and therefore the same ciphertext.

The main difference to regular CE is that the tag is calculated by the client and appended to the ciphertext. The rationale for HCE1 is to offer better performance for the server who can simply read the tag as the second part of the ciphertext rather than needing to compute it by hashing the possibly long ciphertext. There is one major downside to HCE1, which is that a malicious user could simply replace the tag with a different one, which could be used to replace legitimate data with malicious data on the server. There is a variant of this called HCE2, where the user re-calculates the tag after decryption to at least be able to recognize that the data was tampered with.

In response to the downsides of existing schemes, researchers set out to develop a new scheme, which offers better performance than HCE1 and better security. [Explain it.] The advantage of this scheme is that it only makes one pass over the data, compared to two passes for HCE and three passes for CE.

There are downsides to MLE. One of the most significant limitations is that no MLE scheme can provide privacy for predictable messages. Predictable messages are those drawn from a space of polynomial size. This limitation rules out classical semantic security. An attacker with access to the ciphertext could potentially guess the message and derive the key, compromising the security of the system. This likely won't be an issue for storing things like private keys, since they are not predictable, but it is something to keep in mind. For example if a company's HR department sent out a standard contract to all employees, an attacker could potentially guess things like a co-worker's salary or social security number, derive the key, and decrypt the message for confirmation.

[Explain example.] Even if one ignores the possibility of a cyberattack, there is still the issue of privacy. The *main goal* of encryption is to keep your data secret. As this example shows however, cloud providers can and do scan your data and compare it to a set of known files.

Another issue is that MLE can be vulnerable to side channel attacks. In an attempt to reduce bandwidth, some MLE schemes let the user compute the tag and check if it is already present on the server. If it is, the user doesn't need to upload the file. This is a problem because an attacker could observe which tags are already present on the server and infer which files are stored there. This is a form of data leakage and could potentially compromise the privacy of the system.

MLE schemes are also vulnerable to duplicate faking attacks. In this scenario, a user wishes to upload a file at a later date. An attacker, who knows or at least can guess this file uploads a different file but uses the tag of the user's file. When the user wants to upload their file, the server discards it as a duplicate of the attacker's file. If the user then wants to retrieve their file, they will only get the attacker's file. HCE2 and RCE allow a user to notice that the data was tampered with, but the data is still lost.

Since MLE does not satisfy standard security notions, such as IND-CPA, new security notions have been introduced. One of these is PRV-CDA, which states that the encryptions of two unpredictable messages should be indistinguishable. Another is PRV$-CDA, which states that the encryption of an unpredictable message must be indistinguishable from a random string of the same length. There are also adaptive versions of these security notions, (which allow the adversary to adaptively choose messages and receive encryptions of them.) These are denoted by the -A suffix. The other two security notions are TC and STC. TC states that an adversary cannot make a client recover an incorrect message, while STC states that an adversary cannot erase a client’s message.

As an example of an improvement upon standard MLE, I'd like to briefly talk about the DupLESS scheme. DUPLESS was proposed by three researchers and seems to be aimed at large corporations, who wish to outsource their data storage. The main idea behind DUPLESS is to use a key server to help derive encryption keys, ensuring secure deduplication. The DUPLESS scheme involves three main components: 1. The Client, which uploads data to the cloud. 2. The Key Server, which provides encryption keys derived from the file content. This key server is crucial for ensuring that identical files get the same encryption key. The idea is that a large corporation could run their own key server. The key server would also only respond to requests from authenticated clients. 3. The Cloud Storage Provider, which stores the deduplicated encrypted data.

The DUPLESS scheme offers increased security as long a the key server remains intact. Since the key server is responsible for generating encryption keys, external adversaries cannot try guessing messages. If a client becomes compromised, an attacker would still be unable to perform a brute-force attack on the key server, as it employs rate-limiting. In the event of a key server failure, the system falls back to the security level of regular MLE.

To summarize, deduplication is crucial for efficient data management, especially in large-scale storage systems like cloud services. Message-Locked Encryption (MLE) enables deduplication of encrypted data by deriving keys from the message itself. However, it is limited by privacy concerns and security vulnerabilities, such as exposure to predictable message attacks and potential side-channel attacks. Thank you for your attention, and I'm happy to take any questions you might have.

Now, I would like to open the floor for any questions you might have about MLE and secure deduplication. Feel free to ask anything that needs further clarification or any specific aspects you're curious about.

Deduplication – security and schemes

Overview

Deduplication

Concept and Benefits

Deduplication

Use Cases

Cloud Deduplication vs. Conventional Encryption

What is Message-Locked Encryption (MLE)?

How does it work?

Convergent Encryption (CE)

How does it work?

HCE1

How does it work?

RCE

Limitations

Predictable files

Limited Privacy

Bandwidth benefits Side channel attack

Duplicate faking attacks

Security notions

DupLESS Scheme

Overview

DupLESS Scheme

Advantages

Conclusion

Key Takeaways

Q&A

Questions?

Thank You!