Deduplication – security and schemes

Sebastian Hietsch

Symmetric Cryptography
Sebastian Hietsch

Overview

  • What is deduplication?
  • Message-Locked Encryption (MLE)
  • Limitations and privacy concerns
  • Security notions
  • DupLESS scheme
Symmetric Cryptography
Sebastian Hietsch

Deduplication

Concept and Benefits

  • Definition: eliminating duplicate copies of data
  • Purpose: reduce storage space
  • Benefits:
    • Storage savings
    • Cost reduction
    • Improved efficiency
Symmetric Cryptography
Sebastian Hietsch

Deduplication

Use Cases

  • Backup and Disaster Recovery: only changed files take up space
  • Virtualization: redundant virtual machine images
  • Email Systems: duplicate email attachments and emails across users
  • Cloud Storage Services: reduce storage costs
Symmetric Cryptography
Sebastian Hietsch

Cloud Deduplication vs. Conventional Encryption

  • Incompatibility:

    • Encryption Variability: different keys produce different ciphertexts
    • Deduplication Requirement: needs identical ciphertexts
  • Solution:

    • Message-Locked Encryption (MLE): derive keys from the message itself
Symmetric Cryptography
Sebastian Hietsch

What is Message-Locked Encryption (MLE)?

  • Definition: A cryptographic technique where the encryption key is derived from the message itself.
  • Purpose: identical plaintext identical ciphertext.
  • Benefit: deduplication of encrypted data, storage efficiency.
Symmetric Cryptography
Sebastian Hietsch

How does it work?

Convergent Encryption (CE)

  1. Key Derivation:
    • Key is derived from message using cryptographic hash function
  2. Encryption:
    • The derived key encrypts the message.
  3. Tagging:
    • Hash of ciphertext is used for deduplication.
  4. Decryption:
    • The key (retained) decrypts the message.
Symmetric Cryptography
Sebastian Hietsch

How does it work?

HCE1

  1. Key Derivation:
    • Key is derived from message using cryptographic hash function
  2. Encryption:
    • The tag is appended to the ciphertext.
  3. Tagging:
    • Hash of key is used for deduplication, HCE2 recalculates tag after decryption.
  4. Decryption:
    • The key (retained) decrypts the message.
Symmetric Cryptography
Sebastian Hietsch

How does it work?

RCE

  1. Key Derivation:
    • Random key is encrypted with derived key
  2. Encryption:
    • The encrypted key & tag is appended to the ciphertext.
  3. Tagging:
    • Hash of key is used for deduplication.
  4. Decryption:
    • The key (retained) decrypts the message-key , user checks tag.
Symmetric Cryptography
Sebastian Hietsch

Limitations

Predictable files

No MLE scheme can provide privacy for predictable messages,
meaning ones drawn from a space of polynomial size, in particular ruling out classical semantic security.

Symmetric Cryptography
Sebastian Hietsch

Limited Privacy

Justin Roiland, co-creator of Rick and Morty, discovers that Dropbox uses content scanners through the deletion of all his data stored on their servers.

Symmetric Cryptography
Sebastian Hietsch

Bandwidth benefits Side channel attack

  • "Improvement":
    • let the user compute the tag
    • if tag already present, upload not necessary
  • Attack:
    • Side channel: observe which tags are already present
    • Data leakage: infer which files are stored
Symmetric Cryptography
Sebastian Hietsch

Duplicate faking attacks

  • User: Wishes to upload at a later date
  • Attacker: Uploads , but uses the tag of
  • Server: Discards user's as a duplicate of
  • (partial) Solution: HCE2/RCE recalculates tag after decryption - but data still lost
Symmetric Cryptography
Sebastian Hietsch

Security notions

  • PRV-CDA(-A): encryptions of two unpredictable messages should be indistinguishable
    • CE, HCE1, RCE
  • PRV$-CDA(-A): the encryption of an unpredictable message must be indistinguishable from a random string of the same length
    • CE, HCE1, RCE
  • TC: an adversary cannot make a client recover an incorrect message
    • CE, RCE
  • STC: an adversary cannot erase/overwrite a client’s message
    • CE
Symmetric Cryptography
Sebastian Hietsch

DupLESS Scheme

Overview

  • DupLESS transparently increases security of existing MLE schemes.
  • Main Idea: Uses a key server to manage encryption keys
  • Components:
    • Client: Uploads data to the cloud.
    • Key Server: Provides encryption keys derived from file content.
    • Cloud Storage Provider: Stores the deduplicated encrypted data.
Symmetric Cryptography
Sebastian Hietsch

DupLESS Scheme

Advantages

  • Better security with intact key server
  • Graceful degradation in case of key server failure
Symmetric Cryptography
Sebastian Hietsch

Conclusion

Key Takeaways

  • Deduplication:

    • Reduces storage space and costs.
    • Crucial for efficient data management.
  • Message-Locked Encryption (MLE):

    • Enables deduplication of encrypted data.
    • Limited by inherent security vulnerabilities.
Symmetric Cryptography
Sebastian Hietsch

Q&A

Questions?

Feel free to ask any questions about MLE and secure deduplication.

Symmetric Cryptography
Sebastian Hietsch

Thank You!

Symmetric Cryptography

theme:default, class: invert looks good

theme:gradient, class: blue looks good

theme:king looks good

Today I will be talking about deduplication and it's compatibility - or lack of compatibility - with modern encryption.

Let's start by discussing deduplication in general. Deduplication is a technique aimed at eliminating duplicate copies of repeating data. Its primary purpose is to reduce storage space and improve data management efficiency. If your computer has a deduplicated filesystem and you have many copies of a large file, the filesystem will only store one copy of the file and reference it multiple times. This can save a lot of space. The benefits of deduplication are: 1. Storage Savings: By removing duplicate data, the overall storage requirement is significantly reduced. 2. Cost Reduction: With less data to store, the costs associated with storage are lower. 3. Improved Efficiency: Backup and recovery processes are faster and more efficient since there is less data to manage.

Deduplication has numerous real-world applications across various industries. Let's explore a few key use cases: 1. Backup and Disaster Recovery: Deduplication is crucial in backup systems. Imagine you created weekly backups of all your files. From one week to another, most of your data wouldn't change and saving dozens of copies of he same file would waste a lot of space. 2. Virtualization: In virtual machine environments, deduplication reduces storage requirements by eliminating redundant virtual machine images. This is particularly useful in data centers where multiple similar VMs are deployed. 3. Email Systems: Email platforms can significantly reduce storage needs by removing duplicate attachments and identical emails across user accounts. This optimizes storage and improves system performance. 4. Cloud Storage Services: Providers such as Dropbox, Google Drive, and Amazon S3 utilize deduplication to reduce storage costs by eliminating redundant data. This ensures efficient use of storage resources. We will be focusing on the last use case.

You may have noticed that there is a slight conflict of interest between cloud platforms and their users. Users will always want the greatest possible privacy and security, while cloud platforms will want to save as much space as possible. As you recall from the lecture, modern encryption schemes are generally not deterministic. If two different users encrypt the same message, they will get different ciphertexts. This isn't good for the deduplication system, which relies on identical ciphertexts to identify and remove duplicates. To address this incompatibility, cloud providers use Message-Locked Encryption (MLE).

MLE derives the encryption key from the message itself, ensuring that identical plaintexts produce identical ciphertexts. This way, we can achieve the storage efficiency of deduplication while maintaining data security.

Let's look at how MLE works. Just as a disclaimer: There a many different variants of MLE and this is an explanation of the basic variant, called Convergent Encryption (CE). First, the key is derived from the message itself using a cryptographic hash function. For example, the file is fed into SHA256, and the output is used as the key. Next, this derived key is used to encrypt the message. This ciphertext is uploaded to the cloud. The key is either retained by the user or encrypted in some other way and stored on the cloud platform. The cloud platform then hashes the ciphertext to create a tag. This tag is used to identify duplicate ciphertexts. Finally, when the user wants to retrieve the file, the key is retreieved, and the ciphertext is decrypted. This process ensures that the same message will always result in the same encryption key and therefore the same ciphertext.

The main difference to regular CE is that the tag is calculated by the client and appended to the ciphertext. The rationale for HCE1 is to offer better performance for the server who can simply read the tag as the second part of the ciphertext rather than needing to compute it by hashing the possibly long ciphertext. There is one major downside to HCE1, which is that a malicious user could simply replace the tag with a different one, which could be used to replace legitimate data with malicious data on the server. There is a variant of this called HCE2, where the user re-calculates the tag after decryption to at least be able to recognize that the data was tampered with.

In response to the downsides of existing schemes, researchers set out to develop a new scheme, which offers better performance than HCE1 and better security. [Explain it.] The advantage of this scheme is that it only makes one pass over the data, compared to two passes for HCE and three passes for CE.

There are downsides to MLE. One of the most significant limitations is that no MLE scheme can provide privacy for predictable messages. Predictable messages are those drawn from a space of polynomial size. This limitation rules out classical semantic security. An attacker with access to the ciphertext could potentially guess the message and derive the key, compromising the security of the system. This likely won't be an issue for storing things like private keys, since they are not predictable, but it is something to keep in mind. For example if a company's HR department sent out a standard contract to all employees, an attacker could potentially guess things like a co-worker's salary or social security number, derive the key, and decrypt the message for confirmation.

[Explain example.] Even if one ignores the possibility of a cyberattack, there is still the issue of privacy. The *main goal* of encryption is to keep your data secret. As this example shows however, cloud providers can and do scan your data and compare it to a set of known files.

Another issue is that MLE can be vulnerable to side channel attacks. In an attempt to reduce bandwidth, some MLE schemes let the user compute the tag and check if it is already present on the server. If it is, the user doesn't need to upload the file. This is a problem because an attacker could observe which tags are already present on the server and infer which files are stored there. This is a form of data leakage and could potentially compromise the privacy of the system.

MLE schemes are also vulnerable to duplicate faking attacks. In this scenario, a user wishes to upload a file at a later date. An attacker, who knows or at least can guess this file uploads a different file but uses the tag of the user's file. When the user wants to upload their file, the server discards it as a duplicate of the attacker's file. If the user then wants to retrieve their file, they will only get the attacker's file. HCE2 and RCE allow a user to notice that the data was tampered with, but the data is still lost.

Since MLE does not satisfy standard security notions, such as IND-CPA, new security notions have been introduced. One of these is PRV-CDA, which states that the encryptions of two unpredictable messages should be indistinguishable. Another is PRV$-CDA, which states that the encryption of an unpredictable message must be indistinguishable from a random string of the same length. There are also adaptive versions of these security notions, (which allow the adversary to adaptively choose messages and receive encryptions of them.) These are denoted by the -A suffix. The other two security notions are TC and STC. TC states that an adversary cannot make a client recover an incorrect message, while STC states that an adversary cannot erase a client’s message.

As an example of an improvement upon standard MLE, I'd like to briefly talk about the DupLESS scheme. DUPLESS was proposed by three researchers and seems to be aimed at large corporations, who wish to outsource their data storage. The main idea behind DUPLESS is to use a key server to help derive encryption keys, ensuring secure deduplication. The DUPLESS scheme involves three main components: 1. The Client, which uploads data to the cloud. 2. The Key Server, which provides encryption keys derived from the file content. This key server is crucial for ensuring that identical files get the same encryption key. The idea is that a large corporation could run their own key server. The key server would also only respond to requests from authenticated clients. 3. The Cloud Storage Provider, which stores the deduplicated encrypted data.

The DUPLESS scheme offers increased security as long a the key server remains intact. Since the key server is responsible for generating encryption keys, external adversaries cannot try guessing messages. If a client becomes compromised, an attacker would still be unable to perform a brute-force attack on the key server, as it employs rate-limiting. In the event of a key server failure, the system falls back to the security level of regular MLE.

To summarize, deduplication is crucial for efficient data management, especially in large-scale storage systems like cloud services. Message-Locked Encryption (MLE) enables deduplication of encrypted data by deriving keys from the message itself. However, it is limited by privacy concerns and security vulnerabilities, such as exposure to predictable message attacks and potential side-channel attacks. Thank you for your attention, and I'm happy to take any questions you might have.

Now, I would like to open the floor for any questions you might have about MLE and secure deduplication. Feel free to ask anything that needs further clarification or any specific aspects you're curious about.

Thank you for your attention, and have a great day!