Basic idea of the feature request

I was trying to read an encrypted file in pandas. As far as I know, there is no way to provide something to read_csv (or any other read_* function) to decrypt a file when reading (and not with ex-post applymap functions as in this stack overflow thread)

The solution proposed in the aforementioned post seems quite slow with data > several Mb.

My solution has been to decrypt the file using cryptography package and write that in a temporary location (there's room for improvement in the functions I will propose below, I am aware of that). This works but I was hoping this would be better to have an option in pandas to decrypt when reading the stream input. This would probably lead to:

  • speed improvements since you reduce the I/O
  • improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose

Here an example that makes possible to reproduce the feature:

  1. The encrypt_data is just here to reproduce the setting of having a crypted file
  2. It would be great to avoid the decrypt_data step to directly use read_csv with an extra argument.
import pandas as pd
from cryptography.fernet import Fernet

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})
df.to_csv("toto.csv")

def encrypt_data(path, key, outpath = None):

    if outpath is None:
        outpath = '{}_encrypted'.format(path)

    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    # encrypting the file
    encrypted = f.encrypt(original)  
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as encrypted_file:
        encrypted_file.write(encrypted)

    print("file {} encrypted ; written at {} location".format(path, outpath))



def decrypt_data(path, key,  outpath = None):
    if outpath is None:
        outpath = '{}_encrypted'.format(path)
    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    decrypted = f.decrypt(original)
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as dfile:
        dfile.write(decrypted)
    print("file {} decrypted ; written at {} location".format(path, outpath))


dummykey = Fernet.generate_key()
encrypt_data("toto.csv", dummykey, outpath = "toto_crypt.csv")
decrypt_data("toto_crypt.csv", dummykey, outpath = "toto_decrypt.csv")


pd.read_csv("toto_crypt.csv")
pd.read_csv("toto_decrypt.csv")

A possible approach

Let's say we call this argument encryption. We could provide an object from cryptography to decode datastream directly in pd.read_csv call. For instance:

pd.read_csv("toto_decrypt.csv", encryption = Fernet(dummykey))

The same approach could be used to to_csv (or other writing functions) to directly write encrypted data in the disk.

However, maybe this solution would imply to use the python engine. Directly providing the key and the encryption method (e.g. Fernet) is maybe better to work with the C engine (I am not familiar with C but there's probably equivalent method than the one I applied in python)

API breaking implications

As far as I understand how I/O works, I think this extra argument would not break any existing code with a default value to None.

Comment From: jreback

not really in favor of this as out of scope here. adding complexity w/o much value. That said if a fully formed PR that works generally, wouldn't object.

could be a doc recipe instead.

Comment From: twoertwein

I'm not familiar with the 3rd-party cryptography library. If it provides you with a file handle, you can simply pass that to pd.read_csv.

Comment From: twoertwein

  • improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose

Instead of writing the content (str/bytes) to a file, you can simply wrap it inside io.StringIO or io.BytesIO and then give that to read_csv.

Comment From: linogaliana

Thanks for the quick reply.

I understand the maintainer point of view that it is not necessary to add extra complexity if not needed. I agree with @jreback that it would maybe make more sense as a doc recipe.

I will have a look to io.StringIO or io.BytesIO, maybe this would avoid my overcomplicated solution. If I'm happy about it, I will make a PR for adding that to the documentation.

Comment From: slremy

Hello all, is the suggestion that this should be implemented as a method which returns a file-like object?

ala

pd.read_csv(decrypt_data("toto_crypt.csv", dummykey)) or pd.read_csv(decrypt_data("https://server/toto_crypt.csv", dummykey))

Comment From: twoertwein

Pandas has convenient methods for compression, but I think adding a particular non-stdlib en/decryption packages might be a very niche feature which might not be able to justify the added complexity.

I think the best solution would be if cryptography.fernet implements a function to return a decrypted file handle. This should work with read_csv/to_csv/...

Comment From: jbrockmendel

Agree with Jeff this doesn’t belong in pandas.