Python Pickle versus HDF5

rch · on Oct 12, 2013

PyTables is really useful, but there is also h5py. I sometimes find it handy to create an HDF data structure in memory, and I think the latter library has better support in that case. I would like to know more about HDF read and write performance though, maybe relative to protobuf and msgpack (both could be much faster, but I wonder).

TheLoneWolfling · on Oct 12, 2013

"Warning The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source."

Storing pickled data on disk paves a path for all sorts of nasty exploits... Case in point:

File contents: "cos\nsystem\n(S'ls ~'\ntR."

How is HDF5, security-wise?

pekk · on Oct 12, 2013

If your local filesystem is literally an untrusted source, then you have big problems. All your own Python code is coming from that same untrusted source, along with all the .pyc files in your code and on PYTHONPATH. Is a .py or a .pyc paving a path for all sorts of nasty exploits, just because it can be run?

This isn't even limited to Python: every executable you are running on that machine are coming from the same untrusted source, and every binary you build on that machine is also tainted by extension.

Does loading a kernel from disk pave a path for all sorts of nasty exploits?

lvh · on Oct 12, 2013

I think there's a useful distinction to be made between data and code here. Your code typically writes to your data store. Your code does not typically write to your code.

It's a lot more reasonable that an attacker manages to convince your code to write some malicious data, than it is that an attacker has full write access to your filesystem.

As an analogy: your SQL database probably writes to a filesystem somewhere. If it's running on the same machine as your app server, it may be on the same filesystem. But, say, SQL injection attacks are still infinitely more common than "can write to the source file currently being executed".

Or, to rephrase: being writable (by the user executing app code) is a default state for your data store. It isn't a default state for your app, or your kernel.

TheLoneWolfling · on Oct 12, 2013

Yes, assuming that you are loading the kernel from a directory that can be accessed without privileges.

Generally the Python distribution is stored in a pseudo-readonly directory without privileges. UAC on Windows, for example.

That is different with reading input files - generally files like that are stored somewhere under the user folder and as such can be accessed easily.

Someone could craft an input file that when opened did pretty much anything. If this is a problem for formats like PDF (and it is), why would one not consider this a problem in this case?

marshray · on Oct 13, 2013

> If your local filesystem is literally an untrusted source, then you have big problems.

When has an easy to use general purpose data serialization format ever stayed put on local filesystems?

kilink · on Oct 12, 2013

I'd have to agree with pekk's reply in general about trusting what you've written to the filesystem.

That being said, it's easy to securely unpickle builtin or whitelisted types:

1. Use a cPickle.Unpickler instance, and set its "find_global" attribute to None to disable importing any modules (thus restricting loading to builtin types such as dict, int, list, string, etc).

2. Use a cPickle.Unpickler instance, and set its "find_global" attribute to a function that only allows importing of modules and names from a whitelist.

3. Use something like the itsdangerous package to authenticate the data before unpickling it if you're loading it from an untrusted source.

Anyway, this whole issue is largely tangential from what the OP was discussing.

Demiurge · on Oct 12, 2013

HDF5 is a hierarchical data storage format, it doesn't store or 'execute' code or retrieve language objects. In fact, it has nothing more to do with python or python data structures than NetCDF, geotiff or .mkv.

I guess this comparison is just saying "don't use pickle to efficiently store data"... Which is a bit of a 'dur'...

tomrod · on Oct 12, 2013

What's a better way to store, say, a dict in python?

I've not used python for repeated input/output before, so this is an open question for me.

reidrac · on Oct 12, 2013

I would use JSON.

I'm using python-memcached in a project and I was hitting memcache object size limit (python-memcached uses pickle by default). I wrote two simple functions to serialize/unserialize to/from JSON and compress the serialized data and the result was great and way below memcache limit.

Sure, pickle is slightly faster, but in my experience JSON + compression is a good compromise between simplicity, size and speed.

stock_toaster · on Oct 12, 2013

If you use python-memcached, make sure to set pickle protocol to 2 (current highest protocol). The default is 0. You can see an improvement[1] in performance by using proto 2.

[1]: https://read.twobitfifo.net/article/the-venerable-pickle/

ericcholis · on Oct 12, 2013

Lets not forget portability to any number of languages

stock_toaster · on Oct 12, 2013

My goodness. The author should be using cPickle.

zmk_ · on Oct 13, 2013

With protocol=2 as well.

stock_toaster · on Oct 13, 2013

The code example was already using highest_protocol, which is currently 2.

tomrod · on Oct 13, 2013

Tell me more!