How to avoid computation every time a python module is reloaded

I have a python module that makes use of a huge dictionary global variable, currently I put the computation code in the top section, every first time import or reload of the module takes more then one minute which is totally unacceptable. How can I save the computation result somewhere so that the next import/reload doesn't have to compute it? I tried cPickle, but loading the dictionary variable from a file(1.3M) takes approximately the same time as computation.

To give more information about my problem,

FD = FreqDist(word for word in brown.words()) # this line of code takes 1 min

-------------Problems Reply------------

Just to clarify: the code in the body of a module is not executed every time the module is imported - it is run only once, after which future imports find the already created module, rather than recreating it. Take a look at sys.modules to see the list of cached modules.

However, if your problem is the time it takes for the first import after the program is run, you'll probably need to use some other method than a python dict. Probably best would be to use an on-disk form, for instance a sqlite database, one of the dbm modules.

For a minimal change in your interface, the shelve module may be your best option - this puts a pretty transparent interface between the dbm modules that makes them act like an arbitrary python dict, allowing any picklable value to be stored. Here's an example:

# Create dict with a million items:
import shelve
d = shelve.open('path/to/my_persistant_dict')
d.update(('key%d' % x, x) for x in xrange(1000000))
d.close()

Then in the next process, use it. There should be no large delay, as lookups are only performed for the key requested on the on-disk form, so everything doesn't have to get loaded into memory:

>>> d = shelve.open('path/to/my_persistant_dict')
>>> print d['key99999']
99999

It's a bit slower than a real dict, and it will still take a long time to load if you do something that requires all the keys (eg. try to print it), but may solve your problem.

Calculate your global var on the first use.

class Proxy:
@property
def global_name(self):
# calculate your global var here, enable cache if needed
...

_proxy_object = Proxy()
GLOBAL_NAME = _proxy_object.global_name

Or better yet, access necessery data via special data object.

class Data:
GLOBAL_NAME = property(...)

data = Data()

Example:

from some_module import data

print(data.GLOBAL_NAME)

See Django settings.

I assume you've pasted the dict literal into the source, and that's what's taking a minute? I don't know how to get around that, but you could probably avoid instantiating this dict upon import... You could lazily-instantiate it the first time it's actually used.

You could try using the marshal module instead of the c?Pickle one; it could be faster. This module is used by python to store values in a binary format. Note especially the following paragraph, to see if marshal fits your needs:

Not all Python object types are supported; in general, only objects whose value is independent from a particular invocation of Python can be written and read by this module. The following types are supported: None, integers, long integers, floating point numbers, strings, Unicode objects, tuples, lists, sets, dictionaries, and code objects, where it should be understood that tuples, lists and dictionaries are only supported as long as the values contained therein are themselves supported; and recursive lists and dictionaries should not be written (they will cause infinite loops).

Just to be on the safe side, before unmarshalling the dict, make sure that the Python version that unmarshals the dict is the same as the one that did the marshal, since there are no guarantees for backwards compatibility.

If the 'shelve' solution turns out to be too slow or fiddly, there are other possibilities:

  • shove
  • Durus
  • ZopeDB
  • pyTables

shelve gets really slow with large data sets. I've been using redis quite successfully, and wrote a FreqDist wrapper around it. It's very fast, and can be accessed concurrently.

You can use a shelve to store your data on disc instead of loading the whole data into memory. So startup time will be very fast, but the trade-off will be slower access time.

Shelve will pickle the dict values too, but will do the (un)pickle not at startup for all the items, but only at access time for each item itself.

A couple of things that will help speed up imports:

  1. You might try running python using the -OO flag when running python. This will do some optimizations that will reduce import time of modules.
  2. Is there any reason why you couldn't break the dictionary up into smaller dictionaries in separate modules that can be loaded more quickly?
  3. As a last resort, you could do the calculations asynchronously so that they won't delay your program until it needs the results. Or maybe even put the dictionary in a separate process and pass data back and forth using IPC if you want to take advantage of multi-core architectures.

With that said, I agree that you shouldn't be experiencing any delay in importing modules after the first time you import it. Here are a couple of other general thoughts:

  1. Are you importing the module within a function? If so, this can lead to performance problems since it has to check and see if the module is loaded every time it hits the import statement.
  2. Is your program multi-threaded? I have seen occassions where executing code upon module import in a multi-threaded app can cause some wonkiness and application instability (most notably with the cgitb module).
  3. If this is a global variable, be aware that global variable lookup times can be significantly longer than local variable lookup times. In this case, you can achieve a significant performance improvement by binding the dictionary to a local variable if you're using it multiple times in the same context.

With that said, it's a tad bit difficult to give you any specific advice without a little bit more context. More specifically, where are you importing it? And what are the computations?

  1. Factor the computationally intensive part into a separate module. Then at least on reload, you won't have to wait.
  2. Try dumping the data structure using protocol 2. The command to try would be cPickle.dump(FD, protocol=2). From the docstring for cPickle.Pickler:

    Protocol 0 is the
    only protocol that can be written to a file opened in text
    mode and read back successfully. When using a protocol higher
    than 0, make sure the file is opened in binary mode, both when
    pickling and unpickling.

I'm going through this same issue... shelve, databases, etc... are all too slow for this type of problem. You'll need to take the hit once, insert it into an inmemory key/val store like Redis. It will just live there in memory (warning it could use up a good amount of memory so you may want a dedicated box). You'll never have to reload it and you'll just get looking in memory for keys

r = Redis()
r.set(key, word)

word = r.get(key)

Expanding on the delayed-calculation idea, why not turn the dict into a class that supplies (and caches) elements as necessary?

You might also use psyco to speed up overall execution...

OR you could just use a database for storing the values in? Check out SQLObject, which makes it very easy to store stuff to a database.

There's another pretty obvious solution for this problem. When code is reloaded the original scope is still available.

So... doing something like this will make sure this code is executed only once.

try:
FD
except NameError:
FD = FreqDist(word for word in brown.words())

Category:python Views:2 Time:2008-10-12
Tags: python nltk

Related post

  • Python module being reloaded for each request with django and mod_wsgi 2010-06-11

    I have a variable in init of a module which get loaded from the database and takes about 15 seconds. For django development server everything is working fine but looks like with apache2 and mod_wsgi the module is loaded with every request (taking 15

  • How can we compute cube-root in python with specified precision? 2011-03-04

    How can we compute cube-root in python with specified precision? I would like to use the decimal class for this and my compiler is python 2.5 I tried using something like this: >>> from decimal import * >>> x=Decimal("10") >>

  • eye tracking driven vitual computer mouse using OpenCV python lkdemo 2011-04-14

    I am a beginner in OpenCV programming. Now I'm trying to develop an eye tracking driven virtual computer mouse using OpenCV python version of lkdemo. I have a code in python lkdemo. I compiled it using python pgmname.py.Then I have the following resu

  • Boost-Python: Load python module with unicode chars in path 2012-02-01

    I'm working on game project. I use python 2.7.2 for scripting. My application works fine with non unicode path to .exe. But it can't load scripts with unicode path using boost::python::import (import_path.c_str()); I tried this example 5.3. Pure Embe

  • How do you organize Python modules? 2008-10-05

    When it comes to organizing python modules, my Mac OS X system is a mess. I've packages lying around everywhere on my hdd and no particular system to organize them. How do you keep everything manageable? --------------Solutions------------- My advice

  • How can I get a list of locally installed Python modules? 2009-04-11

    I would like to get a list of Python modules, which are in my Python installation (UNIX server). How can you get a list of Python modules installed in your computer? --------------Solutions------------- Solution My 50 cents for getting a pip freeze-l

  • What is the most compatible way to install python modules on a Mac? 2009-07-31

    I'm starting to learn python and loving it. I work on a Mac mainly as well as Linux. I'm finding that on Linux (Ubuntu 9.04 mostly) when I install a python module using apt-get it works fine. I can import it with no trouble. On the Mac, I'm used to u

  • Most useful Python modules from the standard library? 2009-09-21

    I am teaching a graduate level Python class at the University of Paris, and the students need to be introduced to the standard library. I want to discuss with them about some of the most important standard modules. What modules do you think are absol

  • Is there any Python module similar to Distributed Ruby 2009-10-28

    I am new to Python. Just want to know is there any module in python similar to ruby's drb? Like a client can use object provided by the drb server? --------------Solutions------------- This is generally called "object brokering" and a list of some Py

  • Python module matrix class that implements Modulo 2 arithmetic? 2010-02-06

    I'm looking for a pure Python module that implements a matrix class where the underlying matrix operations are computed in modulo 2 arithmetic as in (x+y)%2 I need to do a lot of basic matrix manipulations ( transpose, multiplication, etc. ). Any hel

  • How to create Python module distribution to gracefully fall-back to pure Python code 2010-03-08

    I have written a Python module, and I have two versions: a pure Python implementation and a C extension. I've written the __init__.py file so that it tries to import the C extension, and if that fails, it imports the pure Python code (is that reasona

  • Testing sample code in python modules 2010-05-07

    I'm in the process of writing a python module that includes some samples. These samples aren't unit-tests, and they are too long and complex to be doctests. I'm interested in best practices for automatically checking that these samples run. My curren

  • install python modules on shared web hosting 2010-05-29

    I am using a shared hosting environment that will not give me access to the command line. Can I download the python module on my computer, compile it using python setup.py installand then simply upload a .py file to the web host? If yes, where does t

  • How do I use a relative path in a Python module when the CWD has changed? 2010-11-15

    I have a Python module which uses some resources in a subdirectory of the module directory. After searching around on stack overflow and finding related answers, I managed to direct the module to the resources by using something like import os os.pat

  • Which Python modules provide good foundation classes for spatial data? 2011-01-01

    I am preparing to build an application in Python that works with a lot of spatial data. I am looking for a Python module that provides a nice set of spatially-enabled classes that I can inherit from. Two things I would like to have baked in are: Supp

  • In Python, why is a module implemented in C faster than a pure Python module, and how do I write one? 2011-01-06

    The python documentation states, that the reason cPickle is faster than Pickle is, that the former is implemented in C. What does that mean exactly? I am making a module for advanced mathematics in Python, and some calculations take a significant amo

  • any python module can support enumerate 2 lists and do the "cross multiplication"? 2011-03-11

    I often write below snippets in daily works, res = [] a = ["A","B","C","D"] b = [1,2,3,4] for _a in a: for _b in b: res.append((_a,_b)) # or be more simple #[(_a,_b) for _a in a for _b in b] [('A', 1), ('A', 2), ('A', 3), ('A', 4), ('B', 1), ('B', 2)

  • How to prohibit a Python module from calling other modules? 2011-04-24

    In my Python application, I'll call 3rd party's Python modules. But these modules must have some restrictions for security problems. For example, they can't call some low level IO functions. Can I have a way to prohibit these 3rd party's modules from

  • Can I reliably unimport a Python module if I import it in a namespace? 2011-05-31

    Basically, I have a long running process where I would like to be able to unimport modules and recover memory via the gc. I've read about deleting modules How do I unload (reload) a Python module? and it seems like there are still dangling references

Copyright (C) dskims.com, All Rights Reserved.

processed in 0.107 (s). 11 q(s)