Python: Optimizing, or at least getting fresh ideas for a tree generator

I have written a program that generates random expressions and then uses genetic techniques to select for fitness.

The following part of the program generates the random expression and stores it in a tree structure.

As this can get called billions of times during a run, I thought it should be optimized for time.

I'm new to programming and I work (play) by myself so, as much as I search on the inernet for

ideas, I'd like some input as I feel like I'm doing this in isolation.

The bottlenecks seem to be Node.init (), (22% of total time) and random.choice(), (14% of total time)

import random def printTreeIndented(data, level=0): '''utility to view the tree ''' if data == None: return printTreeIndented(data.right, level+1) print ' '*level + ' '+ str(data.cargo)#+ ' '+ str(data.seq)+ ' '+ str(data.branch) printTreeIndented(data.left, level+1) #These are the global constants used in the Tree.build_nodes() method. Depth = 5 Ratio = .6 #probability of terminating the current branch. Atoms = ['1.0','2.0','3.0','4.0','5.0','6.0','7.0','8.0','9.0','x','x','x','x'] #dict of operators. the structure is: operator: number of arguements Operators = {'+': 2, '-': 2, '*': 2, '/': 2, '**': 2} class KeySeq: '''Iterator to produce sequential integers for keys in Tree.thedict ''' def __init__(self, data = 0): self.data = data def __iter__(self): return self def next(self): self.data = self.data + 1 return self.data KS = KeySeq() class Node(object): ''' ''' def __init__(self, cargo, left=None, right=None): object.__init__(self) self.isRoot = False self.cargo = cargo self.left = left self.right = right self.parent = None self.branch = None self.seq = 0 class Tree(object): def __init__(self): self.thedict = {} #provides access to the nodes for further mutation and # crossbreeding. #When the Tree is instantiated, it comes filled with data. self.data = self.build_nodes() # Uncomment the following lines to see the data and a crude graphic of the tree. # print 'data: ' # for v in self.thedict.itervalues(): # print v.cargo, # print # print # printTreeIndented(self.data) def build_nodes (self, depth = Depth, entry = 1, pparent = None, bbranch = None): ''' ''' r = float() r = random.random() #If r > Ratio, it forces a terminal node regardless of #the value of depth. #If entry = 1, then it's the root node and we don't want # a tree with just a value in the root node. if (depth <= 0) or ((r > Ratio) and (not (entry))): ''' Add a terminal node. ''' this_atom = (random.choice(Atoms)) this_atom = str(this_atom) this_node = Node(this_atom) this_node.parent = pparent this_node.branch = bbranch this_node.seq = KS.next() self.thedict[this_node.seq] = this_node return this_node else: ''' Add a node that has branches. ''' this_operator = (random.choice(Operators.keys())) this_node = Node(this_operator) if entry: this_node.isRoot = True this_node.parent = pparent this_node.branch = bbranch this_node.seq = KS.next() self.thedict[this_node.seq] = this_node #branch as many times as 'number of arguements' # it's only set up for 2 arguements now. for i in range(Operators[this_operator]): depth =(depth - 1) if i == 0: this_node.left = (self.build_nodes(entry = 0, depth =(depth), pparent = this_node, bbranch = 'left')) else: this_node.right = (self.build_nodes(entry = 0, depth =(depth), pparent = this_node, bbranch = 'right')) return this_node def Main(): for i in range(100000): t = Tree() return t if __name__ == '__main__': rresult = Main()

-------------Problems Reply------------

Below, I've summarized some of the more obvious optimization efforts, without really touching the algorithm much. All timings are done with Python 2.6.4 on a Linux x86-64 system.

Initial time: 8.3s

Low-Hanging Fruits

jellybean already pointed some out. Just fixing those already improves the runtime a little bit. Replacing the repeated calls to Operators.keys() by using the same list again and again also saves some time.

Time: 6.6s

Using itertools.count

Pointed out by Dave Kirby, simply using itertools.count also saves you some time:

from itertools import count
KS = count()

Time: 6.2s

Improving the Constructor

Since you're not setting all attributes of Node in the ctor, you can just move the attribute declarations into the class body:

class Node(object):
isRoot = False
left = None
right = None
parent = None
branch = None
seq = 0

def __init__(self, cargo):
self.cargo = cargo

This does not change the semantics of the class as far as you're concerned, since all values used in the class body are immutable (False, None, 0), if you need other values, read this answer on class attributes first.

Time: 5.2s

Using namedtuple

In your code, you're not changing the expression tree any more, so you might as well use an object that is immutable. Node also does not have any behavior, so using a namedtuple is a good option. This does have an implication though, since the parent member had to be dropped for now. Judging from the fact that you might introduce operators with more than two arguments, you would have to replace left/right with a list of children anyway, which is mutable again and would allow creating the parent node before all the children.

from collections import namedtuple
Node = namedtuple("Node", ["cargo", "left", "right", "branch", "seq", "isRoot"])
# ...
def build_nodes (self, depth = Depth, entry = 1, pparent = None,
bbranch = None):
r = random.random()

if (depth <= 0) or ((r > Ratio) and (not (entry))):
this_node = Node(
random.choice(Atoms), None, None, bbranch, KS.next(), False)
self.thedict[this_node.seq] = this_node
return this_node

else:
this_operator = random.choice(OpKeys)

this_node = Node(
this_operator,
self.build_nodes(entry = 0, depth = depth - 1,
pparent = None, bbranch = 'left'),
self.build_nodes(entry = 0, depth = depth - 2,
pparent = None, bbranch = 'right'),
bbranch,
KS.next(),
bool(entry))

self.thedict[this_node.seq] = this_node
return this_node

I've kept the original behavior of the operand loop, that decrements the depth at each iteration. I'm not sure this is wanted behavior, but changing it increases runtime and therefore makes comparison impossible.

Final time: 4.1s

Where to go from here

If you want to have support for more than two operators and/or support for the parent attribute, use something along the lines of the following code:

from collections import namedtuple
Node = namedtuple("Node", ["cargo", "args", "parent", "branch", "seq", "isRoot"])

def build_nodes (self, depth = Depth, entry = 1, pparent = None,
bbranch = None):
r = random.random()

if (depth <= 0) or ((r > Ratio) and (not (entry))):
this_node = Node(
random.choice(Atoms), None, pparent, bbranch, KS.next(), False)
self.thedict[this_node.seq] = this_node
return this_node

else:
this_operator = random.choice(OpKeys)

this_node = Node(
this_operator, [], pparent, bbranch,
KS.next(), bool(entry))
this_node.args.extend(
self.build_nodes(entry = 0, depth = depth - (i + 1),
pparent = this_node, bbranch = i)
for i in range(Operators[this_operator]))

self.thedict[this_node.seq] = this_node
return this_node

This code also decreases the depth with the operator position.

You can omit lots of braces in your code, that's one of Python's benefits. E.g. when putting braces around conditions, like

if (depth <= 0) or ((r > Ratio) and (not (entry))):

just write

if depth <= 0 or (r > Ratio and not entry):

And I think there are a couple of redundant calls, e.g.

this_atom = str(this_atom)

(this_atom will already be a string, and building strings is always expensive, so just omit this line)

or the call to the object constructor

object.__init__(self)

which isn't necessary, either.

As for the Node.__init__ method being the "bottleneck": I guess spending most of your time there cannot be avoided, since when constructing trees like this there's not much else you'll be doing but creating new Nodes.

You can replace the KeySeq generator with itertools.count which does exactly the same thing but is implemented in C.

I don't see any way of speeding up the Node constructor. The call to random.choice you could optimise by inlining the code - cut & paste it from the source for the random module. This will eliminate a function call, which are relatively expensive in Python.

You could speed it up by running under psyco, which is a kind of JIT optimiser. However this only works for 32 bit Intel builds of Python. Alternatively you could use cython - this converts python(ish) code into C, which can be compiled into a Python C module. I say pythonish since there some things that cannot be converted, and you can add C data type annotations to make the generated code more efficient.

Category:python Views:0 Time:2010-01-24

Related post

  • python optimized mode 2010-01-13

    Python can run script in optimized mode (-O) that turns off debugs like assert and if I remember also remove docstrings. I have no seen it used really and maybe it is just artifact of the past times. Is it being used? What for? Why isn't this useless

  • Python: Optimizing a tree evaluator 2010-01-30

    I know tree is a well studied structure. I'm writing a program that randomly generates many expression trees and then sorts and selects by a fitness attribute. I have a class MakeTreeInOrder() that turns the tree into a string that 'eval' can evaluat

  • Append to JSON in Python (Optimally due to RAM constraint) 2011-01-05

    I'm trying to find the optimal way to append some data to a json file using Python. Basically what happens is I have about say 100 threads open storing data to an array. When they are done they send that to a json file using json.dump. However since

  • Python - Optimal code to find preceding and following five words from a given point in a line 2011-03-24

    I'm trying to write code to find the 5 words on either side of a particular phrase. Easy enough, but I have to do this on a massive volume of data, thus the code needs to be optimal! for file in listing: file2 = open('//home/user/Documents/Corpus/Fil

  • Python: Optimal algorithm to avoid downloading unchanged pages while crawling 2011-09-30

    I am writing a crawler which regularly inspects a list of news websites for new articles. I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the

  • Is mod_wsgi/Python optimizing things out? 2009-06-05

    I have been trying to track down weird problems with my mod_wsgi/Python web application. I have the application handler which creates an object and calls a method: def my_method(self, file): self.sapi.write("In my method for %d time"%self.mmcount) se

  • Python Optimized Comparison Between List of Dict 2010-11-30

    I'm trying to see whether nodes reside within the volume of a sphere, and add the node id to a list. However, the efficiency of the algorithm is incredibly slow and I'm not sure how to improve it. I have two lists. List A has the format [{'num': ID,

  • Python: Optimizing Code Using SQLite3 + Mutagen 2011-12-27

    I'm in the process of improving an open-source music database, which reads songs in from my collection and stores them to an SQLite database. In turn, I'm able to leverage the database to find duplicates, run queries on my collection, and (if I so de

  • Is Python optimized (to a certain extent) automatically at runtime? 2014-01-23

    I'm just curious to know. For example, if I wanted to check if a number is even, either of these will work: # values are True if even, False if odd even_masked = not (number & 0x1) even_modulo = (number%2 == 0) They both do the same thing but the

  • Running a Python Script to read info. from a new .txt file being generated at a known location every 1 sec 2011-04-29

    My Scenario: I have a known location(directory/path) where a .txt file is going to be generated every 1 sec, I just need to copy its content (contents are in a format which can be used directly to put in a MySQL query) and put it in a MySQL query in

  • Python Iterating through a list of float widths: TypeError: sequence expected, generator found 2011-11-07

    I am trying to iterate through a list of float widths that varies. [10.5, 15.5, 3.7] <- Randomly generated I am using this list of floats to generate spaces between a list of strings I am trying to print. I am doing this via print ''.join('%*s' %i

  • In Python, given a directory of full-size images, how can I generate thumbnails using more than one CPU core? 2010-11-01

    I have a 16-core machine but my current resizing function only uses one core, which is really inefficient for a large directory of images. def generateThumbnail(self, width, height): """ Generates thumbnails for an image """ im = Image.open(self._fil

  • Python (yield): all paths from leaves to root in a tree 2011-08-20

    I want to generate all paths from every leaf to root in a tree. I'd like to do that with generators, to save memory (tree can be big). Here's my code: def paths(self, acc=[]): if self.is_leaf(): yield [self.node]+acc for child in self.children: child

  • How can I merge two Python dictionaries in a single expression? 2008-09-02

    I have two Python dictionaries, and I want to write a single expression that returns these two dictionaries, merged. The update() method would be what I need, if it returned its result instead of modifying a dict in-place. >>> x = {'a':1, 'b

  • What can you use Python generator functions for? 2008-09-19

    I'm starting to learn Python and I've come across generator functions, those that have a yield statement in them. I want to know what types of problems that these functions are really good at solving. --------------Solutions------------- Generators g

  • Translate algorithmic C to Python 2008-09-25

    I would like to translate some C code to Python code or bytecode. The C code in question is what i'd call purely algorithmic: platform independent, no I/O, just algorithms and in-memory data structures. An example would be a regular expression librar

  • extracting a parenthesized Python expression from a string 2008-10-16

    I've been wondering about how hard it would be to write some Python code to search a string for the index of a substring of the form ${expr}, for example, where expr is meant to be a Python expression or something resembling one. Given such a thing,

  • "ImportError: No module named dummy" on fresh Django project 2009-04-10

    I've got the following installed through MacPorts on MacOS X 10.5.6: py25-sqlite3 @2.5.4_0 (active) python25 @2.5.4_1+darwin_9+macosx (active) sqlite3 @3.6.12_0 (active) python25 is correctly set as my system's default Python. I downloaded a fresh co

  • Python factorization 2009-06-18

    I'd just like to know the best way of listing all integer factors of a number, given a dictionary of its prime factors and their exponents. For example if we have {2:3, 3:2, 5:1} (2^3 * 3^2 * 5 = 360) Then I could write: for i in range(4): for j in r

Copyright (C) dskims.com, All Rights Reserved.

processed in 0.119 (s). 11 q(s)