I have a software project that creates a series of fingerprint (hash) values from objects of varying size. The larger the object size, of course, the more expensive the computation of the hash. The hashes are used for comparative purposes.
I now wish to cache hash values in order to improve performance of subsequent comparisons. For any given entry in the cache, I have the following metrics available:
- hit count
- last modification date/time
- size of object hashed
So on to my question. Given the need to constrain the size of the cache (limit it to a specific number of entries), what is a well-balanced approach to replacing cache items?
Clearly, larger objects are more expensive to hash so they need to be kept around for as long as possible. However, I want to avoid a situation where populating the cache with a large quantity of large objects will prevent future (smaller) items from being cached.
So, based upon the metrics available to me (see above), I'm looking for a good general-purpose "formula" for expiring (removing) cache entries when the cache becomes full.
All thoughts, comments are appreciated.
You need to think about the nature of the objects. Think about the probability of the objects to be called again soon. And remove the least likely object.
This is very specific to the software you're using and the nature of the objects.
If they are used continuously in the program they will probably abide to the Locality of reference principle. So you should use LRU (Least recently used) algorithm.
If objects with higher hit count are more likely to be called again, then use that (and remove the lowest).
Take a look at Cache Algorithms
In principle, you need to calculate:
p = probability to be called again.
cost = The cost of caching that object again.
Assuming the ability to record when an entry was last accessed, I'd go with a "Cost" for each entry, where you at any time remove the least expensive entry.
Cost = Size * N - TimeSinceLastUse * M
Presuming you completely remove entries from the cache (and not keep old hitcount data around) I'd avoid using hitcount as a metric, you'd end up with an entry that has a high hitcount because it's been there for a long time, and it'll be there even longer because it has a high hitcount.
I typically use a strict least recently used (LRU) scheme for discarding things from the cache, unless it's hugely more expensive to reconstruct some items. LRU has the benefit of being trivially simple to implement, and it works really well for a wide range of applications. It also keeps the most frequently used items in the cache.
In essence, I create a linked list that's also indexed by a dictionary. When a client wants an item, I look it up in the dictionary. If it's found, I unlink the node from the linked list and move it to the head of the list. If the item isn't in the cache, I construct it (load it from disk, or whatever), put it at the head of the list, insert it into the dictionary, and then remove the item that's at the tail of the list.
Might want to try a multilevel style of cache. Dedicate a certain percentage of the cache to Expensive to create items and a portion to easy to create but more heavily accessed items. You can then use different strategies for maintaining the expensive cache than you would the less expensive one.
The algorithm could consider the cost of reproducing a missing element. That way you would keep the most valuable items in the cache.