Python-Cachey: Caching based on computation time and storage space

Caching for Analytic Computations

Humans repeat stuff. Caching helps.

Normal caching policies like LRU aren't well suited for analytic computations where both the cost of recomputation and the cost of storage routinely vary by one million or more. Consider the following computations

# Want this
np.std(x)        # tiny result, costly to recompute

# Don't want this
np.transpose(x)  # huge result, cheap to recompute

Cachey tries to hold on to values that have the following characteristics

  1. Expensive to recompute (in seconds)
  2. Cheap to store (in bytes)
  3. Frequently used
  4. Recenty used

It accomplishes this by adding the following to each items score on each access

score += compute_time / num_bytes * (1 + eps) ** tick_time

For some small value of epsilon (which determines the memory halflife.) This has units of inverse bandwidth, has exponential decay of old results and roughly linear amplification of repeated results.

Example

>>> from cachey import Cache
>>> c = Cache(1e9, 1)  # 1 GB, cut off anything with cost 1 or less

>>> c.put('x', 'some value', cost=3)
>>> c.put('y', 'other value', cost=2)

>>> c.get('x')
'some value'

This also has a memoize method

>>> memo_f = c.memoize(f)

Status

Cachey is new and not robust.

Comments

  • notebook mode
    notebook mode

    Aug 19, 2015

    cross-posting from ipycache:

    https://github.com/rossant/ipycache/issues/32

    Reply
  • Distributed cache strategy
    Distributed cache strategy

    Mar 15, 2016

    Hello Fellows, I am looking at using dask for distributed computation and I was wondering what is your strategy to make cachey into horizontally scalable cache and common calculation space?

    Regards, Alex

    Reply
  • Persistent cache with Zarr
    Persistent cache with Zarr

    Jan 17, 2017

    This is somewhat related to both #5 and #3 but slightly different. Basically I would like a persistent cache for use when working in a jupyter notebook. The motivation is very similar to https://github.com/rossant/ipycache i.e. if I restart a notebook I don't want to have to repeat any computations that previously finished. However I would like to use Zarr to store cache results not pickle because compression will save disk space. Also I would like to use a memoize function decorator rather than a cell magic, i.e., something more like the cachey memoize decorator and the joblib Memory.cache decorator.

    No problem if this is beyond scope for cachey but I thought I'd mention it in case there were any synergies with other requirements. On the technical side there are two main points to consider: one is how to generate a key from function arguments that is stable across python sessions (i.e., doesn't rely on Python's built-in hash function); the second is how to integrate with Zarr (or similar) for storage.

    Reply
  • recursive for list and tuple seems like a safe default
    recursive for list and tuple seems like a safe default

    Sep 30, 2017

                                                                                                                                                                                                           
    Reply
  • unexpected behavior in cached function when key words arguments are used.
    unexpected behavior in cached function when key words arguments are used.

    Feb 21, 2019

    Hello,

    I have found this project very useful and while I was doing tests I have found this behavior unexpected for me.

    I leave below a minimum reproducible example:

    import pandas as pd
    import cachey
    
    df = pd.DataFrame([[1, 2]], columns=list("ab"))
    c = cachey.Cache(1e9)
    
    
    @c.memoize
    def boo(df):
        print("not cached!")
        return df, df.to_dict()
    
    
    boo(df=df)  # not works
    boo(df)    # works
    boo(df)    # works 
    
    

    I don't know if this comment is appropriate but I thought it was interesting, maybe you can give me a clue as to what could cause this problem.

    Reply
  • Badly-defined behavior calling `put` reusing existing key.
    Badly-defined behavior calling `put` reusing existing key.

    Mar 4, 2019

    The current behavior seems to be replacing the value of the key, but increases total_bytes without retiring the existing item.

    I do not think this behavior is well defined. It causes total_bytes and available_bytes to diverge, and eventually triggers #15.

    How unique do we expect the keys to be?

    Shall we raise an exception if the key same key is used twice, or shall we consistently retire the existing key/value when the same key is reused?

    Reply
  • Can't groupby two or more keys
    Can't groupby two or more keys

    Apr 18, 2019

    When attempting to groupby two or more keys, I get an attribute error. This does not occur when grouping by a single key

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-11-c4a4f7e033ab> in <module>
    ----> 1 df.groupby(['path','time']).alt.mean().compute()
    
    ~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
        154         dask.base.compute
        155         """
    --> 156         (result,) = compute(self, traverse=False, **kwargs)
        157         return result
        158 
    
    ~/anaconda3/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
        396     keys = [x.__dask_keys__() for x in collections]
        397     postcomputes = [x.__dask_postcompute__() for x in collections]
    --> 398     results = schedule(dsk, keys, **kwargs)
        399     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
        400 
    
    ~/anaconda3/lib/python3.6/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
         74     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
         75                         cache=cache, get_id=_thread_get_id,
    ---> 76                         pack_exception=pack_exception, **kwargs)
         77 
         78     # Cleanup pools associated to dead threads
    
    ~/anaconda3/lib/python3.6/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
        465                 finish_task(dsk, key, state, results, keyorder.get)
        466                 for f in posttask_cbs:
    --> 467                     f(key, res, dsk, state, worker_id)
        468 
        469                 while state['ready'] and len(state['running']) < num_workers:
    
    ~/anaconda3/lib/python3.6/site-packages/dask/cache.py in _posttask(self, key, value, dsk, state, id)
         59             duration += max(self.durations.get(k, 0) for k in deps)
         60         self.durations[key] = duration
    ---> 61         nb = self._nbytes(value) + overhead + sys.getsizeof(key) * 4
         62         self.cache.put(key, value, cost=duration / nb / 1e9, nbytes=nb)
         63 
    
    ~/anaconda3/lib/python3.6/site-packages/cachey/nbytes.py in nbytes(o)
         27 
         28     if name == 'pandas.core.series.Series':
    ---> 29         return _array(o._data.blocks[0].values) + _array(o.index._data)
         30     elif name == 'pandas.core.frame.DataFrame':
         31         return _array(o.index) + sum([_array(blk.values)
    
    ~/anaconda3/lib/python3.6/site-packages/cachey/nbytes.py in _array(x)
          3 
          4 def _array(x):
    ----> 5     if x.dtype == 'O':
          6         return sys.getsizeof('0'*100) * x.size
          7     elif str(x.dtype) == 'category':
    
    AttributeError: 'NoneType' object has no attribute 'dtype'
    

    Is this a known problem? Or a bug?

    Reply
  • BUG: ZeroDivisionError when returning zero-length series/arrays
    BUG: ZeroDivisionError when returning zero-length series/arrays

    Oct 31, 2019

    Cachey produces an unhelpful div/zero error when a cached function returns a zero-length array.

    Examples:

    from cachey import Cache
    import numpy as np
    import pandas as pd
    
    cache = Cache(1e5)
    
    @cache.memoize
    def myfunc(x):
        return x
    
    try:
        myfunc(pd.Series())
    except ZeroDivisionError as m:
        print(m)
    
    try:
        myfunc(np.array([]))
    except ZeroDivisionError as m:
        print(m)
        
    myfunc([]) #lists work fine
    
    
    float division by zero
    float division by zero
    
    Out[39]:
    []
    
    Reply
  • MNT Migrate to GitHub actions
    MNT Migrate to GitHub actions

    Nov 7, 2020

    Resolves https://github.com/dask/cachey/issues/25

    CI on fork: https://github.com/thomasjpfan/cachey/actions?query=workflow%3ACI

    Reply
  • Error during pip install
    Error during pip install

    Aug 2, 2015

    Hello,

    I'm having troubles during the installation,

    [email protected]:~$ sudo pip install cachey
    [sudo] password for kepler: 
    Downloading/unpacking cachey
      Downloading cachey-0.0.1.tar.gz                                                                                                                                                               
      Running setup.py (path:/tmp/pip-build-Kfuq02/cachey/setup.py) egg_info for package cachey                                                                                                     
        Traceback (most recent call last):                                                                                                                                                          
          File "<string>", line 17, in <module>                                                                                                                                                     
          File "/tmp/pip-build-Kfuq02/cachey/setup.py", line 15, in <module>                                                                                                                        
            install_requires=list(open('requirements.txt').read().strip().split('\n')),                                                                                                             
        IOError: [Errno 2] No such file or directory: 'requirements.txt'                                                                                                                            
        Complete output from command python setup.py egg_info:                                                                                                                                      
        Traceback (most recent call last):                                                                                                                                                          
    
      File "<string>", line 17, in <module>                                                                                                                                                         
    
      File "/tmp/pip-build-Kfuq02/cachey/setup.py", line 15, in <module>
    
        install_requires=list(open('requirements.txt').read().strip().split('\n')),
    
    IOError: [Errno 2] No such file or directory: 'requirements.txt'
    

    Thank you.

    Reply
  • Move default branch from "main"" class="image-fit lazy">
    Move default branch from "master" -> "main"

    Feb 9, 2021

    @jrbourbeau and I are in the process of moving the default branch for this repo from master to main.

    • [x] Changed in GitHub
    • [x] ~~Merged PR to change branch name in code~~

    What you'll see

    Once the name on github is changed (the first box above is Xed, or this issue closed), when you try to git pull you'll get

    Your configuration specifies to merge with the ref 'refs/heads/master'
    from the remote, but no such ref was fetched.
    

    What you need to do

    First: head to your fork and rename the default branch there Then:

    git branch -m master main
    git fetch origin
    git branch -u origin/main main
    
    Reply
  • Add a resize method.
    Add a resize method.

    Nov 26, 2018

    Currently to resize a cache one has to meddle with available_bytes member in the client code.

    Reply
  • Cost zero division
    Cost zero division

    Nov 1, 2019

    Fix for bug #19.

    Updated cost() to include a kwarg called zvalue which defaults to 0.5. Changes should have little effect on existing calls.

    It effectively prevents zero values for nbytes from creating a ZeroDivisionError by providing a nonzero value less than 1. The function still raises ZeroDivisionError if zvalue==0.

    New return value:

    time / (nbytes if nbytes else zvalue) / 1e9
    
    Reply
  • Fix Cache.clear()
    Fix Cache.clear()

    Jan 16, 2017

    Fix to the following error when calling clear

    >>> import cachey
    >>> c = cachey.Cache(2e9)
    >>> c.clear()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/col540/dev/bhpod/rosetta/venv/lib/python3.5/site-packages/cachey/cache.py", line 146, in clear
        self._shrink_one()
      File "/home/col540/dev/bhpod/rosetta/venv/lib/python3.5/site-packages/cachey/cache.py", line 126, in _shrink_one
        key, score = self.heap.popitem()
      File "/home/col540/dev/bhpod/rosetta/venv/lib/python3.5/site-packages/heapdict.py", line 91, in popitem
        wrapper = self.heap[0]
    IndexError: list index out of range
    
    Reply
  • Beyond in-memory caching
    Beyond in-memory caching

    Mar 9, 2020

    I've just started to use cachey (its quite cool, so thanks for making it). I'm curious if it would be possible to extend cache beyond in-memory caching, perhaps using Zict or similar mutable mappings.

    Currently Cache.data is just a vanilla dict:

    https://github.com/dask/cachey/blob/382e55ee335ece4ad1295f6350b058ea3bac27fb/cachey/cache.py#L71

    and I'm wondering if it could be any dict-like thing. A little monkey-patch exploration makes this seem possible. Here I replace Cache.data with Zict's LRU mapping:

    In [1]: >>> from cachey import Cache
       ...:
       ...: >>> c = Cache(1e9, 1)  # 1 GB, cut off anything with cost 1 or less
    
    In [2]: c.data
    Out[2]: {}
    
    In [3]: import pickle
       ...: import zlib
       ...:
       ...: from zict import File, Func, LRU
       ...:
       ...: a = File('myfile/', mode='a')
       ...: b = Func(zlib.compress, zlib.decompress, a)
       ...: c = Func(pickle.dumps, pickle.loads, b)
       ...: d = LRU(100, c)
    
    In [4]: d
    Out[4]: <LRU: 0/100 on <Func: dumps<->loads <Func: compress<->decompress <File: myfile/, mode="a", 0 elements>>>>
    
    In [5]: c = Cache(1e9)
    
    # monkey patch here!
    In [6]: c.data = d
    
    In [7]: c
    Out[7]: <cachey.cache.Cache at 0x7fc7c05ed6d8>
    
    In [8]: c.data
    Out[8]: <LRU: 0/100 on <Func: dumps<->loads <Func: compress<->decompress <File: myfile/, mode="a", 0 elements>>>>
    
    In [9]: c.put('x', 'some value', cost=3)
    
    In [10]: c.get('x')
    Out[10]: 'some value'
    

    Thoughts on what I've done here? What I'm really going for is a Zict.Buffer with a fast dict and a slower datastore (database or object store).

    Reply
  • Thoughts of Questionable Value
    Thoughts of Questionable Value

    Aug 2, 2015

    Food for thought. Most of these are really "Pandas things" and not "Dask things", but I'm kind of riffing here.

    For a DataFrame, what if we cached

    • Factorized columns
    • Sorted columns or sorted Indices
    • The group index of a GroupBy, as well as the index arrays for any given set of labels/columns in a given GroupBy

    For a dask bag, what if we cached

    • All intermediates in any given series of folds, mappings, filters, etc. For example, if I perform a bag.filter(lambda x: x > 2).map(lambda x: x + 1), dask can cache the filter and the map in case either is ever called again with that particular lambda. The following could be a terrible idea. Still riffing.:
    import inspect
    import hashlib
    
    m = hashlib.md5()
    foo = lambda x: x > 2
    stringified = inspect.getsource(foo)
    m.update(stringified)
    m.digest()
    

    Now you have a checksum on that particular lambda function that you can use as a key for that calc on that bag.

    For a Dask DataFrame, what if we cached

    • The same as the dask bag example, but for apply or maps on the DataFrame (which tend to be terrible expensive). Same hashing of the function they're using for the apply or map, then caching of the applyed or maped columns or the entire Frame.
    • Aggregations

    In general, I think there could be huge wins in essentially taking snapshots of a given series of computations on any given sequence or DataFrame, and being able to reconstruct those from caches of both intermediates and final results.

    This actually could actually provide some compelling data flows on the distributed side of dask. If any given computation node fails, you could dramatically decrease the time it takes to re-run a given sequence of computations if you had all of those intermediate steps cached along the way. So for any given branch of the graph, if you've finished 5 of the 6 pieces of computation and the node fails, either you can wait for the node to stand back up and just finish step 6, or you can shuffle the intermediate cached result from step 5 to another node and finish step 6. Moving that intermediate is going to be way less expensive that moving the whole data block. This does, of course, assume you get the data from the failed node into storage where a new node can access it.

    End of riff.

    Reply
  • Deploy tests
    Deploy tests

    Nov 26, 2018

    Without this it appears pytest would be importing cachey from the installed location, and gives an error:

    
    $ pytest cachey
    
    ....
    
    cachey/tests/test_cache.py:21: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    ../../anaconda3/install/lib/python3.6/site-packages/cachey/cache.py:146: in clear
        self._shrink_one()
    ../../anaconda3/install/lib/python3.6/site-packages/cachey/cache.py:126: in _shrink_one
        key, score = self.heap.popitem()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <heapdict.heapdict object at 0x7f015467c630>
    
        def popitem(self):
            """D.popitem() -> (k, v), remove and return the (key, value) pair with lowest\nvalue; but raise KeyError if D is empty."""
    >       wrapper = self.heap[0]
    E       IndexError: list index out of range
    

    With this it appears to be using the correct source code from tree.

    setup.py test still fails with a mysterious error:

    ....
        tests = self.loadTestsFromModule(package, pattern=pattern)
      File "/home/yfeng1/anaconda3/install/lib/python3.6/site-packages/setuptools/command/test.py", line 54, in loadTestsFromModule
        tests.append(self.loadTestsFromName(submodule))
      File "/home/yfeng1/anaconda3/install/lib/python3.6/unittest/loader.py", line 204, in loadTestsFromName
        test = obj()
    TypeError: nbytes() missing 1 required positional argument: 'o'
    

    But we don't have to use it.

    Reply
  • test against 3.7 and 3.8, drop 2.7
    test against 3.7 and 3.8, drop 2.7

    Mar 10, 2020

                                                                                                                                                                                                           
    Reply