Eat

eat takes a PyObject and creates a repeated pointer to a specified length. Use simply to generate a list, or extensibly through custom methods which expect an iterator. Pass args or kwargs directly, for use when constructing the data with the method passed. Recommended use with @decay from pointers.py.

eat (function): Create an iterable set of pointers to a specified length. Useful for parellelizing large datasets. Use simply to generate a list, or extensibly through custom methods which expect an iterator. Pass args or kwargs directly, for use when constructing the data with the method passed. If compute is False, to reaccess the value in the pointer, call ~ at computation time

Args:
- obj : (Any) -> PyObject to repeat over
- its : (int) -> number of iterations
- method : (Callable) -> method / function which accepts the repeated object and structures it to the passed args & kwargs
- compute : (bool) -> access items in pointer while constructing the data. delay computation and return pointers if False, or calculate a structured repeated object if True (eg DataFrame)

Returns: list, array, etc.

import numpy as np
from cereal import eat

x = [1, 2, 3, 4]

yum = eat(x, its = 4, np.array, copy = False)

>> yum
array([<pointer to list object at 0x113b3edc0>,
       <pointer to list object at 0x113b3edc0>,
       <pointer to list object at 0x113b3edc0>,
       <pointer to list object at 0x113b3edc0>], dtype=object) # 4 pointers to the original object

>>> (~yum[0])[0]
1

yum[0][0] = 4

>>> x # the original object
[4, 2, 3, 4] # it changed!

Why use this

Since the object does not create an actual copy of the original object, this lets us make modifications to the original object at computation time - which is especially useful if you are parallelizing large datasets.

Starting simple

First, lets build a basic DataFrame with a repeated list [1, 2, 3, 4]:

import pandas as pd
from cereal import eat

x = [1, 2, 3, 4]

yum = eat(x, its = 4, method = pd.DataFrame, compute=True, columns = ['a', 'b', 'c', 'd'])

>> yum # outputs a dataframe with columns ['a', 'b', 'c', 'd'] with [1, 2, 3, 4] repeated 4x.

Easy! Furthermore, since it's all referencing the same object - there's no additional overhead at computation by initializing new objects. This is because Python Pointers are referenced directly by the memory location. Lets try an example with compute = False.

import numpy as np
from cereal import eat

x = [1, 2, 3, 4]

yum = eat(x, its = 4, method = np.array, copy = False)

Next, we will define the function addone, which will take a pointer and reassign the item at its position to itself + 1.

def addone(x, pos):
    x[pos] +=1

Note that the function returns nothing - that's becuase it only interacts with the Pointer, therefore no collection is required. Next, if we use some arbitrary mapping function to perform the function we will see that the original PyObject is changed without the memory overhead of collection.

>>> list(map(addone, yum, range(len(yum))))
[None, None, None, None]

>>> x
[2, 3, 4, 5]

Orchestration

To see the beauty in using eat, view what happens when you segment operations over an eat object:

import numpy as np
from cereal import eat

x = np.arange(100).reshape(4, 25) # create a basic array, but reshape into segments

>>> x
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
        41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
        66, 67, 68, 69, 70, 71, 72, 73, 74],
       [75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
        91, 92, 93, 94, 95, 96, 97, 98, 99]]) # four segments, len 25 each

yum = eat(x, 4, np.array, copy = False) # create an iterable of len 4, store in an array, (no copy to memory is a numpy arg)

>>> yum

array([<pointer to ndarray object at 0x120069f50>,
       <pointer to ndarray object at 0x120069f50>,
       <pointer to ndarray object at 0x120069f50>,
       <pointer to ndarray object at 0x120069f50>], dtype=object)

Next, we define addone, except addone now accepts a position offset (dimension 1), which traverses dimemsion 2 and reassigns the value to itself + 1:

def addone(x, pos):
    for n in range(25):
        x[pos][n] +=1


>>> list(map(addone, yum, range(4)))
[None, None, None, None]

>>> x
array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25],
       [ 26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50],
       [ 51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
         64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75],
       [ 76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,
         89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100]])

Parallelization

import numpy as np
from cereal import eat
from concurrent.futures import ThreadPoolExecutor

x = np.arange(10000).reshape(16, 625)

yum = eat(x, 16, np.array, copy = False)

def addone(slice, n):
    slice[n] += 1

def traverse(x, pos):
    mem = [addone(x[pos], n) for n in range(625)]
    return mem

%%timeit
with ThreadPoolExecutor() as executor:
    executor.map(traverse, yum, range(16))

(111 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each))

Of course, its much faster to just broadcast arithmetic functions with numpy:

%%timeit
x + 1

(5.01 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each))

However, when used asynchronously it provides a useful tool for handling and interacting with structured data:

import random
import asyncio
import numpy as np
from cereal import eat
from time import process_time

x = np.arange(10000).reshape(16, 625)

>>> x
array([[   0,    1,    2, ...,  622,  623,  624],
       [ 625,  626,  627, ..., 1247, 1248, 1249],
       [1250, 1251, 1252, ..., 1872, 1873, 1874],
         ...,
       [8125, 8126, 8127, ..., 8747, 8748, 8749],
       [8750, 8751, 8752, ..., 9372, 9373, 9374],
       [9375, 9376, 9377, ..., 9997, 9998, 9999]])

yum = eat(x, 16, np.array, copy = False) # 16 iterations, 1 for each iteration on dimension 1

async def addone(slice, n):
    await asyncio.sleep(random.random())
    slice[n] += 1

async def traverse(x, pos):
    mem = [asyncio.create_task(addone(x[pos], n)) for n in range(625)] # one computation point for each iteration on dimension 2
    return asyncio.gather(*mem)

async def traverse_gather(x, n):
    fut = await traverse(x, n)

start = process_time()
fut = [await traverse_gather(x, n) for n, x in enumerate(yum)] # in jupyter, asyncio.run() is proper form in pure python
end = process_time()

>>> print(end - start)
0.11029499999995096 # submission only - this allows you to dereference procedures from the data you are interacting with and retrieve values from the store as coroutines are completing.

>>> x # could be complete, or incomplete...
array([[    1,     2,     3, ...,   623,   624,   625],
       [  626,   627,   628, ...,  1248,  1249,  1250],
       [ 1251,  1252,  1253, ...,  1873,  1874,  1875],
          ...,
       [ 8126,  8127,  8128, ...,  8748,  8749,  8750],
       [ 8751,  8752,  8753, ...,  9373,  9374,  9375],
       [ 9376,  9377,  9378, ...,  9998,  9999, 10000]])

This is obviously, much, much faster than:

x = np.arange(10000)

for n in range(10000):
    x[n] += 1
    time.sleep(random.random()) # would take up to 166m to complete

Note that while submitting the tasks asynchronously may be faster:

import random
import asyncio
import numpy as np

x = np.arange(10000)

async def addone(slice, n):
    await asyncio.sleep(random.random())
    slice[n] += 1

async def traverse(x): # runs all at once, no traversal
    mem = [asyncio.create_task(addone(x, n)) for n in range(len(pos))]
    return asyncio.gather(*mem)

start = process_time()
fut = await traverse(x)
end = process_time()

>>> print(end - start)
0.0850950000000239

At computation, eat supports with controlling dataflows which act on a store while permitting batched sampling without creating deep-memory copies. This allows insertions, traversals, and replacements over n-dimensional data while maintaining strict memory allocation methods. Beneath the hood, eat will create shallow memory copies of an object using Pointers such that asynchronous traversal and access methods can be easily implemented without initializing a new object at compute time. This is particularly useful for distributed systems with shared memory spaces - wherein that we can control computation distribution accordingly without the requirement for managing distribution and re-collection of data (items are interacted with directly, across the traversal per system).