Using BColz with Keras Generators

Have you ever run out of RAM when doing some Deep Learning computation on your laptop instead of the server? One approach to solving this is to write out the data to disk. An alternative is to write a custom generator that reads the data. Recently I was attempting to implement Lesson 7 of the Fast.ai and it ran incredibly slowly. My fix was to read the data into an on disk array, and then read from that array using bcolz. Below is a function that you can use to convert a keras generator into a bcolz array:

import bcolz
from tqdm import tqdm
import os.path

def save_generator(gen, data_dir, labels_dir):
    """
    Save the output from a generator without loading all images into memory.

    Does not return anything, instead writes data to disk.

    :gen: A Keras ImageDataGenerator object
    :data_dir: The folder name to store the bcolz array representing the features in.
    :labels_dir: The folder name to store the bcolz array representing the labels in.
    :mode: the write mode. Set to 'a' for append, set to 'w' to overwrite existing data and 'r' to read only.

    """
    for directory in [data_dir, labels_dir]:
        if not os.path.exists(directory):
            os.makedirs(directory)

    num_samples = gen.samples

    d,l = gen.__next__()

    data = bcolz.carray(d, rootdir=data_dir, mode='w')
    labels = bcolz.carray(l, rootdir=labels_dir, mode='w')

    for i in tqdm(range(num_samples-1)):
        d, l = gen.__next__()
        data.append(d)
        labels.append(l)
    data.flush()
    labels.flush()

Now if you want to load the data all you have to do is (assuming 'data_dir' is the folder location):

data = bcolz.open(data_dir)

Another problem I encountered is that I wanted to get the predictions from a pre-trained model and build on-top of that. However, this required a large amount of RAM. The following function allows you to write the results of this to disk.

def save_predictions(model, data, rootdir, batch_size):
    """
    This function will use BColz to save the predictions from a model. This is useful when you want to get the features from a
    pretrained net and build something ontop of it without re-evaluating the network every time.

    This function does not return anything and writes stuff to disk.

    :model: A keras model.
    :data: A Numpy dataframe, it is assumed that the first index is the batch index.
    :roodir: The directory to store the bcolz data
    :batchsize: The number of samples to run. Will depend upon your hardware.
    """
    output = bcolz.carray(model.predict(data[0:batch_size]), rootdir=rootdir, mode='w')

    for i in tqdm(range(batch_size, data.shape[0], batch_size)):
        end = i+batch_size if i+batch_size < data.shape[0] else data.shape[0]
        output.append(model.predict(data[i:end]))
    output.flush()

John's Notes

John's Notes

Using BColz with Keras Generators