Thursday, 14 August 2014

Sparse matrix: from categorical matrix to binary matrix

On this post I will show a mini project I have been working on for the last few days. The goal is generate boolean or binary data from categorical data. For doing so, I will primarly use Pandas and Numpy libraries. This class was created for testing purposes, so performance has not been taken into account and therefore is not relevant at this moment, however I am convinced that the code can be much more optimized.

Firstly, a short explanation about what this code is doing:

The next table represents the input data, as shown, fields p1, p2, p3 and p4 store categorical data, in this case a product code. Each module consists of four products (p) at most. Note: field "modules" is the index (row names).

modules p1 p2 p3 p4 price
m_1 A1 A2 9.90
m_2 A1 A3 10.50
m_3 B1 B2 A4 C1 20.30
m_4 B1 A1 C1 22.10

Now, let's imagine we want to solve a linear equation system (Ax = b) composed by field p1, p2, p3 and p4 as matrix "A" , unknows (A1, A2, A3, A4, B1, B2, C1) and price as vector "b":

$A=\begin{bmatrix}a_{11} & a_{12} & \cdots & a_{1n} \\a_{21} & a_{22} & \cdots & a_{2n} \\\vdots & \vdots & \ddots & \vdots \\a_{m1} & a_{m2} & \cdots & a_{mn}\end{bmatrix}, \quad x=\begin{bmatrix}x_1 \\x_2 \\\vdots \\x_n\end{bmatrix}, \quad b=\begin{bmatrix}b_1 \\b_2 \\\vdots \\b_m\end{bmatrix}$

The matrix equations shows that size(A) = m x n, size(x) = n and size(b) = m. On the above example columns(A) is not equal to size(x). Hence, we need a new matrix A with the correct shape, see below.

$A=\begin{bmatrix}1 & 1 & 0 & 0 & 0 & 0 & 0 \\1 & 0 & 1 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 1 & 1 & 1 & 1 \\1 & 0 & 0 & 0 & 1 & 0 & 1 \end{bmatrix}, \quad x=\begin{bmatrix}A1 \\A2 \\A3 \\A4 \\B1 \\B2 \\C1 \end{bmatrix}, \quad b=\begin{bmatrix}9.90 \\10.50 \\20.30 \\22.10 \end{bmatrix}$

If we get a matrix "A" like the previous one, then we can solve the system easier.
Using vectorial notation, m_1 = A(1, :).x (e.g. Matlab indexing)

Before we continue, I would like to remark that the above system can be easily solvead by isolating x from the equation or by using LU decomposition, especially for square matrices. When dealing with non square matrices ($m \neq n$) the next methods are normaly used:

  • Normal equation: $x = (A^{T} A)^{-1} A^{T} b $
  • QR factorization
  • SVD decomposition

Code:

    import numpy as np
    import pandas as pd
    import warnings
    import matplotlib.pyplot as plt


    class CatBin(object):
        def __init__(self, i_file, o_file=False, cat_cols="all", other_cols=None, op_nan=True):
            """
            Class to convert a matrix with categorical data into a matrix with binary data.

            Example: Convert the below matrix into a binary matrix

                initial_data =
                modules   p1    p2      p3      p4     price
                m_1       A1    A2      nan     nan    9.90
                m_2       A1    nan     A3      nan    10.50
                m_3       B1    B2      A4      C1     20.30
                m_4       B1    nan     A1      C1     22.10

                binary_data =
                modules   A1  A2  A3  A4  B1  B2  C1    price
                m_1       1   1   0   0   0   0   0     9.90
                m_2       1   0   1   0   0   0   0     10.50
                m_3       0   0   0   1   1   1   1     20.30
                m_4       1   0   0   0   1   0   1     22.10

            Uses:
            Solve a linear equation system: Ax = b
                A = matrix[0,1]
                x = [A1, A2, A3, A4, B1, B2, C1]
                b = price

                Note:
                if non-square matrix, we can solve the system by using some 
                of the well-known methods:

                    Normal equation
                    QR factorization
                    SVD decomposition

            Graphs - adjacency matrix - connectivity
            ...

            :param i_file: input_file in csv format
            :param o_file: if True a csv is generated. The path must be provided.
            :param cat_cols: slicing - e.g [1, 10] -> from column 1 to 10(not included - pandas)
            :param other_cols:

            if the new data frame must preserve others columns besides those
            with categorical data, position and name should be provided in a dictionary.
                e.g
                others = {'i' : 'modules', 'e', 'prices'}
                if 'i', that column will be inserted on the first position in the new data frame
                    data.insert(0, 'name', column_data)
                if 'e', that column will be inserted after the rest of columns
                    data['name'] = column_data

            :param op_nan: if true all NaN values will be removed.
            """

            self.o_file = o_file
            self.cat_cols = cat_cols
            self.other_cols = other_cols
            self.nan = op_nan
            self.list = []

            self.rows = None
            self.columns = None
            self.p_sparsity = -1

            self.input_data = pd.read_csv(i_file)
            self.boolean_m = None
            self.boolean_a = None

            if self.cat_cols is "all":
                if self.nan is True:
                    # if all data is categorical data to be converted, automatically 
                    # others_cols must be false to avoid inconsistencies.
                    
                    self.data = self.input_data.fillna(value='nan')
                else:
                    self.data = self.input_data
                if other_cols is not None:
                    other_cols = None
                    warnings.warn("other_cols has been set to False to avoid inconsistencies")
            else:
                # cat_cols is an array [start, end]
                if isinstance(self.cat_cols, list):
                    if len(self.cat_cols) == 2:
                        _start = self.cat_cols[0]
                        _end = self.cat_cols[1]
                        if self.nan is True:
                            self.data = self.input_data.fillna(value='nan').iloc[:, _start: _end]
                        else:
                            self.data = self.input_data.iloc[:, _start, _end]
                    else:
                        raise ValueError("'start'and 'end' position must be provided")
                else:
                    raise TypeError("cat_cols is not a list")

            if not self.data.empty:
                # extract unique values:
                self.list = self.unique_values()
            else:
                raise ImportError("errors or empty csv file. Check input file")

        def unique_values(self):
            """ return sorted list of unique values """

            u_values = pd.Series(self.data.values.ravel()).unique()
            if self.nan is True:
                return np.sort(u_values[u_values != 'nan'])
            else:
                return np.sort(u_values)

        def boolean_matrix(self):
            matrix = []
            for i in range(len(self.data)):
                v = self.data.iloc[i, :].values
                # avoid worrying about the argument op_nan
                intersect = set(v).intersection(self.list)

                ea = np.zeros(len(self.list))
                for n in intersect:
                    for index, v in enumerate(self.list):
                        if n == v:
                            ea[index] = 1
                matrix.append(ea)

            self.rows = len(matrix)
            self.columns = len(matrix[0])
            self.boolean_m = pd.DataFrame(matrix, columns=self.list)

            return self.boolean_m

        def boolean_augmented_matrix(self):
            """
            The goal of this function is create the augmented matrix:
            Ax = b --> augmented_matrix = [A|b]

            However, this function allows the option of inserting at least 
            one first column and many others columns after the boolean data, 
            in case that was required to suit your necessities.
            """
            if self.cat_cols is not 'all':
                if self.boolean_m is None:
                    self.boolean_a = self.boolean_matrix()
                else:
                    self.boolean_a = self.boolean_m.copy()

                if self.other_cols is not None and isinstance(self.other_cols, dict):
                    for key, value in self.other_cols.items():
                        if key[0] is 'i':
                            self.boolean_a.insert(0, value, self.input_data[value])
                        elif key[0] is 'e':
                            self.boolean_a[value] = self.input_data[value]
                        else:
                            raise ValueError("Incorrect key format")
                else:
                    warnings.warn("insufficient arguments")
                return self.boolean_a
            else:
                raise ValueError("cat_cols = 'all' the aug. matrix is not possible")

        def sparsity(self):
            """
            The sparsity or density is the fraction of non-zero elements in a matrix.
            :return: ratio non-zero/total
            """
            if self.rows and self.columns and self.boolean_m is not None:
                total = self.rows * self.columns
                non_zeros = sum(self.boolean_m[self.boolean_m == 1].count())
                self.p_sparsity = non_zeros / total
                return "sparsity = {0:.4}%".format(self.p_sparsity * 100)
            else:
                raise ValueError("incorrect data shape. A boolean transformation is required")

        def generate_csv(self, matrix_type='normal'):
            """
            generate a csv file with boolean matrix or augmented matrix
            note: removes first column (index)
            """
            if matrix_type in ['normal', 'augmented']:
                if matrix_type is 'normal':
                    if self.boolean_m is not None:
                        self.boolean_m.to_csv(self.o_file, index=False)
                else:
                    if self.boolean_a is not None:
                        self.boolean_a.to_csv(self.o_file, index=False)
            else:
                raise ValueError("select: normal or augmented")

        @property
        def clean_data(self):
            return self.data

        @property
        def raw_data(self):
            return self.input_data

        @property
        def initial_size(self):
            """ Only includes those fields with categorical data"""
            rows = self.data.shape[0]
            columns = self.data.shape[1]
            return "rows={0}, columns={1}, total={2} elements".\
                format(rows, columns, rows * columns)

        @property
        def final_size(self):
            rows = self.boolean_m.shape[0]
            columns = self.boolean_m.shape[1]
            return "rows={0}, columns={1}, total={2} elements".\
                format(rows, columns, rows * columns)
    

The code tries to be self-explanatory, however I consider that three functions need more details:

  • boolean_matrix: This function contains the basic code to convert the input data. Basically, it compares each row from the initial data "A" with its unique values and inserts one on those positions where matching. As example, this task can be done in Matlab much easier, Matlab can compare two arrays with different length by using the function "ismember()":
            
              p = {'A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'C1'}; % character arrays or cell arrays
              p_test = {'A1', 'B1','C1'};
              ismember(p, p_test)
    
              ans =
    
                 1     0     0     0     1     0     1
            
  • boolean_augmented_matrix: The augmented matrix is that obtained by appending the solution vector "b" to matrix "A", usually denoted as [A b] or (A|b). Having a solvable linear equation system, then the augmented matrix [A b] has the same rank as A. Besides of creating the augmented matrix, this function allows the possibility to append other columns if needed.

  • Sparsity: The sparsity or density is the fraction of non-zero elements in matrix. A typical example of sparse matrix is the adjacency matrix of an undirected graph.

Below a short example using the class CatBin:
    from others.catbin import CatBin
    import time

    file = "test data/i_simple.csv"
    output_file = "test data/clean_matrix.csv"

    others = {'i': 'modules', 'e': 'price'}

    t1 = time.time()
    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[1, 5], other_cols=others, op_nan=True)
    
    unique_v = ct.unique_values()
    boolean_m = ct.boolean_matrix()
    boolean_a = ct.boolean_augmented_matrix()
    sparsity = ct.sparsity()
    t = time.time() - t1
    
    print(t)
    print(ct.initial_size)
    print(ct.final_size)

    ct.generate_csv()
    
The file "i_simple.csv" is exactly the first table on this post. This code uses all features of the class, producing both boolean matrix and boolean augmented matrix and calculates the sparsity ratio. The most simple example needs 5 lines (without verbose).
    from others.catbin import CatBin
    
    file = "test data/i_simple.csv"
    output_file = "test data/clean_matrix.csv"

    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[1, 5], 
                other_cols={'i': 'modules', 'e': 'price'}, op_nan=True)

    boolean_a = ct.boolean_augmented_matrix()
    
Results:
    
      modules  A1  A2  A3  A4  B1  B2  C1  price
    0     m_1   1   1   0   0   0   0   0    9.9
    1     m_2   1   0   1   0   0   0   0   10.5
    2     m_3   0   0   0   1   1   1   1   20.3
    3     m_4   1   0   0   0   1   0   1   22.1

    0.010005950927734375
    rows=4, columns=4, total=16 elements
    rows=4, columns=7, total=28 elements
    
In my laptop(4GB, i5) the process took 0.01 seconds. The next step will be to test something bigger. The dimensiones of the new data is 2194 x 11:
    1.1577680110931396
    rows=2194, columns=9, total=19746 elements
    rows=2194, columns=294, total=645036 elements
    
Due to the number of unique values is much higher, the number of elements is almost 650.000 and the process now took 1.15 seconds, surely this can be improved, but for now it works for my purposes. As shown, the new dimensiones will be multiplied by a factor X(unique values / initial columns).

Hence, having large amount of binary data in a database can also have an impact on performance, so I will not recommend to store the new dataset in a DB better create a system file. There are other options of storing a sparse matrix much more efficient that save the complete data in a DB (CSR or CSC, for instance).

One more example: This dataset includes extra columns before and after the fields with categorical data (2194 x 13). If the order of the extra columns matters is convenient to use ordered dictionaries.
    from others.catbin import CatBin
    from collections import OrderedDict
    import time

    file = "test data/i_data_2.csv"
    output_file = "test data/clean_matrix.csv"

    others = {'i1': 'modules', 'i2': 'modules2', 'e1': 'price', 'e2': 'price2'}
    ordered_others = OrderedDict(sorted(others.items(), key=lambda t: t[0]))

    t1 = time.time()
    
    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[2, 11],
                other_cols=others, op_nan=True)
    
    boolean_m = ct.boolean_matrix()
    boolean_a = ct.boolean_augmented_matrix()
    
    t = time.time() - t1
    
    sparsity = ct.sparsity()
    print(t)
    print(ct.initial_size)
    print(ct.final_size)
    print(boolean_a)
    
Finally, a function for plotting:
    def plot_sparse(self, ticks=False, x_ticks=False, y_ticks=False, y_label=None,
                    sparsity=False):
        """
        Notes: 
            if sparsity was previously calculated - this figure can be included.
            if data is extraordinarily large might be convenient to hide ticks.
        """
        if sparsity and self.p_sparsity >= 0:
            plt.title("sparsity ratio = {0:.4}%".format(self.p_sparsity * 100), fontsize=10)
            plt.suptitle("boolean matrix", fontsize=14, fontweight='bold')
        else:
            plt.title("boolean matrix", fontsize=14, fontweight='bold')
        plt.imshow(self.boolean_m, interpolation='nearest', cmap='gist_yarg')

        if ticks is True:
            plt.yticks(range(self.boolean_m.shape[0]))
            if x_ticks is True:
                plt.xticks(np.arange(len(self.list)), self.list)
            if y_label is not None and y_ticks is True:
                plt.yticks(np.arange(len(self.input_data[y_label])), self.input_data[y_label])
        else:
            plt.axis('off')

        plt.show()
    
Let's plot both matrices used on the previous examples:
    ct.plot_sparse(ticks=True, x_ticks=True, sparsity=True)
    
fig.1: small matrix
    ct.plot_sparse(ticks=False, sparsity=True)
    
fig.2: medium matrix
    ct.plot_sparse(ticks=True, x_ticks=True, sparsity=True)
    
fig.3: medium matrix - detail
That's all for this post, I hope you find it useful.