Friday 31 October 2014

creating shared library AMPL

On this post I will try to explain in detail how create a shared library for AMPL. At this moment the Extended Function Library extends AMPL with over 300 functions, all of them available on the GNU Scientific library. Hence, before creating your own library you should check if your function is already included in that library.

I started my custom library because I needed the sigmoid function for one project and that function was not available in the Extended Function Library.

This process has the next steps:

  • Download headers: funcadd.h and stdio.1.h
  • Visual Studio: create a new Win32 project -> DLL.
  • Build DLL and testing.

The headers can be downloaded from the pages below:

Once this is done, create a new project -> Win32 project in Visual Studio and select application type DLL. Add both headers to Header Files and create a new file for your functions (e.g mathFuncs.c). Now your solution explorer should look like this:

Figure 1. Solution Explorer - Visual Studio

As said, we will create the sigmoid function and an approximation, the C code below shows one possible implementation, surely can be improved to gain efficiency. But before getting into the code, let's explain some general concepts about solvers and sigmoid functions. In general, the solvers need the first and second order derivatives of the function, depending on the method implemented by the solver (quasi-newton, newton's method, etc.), they may require the first derivative or both, as example MINOS requires just the first order derivative since it approximates the second one. In this .dll we will provide both derivatives of the functions.

Sigmoid function: $\frac{1}{(1 +\exp(-x))}$

First order derivative: $\frac{1}{(1 +\exp(-x))} (1 - \frac{1}{(1 +\exp(-x))}) = sigmoid(x)(1-sigmoid(x))$

Second order derivative: $\frac{-(\exp(x)(\exp(x) - 1))}{(\exp(x) + 1)^3}$

A possible sigmoid function approximation ,computationally faster, is the "fast sigmoid" function. The evaluation of the exponential function has a relevant cost and in many situations the exact sigmoid function may not be needed. This post is an interesting resource about the topic with a good benchmarking between methods link.

Fast sigmoid function: $\frac{x}{(1 + |x|)}$

First order derivative: $\frac{(-\frac{x^2}{|x|} + |x| + 1)}{|x|(1 + |x|)^2}$

Second order derivative: $\frac{3x^2|x| - 3|x|^3 - 3|x|^2 + x^2}{|x|^3(|x| + 1)^3}$

#include "stdlib.h"
#include "math.h"
#include "funcadd.h"    /* includes "stdio1.h" */

#define MAX_VAL 1.0e+300

#ifdef __cplusplus
/* The #ifdef __cplusplus lines allow this to be compiled
 * either by an ANSI C compiler or by a C++ compiler.
 */
extern "C" {
#endif

 static real
sigmoid(arglist *al)
 {
     /*  Sigmoid function: sigmoid(x) = 1/(1 + exp(-x)) */
     
     real x = al->ra[0];
     AmplExports *ae = al->AE;

     if (x > MAX_VAL || x < -MAX_VAL) 
     {
         fprintf(Stderr, "x = %d exceeded the maximum allowed value.", x);
         return -1;
     }
     else
     {
        if (al->derivs)
        {
            //first derivative: sigmoid gradient
            //  domain: R (all real numbers)
            //  range: {y c R: 0 <= y <= 0.25}

            *al->derivs = (1.0f/(1.0f + exp(-x)))*(1.0f -1.0f/(1.0f + exp(-x)));
            if (al -> hes)
            {
                //second derivative (hessian)
                //  domain: R (all real numbers)
                //  range: {y c R: -1/(6*sqrt(3)) <= 1/(6*sqrt(3))}

                *al->hes = -(exp(x) * (exp(x) - 1.0f)/ (pow(exp(x) + 1.0f, 3)));
            }
        }
        return 1.0f/(1.0f + exp(-x));
     }   
 }
 
 static real
fast_sigmoid(arglist * al)
 {
    /* fast sigmoid: sigmoid function approximation.
       fast_sigmoid(x) = x/(1 + abs(x))

       Domain {x c R: x != 0} -> derivative discontinuity at x = 0
     */
     
     real x = al->ra[0];
     if (al->derivs)
     {
         // first derivative (gradient)
         *al->derivs = (-pow(x,2)/fabs(x) + fabs(x) + 1) /(fabs(x) * pow(1 + fabs(x), 2));
         
         if (al -> hes)
         {
             // second derivative (hessian)
             real d = 3 * pow(x,2) * fabs(x) - 3 * pow(fabs(x), 3);
             real d1 = 3 * pow(fabs(x),2) + pow(x,2);
             *al->hes = (d  - d1) / (pow(fabs(x), 3) * pow(fabs(x) + 1, 3));
         }
     }
     return x / (1 + fabs(x));  
 }

 void
funcadd(AmplExports *ae){
/* Insert calls on addfunc here... */

/* Arg 3, called type, must satisfy 0 <= type <= 6:
 * type&1 == 0: 0,2,4,6 ==> force all arguments to be numeric.
 * type&1 == 1: 1,3,5   ==> pass both symbolic and numeric arguments.
 * type&6 == 0: 0,1 ==> the function is real valued.
 * type&6 == 2: 2,3 ==> the function is char * valued; static storage
                suffices: AMPL copies the return value.
 * type&6 == 4: 4,5 ==> the function is random (real valued).
 * type&6 == 6: 6   ==> random, real valued, pass nargs real args,
 *              0 <= nargs <= 2.
 *
 * Arg 4, called nargs, is interpretted as follows:
 *  >=  0 ==> the function has exactly nargs arguments
 *  <= -1 ==> the function has >= -(nargs+1) arguments.
 */

    addfunc("sigmoid", (rfunc) sigmoid, 0, 1, 0);
    addfunc("f_sigmoid", (rfunc) fast_sigmoid, 0, 1, 0);
}

#ifdef __cplusplus
    }
#endif

When creating a custom functions there are two important points, write the first order derivative (*al ->derivs) and if there is second derivative (if the solver's method requires the second order derivative for computing the Hessian ; al -> hes) then (*al->hes).

A last comment regarding funcadd(); It is important to define appropiate names to call your function (first argument in addfunc()), in this case I chose "sigmoid" and "f_sigmoid", keep in mind this, because otherwise AMPL will not recognize your functions.

Now it is time to build the .dll and do some testing with AMPL. Paste your .dll into your AMPL directory:

Figure 2. AMPL directory (student-edition)
Open ampl.exe or amplide.exe and type the instructions below to test your functions; if you followed these steps your output should be identical.
Figure 3. AMPL console
That's all from my side, I hope you find it useful.

Thursday 14 August 2014

Sparse matrix: from categorical matrix to binary matrix

On this post I will show a mini project I have been working on for the last few days. The goal is generate boolean or binary data from categorical data. For doing so, I will primarly use Pandas and Numpy libraries. This class was created for testing purposes, so performance has not been taken into account and therefore is not relevant at this moment, however I am convinced that the code can be much more optimized.

Firstly, a short explanation about what this code is doing:

The next table represents the input data, as shown, fields p1, p2, p3 and p4 store categorical data, in this case a product code. Each module consists of four products (p) at most. Note: field "modules" is the index (row names).

modules p1 p2 p3 p4 price
m_1 A1 A2 9.90
m_2 A1 A3 10.50
m_3 B1 B2 A4 C1 20.30
m_4 B1 A1 C1 22.10

Now, let's imagine we want to solve a linear equation system (Ax = b) composed by field p1, p2, p3 and p4 as matrix "A" , unknows (A1, A2, A3, A4, B1, B2, C1) and price as vector "b":

$A=\begin{bmatrix}a_{11} & a_{12} & \cdots & a_{1n} \\a_{21} & a_{22} & \cdots & a_{2n} \\\vdots & \vdots & \ddots & \vdots \\a_{m1} & a_{m2} & \cdots & a_{mn}\end{bmatrix}, \quad x=\begin{bmatrix}x_1 \\x_2 \\\vdots \\x_n\end{bmatrix}, \quad b=\begin{bmatrix}b_1 \\b_2 \\\vdots \\b_m\end{bmatrix}$

The matrix equations shows that size(A) = m x n, size(x) = n and size(b) = m. On the above example columns(A) is not equal to size(x). Hence, we need a new matrix A with the correct shape, see below.

$A=\begin{bmatrix}1 & 1 & 0 & 0 & 0 & 0 & 0 \\1 & 0 & 1 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 1 & 1 & 1 & 1 \\1 & 0 & 0 & 0 & 1 & 0 & 1 \end{bmatrix}, \quad x=\begin{bmatrix}A1 \\A2 \\A3 \\A4 \\B1 \\B2 \\C1 \end{bmatrix}, \quad b=\begin{bmatrix}9.90 \\10.50 \\20.30 \\22.10 \end{bmatrix}$

If we get a matrix "A" like the previous one, then we can solve the system easier.
Using vectorial notation, m_1 = A(1, :).x (e.g. Matlab indexing)

Before we continue, I would like to remark that the above system can be easily solvead by isolating x from the equation or by using LU decomposition, especially for square matrices. When dealing with non square matrices ($m \neq n$) the next methods are normaly used:

  • Normal equation: $x = (A^{T} A)^{-1} A^{T} b $
  • QR factorization
  • SVD decomposition

Code:

    import numpy as np
    import pandas as pd
    import warnings
    import matplotlib.pyplot as plt


    class CatBin(object):
        def __init__(self, i_file, o_file=False, cat_cols="all", other_cols=None, op_nan=True):
            """
            Class to convert a matrix with categorical data into a matrix with binary data.

            Example: Convert the below matrix into a binary matrix

                initial_data =
                modules   p1    p2      p3      p4     price
                m_1       A1    A2      nan     nan    9.90
                m_2       A1    nan     A3      nan    10.50
                m_3       B1    B2      A4      C1     20.30
                m_4       B1    nan     A1      C1     22.10

                binary_data =
                modules   A1  A2  A3  A4  B1  B2  C1    price
                m_1       1   1   0   0   0   0   0     9.90
                m_2       1   0   1   0   0   0   0     10.50
                m_3       0   0   0   1   1   1   1     20.30
                m_4       1   0   0   0   1   0   1     22.10

            Uses:
            Solve a linear equation system: Ax = b
                A = matrix[0,1]
                x = [A1, A2, A3, A4, B1, B2, C1]
                b = price

                Note:
                if non-square matrix, we can solve the system by using some 
                of the well-known methods:

                    Normal equation
                    QR factorization
                    SVD decomposition

            Graphs - adjacency matrix - connectivity
            ...

            :param i_file: input_file in csv format
            :param o_file: if True a csv is generated. The path must be provided.
            :param cat_cols: slicing - e.g [1, 10] -> from column 1 to 10(not included - pandas)
            :param other_cols:

            if the new data frame must preserve others columns besides those
            with categorical data, position and name should be provided in a dictionary.
                e.g
                others = {'i' : 'modules', 'e', 'prices'}
                if 'i', that column will be inserted on the first position in the new data frame
                    data.insert(0, 'name', column_data)
                if 'e', that column will be inserted after the rest of columns
                    data['name'] = column_data

            :param op_nan: if true all NaN values will be removed.
            """

            self.o_file = o_file
            self.cat_cols = cat_cols
            self.other_cols = other_cols
            self.nan = op_nan
            self.list = []

            self.rows = None
            self.columns = None
            self.p_sparsity = -1

            self.input_data = pd.read_csv(i_file)
            self.boolean_m = None
            self.boolean_a = None

            if self.cat_cols is "all":
                if self.nan is True:
                    # if all data is categorical data to be converted, automatically 
                    # others_cols must be false to avoid inconsistencies.
                    
                    self.data = self.input_data.fillna(value='nan')
                else:
                    self.data = self.input_data
                if other_cols is not None:
                    other_cols = None
                    warnings.warn("other_cols has been set to False to avoid inconsistencies")
            else:
                # cat_cols is an array [start, end]
                if isinstance(self.cat_cols, list):
                    if len(self.cat_cols) == 2:
                        _start = self.cat_cols[0]
                        _end = self.cat_cols[1]
                        if self.nan is True:
                            self.data = self.input_data.fillna(value='nan').iloc[:, _start: _end]
                        else:
                            self.data = self.input_data.iloc[:, _start, _end]
                    else:
                        raise ValueError("'start'and 'end' position must be provided")
                else:
                    raise TypeError("cat_cols is not a list")

            if not self.data.empty:
                # extract unique values:
                self.list = self.unique_values()
            else:
                raise ImportError("errors or empty csv file. Check input file")

        def unique_values(self):
            """ return sorted list of unique values """

            u_values = pd.Series(self.data.values.ravel()).unique()
            if self.nan is True:
                return np.sort(u_values[u_values != 'nan'])
            else:
                return np.sort(u_values)

        def boolean_matrix(self):
            matrix = []
            for i in range(len(self.data)):
                v = self.data.iloc[i, :].values
                # avoid worrying about the argument op_nan
                intersect = set(v).intersection(self.list)

                ea = np.zeros(len(self.list))
                for n in intersect:
                    for index, v in enumerate(self.list):
                        if n == v:
                            ea[index] = 1
                matrix.append(ea)

            self.rows = len(matrix)
            self.columns = len(matrix[0])
            self.boolean_m = pd.DataFrame(matrix, columns=self.list)

            return self.boolean_m

        def boolean_augmented_matrix(self):
            """
            The goal of this function is create the augmented matrix:
            Ax = b --> augmented_matrix = [A|b]

            However, this function allows the option of inserting at least 
            one first column and many others columns after the boolean data, 
            in case that was required to suit your necessities.
            """
            if self.cat_cols is not 'all':
                if self.boolean_m is None:
                    self.boolean_a = self.boolean_matrix()
                else:
                    self.boolean_a = self.boolean_m.copy()

                if self.other_cols is not None and isinstance(self.other_cols, dict):
                    for key, value in self.other_cols.items():
                        if key[0] is 'i':
                            self.boolean_a.insert(0, value, self.input_data[value])
                        elif key[0] is 'e':
                            self.boolean_a[value] = self.input_data[value]
                        else:
                            raise ValueError("Incorrect key format")
                else:
                    warnings.warn("insufficient arguments")
                return self.boolean_a
            else:
                raise ValueError("cat_cols = 'all' the aug. matrix is not possible")

        def sparsity(self):
            """
            The sparsity or density is the fraction of non-zero elements in a matrix.
            :return: ratio non-zero/total
            """
            if self.rows and self.columns and self.boolean_m is not None:
                total = self.rows * self.columns
                non_zeros = sum(self.boolean_m[self.boolean_m == 1].count())
                self.p_sparsity = non_zeros / total
                return "sparsity = {0:.4}%".format(self.p_sparsity * 100)
            else:
                raise ValueError("incorrect data shape. A boolean transformation is required")

        def generate_csv(self, matrix_type='normal'):
            """
            generate a csv file with boolean matrix or augmented matrix
            note: removes first column (index)
            """
            if matrix_type in ['normal', 'augmented']:
                if matrix_type is 'normal':
                    if self.boolean_m is not None:
                        self.boolean_m.to_csv(self.o_file, index=False)
                else:
                    if self.boolean_a is not None:
                        self.boolean_a.to_csv(self.o_file, index=False)
            else:
                raise ValueError("select: normal or augmented")

        @property
        def clean_data(self):
            return self.data

        @property
        def raw_data(self):
            return self.input_data

        @property
        def initial_size(self):
            """ Only includes those fields with categorical data"""
            rows = self.data.shape[0]
            columns = self.data.shape[1]
            return "rows={0}, columns={1}, total={2} elements".\
                format(rows, columns, rows * columns)

        @property
        def final_size(self):
            rows = self.boolean_m.shape[0]
            columns = self.boolean_m.shape[1]
            return "rows={0}, columns={1}, total={2} elements".\
                format(rows, columns, rows * columns)
    

The code tries to be self-explanatory, however I consider that three functions need more details:

  • boolean_matrix: This function contains the basic code to convert the input data. Basically, it compares each row from the initial data "A" with its unique values and inserts one on those positions where matching. As example, this task can be done in Matlab much easier, Matlab can compare two arrays with different length by using the function "ismember()":
            
              p = {'A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'C1'}; % character arrays or cell arrays
              p_test = {'A1', 'B1','C1'};
              ismember(p, p_test)
    
              ans =
    
                 1     0     0     0     1     0     1
            
  • boolean_augmented_matrix: The augmented matrix is that obtained by appending the solution vector "b" to matrix "A", usually denoted as [A b] or (A|b). Having a solvable linear equation system, then the augmented matrix [A b] has the same rank as A. Besides of creating the augmented matrix, this function allows the possibility to append other columns if needed.

  • Sparsity: The sparsity or density is the fraction of non-zero elements in matrix. A typical example of sparse matrix is the adjacency matrix of an undirected graph.

Below a short example using the class CatBin:
    from others.catbin import CatBin
    import time

    file = "test data/i_simple.csv"
    output_file = "test data/clean_matrix.csv"

    others = {'i': 'modules', 'e': 'price'}

    t1 = time.time()
    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[1, 5], other_cols=others, op_nan=True)
    
    unique_v = ct.unique_values()
    boolean_m = ct.boolean_matrix()
    boolean_a = ct.boolean_augmented_matrix()
    sparsity = ct.sparsity()
    t = time.time() - t1
    
    print(t)
    print(ct.initial_size)
    print(ct.final_size)

    ct.generate_csv()
    
The file "i_simple.csv" is exactly the first table on this post. This code uses all features of the class, producing both boolean matrix and boolean augmented matrix and calculates the sparsity ratio. The most simple example needs 5 lines (without verbose).
    from others.catbin import CatBin
    
    file = "test data/i_simple.csv"
    output_file = "test data/clean_matrix.csv"

    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[1, 5], 
                other_cols={'i': 'modules', 'e': 'price'}, op_nan=True)

    boolean_a = ct.boolean_augmented_matrix()
    
Results:
    
      modules  A1  A2  A3  A4  B1  B2  C1  price
    0     m_1   1   1   0   0   0   0   0    9.9
    1     m_2   1   0   1   0   0   0   0   10.5
    2     m_3   0   0   0   1   1   1   1   20.3
    3     m_4   1   0   0   0   1   0   1   22.1

    0.010005950927734375
    rows=4, columns=4, total=16 elements
    rows=4, columns=7, total=28 elements
    
In my laptop(4GB, i5) the process took 0.01 seconds. The next step will be to test something bigger. The dimensiones of the new data is 2194 x 11:
    1.1577680110931396
    rows=2194, columns=9, total=19746 elements
    rows=2194, columns=294, total=645036 elements
    
Due to the number of unique values is much higher, the number of elements is almost 650.000 and the process now took 1.15 seconds, surely this can be improved, but for now it works for my purposes. As shown, the new dimensiones will be multiplied by a factor X(unique values / initial columns).

Hence, having large amount of binary data in a database can also have an impact on performance, so I will not recommend to store the new dataset in a DB better create a system file. There are other options of storing a sparse matrix much more efficient that save the complete data in a DB (CSR or CSC, for instance).

One more example: This dataset includes extra columns before and after the fields with categorical data (2194 x 13). If the order of the extra columns matters is convenient to use ordered dictionaries.
    from others.catbin import CatBin
    from collections import OrderedDict
    import time

    file = "test data/i_data_2.csv"
    output_file = "test data/clean_matrix.csv"

    others = {'i1': 'modules', 'i2': 'modules2', 'e1': 'price', 'e2': 'price2'}
    ordered_others = OrderedDict(sorted(others.items(), key=lambda t: t[0]))

    t1 = time.time()
    
    ct = CatBin(i_file=file, o_file=output_file, cat_cols=[2, 11],
                other_cols=others, op_nan=True)
    
    boolean_m = ct.boolean_matrix()
    boolean_a = ct.boolean_augmented_matrix()
    
    t = time.time() - t1
    
    sparsity = ct.sparsity()
    print(t)
    print(ct.initial_size)
    print(ct.final_size)
    print(boolean_a)
    
Finally, a function for plotting:
    def plot_sparse(self, ticks=False, x_ticks=False, y_ticks=False, y_label=None,
                    sparsity=False):
        """
        Notes: 
            if sparsity was previously calculated - this figure can be included.
            if data is extraordinarily large might be convenient to hide ticks.
        """
        if sparsity and self.p_sparsity >= 0:
            plt.title("sparsity ratio = {0:.4}%".format(self.p_sparsity * 100), fontsize=10)
            plt.suptitle("boolean matrix", fontsize=14, fontweight='bold')
        else:
            plt.title("boolean matrix", fontsize=14, fontweight='bold')
        plt.imshow(self.boolean_m, interpolation='nearest', cmap='gist_yarg')

        if ticks is True:
            plt.yticks(range(self.boolean_m.shape[0]))
            if x_ticks is True:
                plt.xticks(np.arange(len(self.list)), self.list)
            if y_label is not None and y_ticks is True:
                plt.yticks(np.arange(len(self.input_data[y_label])), self.input_data[y_label])
        else:
            plt.axis('off')

        plt.show()
    
Let's plot both matrices used on the previous examples:
    ct.plot_sparse(ticks=True, x_ticks=True, sparsity=True)
    
fig.1: small matrix
    ct.plot_sparse(ticks=False, sparsity=True)
    
fig.2: medium matrix
    ct.plot_sparse(ticks=True, x_ticks=True, sparsity=True)
    
fig.3: medium matrix - detail
That's all for this post, I hope you find it useful.

Saturday 29 March 2014

TSP / ATSP algorithm.

This post will be dedicated to the Travelling salesman problem (TSP), one of the most known combinatorial optimization problem. The goal of this problem is simple: A salesman has to visit a number of places (nodes). Which route would visits them all in the shortest total distance (time, cost, etc...)
The simplest way to solve the problem is list and try all possible routes and them select the shortest one. The problem here is the computational time which will grow faster than exponential (Stirling's formula: $n! \approx \sqrt{2\pi n}(\frac{n}{e})^n$)

There are other solutions which can reduce this computational time, for example dynamic programming which solves the problem in exponential time E. The first implemented method using that more efficient method was the Held-Karp algorithm which finds the shortest route in $2^n n^2$
Other methods are: k-opt, Greedy algorithm, Greedy k-opt, Genetic algorithm, Simulating annealing, neuronal network approach among others.

On this post I will show an implementation for the simplest and naive method, an implementation for dealing with asymmetric input matrices and finally we will compare these methods with the genetic algorithm(this implementation has too many lines of code for one post)

import itertools
import numpy as np

INF = 1000000


class TSPExact:
    def __init__(self, matrix, redundant=False):
        """
        input: Adjacency matrix N x N
        len(nodes) = shape(input)
        @param matrix: N x N matrix
        @param redundant: false by default (n-1)!
        """
        # 1: Check if the adjacency matrix is symmetric
        self.cost_matrix = np.array(matrix)
        self.symmetry = None
        if self.symmetric() is False:
            self.symmetry = False
            self.as_symmetric()

        self.size = len(self.cost_matrix)
        self.nodes = range(self.size)
        self.all_tours = itertools.permutations

        if redundant is False:
            self.tours = self.non_redundant()
        else:
            self.tours = list(self.all_tours(self.nodes))

    def as_symmetric(self):
        """
        Reformulate an asymmetric TSP as a symmetric TSP: 
        "Jonker and Volgenant 1983"
        This is possible by doubling the number of nodes. For each city a dummy 
        node is added: (a, b, c) => (a, a', b, b', c, c')

        distance = "value"
        distance (for each pair of dummy nodes and pair of nodes is INF)
        distance (for each pair node and its dummy node is -INF)
        ------------------------------------------------------------------------
          |A'   |B'   |C'   |A    |B    |C    |
        A'|0    |INF  |INF  |-INF |dBA  |dCA  |
        B'|INF  |0    |INF  |dAB  |-INF |dCB  |
        C'|INF  |INF  |0    |dAC  |dBC  |-INF |
        A |-INF |dAB  |dAC  |0    |INF  |INF  |
        B |dBA  |-INF |dBC  |INF  |0    |INF  |
        C |dCA  |dCB  |-INF |INF  |INF  |0    |

        For large matrix an exact solution is infeasible
        if N > 5 (N = N*2 > 10) then use other techniques: 
        Heuristics and relaxation methods
        @return: new symmetric matrix

        [INF][A.T]
        [A  ][INF]
        """

        shape = len(self.cost_matrix)
        mat = np.identity(shape) * - INF + self.cost_matrix

        new_shape = shape * 2
        new_matrix = np.ones((new_shape, new_shape)) * INF
        np.fill_diagonal(new_matrix, 0)

        # insert new matrices
        new_matrix[shape:new_shape, :shape] = mat
        new_matrix[:shape, shape:new_shape] = mat.T
        # new cost matrix after transformation
        self.cost_matrix = new_matrix

    def total_tour(self, tour):
        total = sum(self.cost_matrix[tour[node], tour[(node + 1) % self.size]] for node in self.nodes)
        return total, tour

    def shortest(self):
        min_tour = min(self.total_tour(tour) for tour in self.tours)
        if self.symmetry is False:
            min_tour_real = min_tour[0] + self.size / 2 * INF
            min_cycle_real = min_tour[1][0::2]
            return min_tour_real, min_cycle_real
        else:
            return min_tour

    def non_redundant(self):
        start_node = self.nodes[0]
        new_set = list(self.nodes)[1:]
        return [list(np.append(start_node, tour)) for tour in self.all_tours(new_set)]

    def symmetric(self):
        #@return: boolean symmetric | asymmetric
        mat = [tuple(row) for row in self.cost_matrix]
        return mat == zip(*mat)

    @property
    def tour_iterations(self):
        return list(self.tours)

    @property
    def tour_size(self):
        return self.size
The code above tried to be self explanatory, but it might be some points unclear. The trickiest part is the transformation from asymmetric matrix to symmetric matrix. As shown in the code, we are using the implementation suggested by Jonker R, Volgenant T (1983) (“Transforming asymmetric into symmetric traveling salesman problems.” Operations Research Letters, 2, 161–163). The process behind is simple, doubling the number of nodes and transposing the original matrix it is possible get a symmetric matrix which can be solvable applying the current known methods. Finally this algorithm requires the removal of all the dummy nodes and a subtle operation in order to get the real shortest tour value.

$shortestRoute (real) = shortestRoute + \frac{Size(InputMatrix)}{2} \times INF.Value $
We can use slicing in Numpy for removing dummy nodes.

min_tour_real = min_tour[0] + self.size / 2 * INF
min_cycle_real = min_tour[1][0::2]

Unit testing

Now it's time to compare some algorithms. I have hard coded three different matrices for testing purpose. We must keep in mind that the matrix size will be multiplied by two if input matrix is asymmetric. In addition, one must point out that the basic algorithm becomes infeasible when the number of nodes is higher than 12 (6 - asymmetric matrix)

from Algorithms.scripts.TSP_Algorithm import *
import unittest
import time

MATRIX = [[0, 750, 43, 402], [750, 0, 736, 671], [43, 736, 0, 117], [402, 671, 117, 0]]
MATRIX5 = [[0, 750, 43, 402, 133], [750, 0, 736, 671, 85], [43, 736, 0, 117, 934], 
     [402, 671, 117, 0, 982], [133, 85, 934, 982, 0]]

MATRIX8 = [[0, 750, 431, 402, 133, 489, 310, 566], [750, 0, 736, 671, 85, 174, 870, 927],
           [431, 736, 0, 117, 934, 65, 939, 305], [402, 671, 117, 0, 982, 727, 911, 870],
           [133, 85, 934, 982, 0, 956, 834, 118], [489, 174, 65, 727, 956, 0, 322, 397],
           [310, 870, 939, 911, 834, 322, 0, 127], [566, 927, 305, 870, 118, 397, 127, 0]]


class TestAlgorithms(unittest.TestCase):
    @staticmethod
    def test_exact():
        start_time = time.time()
        tsp = TSPExact(MATRIX5)
        total_value, tour = tsp.shortest()
        print "__exact__"
        print total_value
        print tour
        print "Time (s)", time.time() - start_time
  
    @staticmethod
    def test_genetic():
        start_time = time.time()
        tsp = TSPGenetic(MATRIX5)
        total, route = tsp.calculate()
        print "__genetic__"
        print total
        print route
        print "Time (s)", time.time() - start_time

Some results

MATRIX(Asymmetric)
__exact__
1580.0
[0, 2, 3, 1]
Time (s) 0.130000114441
__genetic__
1580
[0, 2, 3, 1]
Time (s) 0.00100016593933

MATRIX5(Asymmetric)
__exact__
1049.0
[0, 2, 3, 1, 4]
Time (s) 10.1160001755
__genetic__
1049
[0, 2, 3, 1, 4]
Time (s) 0.000999927520752

MATRIX8(Asymmetric)
__exact__
16 nodes - time...
__genetic__
1398
[0, 3, 2, 5, 1, 4, 7, 6]
Time (s) 0.00399994850159


Comments

As shown on the previous tests, the problem becomes intractable for the Exact algorithm as the number of nodes increases. Using the Genetic algorithm we got the same results, however this algorithm might not obtain the exact solution in large graphs.

Sunday 23 February 2014

Python and Excel/VBA

Today we are going to talk about how easy it could be run Python's scripts from Excel, which might be considered pretty interesting. Basically we will create a VBA code which launches a console running the selected script. This script creates an output which must be printed on the same excel file from where it was launched.

Continuing with the previous post, we will use the scraping script to print the extracted data into excel. First we will change a bit the previous code and we will create a class called "collect". UPDATE: The website has changed and the previous code doesn't work any more!

import numpy as np
import urllib2
import urlparse
import re
from bs4 import BeautifulSoup


class collect:

    def __init__(self, mainSites):
        self.sites = mainSites

    def connect(self, site):
        hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)',
                'Accept': 'text/html,application/xhtml+xml,application/xml'
                ';q=0.9,*/*;q=0.8',
               'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
               'Accept-Encoding': 'none',
               'Accept-Language': 'en-US,en;q=0.8',
               'Connection': 'keep-alive'}
        req = urllib2.Request(site, headers=hdr)
        page = urllib2.urlopen(req).read()
        soup = BeautifulSoup(page)
        return soup

    def extractUrls(self):
        FxUrls = []
        C1 = []
        C2 = []
        for mainSite in self.sites:
            soup = self.connect(mainSite)
            for link in soup.find_all('td', {'class': 'bold left noWrap'}):
                path = 'http://www.investing.com/currencies/'
                url = link.find(href=re.compile(path)).get('href')
                FxUrls.append(url)
                curr = urlparse.urlparse(url).path.split("/")[2].split("-")
                C1.append(curr[0])
                C2.append(curr[1])
        return np.vstack((FxUrls, C1, C2))

    def extractData(self):
        Value = []
        Data = self.extractUrls()
        for link in Data[0, :]:
            soup = self.connect(link)
            for span in soup.find_all('span', {'id': 'last_last'}):
                value = (span.text).replace(',', '')
                Value.append(value)
        return np.vstack((Data, Value))
Once we have the script is time to create the interaction with Excel. In order to do this operation we will use an excellent module: pyvot. I have previously prepared some special ranges for printing the data (I recommend Name ranges as a good Excel/VBA practice). The script below will create a connection with Excel, just selecting the desired ranges and sending the values. This script will also need some arguments, the workbook name from where the script is launched and our links.

import sys
import scraping
import xl

#arguments:
excelWb = sys.argv[1]
print "Excel workbook: ", excelWb

mainSites = []
for i in range(len(sys.argv) - 2):
    mainSites.append(str(sys.argv[i + 2]))
print "urls: ", mainSites

scr = scraping.collect(mainSites)
data = scr.extractData()

rngCurr1 = data[1, :]
rngCurr2 = data[2, :]
rngValue = data[3, :]

"""Excel connection"""
wb = xl.Workbook(excelWb)
wb.range("rngCurr1").set(rngCurr1)
wb.range("rngCurr2").set(rngCurr2)
wb.range("rngRate").set(rngValue)

Excel template

Ranges

Finally, we need a VBA code which can run a script through the Windows console. First we will be calling a function to clear ranges contents and the main function afterwards. The main function, called runScript, creates an object Shell and runs the script with its arguments, which are: workbook name and array of urls.

Option Explicit

Sub clean()

Dim rngForex As Range
Set rngForex = Sheet1.Range("rngForex")
rngForex.ClearContents

End Sub

Sub runScript()

Dim oShell As Object
Dim sScript As String, sScriptName As String, sWbName As String
Dim sArguments As String
Dim iRegions As Integer, i As Integer
Dim aRegion() As String, sLink As String

Call clean

Set oShell = CreateObject("WScript.Shell")

sWbName = ThisWorkbook.Name
sScriptName = "scrapingtest.py"
sLink = "http://www.investing.com/currencies/"

iRegions = Application.WorksheetFunction.CountA(Sheet1.Range("rngRegions"))
ReDim aRegion(iRegions - 1)
For i = 1 To iRegions
    aRegion(i - 1) = sLink & CStr(Sheet1.Range("rngRegions").Cells(i, 1).Value)
    sArguments = sArguments & aRegion(i - 1) & " "
Next i

sScript = sScriptName & " " & sWbName & " " & sArguments
oShell.Run ("cmd.exe /S /C " & sScript)

End Sub
Now, if you click "RUN" a console will appear showing a message similar to this one:


Conclusion:
My personal opinion is that at this moment there are not real alternatives to Excel. Is true that there are open source tools like LibreOffice or Apache OpenOffice but these are not even comparable to Excel. It is quite common create scripts in C# to automate processes which are not so fast in VBA. VS languages (VB /C#) are chosen because they are much more integrated with Excel and generally speaking with most of the Microsoft applications.

The aim of this post was show that we can use Python to accelerate the code and simply use Excel as a template where print our results, creating dashboards, reports and so on. Thus, for me Python and Excel/VBA is a good association and I will use it going forward.

Saturday 25 January 2014

Networks 1: Scraping + Data visualization + Graph stats

These last weeks I have been reading about networks and optimization algorithms, I think is an interesting field with many applications, so my idea was write a new article (or series of articles) showing roughly how use some interesting python libraries like Networkx, for instance. In this article I also like to make a brief introduction to Scraping (a very good way to get data for your graphs).

And only to whet your appetite:

Figure 1. Undirected graph.


Web Scraping:

The first step to construct graphs like the previous one is get the data. As example, I wanted to visualize the whole foreign exchange market and the connections between each currency (Foreign exchange cross rates), but there are another good examples like user connections in social networks (Twitter, Facebook etc).
For this case study(Forex), there are several available websites where you can find financial data for free, nonetheless this data will be restricted to certain time periods, so do not expect 10 or 30 seconds candles for free.

List of Forex websites:

  1. Google Finance
  2. Yahoo Finance
  3. Dukascopy
  4. Investing

Once we have selected our source, we can think about how extract the info. Some websites like Yahoo Finance allow downloading end-of-day equities or forex data, normally in csv format. But let's image that our intention is to obtain the most possible current data. Here Scraping is the appropriate technique. Web scraping is the process of automatically collecting information from websites, usually simulating human interaction through a full-fledged web browser. Note: web scraping may be against the terms of use of several websites.

For this purpose, I used the next python modules:

  1. urllib2
  2. urlparse
  3. re
  4. BeautifulSoup

The goal then is extract all the foreign exchange cross rates. This data will be collected in three numpy arrays and finally we will create a dataframe with pandas to export a csv file. The tricky part of all this is parse the html file, however BeautifulSoup allows using regular expressions, which combined with urlparse creates a powerful and simple tool. So let's take a look at the script.

Script:

import numpy as np
import pandas as pd
import urllib2
import urlparse
import re
from bs4 import BeautifulSoup

#------------------------------------------------------------------------------
#Functions:
#------------------------------------------------------------------------------


def add_exchange(Data):
    Value = []
    for link in Data[0, :]:
        soup = connect(link)
        for span in soup.find_all('span', {'id': 'last_last'}):
            #remove thousands (",")
            value = (span.text).replace(',', '')
            Value.append(value)
    return np.vstack((Data, Value))


def connect(site):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    req = urllib2.Request(site, headers=hdr)
    page = urllib2.urlopen(req).read()
    soup = BeautifulSoup(page)
    return soup

#------------------------------------------------------------------------------
#Extract list of currencies + Matrix
#------------------------------------------------------------------------------
mainSites = {'http://www.investing.com/currencies/middle-east',
    'http://www.investing.com/currencies/north-america',
    'http://www.investing.com/currencies/central-america',
    'http://www.investing.com/currencies/south-america',
    'http://www.investing.com/currencies/asia',
    'http://www.investing.com/currencies/pacific',
    'http://www.investing.com/currencies/europe',
    'http://www.investing.com/currencies/middle-east',
    'http://www.investing.com/currencies/africa',
    'http://www.investing.com/currencies/caribbean'}

FxUrls = []
C1 = []
C2 = []
for mainSite in mainSites:
    soup = connect(mainSite)
    for link in soup.find_all(href=re.compile("www.investing.com/currencies/")):
        #extract currencies:
        url = link.get('href')
        FxUrls.append(url)
        currencies = urlparse.urlparse(url).path.split("/")[2].split("-")
        C1.append(currencies[0])
        C2.append(currencies[1])

#Create Matrix
Forex = np.vstack((FxUrls, C1, C2))
#------------------------------------------------------------------------------

#Extract all foreign exchange cross rates:
Import = add_exchange(Forex)

#Pandas dataframe
data = {'currency_1': Import[1, :],
    'currency_2': Import[2, :],
    'value': Import[3, :]}

df = pd.DataFrame(data)
df.to_csv('Forex.csv', index=False, cols=['currency_1', 'currency_2', 'value'])

Summarizing what we have done in this script:

I wrote two functions, one in order to connect with website and capture the html page, and the other one to parse this html and select the required information. There are different ways to extract the data, select the one that best suits the structure of your html file. Run the script, just take into account that this script may take a while depending on your internet connection.
With this tool, you might consider a dynamic system to obtain frequently data. Well, that would be complicated and unlikely, your IP address would be blocked if the number of requests is considerable.


Networkx - Graphs:

Now we have collected the data, it's time to show what we got. For the creation and study of the graphs we will use the python package called Networkx. This is a pretty helpful package which provides many functions and algorithm that facilitate the study of complex networks. However, this may be not the best package to draw your graph, although we can use it as an interface to use Graphviz.

The first thing I did was write a new module, called "graphs", with the Networkx's functionalities I needed. This includes two classes, "create_graph"and "graphStats". The first one reads the produced csv file and creates the two elements that form a graph, nodes and edges. In this case we also have weights, the exchange rates (third column), so we need a list in order to create the labels, which looks like this:

example: {('D', 'E'):9.0, ('F', 'N'):6.0}...

The class "graphStats" includes two functions in order to calculate the degree centrality and betweenness centrality, we will talk about this in more detail later on.

Graph class:
import networkx as nx
from operator import itemgetters
import pandas as pd
import numpy as np


class create_graph:

    def __init__(self):
        data = 'csv/Forex.csv'
        headers = ['start', 'end', 'weights']
        self.Network = pd.read_csv(data, skiprows=1, names=headers)
        self.graph = np.asarray(self.Network[['start', 'end']])
        self.nodes = np.unique(self.graph)
        self.weights = list(map(float, self.Network['weights']))

    def networkList(self):
        G = nx.DiGraph()
        for node in self.nodes:
            G.add_node(node)

        self.graph_l = []
        for edge in self.graph:
            G.add_edge(edge[0], edge[1])
            self.graph_l.append((edge[0], edge[1]))

        labels = dict(list(zip(self.graph_l, self.weights)))
        return G, labels


class graphStats:

    def calculate_degree_centrality(self, graph):
        dgc_key = []
        dgc_value = []
        g = graph
        dc = nx.degree_centrality(g)
        nx.set_node_attributes(g, 'degree_cent', dc)
        degcent_sorted = sorted(dc.items(), key=itemgetter(1), reverse=True)
        for key, value in degcent_sorted:
            dgc_key.append(str(key))
            dgc_value.append(value)
        return dgc_key, dgc_value

    def calculate_betweenness_centrality(self, graph):
        btc_key = []
        btc_value = []
        g = graph
        bc = nx.betweenness_centrality(g)
        betcent_sorted = sorted(bc.items(), key=itemgetter(1), reverse=True)
        for key, value in betcent_sorted:
            btc_key.append(str(key))
            btc_value.append(value)
        return btc_key, btc_value


Now we have prepared the module, we will use matplotlib to draw the graph. Actually, this is not a huge network, it has only 133 nodes and 1918 edges, but it is difficult get a good resolution if your output file is .jpg or png, so I recommend a pdf file and some adjustments in the figure size.

Graph - plot:
import networkx as nx
import matplotlib.pyplot as plt
import graphs

#------------------------------------------------------------------------------
# Read data + create graph
#------------------------------------------------------------------------------
gr = graphs.create_graph()
graph = gr.graph
G, labels = gr.networkList()

#------------------------------------------------------------------------------
# Graph plot:
#------------------------------------------------------------------------------
fig = plt.figure()
#fig = plt.figure(figsize=(8,8), dpi=1000) - pdf file
pos = nx.graphviz_layout(G, prog='neato')
node_color = [float(G.degree(v)) for v in G]
nx.draw_networkx_nodes(G, pos, node_size=[float(G.degree(v)) * 2.5 for v in G],
             alpha=0.6, node_color=node_color)
nx.draw_networkx_edges(G, pos, alpha=0.25, edge_color='blue', width=0.5,
arrows=False)
nx.draw_networkx_labels(G, pos, font_size=2.5)

plt.axis('off')
plt.show()
#plt.savefig("directedGraph.pdf")

Figure 1.1. Undirected graph (detail).
Figure 2. Directed graph.

The second graph is which shows the real Forex network. The Forex network is a directed graph, a graph where its edges have a direction associated with them. This digraph is strongly connected due to it contains several pairs of vertices with both paths, (euro -> dolar and dolar -> euro, for instance).

Graph - stats:

Networkx is a great python package to study graphs, its main purpose. In this point we are going to look into some statistics which will help us to extract real information from our graphs with the aim of interpret the results properly.

1) Betweenness centrality:

Betweenness centrality quantifies the number of times a node acts as bridge along the shortest path between two other nodes. It is basically a measure of connectedness between components of the graph.

2) Degree centrality:

The centrality for a node measures its relative importance within the graph. The degree centrality for a node (n) is the number of nodes it is connected. The degree centrality values are normalized by the maximum degree in the graph. For a graph with self loops (a digraph) the maximum degree might be higher than 1 and the degree centrality as well.

In this article I will not expand the theory behind these measures.
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import graphs


def topTable(field1, field2, n_top):
    topM = max(field2) * 0.9
    right = len(field1) * 0.75
    plt.text(right, topM * 1.08, 'Top %s' % n_top, fontsize=12)
    for i in range(n_top):
        curr = field1[i]
        val = field2[i]
        plt.text(right, topM - i * topM / 20, '{}) {} = {}'.format(i + 1,
        curr.upper(), round(val, 3)), fontsize=8)

#------------------------------------------------------------------------------
# Read data + create graph
#------------------------------------------------------------------------------
gr = graphs.create_graph()
G, labels = gr.networkList()

#------------------------------------------------------------------------------
# Stats + plots:
#------------------------------------------------------------------------------
stats = graphs.graphStats()
betc_key, betc_value = stats.calculate_betweenness_centrality(G)
degc_key, degc_value = stats.calculate_degree_centrality(G)

#Average degree
N = G.order()
K = G.size()
avg_d = float(N) / K
avg_degree = 'Average degree: %.4f ' % (avg_d)


# Plot: Degree_centrality
plt.figure()

ax1 = plt.subplot(211)
plt.title('Degree centrality for nodes', fontsize=12)
a_lenght = np.arange(len(degc_value))
plt.bar(a_lenght, degc_value, color=cm.jet(degc_value), align='center')
plt.xticks(a_lenght, degc_key, size='small', rotation='vertical')
plt.tick_params(axis='x', labelsize=4)
plt.tick_params(axis='y', labelsize=8)
plt.autoscale(enable=True, axis='both', tight=None)

#Top degree centrality:
topTable(degc_key, degc_value, 10)
plt.text(len(degc_value) * 0.75, max(degc_value) * 0.4, avg_degree,
bbox={'facecolor': 'blue', 'alpha': 0.25, 'pad': 10}, fontsize=7)

# Plot: Betweenness_centrality
plt.subplot(212)
plt.title('Betweenness centrality for nodes', fontsize=12)
a_lenght = np.arange(len(betc_value))
plt.bar(a_lenght, betc_value, color=cm.jet(betc_value), align='center')
plt.xticks(a_lenght, betc_key, size='small', rotation='vertical')
plt.tick_params(axis='x', labelsize=4)
plt.tick_params(axis='y', labelsize=8)
plt.autoscale(enable=True, axis='both', tight=None)
plt.ylim(0, max(betc_value) * 1.1)
plt.plot(betc_value, '--b')

#Top degree betweenness:
topTable(betc_key, betc_value, 10)

plt.show()


The above plot shows the currencies which have more relevance into the network, and as we suspected these are: USD dolar, Euro and the pound sterling. We can make at least two significant considerations; The average degree centrality is quite low, what shows that there are currencies with a notable grade of popularity(Degree centrality : in-degree) and these same currencies play a broker role in the network due to cumulate most of trajectories for the shortest paths. For this particular example,this might mean that we can create a more simplistic network, which would represent pretty faithfully the Forex market, excluding those currencies with fewer number of connections.


Adjacency Matrix:

The adjacency matrix is a N x N matrix where the cells constitute edges. In this example the fields are coloured according the degree centrality and as the graph is directed, the matrix is not symmetric.
import numpy as np
import matplotlib.pyplot as plt
import graphs


gr = graphs.create_graph()
G = gr.networkList()[0]

n = len(gr.nodes)
oMapp = dict(list(zip(gr.nodes, np.arange(n))))

Mtx_w = (np.ones((n, n))) * np.inf
for i in range(len(gr.graph)):
    start = oMapp.get(gr.Network['start'][i])
    end = oMapp.get(gr.Network['end'][i])
    w = gr.Network['weights'][i]
    Mtx_w[start, end] = w

Mtx_nodes = []
Mtx_nodes = np.array([np.concatenate([Mtx_nodes, gr.nodes]) for _ in range(n)])
#-----------------------------------------------------------------------------

# Show matrix by colour:
Mtx_Image = Mtx_w.copy()
for i in range(n):
    count = ((Mtx_Image[i, :] != np.inf) & (Mtx_Image[i, :] != np.nan)).sum()
    Mtx_Image[i, :][Mtx_Image[i, :] != np.inf] = count

# Gradient based on degrees
plt.figure()
plt.title('Foreign Exchange cross rates', fontsize=16)
plt.imshow(Mtx_Image, interpolation='nearest')
plt.xticks(np.arange(len(gr.nodes)), gr.nodes, rotation=90)
plt.yticks(np.arange(len(gr.nodes)), gr.nodes)
plt.tick_params(axis='both', labelsize=4)

plt.tight_layout()
plt.show()
#plt.savefig("Forex_Matrix.pdf")




This first article related to networks has been just a brief introduction on this field. We will continue exploring graphs and graph algorithms in next posts. Finally, I would like to make some notes regarding the network images, these images have been modified reversing the RGB output (I preferred a black background), I used GIMP for that task.

Sunday 5 January 2014

Numerical methods: ODE and Finite difference method

The goal of this post is show how solve an ordinary differential equation, numerically and using the finite difference method and compare the result with the analytic solution. The problem statement is as follow:

Numerically solve the ODE: $m\frac{\partial^{2}y}{\partial t^{2}} = -b\frac{\partial y}{\partial t} $ that represents an object whose velocity at $t=0$ is equal to $v_0$ and it is slowing due to a retardant force, which is proportional to its velocity.

The integration interval is $0 \leq t \leq 5$ seconds. For simplicity, we take $m = b = 1$.
Boundary conditions:

  • $y(0) = 0 m$
  • $y(5) = 1 m$

Analytic solution:
$Y(t) = \frac{m v_0}{b} (1 - e^{-\frac{b}{m}t})$ where $v_0 = 1$

Ordinary differential equation:
We will start solving the ordinary differential equation of second grade . I will explain the resolution step by step:

  • Step 1: Characteristic equation
  • Step 2: Finding constants of integration
$m\frac{\partial^{2}y}{\partial t^{2}} = -b\frac{\partial y}{\partial t} \Rightarrow mD^{2} = -bD \Rightarrow D(mD + b) = 0 $
Thus, we have two possible solutions: $s1 = 0; s2 = -\frac{b}{m}$. As these two roots are real and distinct ($s1 \ne s2$) we can use the general solution for this specific case:

$Y(x) = C1 e^{s_1 x} + C2 e^{s_2 x} \Rightarrow Y(t) = C1 e^{0 t} + C2 e^{-\frac{b}{m} t} = C1+C2 e^{-\frac{b}{m} t}$

Using the boundary conditions we can obtain the constants of integration C1 and C2:
$Y(t=0) \Rightarrow C1 = -C2$
$Y(t=5) \Rightarrow -C2 + C2e^{-\frac{b}{m} 5 } = 1 \Rightarrow (-1 + e^{-\frac{b}{m} 5 } ) C2 = 1 \Rightarrow C2 = \frac{1}{(-1 + e^{-\frac{b}{m} 5 } ) } $
Solution:
$Y(t) = -\frac{1}{(-1 + e^{-\frac{b}{m} 5 } ) } + \frac{1}{(-1 + e^{-\frac{b}{m} 5 } ) } e^{-\frac{b}{m} t}$

$m = b = 1 \Rightarrow Y(t) = 1.00678 - 1.00678 e^{-t}$

Finite difference method:
We use the centered difference at time t to obtain the central difference approximation in both equation sides.
\[y" = \frac{\partial^{2}y}{\partial t^{2}} = \frac{U_{i+1} - 2U_i + U_ {i-1}}{h^{2}} \enspace y' = \frac{\partial y}{\partial t} = \frac{U_{i+1}-U_{i-1}}{2h} \]
Now we can sustitude these above approximations in the initial expression:
\[\frac{m}{h^{2}}(U_{i+1} - 2U_i + U_ {i-1}) = -\frac{b}{2h}(U_{i+1}-U_{i-1})\] Isolate $U_i$:
$U_{i+1} - 2U_i + U_ {i-1} = -\frac{b}{2h}(U_{i+1}-U_{i-1})\frac{h^{2}}{m} \Rightarrow [-\frac{bh}{2m}(U_{i+1}-U_{i-1})-(U_{i+1}+U_{i-1})] \frac{1}{-2} = U_i$

Implementation:
For this simple example, we create a Python class called "ODE", which includes the analytic solution and the basic methodology based on the iterative scheme.

Iterations:

def solve(self, n, steps, m, b, h):
    v = np.zeros(n)
    #Boundary conditions
    #y(t = 0) = 0
    v[0] = 0
    #y(t = 5) = 1
    v[n - 1] = 1

    #Iterations
    for i in range(steps):
        for i in range(1, n - 1):
            a1 = (v[i - 1] - v[i + 1]) * (b * h / (2 * m))
            a2 = v[i + 1] + v[i - 1]
            v[i] = (a1 - a2) / (-2)
    return v
As shown, the first value in the array $= 0$, first boundary condition (we start the loop in the position 1) and the last value = 1 (position n - 1). Finally, the function returns the complete array.

The attached chart shows how the result is approximated by finite difference as the number of iterations increases.



Code:
import numpy as np
import matplotlib.pyplot as plt


class ODE(object):

    def __init__(self, m, b, v0, t_min, t_max, n):
        self.t = np.linspace(t_min, t_max, n)
        #Analytic solution
        self.s = m * v0 / b * (1 - np.exp(- b / m * self.t))
        #ODE
        C2 = 1 / (- 1 + np.exp(- b / m * t_max))
        C1 = -C2
        self.f = C1 + C2 * np.exp(- b / m * self.t)

    def solve(self, n, steps, m, b, h):
        v = np.zeros(n)
        #Boundary conditions
        #y(t = 0) = 0
        v[0] = 0
        #y(t = 5) = 1
        v[n - 1] = 1

        #Iterations
        for i in range(steps):
            for i in range(1, n - 1):
                a1 = (v[i - 1] - v[i + 1]) * (b * h / (2 * m))
                a2 = v[i + 1] + v[i - 1]
                v[i] = (a1 - a2) / (-2)
        return v

#inputs and solution:
m = 1
b = 1
v0 = b / m
t_min = 0.0
t_max = 5.0
n = 100
h = t_max / n

vel = ODE(m, b, v0, t_min, t_max, n)
plt.plot(vel.t, vel.f, 'b.-', label='ODE')
plt.plot(vel.t, vel.s, 'g.-', label='Analytic solution')

for i in range(250, 3000, 500):
    sol_i = vel.solve(n, i, m, b, h)
    plt.plot(vel.t, sol_i, '-.', label=r'Iteration = %i' % i)

plt.ylabel('Distance (m)')
plt.xlabel('Time (s)')

plt.legend(loc=2, prop={'size': 9})
plt.show()