ComputSimu: Networks 1: Scraping + Data visualization + Graph stats

These last weeks I have been reading about networks and optimization algorithms, I think is an interesting field with many applications, so my idea was write a new article (or series of articles) showing roughly how use some interesting python libraries like Networkx, for instance. In this article I also like to make a brief introduction to Scraping (a very good way to get data for your graphs).

And only to whet your appetite:

Figure 1. Undirected graph.

Web Scraping:

The first step to construct graphs like the previous one is get the data. As example, I wanted to visualize the whole foreign exchange market and the connections between each currency (Foreign exchange cross rates), but there are another good examples like user connections in social networks (Twitter, Facebook etc).
For this case study(Forex), there are several available websites where you can find financial data for free, nonetheless this data will be restricted to certain time periods, so do not expect 10 or 30 seconds candles for free.

List of Forex websites:

Once we have selected our source, we can think about how extract the info. Some websites like Yahoo Finance allow downloading end-of-day equities or forex data, normally in csv format. But let's image that our intention is to obtain the most possible current data. Here Scraping is the appropriate technique. Web scraping is the process of automatically collecting information from websites, usually simulating human interaction through a full-fledged web browser. Note: web scraping may be against the terms of use of several websites.

For this purpose, I used the next python modules:

urllib2

urlparse

BeautifulSoup

The goal then is extract all the foreign exchange cross rates. This data will be collected in three numpy arrays and finally we will create a dataframe with pandas to export a csv file. The tricky part of all this is parse the html file, however BeautifulSoup allows using regular expressions, which combined with urlparse creates a powerful and simple tool. So let's take a look at the script.

Script:

import numpy as np
import pandas as pd
import urllib2
import urlparse
import re
from bs4 import BeautifulSoup

#------------------------------------------------------------------------------
#Functions:
#------------------------------------------------------------------------------


def add_exchange(Data):
    Value = []
    for link in Data[0, :]:
        soup = connect(link)
        for span in soup.find_all('span', {'id': 'last_last'}):
            #remove thousands (",")
            value = (span.text).replace(',', '')
            Value.append(value)
    return np.vstack((Data, Value))


def connect(site):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    req = urllib2.Request(site, headers=hdr)
    page = urllib2.urlopen(req).read()
    soup = BeautifulSoup(page)
    return soup

#------------------------------------------------------------------------------
#Extract list of currencies + Matrix
#------------------------------------------------------------------------------
mainSites = {'http://www.investing.com/currencies/middle-east',
    'http://www.investing.com/currencies/north-america',
    'http://www.investing.com/currencies/central-america',
    'http://www.investing.com/currencies/south-america',
    'http://www.investing.com/currencies/asia',
    'http://www.investing.com/currencies/pacific',
    'http://www.investing.com/currencies/europe',
    'http://www.investing.com/currencies/middle-east',
    'http://www.investing.com/currencies/africa',
    'http://www.investing.com/currencies/caribbean'}

FxUrls = []
C1 = []
C2 = []
for mainSite in mainSites:
    soup = connect(mainSite)
    for link in soup.find_all(href=re.compile("www.investing.com/currencies/")):
        #extract currencies:
        url = link.get('href')
        FxUrls.append(url)
        currencies = urlparse.urlparse(url).path.split("/")[2].split("-")
        C1.append(currencies[0])
        C2.append(currencies[1])

#Create Matrix
Forex = np.vstack((FxUrls, C1, C2))
#------------------------------------------------------------------------------

#Extract all foreign exchange cross rates:
Import = add_exchange(Forex)

#Pandas dataframe
data = {'currency_1': Import[1, :],
    'currency_2': Import[2, :],
    'value': Import[3, :]}

df = pd.DataFrame(data)
df.to_csv('Forex.csv', index=False, cols=['currency_1', 'currency_2', 'value'])

Summarizing what we have done in this script:

I wrote two functions, one in order to connect with website and capture the html page, and the other one to parse this html and select the required information. There are different ways to extract the data, select the one that best suits the structure of your html file. Run the script, just take into account that this script may take a while depending on your internet connection.
With this tool, you might consider a dynamic system to obtain frequently data. Well, that would be complicated and unlikely, your IP address would be blocked if the number of requests is considerable.

Networkx - Graphs:

Now we have collected the data, it's time to show what we got. For the creation and study of the graphs we will use the python package called Networkx. This is a pretty helpful package which provides many functions and algorithm that facilitate the study of complex networks. However, this may be not the best package to draw your graph, although we can use it as an interface to use Graphviz.

The first thing I did was write a new module, called "graphs", with the Networkx's functionalities I needed. This includes two classes, "create_graph"and "graphStats". The first one reads the produced csv file and creates the two elements that form a graph, nodes and edges. In this case we also have weights, the exchange rates (third column), so we need a list in order to create the labels, which looks like this:

example: {('D', 'E'):9.0, ('F', 'N'):6.0}...

The class "graphStats" includes two functions in order to calculate the degree centrality and betweenness centrality, we will talk about this in more detail later on.

Graph class:

import networkx as nx
from operator import itemgetters
import pandas as pd
import numpy as np


class create_graph:

    def __init__(self):
        data = 'csv/Forex.csv'
        headers = ['start', 'end', 'weights']
        self.Network = pd.read_csv(data, skiprows=1, names=headers)
        self.graph = np.asarray(self.Network[['start', 'end']])
        self.nodes = np.unique(self.graph)
        self.weights = list(map(float, self.Network['weights']))

    def networkList(self):
        G = nx.DiGraph()
        for node in self.nodes:
            G.add_node(node)

        self.graph_l = []
        for edge in self.graph:
            G.add_edge(edge[0], edge[1])
            self.graph_l.append((edge[0], edge[1]))

        labels = dict(list(zip(self.graph_l, self.weights)))
        return G, labels


class graphStats:

    def calculate_degree_centrality(self, graph):
        dgc_key = []
        dgc_value = []
        g = graph
        dc = nx.degree_centrality(g)
        nx.set_node_attributes(g, 'degree_cent', dc)
        degcent_sorted = sorted(dc.items(), key=itemgetter(1), reverse=True)
        for key, value in degcent_sorted:
            dgc_key.append(str(key))
            dgc_value.append(value)
        return dgc_key, dgc_value

    def calculate_betweenness_centrality(self, graph):
        btc_key = []
        btc_value = []
        g = graph
        bc = nx.betweenness_centrality(g)
        betcent_sorted = sorted(bc.items(), key=itemgetter(1), reverse=True)
        for key, value in betcent_sorted:
            btc_key.append(str(key))
            btc_value.append(value)
        return btc_key, btc_value

Now we have prepared the module, we will use matplotlib to draw the graph. Actually, this is not a huge network, it has only 133 nodes and 1918 edges, but it is difficult get a good resolution if your output file is .jpg or png, so I recommend a pdf file and some adjustments in the figure size.

Graph - plot:

import networkx as nx
import matplotlib.pyplot as plt
import graphs

#------------------------------------------------------------------------------
# Read data + create graph
#------------------------------------------------------------------------------
gr = graphs.create_graph()
graph = gr.graph
G, labels = gr.networkList()

#------------------------------------------------------------------------------
# Graph plot:
#------------------------------------------------------------------------------
fig = plt.figure()
#fig = plt.figure(figsize=(8,8), dpi=1000) - pdf file
pos = nx.graphviz_layout(G, prog='neato')
node_color = [float(G.degree(v)) for v in G]
nx.draw_networkx_nodes(G, pos, node_size=[float(G.degree(v)) * 2.5 for v in G],
             alpha=0.6, node_color=node_color)
nx.draw_networkx_edges(G, pos, alpha=0.25, edge_color='blue', width=0.5,
arrows=False)
nx.draw_networkx_labels(G, pos, font_size=2.5)

plt.axis('off')
plt.show()
#plt.savefig("directedGraph.pdf")

Figure 1.1. Undirected graph (detail).

Figure 2. Directed graph.

The second graph is which shows the real Forex network. The Forex network is a directed graph, a graph where its edges have a direction associated with them. This digraph is strongly connected due to it contains several pairs of vertices with both paths, (euro -> dolar and dolar -> euro, for instance).

Graph - stats:

Networkx is a great python package to study graphs, its main purpose. In this point we are going to look into some statistics which will help us to extract real information from our graphs with the aim of interpret the results properly.

1) Betweenness centrality:

Betweenness centrality quantifies the number of times a node acts as bridge along the shortest path between two other nodes. It is basically a measure of connectedness between components of the graph.

2) Degree centrality:

The centrality for a node measures its relative importance within the graph. The degree centrality for a node (n) is the number of nodes it is connected. The degree centrality values are normalized by the maximum degree in the graph. For a graph with self loops (a digraph) the maximum degree might be higher than 1 and the degree centrality as well.

In this article I will not expand the theory behind these measures.

import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import graphs


def topTable(field1, field2, n_top):
    topM = max(field2) * 0.9
    right = len(field1) * 0.75
    plt.text(right, topM * 1.08, 'Top %s' % n_top, fontsize=12)
    for i in range(n_top):
        curr = field1[i]
        val = field2[i]
        plt.text(right, topM - i * topM / 20, '{}) {} = {}'.format(i + 1,
        curr.upper(), round(val, 3)), fontsize=8)

#------------------------------------------------------------------------------
# Read data + create graph
#------------------------------------------------------------------------------
gr = graphs.create_graph()
G, labels = gr.networkList()

#------------------------------------------------------------------------------
# Stats + plots:
#------------------------------------------------------------------------------
stats = graphs.graphStats()
betc_key, betc_value = stats.calculate_betweenness_centrality(G)
degc_key, degc_value = stats.calculate_degree_centrality(G)

#Average degree
N = G.order()
K = G.size()
avg_d = float(N) / K
avg_degree = 'Average degree: %.4f ' % (avg_d)


# Plot: Degree_centrality
plt.figure()

ax1 = plt.subplot(211)
plt.title('Degree centrality for nodes', fontsize=12)
a_lenght = np.arange(len(degc_value))
plt.bar(a_lenght, degc_value, color=cm.jet(degc_value), align='center')
plt.xticks(a_lenght, degc_key, size='small', rotation='vertical')
plt.tick_params(axis='x', labelsize=4)
plt.tick_params(axis='y', labelsize=8)
plt.autoscale(enable=True, axis='both', tight=None)

#Top degree centrality:
topTable(degc_key, degc_value, 10)
plt.text(len(degc_value) * 0.75, max(degc_value) * 0.4, avg_degree,
bbox={'facecolor': 'blue', 'alpha': 0.25, 'pad': 10}, fontsize=7)

# Plot: Betweenness_centrality
plt.subplot(212)
plt.title('Betweenness centrality for nodes', fontsize=12)
a_lenght = np.arange(len(betc_value))
plt.bar(a_lenght, betc_value, color=cm.jet(betc_value), align='center')
plt.xticks(a_lenght, betc_key, size='small', rotation='vertical')
plt.tick_params(axis='x', labelsize=4)
plt.tick_params(axis='y', labelsize=8)
plt.autoscale(enable=True, axis='both', tight=None)
plt.ylim(0, max(betc_value) * 1.1)
plt.plot(betc_value, '--b')

#Top degree betweenness:
topTable(betc_key, betc_value, 10)

plt.show()

The above plot shows the currencies which have more relevance into the network, and as we suspected these are: USD dolar, Euro and the pound sterling. We can make at least two significant considerations; The average degree centrality is quite low, what shows that there are currencies with a notable grade of popularity(Degree centrality : in-degree) and these same currencies play a broker role in the network due to cumulate most of trajectories for the shortest paths. For this particular example,this might mean that we can create a more simplistic network, which would represent pretty faithfully the Forex market, excluding those currencies with fewer number of connections.

Adjacency Matrix:

The adjacency matrix is a N x N matrix where the cells constitute edges. In this example the fields are coloured according the degree centrality and as the graph is directed, the matrix is not symmetric.

import numpy as np
import matplotlib.pyplot as plt
import graphs


gr = graphs.create_graph()
G = gr.networkList()[0]

n = len(gr.nodes)
oMapp = dict(list(zip(gr.nodes, np.arange(n))))

Mtx_w = (np.ones((n, n))) * np.inf
for i in range(len(gr.graph)):
    start = oMapp.get(gr.Network['start'][i])
    end = oMapp.get(gr.Network['end'][i])
    w = gr.Network['weights'][i]
    Mtx_w[start, end] = w

Mtx_nodes = []
Mtx_nodes = np.array([np.concatenate([Mtx_nodes, gr.nodes]) for _ in range(n)])
#-----------------------------------------------------------------------------

# Show matrix by colour:
Mtx_Image = Mtx_w.copy()
for i in range(n):
    count = ((Mtx_Image[i, :] != np.inf) & (Mtx_Image[i, :] != np.nan)).sum()
    Mtx_Image[i, :][Mtx_Image[i, :] != np.inf] = count

# Gradient based on degrees
plt.figure()
plt.title('Foreign Exchange cross rates', fontsize=16)
plt.imshow(Mtx_Image, interpolation='nearest')
plt.xticks(np.arange(len(gr.nodes)), gr.nodes, rotation=90)
plt.yticks(np.arange(len(gr.nodes)), gr.nodes)
plt.tick_params(axis='both', labelsize=4)

plt.tight_layout()
plt.show()
#plt.savefig("Forex_Matrix.pdf")

This first article related to networks has been just a brief introduction on this field. We will continue exploring graphs and graph algorithms in next posts. Finally, I would like to make some notes regarding the network images, these images have been modified reversing the RGB output (I preferred a black background), I used GIMP for that task.

ComputSimu

Pages

Saturday, 25 January 2014

Networks 1: Scraping + Data visualization + Graph stats

No comments:

Post a Comment