Author Archive

A D4D Data Visualization

Orange Data for Development is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.

D4D Viz Approach

Our team used the geolocation data from call detail records extracted from Orange’s customer base in order to know in which areas the customers have been moving around, to help us discover the morning and evening rush hours: the time when users were commuting between their place of residence and place of work.

Visualization

We used Python for crunching the numbers and D3.js for creating the visualization.

Bar Chart

The bar chart shows the total population density at a fixed time slot. Rush hours can be identified by the two peaks that emerge every day, one in the morning and one in the afternoon.

Choropleth

The choropleth shows how the population density flows over time, as people move from one region to another. Notice how the density increases (areas get darker) as the time gets closer to the rush hours.

Take a look!

If you want to see it running, you can either visit this link for a demo with simulated data, or clone the repo and start a local web server.

D4D Challenge accepted !

We are glad to announce that a multidisciplinary team formed by engineers and scientifics of Paradigma Labs and Spanish National Research Council will take part into Orange “Data for Development” D4D Challenge.

Orange challenge

Orange “Data for Development” – D4D – is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.

Paradigma labs team

Short abstract

Our idea is to use the geolocation data from the antennas processing the mobile phones calls in order to know which sub-prefectures the customers have been getting around. The main goal of our project is developing spatio-temporal models to detect patterns for the different sub-prefectures, including some other factors related to the region and/or time: wealth, development, infrastructure, investment, grants…
By means of GIS technology, we will be able to apply our generated models to the gathered data and to analyze their correlations over the Côte d’Ivoire surface, working with geographical layers: landcover, roads map, railways lines, water sources… Consequently, the reached conclusions from our study will be properly visualized, allowing a better explanation of the facts.
In the near future, some other measures could be included. For instance, hospitals and police stations locations, their calls rate… Thus, we could know its real use, being able to improve their service to the citizens: dangerous areas, crowded hospitals…

People involved

At Paradigma tecnologico:
At Spanish National Research Council:
At Complutense University:

Throughput analysis with Continuous-time Markov Chains simulations and design of realiable cloud services system based on Gunicorn, Tornado and Iptables

At this moment a lot of companies offer end-point services (data providers, semantic analysis, …) that we can integrate with our applications. However, when designing our own service, it could be tough find the ideal parameters to configure it and to find the best software to make it scalable and highly available.

Continuous-Time Markov Chains (Yin, G. et all, 1998) (CTMC) provides an ideal framework to estimate this most important parameters, and by means of simulations we can find them. An special model of CTMC which belongs to the Queuing Theory (Breuer, L. et all, 2005) is the M/M/c/K model, and modelize our service like a queuing system, implying that our system holds:

  • c: the number of parallel process
  • K: is the maximum number of clients waiting in the queue
  • Input: Poisson
  • Service: Exponential

E.g.: The next CTMC can represent a simple M/M/3/4 queuing system (Download .dot):

As seen in the picture above, grey nodes mean that n-3 clients exist waiting in the queue and the last state will be the red node (#7) which implies that at this moment incoming  clients will be reject of our system.
Like a CTMC we can derivate the equilibrium equations or we can use directly the formulae of the model M/M/c/K. By means of the software developed at ParadigmaLabs we are able to simulate several configurations on this model, and get other features too, e.g.:
M/M/c/K model simulation
------------------------

+ MODEL PARAMETERS
	Lambda: 40.0000
	Mu: 30.0000
	c: 3.0000
	K: 7.0000
	Stability: True (rho = 0.4444)

+ QUEUE
	Average number of clients (l) = 1.4562
	Average length (lq) = 0.1268
	Average waiting time for a client into the queue (w) = 0.0365

+ SYSTEM
	Average waiting time into the system (wq) = 0.0032

+ PROBABILITY DISTRIBUTION
	P_0 = 0.2550368777
	P_1 = 0.340049170234300
	P_2 = 0.226699446822867
	P_3 = 0.100755309699052
	P_4 = 0.044780137644023
	P_5 = 0.019902283397344
	P_6 = 0.008845459287708
	P_7 = 0.003931315238981
	[Total Probability: 1.0]

Elapsed time: 0.00025105

Once we have calculated the best-fit values for our system, it is time to present our service based on a Wikipedia Semantic Graph. The next picture shows the main structure creating relations between articles and categories:

So, in first instance our service will perform lookup queries in order to identify Entities onto a text. We can see the result of a query to our service:

Up to this point, we have calculated several parameters for our system: Incoming Lambda (λ)Service Mu (μ)c (parallel servers) and K (queue length). To ensure the system holds these several constrains we should implement a two layers throttle system.

  1. IPTABLES filter: Several clients will try to access to our system, however only a portion of them will succeed.
  2. LOGIC filter: Is a software based filter and perform this throttle by means of user tokens. It applies temporal restrictions handling  the incoming rate of each user.

Therefore, the following software help us to implement these restrictions:

  • Iptables filter: Using Iptables (debian-administration.org) we can restrict the incoming connections avoiding denial-of-service attack (DoS).
  • Logic filter: Using a time control and token manager script we can deal with this problem.
  • Several parallel servers and queue system: We set up Gunicorn to run several tornado servers to implement the queue restrictions.
nohup gunicorn  --workers 3 --backlog 7
                --limit-request-line 4094  --limit-request-fields 4  -b 0.0.0.0:8000-k egg:gunicorn#tornado server:app &

A sample tornado server scaffold for our service could be:

# -*- coding: utf-8 -*-
import tornado.ioloop
from tornado.web import Application, RequestHandler, asynchronous
from tornado.ioloop import IOLoop
# Main class
class NerService(tornado.web.RequestHandler):
    def get(self):
# run application
app = tornado.web.Application([
    (r"/", NerService, dict(...parameters...),
    ])
# To test single server file"
app.listen(8000)
tornado.ioloop.IOLoop.instance().start()

Finally, after applying this configuration we have simulated several incoming rates (testing sundry numbers of clients too) getting the next service performance statistics represented in the picture below:

Summing up:

  • Using wikipedia categories and articles, we are able to detect a huge range of Entities.
  • Wikipedia is always updated in real time, therefore we have a updated NER (Name Entities Recognition).
  • We can use Gunicorn to run and manage serveral service instances.
  • We have implemented a throttle system to restrict the maximum number of requests per second. Also the way to restrict the general incoming rate by means of iptables is provided.
  • It is proven to be neccessary to simulate different invocations of our services using Queuing Theory formulae to find the best-fit paramaters like λ, μ, ρ, L, Lq, W, Wq.

Unstructured information extraction: A sample case with Unitex-Manager

1.The problem of unstructured information

There is a lot of information in today’s companies flowing from one computer to another like e-mails, documents, many kinds of files and, of course, the webs the employees surf through. These electronic documents probably contain part of the core knowledge of the company or, at least, very useful information which besides of being easily readable by humans is unstructured and impossible to be processes automatically using computers. The amount of unstructured information in enterprises is around 80% [1] to 85% [2] nowadays, and such a situation is a disadvantage for business since processes are difficult to automatize and data is hard to find (well… unless a very well defined storage schema is set but anyways the success of that system relies on every employee following it). Unstructured data is that without a formally defined structure or a structure inherent to human communication but not prepared to be used by computers. As said before, examples are: text, web pages, images, emails and so on… To simplify, hereinafter we can say that everything that does not come from a database or an API it is unstructured data.

Scraping is the technique used for extracting data from these sources, and maybe the most common type is the so-called web scraping, used to get relevant information from sites on the Internet. scraping is very useful to extract information from documents or sources organized always in a certain manner. However, when the layout may change quickly over time or may differ to a large extent among different sources – as usually happens in the web – , scraping is an endless task. Once the desired data is extracted in a manner that the computers can process it as second problem is faced. Since documents are created by humans for humans, the information is written in what is called “Natural Language”, the way we use to talk or write: human language. Hence, information is still raw and it requieres a processing step before the machines can manipulate it and do any kind of computation with it. There are many Natural Language Processing (NLP) approaches but at this point it’s enough to know that this technique it’s aimed to extract the meaning of texts (or even speech).

2. Unitex Corpus Processor

The Unitex software was developed at the Linguistic group (Prof. Eric Laporte) of the Institut Gaspard Monge, Université de Marne-La-Vallée and is a corpus processing system, based on automata-oriented technology. Unitex is able to perform several operations like:

  • Apply electronic dictionaries, that you can create ad-hoc for a particular domain.
  • Pattern matching with recursive transition networks.
  • Resolve ambiguity by means of the text automaton.

However, Unitex can apply advance operations too like ELAG (Elimination of Lexical Ambiguities by Grammars) for disambiguation between lexical symbols in text automata or Cascade of transducers (The prototype of the CasSys system was created in 2002 at the LI labs at University of Tours) applying one after the other onto a text to modify this text.

A very simple example of Unitex grammar is shown in the following figure:

Unitex Grammar for stock value extraction
Unitex has been applied in several research papers [3], e.g.:

  • Portuguese Large-scale Language Resources for NLP Applications
  • Syntactic variation of support verb constructions
  • XML-Based Representation Formats of Local Grammars for the NL
  • Spanish adverbial frozen expressions

Unitex provides a great User Interface to manage our Grammars and dictionaries but also a fast binding to perform specific operations onto a text is provided by Paradigma Labs.

3. UnitexManager

Unitex-manager is a python module which provides a high level layer to easily work with the above described Unitex Corpus Processor. Unitex-manager is based on pyUnitex, a minimalist python wrapper used as an interface to interact with the C interface of Unitex.

Unitex-manager architecture. (Unitex-manager, PyUnitex, Unitex)

Natural Language Processing requires a first stage of language recognition and then a transformation of the whole text into simpler units, usually sentences. Hence text is tokenized first and then each sentence is pos-tagged, labeling words with its grammatical or/and its semantical function. For that purpose, different dictionaries are used; some of them are included with Unitex (basic language) but some of them (entities recognition, for example) should be prepared by a documentalist in advance. Finally, the tagged sentences run through a grammar (Unitex Graph) generating the desired output.

Unitex-manager interface contains three methods representing these three actions:

tokenizer(input_str, lang)
Given a text and its language returns an arrray containing its sentences splitted by the dot (".") character
postagger(tokens, lang)
Given an array of sentences (and its language) returns them same sentences tagged with Part-of-Speech labels.
grammar(tokens, pos, lang)
Evaluates the given pos-tagged sentences with the grammar set-up in the configuration file.

An example of the execution flow can be seen in the next figure:

Text processing stages

4. A practical case

To give an example of the use of  Unitex-Manager we have prepared a practical case of unstructured information retrieval and processing. In this case, the evolution of the most active values during the day in the NASDAQ stock exchange will be followed.

First of all, it’s necessary to find a reliable source of information. Financial information is widespread among a real mess of websites, however we have found that yahoo! Finance provides just the required information (here) already compiled and updates it very often. Once the information is found, is necessary to analyze its structure and prepare a web-scrapper. In our case, we created our own scrapper written in Ruby that is launched once in a while to extract the symbol and the name of the company as well as the last change.

This text is pased to the Unitex-Manager and processed with the workflow described above to extract the following entities:

  • Company name
  • Symbol
  • Change
  • Trend

Each our we extract this information to calculate the top five of most active companies in NASDAQ based on the absolute value of their growth and we tweet this Top-5 in the Financial Unitex account so you can easily follow how the stock exchange evolves.

Financial Unitex Twitter Account with example values

Openinfluence

Openinfluence is an open-metric developed at Paradigmalabs and tries to define the relevance of each user in Twitter. It is open because you can see the formula and contribute to improve it. You can see the formula in the picture below:

As you can see, the formula has two main components “Popularity” and “Influence“. Popularity is related to static properties of your social network. It’s some kind of “potential influence”, the beforehand capability of getting your tweets spread. Influence is related to the propagation and repercussion of each of your tweets, the effective reach of your messages.

We have applied successfully this metric in several analysis, e.g.: during the Andalusian elections campaign or UX Spain Conference.

Currently we can represent this formula with the next plot:

We are involved in trying to improve this metric, because the two main parts have the same weight in the formula. However, is this metric more related to Influence? Is the formula below better?


We have tested Openinfluence with the next dataset. In the picture below, you can see the number of followers degree of each user in the sample (in logarithmic scale):

The correlation between Popularity and Influence (dataset) shows that the main stream of people has more or less the same Popularity and Influence. By means of the structure of this formula, some users have 0 of influence and n>0 popularity however they have not null relevance.

Suggest us your point of view !! We are expecting to improve it!!

15th October on Twitter: Global Revolution ‘Mapped’

#15oct and #ows

15th October 2011 was a world-level milestone day: Millions of people aroud the globe occupied the streets to protest against global financial crisis, influenced in a great measure by the power of social networks, essentially Twitter. The protest movement, tagged as #15o and #15oct was heavily based upon #15m (Spain) and #ows (“Occupy Wall Street”), social movements around the notion that 99% of the people is NOT responsible of the ‘financial games’ played by a minor 1% that get rich in the process of sucking their wealth from the remaining 99% (#weare99)

The Process

We present evolution through time of related Twitter activity, around 15th October 2011. Taking a Dataset of 1.2 million tweets (ranging from 13th October to 18th October), we worked to offer some global (geolocated) visualizations, local visualizations (centered around New York, San Francisco, Barcelona and Madrid) and, lastly, a visualization about how did the associated hashtags evolved in that time frame.
Read More

15 Octubre 2011: Mapas de la Revolucion Global en Twitter

#15oct y #ows

El 15 de Octubre de 2011 fue un día histórico a nivel mundial: Millones de personas alrededor del globo se echaron a la calle para protestar contra la crisis financiera, movilizados en gran parte a través de las redes sociales, y en concreto, Twitter. El movimiento, marcado con hashtags como #15o,#15oct, estuvo fuertemente basado en las reivindicaciones #15m y #ows (“Occupy Wall Street”), recalcando que el 99% de la gente NO es responsable de los juegos financieros que hacen que el 1% restante se enriquezca a su costa (#weare99)

Proceso

Presentamos la evolución en el tiempo de la actividad en Twitter relacionada con estos movimientos alrededor del 15 de Octubre. Con un conjunto de partida de 1.2 millones de Tweets, capturados desde el 13 de Octubre hasta el 18 de Octubre de 2011, hemos trabajado para ofrecer visualizaciones globales geolocalizadas, locales (donde se pueden observar los avances de la marcha en cuatro ciudades: Nueva York, San Francisco, Barcelona y Madrid) y, por último, cómo evolucionaron los hashtags (en volumen y composición) en ese intervalo de tiempo.
Read More