Orange Data for Development is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.
Our team used the geolocation data from call detail records extracted from Orange’s customer base in order to know in which areas the customers have been moving around, to help us discover the morning and evening rush hours: the time when users were commuting between their place of residence and place of work.
The bar chart shows the total population density at a fixed time slot. Rush hours can be identified by the two peaks that emerge every day, one in the morning and one in the afternoon.
The choropleth shows how the population density flows over time, as people move from one region to another. Notice how the density increases (areas get darker) as the time gets closer to the rush hours.
Take a look!
As we explained in a previous post, during the last months we have been working on a project for the Orange D4D Challenge. Our main task has been analyzing and visualizing the provided mobile communication datasets (collected in Ivory Coast from December, 2011 to April, 2012) looking for relevant and original findings for the society of this West-African country, that is, showing deductions in an easy and friendly way which helps government and NGOs to perform more accurate and correct decisions.
Therefore, it could be said that the project is divided into 2 dimensions:
a) Scientific side: gathering information on similar research projects (behavioural data, data mobile commuting, people dynamics), using different tools and strategies to manipulate such big amounts of data in an efficient way (BigData, Hadoop, Pig), checking diverse visualization options (Excel and R charts, Gephi, GIS tools like qGIS, uDig, ArcGIS, Leaflet, Polymaps, D3.js…), reflecting on the kind of extracted conclusions and their possible interpretations.
b) Cooperative side: mobile communication data are plentiful and their structure is really simple. There is a great deal of applications where this sort of data can play a really important role. Moreover, as their nature is so related with all of us (communications), many of the inferred ideas can be quickly tied with common people’s daily lives. Leaving aside those solutions with a major interest for companies (improving business basing on potential customers’ behavior, habits and trends, elaborating more sophisticated and customized marketing campaigns…), we have focused on those ones which can contribute to make people day-to-day fairer and more comfortable, especially in underdeveloped countries (detecting commuting patterns allowing public transport policy improvements, more adequate urban planning, determining massive usage of hospitals, police stations…).
Let’s describe how the project was faced and developed:
1) Studying related research project, both from private companies and from universities.
2) Storing and Processing datasets with cutting-edge technologies Hadoop/PIG, Mongo, Python, GIT…
3) Statistics: normalizations, means, dispersions, medians…
4) Charts: Excel, R, Python
5) Visualizations: network diagrams (Gephi), Kernel Densities Estimations maps(qGIS, ArcGIS)…
6) Web: customizable and interactive animations, making easier to display and spread the reached conclusions (Leaflet, D3.js, CartoCSS, TileMill, Mapnik, Polymaps)
7) Paper: collecting all our discoveries to write a final report (Latex)
As a result of the whole process, many interesting findings and ideas:
a) A designed and implemented mathematical model to detect geospatial-temporal commuting patterns.
b) Distinction between commuters and non-commuters, apart from their evolution during every day and for each city.
c) Identification of time periods (hours, week days) depending on the amount of phone calls; moreover, those regions or cities originating them are also located.
d) A set of charts and maps which illustrate the previous model, making easier to deduce interesting findings.
e) Discovery of the diary commuting pattern for this specific dataset (morning peak, central valley, evening peak).
f) An online application to display all this information in a friendly and customizable way
g) Drafting new R&D open worklines with a igh potential (clustering, replicating algrithms with other datasets, tesellations, use of DTW & LCS operators…)
Summing up, we are really glad and satisfied with the work we carried out. It has been a fantastic opportunity which has allowed us to learn a lot in different knowledge areas. The key for all this, motivation, doubtless; since the very beginning and during the whole challenge we have been very thrilled trying to collaborate and, at the same time, eager to learn from each other.
We are glad to announce that a multidisciplinary team formed by engineers and scientifics of Paradigma Labs and Spanish National Research Council will take part into Orange “Data for Development” D4D Challenge.
Orange “Data for Development” – D4D – is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.
At this moment a lot of companies offer end-point services (data providers, semantic analysis, …) that we can integrate with our applications. However, when designing our own service, it could be tough find the ideal parameters to configure it and to find the best software to make it scalable and highly available.
Continuous-Time Markov Chains (Yin, G. et all, 1998) (CTMC) provides an ideal framework to estimate this most important parameters, and by means of simulations we can find them. An special model of CTMC which belongs to the Queuing Theory (Breuer, L. et all, 2005) is the M/M/c/K model, and modelize our service like a queuing system, implying that our system holds:
- c: the number of parallel process
- K: is the maximum number of clients waiting in the queue
- Input: Poisson
- Service: Exponential
E.g.: The next CTMC can represent a simple M/M/3/4 queuing system (Download .dot):
M/M/c/K model simulation ------------------------ + MODEL PARAMETERS Lambda: 40.0000 Mu: 30.0000 c: 3.0000 K: 7.0000 Stability: True (rho = 0.4444) + QUEUE Average number of clients (l) = 1.4562 Average length (lq) = 0.1268 Average waiting time for a client into the queue (w) = 0.0365 + SYSTEM Average waiting time into the system (wq) = 0.0032 + PROBABILITY DISTRIBUTION P_0 = 0.2550368777 P_1 = 0.340049170234300 P_2 = 0.226699446822867 P_3 = 0.100755309699052 P_4 = 0.044780137644023 P_5 = 0.019902283397344 P_6 = 0.008845459287708 P_7 = 0.003931315238981 [Total Probability: 1.0] Elapsed time: 0.00025105
Once we have calculated the best-fit values for our system, it is time to present our service based on a Wikipedia Semantic Graph. The next picture shows the main structure creating relations between articles and categories:
Up to this point, we have calculated several parameters for our system: Incoming Lambda (λ), Service Mu (μ), c (parallel servers) and K (queue length). To ensure the system holds these several constrains we should implement a two layers throttle system.
- IPTABLES filter: Several clients will try to access to our system, however only a portion of them will succeed.
- LOGIC filter: Is a software based filter and perform this throttle by means of user tokens. It applies temporal restrictions handling the incoming rate of each user.
Therefore, the following software help us to implement these restrictions:
- Iptables filter: Using Iptables (debian-administration.org) we can restrict the incoming connections avoiding denial-of-service attack (DoS).
- Logic filter: Using a time control and token manager script we can deal with this problem.
- Several parallel servers and queue system: We set up Gunicorn to run several tornado servers to implement the queue restrictions.
nohup gunicorn --workers 3 --backlog 7 --limit-request-line 4094 --limit-request-fields 4 -b 0.0.0.0:8000-k egg:gunicorn#tornado server:app &
A sample tornado server scaffold for our service could be:
# -*- coding: utf-8 -*- import tornado.ioloop from tornado.web import Application, RequestHandler, asynchronous from tornado.ioloop import IOLoop # Main class class NerService(tornado.web.RequestHandler): def get(self): # run application app = tornado.web.Application([ (r"/", NerService, dict(...parameters...), ]) # To test single server file" app.listen(8000) tornado.ioloop.IOLoop.instance().start()
Finally, after applying this configuration we have simulated several incoming rates (testing sundry numbers of clients too) getting the next service performance statistics represented in the picture below:
- Using wikipedia categories and articles, we are able to detect a huge range of Entities.
- Wikipedia is always updated in real time, therefore we have a updated NER (Name Entities Recognition).
- We can use Gunicorn to run and manage serveral service instances.
- We have implemented a throttle system to restrict the maximum number of requests per second. Also the way to restrict the general incoming rate by means of iptables is provided.
- It is proven to be neccessary to simulate different invocations of our services using Queuing Theory formulae to find the best-fit paramaters like λ, μ, ρ, L, Lq, W, Wq.