As we explained in a previous post, during the last months we have been working on a project for the Orange D4D Challenge. Our main task has been analyzing and visualizing the provided mobile communication datasets (collected in Ivory Coast from December, 2011 to April, 2012) looking for relevant and original findings for the society of this West-African country, that is, showing deductions in an easy and friendly way which helps government and NGOs to perform more accurate and correct decisions.
Therefore, it could be said that the project is divided into 2 dimensions:
a) Scientific side: gathering information on similar research projects (behavioural data, data mobile commuting, people dynamics), using different tools and strategies to manipulate such big amounts of data in an efficient way (BigData, Hadoop, Pig), checking diverse visualization options (Excel and R charts, Gephi, GIS tools like qGIS, uDig, ArcGIS, Leaflet, Polymaps, D3.js…), reflecting on the kind of extracted conclusions and their possible interpretations.
b) Cooperative side: mobile communication data are plentiful and their structure is really simple. There is a great deal of applications where this sort of data can play a really important role. Moreover, as their nature is so related with all of us (communications), many of the inferred ideas can be quickly tied with common people’s daily lives. Leaving aside those solutions with a major interest for companies (improving business basing on potential customers’ behavior, habits and trends, elaborating more sophisticated and customized marketing campaigns…), we have focused on those ones which can contribute to make people day-to-day fairer and more comfortable, especially in underdeveloped countries (detecting commuting patterns allowing public transport policy improvements, more adequate urban planning, determining massive usage of hospitals, police stations…).
Let’s describe how the project was faced and developed:
1) Studying related research project, both from private companies and from universities.
2) Storing and Processing datasets with cutting-edge technologies Hadoop/PIG, Mongo, Python, GIT…
3) Statistics: normalizations, means, dispersions, medians…
4) Charts: Excel, R, Python
5) Visualizations: network diagrams (Gephi), Kernel Densities Estimations maps(qGIS, ArcGIS)…
6) Web: customizable and interactive animations, making easier to display and spread the reached conclusions (Leaflet, D3.js, CartoCSS, TileMill, Mapnik, Polymaps)
7) Paper: collecting all our discoveries to write a final report (Latex)
As a result of the whole process, many interesting findings and ideas:
a) A designed and implemented mathematical model to detect geospatial-temporal commuting patterns.
b) Distinction between commuters and non-commuters, apart from their evolution during every day and for each city.
c) Identification of time periods (hours, week days) depending on the amount of phone calls; moreover, those regions or cities originating them are also located.
d) A set of charts and maps which illustrate the previous model, making easier to deduce interesting findings.
e) Discovery of the diary commuting pattern for this specific dataset (morning peak, central valley, evening peak).
f) An online application to display all this information in a friendly and customizable way
g) Drafting new R&D open worklines with a igh potential (clustering, replicating algrithms with other datasets, tesellations, use of DTW & LCS operators…)
Summing up, we are really glad and satisfied with the work we carried out. It has been a fantastic opportunity which has allowed us to learn a lot in different knowledge areas. The key for all this, motivation, doubtless; since the very beginning and during the whole challenge we have been very thrilled trying to collaborate and, at the same time, eager to learn from each other.
We are glad to announce that a multidisciplinary team formed by engineers and scientifics of Paradigma Labs and Spanish National Research Council will take part into Orange “Data for Development” D4D Challenge.
Orange “Data for Development” – D4D – is an open data challenge, encouraging research teams around the world to use four datasets of anonymous call patterns of Orange’s Ivory Coast subsidiary, to help address society development questions in novel ways. The data sets are based on anonymized Call Detail Records extracted from Orange’s customer base, covering the months of December 2011 to April 2012.
Openinfluence is an open-metric developed at Paradigmalabs and tries to define the relevance of each user in Twitter. It is open because you can see the formula and contribute to improve it. You can see the formula in the picture below:
As you can see, the formula has two main components “Popularity” and “Influence“. Popularity is related to static properties of your social network. It’s some kind of “potential influence”, the beforehand capability of getting your tweets spread. Influence is related to the propagation and repercussion of each of your tweets, the effective reach of your messages.
Currently we can represent this formula with the next plot:
We have tested Openinfluence with the next dataset. In the picture below, you can see the number of followers degree of each user in the sample (in logarithmic scale):
The correlation between Popularity and Influence (dataset) shows that the main stream of people has more or less the same Popularity and Influence. By means of the structure of this formula, some users have 0 of influence and n>0 popularity however they have not null relevance.
With this Plugin for Gephi, ParadigmaLabs wants to provide the community with an useful tool to analyze Twitter information. We have encapsulated all the complexity behind a simple button. A retweet is one of the main actions for information propagation, and now you can make your own analysis in real time by means of Gephi and the Retweet Monitor plugin.
It´s internal mechanisms are fairly simple. The software will connect to the TwitterStream, then apply(if desired) a content filter. All the information gathered will be displayed by Gephi, and you can then apply the standard algorithms and layouts in order to create a representative visualization.
#15oct and #ows
15th October 2011 was a world-level milestone day: Millions of people aroud the globe occupied the streets to protest against global financial crisis, influenced in a great measure by the power of social networks, essentially Twitter. The protest movement, tagged as #15o and #15oct was heavily based upon #15m (Spain) and #ows (“Occupy Wall Street”), social movements around the notion that 99% of the people is NOT responsible of the ‘financial games’ played by a minor 1% that get rich in the process of sucking their wealth from the remaining 99% (#weare99)
We present evolution through time of related Twitter activity, around 15th October 2011. Taking a Dataset of 1.2 million tweets (ranging from 13th October to 18th October), we worked to offer some global (geolocated) visualizations, local visualizations (centered around New York, San Francisco, Barcelona and Madrid) and, lastly, a visualization about how did the associated hashtags evolved in that time frame.
#15oct y #ows
El 15 de Octubre de 2011 fue un día histórico a nivel mundial: Millones de personas alrededor del globo se echaron a la calle para protestar contra la crisis financiera, movilizados en gran parte a través de las redes sociales, y en concreto, Twitter. El movimiento, marcado con hashtags como #15o,#15oct, estuvo fuertemente basado en las reivindicaciones #15m y #ows (“Occupy Wall Street”), recalcando que el 99% de la gente NO es responsable de los juegos financieros que hacen que el 1% restante se enriquezca a su costa (#weare99)
Presentamos la evolución en el tiempo de la actividad en Twitter relacionada con estos movimientos alrededor del 15 de Octubre. Con un conjunto de partida de 1.2 millones de Tweets, capturados desde el 13 de Octubre hasta el 18 de Octubre de 2011, hemos trabajado para ofrecer visualizaciones globales geolocalizadas, locales (donde se pueden observar los avances de la marcha en cuatro ciudades: Nueva York, San Francisco, Barcelona y Madrid) y, por último, cómo evolucionaron los hashtags (en volumen y composición) en ese intervalo de tiempo.
El objetivo de las visualizaciones de candidatos por áreas en Twitter, es detectar de qué áreas de su Programa Electoral (o bien de áreas que puedan formar parte de las preocupaciones del ciudadano) se habla en mayor o menor medida en Twitter cuando un tweet se refiere a un candidato. Cada Área (Terrorismo, Inmigración, Medio Ambiente, Sanidad, Economia, Educacion, Trabajo/Paro, Medio Ambiente, Vivienda) está compuesta a su vez por pequeñas subáreas, y la visualización de ambas nos permite hacernos una idea del ‘panorama ideológico’ percibido por los usuarios de Twitter respecto a cada uno de los candidatos. Read More
Generales 20N: Conceptos y Evoluciones
En esta serie gráfica presentamos los conceptos más frecuentemente “tweeteados” por los usuarios, asociados a las elecciones o bien a los candidatos Mariano Rajoy y Alfredo Pérez Rubalcaba. Las nubes de conceptos se han generando gracias a wordle a partir de datos adquiridos en twitter desde los primeros días de Octubre (precampaña) hasta el día anterior a la jornada de reflexión (campaña). La selección de términos se ha realizado a través de un motor semántico de extracción de conceptos y entidades desarrollando conjuntamente entre Paradigma Tecnológico y HAVAS Media.
También incluimos unos gráficos de tipo “StreamGraph”, que cuentan con la particularidad.
It’s well-known that Twitter’s most powerful use is as tool for real-time journalism. Trying to understand its social connections and outstanding capacity to propagate information, we have developed a mathematical model to identify the evolution of a single tweet.
The way a tweet is spread through the network is closely related with Twitter’s retweet functionality, but retweet information is fairly incomplete due to the fight for earning credit/users by means of being the original source/author. We have taken into consideration this behavior and our approach uses text similarity measures as complement of retweet information. In addition, #hashtags and urls are included in the process since they have an important role in Twitter’s information propagation. Read More