At this moment a lot of companies offer end-point services (data providers, semantic analysis, …) that we can integrate with our applications. However, when designing our own service, it could be tough find the ideal parameters to configure it and to find the best software to make it scalable and highly available.
Continuous-Time Markov Chains (Yin, G. et all, 1998) (CTMC) provides an ideal framework to estimate this most important parameters, and by means of simulations we can find them. An special model of CTMC which belongs to the Queuing Theory (Breuer, L. et all, 2005) is the M/M/c/K model, and modelize our service like a queuing system, implying that our system holds:
- c: the number of parallel process
- K: is the maximum number of clients waiting in the queue
- Input: Poisson
- Service: Exponential
E.g.: The next CTMC can represent a simple M/M/3/4 queuing system (Download .dot):
M/M/c/K model simulation ------------------------ + MODEL PARAMETERS Lambda: 40.0000 Mu: 30.0000 c: 3.0000 K: 7.0000 Stability: True (rho = 0.4444) + QUEUE Average number of clients (l) = 1.4562 Average length (lq) = 0.1268 Average waiting time for a client into the queue (w) = 0.0365 + SYSTEM Average waiting time into the system (wq) = 0.0032 + PROBABILITY DISTRIBUTION P_0 = 0.2550368777 P_1 = 0.340049170234300 P_2 = 0.226699446822867 P_3 = 0.100755309699052 P_4 = 0.044780137644023 P_5 = 0.019902283397344 P_6 = 0.008845459287708 P_7 = 0.003931315238981 [Total Probability: 1.0] Elapsed time: 0.00025105
Once we have calculated the best-fit values for our system, it is time to present our service based on a Wikipedia Semantic Graph. The next picture shows the main structure creating relations between articles and categories:
So, in first instance our service will perform lookup queries in order to identify Entities onto a text. We can see the result of a query to our service:
Up to this point, we have calculated several parameters for our system: Incoming Lambda (λ), Service Mu (μ), c (parallel servers) and K (queue length). To ensure the system holds these several constrains we should implement a two layers throttle system.
- IPTABLES filter: Several clients will try to access to our system, however only a portion of them will succeed.
- LOGIC filter: Is a software based filter and perform this throttle by means of user tokens. It applies temporal restrictions handling the incoming rate of each user.
Therefore, the following software help us to implement these restrictions:
- Iptables filter: Using Iptables (debian-administration.org) we can restrict the incoming connections avoiding denial-of-service attack (DoS).
- Logic filter: Using a time control and token manager script we can deal with this problem.
- Several parallel servers and queue system: We set up Gunicorn to run several tornado servers to implement the queue restrictions.
nohup gunicorn --workers 3 --backlog 7 --limit-request-line 4094 --limit-request-fields 4 -b 0.0.0.0:8000-k egg:gunicorn#tornado server:app &
A sample tornado server scaffold for our service could be:
# -*- coding: utf-8 -*-
import tornado.ioloop
from tornado.web import Application, RequestHandler, asynchronous
from tornado.ioloop import IOLoop
# Main class
class NerService(tornado.web.RequestHandler):
def get(self):
# run application
app = tornado.web.Application([
(r"/", NerService, dict(...parameters...),
])
# To test single server file"
app.listen(8000)
tornado.ioloop.IOLoop.instance().start()
Finally, after applying this configuration we have simulated several incoming rates (testing sundry numbers of clients too) getting the next service performance statistics represented in the picture below:
Summing up:
- Using wikipedia categories and articles, we are able to detect a huge range of Entities.
- Wikipedia is always updated in real time, therefore we have a updated NER (Name Entities Recognition).
- We can use Gunicorn to run and manage serveral service instances.
- We have implemented a throttle system to restrict the maximum number of requests per second. Also the way to restrict the general incoming rate by means of iptables is provided.
- It is proven to be neccessary to simulate different invocations of our services using Queuing Theory formulae to find the best-fit paramaters like λ, μ, ρ, L, Lq, W, Wq.
Openinfluence is an open-metric developed at Paradigmalabs and tries to define the relevance of each user in Twitter. It is open because you can see the formula and contribute to improve it. You can see the formula in the picture below:
As you can see, the formula has two main components “Popularity” and “Influence“. Popularity is related to static properties of your social network. It’s some kind of “potential influence”, the beforehand capability of getting your tweets spread. Influence is related to the propagation and repercussion of each of your tweets, the effective reach of your messages.
We have applied successfully this metric in several analysis, e.g.: during the Andalusian elections campaign or UX Spain Conference.
Currently we can represent this formula with the next plot:
We are involved in trying to improve this metric, because the two main parts have the same weight in the formula. However, is this metric more related to Influence? Is the formula below better?
We have tested Openinfluence with the next dataset. In the picture below, you can see the number of followers degree of each user in the sample (in logarithmic scale):
The correlation between Popularity and Influence (dataset) shows that the main stream of people has more or less the same Popularity and Influence. By means of the structure of this formula, some users have 0 of influence and n>0 popularity however they have not null relevance.
Suggest us your point of view !! We are expecting to improve it!!
#15oct and #ows
15th October 2011 was a world-level milestone day: Millions of people aroud the globe occupied the streets to protest against global financial crisis, influenced in a great measure by the power of social networks, essentially Twitter. The protest movement, tagged as #15o and #15oct was heavily based upon #15m (Spain) and #ows (“Occupy Wall Street”), social movements around the notion that 99% of the people is NOT responsible of the ‘financial games’ played by a minor 1% that get rich in the process of sucking their wealth from the remaining 99% (#weare99)
The Process
We present evolution through time of related Twitter activity, around 15th October 2011. Taking a Dataset of 1.2 million tweets (ranging from 13th October to 18th October), we worked to offer some global (geolocated) visualizations, local visualizations (centered around New York, San Francisco, Barcelona and Madrid) and, lastly, a visualization about how did the associated hashtags evolved in that time frame.
Read More
#15oct y #ows
El 15 de Octubre de 2011 fue un día histórico a nivel mundial: Millones de personas alrededor del globo se echaron a la calle para protestar contra la crisis financiera, movilizados en gran parte a través de las redes sociales, y en concreto, Twitter. El movimiento, marcado con hashtags como #15o,#15oct, estuvo fuertemente basado en las reivindicaciones #15m y #ows (“Occupy Wall Street”), recalcando que el 99% de la gente NO es responsable de los juegos financieros que hacen que el 1% restante se enriquezca a su costa (#weare99)
Proceso
Presentamos la evolución en el tiempo de la actividad en Twitter relacionada con estos movimientos alrededor del 15 de Octubre. Con un conjunto de partida de 1.2 millones de Tweets, capturados desde el 13 de Octubre hasta el 18 de Octubre de 2011, hemos trabajado para ofrecer visualizaciones globales geolocalizadas, locales (donde se pueden observar los avances de la marcha en cuatro ciudades: Nueva York, San Francisco, Barcelona y Madrid) y, por último, cómo evolucionaron los hashtags (en volumen y composición) en ese intervalo de tiempo.
Read More
Análisis de Climas Emocionales
Hace mucho tiempo que trabajamos en lo que se conoce como ‘sentiment analysis’ o el análisis de la actitud del ‘emisor’ de un texto/opinión (positiva o negativa), bien sea en general, o respecto a una entidad (compañía, persona, producto, etc..). La minería emocional o ‘mood analysis’ pretende ir un paso más allá en el análisis emocional de un usuario, tratando de encontrar las emociones que provocan en él determinadas situaciones.
En los vídeos que se incluyen, se dibuja la evolución en el tiempo del ‘clima emocional’ que ha rodeado a los candidatos Mariano Rajoy y Alfredo Pérez Rubalcaba en Twitter, es decir: qué emociones subyacen en los usuarios cuando ‘tweetean’ sobre cada uno de los dos candidatos. Los estados emocionales incluidos en el motor de análisis son los siguientes: Sorpresa,Indignacion, Decepcion, Enfado, Miedo, Alegria y Esperanza Read More
Áreas Temáticas
El objetivo de las visualizaciones de candidatos por áreas en Twitter, es detectar de qué áreas de su Programa Electoral (o bien de áreas que puedan formar parte de las preocupaciones del ciudadano) se habla en mayor o menor medida en Twitter cuando un tweet se refiere a un candidato. Cada Área (Terrorismo, Inmigración, Medio Ambiente, Sanidad, Economia, Educacion, Trabajo/Paro, Medio Ambiente, Vivienda) está compuesta a su vez por pequeñas subáreas, y la visualización de ambas nos permite hacernos una idea del ‘panorama ideológico’ percibido por los usuarios de Twitter respecto a cada uno de los candidatos. Read More
In this video, we present the network evolution around March iPad 2 launch conversation.
Data was collected using twitter real-time API, on March 2nd, 2011, totalling around 50k tweets+retweets
After that, we used Gephi Streaming feature in tandem with its Force Atlas Layout, et voilà, Gephi instant gratification!








