Thursday, August 23, 2018

Understanding Twitter as a platform

Understanding Twitter as a Platform
Since Twitter was launched in 2006, it has grown to be the second largest social media tool in many countries. It’s transformation to an event following, tracking, news media from a simple “what I had for lunch” messenger is phenomenal in just a few years. Twitter is an example of how a simple idea can evolve into something more capable with a potential to make a mark and position with time mainly by usage. It is even more astounding and paradoxical to turn into something that it was not originally intended for, into an archived data set adapted by the U. S. library of congress.

Functions and Preliminary Architecture:

Tracking and following is fundamental to the “dispatch and courier protocol” that Twitter is built on. It is a one to many communication system combining the functions of instant messaging with SMS texting with a status function of “what you are doing”. This makes it a microblogging service. Following friends and other users is the second function of Twitter. The third function is enabled by search, to be able to find information that is valuable to the user. This came about with a change of focus to “what is happening”.
A use case of Twitter is centered on what it takes to post a tweet, the pre and post conditions. An actor of Twitter use case does Login, post a tweet, search for people, search for words, follow trending topics and logs out.
With the preliminary services (even search got added later on), it is easy to imagine that Twitter had a front and back end services layers with a user interface to send and receive feedback. A simple database would allow for storage and access of tweet texts. As Twitter adoption grew among users, a middle layer, search engine, and a robust data management layer with in memory processing capabilities evolved from the basic architecture.
Quality of Services:
The architecture evolved to address the quality of services - performance, usability, maintainability. Relevance to ISO/ IEC 25010 quality model is an exercise that explains how the architecture and technologies resolved the quality of services considering the tradeoffs between maintainability and efficiency, maintainability and reliability.   More servers for reliability, a variety of technologies for efficiency is the normal route with tradeoffs to maintainability.  
Contribution of Early adopters/ User innovation:
The famous # hashtag to report news, @ mention marker to share information about url’s are examples of user innovations representing internet relay chat culture. 
Design changes – Twittering by Cuckoo:
Twitter is designed as a centralized architecture and microblogging service to handle SMS text messages in a messaging platform. Twitter’s centralized architecture is prone to issues that arise with scale,    availability, reliability, and performance bottlenecks. Twitter’s broadcasters (for example news media and celebrities) can be those with millions of followers responsible for the scaling issues that arise. Broadcasting is not the only supporting service from Twitter – the list ranges from individual profiles, social relations, those looking for connections.
Cuckoo is designed with a decentralized architecture to allow for reducing bandwidth costs, load balancing, remove single points of failure, and ensure availability.   Organizing the users based on their usage of the system is the key behind the structured and unstructured overlays of users. Dedicated servers to store static information about users also serve as backup servers to ensure availability. Organizing according to relationships allows for load balancing. Normal users, friends and broadcasters form the social awareness framework in Cuckoo.

Twitter’s ecosystem – publicly available API’s, access points, Services:
Twitter’s primary offerings for access to data are via three APIs - streaming data, REST (Representational state transfer) API and search API. There is an assortment of Twitter data connection (access), collection and analysis tools that deserves to be called the ecosystem responsible for its growth.  It’s wide range of services in varied contexts makes it an attractive research tool.
Streaming API connects the receiving host and Twitter for large volumes of data.
REST API follows a request response communication pattern. Connections are established the Twitter and the host on an as per request basis. 
Search API relies on the REST to provide a real time index search to the tweets.
User Search:
With the help of a search engine layer, Twitter provides robust text search mechanisms for its users. This layer is added not only to speed up searches, but also to improve on the queuing system for more reliability
Recommendation services – Twitter WTF.
Goal of the “who to follow” recommendation service is to help users discover connections. At the core is an open source, in memory graph processing engine – Cassovary. Recommendation algorithm is based on SALSA. Twitter’s uniqueness with connections is that there is no need for establishment of a connection to receive messages. The WTF architecture is continuous ffort. It is based on a single server prone to scalability issues. Memory limitations are also affecting the graph processing engine. The Twitter team is considering s Real Graph that resides on Hadoop distributed file system as the next step for the WTF architecture.
TwitterEcho – A distributed focused crawler:
TwitterEcho is an open source, modular distributed crawler used to support new insights into social community research, crawling communities. Alternative is to license Gnip social media API (for unlimited access to data) which can be expensive and beyond reach for academic research.
Storm @Twitter:
Storm is a Twitter developed, real-time, distributed stream data processing engine behind the data management tasks important for Twitter services. Twitter team lays out the essential features of a service they designed, used by millions of users – scalable, resilient, extensible, efficient, easy to administer. Several data partitioning strategies come into play at Storm. Many data-driven decisions are powered by Storm at Twitter.
Twitter Zombie:
Twitter Zombie is a technical tool for data collection and analysis. It is designed to have scheduled, designed, and programmable searches between the Twitter data and the host using the search API. The rate of funneling queries for search, the complexity of the query, are some of the variables that determine the usage of Zombie.  
TwitterMonitor:
Twitter Monitor offers trend detection over twitter stream. Detecting trends in real time and providing meaningful analytics is the goal of this system. It detects trends based on bursty keywords and analyzing them further for trends.
There are several systems that come to use with Twitter’s platform on a daily basis making it a lively and growing ecosystem. The above are just a few of the systems selected to demonstrate how fast a simple idea can grow if it is of interest to its users.

Conclusion:
Addressing scalability, reliability, availability, maintainability and performance bottlenecks that arise requires ongoing effort to adopt changes to architecture, experimenting, throwing away old, replacing by new.
The effort to keep the platform relevant to the vast user community is continuous. It is a process of reverse engineering as well as a “next to impossible” task to track and document the architecture and usage. In the web 2.0 world, this also requires agility and lightning speed (almost real-time) change management.
Services offered, adaptable architectures, and software evaluation methodologies (to build, buy or acquire, try open source) to choose the right technologies, packaged into useful products and platforms offer the recipe for success attracting users in multitude. 
In many situations, services are easily adapted by changing the focus with modular distributed architectures of the services. For example, TwitterEcho used to crawl one community can be adapted to crawl yet another community adhering to limitations of Twitter. TwitterEcho also reminds the limitations to obtain complete access to data – an act of governance by the Platform owner. 
Efforts to produce new algorithms for an intended use often become re-used and become meaningful resources for the organization. For example, the efforts of WTF recommendation system with the Real Graph has also found its use in search personalization and content discovery at Twitter.

References:
Published papers of Twitter engineers, researchers on the architecture and services.