HOW TWITTER DUMPED HADOOP TO ADOPT GOOGLE CLOUD
Back in May 2018, Twitter announced that they were collaborating with Google Cloud to migrate their services to the cloud. Today, after two years, they have successfully migrated and have also started reaping the benefits of this move.
For the past 14 years, Twitter has been developing its data transformation pipelines to handle the load of its massive user base. The first deployments for those pipelines were initially running in Twitter’s data centers. For example, Twitter’s Hadoop file systems hosted more than 300PB of data across tens of thousands of servers.
This system had some limitations despite having consistently sustained massive scale. With increasing user base, it became challenging to configure and extend new features to some parts of the system, which led to failures.
In the next section, we take a look at how Twitter’s engineering team, in collaboration with Google Cloud, have successfully migrated their platform.
How The Migration Took Place
For the first step, Twitter’s team left a few pipelines such as the data aggregation legacy Scalding pipelines, unchanged. These pipelines were made to run at their own data centers. But the batch layer’s output was switched to two separate storage locations in Google Cloud.
The output aggregations from the Scalding pipelines were first transcoded from Hadoop sequence files to Avro on-prem, staged in four-hour batches to Cloud Storage, and then loaded into BigQuery, which is Google’s serverless and highly scalable data warehouse, to support ad-hoc and batch queries.
This data from BigQuery is then read by a simple pipeline deployed on Dataflow, and some light transformations are applied. Finally, results from the Dataflow pipeline are written into Bigtable. This is a Cloud Bigtable for low-latency, fully managed NoSQL database that serves as a backend for online dashboards and consumer APIs.
Also Read COVID-19 Misinformation Tweets Being Removed By Twitter
With the successful installation of the first iteration, the team began to redesign the rest of the data analytics pipeline using Google Cloud technologies.
After evaluating all possible options, the team chose Apache Beam because of its deep integration with other Google Cloud products, such as Bigtable, BigQuery, and Pub/Sub, Google Cloud’s fully managed, real-time messaging service.
A BigQuery slot can be defined as a unit of computational capacity required to execute SQL queries.
The Twitter team re-implemented the batch layer as follows:
Data is first staged from on-prem HDFS to Cloud Storage
A batch Dataflow job then regularly loads the data from Cloud Storage and processes the aggregations, and
The results are then written to BigQuery for ad-hoc analysis and Bigtable for the serving system.
For instance, the results showed that processing 800+ queries(~1 TB of data each) took a median execution time of 30 seconds.
Migration final picture via GCP
The above picture illustrates the final architecture after the second step of migration.
For job orchestration, the Twitter team built a custom command line tool that processes the configuration files to call the Dataflow API and submit jobs.
What Do The Numbers Say
The migration for modernization of advertising data platforms started back in 2017, and today, Twitter’s strategies have come to fruition, as can be seen in their annual earnings report.
The revenue for Twitter can be divided mainly into two categories:
Data licensing and other services.
According to the quarterly earnings report for the year 2019, Twitter has declared decent profits with steady progress.
“We reached a new milestone in Q4 with quarterly revenue in excess of $1 billion, reflecting steady progress on revenue product and solid performance across most major geographies, with particular strength in US advertising,” said Ned Segal, Twitter’s CFO.
Also Read How Google’s BigQuery ML Is Empowering Data Analysts
The 2019 revenue was $3.46 billion, which is an increase of 14% year-over-year.
Advertising revenue totalled $885 million, an increase of 12% year-over-year
Total ad engagements increased by 29% year-over-year
The motivation behind Twitter’s migration to GCP also involves other factors like the democratization of data analysis. For Twitter’s engineering team, visualization, and machine learning in a secure way is a top priority, and this is where Google’s tools such as BigQuery and Data Studio came in handy.
Although Google’s tools were used for simple pipelines, Twitter, however, had to build their infrastructure called Airflow. In the area of data governance, BigQuery services for authentication, authorization, and auditing did well but, for metadata management and privacy compliance, in house systems had to be designed.