Greenfield as starting point
Building a Data Platform on a green field, such as we found here, offers a lot of possibilities but also challenges. In terms of tools there were no restrictions, but to consider using Google Cloud Components. There was a process for deploying products in place already, however, that had to be used.
Choosing the components
This initial situation led to some consideration on tools and processes. Fixed in this process is Jenkins CI for deploying code to staging and production environments, with some other restrictions, like using a Infrastructure as Code approach to deploying. Toolwise we first took a look at Google provided tools like Bigquery, Dataflow and Composer, on which we then built the backbone of our data platform.
Bigquery is a Data Warehouse system provided by Google. Its serverless and scalable architecture in combination with a SQL dialect which is compatible to the SQL2011 standard make it an easy choice for providing the API to our data. It seamlessly integrates with Google Cloud Storage as well, where already a big part of our data was stored.
Dataflow as data processing engine has the ability to process both, streaming and batch data. This affords us the comfort to use one framework for both types of data processing. Since it is fully managed, this also reduces the overhead for reserving resources and servers.
Beam is a unified programming model for batch and realtime processing, that is portable on different engines, like Dataflow, Spark, Flink, etc. Using this framework keeps us independent of the underlying system, so we can migrate from one to another with little overhead.
Composer is a hosted version of Apache Airflow. It provides scheduling and managing our workflows programmatically using a DAG based tool. Airflow is also used and developed outside of Google, so choosing this keeps us more independent.
Geode is an in-memory datagrid, that can be used complimentary to a Data Warehouse like Bigquery for providing real time data access. It provides an out of the box REST Api for programmatically fetching results into data services.
Another way to deliver data on time is to push them to a message queue. The one provided by Google is PubSub. With sending transformation results, e.g. predictions, to a queue, other services can consume these results in realtime and use them for decision making.
Making this all come together as our Data Platform was the main challenge in designing and implementing this system. For programming language we decided on Python. All the tools, but Apache Geode, can be programmed using it. As infrastructure we make use of the Kubernetes Cluster started by the Composer environment. Composer or rather Airflow can start both Docker containers and Dataflow jobs from the DAG.
So all our processes are running in either Dataflow / Apache Beam or are defined in a Docker container. This has the advantage, that each task has a separate environment of dependencies. This makes the tasks independent. As Composer is deployed using Python 2.7 by Google, we can still deploy tasks using Python 3 this way for example. Composer starts the task as a Pod in the Composer cluster, while Dataflow runs completely independent of this cluster. This enables us, to run more jobs in parallel.
Data management setup
Having set up the process structure like this we needed to provide a SQL compatible API for exposing the data to our clients. To handle data structures in several systems efficiently, we decided on Apache AVRO as data format. AVRO provides a schema for each file and the schema is stored with the data. In addition to this, the data is splittable and compressible by MapReduce. These attributes we use by creating external tables in BigQuery on top of these created AVRO files. Table definitions are read directly from the files and schema evolution is given by AVRO itself and supported by BigQuery. Data definitions are handled in our schema registry and only there to keep data definitions in one place and programmatically changeable everywhere in the system, just by deploying a new schema.The analytical setup based on BigQuery is complement by Apache Geode and PubSub for providing data in realtime.