A Paved Road for Data Pipelines: Part 2

Published in

Intuit Engineering

8 min readSep 15, 2021

As a global technology platform company, data is a key bet for Intuit.

We help ~100M customers overcome their most important financial challenges with TurboTax, QuickBooks, Mint and Credit Karma. We invest heavily in creating awesome customer experiences that bring together experts anywhere in the world with consumers, small business owners, and institutions, and aggregate their financial information into simplified user workflows and customer care interactions, all powered by data and AI.

Data pipelines that capture data from the source systems, perform transformations on the data and make the data available to the machine learning (ML) and analytics platforms, are critical for enabling these experiences.

In our first installment of this Medium blog, we discussed the need for A Paved Road for Data Pipelines at Intuit, covering various types of pipelines [ingestion, curation, ETL (e.g., extract, transform, load) pipelines], how to manage these pipelines through a developer portal, and a variety of tools available for monitoring, managing lineage, etc.

Over the past year, we’ve made substantial progress. In this second blog installment, we’ll explore data processing capabilities and take a deep dive into our batch processing platform.

Data Processing at Intuit

Data processing encompasses ingesting the data from data sources such as databases and third party applications into the data lake, transforming the data, and storing the results in data marts/data warehouses for analytics and machine learning/AI.

Stream and batch processing platforms form the basis for all data processing in the lake. Streaming built on Apache Flink and batch built on Apache Spark enable data engineers to code their business logic and deploy them to a secure, scalable, reliable data processing platform.

Low-code/no-code platforms, layered on top of the stream and batch processing, open up data processing to a larger community of data scientists and analysts. Ingestion enables owners of the data sources to classify, encrypt and ingest data into the data lake. Curation enables data stewards to define cleansing rules and transform the raw data into clean and secure data. Superglue provides a low code/no code environment for data processing with advanced anomaly detection and lineage tools.

Batch Processing Platform

While there has been substantial emphasis on real-time and stream processing, batch processing continues to be a mainstay for processing data, training a model and extracting features, reporting (daily, month-end or quarterly) are done in batch to save costs and address computational complexity. Batch pipelines need a secure, reliable, cost effective and an efficiently managed platform. Intuit’s batch processing platform is a paved road for Apache Spark and non-Spark workloads.

Pipelines, Projects and Teams

Pipelines allow data engineers to schedule one or more tasks that can be executed in a series or in parallel. A cluster is provisioned for a pipeline dynamically on a schedule, and terminated on the completion of the pipeline.

Different types of assets, including pipeline assets, are managed in projects in Intuit’s development portal. Projects allow multiple pipelines managed by a single team to be grouped together into a single entity.

The projects are owned and managed by teams. Individuals in the team have different roles and the asset RBAC (role-based access control) is managed by these roles.

Admin: Administrators of the pipeline have the ability to create, update, execute and delete the pipelines. They also have the ability to grant pipeline permissions to others.
DevOps Engineer: A DevOps engineer has the ability to start, stop and monitor pipelines. They do not have access to the underlying databases and tables for which pipelines have access.
Data Engineer: In addition to permissions of a DevOps engineer, Data Engineers have access to the data accessed by the pipelines.

Data from a variety of data sources such as Oracle, RDS (relational database service), 3rd party sources, flat files, etc. lands into the central data lake account. The pipelines are hosted in processing accounts administered by the BUs that read the data from the central data lake and store the processed data into specific processing accounts.

Processors

The tasks within a pipeline are configured in batch processing platform (BPP) using processors. Processors by default execute Spark logic with support for PySpark, Scala and other templates. Processors are tuned based on the volume of the data they process and the complexity of the business logic.

The Spark driver program that manages the execution of the pipeline and executors where the logic is executed are both tuned in the processor configuration. The cores, memory and instances can be tuned based on the capacity requirements of the pipeline.

Processors are tightly integrated into Intuit’s CI/CD (continuous integration/continuous delivery) platform that is used for development of all services at Intuit. A git repository is created for the processor or users can link to existing repositories. Intuit’s build platform sets up an automated build infrastructure to build, test and verify changes to the processor code.

Processor binaries are automatically versioned and stored in Intuit’s artifactory. Upon changing the processor logic and rebuilding a new version in the artifactory, pipelines are redeployed to use the latest version of processor binaries.

Scheduling

Scheduling is the core capability of our batch processing platform. The most basic approach for scheduling is calendar-driven (e.g., run this pipeline every day at 3 p.m. PST, run the pipeline every hour from 9 a.m. to 5 p.m.).

A more common use case is scheduling upon completion of an upstream pipeline or an event, such as a “done” file placed in an S3 bucket or a legacy scheduler (Tidal). To schedule based on completion of events such as done files, developers use Lambda functions to trigger the pipelines through APIs.

Support for Multiple Runtimes

The data pipelines read raw data ingested into the central data lake and process the data for downstream consumption (e.g., reporting, machine learning, AI). Intuit’s data lake is federated across multiple AWS accounts, data is ingested into the central data lake account by the ingestion platform and processed in specific accounts (tax, work, finance, etc.)

Runtimes that execute the pipelines are hosted inside processing accounts for administration. Data engineers have the option of selecting three different runtimes for executing their pipelines

Spark on Kubernetes

For this option, users have access to the open source version of Spark,hosted on Intuit’s Kubernetes service (IKS) using Google’s Spark Operator. Open source Spark(vs. vendor-optimized Spark) is ideal for simple pipelines and typically the most cost-effective alternative.

With this deployment option, we have multiple shared IKS clusters, each of which can scale up to 1000 nodes. Using Kubernetes namespaces, we can support multiple virtual clusters backed by the same physical cluster in an environment with many users spread across multiple teams, or projects. A default instance group that spans across all namespaces helps to keep costs under control by scaling the cluster down to the minimum number of nodes when idle. The IKS instance groups support heterogeneous EC2 (elastic compute cloud) instance types, which also helps keep costs under control by differentiating between multiple CPU/memory intensive data pipelines.

AWS EMR

AWS EMR (elastic map reduce) is the most common deployment model for pipelines. For each pipeline execution, an EMR is provisioned dynamically and terminated upon completion of the pipeline. EMRs are provisioned in specific processing accounts and have access to the databases and tables owned by a specific product group. Spark in EMR is optimized by AWS for fast performance, and may be more cost effective than an open source version, depending on the workload.

BPP enables advanced configuration of EMR instances for optimal cost/performance. The instance role for EMR (execution, master, core and task node settings, and autoscaling policies and bootstrap scripts) can be configured in the advanced settings page for the EMR.

Databricks

Databricks offers an advanced and performant Spark engine that can excel for pipelines working on large volumes of data with complex business logic. Databricks Workspaces are provisioned and centrally managed in each of the processing accounts. Databricks clusters within these workspaces are dynamically provisioned when pipelines are scheduled to run and terminated upon completion of the pipeline.

Monitoring Pipelines

A key capability of the batch processing platform is its ability to monitor pipelines and allow pipeline owners to diagnose issues with pipeline execution, control cost, and meet pipeline service-level agreements.

A variety of dashboards are available to pipeline administrators, as follows:

Spark History Server: this server is available with standard Apache Spark, enabling users to look at the Spark configuration for the pipeline, including driver and executor metrics, application logs, and much more, to detect performance and cost overrun issues.
AWS Cost Dashboard: This dashboard provides actual cost details for executing the pipeline based on Intuit’s EC2 pricing, and is used for cost optimization strategies such as spot pricing for task nodes.
Splunk Dashboard: This Splunk dashboard is customized to analyze all relevant logs for this pipeline and to debug pipeline execution issues for all logs from the pipeline that have been pushed to Splunk (e.g., runtime, Spark engine, applications).
Wavefront Dashboard: This dashboard is customized to capture pipeline execution metrics (e.g., resource utilization, execution times) that have been pushed to Wavefront.
Intuit Kubernetes Service Dashboard: This dashboard hosts the batch processing platform and is useful for monitoring platform stability and performance issues.

Cost Monitoring

Computing the actual dollar cost of the pipeline execution based on Intuit’s AWS pricing is an important capability of the batch processing platform. Data Engineers test the pipeline’s execution in a pre-production environment with all the three runtimes (Spark on Kubernetes, EMR and Databricks) and choose the most cost effective option. For simple pipelines, open source Spark is typically the best solution. Databricks is one of the most performant Spark engines in the industry, but also one of the most expensive, so it’s typically selected when workload demands require it.

Conclusion & Future Work

We’re incredibly proud of the progress we’ve made with Intuit’s paved road for data pipelines since our first installment of this Medium Engineering Blog in August 2020.

To date, we have nearly 100 pipelines deployed in production, with “hockey stick” adoption of our batch processing platform across the company. The platform has proven more than capable of supporting the full gamut of data processing needs with advanced capabilities in support of pipeline security, multiple runtimes, tight integration with our build platform and advanced monitoring.

Stay tuned for more as we continue our data pipeline journey in the coming months.