Databricks Introduces LakeFlow: Simplifying Data Engineering
Databricks, the Data and AI company, yesterday announced the launch of Databricks LakeFlow, a new solution designed to unify and simplify all aspects of data engineering, from data ingestion to transformation and orchestration. LakeFlow enables data teams to efficiently ingest data at scale from databases like MySQL, Postgres, and Oracle, as well as enterprise applications such as Salesforce, Dynamics, SharePoint, Workday, NetSuite, and Google Analytics. Additionally, Databricks is introducing Real Time Mode for Apache Spark, allowing ultra-low latency stream processing.
Simplified Data Engineering with LakeFlow
LakeFlow automates the deployment, operation, and monitoring of data pipelines at scale, with built-in support for CI/CD and advanced workflows that include triggering, branching, and conditional execution. It integrates data quality checks and health monitoring with alerting systems such as PagerDuty, simplifying the process of building and operating production-grade data pipelines. This efficiency enables data teams to meet the growing demand for reliable data and AI.
Tackling Data Pipeline Challenges
Data engineering is crucial for democratizing data and AI within businesses but remains complex and challenging. Data teams often struggle with ingesting data from siloed, proprietary systems, and managing intricate logic for data preparation. Failures and latency spikes can disrupt operations and disappoint customers. The deployment of pipelines and monitoring of data quality typically involve disparate tools, complicating the process further. Fragmented solutions lead to low data quality, reliability issues, high costs, and increasing backlogs.
LakeFlow addresses these challenges by providing a unified experience on the Databricks Data Intelligence Platform, with deep integrations with Unity Catalog for end-to-end governance and serverless compute for efficient and scalable execution.
Key Features of LakeFlow
- LakeFlow Connect: Enables simple and scalable data ingestion from various sources. It offers native, scalable connectors for databases such as MySQL, Postgres, SQL Server, and Oracle, as well as enterprise applications like Salesforce, Dynamics, SharePoint, Workday, and NetSuite. Integrated with Unity Catalog for robust data governance, LakeFlow Connect leverages the capabilities of Arcion, acquired by Databricks in November 2023, to provide efficient and low-latency data availability for batch and real-time analysis.
- LakeFlow Pipelines: Simplifies and automates real-time data pipelines. Built on Databricks’ Delta Live Tables technology, it allows data teams to implement data transformation and ETL using SQL or Python. Real Time Mode can be enabled for low-latency streaming without code changes, unifying batch and stream processing and offering incremental data processing for optimal price/performance.
- LakeFlow Jobs: Provides automated orchestration, data health, and delivery management, from scheduling notebooks and SQL queries to ML training and automatic dashboard updates. It enhances control flow capabilities and observability to detect, diagnose, and mitigate data issues, automating the deployment, orchestration, and monitoring of data pipelines.
Availability
LakeFlow represents the future of unified and intelligent data engineering. The preview phase will begin soon, starting with LakeFlow Connect.