Apache Hudi: The Definitive Guide: Building Robust, Open, and High-Performing Data Lakehouses (Final Release)
- Добавил: literator
- Дата: 25-10-2025, 15:53
- Комментариев: 0
Автор: Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
Издательство: O’Reilly Media, Inc.
Год: 2026
Страниц: 354
Язык: английский
Формат: epub
Размер: 10.1 MB
Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi. With this practical guide, data engineers, data architects, and software architects will discover how to seamlessly build an interoperable lakehouse from disparate data sources and deliver faster insights using your query engine of choice.
Authors Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, and Rebecca Bilbro provide practical examples and insights to help you unlock the full potential of data lakehouses for different levels of analytics, from batch to interactive to streaming. You'll also learn how to evaluate storage choices and leverage built-in automated table optimizations to build, maintain, and operate production data applications.
Hudi introduced several foundational concepts that have since become synonymous with the modern lakehouse architecture: incremental change capture, write-optimized storage formats like Merge-on-Read, record-level upserts, and background table services for compaction, clustering, and cleaning. These ideas were novel at the time but have since become core pillars across the ecosystem. Systems like Delta Lake and Apache Iceberg, which followed Hudi, adopted many of these principles and extended the conversation around openness and interoperability. At the time, these ideas were radical. Today, they’re foundational.
In many ways, Hudi sparked one of the most significant shifts in database technology over the past decade. The vibrant Hudi community has been instrumental in this journey. What began as an internal project has grown into a thriving open source ecosystem with contributors from across the world—engineers, architects, researchers—each helping to evolve the system to meet new use cases and challenges. This vision, powered by Hudi and its successors, has redefined what it means to build data platforms at scale.
But Hudi has always charted its own course. It was the first to enable incremental pipelines natively, allowing downstream systems to consume only what changed. It was the first to unify streaming and batch ingestion within the same table abstraction. Today, it continues to lead with innovations like secondary indexing, non-blocking concurrency control, and metadata-driven optimizations, and to evolve toward AI-ready storage formats that support vector searches, feature engineering, and model training at scale.
This book helps you:
Understand the need for transactional data lakehouses and the challenges associated with building them
Explore data ecosystem support provided by Apache Hudi for popular data sources and query engines
Perform different write and read operations on Apache Hudi tables and effectively use them for various use cases, including batch and stream applications
Apply different storage techniques and considerations such as indexing and clustering to maximize your lakehouse performance
Build end-to-end incremental data pipelines using Apache Hudi for faster ingestion and fresher analytics
Who This Book Is For:
This book is written for practitioners: the engineers, architects, and technical leaders who design, build, and operate large-scale data platforms. You’ll find it useful if you are one of the following:
• A data engineer or platform engineer responsible for building ingestion pipelines or managing high-velocity data streams
• A data architect evaluating ways to unify data lakes and warehouses
• A developer or analyst who needs consistent, incremental access to large and evolving datasets
• A technical manager or leader making strategic decisions about adopting lakehouse technologies
This is not a beginner’s introduction to databases or distributed systems. Readers should already be comfortable writing SQL, familiar with distributed processing engines such as Apache Spark or Apache Flink, and have a basic understanding of data pipelines. While deep expertise is not required, the book moves quickly from foundational principles to advanced operational guidance.
Скачать Apache Hudi: The Definitive Guide: Building Robust, Open, and High-Performing Data Lakehouses (Final Release)
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.
Информация
Посетители, находящиеся в группе Гости, не могут оставлять комментарии к данной публикации.
Посетители, находящиеся в группе Гости, не могут оставлять комментарии к данной публикации.
