Apache Iceberg: The Definitive Guide

Apache Iceberg: The Definitive Guide
Author: Tomer Shiran
Publisher: "O'Reilly Media, Inc."
Total Pages: 344
Release: 2024-05-02
Genre: Computers
ISBN: 1098148592

Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way. Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg. With this book, you'll learn: The architecture of Apache Iceberg tables What happens under the hood when you perform operations on Iceberg tables How to further optimize Iceberg tables for maximum performance How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

Apache Iceberg: The Definitive Guide

Apache Iceberg: The Definitive Guide
Author: Tomer Shiran
Publisher: O'Reilly Media
Total Pages: 0
Release: 2024-02-29
Genre:
ISBN: 9781098148621

Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool--a cost-prohibitive process for making warehouse features available to all of your data. This lack of flexibility forces you to adjust your workflow to the tool your data is locked in, which creates data silos and data drift. This book shows you a better way. Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this lakehouse. Authors Tomer Shiran, Jason Hughes, Alex Merced, and Dipankar Mazumdar from Dremio guide you through the process. With this book, you'll learn: The architecture of Apache Iceberg tables What happens under the hood when you perform operations on Iceberg tables How to further optimize Apache Iceberg tables for maximum performance How to use Apache Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio Sonar How Apache Iceberg can be used in streaming and batch ingestion Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

The Definitive Guide to Data Integration

The Definitive Guide to Data Integration
Author: Pierre-Yves BONNEFOY
Publisher: Packt Publishing Ltd
Total Pages: 490
Release: 2024-03-29
Genre: Computers
ISBN: 1837634777

Learn the essentials of data integration with this comprehensive guide, covering everything from sources to solutions, and discover the key to making the most of your data stack Key Features Learn how to leverage modern data stack tools and technologies for effective data integration Design and implement data integration solutions with practical advice and best practices Focus on modern technologies such as cloud-based architectures, real-time data processing, and open-source tools and technologies Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionThe Definitive Guide to Data Integration is an indispensable resource for navigating the complexities of modern data integration. Focusing on the latest tools, techniques, and best practices, this guide helps you master data integration and unleash the full potential of your data. This comprehensive guide begins by examining the challenges and key concepts of data integration, such as managing huge volumes of data and dealing with the different data types. You’ll gain a deep understanding of the modern data stack and its architecture, as well as the pivotal role of open-source technologies in shaping the data landscape. Delving into the layers of the modern data stack, you’ll cover data sources, types, storage, integration techniques, transformation, and processing. The book also offers insights into data exposition and APIs, ingestion and storage strategies, data preparation and analysis, workflow management, monitoring, data quality, and governance. Packed with practical use cases, real-world examples, and a glimpse into the future of data integration, The Definitive Guide to Data Integration is an essential resource for data eclectics. By the end of this book, you’ll have the gained the knowledge and skills needed to optimize your data usage and excel in the ever-evolving world of data.What you will learn Discover the evolving architecture and technologies shaping data integration Process large data volumes efficiently with data warehousing Tackle the complexities of integrating large datasets from diverse sources Harness the power of data warehousing for efficient data storage and processing Design and optimize effective data integration solutions Explore data governance principles and compliance requirements Who this book is for This book is perfect for data engineers, data architects, data analysts, and IT professionals looking to gain a comprehensive understanding of data integration in the modern era. Whether you’re a beginner or an experienced professional enhancing your knowledge of the modern data stack, this definitive guide will help you navigate the data integration landscape.

Snowflake: The Definitive Guide

Snowflake: The Definitive Guide
Author: Joyce Kay Avila
Publisher: "O'Reilly Media, Inc."
Total Pages: 468
Release: 2022-08-11
Genre: Computers
ISBN: 1098103793

Snowflake's ability to eliminate data silos and run workloads from a single platform creates opportunities to democratize data analytics, allowing users at all levels within an organization to make data-driven decisions. Whether you're an IT professional working in data warehousing or data science, a business analyst or technical manager, or an aspiring data professional wanting to get more hands-on experience with the Snowflake platform, this book is for you. You'll learn how Snowflake users can build modern integrated data applications and develop new revenue streams based on data. Using hands-on SQL examples, you'll also discover how the Snowflake Data Cloud helps you accelerate data science by avoiding replatforming or migrating data unnecessarily. You'll be able to: Efficiently capture, store, and process large amounts of data at an amazing speed Ingest and transform real-time data feeds in both structured and semistructured formats and deliver meaningful data insights within minutes Use Snowflake Time Travel and zero-copy cloning to produce a sensible data recovery strategy that balances system resilience with ongoing storage costs Securely share data and reduce or eliminate data integration costs by accessing ready-to-query datasets available in the Snowflake Marketplace

Trino: The Definitive Guide

Trino: The Definitive Guide
Author: Matt Fuller
Publisher: "O'Reilly Media, Inc."
Total Pages: 310
Release: 2021-04-14
Genre: Computers
ISBN: 1098107683

Perform fast interactive analytics against different data sources using the Trino high-performance distributed SQL query engine. With this practical guide, you'll learn how to conduct analytics on data where it lives, whether it's Hive, Cassandra, a relational database, or a proprietary data store. Analysts, software engineers, and production engineers will learn how to manage, use, and even develop with Trino. Initially developed by Facebook, open source Trino is now used by Netflix, Airbnb, LinkedIn, Twitter, Uber, and many other companies. Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Trino query can combine data from multiple sources to allow for analytics across your entire organization. Get started: Explore Trino's use cases and learn about tools that will help you connect to Trino and query data Go deeper: Learn Trino's internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more Put Trino in production: Secure Trino, monitor workloads, tune queries, and connect more applications; learn how other organizations apply Trino

Delta Lake: The Definitive Guide

Delta Lake: The Definitive Guide
Author: Denny Lee
Publisher: "O'Reilly Media, Inc."
Total Pages: 391
Release: 2024-10-30
Genre: Computers
ISBN: 1098151909

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Trino: The Definitive Guide

Trino: The Definitive Guide
Author: Matt Fuller
Publisher: "O'Reilly Media, Inc."
Total Pages: 333
Release: 2022-10-03
Genre: Computers
ISBN: 1098137191

Perform fast interactive analytics against different data sources using the Trino high-performance distributed SQL query engine. In the second edition of this practical guide, you'll learn how to conduct analytics on data where it lives, whether it's a data lake using Hive, a modern lakehouse with Iceberg or Delta Lake, a different system like Cassandra, Kafka, or SingleStore, or a relational database like PostgreSQL or Oracle. Analysts, software engineers, and production engineers learn how to manage, use, and even develop with Trino and make it a critical part of their data platform. Authors Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Trino query can combine data from multiple sources to allow for analytics across your entire organization. Explore Trino's use cases, and learn about tools that help you connect to Trino for querying and processing huge amounts of data Learn Trino's internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more Deploy and secure Trino at scale, monitor workloads, tune queries, and connect more applications Learn how other organizations apply Trino successfully

Spark: The Definitive Guide

Spark: The Definitive Guide
Author: Bill Chambers
Publisher: "O'Reilly Media, Inc."
Total Pages: 594
Release: 2018-02-08
Genre: Computers
ISBN: 1491912294

Learn how to use, deploy, and maintain Apache Spark with this comprehensive guide, written by the creators of the open-source cluster-computing framework. With an emphasis on improvements and new features in Spark 2.0, authors Bill Chambers and Matei Zaharia break down Spark topics into distinct sections, each with unique goals. Youâ??ll explore the basic operations and common functions of Sparkâ??s structured APIs, as well as Structured Streaming, a new high-level API for building end-to-end streaming applications. Developers and system administrators will learn the fundamentals of monitoring, tuning, and debugging Spark, and explore machine learning techniques and scenarios for employing MLlib, Sparkâ??s scalable machine-learning library. Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasetsâ??Sparkâ??s core APIsâ??through worked examples Dive into Sparkâ??s low-level APIs, RDDs, and execution of SQL and DataFrames Understand how Spark runs on a cluster Debug, monitor, and tune Spark clusters and applications Learn the power of Structured Streaming, Sparkâ??s stream-processing engine Learn how you can apply MLlib to a variety of problems, including classification or recommendation

The Cloud Data Lake

The Cloud Data Lake
Author: Rukmani Gopalan
Publisher: "O'Reilly Media, Inc."
Total Pages: 270
Release: 2022-12-12
Genre: Computers
ISBN: 1098116542

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights. This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance. Learn the benefits of a cloud-based big data strategy for your organization Get guidance and best practices for designing performant and scalable data lakes Examine architecture and design choices, and data governance principles and strategies Build a data strategy that scales as your organizational and business needs increase Implement a scalable data lake in the cloud Use cloud-based advanced analytics to gain more value from your data