Название: Programming Hive: Data Warehouse and Query Language for Hadoop Автор: Dean Wampler, Jason Rutherglen, Edward Capriolo Издательство: O'Reilly Media Год: 2012 ISBN: 9781449319335 Формат: pdf Страниц: 350 Размер: 30,6 mb Язык: English
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. You’ll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.
– Use Hive to create, alter, and drop databases, tables, views, functions, and indexes – Customize data formats and storage options, from files to external databases – Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods – Gain best practices for creating user defined functions (UDFs) – Learn Hive patterns you should use and anti-patterns you should avoid – Integrate Hive with other data processing programs – Use storage handlers for NoSQL databases and other datastores – Learn the pros and cons of running Hive on Amazon’s Elastic MapReduce
An Overview of Hadoop and MapReduce Hive in the Hadoop Ecosystem Java Versus Hive: The Word Count Algorithm What’s Next
Chapter 2 Getting Started
Installing a Preconfigured Virtual Machine Detailed Installation What Is Inside Hive? Starting Hive Configuring Your Hadoop Environment The Hive Command The Command-Line Interface
Chapter 3 Data Types and File Formats
Primitive Data Types Collection Data Types Text File Encoding of Data Values Schema on Read
Chapter 4 HiveQL: Data Definition
Databases in Hive Alter Database Creating Tables Partitioned, Managed Tables Dropping Tables Alter Table
Chapter 5 HiveQL: Data Manipulation
Loading Data into Managed Tables Inserting Data into Tables from Queries Creating Tables and Loading Them in One Query Exporting Data
Chapter 6 HiveQL: Queries
SELECT … FROM Clauses WHERE Clauses GROUP BY Clauses JOIN Statements ORDER BY and SORT BY DISTRIBUTE BY with SORT BY CLUSTER BY Casting Queries that Sample Data UNION ALL
Chapter 7 HiveQL: Views
Views to Reduce Query Complexity Views that Restrict Data Based on Conditions Views and Map Type for Dynamic Tables View Odds and Ends
Chapter 8 HiveQL: Indexes
Creating an Index Rebuilding the Index Showing an Index Dropping an Index Implementing a Custom Index Handler
Chapter 9 Schema Design
Table-by-Day Over Partitioning Unique Keys and Normalization Making Multiple Passes over the Same Data The Case for Partitioning Every Table Bucketing Table Data Storage Adding Columns to a Table Using Columnar Tables (Almost) Always Use Compression!
Chapter 10 Tuning
Using EXPLAIN EXPLAIN EXTENDED Limit Tuning Optimized Joins Local Mode Parallel Execution Strict Mode Tuning the Number of Mappers and Reducers JVM Reuse Indexes Dynamic Partition Tuning Speculative Execution Single MapReduce MultiGROUP BY Virtual Columns
Chapter 11 Other File Formats and Compression
Determining Installed Codecs Choosing a Compression Codec Enabling Intermediate Compression Final Output Compression Sequence Files Compression in Action Archive Partition Compression: Wrapping Up
Chapter 12 Developing
Changing Log4J Properties Connecting a Java Debugger to Hive Building Hive from Source Setting Up Hive and Eclipse Hive in a Maven Project Unit Testing in Hive with hive_test The New Plugin Developer Kit
Chapter 13 Functions
Discovering and Describing Functions Calling Functions Standard Functions Aggregate Functions Table Generating Functions A UDF for Finding a Zodiac Sign from a Day UDF Versus GenericUDF Permanent Functions User-Defined Aggregate Functions User-Defined Table Generating Functions Accessing the Distributed Cache from a UDF Annotations for Use with Functions Macros
Chapter 14 Streaming
Identity Transformation Changing Types Projecting Transformation Manipulative Transformations Using the Distributed Cache Producing Multiple Rows from a Single Row Calculating Aggregates with Streaming CLUSTER BY, DISTRIBUTE BY, SORT BY GenericMR Tools for Streaming to Java Calculating Cogroups
Chapter 15 Customizing Hive File and Record Formats
File Versus Record Formats Demystifying CREATE TABLE Statements File Formats Record Formats: SerDes CSV and TSV SerDes ObjectInspector Think Big Hive Reflection ObjectInspector XML UDF XPath-Related Functions JSON SerDe Avro Hive SerDe Binary Output
Chapter 16 Hive Thrift Service
Starting the Thrift Server Setting Up Groovy to Connect to HiveService Connecting to HiveServer Getting Cluster Status Result Set Schema Fetching Results Retrieving Query Plan Metastore Methods Administrating HiveServer Hive ThriftMetastore
Integration with Hadoop Security Authentication with Hive Authorization in Hive
Chapter 19 Locking
Locking Support in Hive with Zookeeper Explicit, Exclusive Locks
Chapter 20 Hive Integration with Oozie
Oozie Actions A Two-Query Workflow Oozie Web Console Variables in Workflows Capturing Output Capturing Output to Variables
Chapter 21 Hive and Amazon Web Services (AWS)
Why Elastic MapReduce? Instances Before You Start Managing Your EMR Hive Cluster Thrift Server on EMR Hive Instance Groups on EMR Configuring Your EMR Cluster Persistence and the Metastore on EMR HDFS and S3 on EMR Cluster Putting Resources, Configs, and Bootstrap Scripts on S3 Logs on S3 Spot Instances Security Groups EMR Versus EC2 and Apache Hive Wrapping Up
Chapter 22 HCatalog
Introduction MapReduce Command Line Security Model Architecture
Chapter 23 Case Studies
m6d.com (Media6Degrees) Outbrain NASA’s Jet Propulsion Laboratory Photobucket SimpleReach Experiences and Needs from the Customer Trenches
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.
Информация
Посетители, находящиеся в группе Гости, не могут оставлять комментарии к данной публикации.