Hive, Impala and Spark SQL all fit into the SQL-on-Hadoop category. Apache Hive and Spark are both top level Apache projects. Impala is developed by Cloudera and shipped by Cloudera, MapR, Oracle and Amazon. Spark SQL is part of the Spark project and is mainly supported by the company Databricks. Since July 1st 2014, it was announced that development on Shark (also known as Hive on Spark) were ending and focus would be put on Spark SQL. For those familiar with Shark, Spark SQL gives the similar features as Shark, and more. Hive, Impala and Spark SQL are all available in YARN .
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage.
Built on top of Apache Hadoop, it provides:
- Tools to enable easy data extract/transform/load (ETL)
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
- Query execution via MapReduce
- Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
- QL can also be extended with custom scalar functions (UDF’s), aggregations (UDAF’s), and table functions (UDTF’s).
- Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10.
- Different storage types such as plain text, RCFile, HBase, ORC, and others.
- Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
- Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
- Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
- SQL-like queries (HiveQL), which are implicitly converted into MapReduce, or Spark jobs.
mpala was the first to bring SQL querying to the public in April 2013. Impala comes with a bunch of interesting features:
- it can query many file format such as Parquet, Avro, Text, RCFile, SequenceFile
- it supports data stored in HDFS, Apache HBase and Amazon S3
- it supports multiple compression codecs: Snappy (Recommended for its effective balance between compression ratio and decompression speed), Gzip (Recommended when achieving the highest level of compression), Deflate (not supported for text files), Bzip2, LZO (for text files only);
- it provides security through authorization based on Sentry (OS user ID), defining which users are allowed to access which resources, and what operations are they allowed to perform authentication based on Kerberos + ability to specify Active Directory username/password, how does Impala verify the identity of the users to confirm that they are allowed exercise their privileges assigned to that user auditing, what operations were attempted, and did they succeed or not, allowing to track down suspicious activity; the audit data are collected by Cloudera Manager;
- it supports SSL network encryption between Impala and client programs, and between the Impala-related daemons running on different nodes in the cluster;
- it allows to use UDFs and UDAFs;
- it orders the joins automatically to be the most efficient;
- it allows admission control – prioritization and queueing of queries within impala;
- it allows multi-user concurrent queries;
- it caches frequently accessed data in memory;
- it computes statistics (with COMPUTE STATS);
- it provides window functions (aggregation OVER PARTITION, RANK, LEAD, LAG, NTILE, and so on) – to provide more advanced SQL analytic capabilities (since version 2.0);
- it allows external joins and aggregation using disk (since version 2.0) – enables operations to spill to disk if their internal state exceeds the aggregate memory size;
- it allows subqueries inside WHERE clauses;
- it allows incremental statistics – only run statistics on the new or changed data for even faster statistics computations;
- it enables queries on complex nested structures including maps, structs and arrays;
- it enables merging (MERGE) in updates into existing tables;
- it enables some OLAP functions (ROLLUP, CUBE, GROUPING SET);
- it allows use of impala for inserts and updates into HBase.
Spark SQL has been announced in March 2014. It officially replaces Shark, which has limited integration with Spark programs. “Spark SQL conveniently blurs the lines between RDDs and relational tables.” In addition to be part of the Spark platform allowing compatibility with the other Spark libraries (MLlib, GraphX, Spark streaming), Spark SQL shows multiple interesting features:
- it supports multiple file formats such as Parquet, Avro, Text, JSON, ORC;
- it supports data stored in HDFS, Apache HBase (see here, showing better performance than Phoenix) and Amazon S3;
- it supports classical Hadoop codecs such as snappy, lzo, gzip;
- it provides security through authentification via the use of a “shared secret” (spark.authenticate=true on YARN, or spark.authenticate.secret on all nodes if not YARN);
- encryption, Spark supports SSL for Akka and HTTP protocols;
- keeping event logs;
- it supports UDFs.
- it supports concurrent queries and manages the allocation of memory to the jobs (it is possible to specify the storage of RDD like in-memory only, disk only or memory and disk;
- it supports caching data in memory using a SchemaRDD columnar format (cacheTable(““))exposing ByteBuffer, it can also use memory-only caching exposing User object;
- it supports nested structures;
Impala is your best choice for interactive BI-like workloads, because Impala queries have proven to have the lowest latency across all other options — especially under concurrent
Hive is still a great choice when low latency/multiuser support is not a requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the time windows needed for such processing, but not to an extent that makes Hive suitable for BI
Spark SQL, lets Spark users selectively use SQL constructs when writing Spark pipelines. It is not intended to be a general-purpose SQL layer for interactive/exploratory analysis. However, Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine, which provides full mid-query fault tolerance. The performance is biggest advantage of Spark SQL.