Would Hive or Impala be a better choice for a nightly ETL process on large amounts of data?
Hive is slow but undoubtedly a great option for heavy ETL tasks where reliability plays a vital role, for instance the hourly log aggregations for advertising organizations. Impala is an open source SQL engine that can be used effectively for processing queries on huge volumes of data.
What are two benefits that hive Impala and Hadoop have over typical data warehousing systems?
Hive is better able to handle longer-running, more complex queries on much larger datasets. Since Impala is not built over the MapReduce algorithms, the latency is reduced allowing Impala to run faster than Hive.