Hive vs HBase -不同技術可以更好地協同工作 [GOOGLE 翻譯的備份]

Hive vs HBase -不同技術可以更好地協同工作 [GOOGLE 翻譯的備份]

Hive vs HBase -不同技術可以更好地協同工作 [GOOGLE 翻譯的備份] 


資料來源: https://www.dezyre.com/article/hive-vs-hbase-different-technologies-that-work-better-together/322


HBase和Hive是兩種基於hadoop的大數據技術,可用於不同的目的。例如,當您登錄Facebook時,您會看到多個內容,例如您的朋友列表,新聞Feed,朋友建議,喜歡您的狀態的人等。在Facebook上有17.9億月活躍用戶,並且以閃電般的速度加載個人資料頁面你能想到像Hadoop或Hive或HBase這樣的大數據技術在後端做這一切嗎?所有這些技術協同工作,為所有Facebook用戶提供了絕佳的體驗。大數據系統的複雜性要求每種技術都需要與另一種技術結合使用。


Apache Hive

    Hive是一個基於hadoop的SQL引擎,專為精通SQL的人設計,通過類似查詢的SQL運行mapreduce作業。Hive允許開發人員在hadoop集群內部或外部的各種文件格式和物理存儲機制上強加邏輯關係模式。SQL查詢作為Hadoop MapReduce作業針對這些模式運行。由於具有有限的寫入功能和交互性,Hive用於執行批量轉換和大型分析查詢。


何時使用Hive

    RDBMS專業人員喜歡apache hive,因為他們可以簡單地將HDFS文件映射到Hive表並查詢數據。甚至可以映射HBase表,也可以使用Hive對該數據進行操作。Apache Hive應該用於數據倉庫需求,並且當程序員不想編寫複雜的mapreduce代碼時。但是,使用apache配置單元可以解決所有問題。對於需要復雜和細粒度處理的大數據應用程序,Hadoop MapReduce是最佳選擇。



HBase – NoSQL Hadoop數據庫
    Apache Hadoop不提供隨機訪問功能,這就是Hadoop數據庫HBase拯救的時候。HBase具有高可擴展性(使用現成的區域服務器進行水平擴展),高可用性,一致性和低延遲NoSQL數據庫。憑藉靈活的數據模型,成本效益和無分片(自動分片),HBase可與稀疏數據配合使用。在為您的應用程序選擇HBase之前,請先詢問以下問題 –

    ▲你有足夠的硬件嗎?
    ▲您的應用程序是否需要RDBMS不提供的其他功能?
    ▲你有足夠的數據嗎?


何時使用HBase

    Apache Hadoop不是用於實時分析的完美大數據框架,這是可以使用HBase的時候,即用於實時查詢數據。如果應用程序需要隨機讀取或隨機寫入操作或兩者兼而有之,HBase是理想的大數據解決方案。如果應用程序需要實時訪問某些數據,那麼它可以存儲在NoSQL數據庫中。HBase有自己的一套精彩的API,可用於提取或推送數據。HBase還可以與Hadoop MapReduce完美集成,用於批量操作,如分析,索引等。使用HBase的最佳方法是使Hadoop成為靜態數據的存儲庫,HBase將數據存儲用於實時更改的數據經過一番處理。

當 – 時應使用HBase

    ▲有大量的數據。
    ▲ACID屬性不是強制性的,只是必需的。
    ▲數據模型模式很稀疏。

    ▲當您的應用程序需要優雅地擴展時。


Hive vs. HBase – Hive和HBase之間的差異

    ▲Hive是查詢引擎,而HBase是特別是非結構化數據的數據存儲。
    ▲Apache Hive主要用於批處理,即OLAP,但HBase廣泛用於事務處理,其中查詢的響應時間不是高度交互的,即OLTP。
    ▲與Hive不同,HBase中的操作在數據庫上實時運行,而不是轉換為mapreduce作業。
    ▲HBase是實時查詢,Hive是分析查詢。


Hive和HBase -Better Together

    Hive有一些高延遲的限制,HBase沒有分析功能,將兩種技術集成在一起是最佳解決方案。通常,處理大數據的人都會考慮這個問題 – “如何使用來自Hive的HBase?使用hive和HBase一起工作的效果如何?使用它們的最佳方法是什麼?
    通常,HBase和Hive在同一個Hadoop集群中一起使用。Hive可以用作ETL工具,用於批量插入HBase或執行將HBase表中存在的數據與HDFS文件或外部數據存儲中存在的數據連接的查詢。
    可以在HBase表上編寫HiveQL查詢,以便HBase可以充分利用Hive的語法和解析器,查詢執行引擎,查詢計劃器等.Apache Hive有一個額外的庫用於與HBase交互,其中Hive和Hive之間的中間層HBase已實施。從Hive查詢訪問HBase時,需要實現一個名為HBaseStorageHandler的主接口。應用程序還可以通過輸入和輸出格式直接與HBase表交互,但處理程序易於實現,並且適用於大多數用例。Hive和HBase之間的接口仍處於成熟階段,但潛力巨大。將hive與HBase集成的唯一問題是HBase的稀疏和非類型模式之間的阻抗不匹配,而不是Hive的密集和類型模式。

2 thoughts on “Hive vs HBase -不同技術可以更好地協同工作 [GOOGLE 翻譯的備份]

  1. 我本以為只會SQL語法 不能在大數據的世界繼續躲著爽

    這一篇文章終於幫我解惑了,原來只要叫老闆請真正的系統工程師把Hadoop完整架起來

    我就可以繼續使用SQL 做日常的工作了 顆顆

  2. Hive vs.HBase–Different Technologies that work Better Together (原文備份)

    HBase and Hive are two hadoop based big data technologies that serve different purposes. For instance, when you login to Facebook, you see multiple things like your friend list, you news feed, friend suggestions, people who liked your statuses, etc. With 1.79 billion monthly active users on Facebook and the profile page loading at lightning fast speed, can you think of a single big data technology like Hadoop or Hive or HBase doing all this at the backend? All these technologies work together to render an awesome experience for all Facebook users. The complexity of big data systems requires that every technology needs to be used in conjunction with the other.

    Let’s consider the friend recommendations feature on Facebook, it is something that does not change every second or minute. Thus, recommendations can be pre-computed for all Facebook users. However, high throughput is required to pre-compute friend recommendations but latency is just fine. This is when Hadoop MapReduce or HIVE is helpful. Your Facebook profile data or news feed is something that keeps changing and there is need for a NoSQL database faster than the traditional RDBMS’s. HBase plays a critical role of that database. In this case, the analytical use case can be accomplished using apache hive and results of analytics need to be stored in HBase for random access.

    Hive and HBase are both data stores for storing unstructured data. HBase is a NoSQL database used for real-time data streaming whereas Hive is not ideally a database but a mapreduce based SQL engine that runs on top of hadoop. Ideally comparing Hive vs. HBase might not be right because HBase is a database and Hive is a SQL engine for batch processing of big data. Instead of understanding Hive vs. HBase- what is the difference between Hive and HBase, let’s try to understand what hive and HBase do and when and how to use Hive and HBase together to build fault tolerant big data applications.

    Apache Hive
    Hive is a SQL engine on top of hadoop designed for SQL savvy people to run mapreduce jobs through SQL like queries. Hive allows developers to impose a logical relational schema on various file formats and physical storage mechanisms within or outside the hadoop cluster. SQL like queries are run against those schemas as Hadoop MapReduce jobs. With limited write capabilities and interactivity, Hive is meant for the execution of batch transformations and large analytical queries.

    When to use Hive
    RDBMS professionals love apache hive as they can simply map HDFS files to Hive tables and query the data. Even the HBase tables can be mapped and Hive can be used to operate on that data. Apache Hive should be used for data warehousing requirements and when the programmers do not want to write complex mapreduce code. However, all problems can be solved using apache hive. For big data applications that require complex and fine grained processing, Hadoop MapReduce is the best choice.

    Companies Using Apache Hive – Hive Use Cases
    Apache Hive has approximately 0.3% of the market share i.e. 1902 companies are already using Apache Hive in production.

    ▲Scribd uses Hive for ad-hoc querying, data mining and for user facing analytics.
    ▲Hive is an integral part of the Hadoop pipeline at Hubspot for near real-time web analytics.
    ▲Chitika, the popular online advertising network uses Hive for data mining and analysis of its 435 million global user base.

    HBase – The NoSQL Hadoop Database
    Apache Hadoop does not provide random access capabilities and this is when the Hadoop database HBase comes to the rescue. HBase is high scalable (scales horizontally using off the shelf region servers), highly available, consistent and low latency NoSQL database. With flexible data models, cost effectiveness and no Sharding (automatic Sharding), HBase works well with sparse data. Before choosing HBase for your applications, do ask these questions –

    ▲Do you have sufficient hardware?
    ▲Does your applications require those additional features that RDBMS does not provide?
    ▲Do you have enough data?

    When to use HBase
    Apache Hadoop is not a perfect big data framework for real-time analytics and this is when HBase can be used i.e. For real-time querying of data. HBase is an ideal big data solution if the application requires random read or random write operations or both. If the application requires to access some data in real-time then it can be stored in a NoSQL database. HBase has its own set of wonderful API’s that can be used to pull or push data. HBase can also be integrated perfectly with Hadoop MapReduce for bulk operations like analytics, indexing, etc. The best way to use HBase is to make Hadoop the repository for static data and HBase the data store for data that is going to change in real-time after some processing.

    HBase should be used when –

    ▲There is large amount of data.
    ▲ACID properties are not mandatory but just required.
    ▲Data model schema is sparse.
    ▲When your applications needs to scale gracefully.

    Companies Using HBase – HBase Use Cases
    In the big data category, HBase has a market share of about 9/1% i.e. approximately 6190 companies use HBase. Companies use HBase for time series analysis or for click stream data storage and analysis.

    ▲Original HBase use case was at Google which wanted to store massive databases for the internet and its users.
    ▲Facebook uses HBase for real-time analytics, counting Facebook likes and for messaging.
    ▲FINRA Financial Industry Regulatory Authority uses HBase to store all the trading graphs.
    ▲Pinterest uses HBase to store the graph data.
    ▲Flipboard uses HBase to personalize the content feed for its users.

    Hive vs. HBase – Difference between Hive and HBase
    ▲Hive is query engine that whereas HBase is a data storage particularly for unstructured data.
    ▲Apache Hive is mainly used for batch processing i.e. OLAP but HBase is extensively used for transactional processing wherein the response time of the query is not highly interactive i.e. OLTP.
    ▲Unlike Hive, operations in HBase are run in real-time on the database instead of transforming into mapreduce jobs.
    ▲HBase is to real-time querying and Hive is to analytical queries.

    Hive and HBase –Better Together

    Hive has some limitations of high latency and HBase does not have analytical capabilities, integrating the two technologies together is the best solution. Often, people working with big data have this question in mind on –“How to use HBase from Hive? How well does using hive and HBase together work and what is the best way to use them?

    Commonly HBase and Hive are used together on the same Hadoop cluster. Hive can be used as an ETL tool for batch inserts into HBase or to execute queries that join data present in HBase tables with the data present in HDFS files or in external data stores.

    It is possible to write HiveQL queries over HBase tables so that HBase can make the best use of Hive’s grammar and parser, query execution engine, query planner, etc. Apache Hive has an additional library for interacting with HBase where the middle layer between Hive and HBase is implemented. When accessing HBase from Hive queries, there is a primary interface called HBaseStorageHandler that needs to be implemented. The application can also interact with HBase tables directly through input and output format but the handler is easy to implement and works well with most of the use cases. The interface between Hive and HBase is still in its maturing phase but has a great potential. The only issue integrating hive with HBase is the impedance mismatch between HBase’s sparse and untyped schema over Hive’s dense and typed schema.

發表迴響

你的電子郵件位址並不會被公開。 必要欄位標記為 *