Itm 4273
Essay by Balgit Kaur • February 21, 2016 • Coursework • 1,953 Words (8 Pages) • 946 Views
Baljit Kaur
Anmol Kharbanda
Khou Xiong
Parham Dehnadfar
Richard Gillman
ITM 4273-01
Balu Rajagopal
Group Article Review #1
1.
Differentiators | Hadoop | Data Warehouse | Remarks (Impact on Extraction of Real-time Business Insights) |
Data Repository | Raw Data | Aggregate and Refined Data | Hadoop storage on raw data pushes extraction of insights to on-demand questions later as opposed to Data warehouse that is designed to provide insights predefined questions. |
Query Format | NoSQL | SQL | The optimizer examines incoming SQL during the Data Warehouse performance and considers various plans for executing each query as fast as possible. More specifically, it achieves this by comparing the SQL request to the extensive data statistics and database design that help identify the best combination of execution steps. Hadoop, on the other hand, does not use any SQL. |
Database Technology | HBASE | RDBMS | Hbase is column family oriented unlike RDBMS, which is row oriented With RDBMS there are 1000s of queries/second whereas for Hbase there could be millions of queries/second. With data warehouse max data size is in TBs and with hadoop there's hundreds of petabytes |
File-System | HDFS | Hadoop scales out to large clusters of servers and storage using the HDFS to manage huge data sets and spread them across the servers. | |
Tool | File Copy (Extract, Transform only) | ETL (Extract, Transform, Load) | Hadoop is not an ETL tool. It is a platform that supports running ETL processes in parallel. In data warehousing, however, the ETL server becomes infeasible with big data volumes when moving all the big data to one storage area. |
Setup | Multiple machines | Single Relational database(serves as the central store) | Hadoop file system are designed to span multiple machines and can handle huge volumes of data that surpass the capability of any single machine. |
Data | Raw data | Structured relational database | Hadoop used HDFS which is often cloud storage (cloud is cheap and flexible). one can still do ETL and create data warehouse using HIVE. With hadoop you have raw data available so you can also define new questions and do complex analyses over all the raw historical data. |
Managing and analyzing data | Uses HIVE | ETL | The hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse |
Running Workloads | Fluctuating | Constant and Predictable | Hadoop has the ability to spin virtual servers up or down on demand within minutes, hadoop in the cloud provides flexible scalability you’ll need to handle fluctuating workloads. |
Finance / Cost | Inexpensive | Costly | Hadoop can be a very inexpensive alternative to a data warehouse. you can store your structured data across cheap computing /storage nodes as opposed to adding large servers. In Hadoop you can break up the data and let HDFS, the Hadoop distributed file system handle the 3x copies of each chunk of data. You can use Pig, hive, ambari, or flume to run queries against just as if you are using a data warehouse. |
2.
Shades of Grey Areas | Hadoop | Data Warehouse | Remarks |
Provisional Data such as Clickstream Data | 9 - Hadoop has the advantage of being flexible, time of value and it’s not being limited by the governance committees or administrators. | 7 - Is limited to just regional bank but it does provide quick identification of overlapping consumers and comparisons of account quality to existing accounts. | Hadoop stores and refine data, then load some of the refined data into data warehouse for further analysis. |
Sandbox Analysis (small samples versus All data) | 8 - raw data in quantity, no limitation on data set size, | 7 - has clean integrated data | Both Hadoop and data warehouse can determine their data mining by how much data they use. |
In-Memory Data Processing | 8 - enable fast data processing, avoids duplication of data and eliminates unnecessary data movement. | 9 - provides self- service analytics. | By providing self- service analytics, Data warehouse is the better choice. |
Complex Batch Analysis | 8 - can process massive amounts of data. | 8 - can process massive amount of data, runs in minutes and not invoked by business users | Both Hadoop and data warehouse can depend on running complex batch jobs to process massive amounts of data. |
Interactive Analysis | 7 - Used when run in parallel to achieve scalability and the program is highly complex | 9 - SQL programming combined with a parallel data warehouse is probably the best choice | If there is a requirement to run any language and any level of program complexity in parallel, Hadoop is favored |
Prediction Analysis | 8 - Runs predictive analytics in parallel against enormous quantities of data | 8 - small samples of data compared to Hadoop, but data is clean and integrated | Hadoop is favored if the data is too big for a DW to process. |
Recommendation Analysis | 8 - ideal for pulling apart click streams from websites to find consumer preferences | 8 - include components that detect customer activity and use a recommendation engine to persuade the consumer to stay | scrubbed data from Hadoop can be imported into the data warehouse to ensure data governance |
Text Analysis | 10- Hadoop is a good at finding keywords and performing analysis | 3- Relational are not good at parsing text | Hadoop maps text, and after refinement, stores it in Data Warehouses |
...
...