ReviewEssays.com - Term Papers, Book Reports, Research Papers and College Essays
Search

Itm 4273

Essay by   •  February 21, 2016  •  Coursework  •  1,953 Words (8 Pages)  •  927 Views

Essay Preview: Itm 4273

Report this essay
Page 1 of 8

Baljit Kaur

Anmol Kharbanda

Khou Xiong

Parham Dehnadfar

Richard Gillman                                                   

ITM 4273-01

Balu Rajagopal

Group Article Review #1

                                        

1.

Differentiators

Hadoop

Data Warehouse

Remarks (Impact on Extraction of Real-time Business Insights)

Data Repository

Raw Data

Aggregate and Refined Data

Hadoop storage on raw data pushes extraction of insights to on-demand questions later as opposed to Data warehouse that is designed to provide insights predefined questions.

Query Format

NoSQL

SQL

The optimizer examines incoming SQL during the Data Warehouse performance and considers various plans for executing each query as fast as possible. More specifically, it achieves this by comparing the SQL request to the extensive data statistics and database design that help identify the best combination of execution steps. Hadoop, on the other hand, does not use any SQL. 

Database Technology

HBASE

RDBMS

Hbase is column family oriented unlike RDBMS, which is row oriented With RDBMS there are 1000s of queries/second whereas for Hbase there could be millions of queries/second.  With data warehouse max data size is in TBs and with hadoop there's hundreds of petabytes

File-System

HDFS

Hadoop scales out to large clusters of servers and storage using the HDFS to manage huge data sets and spread them across the servers.

Tool

File Copy (Extract, Transform only)

ETL (Extract, Transform, Load)

Hadoop is not an ETL tool. It is a platform that supports running ETL processes in parallel. In data warehousing, however, the ETL server becomes infeasible with big data volumes when moving all the big data to one storage area.

                        

Setup

Multiple machines

Single Relational database(serves as the central store)

Hadoop file system are designed to span multiple machines and can handle huge volumes of data that surpass the capability of any single machine.

Data

Raw data

Structured relational database

Hadoop used HDFS which is often cloud storage (cloud is cheap and flexible). one can still do ETL and create data warehouse using HIVE. With hadoop you have raw data available so you can also define new questions and do complex analyses over all the raw historical data.

Managing and analyzing data

Uses HIVE

ETL

The hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse

Running Workloads

Fluctuating

Constant and Predictable

Hadoop has the ability to spin virtual servers up or down on demand within minutes, hadoop in the cloud provides flexible scalability you’ll need to handle fluctuating workloads.

Finance / Cost

Inexpensive

Costly

Hadoop can be a very inexpensive alternative to a data warehouse. you can store your structured data across cheap computing /storage nodes as opposed to adding large servers. In Hadoop you can break up the data and let HDFS, the Hadoop distributed file system handle the 3x copies of each chunk of data. You can use Pig, hive, ambari, or flume to run queries against just as if you are using a data warehouse.

                        

2.

Shades of Grey Areas

Hadoop

Data Warehouse

Remarks

Provisional Data such as Clickstream Data

9 - Hadoop has the advantage of being flexible, time of value and it’s not being limited by the governance committees or administrators.

7 - Is limited to just regional bank but it does provide quick identification of overlapping consumers and comparisons of account quality to existing accounts.

Hadoop stores and refine data, then load some of the refined data into data warehouse for further analysis.

Sandbox Analysis (small samples versus All data)

8 - raw data in quantity, no limitation on data set size,

7 - has clean integrated data

Both Hadoop and data warehouse can determine their data mining by how much data they use.

In-Memory Data Processing

8 - enable fast data processing, avoids duplication of data and eliminates unnecessary data movement.

9 - provides self- service analytics.

By providing self- service analytics, Data warehouse is the better choice.

Complex Batch Analysis

8 - can process massive amounts of data.

8 - can process massive amount of data, runs in minutes and not invoked by business users

Both Hadoop and data warehouse can depend on running complex batch jobs to process massive amounts of data.

Interactive Analysis

7 - Used when run in parallel to achieve scalability and the program is highly complex

9 - SQL programming combined with a parallel data warehouse is probably the best choice

If there is a requirement to run any language and any level of program complexity in parallel, Hadoop is favored

Prediction Analysis

8 - Runs predictive analytics in parallel against enormous quantities of data

8 - small samples of data compared to Hadoop, but data is clean and integrated

Hadoop is favored if the data is too big for a DW to process.

Recommendation Analysis

8 - ideal for pulling apart click streams from websites to find consumer preferences

8 - include components that detect customer activity and use a recommendation engine to persuade the consumer to stay

scrubbed data from Hadoop can be imported into the data warehouse to ensure data governance

Text Analysis

10- Hadoop is a good at finding keywords and performing analysis

3- Relational are not good at parsing text

Hadoop maps text, and after refinement, stores it in Data Warehouses

                                

...

...

Download as:   txt (10.4 Kb)   pdf (212.4 Kb)   docx (15.1 Kb)  
Continue for 7 more pages »
Only available on ReviewEssays.com