FACING THE DATABASE CONUNDRUM SQL, HBASE, HIVE, OR SPARK: WITH THE SATURATION OF THE ENVIRONMENT WHICH DO YOU CHOOSE SAMANTHA MOHR UNIVERSITY OF MARYLAND UNIVERSITY COLLEGE SPRING 2015 ABSTRACT There is currently a conundrum facing experts in the field of Big Data. The struggle is the ability to perform large-scale data analysis and the impracticality of using relational database processing languages to handle the information that is collected/processed. Specifically, the growth of data, the sheer volume that must be stored in databases, processed by cloud analytic and queried by applications has led to a growth in the data capacity the needs to be handled. Unfortunately, this exponential growth has exceeded the hardware and …show more content…
Are dynamic columns something you require support for, then you should choose Cassandra. Do you do batch analytic modeling on your data, Hadoop may be the choice for you. For live streaming analytic modeling abilities, Apache Spark is a much better choice. So you want to work with your data as if it were SQL, then you should try Hive. This paper will provide you with a detailed knowledge of how by choosing the correct database processing and query language you are able to mitigate the processing capacity problems that are involved with the vast growth of data recently. This will help to show that while there may be no one size fits all answer, there is a fit for the problem at hand based on the storage, processing, and query needs that are to be met. INTRODUCTION BACKGROUND As a result of the appearance of big data in our world, conventional data warehousing and data analysis methods no longer have the process power needed. What is Big Data you may ask and why is it such a big deal. NIST defines big data as anywhere “[…] data volume, acquisition velocity, or data representation limits the ability to perform effective analysis using traditional relational approaches […]” (Mell & Cooper, n.d.). 1 (Gong, 2012, p.15) Today’s analyst is inundated by an ever growing number of data being created by social media, mobile phones, climate sensors, digital pictures, etc. The volume being generated is staggering (2.7 Zettabytes of data in the digital universe).While
“Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses.” (SAS, 1)
Since 1960 and beyond the need for an efficient data management and retrieval of data has always been an issue due to the growing need in business and academia. To resolve these issues a number of databases models have been created. Relational databases allow data storage, retrieval and manipulation using a standard Structured Query Language (SQL). Until now, relational databases were an optimal enterprise storage choice. However, with an increase in growth of stored and analyzed data, relational databases have displayed a variety of limitations. The limitations of scalability, storage and efficiency of queries due to the large volumes of data [1] [2].
The invention of relational databases have brought a number of changes to the business world in which they operate specially for the businesses whose prime focus is on its customers, their likes and dislikes to win more market share. There is no such concept as “one size fits all” in using this technology, it varies from industry to industry. One thing may work for some businesses and may not work for others, therefore it is advisable that one should shop around before investing in any of the technologies because it is vital to find an industry-specific solution. One technique to narrow the search for industry-specific solutions is to find out what our competitors are using to gain more customer base.
Abstract - Considered as another subversive technology revolution in IT other than Internet of Things and cloud computing, Big Data has the most valuable property. Its value hides in the great storage that needs to be analyzed while cloud computing merely serves as a method or a step to save and store the messages. As rapidly increasing attention is drawn on in recent years, Big Data has wider and wider influences in many fields. With the explosive demands on Big Data analytics, Big Data broadened its definition from an IT term of extremely large data sets to a set of new technologies.
This paper will discuss and make comparisons on the markets top Database Management Systems (DBMS) currently available. The paper includes a table for side-by-side comparisons of feature sets and other factors required when making decisions on which DBMS to purchase and implement in a business. While this may not be a complete list of all available DBMS systems it will include important discussions on aspects required when evaluating any major application / system choice.
Five years ago, few people had heard the phrase ‘Big Data.’ Today, it’s hard to go an hour without seeing it implemented practically in our daily life. The promise of a highly accurate data-driven decision-making tool is an attractive lure for any organization in any industry. However, big data is not without its own problems.
SQL will be a best fit where we need 100% consistency of data, such as most of financial problems which cannot be achieved by NoSQL database.
Big data is an element that allows companies to leverage high volume data effectively and not in isolation. Big data needs to be quickly accessible and have the ability to be analyzed. Data stores or warehouses are one way data is managed that is persistent, protected and available as long as the data is needed. The forefather to data stores is relational data bases, relational data bases put in place decades ago are still in use today
According to a report from The International Business Machines Corporation, known as IBM, 90% of the data in the world has been generated in the last two years. Frank J. Ohlhorst (2013) explains how the concept of collecting data for use in business is not new, but the scale of data that has been collected recently is so large that it has been termed Big Data (p. 1). Company executives who choose to ignore Big Data are denying their companies an advantage over their competitors. Big Data analysis is fundamental for all fields of work; it provides an insight to large amounts of data that will answer questions and make discoveries to improve efficiency in all areas of the world.
Emergence of big data generated by an increased number of data sources led the evolution of many data-handling tools. Storing and analyzing vast amounts of structured and unstructured data is a big challenge. Traditional relational databases such as Oracle, DB2, HANA, MySQL, and SQL Server still handle structured data for enterprise applications like ERP and CRM and financial systems. Most of these databases have added some level of in-memory features exception to SAP HANA, which runs the entire database in-memory so that users can gain insights into data faster.
Data has always been analyzed within companies and used to help benefit the future of businesses. However, the evolution of how the data stored, combined, analyzed and used to predict the pattern and tendencies of consumers has evolved as technology has seen numerous advancements throughout the past century. In the 1900s databases began as “computer hard disks” and in 1965, after many other discoveries including voice recognition, “the US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.” The evolution of data and how it evolved into forming large databases continues in 1991 when the internet began to pop up and “digital storage became more cost effective than paper. And with the constant increase of the data supplied digitally, Hadoop was created in 2005 and from that point forward there was “14.7 Exabytes of new information are produced this year" and this number is rapidly increasing with a lot of mobile devices the people in our society have today (Marr). The evolution of the internet and then the expansion of the number of mobile devices society has access to today led data to evolve and companies now need large central Database management systems in order to run an efficient and a successful business.
We also studied and compared new emerging NoSQL databases like Cassandra, Accumulo, CouchDB, Hbase, MongoDB etc. to find the best solution for organizations in accordance with their requirements.
Big data and big data analytics are used to describe data sets and analytical techniques in applications that are so large and complex and require special technologies to analyze and visualize them.
The ever-widening realm of big data has created an expanding frontier of exploration for the creation of new methods of data analysis in order to produce actionable knowledge for the benefit of organizations everywhere. Companies amass enormous troves of data every day. Keeping this data housed in a fashion that maximizes storage efficiency and in a format optimized for query and analysis is paramount for effective data warehousing. Many database structures exist for the storage, arrangement, and accessing of data, but large databases and online analytical processing (OLAP) benefit from specific qualities. In these databases, compression and rapid querying are the main enabling qualities sought for analytical data stores and data warehouses. Columnar (or column-oriented) relational databases (RDBMS) offer these and other benefits, which is why it is a popular database scheme for analytical systems. Specifically, the vertical arrangement of records is optimal for selecting the sum, average, or a count of total record attributes because one horizontal read yields all values of an attribute. Otherwise, a physical disk must seek over and past unwanted attributes of the records to provide the same
Database research and associated standardization activities have successfully guided the development of database technology over the last four decades and SQL relational databases remain the dominant database technology today. This effort to innovate relational databases to address the needs of new applications is continuing today. Recent examples of database innovation include the development of streaming SQL technology that is 170 George Feuerlicht used to process rapidly flowing data (“data in flight”) minimizing latency in Web 2.0 applications, and database appliances that simplify DBMS deployment on cloud computing platforms. It is also evident from the above discussion that the relational