Big Data

BD – 118

• IBM Definition – Volume, Variety and Velocity
• Oracle Definition – Volume, Variety, Velocity and Value
• Small Data is something which can fit into RAM. Big Data is something which cannot fit into RAM.
• Byte >> Kilobyte >> Megabyte >> Gigabyte >> Terabyte >> Petabyte >> Exabyte >> Zettabyte >> Yottabyte
• Big Data is a concept…not a technology or software or tool…

Map Reduce
• Map: The Map component distributes the task across a large number of systems and handles the placement of the tasks in a way that balances the load and manages recovery from failures.
• Reduce: Once distributed computation is completed, another function called “Reduce” aggregates all the elements back together to provide a result.

Big Table
• Developed by Google.
• A distributed storage system intended to manage highly scalable structured data.
• Data is organized into tables having rows and columns. Table can expand horizontally and vertically without any limitations.
• Sparse, distributed, persistent, multi dimensional sorted map.
• Intended to store huge volumes of data across commodity servers.

• Hadoop is an Apache-Managed software framework derived from MapReduce and Big Table.
• Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware.
• Two major components:
i) Distributed File System (DFS) that can support petabytes of data.
ii) Map Reduce Engine (MRE) that computes results in batch.

Stages in Big Data
1) Acquisition
2) Marshalling
3) Analysis
4) Action

1) Acquisition: The process of sampling signals, measuring real world physical conditions and convert resulting samples into digital numeric values for manipulation using computer.
2) Marshalling: The process of gathering data and transforming it into a standard format before it is transmitted over a network. Data pieces are collected in a message buffer before they are marshaled. Data marshalling is required when passing the output parameters of a program written in one language as input to a program written in another language.
3) Analysis: The process of breaking a complex substance into smaller parts in order to gain a better understanding of it.
4) Action: The final phase in Big Data is implementation, which finalizing the presentation of the data to end user.