The keyword in Data Science is Science, not the data.
Digital data is of three types:
1) Structured data: Data which is organized and structured in the form of rows and columns and adheres to a pre-defined schema.
Ex: RDBMS, MS Excel, Data in data warehouse
2) Semi-structured data: Data that doesn’t obey to any data model but has its own structure.
Ex: Email, XML, HTML
3) Unstructured data: Data that doesn’t obey to any data model and around 80% of data is Unstructured data in todays world.
Ex: Satellite images, Scientific data, Social Media data
Some interesting facts about todays data:
→ Most of the data has been created in the last two years itself
→ Around 40k searches happen in google per second
→ At present, less than 0.5% of data is analyzed and used
→ Methodical alignment of data plays a crucial role
Components of Data Science:
→ Probability & Statistics
→ Linear Algebra
→ Machine Learning
→ Computer Science
→ Probability: (Possibility or Likelihood) Is the study of random events. An event may possible or may not possible. They usually expressed as a number between 0 and 1. An event with probability 1 can be considered as certainty whereas probability 0 can be considered as impossibility. The higher the probability of an event, the more certain that the event will occur. Ex: Possibility of coming Head when a coin is tossed 10 times, picking a black ball out of a bag having 5 black balls and 5 white balls.
→ Statistics: Statistics is a branch of mathematics concerned with collection, classification, analysis and interpretation of numerical facts for drawing inferences on the basis of their quantifiable possibility. Statistical data tend to behave in regular and predictable manner. Ex: To get the average height of class 10 students, to get the mean of days temperature in the month of January.
→ Linear Algebra: The branch of mathematics that deals with the theory of systems of linear equations, matrices, vector spaces and linear transformations. We use this component to solve the problems related to linear model. Most of the complex science problems are converted into problems of vectors and matrices and then solved with linear models. Linear algebra method improves performance compared to iterative method.
→ Machine Learning: Machine learning is used for the actions which cannot be performed by human being like image recognition, speech processing, biometrics. Computers have the ability to learn without being explicitly programmed.
Types of learning:
1. Supervised learning: This is commonly used in applications where historical data is used to predict likely upcoming/ future events. Firstly, we train the machine with some known data so that it learns something from it. Secondly, machine is exposed to unknown data and is required to classify it based on the knowledge it acquired from first step. Finally, the model is evaluated on the basis of how accurately it has classified the unknown data. Ex: Handwriting recognition, Pattern recognition, House price prediction
Supervised learning problems are of two types:
a. Classification problem: We try to predict the results in separate output. We try to map input variables into separate categories.
b. Regression problem: We try to predict the results in within a continuous output. We try to map input variables to some continuous function.
2. Unsupervised learning: This is commonly used in applications when there is no historical data. Also known as clustering. Ex: Identifying fruits based on color. Firstly, we fix the color as parameter based on which machine will arrange the given data. Secondly, the machine labels the unknown data based on parameter (color). Google news and web page classification are best examples.
3. Semi-supervised learning: This falls between Supervised learning and Unsupervised learning. Its a learning technique from a combination of both labeled and unlabeled data. Ex: Text processing, video-indexing, bioinformatics
4. Reinforcement learning: This is an algorithm that discovers which action yields the greatest reward through trial and error. Mostly used in gaming, navigation and robotics.
→ Data Science life cycle:
Define the goal (Identify the problem) >> Collect and manage data (Identify the information needed) >> Build a model (Find patterns in the data that leads to solution) >> Evaluate model (Validate whether the model solves the problem) >> Present results and document (Establish how the problem can be solved) >> Deploy model
→ Famous Machine Learning Algorithms: