What really is big data?
Most of us understand big data in the context of the volume of the data. It is elementary if the unit of data to be produced is large enough that we can not hold it in the memory, a different level of thinking is required to process it. For example most of the sorting algorithms that we are aware of expect the data to exist in the main memory of the computers and if the size of the data is larger than that, we need to think about how to handle it. Similarly, if the data received can't be stored on persistence storage connected to a single computer, we need to think about how we will process it. This simplistic understanding of data is not sufficient to classify it as big data. We need to think of the following three dimensions of data to say whether we need to think about it in terms of big data or now.
- Size -- Size is the key ingredient in the properties of data that would force us to think about it as big data.
- Type -- The type of data is also a key ingredient to start thinking about it in terms of big data. When web-based systems became commonplace, the logs generated by the web servers because first of such data that could not be effectively processed using regular algorithms. This type of data was the first to start utilizing some of the artifacts that we see today in play in big data systems.
- Speed -- The speed at which the data is being generated is another important aspect that helps us decided whether it is a problem that needs to be looked like a big data problem. For example, if there is a system that is generating data at a rate that a single system can't even collect it, we need to find special ways of handling that data.
In today's world, we have systems consisting of Sensor networks, social media platforms, video surveillance systems, other imaging systems like medical which generate huge amounts of data which fits into categories defined above and thus requires special systems, architectures, and algorithms to process it. Data that is generated today lies in a continuum of structure to unstructured data. It required a multitude of methods to handle it.
We hear of two different types of processing systems in enterprises that process generated data. Business Intelligence systems provide functionality to generate reports, dashboards, information on demand which is typically historical in nature. Data Science or Data Analytics systems are more predictive in nature. They provide information on how to optimize processes, predictive modeling, forecasting, etc.
Data Scientists that are engaged in the craft have to perform a set of activities which are beyond just the technological systems. They need to one or more of the following.
- Understand Business Challenges and Questions
- Reframe these business challenges as analytical challenges
- Design Models and Techniques required
- Implement Models and Techniques
- Deploy the systems
- Develop insights
- Generate actionable recommendations
The skills that a data scientist needs to have is a combination of following skills.
- A flair for data -- hard to objectively define what it is, but if looking at numbers doesn't excite you, maybe you are not cut out to be a data scientist.
- Quantitative Skills -- Mathematics and Statistics
- Technical Skills -- These skills may vary from using software like SPSS, Programming in R to writing complicated data pipelines
- Critical Thinking -- It helps to be a skeptical mind if you want to be a data scientist.
- Communicative -- Since one of the functions of a data scientist is to bring together a broad set of skills to solve problems that may not be necessary in one person's domain, it helps if you get along with others and communicate clearly.