Wednesday 23 January 2019

Big Data and Analytics

What really is big data?

Most of us understand big data in the context of the volume of the data. It is elementary if the unit of data to be produced is large enough that we can not hold it in the memory, a different level of thinking is required to process it. For example most of the sorting algorithms that we are aware of expect the data to exist in the main memory of the computers and if the size of the data is larger than that, we need to think about how to handle it. Similarly, if the data received can't be stored on persistence storage connected to a single computer, we need to think about how we will process it. This simplistic understanding of data is not sufficient to classify it as big data. We need to think of the following three dimensions of data to say whether we need to think about it in terms of big data or now.
  • Size -- Size is the key ingredient in the properties of data that would force us to think about it as big data. 
  • Type -- The type of data is also a key ingredient to start thinking about it in terms of big data. When web-based systems became commonplace, the logs generated by the web servers because first of such data that could not be effectively processed using regular algorithms. This type of data was the first to start utilizing some of the artifacts that we see today in play in big data systems.
  • Speed -- The speed at which the data is being generated is another important aspect that helps us decided whether it is a problem that needs to be looked like a big data problem. For example, if there is a system that is generating data at a rate that a single system can't even collect it, we need to find special ways of handling that data.
In today's world, we have systems consisting of Sensor networks, social media platforms, video surveillance systems, other imaging systems like medical which generate huge amounts of data which fits into categories defined above and thus requires special systems, architectures, and algorithms to process it. Data that is generated today lies in a continuum of structure to unstructured data. It required a multitude of methods to handle it.
We hear of two different types of processing systems in enterprises that process generated data. Business Intelligence systems provide functionality to generate reports, dashboards, information on demand which is typically historical in nature.  Data Science or Data Analytics systems are more predictive in nature. They provide information on how to optimize processes, predictive modeling, forecasting, etc. 
Data Scientists that are engaged in the craft have to perform a set of activities which are beyond just the technological systems. They need to one or more of the following.
  • Understand Business Challenges and Questions
  • Reframe these business challenges as analytical challenges
  • Design Models and Techniques required
  • Implement Models and Techniques
  • Deploy the systems
  • Develop insights 
  • Generate actionable recommendations
The skills that a data scientist needs to have is a combination of following skills.
  • A flair for data -- hard to objectively define what it is, but if looking at numbers doesn't excite you, maybe you are not cut out to be a data scientist.
  • Quantitative Skills -- Mathematics and Statistics
  • Technical Skills -- These skills may vary from using software like SPSS, Programming in R to writing complicated data pipelines
  • Critical Thinking -- It helps to be a skeptical mind if you want to be a data scientist.
  • Communicative -- Since one of the functions of a data scientist is to bring together a broad set of skills to solve problems that may not be necessary in one person's domain, it helps if you get along with others and communicate clearly.


  1. Get connected with the opponent organizations engaged with a similar business. Figure out how they mine their information. machine learning course

  2. Very informative article, which you have shared here about the Big Data and Analytics. After reading your article I got very much information and it is very useful for us. I am thankful to you for sharing this article here. Higher Education Data Analytics Reporting Software in USA

  3. Nice reading, I love your content. This is really a fantastic and informative post. Keep it up and if you are looking for Data Analytics tools then visit TechMarkBlog

  4. Informative post, managing large data sets has been a challenge, while the processing of this complex, ever-growing data has helped businesses in developing better solutions. Big Data Training and placement in Chennai.

  5. Bet365 Casino NJ - December 2021 - JetBlue NJ
    Bet365 Casino 목포 출장안마 is your 제천 출장샵 trusted online 과천 출장마사지 casino for NJ players. With hundreds of games, huge progressive jackpots, and thousands 제주도 출장마사지 of promotions, it's time to 고양 출장샵 play


How GenAI models like ChatGPT will end up polluting knowledge base of the world.

 Around nine years ago, around the PI day, I wrote a blog post about an ancient method of remembering digits of pi. Here is the  link  to bl...