Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Thursday, 24 January 2019

Building an analytical organization

Given the noise around data analytics, many companies have woken up to perceived benefits of being an analytical organization. Many of them want to transition into an analytical organization. Unfortunately, a big-bang approach is neither desirable nor possible in analytical capability building. We should look at the exercise of the analytical building as a continuum. It is extremely hard to leapfrog.

Companies that are at the bottom of analytical capability might have some data but they do not actively use this data for decision making. The first building block towards analytical organization is the quality of the data. If the quality of data is suspect, the first effort should be to improve the quality of data. In many situations, the organization might have to build systems that collect data. For example, an organization which has all the customer data in a non-digital form might start digitizing all the data and then realize that digitization has resulted in inaccurate data. The best approach for them might be to first build systems for that function and wait for new data to be collected through continuous operations before embarking on any type of analytics.
If the organization has quality data available in specific functions and local management in these functions wants to leverage analytics, they can embark on the path. It is extremely important to understand that even in this scenario, the support of local management is critical. If the management believes that because of them being in the place for years they know everything and they can take decisions themselves, the organization would never embark on the path to be an analytical organization.
If the organization is using data-driven decision making in some of the functions and then the leadership team is convinced that they need to move on the path to be an analytical organization, they need to assess the state of Organization, Skills, and Technology to evaluate the path they might take to proceed further. At this time the organizations would need to ask the following questions.

  • What is the existing capability of the organization that might help the journey towards an analytical organization
  • Which key processes and decision will help most with the data-driven decision making
  • What is the differentiating factor of the organization
Once an organization decides to embark on increasing their maturity on analytics continuum, they need to choose a path. If there is an extremely high commitment from the leadership of the company, the organization can invest and proceed on a path toward a big bank approach towards becoming an analytical organization. An organization may start one or many of the following activities.
  • Find opportunities to collect new data and improve the quality of data
  • Build a relationship across different datasets
  • Build data pipelines
  • Build processes and governance organizations
The second path that an organization may take is a proof of concept path where specific problems are picked and a proof of concept is performed before it is rolled out the larger organization. This is a low risk, low reward, high cycle time option.
The point to understand is, a big-bang approach would only work if the top leadership of the organization is fully committed behind the initiative. The PoC approach can be undertaken with finding a functional manager as a sponsor but may result in multiple systems that don't work together.

Wednesday, 23 January 2019

Big Data and Analytics

What really is big data?


Most of us understand big data in the context of the volume of the data. It is elementary if the unit of data to be produced is large enough that we can not hold it in the memory, a different level of thinking is required to process it. For example most of the sorting algorithms that we are aware of expect the data to exist in the main memory of the computers and if the size of the data is larger than that, we need to think about how to handle it. Similarly, if the data received can't be stored on persistence storage connected to a single computer, we need to think about how we will process it. This simplistic understanding of data is not sufficient to classify it as big data. We need to think of the following three dimensions of data to say whether we need to think about it in terms of big data or now.
  • Size -- Size is the key ingredient in the properties of data that would force us to think about it as big data. 
  • Type -- The type of data is also a key ingredient to start thinking about it in terms of big data. When web-based systems became commonplace, the logs generated by the web servers because first of such data that could not be effectively processed using regular algorithms. This type of data was the first to start utilizing some of the artifacts that we see today in play in big data systems.
  • Speed -- The speed at which the data is being generated is another important aspect that helps us decided whether it is a problem that needs to be looked like a big data problem. For example, if there is a system that is generating data at a rate that a single system can't even collect it, we need to find special ways of handling that data.
In today's world, we have systems consisting of Sensor networks, social media platforms, video surveillance systems, other imaging systems like medical which generate huge amounts of data which fits into categories defined above and thus requires special systems, architectures, and algorithms to process it. Data that is generated today lies in a continuum of structure to unstructured data. It required a multitude of methods to handle it.
We hear of two different types of processing systems in enterprises that process generated data. Business Intelligence systems provide functionality to generate reports, dashboards, information on demand which is typically historical in nature.  Data Science or Data Analytics systems are more predictive in nature. They provide information on how to optimize processes, predictive modeling, forecasting, etc. 
Data Scientists that are engaged in the craft have to perform a set of activities which are beyond just the technological systems. They need to one or more of the following.
  • Understand Business Challenges and Questions
  • Reframe these business challenges as analytical challenges
  • Design Models and Techniques required
  • Implement Models and Techniques
  • Deploy the systems
  • Develop insights 
  • Generate actionable recommendations
The skills that a data scientist needs to have is a combination of following skills.
  • A flair for data -- hard to objectively define what it is, but if looking at numbers doesn't excite you, maybe you are not cut out to be a data scientist.
  • Quantitative Skills -- Mathematics and Statistics
  • Technical Skills -- These skills may vary from using software like SPSS, Programming in R to writing complicated data pipelines
  • Critical Thinking -- It helps to be a skeptical mind if you want to be a data scientist.
  • Communicative -- Since one of the functions of a data scientist is to bring together a broad set of skills to solve problems that may not be necessary in one person's domain, it helps if you get along with others and communicate clearly.

How GenAI models like ChatGPT will end up polluting knowledge base of the world.

 Around nine years ago, around the PI day, I wrote a blog post about an ancient method of remembering digits of pi. Here is the  link  to bl...