Wednesday, 13 March 2024

How GenAI models like ChatGPT will end up polluting knowledge base of the world.

 Around nine years ago, around the PI day, I wrote a blog post about an ancient method of remembering digits of pi. Here is the link to blog post.

Here is the operative part of the blog post.

The code is as follows.

123456789
क ख ग घ ङ च  छ  ज  झ 
ट ठ ड ढ ण त थ द ध 
प  फ  ब  भ  म  
य  र  ल  व  श  ष  स  ह  क्ष  

With the above key in place, Sri Bharathi Krishna Tirtha in his Vedic Mathematics gives following verse.
गोपी भाग्य मधुव्रात  श्रुङ्गिशो दधिसन्धिग  |
खलजीवित खाताव गलहालारसंधार |
If we replace the code from the above table in the above verse, here is what we get.
31 41 5926 535 89793
23846 264 33832792
That gives us 

Today, I asked ChatGPT this question.



If you look at the details explanation provided by ChatGPT, it is completely wrong. What I picked up is from a book on the topic. Once more and more such content floods internet, it will become extremely hard to decipher what is right and what is wrong. Furthermore even academic text being written in future would consist of ChatGPT created stuff.

Unles GenAI can provide traceability to the source material, I don't think it is worth giving the attention that it is being given.

Wednesday, 20 February 2019

The morality of a machine

It is 2019, everybody is gung-ho on about Artificial Intelligence and/or Machine Learning. If we look around, there are many systems that exist today which incorporate AI/ML in some shape or form. Most systems in product consist of typical classifiers or regression that provides information to make a particular decision. The question of the morality of a machine has not found top consideration because till now the AI/ML is producing output that is used by a human to make actual decision.
When Google Photo image tagging algorithm misclassifies an image and tags it with somebody else's’ name, it is not a serious problem because I can personally go ahead and reclassify the photo with the correct name. Still, the algorithm is correctly classifying more than 90% of images accurately and reducing my workload. When Google Mail or Gmail misclassifies one of the important email containing a notice from the income tax department, it is a significant hassle but still something that I can easily fix. The primary point here is that all the input created by AI is brought to the notice of a human who looks at all the available data before making any decision.
But, the day is not far off when we will reach a situation where machines may be directly handling responses and humans are either not in the loop or they become so unmindful of what the machines are doing for you that just ignore those actions. For example, it is not inconceivable to expect Gmail to automatically respond to any notice received from the income tax department by looking at your financial data that is stored on some Google Drive. It does not have the capability today but the technology exists where this capability can be easily built.
The problem becomes more compounded where the actions taken by machines may result in serious hazards to humans. Look at all the excitement around self-driving cars. When human’s are driving cars, they are constantly making decisions that are driven by morality rather than plain logic. Take an example where you are driving your car and on one side there is a pedestrian or a cyclist while on the other side of your car there is another car. Most human would err on the side of the car because that is just the material damage. Our morality expects us to value human life higher than material things. If we extend this problem, let’s say the self-driving car AI realizes that it can’t avoid an accident, it has to choose between a child and one aged person. Suddenly the decision becomes extremely difficult. Humans make these decisions instinctively and many are disturbed through decisions for many years down the line.
Should you value the life of a child higher than a senior citizen? What about when you have to choose between parent and child? What about choosing between the head of state and a normal person? These are complex decisions. We can write code that can make these decisions but the rules have to be defined and agree by society and governments. If every car manufacturer starts making their own moral decisions, that may become anarchy. Governments, Societies, Courts need to define a set of morality rules that every autonomous AI has to follow. I think the time has come when Asimov’s three laws need to be expanded to something like Asimov’s three laws and other morality principles. These principles need to define how a machine can evaluate different outcomes based on a given situation and choose an outcome that is acceptable by courts, societies, and governments. Unless that happens, AI should be relegated to a decision support system and should not become something that has any control.

Thursday, 24 January 2019

Building an analytical organization

Given the noise around data analytics, many companies have woken up to perceived benefits of being an analytical organization. Many of them want to transition into an analytical organization. Unfortunately, a big-bang approach is neither desirable nor possible in analytical capability building. We should look at the exercise of the analytical building as a continuum. It is extremely hard to leapfrog.

Companies that are at the bottom of analytical capability might have some data but they do not actively use this data for decision making. The first building block towards analytical organization is the quality of the data. If the quality of data is suspect, the first effort should be to improve the quality of data. In many situations, the organization might have to build systems that collect data. For example, an organization which has all the customer data in a non-digital form might start digitizing all the data and then realize that digitization has resulted in inaccurate data. The best approach for them might be to first build systems for that function and wait for new data to be collected through continuous operations before embarking on any type of analytics.
If the organization has quality data available in specific functions and local management in these functions wants to leverage analytics, they can embark on the path. It is extremely important to understand that even in this scenario, the support of local management is critical. If the management believes that because of them being in the place for years they know everything and they can take decisions themselves, the organization would never embark on the path to be an analytical organization.
If the organization is using data-driven decision making in some of the functions and then the leadership team is convinced that they need to move on the path to be an analytical organization, they need to assess the state of Organization, Skills, and Technology to evaluate the path they might take to proceed further. At this time the organizations would need to ask the following questions.

  • What is the existing capability of the organization that might help the journey towards an analytical organization
  • Which key processes and decision will help most with the data-driven decision making
  • What is the differentiating factor of the organization
Once an organization decides to embark on increasing their maturity on analytics continuum, they need to choose a path. If there is an extremely high commitment from the leadership of the company, the organization can invest and proceed on a path toward a big bank approach towards becoming an analytical organization. An organization may start one or many of the following activities.
  • Find opportunities to collect new data and improve the quality of data
  • Build a relationship across different datasets
  • Build data pipelines
  • Build processes and governance organizations
The second path that an organization may take is a proof of concept path where specific problems are picked and a proof of concept is performed before it is rolled out the larger organization. This is a low risk, low reward, high cycle time option.
The point to understand is, a big-bang approach would only work if the top leadership of the organization is fully committed behind the initiative. The PoC approach can be undertaken with finding a functional manager as a sponsor but may result in multiple systems that don't work together.

Wednesday, 23 January 2019

Big Data and Analytics

What really is big data?


Most of us understand big data in the context of the volume of the data. It is elementary if the unit of data to be produced is large enough that we can not hold it in the memory, a different level of thinking is required to process it. For example most of the sorting algorithms that we are aware of expect the data to exist in the main memory of the computers and if the size of the data is larger than that, we need to think about how to handle it. Similarly, if the data received can't be stored on persistence storage connected to a single computer, we need to think about how we will process it. This simplistic understanding of data is not sufficient to classify it as big data. We need to think of the following three dimensions of data to say whether we need to think about it in terms of big data or now.
  • Size -- Size is the key ingredient in the properties of data that would force us to think about it as big data. 
  • Type -- The type of data is also a key ingredient to start thinking about it in terms of big data. When web-based systems became commonplace, the logs generated by the web servers because first of such data that could not be effectively processed using regular algorithms. This type of data was the first to start utilizing some of the artifacts that we see today in play in big data systems.
  • Speed -- The speed at which the data is being generated is another important aspect that helps us decided whether it is a problem that needs to be looked like a big data problem. For example, if there is a system that is generating data at a rate that a single system can't even collect it, we need to find special ways of handling that data.
In today's world, we have systems consisting of Sensor networks, social media platforms, video surveillance systems, other imaging systems like medical which generate huge amounts of data which fits into categories defined above and thus requires special systems, architectures, and algorithms to process it. Data that is generated today lies in a continuum of structure to unstructured data. It required a multitude of methods to handle it.
We hear of two different types of processing systems in enterprises that process generated data. Business Intelligence systems provide functionality to generate reports, dashboards, information on demand which is typically historical in nature.  Data Science or Data Analytics systems are more predictive in nature. They provide information on how to optimize processes, predictive modeling, forecasting, etc. 
Data Scientists that are engaged in the craft have to perform a set of activities which are beyond just the technological systems. They need to one or more of the following.
  • Understand Business Challenges and Questions
  • Reframe these business challenges as analytical challenges
  • Design Models and Techniques required
  • Implement Models and Techniques
  • Deploy the systems
  • Develop insights 
  • Generate actionable recommendations
The skills that a data scientist needs to have is a combination of following skills.
  • A flair for data -- hard to objectively define what it is, but if looking at numbers doesn't excite you, maybe you are not cut out to be a data scientist.
  • Quantitative Skills -- Mathematics and Statistics
  • Technical Skills -- These skills may vary from using software like SPSS, Programming in R to writing complicated data pipelines
  • Critical Thinking -- It helps to be a skeptical mind if you want to be a data scientist.
  • Communicative -- Since one of the functions of a data scientist is to bring together a broad set of skills to solve problems that may not be necessary in one person's domain, it helps if you get along with others and communicate clearly.

Saturday, 24 November 2018

A pothole detector

I like to drive a lot and that and during highways at high speeds encountering a pothole can be tremendously dangerous. I always wondered whether AI can really detect potholes in the road. So I decided to give it a try. I always have a dashcam on my car that constantly records videos of my drive. So I decided to take a few of the videos along with the pothole dataset available on the internet to train a pothole detector.

I visualized this problem as an object detection problem. As I approached the problem, it became very clear that the form in which the dataset was, it was not ready for training. Look at one of the pictures from the dataset.

The image contains a bunch of potholes. We can use these type of images if we were to build a classifier which classifies the images into two groups, one having potholes and another without potholes.
Our intentions are very different. What we want to achieve is to locate the pothole in the frame. So we have to label our images. That is a time-consuming activity.
There are many tools available that help in tagging and labeling. I looked at quite a few of them and finally found VoTT from Microsoft to be a good tool. The fact that it can export tagged data into multiple formats was an icing on the cake.
So, I got around to tagging a bunch of these images. At this point, the only object that we are interested in is Pothole. After tagging, the above image looked like below image.
We also did the same exercise with a bunch of dashcam videos. It is really a time-consuming exercise and for that reason, we did not really get a substantial dataset.
Once we have our tagged and labeled dataset ready, we need to export them to tfrecord format. This is the format that tensorflow is most familiar with and it also makes it easy to merge multiple datasets into one.
Once, we are done with this, we have a number of tfrecord files and a pbtxt file.
item {
 id: 1
 name: 'RoadBump'
}
item {
 id: 2
 name: 'Pothole'
}
item {
 id: 3
 name: 'People'
}
item {
 id: 4
 name: 'Truck'
}
item {
 id: 5
 name: 'Bus'
}
item {
 id: 6
 name: 'Car'
}
item {
 id: 7
 name: 'TwoWheeler'
}
item {
 id: 8
 name: 'AutoRickshaw'
}
item {
 id: 9
 name: 'BadRoad'
}

Now we create two subdirectories, one to hold training data and another to hold evaluation data. We then split tfrecord files randomly into two sets, one for training and another for evaluation. Both the subdirectories can have the same pbtxt file.
The next step is to create a google cloud project. I am doing this work on an Ubuntu Linux machine. Download Google Cloud SDK to make life easier. The Google Cloud Developer tools website has good information on what needs to be done, but in summary, you need to authorize a user account.
The next step is to install tensorflow. Even though I plan on running training on Google Cloud, still I need the tensorflow because there are utilities that are required to package the job.
In my experience, tensorflow still works better with python 2.7, it works with python 3.5 and 3.6 as well but if you have a choice, stay with python 2.7.
You have a choice of installing tensorflow or tensorflow with GPU. If you have a machine that has Cuda, you need to install Cuda 9.0 drivers. If you are using Ubuntu 18.04, by default Nvidia will install Cuda 10.x. You have to make sure you remove them and install Cuda 9.0 drives. Also when you uninstall Cuda 10.x drivers, be careful. In my case it remove some of the important packages and then I had to manually install them later.
$ sudo apt update
$ sudo apt install python-dev python-pip
$ sudo pip install -U virtualenv  # system-wide install
$ pip install --upgrade tensorflow 
$ pip install --upgrade tensorflow-gpu # GPU install

It may be prudent to configure a virtual environment setup to make sure you don't have to worry too many broken dependencies. Tensorflow website has good information on how to go about doing the installation and that would be the best resource to go about it.
The next step is to figure out what should be done about the model. Since I am modeling the problem as an object detection problem, I will use one of the existing model checkpoints to use transference learning.
$ git clone https://github.com/tensorflow/models

We go into models/research directory and the code for object detection is in the object_detection directory. Tensorflow model zoo contains pre-trained models. We decided to use faster_rcnn_inception_resnet_v2_atrous_coco as the base model for training. At this time, download the model and extract the archive in a directory. It is always suggested to pick up the pipeline config file from the git repository. The config files are contained in model/research/object_detection/samples/configs directory. Each model has a corresponding config file. I have used faster_rcnn_inception_resnet_v2_atrous_coco.config for my model.
Now we need to set up our cloud environment. The first step is to create a bucket in cloud storage. Google tutorial uses the following trick to give names to buckets.
$ export PROJECT=$(gcloud config list project --format "value(core.project)")
$ export YOUR_GCS_BUCKET="gs://${PROJECT}-ml"
Within the cloud bucket, we create a directory named data. Within data, we create two subdirectories, train, and eval and we copy corresponding .tfrecord and pbtxt files into those directories. So ${YOUR_GSC_BUCKET}/data/train contains training dataset and ${YOUR_GSC_BUCKET}/data/eval contains evaluation dataset.
We need to package pycocotools to be submitted along with the model.
$ cd model/research
$ bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
We also need to add the location of models folder and models/slim folder to PYTHONPATH variable. From models directory, we need to compile protobuf files.
$ protoc object_detection/protos/*.proto --python_out = .

Now we generate all the archives that are needed to run the training. From model directory, we run following commands.
$ python setup.py sdist
$ (cd slim && python setup.py sdist)
Now we need to configure pipeline_config file. We use faster_rcnn_inception_resnet_v2_atrous_coco.config file as the base file. Basically, we need to modify all the instances of PATH_TO_BE_CONFIGURED in the file with appropriate values. We will need to set the following values.

  1. fine_tune_checkpoint the cloud location of the model checkpt.
  2. input_path for training data. This should be a cloud location of training files. You can give wild cards here. For example gs://myproject-ml/data/train/*.tfrecord.
  3. label_map_path for training data. For example gs://myproject-ml/data/train/tf_label_map.pbtxt
  4. input_path for eval data. This should be a cloud location of eval files. You can give wild cards here. For example gs://myproject-ml/data/eval/*.tfrecord.
  5. label_map_path for eval data. For example gs://myproject-ml/data/eval/tf_label_map.pbtxt

For convenience sake, I also renamed the config file to pipeline.cfg. Now we upload the cfg file to cloud storage in data directory.
Now we are ready to submit our job for training.
$ gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s`     \
      --job-dir=${YOUR_GCS_BUCKET}/train
      --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz
      --module-name object_detection.model_main \
      --config train.yaml \
      --runtime-version 1.10 \
      -- \
      --pipeline_config_path=${YOUR_GSC_BUCKET}/data/pipeline.config \
      --model_dir=${YOUR_GCS_BUCKET}/data/model_dir
Please pay attention to the fact that last two command line arguments are after -- and are passed on to the model.
In my case, I ran the model for approximately 24 hours and even with insufficient data the results were remarkable. Here are some of the images with detection in action.





As we can see even a limited amount of training data and training compute cycles can produce remarkable results. Now as a next step we need to tag and label a larger amount of data and then train it for longer.

Friday, 8 December 2017

Finding representative phrases

As part of my doctoral research, I was faced with an interesting classification problem. I was working with a dataset extracted from the Usenet archive. In Usenet, the content is automatically classified within newsgroups. my challenge was to find representative phrases for a given content based on the primary classification and content of the Usenet post.
Identifying topics
Because we are dealing with technical text, we created our own list of stop words that we ignored while processing for representative phrases. The code appended below is looking of ngram length of maximum 2 but can be easily changed for larger length.



Saturday, 1 March 2014

Building a natural language classifier

The idea behind this post is to build a classifier that would work based any content. For the purpose of this experiment, we chose twitter data.
In any machine learning experiment, we need to have a training set and a test set. I collected a number of tweets during the period October 2013 to February 2014 using twitter stream API. Total 322,382 tweets were collected during this time period. Total 235,100 users participated in this information exchange. Based on specific events on days, I used the following four tokens to filter the tweets on stream API that are 1) India 2) FamilyGuy 3) FastAndFurious 4) Thanksgiving. The data that I collected contains all the original tweets initiated during this time and all the retweets. Since Twitter does not maintain the flow of retweets across the users, all the retweets point to the original tweet rather than intermediate tweets. Because of this even if a particular tweet is missed but retweets are seen, we can always find the original tweet from retweet since original tweet is always fully embedded in the retweet.
Twitter schema allows users to add metadata to their tweets. This metadata manifests itself in the form of hashtags in the tweets. Hashtags provide a primary means of categorization. Since tweets are limited in size of content that they can carry, users can attach URLs that point to a larger content. We want to evaluate the impact of presence or absence of hashtags and URLs in a tweet.
We parse the tweet data and create a table with following columns for each tweet.

  1. Retweets in less than 10 seconds 
  2. RTs in greater than 10 and less than 30 seconds 
  3. RTs in greater than 30 and less than 1 minute 
  4. RTs in great than 1 min and less than 5 mins 
  5. RTs in great than 5 mins and less than 10 mins 
  6. RTs in greater than 10 mins and less than 30 mins 
  7. RTs in greater than 30 mins and less than 1 hour 
  8. RTs in greater than 1 hour 
  9. Number of hashtags 
  10. Number of URLs 
  11. Total retweets 
  12. Category of popularity of retweets denoted by an interval variable ranging from 1 to 7. 1 being the least popular while 7 being most popular

Since we have already seen that there is a large percentage of Twitter users who primarily just retweet and don’t add any content of their own. Also, Twitter does not really preserve the path traveled by a tweet. For all the further analysis, we only take into account tweets that were created by a user. We ignored the instance of retweets by other users since it is not adding any information.
The next step is to find appropriate categories for the tweet. We use the following set of rules to find appropriate categories for the tweets being examined.
  1. Look at all the hashtags in a tweet. All of these are potential categories
  2. For tweets having more than one hashtag, we only categorize it under the hashtag with the highest frequency
  3. If a tweet does not have any hashtag, we classify it under the URL which is attached to it.
Once we have collected all the hashtags and URLs, we need to eliminate similarities. While deciding on the categories of the user, it is important to notice that many times people do mistakes in typing hashtags, use different spelling and also use related words to categorize a tweet. 
Considering all these issues, we propose the following algorithm to come up with relevant categories of tweets.
  1. We measure the distance between two hashtags to identify similar hashtags. We use Jaro Winkler distance as a measure of distance between two hashtags. One of the reasons for choosing Jaro Winkler distance is the fact that it is best suited for short strings which is the case with Hashtags. At the end of step 1), we have similar hashtags clubbed into a single bucket. For example, we can see that the above algorithm identifies "familyguyxxx familyg familyguyproblems familyguyfans" hashtags as a similar one.  
  2. At the second step, we look at groups of hashtags and then look at the hashtags that occur together very frequently and then we merge that group. For example, a group containing “familyguy” and a group containing “briangriffin” occurs together frequently and we merge them in a single group
  3. We eliminate all the hashtags that seem to occur together but the text contained in those tweets has very high distance. It is a common practice among the spammers to ride popular and trending hashtags and start tweeting by including that hashtag.
Thus we end up with the final list of categories that we can use in final analysis. These are our significant categories. The next step is to build a classifier. We have used a dynamic language model based classifier. The Dynamic Language Model based classifiers use multivariate estimators for the category distribution and dynamic language models for the per-category character sequence estimators. The whole data is segregated into training and test set by dividing it into 90% and 10%. The training set is 90% and the test set is 10%.
Total accuracy of the model thus built is 83.64%. This means that out of all the testing cases, the model is able to successfully classify 83.64% of times accurately. 


Important results
Total cases
57240
True Positives
1064
False Negatives
208
False Positives
208
True Negatives
55760

You can find the complete IEEE Paper by clicking on hyperlink.

How GenAI models like ChatGPT will end up polluting knowledge base of the world.

 Around nine years ago, around the PI day, I wrote a blog post about an ancient method of remembering digits of pi. Here is the  link  to bl...