Avasthi AI

Friday, 8 December 2017

Finding representative phrases

As part of my doctoral research, I was faced with an interesting classification problem. I was working with a dataset extracted from the Usenet archive. In Usenet, the content is automatically classified within newsgroups. my challenge was to find representative phrases for a given content based on the primary classification and content of the Usenet post.

Identifying topics

Because we are dealing with technical text, we created our own list of stop words that we ignored while processing for representative phrases. The code appended below is looking of ngram length of maximum 2 but can be easily changed for larger length.

Saturday, 1 March 2014

Building a natural language classifier

The idea behind this post is to build a classifier that would work based any content. For the purpose of this experiment, we chose twitter data.

In any machine learning experiment, we need to have a training set and a test set. I collected a number of tweets during the period October 2013 to February 2014 using twitter stream API. Total 322,382 tweets were collected during this time period. Total 235,100 users participated in this information exchange. Based on specific events on days, I used the following four tokens to filter the tweets on stream API that are 1) India 2) FamilyGuy 3) FastAndFurious 4) Thanksgiving. The data that I collected contains all the original tweets initiated during this time and all the retweets. Since Twitter does not maintain the flow of retweets across the users, all the retweets point to the original tweet rather than intermediate tweets. Because of this even if a particular tweet is missed but retweets are seen, we can always find the original tweet from retweet since original tweet is always fully embedded in the retweet.
Twitter schema allows users to add metadata to their tweets. This metadata manifests itself in the form of hashtags in the tweets. Hashtags provide a primary means of categorization. Since tweets are limited in size of content that they can carry, users can attach URLs that point to a larger content. We want to evaluate the impact of presence or absence of hashtags and URLs in a tweet.
We parse the tweet data and create a table with following columns for each tweet.

Retweets in less than 10 seconds
RTs in greater than 10 and less than 30 seconds
RTs in greater than 30 and less than 1 minute
RTs in great than 1 min and less than 5 mins
RTs in great than 5 mins and less than 10 mins
RTs in greater than 10 mins and less than 30 mins
RTs in greater than 30 mins and less than 1 hour
RTs in greater than 1 hour
Number of hashtags
Number of URLs
Total retweets
Category of popularity of retweets denoted by an interval variable ranging from 1 to 7. 1 being the least popular while 7 being most popular

Since we have already seen that there is a large percentage of Twitter users who primarily just retweet and don’t add any content of their own. Also, Twitter does not really preserve the path traveled by a tweet. For all the further analysis, we only take into account tweets that were created by a user. We ignored the instance of retweets by other users since it is not adding any information.

The next step is to find appropriate categories for the tweet. We use the following set of rules to find appropriate categories for the tweets being examined.

Look at all the hashtags in a tweet. All of these are potential categories
For tweets having more than one hashtag, we only categorize it under the hashtag with the highest frequency
If a tweet does not have any hashtag, we classify it under the URL which is attached to it.

Once we have collected all the hashtags and URLs, we need to eliminate similarities. While deciding on the categories of the user, it is important to notice that many times people do mistakes in typing hashtags, use different spelling and also use related words to categorize a tweet.

Considering all these issues, we propose the following algorithm to come up with relevant categories of tweets.

We measure the distance between two hashtags to identify similar hashtags. We use Jaro Winkler distance as a measure of distance between two hashtags. One of the reasons for choosing Jaro Winkler distance is the fact that it is best suited for short strings which is the case with Hashtags. At the end of step 1), we have similar hashtags clubbed into a single bucket. For example, we can see that the above algorithm identifies "familyguyxxx familyg familyguyproblems familyguyfans" hashtags as a similar one.
At the second step, we look at groups of hashtags and then look at the hashtags that occur together very frequently and then we merge that group. For example, a group containing “familyguy” and a group containing “briangriffin” occurs together frequently and we merge them in a single group
We eliminate all the hashtags that seem to occur together but the text contained in those tweets has very high distance. It is a common practice among the spammers to ride popular and trending hashtags and start tweeting by including that hashtag.

Thus we end up with the final list of categories that we can use in final analysis. These are our significant categories. The next step is to build a classifier. We have used a dynamic language model based classifier. The Dynamic Language Model based classifiers use multivariate estimators for the category distribution and dynamic language models for the per-category character sequence estimators. The whole data is segregated into training and test set by dividing it into 90% and 10%. The training set is 90% and the test set is 10%.

Total accuracy of the model thus built is 83.64%. This means that out of all the testing cases, the model is able to successfully classify 83.64% of times accurately.

Important results
Total cases	57240
True Positives	1064
False Negatives	208
False Positives	208
True Negatives	55760

You can find the complete IEEE Paper by clicking on hyperlink.

Friday, 7 October 2005

Using Neural Networks to Explain Behavior of Indian Markets

It is believed that stock prices are impacted by three factors.

Company/Industry performance
Macroeconomic outlook
Sentiment-driven by news

The first two of the above are defined by numerical data. I decided to build an artificial engine that can predict the value of a particular stock based on movement in the above variables. We looked at following variables.

assetsWithBankingSystem – Total assets with the banking system
bankCredit – Bank credit in India
cash – Cash in hand
investmentAtBookValue – Total bank investments at book value
liabilitiesToBankingSystem – Total liabilities of banks to the banking system
liabilitiesToOthers – Total liability of banks other than the banking system
curcredit – Current account credit in INR
curdebit – Current account debit in INR
capcredit – Capital account credit in INR
capdebit – Capital account debit in INR
errcredit – Errors credit
errdebit – Errors debit
balcredit – Balance credit
baldebit – Balance debit
monmovcredit – Monetary movements credit
monmovdebit – Monetary movements debit
callMoneyHigh – Call money rate, High
callMoneyLow – Call money rate, Low
eps – Earning per share of the company
ceps – Cash earning per share of the company
bookValue – Book value of the company
div – Dividend paid per share of the company
opProfitPerShare – Operating profit per share of the company
netOperatingIncomePerShare – Net operating income per share of the company
freeReserves – Free reserves with the company
opm – Operating profit margin of the company
gpm – Gross profit margin of the company
npm – Net profit margin of the company
ronw – Return on net worth of the company
debtToEquity – Debt to equity ratio of the company
currentRatio – Current ratio of the company
quickRatio – Quick ratio of the company
interestCover – Interest cover of the company
salesByTotalAssets – Sales by total assets of the company
salesByFixedAssets – Sales by fixed assets of the company
salesByCurrentAssets – Sales by current assets of the company
noOfDaysOfWorkingCapital – No of days of working capital with the company
cpi – Consumer price index
br – Bank Rate
idbiRate – IDBI minimum term lending rate
maxCMR – Maximum Call Money Rate
maxPLR – Maximum prime lending rate
minPLR – Minimum Prime lending rate
price – Crude price
totalINRdebt – Total debt in Indian Rupees
concessionalDebtAsPercOfTotal – Concessional debt as a percentage of total
shortTermDebtAsPercOfTotal – Short-term debt as a percentage of total
affConstant – Agriculture, Forestry and Fishing, GDP factor cost, Constant prices
affCurrent – Agriculture, Forestry and Fishing, GDP factor cost, Current prices
cspsConstant – Community social and personal services, GDP factor cost, Constant prices
cspsCurrent – Community social and personal services, GDP factor cost, Current prices
consConstant – Construction, GDP factor cost, Constant prices
consCurrent – Construction, GDP factor cost, Current prices
egwsConstant – Electricity, Gas and Water Services, GDP factor cost, Constant prices
egwsCurrent – Electricity, Gas and Water Services, GDP factor cost, Current prices
firebsConstant – Finance, Insurance, Real Estate, and Business services, GDP factor cost, Constant prices
firebsCurrent – Finance, Insurance, Real Estate, and Business services, GDP factor cost, Current prices
manuConstant – Manufacturing, GDP factor cost, Constant prices
manuCurrent – Manufacturing, GDP factor cost, Current prices
maqConstant – Mining and quarrying, GDP factor cost, Constant prices
maqCurrent – Mining and quarrying, GDP factor cost, Current prices
tdpConstant – Total domestic product, GDP factor cost, Constant prices
tdpCurrent – Total domestic product, GDP factor cost, Current prices
thrConstant – Trade, Hotel and Restaurant, GDP factor cost, Constant prices
thrCurrent – Trade, Hotel and Restaurant, GDP factor cost, Current prices
aff – Agriculture, Forestry and Fishing, GDP factor cost
csps – Community social and personal services, GDP factor cost
cons – Construction, GDP factor cost
egws – Electricity, Gas and Water Services, GDP factor cost
firb – Finance, Insurance, Real Estate, and Business services, GDP factor cost
manuf – Manufacturing, GDP factor cost
min – Mining, GDP factor cost
tdp – Total domestic product, GDP factor cost
thr – Trade, Hotel and Restaurant, GDP factor cost
currencyWithPublic – Total currency with Public
m3 – Money supply, also referred to as stock of legal currency in the economy
timeDepositsWithBank – Total time deposits with the bank
totalIncome – Total income of RBI
totalExpenditure – Total expenditure of RBI
netAvailableBalance – Net available balance in RBI
surplusToCentralGovernment – Surplus payable to central government from RBI
totalIssuesLiabilities – Total liabilities, Issues
totalIssuesAssets – Total assets, Issues
totalBankingLiabilities – Total liabilities, Banking
totalBankingAssets – Total assets, Banking
reserveMoneyLiabilities – Reserve Money, Liabilities
reserveMoneyAssets – Reserve Money, Assets
forwardCashSpot – Forward Cash Spot, USD forward premia
forwardCashOneMonth – Forward Cash one month, USD forward premia
forwardCashThreeMonth – Forward Cash three months, USD forward premia
forwardCashSixMonth – Forward Cash six months, USD forward premia
forwardCash12Month – Forward cash twelve months, USD forward premia
referenceRate – RBI reference rate for USD
rate – US interest rate
quantitiy – Quantity of particular stock traded
turnover – Total turn over of stock traded

We collected the data for the above metrics and established their relationship with the following data specific for a stock.

Previous day close
Day open
Day high
Day low
Day close

Since there are a very large number of input variables related to economic indicators which may have a heavy correlation between themselves, the factor analysis was used to reduce the features to a manageable set of features that were used as inputs for the neural network later to develop the prediction model. For each company, four models were constructed as follows.

1D model, which would predictions the prices for next day given the stock price, turnover, and quantity for a day earlier to the previous day.
7D model, which would make predictions given the stock price, turnover and quantity for a week earlier
15D model, which would make predictions 15 days down the line.
180D model, which would make predictions six months down the line given the
stock price for a day.

After the factor analysis of the data, 96 inputs are reduced to 20 inputs with 95% of the variance explained. These factors are as follows. As we go to later factors, these mostly cover the residual values from initial factors.

Factor 1 – RBI influence and Core sector
Factor 2 – Foreign Exchange and Crude
Factor 3 – Agriculture, Total Domestic Product
Factor 4 – Company Financials
Factor 5 – Company Ratios
Factor 6 – Agriculture, Community services, debt structure with RBI
Factor 7 – Company Capital structure, profitability ratios, and other indicators
Factor 8 – Banking system residuals
Factor 9 – Company Liquidity Ratios
Factor 10 – Company stock performance
Factor 11 – RBI balance sheet debt structure and errors
Factor 12 – RBI balance sheet errors
Factor 13 – Company indicators (residuals)
Factor 14 – Banking system residuals
Factor 15 – Company financial ratios, Residuals
Factor 16 – Foreign Exchange, Crude and interest rate, Residuals
Factor 17 – Company Financial Ratios, Residuals part 1
Factor 18 – Company Financial Ratios, Residuals part 2
Factor 19 – USD Forward Spot rate
Factor 20 – IDBI lending rate and crude prices

The companies in NSE-50 index were considered.

Design of neural networks

Inputs and outputs

The economic indicators for model related to the company have been factored into 20 factors that explain most of these numbers. Additional 3 inputs are company specific and are related to the past stock price data with respect to that company.

Previous Close
Previous Turn Over
Previous Quantity

These makeup for the 23 variables that are used as inputs for the neural network. Three different

neural networks are used for the following three output variables

High
Low
Close

Hidden Layers

It is assumed given the richness of the data that at least 2 hidden layers would be required to form a meaningful neural network. The neural network will have 23 inputs and will have 1 output. Different neural networks would be created and a training run would be performed for 1500 cycles of data set. At the end of the sample run, the best network would be chosen for further training.

Neural networks that were evaluated are with

1 input layer with 23 inputs
first hidden layer with nodes 31 to 351
second hidden layer with nodes 8 to 31
1 output layer

The neural network with hidden layer 1 of 130 nodes and hidden layer 2 of 17 nodes comes with best error values to be further used. I have used libraries provided by Joone. Following are the main fragments of code for this exercise.

Results

Prediction Six Months

Conclusion

This exercise concludes that there is merit to using neural networks in trying to understand and predict the behavior of markets but with a certain caution. Following are important points to be kept in mind if this model is used for investment decisions.

The model does not return profitable results in very short duration trades, the investor should have an investment horizon of more than 6 months for the model to work properly
The model does not guarantee that all the trades would be profitable but overall there is a better chance of profits
Stocks with less volatility perform better in model-based prediction

For more details please refer to the report attached below.

Using Neural Nework to Explain Behavior of Indian Markets from vavasthi