 Research
 Open Access
 Published:
A novel fuzzy documentbased information retrieval scheme (FDIRS)
Applied Informatics volume 3, Article number: 2 (2016)
Abstract
Information retrieval systems are generally used to find documents that are most appropriate according to some query that comes dynamically from the users. In this paper, a novel fuzzy documentbased information retrieval scheme (FDIRS) is proposed for the purpose of Stock Market Index forecasting. The novelty of the proposed approach is the use of a modified tfidf scoring scheme to predict the future trend of the stock market index. The contribution of this paper has two dimensions: (1) In the proposed system, the simple daily time series data are converted to an enriched fuzzy linguistic time series with a unique approach of incorporating information about the manner in which the OHLC (open, high, low, and close) price formation took place at every instance of the time series, and (2) A unique approach is followed while modeling the information retrieval (IR) system which converts a simple IR system into a forecasting system. The modified IR system provides us with a trend forecast and after which a crisp value is generated that becomes the forecast value that can be achieved in next few trading sessions. From the performance comparison of FDIRS with standard benchmark models, it can be affirmed that the proposed model has a potential of becoming a good forecasting model. Transaction data of CNX NIFTY50 index of National Stock Exchange of India are used to experiment and validate the proposed model.
Background
Prediction or forecasting is both an art as well as science. The process and outcome of forecasting have long been a matter of research and still are in its childhood state. We can devise numerous ways of modeling a phenomenon and predict its outcome, but there are no universal methods using which we can model every phenomena. Modeling of linear systems is comparatively simpler than dynamical systems. Stock markets are completely chaotic and dynamic systems which are both time and sentiment driven. The time series generated through stock market data can only represent a financial time series of prices but cannot represent the overall sentiment of the market players who trade and invest in the stock markets. Hence modeling of stock market data is one of the toughest as it should incorporate not only data but market sentiment also. The stock market data are a series of prices that are observed in a series of certain time intervals (minutes, hours, days, or weeks etc.). Data mining is a very effective tool using which the past behavior of the price movement can be modeled to predict the future. Fuzzy logic is a very effective tool using which the market sentiment can be captured and modeled. By adopting a hybrid approach of combining time series, data mining, and fuzzy logic, an effective system can be built to model the stock market price data that can not only give information about price but also the market sentiment or the mood of the market participants.
The stock market gives facilities to gain both from rising prices as well as from falling prices. In stock markets, there are only two forces namely the bulls and the bears. Bulls are those traders who always want the market prices to go higher and gain profit from rising prices. Bears are those traders who always want the market prices to go lower and gain profit from falling prices. If bulls outnumber the bears then the market sentiment becomes bullish and we can see the market prices rising. Similarly, if bears outnumber the bulls then the market sentiment becomes bearish and we can see the market prices falling. When bulls and bears are unable to overpower each other then the market becomes neutral and we can see market prices move in a rangebound fashion, without a specific trend.
Stock market prediction is one of the most researched and discussed fields due to its criticality in commercial applications and attractive benefits. Forecasting in itself is intriguing and if money is involved then its interestingness increases many folds. Financial time series are the toughest to forecast, as the modeling of such time series governs the quality of results achieved. The same financial time series would fetch better results if it is modeled appropriately rather than taking the time series as it is.
Soft computing presents us with a wide variety of options to model any dynamic system as it is adapted from physical science. Problem solving through appropriate modeling of the observed system using soft computing and artificial intelligence is very effective. These systems are intelligent, tolerant to imprecision and uncertainty, making them most adaptable to noisy realms. Soft computing encompasses three key areas of probabilistic reasoning, neural networks, and fuzzy logic. The fuzzy logic area of soft computing is adopted in the proposed model. The property of fuzzy logic system to capture the market sentiment from the price helped to build a linguistic time series that represents the actual time series but exposing and extracting a lot of hidden information from the same crisp time series.
The information retrieval (IR) systems try to find the most appropriate and relevant documents depending upon the query. This quality of the IR systems helped to build a model that would suggest the most appropriate future trend. A novel fuzzy documentbased information retrieval scheme (FDIRS) is proposed for the purpose of Stock Market Index forecasting. In the proposed system, the entire document corpus is generated using a fuzzification process, and the queries containing fuzzy terms would be processed by the proposed system to fetch the most appropriate document from the document corpus. The novelty of the approach followed here is that the trend is represented as a document and the query consists of the fuzzy linguistic terms that represent the current state of the financial time series. This approach gives an entirely new dimension of looking at how traditional IR systems are used. The tfidf value of the terms is used to complete the task of forecasting.
The contribution of this paper has two dimensions: (1) In the proposed system, the simple daily time series is converted to an enriched fuzzy linguistic time series with a unique approach of incorporating information about the manner in which the OHLC price formation took place at every instance of the time series, and (2) A unique approach is followed while modeling the information retrieval (IR) system which converts a simple IR system into a forecasting system. Transaction data of CNX NIFTY50 index of National Stock Exchange of India are used to validate the proposed model.
About Japanese candlestick theory
Japanese candlestick charts are a combination of line chart and bar chart. According to the Japanese candlestick theory, the area between the trading session’s open and close values represent the body of the candle. The low and high values represent the extreme ends emerging from the body of the candle, called the wicks or shadows of the candle. Figure 1 illustrates a typical candlestick formation, when the trading session’s close value is lower than the open value then the candle is filled with any dark color and if the close value is higher than the open value then the candle is filled with white color.
Japanese candlesticks present us with more than one dimension to understand the current market condition. The first dimension is the price; when close is higher than the open then the market is moving upwards i.e., the buyers are outnumbering the sellers and the trend is bullish causing the market values to go up. Similarly, when close is lower than open then the market is moving downwards i.e., the sellers are outnumbering the buyers and the trend is bearish causing the market values to go down. The second dimension is the length of the body of the candlestick formation; if the body length is very big then the market sentiment is very strong whether bullish or bearish and if the body length is small then there is some kind of uncertainty or indecision in the market and the market would try to attain some direction in the coming trading sessions.
About fuzzy logic theory
According to Zadeh (1965), a fuzzy set A [x] over a universe of discourse X is a set of pairs:
where μA(x) is called the membership degree of the element x to the fuzzy set A. This degree ranges between the extremes 0 and 1:

μA(x) = 0 indicates that x in no way belongs to the fuzzy set A.

μA(x) = 1 indicates that x completely belongs to the fuzzy set A.
In the proposed model, the concept of fuzzy logic is implemented to capture the approximate nature of the fuzzy candlestick time series.
About tfidf scheme
Manning et al. (2008) explained in their book about how tfidf technique is useful in information retrieval. In information retrieval systems, the main intention is to retrieve that document which is most relevant to the query posed. The query and the documents both constitute of terms. Terms are words that we use in our day to day speaking and writing. A scheme known as tfidf (term frequency and inverse document frequency) is used to assign weights to the documents according to the query. The terms present in the query and terms present in the documents are used as the basis of calculations done in this scheme. The document corpus or simply corpus is used to represent the collection of all the documents present for evaluation.
A document would consist of lines of text and every line would consist of words and these words are known as terms. Similarly, every query would consist of a line of text that also would contain words or terms that is needed to be searched from the document corpus. In the proposed system, the query and the document would consist of fuzzy linguistic values.
Term frequency tf _{ t,d } is the count (sum) of number of times a term appears (repeats) in the respective document. The log frequency weight ω _{ t,d } of the terms is simply the log of the term frequencies calculated for each term in the document. The normalized value of the log frequency weight, ω _{ t,d }(norm), is used in further calculations. The inverse document frequency idf _{ t } is calculated by taking the log of the value achieved by dividing the total number of documents N by the df _{ t } which is the document frequency of the term ‘t’ in the specific document corpus. The normalized value of the log frequency weight, idf _{ t(norm)}, is used in further calculations. The normalized values are used for the purpose of length normalization of the column vectors. Using the normalized vectors, the cosine similarity between the query vector and document vector is calculated. In the proposed model, the tfidf score of the terms in the query and document are used for the purpose of forecasting.
Background and literature review
Fama (1970) introduced the Efficient Market Hypothesis, and according to him the stock markets are random walks and previous prices cannot be used to predict future prices; however, there are plenty of evidences that prove that stock markets are predictable to a certain extent.
According to Bagheri et al. (2014), the investors and traders in the stock markets use two types of tools for forecasting; one is the fundamental analysis and second is technical analysis. Fundamental analysis uses information gathered from business and economic structure of the company and its related markets, to predict the future stock prices of the company. Technical analysis uses the information present in the stock prices from the past to predict the future. In the proposed model, our approach is purely based on technical analysis.
Zhang and Wu (2009) proposed a novel approach of combining backpropagation neural network with an improved Bacterial Chemotaxis Optimization (IBCO) for stock market data forecasting. Hu et al. (2015) proposed a hybrid approach by combining shortterm and longterm trend following systems with extended classifier system for extraction of rules which selects stocks by different indicators. Wang et al. (2013) proposed fuzzy time series for stock market prediction where the data are fuzzified to the cluster centers. Yu et al. (2014) suggested that the selection of the representative features in creation of the rules is the governing factor for better forecasting results. Korol (2014) designed a fuzzy logic system that creates a knowledgebase that contains fuzzy rules. The fuzzy rules are created by gathering experiences of various traders and investors. The author used 10 years of gathered experience to form the fuzzy rulebase. The rules are formed on the basis of fundamental analysis done by the actual traders and investors.
The authors mentioned above have used the raw time series, but in our proposed system we utilize the fuzzy attributes of every day observations and convert the simple time series into fuzzy linguistic time series. The idea of converting simple numeric time series into fuzzy linguistic time series is adapted from the system proposed by Song and Chissom (1993, 1994).
Paulevé et al. (2010) suggested that the existing information retrieval hashing schemes rely on structured quantizers which poorly fit the real data sets. The authors put forth a comparison of various space hashing functions. The authors concluded that for very large data sets query adaptive KLSH gives the highest recall for a fixed selectivity.
Salakhutdinov and Hinton (2009) proposed a model that describes a process of finding binary codes that can be used for fast document retrieval. The document is divided into layers and the lowest layer represents the wordcount vector and highest layer constitutes the binary code learnt by the proposed system. The authors used backpropagation neural networks for this purpose. Zhang et al. (2011) presented some experimental evaluations of indexing methods on text classification and analyzed that presently we do not have a standard measure to assess the semantic and statistical qualities of text.
Attia et al. (2014) proposed a linguisticbased multiview fuzzy ontology information retrieval model that allows the users to define all their linguistic terms according to their subjective view which helps in retrieving documents according to their linguistic term definitions not to our definitions. The resulted documents are ranked according to userdefined criteria. Gupta et al. (2015) proposed a new ranking function for information retrieval using fuzzy logic. The use of fuzzy logic increases the performance of the system. The fuzzy system incorporates term frequency, inverse document frequency, and normalization.
The motivation for the proposed model came from the above literature survey and many more literature studies, where it was found that information retrieval schemes are not used for stock market trend forecasting. Not a single paper was found that suggested the use of information retrieval schemes like tfidf for stock market forecasting, hence it became the motivating factor to use information retrieval schemes like tfidf to be used as a stock market forecasting system. In our approach, we have used the log frequency weight of the terms which is the log of the term frequencies calculated for each term in the document as the forecasting element.
From the literature review following conclusions were drawn:

(i)
It was found that forecasting is a complex process especially for financial time series.

(ii)
The amount of information that a time series contains, if it is fully extracted, then only the forecasting algorithm can generate more accurate results.

(iii)
The purpose of information retrieval schemes used at present is limited to assigning scores to the documents and identifying the most appropriate document from the corpus according to the query. Presently, they are not used for any kind of forecasting purposes.
Research design
From the conclusions drawn through the literature review process, following research design steps emerged:

(i)
The time series needs to be modified so that maximum possible information could be incorporated in it. Hence, maximum possible information represented using Japanese candlestick charts of the financial time series is to be fuzzified, because by using fuzzy logic the hidden information present in the candlestick charts, related to the market sentiments can be deciphered. Hence the proposed representation of financial time series is more information rich than any other way of representation.

(ii)
The information retrieval schemes have a latent property of predicting the most appropriate document based on the query posed; this latent property can be extracted out by modifying the information retrieval scheme so that it can be used as a forecasting tool.
The methodology consists of three phases. In phase1 the fuzzification of the stock market index time series data is done. The raw data is the open, high, low and close values of every day, together known in abbreviation as OHLC values. The OHCL values are again represented in the form of Japanese candlestick charts. The time series data are converted to fuzzy linguistic time series containing informationenriched fuzzy time series elements. In phase2, the IR model is prepared using a modified tfidf approach. The fuzzy informationenriched time series is used to develop fuzzy document corpus; simultaneously fuzzy queries are also developed which would be used in the information retrieval process. In phase3, the modified tfidf scheme is used to introduce queries to the proposed IR model for forecasting. The fuzzy query processing is done using the tfidf information retrieval scheme. Modifications are preformed in the generation of documents and implementation of the traditional tfidf scheme resulting in a fuzzy documentbased information retrieval system. The results achieved through these processes; give the forecasted output. Figure 2 represents the proposed methodology that is used while implementing the research process.
Methods
Figure 3 represents the candlestick chart in which the points ‘p’ and ‘p1’ are indicated. The point ‘p’ is representative of an observation in the time series whose previous and later values are known. The point ‘p1’ in the time series is representative of a point from where we have access to information only prior to ‘p1’ not after that and desire to forecast the time series after ‘p1.’ The information about these two points is mentioned in the phases described below.
Following is the details of the phases:
Phase1: Fuzzification process:

(i)
Fuzzify the OHLC values of daily observations in the time series by fuzzification of the following attributes: upper shadow (US), body (BD), lower shadow (LS), and candle color (CC) for each day of observation (Fig. 1). The information contained in US, BD, LS, and CC are necessary to enrich the time series because the size of the ‘Upper Shadow’ represents the sentiment of buyers (also known as bulls) in the market who are trying to pull the values in the upward direction, the size of the ‘Lower Shadow’ represents the sentiment of sellers (also known as bears) in the market who are trying to pull the values in the downward direction, the size of the ‘Body’ represents the intensity of the market sentiment and ‘Candle Color’ represents whether the sentiment is getting bullish or bearish, so if the candle color is black then sellers are gaining on buyers (bearish sentiment is increasing) and if candle color is white then buyers are gaining on sellers (bullish sentiment is increasing).

(ii)
Fuzzify the trend of closing values before and after a particular point ‘p’ (Fig. 3) in the time series into three fuzzy categories of trend namely BR (bearish—values going down i.e., the sellers are gaining on buyers causing the values to go down), NT (neutral—values remaining range bound i.e., the sellers and buyers are in a tie and no one is able to take the market into a particular direction) and BL (bullish—values going up i.e., the buyers are gaining on sellers causing the values to go up).
Phase2: Information retrieval system using modified tfidf scoring scheme:

(i)
The trend formed after the point ‘p’ will be any one of BR, NT, or BL. These would form the three categories of documents BR, NT, and BL. Every entry in the documents BR, NT, and BL would again be considered as individual documents. This method is a unique approach and is different from the traditional tfidf scheme.

(ii)
The terms in the IR system would constitute of two fuzzy linguistic elements, first, the trend (BR, NT or BL) formed till the point ‘p’ and second, the fuzzy attributes (US, BD, LS, CC) of the candle formed at the day ‘p’ in the time series.
Phase3: Forecasting process using modified tfidf scoring scheme:

(i)
The query would constitute of two fuzzy terms, first, the trend formed till any point ‘p1’ (Fig. 3) in the time series and second, the fuzzy attributes (US, BD, LS, CC) of the candle (price bar) formed at the day ‘p1.’

(ii)
The tfidf weight of the documents (BR, NT, BL) with respect to the terms in the query is calculated.

(iii)
The document with the highest tfidf weight represents the most probable trend in the future that we can expect after the point ‘p1’ in the time series.

(iv)
According to the achieved trendinformation, a value is generated from the last closing value. When the forecasted trend is BR or bearish and NT or neutral then the forecasted value is calculated as Close − (0.005*Close). When the forecasted trend is BL or bullish then the forecasted value is calculated as Close + (0.005*Close). Through many experiments with different multipliers, 0.005 was chosen as most appropriate. Experimentation can be performed by taking different values of the multiplier.
In the following sections the details of phasewise implementation is described.
Phase1: Fuzzification process
Fuzzification of the candlestick formations in the time series
Five attributes of every day candlestick are used that includes the lengths of upper shadow, lower shadow, and real body; color of the candlestick and collectively these are converted into fuzzy linguistic representation. These attributes are selected as they represent the market sentiment more closely (as mentioned above in phase1 description). Representing every candlestick in the time series using fuzzy values converts a simple time series into an informationrich fuzzy linguistic time series.
The candlestick bars formed can be broadly categorized as Indecisive, Bearish, and Bullish types. An indecisive type of candlestick bar looks similar to the one presented in Fig. 4a, where the attributes of the candlestick bar show very small upper shadow indicating that the buyers are trying to push the market up so the sentiment is bullish; very large lower shadow indicates that buyers are pushing markets up so the sentiment is bullish; tiny real body indicates that the intensity of the sentiment is weak; the color of the real body is black that represents bearish sentiment. In overall perspective, this type of bar formation indicates an indecision in the market and the market can go in any direction (indecisive) from here.
A bearish type of candlestick bar looks similar to the one depicted in Fig. 4b, where the length of the upper shadow is small indicating the sentiment is bullish; the length of the lower shadow is small indicating the sentiment is bearish; the length of the real body is very big indicating that intensity of the market sentiment is strong and finally the color of the real body is black indicating bearish sentiment. In overall perspective this type of bar formation represents a market condition where the sellers are gaining on buyers and may cause the market to go down (bearish).
A bullish type of candlestick bar looks similar to the one drawn in Fig. 4c, where the length of the upper shadow is small indicating that the sentiment is bullish; the length of the lower shadow is small indicating that the sentiment is bearish; the length of the real body is very big indicating that intensity of the market sentiment is strong and in addition the color of the real body is white indicating bullish sentiment. In totality this type of bar formation represents a market where the buyers are gaining on sellers and may cause the market to go up (bullish).
To qualify the three attributes of every candlestick i.e., the length of Upper Shadow, Lower Shadow, and Real Body five fuzzy linguistic values are used, namely: (1)Tiny, (2)VerySmall, (3)Small, (4)Big, and (5)VeryBig. And a binary representation for the color of the candlestick (CC) as B and W is used to represent Black and White colors, respectively. For example, let there be a candlestick whose fuzzy representation is TNYTNYBGW then it would be interpereted as the length of the upper shadow is tiny (TNY), the length of the lower shadow is tiny (TNY), the length of the real body is big (BG) and the color of the candlestick is white (W). The fuzzy arithmetic for the above representations is as follows:
Let, X ^{j}_{ i } represents j value (open, high, low or close values) on ith day. Where j represents OP, HI, LO, or CL values (which are open, high, low or close values), respectively, for the ith day.
D ^{jk}_{ i } represents nonnegative distance between j (open, high, low or close values) and k (open, high, low or close values) values on the ith day.
The color of the candlestick is determined by the difference between close and open, represented by Eq. (2), where C _{ i } is the color of the ith candlestick.
The crisp value for the Real Body attribute of the ith candlestick is represented as D ^{OPCL}_{ i } which is determined by Eq. (3) and it is the nonnegative difference between the open and close values of each day.
The universe of discourse U is chosen as the collective average of the distance between the open and close values of every day in the considered range of consecutive observations. The Universe of discourse will be determined by AD^{OPCL} in Eq. (4) representing the average of the difference between the open and close values for n consecutive observations. The difference of open and close values is taken because they represent crucial sentimental strength of the market direction. The value of n should be taken as required; for our experimentation the value of n is 7.
The crisp value for the upper shadow attribute of the ith candlestick represented as US _{ i } is determined by Eq. (5).
where
The crisp value for the Lower Shadow attribute of the ith candlestick represented as LS _{ i } is determined by Eq. (8).
where
To convert crisp values to fuzzy linguistic terms, we use the following membership functions:
A graphical representation of the combination of Z function, Triangular function, and Inverse Z function used as membership functions in our proposed system is presented through Fig. 5. Membership functions other than proposed ones such as trapezoidal or sigmoidal functions can also be used to serve the purpose.
The xaxis represents the crisp values of any one of the candlestick attributes \(D_{i}^{\text{OPCL}}\) or \({\text{US}}_{i}\) or LS_{i} at a time which is taken into consideration for generating fuzzy linguistic representations and yaxis represents the equivalent membership grades in the fuzzy linguistic categories namely tiny (TNY), very small (VS), small (SM), big (BG), and very big (VB) which are realized by Eqs. (11, 12, 13, 14, 15) (mentioned later in this section). The values of a, b, c, d, and e are taken as 15, 30, 45, 60, and 75 % of AD^{OPCL}, respectively, and these values were found to be most appropriate after performing a series of experiments. The percentage values are set for experimental purposes and can be changed, but the changed values should be constant throughout the experiment.
The mathematical representation of Fig. 5 is as follows:
For generating the fuzzy linguistic values from crisp values, we have devised a function FUZZY(x). The output generated from FUZZY(x) function is the fuzzy linguistic equivalent value of the crisp input x given to the function; it uses the output generated form Eqs. 11, 12, 13, 14, 15. The fuzzy linguistic values generated from FUZZY(x) are used in the fuzzy rules R1 to R10 in the following section. The FUZZY(x) function is depicted through the following Eq. (16).
Fuzzification of the trend of closing values before and after a particular point ‘p’ in the time series
In the proposed model, the difference between the closing price at the observation day ‘p’ (Fig. 3) and closing price of third day after ‘p’ as the measure for the market direction is considered. This difference is fuzzified in the following manner.
where DX ^{CL}_{ i3} is the closing price on i3rd day from day ‘p,’ DX ^{CL}_{ i } is the closing price on the ith day or the ‘p’th point depicted in the time series. B _{ i } is the representative of the market bias, so if the difference between the closing prices of the day ‘p’ and 3 days after comes to be a positive number then the market bias is considered as positive as prices are climbing up or else negative. M _{ i } represents the magnitude of market bias present after point ‘p’ in the time series. This magnitude is converted to fuzzy linguistic market momentum categories FM_{ i } by using the fuzzy rules R2 to R8 (mentioned later in this section). Where FM_{ i } is the fuzzy value of the momentum recognized by the fuzzy rule and FUZZY(x) is the function that converts the crisp value x in the input argument into equivalent fuzzy linguistic term using Eqs. (11) to (16).
The trend that the market has assumed or the sentiment of the market is represented using fuzzy linguistic terms namely, (1) Extremely Bearish, (2) Very Bearish, (3) Bearish Neutral, (4) Neutral, (5) Bullish Neutral, (6) Very Bullish, and (7) Extremely Bullish. Here, ‘Bearish’ word represents the situation where the market sentiment is in selling mood and prices are going down; ‘Bullish’ word represents the situation where the market sentiment is in buying mood and prices are going up and ‘Neutral’ word represents the situation where the market sentiment is indecisive and prices are not moving in any particular direction. The adjectives ‘very’ and ‘extremely’ help represent the market sentiment to a higher degree of accuracy.
 R1::

IF (B _{ i } = Positive OR B _{ i } = Negative) AND FUZZY(M _{ i }) IS TNY THEN FM_{ i } IS Neutral
 R2::

IF B _{ i } = Positive AND FUZZY(M _{ i }) IS VS THEN FM_{ i } IS Bullish Neutral
 R3::

IF B _{ i } = Negative AND FUZZY(M _{ i }) IS VS THEN FM_{ i } IS Bearish Neutral
 R4::

IF B _{ i } = Positive AND FUZZY(M _{ i }) IS BG THEN FM_{ i } IS Very Bullish
 R5::

IF B _{ i } = Negative AND FUZZY(M _{ i }) IS BG THEN FM_{ i } IS Very Bearish
 R6::

IF B _{ i } = Positive AND FUZZY(M _{ i }) IS VG THEN FM_{ i } IS Extremely Bullish
 R7::

IF B _{ i } = Negative AND FUZZY(M _{ i }) IS VG THEN FM_{ i } IS Extremely Bearish
Now the final market direction MD_{ i } is set using the fuzzy rules R9 to R11
 R8::

IF FM_{ i } IS Bearish Neutral OR FM_{ i } IS Very Bearish OR FM_{ i } IS Extremely Bearish THEN MD_{ i } IS DN
 R9::

IF FM_{ i } IS Bullish Neutral OR FM_{ i } IS Very Bullish OR FM_{ i } IS Extremely Bullish THEN MD_{ i } IS UP
 R10::

IF FM_{ i } IS Neutral THEN MD_{ i } IS NT.
Using the abovementioned approach, the trend formed 3 days after the point ‘p’ will also be fuzzified and be represented as either BR, NT, or BL linguistic terms. The fuzzy rulebase will be populated with the information regarding previous trend, candlestick attributes of the ‘pth’ day and trend after ‘pth’ day. The information regarding the trend after ‘pth’ day would help us build the document model in the IR (information retrieval) system. The contents of the documents created with this model will be fuzzy rules. Table 2 in the “The entire document corpus” section presents a snapshot of the fuzzy information generated by phase1.
Phase2: Information retrieval using modified tfidf scoring scheme
Now that the fuzzification process has generated fuzzy rules and these fuzzy rules are stored in documents. The modified information retrieval (IR) system would find the most appropriate and relevant document depending upon the query. This quality of the IR systems helped to build a model that would suggest the most appropriate future trend. The novel approaches followed here are as follows: first, the trend is represented as a document (containing fuzzy observations of the time series) and second, the query consists of the fuzzy linguistic terms that represent the current state of the financial time series; this approach is not present in the traditional tfidf scheme and gives an entirely new dimension of looking at how IR systems are used.
The tfidf scoring scheme is used to complete the task of forecasting. The documents created in the modified IR system are BR, NT, and BL and each contains fuzzy observations of the time series and each fuzzy observation is again considered as a document. The trend formed after the point ‘p’ (Fig. 3) will be any one of BR, NT, or BL. The constituents of BR document would be all those fuzzy observations who have BR as the trend after the point ‘p’ in the time series as they would represent instances when market became Bearish after point ‘p.’ The constituents of NT document would be all those fuzzy observations who have NT as the trend after the point ‘p’ in the time series as they would represent instances when market became Neutral after point ‘p.’ The constituents of BL document would be all those fuzzy observations who have BL as the trend after the point ‘p’ in the time series as they would represent instances when the market became Bullish after point ‘p.’
The point ‘p1’ in the time series is representative of a point from where we desire to forecast the time series for future values. For the purpose of forecasting, the terms in the query would represent the trend (BR, NT or BL) formed till the point ‘p1’ along with the fuzzy attributes (US, BD, LS, CC) of the candle formed at the day ‘p1’ in the time series. The query has two fuzzy terms only. The first term would be the trend that was prevailing before point ‘p1’ (Fig. 3) and second term would be the set of attributes of the candlestick formed at point ‘p1’ in the time series. The importance of the first term of the query is taking into consideration the prevailing trend till the point ‘p1’ and second term would describe which type of candlestick formation took place at the point of observation; both these information would be necessary to forecast the future trend that might be forming after the observation point ‘p1’ in the time series. This treatment of query posed to the IR system is a unique approach, which is not present in the traditional tfidf scheme. So if a query is received then using the tfidf technique we would calculate the scores and the document which gives the highest log frequency weight would be the forecasted trend.
Phase3: Forecasting using modified tfidf scoring scheme
The data
For experiments, CNX NIFTY50 index daily data of the National Stock Exchange of India are used. Table 1 gives a snapshot of the data that were used for generating the knowledgebase or document corpus. The range of data that were used started from Jan011997 to Mar252015.
Every row in Table 1 represents daily open, high, low, and close values of the NIFTY index. The data presented in Table 1 display date in the first column in YYYYMMDD format, open value of the day in the second column, high value of the day in the third column, low value of the day in the fourth column, and close value of the day in fifth column. From the data available through Table 1, fuzzy information is extracted from every row of the observations, using the methods presented in the previous sections. A snapshot of the fuzzy information generated by the proposed model is shown in Table 2.
The entire document corpus
The first column ‘PrevTrnd’ in Table 2 represents the previous trend that was prevailing 3 days before the observation point ‘p’ (Fig. 3) in the time series; the second column ‘Candle’ represents the attributes (US, BD, LS, CC) of the candlestick formation that took place at point ‘p,’ and the third column ‘FutTrnd’ represents the trend that has formed 3 days after the observation point ‘p.’ The fuzzy observations formed from time series data in Table 1 are converted to informationbase or knowledgebase generated by the proposed model which is presented in Table 2 and that would become the whole document corpus (Knowledgebase) for the proposed IR system.
Every row in Table 2 is again considered as individual documents. Now that the entire document corpus is generated, it is then divided into three categories of documents namely BR, NT, and BL. BR documents would contain only those entries from the entire document corpus that are having ‘FutTrnd’ as either Bearish, Bearish Neutral or Extremely Bearish. So, BR documents would contain those instances of the entire corpus whose future trend after point ‘p’ were found to be Bearish in nature. Similarly, BL documents would contain only those entries from the entire document corpus that are having ‘FutTrnd’ as either Bullish, Bullish Neutral, or Extremely Bullish. So, BL documents would contain those instances of the entire corpus whose future trend after point ‘p’ was found to be Bullish in nature. And NT documents would contain only those entries from the entire document corpus that are having ‘FutTrnd’ as Neutral. So, NT documents would contain those instances of the entire corpus whose future trend after point ‘p’ was found to be Neutral in nature. From this treatment, three categories of documents are generated that represent the sentiment of the market and these would be helpful in forecasting the market sentiment.
The query and forecasting
Now that all the documents are in place, the query can be designed that can be given to the proposed system. The query has two fuzzy terms only. The first term would be the trend that was prevailing before point ‘p1’ and second term would be the attributes of the candlestick formed at point ‘p1’ in the time series. So if a query is received then using the tfidfbased calculations we generate the scores and the document which gives the highest score is the forecasted trend.
For example if we pose a query to the system with two fuzzy terms, term1 = “BL” and term2 = “TNYTNYTNYW” then the scores would be calculated by the proposed system as shown in Tables 3, 4, and 5 for the documents BR, NT, and BL, respectively. In Tables 3, 4, and 5, the columns represent the information about the documents BR, NT, and BL, respectively.
The ‘TERM’ column represents the query terms, which are fuzzy linguistic terms and they collectively represent the query vector. In the example, there are two query terms, the first one is ‘BL’ which represents the previous trend that prevailed prior to the point of observation ‘p1’ and the second one is ‘TNYTNYTNYW’ which represents the attributes of the candlestick formed at the point of observation ‘p1’ from where the future is to be predicted.
The ‘TF’ column contains the term frequencies tf _{ t,d } or count of number of times the query terms appear in the respective document.
The ‘TFlog’ column contains the log frequency weight ω _{ t,d } of the terms using the following Eq. (19):
The column ‘TFSQUARE’ represents the squared value of ‘TFlog’ value, (ω _{ t,d })^{2}, for the purpose of normalization. The column ‘NORMTF’ represents the normalized value of ‘TFlog,’ ω _{ t,d }(norm), for the purpose of length normalization of the column vector, by using their squared values in the ‘TFSQUARE’ column, using the following Eq. (20):
The ‘IDF’ column represents the inverse document frequency idf _{ t } which is calculated by the following Eq. (21):
where df _{ t } is the document frequency of the term ‘t’ in the specific document corpus and N is the total number of documents in the entire document corpus.
The column header ‘IDFSQUARE’ represents the squared value of ‘IDF’ value, (idf _{ t })^{2}, for the purpose of normalization. The ‘NORMIDF’ represents the normalized value of ‘IDF,’ \(idf_{{t_{(norm)} }}\), for the purpose of length normalization of the column vector, by using their squared values in the ‘IDFSQUARE’ column, using the following Eq. (22):
The column ‘TFIDF’ represents the product of the normalized weights of tf _{ t,d } and idf _{ t }. The final ‘TFIDFscore’ is the sum of the values in the column ‘TFIDF’ that represents the cosine similarity between the query vector and document vector. So, the document (BR,NT or BL) having the highest ‘TFIDFscore’ is the most relevant document that the IR system has given us and is the forecasted trend. In the above example, the highest score of 1.00 was achieved by the document NT whose details are given in Table 4, so the forecasted trend for the example is NT i.e., neutral.
Following strategies are followed when trend values are achieved:

1.
When the forecasted trend is BR (bearish) or NT (neutral) then the forecasted value is calculated as Close − (0.005*Close).

2.
When the forecasted trend is BL (bullish) then the forecasted value is calculated as Close + (0.005*Close).
By experimenting with different multipliers, 0.005 was chosen to be most appropriate. Experimentation can be performed by taking different values of the multiplier.
Results and discussion
Following Table 6 represents a snapshot of the output generated from the proposed model.
The column ‘DATE’ represents date in YYYYMMDD format and every row represents one trading day. The column ‘ACTUALVALUE’ represents daily observed values of the NIFTY50 index from March272015 to May152015 and the values presented in the column ‘FORECASTEDVALUE’ are the values generated by the proposed model. The error is calculated for each row and RMSE value is evaluated as 81.4774.
The performance analysis of the proposed model is done by calculating the root mean squared error (RMSE). The RMSE (also called the root mean square deviation, RMSD) is a measure frequently used to calculate the difference between values predicted by a model and the values actually observed from the environment from where the model is created. The individual differences so calculated are also called residuals, and the RMSE helps aggregate these residuals into a single measure of predictive power. Lower values of RMSE relative to the number of observations suggest better predictability of the model.
The RMSE of a model prediction with respect to the estimated variable X _{ model } is defined as the square root of the mean squared error, where X _{ obs } is observed values and X _{ model } is modeled values at time/place i.:
In order to verify the efficiency of the proposed model, a number of experiments were performed with the same set of crisp values with other wellknown algorithms through datamining software WEKA 3.7.12. The contents of Table 7 display the comparative of the already established benchmark models’ and the proposed model’s performance.
The performance of FDIRS (the newly proposed model) is compared with three other benchmark models namely, Holt–Winters with triple exponential smoothing, RBF Network (Normalized Gaussian radial basis function network), and Random Forest on the basis of RMSE. The comparison is done with different categories of models so that performance can be judged more critically. The RMSE comparative is tabulated in Table 7 and Fig. 6 depicts a graphical representation of the actual values versus predicted values.
Conclusion
In the proposed system, the simple daily time series is converted to an enriched fuzzy linguistic time series with a unique approach of incorporating information about the manner in which the OHLC price formation took place at every instance of the time series. Another unique approach is followed while modeling the information retrieval (IR) system which converts a simple IR system it into a forecasting system.
The fuzzy documentbased information retrieval scheme (FDIRS) is a novel approach adopted for the purpose of Stock Market Index forecasting. The entire document corpus is generated as a result of fuzzification process and the queries containing fuzzy terms are processed by the modeled system to fetch the most appropriate document from the document corpus. The proposed model is a dynamically adaptable model that uses a hybrid approach of combining fuzzy logic, time series, and information retrieval system to identify the direction in which the market would possibly move in future and then predict a crisp value that the market would achieve in future. The novelty of the proposed approach is the use of a modified fuzzy linguistic time series combined with an IR system to predict the future trend of the stock market index.
The proposed model generates a knowledgebase that can successfully extract and model the trend and market sentimentrelated information from any stock market time series. The novel approach adopted to represent the financial time series and combining with a unique information retrieval approach has produced promising results.
For the conducted experiments, the CNX NIFTY50 index daily data obtained from the National Stock Exchange of India were used. The range of data that were used started from Jan011997 to May152015. The CNX NIFTY50 index stocks represent about 60 % of the total market capitalization of the National Stock Exchange (NSE) of India. We used approx. 18 years of data to build the knowledgebase through the proposed model and it took less than a second to do so.
A number of experiments performed using the proposed model on CNX NIFTY50 index values show that the proposed FDIRS method shows at par performance compared with other benchmark models and has high potential of becoming a good forecasting model. The same model has been tested for individual stocks of National Stock Exchange of India and the forecasting performance has been found promising.
However, improvisation is underway for increasing the forecasting accuracy of the model by experimenting more on the fuzzy elements of the proposed model. To increase the accuracy of forecasting, elements of fundamental analysis could also be included in the proposed model.
References
Attia ZE, Gadallah AM, Hefny HM (2014) An enhanced multiview fuzzy information retrieval model based on linguistics. IERI Procedia, Elsevier 7:90–95
Bagheri A, Peyhani HM, Akbari M (2014) Financial forecasting using ANFIS networks with quantumbehaved particle swarm optimization. Expert Syst Appl 41:6235–6250
Fama EF (1970) Efficient capital markets: a review of theory and empirical work. J Finance 25:383–417
Gupta Y, Saini A, Saxena AK (2015) A new fuzzy logic based ranking function for efficient information retrieval system. Expert Systems with Applications, Elsevier 42(3):1223–1234
Hu Y, Feng B, Zhang X, Ngai E, Liu M (2015) Stock trading rule discovery with an evolutionary trend following model. Expert Syst Appl 42:212–222
Korol T (2014) A fuzzy logic model for forecasting exchange rates. KnowledgeBased Systems, Elsevier 67:49–60
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge
Paulevé L, Jégou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recognition Letters, Elsevier 31(11):1348–1358
Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason, Elsevier 50(7):969–978
Song Q, Chissom BS (1993) Forecasting enrollments with fuzzy time series—Part 1. Fuzzy Sets Syst 54:1–9
Song Q, Chissom BS (1994) Forecasting enrollments with fuzzy time series—Part 2. Fuzzy Sets Syst 62:1–8
Wang L, Liu X, Pedrycz W (2013) Effective intervals determined by information granules to improve forecasting in fuzzy time series. Expert Syst Appl 40:5673–5679
Yu H, Chen R, Zhang G (2014) A SVM stock selection model within PCA. Procedia Computer Sci 31:406–412
Zadeh LA (1965) Fuzzy Sets. Inf Control 8:338–353
Zhang Y, Wu L (2009) Stock market prediction of s&p 500 via combination of improved BCO approach and BP neural network. Expert Syst Appl 36:8849–8854
Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multiwords for text classification. Expert Syst Appl 38(3):2758–2765
Acknowledgements
The author would like to thank the anonymous referees for their constructive and useful comments.
Competing interests
The proposed methodology is a part of an ongoing research and not related to any financial organization. It is purely a part of an academic research initiative by the author. It is further assured that none of the authors have any competing interests in the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Roy, P. A novel fuzzy documentbased information retrieval scheme (FDIRS). Appl Inform 3, 2 (2016). https://doi.org/10.1186/s405350160017y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s405350160017y
Keywords
 Candlestick chart
 Data mining
 Fuzzy logic
 Information retrieval
 Pattern recognition
 Prediction
 Time series
 tfidf