Document Clustering, classification and Data Mining

Conclusion

Normal Searching (syntax maching)

Heavy researh is already done in this area.

Data Storage

Heavy researh is already done in this area.

Indexing, Sorting and Quick Access

Heavy research is already done in this area.

Semantics based Question Answering in Multiple Languages

Has to be done. Try to understand user query. (Answer based on semantics and grammatical context). If a word has a related article in the text but no word in the article matches with the word which we use as query then semantic based search can greatly help. (Needs to be done. Complex)

This should not be done only for english but also for urdu and arabic. If user enters query in Arabic, parse it, understand it, and then query the arabic documents in the corpora and show results in arabic. (Needs to be done. Very Complex). Show the corresponding english and urdu documents if they exist.

Google is already doing translation. It translates the whole page of certain languages into english.

Document Clustering and Data Mining

We do not want to do indexing. Instead we only want to do mining, clustering, and then presenting it. As we have almost fixed number of classes so we have classification as a top down technique. As in web where hyper links are used our text also link to other text but not as dense as the web and also in the later texts.

Platform :-

Linux

Intel Architecture

Open Source Tools

Literature Survey :-

As per our knowledge, there is no such system already existing.

next previous