Large Scale Multi-label Text Classification of a Hierarchical Dataset using Rocchio algorithm

Large Scale Multi-label Text Classification of a Hierarchical Dataset using Rocchio algorithm

Abstract:

Hierarchical data is becoming increasingly prominent, especially on the web. Wikipedia is one such example where there are millions of documents that are classified into multiple classes in a hierarchical fashion. This gives rise to an interesting problem of automating the classification of new documents. As the size of the dataset grows, so does the number of classes. Further, there seems to be sparsity issue even with the increase in the dataset. Therefore, this poses a challenge to classify data in such a manner. We present two different algorithms based on text categorization: Rocchio algorithm and kNN. We implement and compare the above mentioned methods to better understand the approach to take in classifying hierarchical data.

Existing System:

k-Nearest Neighbour(kNN) is a kind of lazy learning where no training is required. It does not attempt to generalize the training data set and delays the computation until a new document arrives. Though the computational complexity of this algorithm during classification is proportional to the size of the training set, it is more expressive than centroid based classifier and can handle complex classes with relative ease.

Disadvantage:

Among them, centroid-based classifier (CC) is noteworthy for its high efficiency and robust nature. While training the computational complexity of centroid based classifier is roughly proportional to the terms in the training set, interesting for large scale text classification tasks and the total number of documents. Also, centroid based classifier matches a new document to dissimilar centroids in classification, which allows it to dynamically calibrate for classes with dissimilar densities. Centroid based classifier uses the idea to use all the training records belonging to one category to build centroid vectors, and finally with the most similar centroid allocate a new document to the category.

Proposed System:

There also exists a social aspect when it comes to user-driven textual tagging. Multiple users are able to freely tag several documents in real time. Initially, the tags chosen by the users are highly dependent on their personal opinions and their preferences. Moreover, people might be describing the same entity based on different granularity. This process leads to the generation of noisy tags and makes it extremely difficult to extract the relevant labels. Secondly, users might also use polysemous words (words with different but related senses) to tag the textual web resource. The absence of semantic contrast in tags might eventually lead to unsuitable connection between items.