NLP for Hindi
This repository contains State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent).
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
Dataset
Created as part of this project
Open Source Datasets
-
BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.
-
IIT Patna Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.
-
IIT Patna Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.
Results
Language Model Perplexity (on validation set)
| Architecture/Dataset | Hindi Wikipedia Articles - 172k | Hindi Wikipedia Articles - 55k |
|---|---|---|
| ULMFiT | 34.06 | 35.87 |
| TransformerXL | 26.09 | 34.78 |
Note: Nirant has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. The scores above aren't directly comparable with his score because his train and validation set were different and they aren't available for reproducibility
Classification Metrics
ULMFiT
| Dataset | Accuracy | MCC | Notebook to Reproduce results |
|---|---|---|---|
| BBC News Articles | 78.75 | 71.61 | Link |
| IIT Patna Movie Reviews | 57.74 | 37.23 | Link |
| IIT Patna Product Reviews | 75.71 | 59.76 | Link |
Visualizations
Word Embeddings
| Architecture | Visualization |
|---|---|
| ULMFiT | Embeddings projection |
| TransformerXL | Embeddings projection |
Sentence Embeddings
| Architecture | Visualization |
|---|---|
| ULMFiT | Encodings projection |
Results of using Transfer Learning + Data Augmentation from iNLTK
On using complete training set (with Transfer learning)
| Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
|---|---|---|---|---|
| IIT Patna Movie Reviews | (2480, 310, 310) | 57.74 | 37.23 | Link |
On using 20% of training set (with Transfer learning)
| Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
|---|---|---|---|---|
| IIT Patna Movie Reviews | (496, 310, 310) | 47.74 | 20.50 | Link |
On using 20% of training set (with Transfer learning + Data Augmentation)
| Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
|---|---|---|---|---|
| IIT Patna Movie Reviews | (496, 310, 310) | 56.13 | 34.39 | Link |
Pretrained Models
Language Models
Download pretrained Language Models of ULMFiT, TransformerXL trained on Hindi Wikipedia Articles - 172k and Hindi Wikipedia Articles - 55k from here
Tokenizer
Unsupervised training using Google's sentencepiece
Download the trained model and vocabulary from here

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
