IndoNLU Benchmark

The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems for Bahasa Indonesia. It is a joint venture from many Indonesia NLP enthusiasts from different institutions such as Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia.

IndoNLU has been accepted by AACL 2020 and you can find the paper here.

Image
Image

Tasks & Dataset

IndoNLU downstream tasks covers 12 different tasks divided into four categories:
(a) single-sentence classification,
(b) single-sentence sequence-tagging,
(c) sentence-pair classification, and
(d) sentence-pair sequence labeling.

Download Dataset
Image

Models

We would like to introduce the Indonesian BERT-based model, IndoBERT, and its ALBERT-based variant model, IndoBERT-lite. The two variants of IndoBERT are also used as baseline models in the IndoNLU benchmark. We have extensively compared our IndoBERT models to different pre-trained word embeddings and existing multilingual pre-trained models, such as Multilingual BERT, XLM, and XLM-R to measure the effectiveness of our IndoBERT models.

Download Models
Image

Corpus

Indo4B consists of around 4B words, with around 250M sentences. The dataset covers both formal and colloquial Indonesian sentences compiled from 12 corpus, of which two corpus cover Indonesian colloquial language, eight corpus cover formal Indonesian language, and the rest have a mixed style, both colloquial and formal.

Download Corpus
Image

IndoNLU Benchmark Leaderboard

A public leaderboard for tracking performance on the downstream tasks. We are planning to release a competition on our IndoNLU tasks soon, stay tuned!

Access Leaderboard »
image
image

About Us

IndoNLU is a free open benchmark for everyone who is interested in Indonesian Natural Language Processing research.

email
Contact Us

IndoBenchmark: indobenchmark@gmail.com

lock
Cite Us

IndoNLU has been accepted by AACL 2020 and you can find the paper here. If you are using any component of IndoNLU for research purposes, please cite the following paper:

@inproceedings{wilie2020indonlu,
    title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
    author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
    booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
    year={2020}
}