To find a labeled data for Tamil NLP task is a difficult task. Some papers talk about Tamil Neural Translation, but the article doesn’t release code. If you’re working part-time or possess an interest in Tamil NLP, you have a tough time finding data.
When I was looking for labeled data for simple sentiment analysis, I couldn’t find any. It’s understandable because there is no one working on it. So I decided to build my dataset. Twitter seemed a perfect place with lots of data. I scrapped data using Twint Python library.
After spending a whole weekend annotating the data as “Happy/Sad” for 1000 tweets, I’m releasing the data in the public domain. You can find the data in Kaggle. The dataset has two columns, tweet and sentiment.
The tweet
column contains Tamil text, and sentiment
column includes relevant sentiment
.
$head -10 tamil_binary_sentiment_1k_tweets_v1.csv
tweet,sentiment
உன்னைத்தொட்டால் உன்னுள்ளத்தை நொருக்கமாட்டியோ!! என்னைப் போல பெண்ணைப்ப் பார்த்து மயங்க மாட்டியோ!! #RaOne #chammakChallo #tamilLyrics,Happy
"நதியா நதியா நயில் நதியா
…
இடை தான் கொடியா
கொடி மேல் கனியா
#RDBurnam #HindMusic #TamilLyrics",Happy
"உறக்கம் விற்று கனவுகள் வாங்கலையா?! #TamilLyrics RT @JanuShath: கனவுகள் விற்றுக் கவிதைகள் வாங்குவதும், கவிதைகள் விற்றுக் காதலை வாங்குவதுமாய்.",Sad
மீண்டும் உன்னை காணும் மனமே ... வேண்டும் எனக்கே மனமே மனமே !!! #TamilLyrics,Sad
உயிரை தொலைத்தேன் அது உன்னில் தானோ ... இது நான் காணும் கனவோ நிஜமோ...அன்பே உயிரை தொடுவேன் உன்னை தாலாட்டுதே பார்வைகள் ! #TamilLyrics,Sad
The dataset includes 1011 tweets. If you do sentiment analysis on the dataset, consider uploading the kernel to Kaggle. If you’re using it in your research work, mention the DOI.
Kracekumar, “Tamil Binary Classification 1K tweets Labels V1.” Kaggle, doi: 10.34740/KAGGLE/DSV/1226691.
You can download the data from GitHub as well.
Points to remember
- The sentiment is labeled based on the tweet and not on the multi-media or hyperlink attachment in the tweet.
- While creating the dataset, I have not looked at the image attached or the user handle for adult content. There may be NSFW attachment.
- The attached link or URL in the tweet may or may not exist.
- The content of the tweet may contain English sentences, words, emojis, etc.
Labeling process
It took the entire weekend to label the tweets. Google Sheets was the annotation tool. Even though there are only a thousand tweets, I had to read two to three thousand tweets to assign the label. Roughly, the whole process took 17 hours(~1 tweet per minute). It’s even hard to read some hundred tweets in a stretch and label. Fun news, I was targeting to label 10K tweets minimum :-) I don’t know whether I’ll do more labeling any time soon. There are close to 4 lakh tweets in the DB :-)
Happy NLP!
Important Links in the blog post
- Kaggle Dataset - https://www.kaggle.com/kracekumar/tamil-binary-classification-1k-tweets-labels-v1
- Twint - https://github.com/twintproject/twint
- Github Repo - https://github.com/kracekumar/tamil-dataset
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.