The Urdu language has the power to captivate and amaze. From mesmerising Ghazals to ornate classical prose, Urdu has a rich legacy that has stood the test of time.

According to some estimates, Urdu is spoken by around 100 million people across the world. It is the official language of Pakistan and is also spoken in India, Bangladesh, the Middle East, the UK, the US, and other countries where the Pakistani diaspora have settled.

By January 2021, Pakistan had around 61 million internet users. In their everyday communications on the Internet, most Pakistanis use their native language. From community forums and review websites to social media channels like Facebook, Twitter, and YouTube, Pakistanis are posting and commenting in Urdu and Roman Urdu (Urdu written in Latin or Roman script).

Here are some examples:

This unstructured Urdu data on the web presents a golden opportunity for businesses and institutions looking to gauge the pulse of their online audiences. With sentiment analysis, this data can provide incredible business intelligence to their social media managers and marketers.

However, Urdu sentiment analysis is easier said than done. There are quite a few challenges when it comes to Urdu sentiment analysis, particularly deriving sentiment from Roman Urdu.

Challenges of Analyzing Sentiment in Urdu

While the English language contains a variety of Natural Language Processing (NLP) resources, such as lexicons and part-of-speech taggers, the same is not true of the Urdu language

A major challenge in Urdu sentiment analysis is the scarcity of acknowledged lexical resources, or vocabulary, in Urdu. Because of this, Urdu sentiment analysis mostly involves shifting of information from resource-rich English language to resource-deprived Urdu language.

Some of the other issues with Urdu sentiment analysis include difficulty in word segmentation and inconsistencies in morphology and case markers.