سنڌي ٻوليءَ جو ھڪ کُليل ۽ آجو ڊيٽاسيٽ
سنڌي ٻوليءَ جو ھڪ آجو ۽ اجريل ڊيٽاسيٽ — جنھن ۾ سنڌي ٻوليءَ جون 223,342 لفظي داخلائون، موجود آھن، جيڪي لارج لئنگئيج ماڊلن (AI، NLP، LLMs،) وغيرہ سميت جديد لغت سازي، انٽرنيٽ جي سرچ انجڻين ۽ تعليمي ايپليڪيشنن لاءِ تيار ڪيون ويون آهن.
{
"dataset": "Sindhi Open Lexicon",
"entries": 223342,
"formats": ["CSV", "JSONL", "SQLite"],
"publisher": "SindhiLanguage.org",
"prepared_by": "Amar Fayaz Buriro"
}
ھي ڊيٽاسيٽ ڇو اھم آھي؟
سنڌي ٻوليءَ لاءِ ھي ڊيٽاسيٽ مشيني ذھانت جي ھن دور ۾ سڀ کان اهم ضرورت آھي جيڪو صاف، ترتيب ڏنل، ڀلي نموني سان سينگاريل ۽ مختلف پليٽ فارمن ۾ ڪتب آڻڻ جوڳو آهي. هي ڊيٽاسيٽ ڊولپرن، محققن، يونيورسٽين، مشيني ذھانت تي آڌاريل نون ڪاروباري فرمن ۽ انساني ٻولين جي مشيني ماڊلن لاءِ هڪ بنيادي ۽ مستند ڊيٽاسيٽ طور ڪم ڪري سگهي ٿو.
Download Dataset
Full master package contains data, metadata, README and license files
Dataset Preview
Example structure for developers and AI researchers
| Word | Grammar / POS | Definition | Source |
|---|---|---|---|
| ڀلو | adjective | سُٺو، سيبتو، اعلیٰ | جامع سنڌي لغات |
| ڪاتي | noun | scissors / cutting tool | ميوارام جي لغت |
| Archive | term | آرڪائيو / محفوظ دستاويز | Official Terms |
| Commerce | domain | واپار سان لاڳاپيل اصطلاح | Trade & Commerce |
JSONL Example
{"word":"ڪاتي","part_of_speech":"noun","definition":
"knife,scissors","source_dictionary":"Mewaram_Dict"}
Required Attribution
SindhiLanguage.org Prepared and curated by Amar Fayaz Buriro (امر فياض ٻرڙو)
Source Composition
Major source dictionaries and terminology collections included in the master dataset
Dataset Distribution Repositories
SindhiLanguage.org master dataset is also distributed through internationally recognized open-data and research repositories
Hugging Face
AI/LLM-ready dataset hosting for model builders, researchers and NLP developers.
Open Dataset ↗
GitHub
Public repository for developers, documentation, versioning and community contribution.
Open Repository ↗
Harvard Dataverse
Academic data citation, DOI-based discovery and long-term research visibility.
Open Dataverse ↗
Zenodo
Research archive distribution for citable releases and open scholarly access.
Open Record ↗
Kaggle
Data science platform access for notebooks, experiments and machine-learning use.
Open Kaggle Dataset ↗
IEEE DataPort
Engineering and technology research distribution for dataset visibility and citation.
Open IEEE DataPort ↗Citation & Acknowledgment
Any public use, redistribution, derivative dataset, application, API, model card, research paper, or AI/LLM training note using this dataset must acknowledge
Published by SindhiLanguage.org — https://sindhilanguage.org/
Prepared and curated by Amar Fayaz Buriro (امر فياض ٻرڙو)
License & Responsible Use
This dataset is released for research, education, AI/NLP development, software development, and non-malicious public-interest use with mandatory attribution