Languages in the taggedPBC and the UDT available for HG2051 Project 2b

This page outlines a subset of the languages represented by CoNNL-U formatted corpora in the current version of the taggedPBC which are also found in the UDT project. This file is automatically generated based on the stats_All.xlsx file found under the scripts/data/output folder.

The subset is selected based on languages with more than 1800 verses in their corpus that are also present in the Universal Dependencies Treebanks project but which are not in commonly used NLP libraries (SpaCy, Trankit) and have not previously been part of projects in HG2051 (Language and the Computer) as taught at NTU Singpore.

The current document organizes languages by language phylum/family and ISO 639-3 code. Additional information includes the full name of the language and region, and links to the individual corpus in the taggedPBC and the UDT, the respective ISO 639-3 page, Glottolog, and Ethnologue.

Afro-Asiatic

Afro-Asiatic languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
hau Hausa 1884 Africa Chadic West Chadic corpus, UDT, ISOs, Ethnologue, Glottolog
cop Coptic 1884 Africa Egyptian Afro-Asiatic corpus, UDT, ISOs, Ethnologue, Glottolog
amh Amharic 1884 Africa Semitic West Semitic corpus, UDT, ISOs, Ethnologue, Glottolog
heb Modern Hebrew 1884 Eurasia Semitic West Semitic corpus, UDT, ISOs, Ethnologue, Glottolog
mlt Maltese 1855 Eurasia Semitic West Semitic corpus, UDT, ISOs, Ethnologue, Glottolog
Atlantic-Congo

Atlantic-Congo languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
wol Wolof 1884 Africa North-Central Atlantic Wolof-BKK corpus, UDT, ISOs, Ethnologue, Glottolog
yor Yoruba 1884 Africa Volta-Congo Benue-Congo corpus, UDT, ISOs, Ethnologue, Glottolog
Austronesian

Austronesian languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
tgl Tagalog 1884 Papunesia Malayo-Polynesian Greater Central Philippine corpus, UDT, ISOs, Ethnologue, Glottolog
jav Javanese 1884 Papunesia Malayo-Polynesian Javanesic corpus, UDT, ISOs, Ethnologue, Glottolog
Dravidian

Dravidian languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
mal Malayalam 1882 Eurasia South Dravidian South Dravidian I corpus, UDT, ISOs, Ethnologue, Glottolog
tam Tamil 1884 Eurasia South Dravidian South Dravidian I corpus, UDT, ISOs, Ethnologue, Glottolog
Indo-European

Indo-European languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
aln Gheg Albanian 1884 Eurasia Classical Indo-European Albanian corpus, UDT, ISOs, Ethnologue, Glottolog
hyw Western Armenian 1883 Eurasia Classical Indo-European Armenic corpus, UDT, ISOs, Ethnologue, Glottolog
bel Belarusian 1884 Eurasia Classical Indo-European Balto-Slavic corpus, UDT, ISOs, Ethnologue, Glottolog
chu Church Slavic 1873 Eurasia Classical Indo-European Balto-Slavic corpus, UDT, ISOs, Ethnologue, Glottolog
rus Russian 1884 Eurasia Classical Indo-European Balto-Slavic corpus, UDT, ISOs, Ethnologue, Glottolog
ukr Ukrainian 1884 Eurasia Classical Indo-European Balto-Slavic corpus, UDT, ISOs, Ethnologue, Glottolog
bre Breton 1883 Eurasia Classical Indo-European Celtic corpus, UDT, ISOs, Ethnologue, Glottolog
cym Welsh 1884 Eurasia Classical Indo-European Celtic corpus, UDT, ISOs, Ethnologue, Glottolog
glv Manx 1884 Eurasia Classical Indo-European Celtic corpus, UDT, ISOs, Ethnologue, Glottolog
bar Bavarian 1884 Eurasia Classical Indo-European Germanic corpus, UDT, ISOs, Ethnologue, Glottolog
fao Faroese 1884 Eurasia Classical Indo-European Germanic corpus, UDT, ISOs, Ethnologue, Glottolog
isl Icelandic 1875 Eurasia Classical Indo-European Germanic corpus, UDT, ISOs, Ethnologue, Glottolog
nds Eastern Low German 1884 Eurasia Classical Indo-European Germanic corpus, UDT, ISOs, Ethnologue, Glottolog
pcm Nigerian Pidgin 1884 Africa Classical Indo-European Germanic corpus, UDT, ISOs, Ethnologue, Glottolog
ell Modern Greek 1884 Eurasia Classical Indo-European Graeco-Phrygian corpus, UDT, ISOs, Ethnologue, Glottolog
grc Ionic-Attic Ancient Greek 1884 Eurasia Classical Indo-European Graeco-Phrygian corpus, UDT, ISOs, Ethnologue, Glottolog
hin Hindi 1884 Eurasia Classical Indo-European Indo-Iranian corpus, UDT, ISOs, Ethnologue, Glottolog
kmr Northern Kurdish 1884 Eurasia Classical Indo-European Indo-Iranian corpus, UDT, ISOs, Ethnologue, Glottolog
san Sanskrit 1884 Eurasia Classical Indo-European Indo-Iranian corpus, UDT, ISOs, Ethnologue, Glottolog
urd Urdu 1884 Eurasia Classical Indo-European Indo-Iranian corpus, UDT, ISOs, Ethnologue, Glottolog
xnr Kangri 1883 Eurasia Classical Indo-European Indo-Iranian corpus, UDT, ISOs, Ethnologue, Glottolog
Sino-Tibetan

Sino-Tibetan languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
lzh Classical Chinese 1884 Eurasia Sinitic Classical-Middle-Modern Sinitic corpus, UDT, ISOs, Ethnologue, Glottolog
yue Yue Chinese 1884 Eurasia Sinitic Classical-Middle-Modern Sinitic corpus, UDT, ISOs, Ethnologue, Glottolog
zho Chinese 1884 Eurasia Sinitic Classical-Middle-Modern Sinitic corpus, UDT, ISOs, Ethnologue, Glottolog
Tupian

Tupian languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
gub Guajajára 1884 South America Eastern Tupian Maweti-Guarani corpus, UDT, ISOs, Ethnologue, Glottolog
gun Mbyá Guaraní 1884 South America Eastern Tupian Maweti-Guarani corpus, UDT, ISOs, Ethnologue, Glottolog
Turkic

Turkic languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
kaz Kazakh 1884 Eurasia Common Turkic Kipchak-Turkestan corpus, UDT, ISOs, Ethnologue, Glottolog
kir Kirghiz 1884 Eurasia Common Turkic Kipchak-Turkestan corpus, UDT, ISOs, Ethnologue, Glottolog
uig Uighur 1884 Eurasia Common Turkic Kipchak-Turkestan corpus, UDT, ISOs, Ethnologue, Glottolog
sah Sakha 1884 Eurasia Common Turkic Sakha-Dolgan corpus, UDT, ISOs, Ethnologue, Glottolog
Uralic

Uralic languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
est Estonian 1884 Eurasia Finnic Coastal Finnic corpus, UDT, ISOs, Ethnologue, Glottolog
krl Karelian 1878 Eurasia Finnic Coastal Finnic corpus, UDT, ISOs, Ethnologue, Glottolog
myv Erzya 1884 Eurasia Mordvin Uralic corpus, UDT, ISOs, Ethnologue, Glottolog
kpv Komi-Zyrian 1884 Eurasia Permian Komi corpus, UDT, ISOs, Ethnologue, Glottolog
sme North Saami 1884 Eurasia Saami Western Saami corpus, UDT, ISOs, Ethnologue, Glottolog
Uto-Aztecan

Uto-Aztecan languages in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
azz Highland Puebla Nahuatl 1884 North America Southern Uto-Aztecan Corachol-Aztecan corpus, UDT, ISOs, Ethnologue, Glottolog
nhi Zacatlán-Ahuacatlán-Tepetzintla Nahuatl 1884 North America Southern Uto-Aztecan Corachol-Aztecan corpus, UDT, ISOs, Ethnologue, Glottolog
Isolates

Isolates in the taggedPBC + UDT:

ISO 639-3 Name Verses in corpus Macroarea Branch Subgroup Links
bam Bambara 1881 Africa Western Mande Manding-Kpelle corpus, UDT, ISOs, Ethnologue, Glottolog
bxr Russia Buriat 1884 Eurasia Mongolic Eastern Mongolic corpus, UDT, ISOs, Ethnologue, Glottolog
jpn Japanese 1884 Eurasia Japanesic Japan-Taiwan Japanese corpus, UDT, ISOs, Ethnologue, Glottolog
kor Korean 1884 Eurasia Koreanic Koreanic corpus, UDT, ISOs, Ethnologue, Glottolog
quc K’iche’ 1884 North America Core Mayan Quichean-Mamean corpus, UDT, ISOs, Ethnologue, Glottolog
tha Thai 1884 Eurasia Kam-Tai Daic-Beic corpus, UDT, ISOs, Ethnologue, Glottolog