This page outlines a subset of the languages represented by CoNNL-U
formatted corpora in the current version of the taggedPBC which
are also found in the UDT project. This file is automatically generated
based on the stats_All.xlsx
file found under the scripts/data/output
folder.
The subset is selected based on languages with more than 1800 verses in their corpus that are also present in the Universal Dependencies Treebanks project but which are not in commonly used NLP libraries (SpaCy, Trankit) and have not previously been part of projects in HG2051 (Language and the Computer) as taught at NTU Singpore.
The current document organizes languages by language phylum/family and ISO 639-3 code. Additional information includes the full name of the language and region, and links to the individual corpus in the taggedPBC and the UDT, the respective ISO 639-3 page, Glottolog, and Ethnologue.
Afro-Asiatic
Afro-Asiatic languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
hau | Hausa | 1884 | Africa | Chadic | West Chadic | corpus, UDT, ISOs, Ethnologue, Glottolog |
cop | Coptic | 1884 | Africa | Egyptian | Afro-Asiatic | corpus, UDT, ISOs, Ethnologue, Glottolog |
amh | Amharic | 1884 | Africa | Semitic | West Semitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
heb | Modern Hebrew | 1884 | Eurasia | Semitic | West Semitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
mlt | Maltese | 1855 | Eurasia | Semitic | West Semitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
Atlantic-Congo
Atlantic-Congo languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
wol | Wolof | 1884 | Africa | North-Central Atlantic | Wolof-BKK | corpus, UDT, ISOs, Ethnologue, Glottolog |
yor | Yoruba | 1884 | Africa | Volta-Congo | Benue-Congo | corpus, UDT, ISOs, Ethnologue, Glottolog |
Austronesian
Austronesian languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
tgl | Tagalog | 1884 | Papunesia | Malayo-Polynesian | Greater Central Philippine | corpus, UDT, ISOs, Ethnologue, Glottolog |
jav | Javanese | 1884 | Papunesia | Malayo-Polynesian | Javanesic | corpus, UDT, ISOs, Ethnologue, Glottolog |
Dravidian
Dravidian languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
mal | Malayalam | 1882 | Eurasia | South Dravidian | South Dravidian I | corpus, UDT, ISOs, Ethnologue, Glottolog |
tam | Tamil | 1884 | Eurasia | South Dravidian | South Dravidian I | corpus, UDT, ISOs, Ethnologue, Glottolog |
Indo-European
Indo-European languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
aln | Gheg Albanian | 1884 | Eurasia | Classical Indo-European | Albanian | corpus, UDT, ISOs, Ethnologue, Glottolog |
hyw | Western Armenian | 1883 | Eurasia | Classical Indo-European | Armenic | corpus, UDT, ISOs, Ethnologue, Glottolog |
bel | Belarusian | 1884 | Eurasia | Classical Indo-European | Balto-Slavic | corpus, UDT, ISOs, Ethnologue, Glottolog |
chu | Church Slavic | 1873 | Eurasia | Classical Indo-European | Balto-Slavic | corpus, UDT, ISOs, Ethnologue, Glottolog |
rus | Russian | 1884 | Eurasia | Classical Indo-European | Balto-Slavic | corpus, UDT, ISOs, Ethnologue, Glottolog |
ukr | Ukrainian | 1884 | Eurasia | Classical Indo-European | Balto-Slavic | corpus, UDT, ISOs, Ethnologue, Glottolog |
bre | Breton | 1883 | Eurasia | Classical Indo-European | Celtic | corpus, UDT, ISOs, Ethnologue, Glottolog |
cym | Welsh | 1884 | Eurasia | Classical Indo-European | Celtic | corpus, UDT, ISOs, Ethnologue, Glottolog |
glv | Manx | 1884 | Eurasia | Classical Indo-European | Celtic | corpus, UDT, ISOs, Ethnologue, Glottolog |
bar | Bavarian | 1884 | Eurasia | Classical Indo-European | Germanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
fao | Faroese | 1884 | Eurasia | Classical Indo-European | Germanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
isl | Icelandic | 1875 | Eurasia | Classical Indo-European | Germanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
nds | Eastern Low German | 1884 | Eurasia | Classical Indo-European | Germanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
pcm | Nigerian Pidgin | 1884 | Africa | Classical Indo-European | Germanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
ell | Modern Greek | 1884 | Eurasia | Classical Indo-European | Graeco-Phrygian | corpus, UDT, ISOs, Ethnologue, Glottolog |
grc | Ionic-Attic Ancient Greek | 1884 | Eurasia | Classical Indo-European | Graeco-Phrygian | corpus, UDT, ISOs, Ethnologue, Glottolog |
hin | Hindi | 1884 | Eurasia | Classical Indo-European | Indo-Iranian | corpus, UDT, ISOs, Ethnologue, Glottolog |
kmr | Northern Kurdish | 1884 | Eurasia | Classical Indo-European | Indo-Iranian | corpus, UDT, ISOs, Ethnologue, Glottolog |
san | Sanskrit | 1884 | Eurasia | Classical Indo-European | Indo-Iranian | corpus, UDT, ISOs, Ethnologue, Glottolog |
urd | Urdu | 1884 | Eurasia | Classical Indo-European | Indo-Iranian | corpus, UDT, ISOs, Ethnologue, Glottolog |
xnr | Kangri | 1883 | Eurasia | Classical Indo-European | Indo-Iranian | corpus, UDT, ISOs, Ethnologue, Glottolog |
Sino-Tibetan
Sino-Tibetan languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
lzh | Classical Chinese | 1884 | Eurasia | Sinitic | Classical-Middle-Modern Sinitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
yue | Yue Chinese | 1884 | Eurasia | Sinitic | Classical-Middle-Modern Sinitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
zho | Chinese | 1884 | Eurasia | Sinitic | Classical-Middle-Modern Sinitic | corpus, UDT, ISOs, Ethnologue, Glottolog |
Tupian
Tupian languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
gub | Guajajára | 1884 | South America | Eastern Tupian | Maweti-Guarani | corpus, UDT, ISOs, Ethnologue, Glottolog |
gun | Mbyá Guaraní | 1884 | South America | Eastern Tupian | Maweti-Guarani | corpus, UDT, ISOs, Ethnologue, Glottolog |
Turkic
Turkic languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
kaz | Kazakh | 1884 | Eurasia | Common Turkic | Kipchak-Turkestan | corpus, UDT, ISOs, Ethnologue, Glottolog |
kir | Kirghiz | 1884 | Eurasia | Common Turkic | Kipchak-Turkestan | corpus, UDT, ISOs, Ethnologue, Glottolog |
uig | Uighur | 1884 | Eurasia | Common Turkic | Kipchak-Turkestan | corpus, UDT, ISOs, Ethnologue, Glottolog |
sah | Sakha | 1884 | Eurasia | Common Turkic | Sakha-Dolgan | corpus, UDT, ISOs, Ethnologue, Glottolog |
Uralic
Uralic languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
est | Estonian | 1884 | Eurasia | Finnic | Coastal Finnic | corpus, UDT, ISOs, Ethnologue, Glottolog |
krl | Karelian | 1878 | Eurasia | Finnic | Coastal Finnic | corpus, UDT, ISOs, Ethnologue, Glottolog |
myv | Erzya | 1884 | Eurasia | Mordvin | Uralic | corpus, UDT, ISOs, Ethnologue, Glottolog |
kpv | Komi-Zyrian | 1884 | Eurasia | Permian | Komi | corpus, UDT, ISOs, Ethnologue, Glottolog |
sme | North Saami | 1884 | Eurasia | Saami | Western Saami | corpus, UDT, ISOs, Ethnologue, Glottolog |
Uto-Aztecan
Uto-Aztecan languages in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
azz | Highland Puebla Nahuatl | 1884 | North America | Southern Uto-Aztecan | Corachol-Aztecan | corpus, UDT, ISOs, Ethnologue, Glottolog |
nhi | Zacatlán-Ahuacatlán-Tepetzintla Nahuatl | 1884 | North America | Southern Uto-Aztecan | Corachol-Aztecan | corpus, UDT, ISOs, Ethnologue, Glottolog |
Isolates
Isolates in the taggedPBC + UDT:
ISO 639-3 | Name | Verses in corpus | Macroarea | Branch | Subgroup | Links |
---|---|---|---|---|---|---|
bam | Bambara | 1881 | Africa | Western Mande | Manding-Kpelle | corpus, UDT, ISOs, Ethnologue, Glottolog |
bxr | Russia Buriat | 1884 | Eurasia | Mongolic | Eastern Mongolic | corpus, UDT, ISOs, Ethnologue, Glottolog |
jpn | Japanese | 1884 | Eurasia | Japanesic | Japan-Taiwan Japanese | corpus, UDT, ISOs, Ethnologue, Glottolog |
kor | Korean | 1884 | Eurasia | Koreanic | Koreanic | corpus, UDT, ISOs, Ethnologue, Glottolog |
quc | K’iche’ | 1884 | North America | Core Mayan | Quichean-Mamean | corpus, UDT, ISOs, Ethnologue, Glottolog |
tha | Thai | 1884 | Eurasia | Kam-Tai | Daic-Beic | corpus, UDT, ISOs, Ethnologue, Glottolog |