Zaɓi Harshe

WOKIE: Taimakon LLM wajen Fassarar Tsarin Ma'ana na SKOS don Nazarin Al'adu na Dijital na Harsuna Daban-daban

Gabatar da WOKIE, tsarin buɗaɗɗen kayan aiki don fassarar atomatik na tsarin ma'ana na SKOS ta amfani da sabis na fassara na waje da inganta LLM don haɓaka samun dama da haɗin kai tsakanin harsuna a Nazarin Al'adu na Dijital.
translation-service.org | PDF Size: 4.2 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - WOKIE: Taimakon LLM wajen Fassarar Tsarin Ma'ana na SKOS don Nazarin Al'adu na Dijital na Harsuna Daban-daban

1. Gabatarwa da Dalili

Tsarin ilimi a cikin Nazarin Al'adu na Dijital (DH) ya dogara sosai akan ƙayyadaddun kalmomi, tsarin ma'ana, da tsarin fahimta (ontologies), waɗanda galibi ana yin samfuri ta amfani da Tsarin Tsarin Ilimi Mai Sauƙi (SKOS). Akwai babban cikas saboda rinjayar Ingilishi a cikin waɗannan albarkatun, wanda ke ware waɗanda ba 'yan asalin ba kuma ba su wakilci al'adu da harsuna daban-daban ba. Tsarin ma'ana na harsuna daban-daban yana da mahimmanci don haɗin gwiwar abubuwan bincike, duk da haka ƙirƙirar su ta hannu ba ta da girma. Hanyoyin Fassarar Injina na gargajiya (MT) sun kasa a cikin mahallin DH saboda rashin ƙamus na yanki na musamman. Wannan takarda ta gabatar da WOKIE (Zaɓuɓɓukan Fassara Mai Kyau don Gudanar da Ilimi a Cikin Muhallin Ƙasashen Duniya), buɗaɗɗen tsari mai sassa, wanda ya haɗa sabis na fassara na waje tare da ingantaccen ingantawa ta amfani da Manyan Harsunan Harsuna (LLMs) don sarrafa fassarar tsarin ma'ana na SKOS, daidaita inganci, girma, da kuɗi.

2. Tsarin WOKIE: Tsari da Tsarin Aiki

An ƙera WOKIE a matsayin tsari mai matakai da yawa wanda ba ya buƙatar ƙwarewar farko a cikin MT ko LLMs. Yana gudana akan kayan aikin yau da kullun kuma yana iya amfani da sabis na fassara kyauta.

2.1 Abubuwan Gini na Tsakiya

Tsarin ya ƙunshi manyan matakai uku:

  1. Fassarar Farko: Ana fassara tsarin ma'ana na SKOS, kuma ana aika alamun sa (prefLabel, altLabel) zuwa sabis na fassara na waje da yawa waɗanda za a iya saita su (misali, Google Translate, DeepL API).
  2. Haɗaɗɗun 'Yan Takara & Gano Rashin Yardar Juna: Ana tattara fassarorin kowane kalma. Wani sabon abu shine gano "rashin yardar juna" tsakanin sabis. Ma'auni mai saita (misali, idan fassarorin daga sabis N sun bambanta fiye da maki kamanceceniya) yana haifar da matakin ingantawa.
  3. Ingantawa na Tushen LLM: Don sharuɗɗan da fassarorin farko suka yi rashin yardar juna, ana ciyar da fassarorin 'yan takara da kalmar asali zuwa LLM (misali, GPT-4, Llama 3) tare da ƙirƙirar faɗakarwa yana neman mafi kyawun fassara da hujja.

2.2 Dabaru na Ingantawa na Tushen LLM

Zaɓin amfani da LLMs shine tsakiyar ƙirar WOKIE. Maimakon fassara kowane kalma tare da LLM (mai tsada, jinkiri, mai yuwuwar yin mafarki), ana tura LLMs kawai a matsayin masu sasantawa don lokuta masu wahala. Wannan hanyar haɗin gwiwar tana amfani da saurin da ƙarancin kuɗin daidaitattun APIs na MT don fassarori masu sauƙi, tare da adana ƙididdigar LLM don sharuɗɗan da babu yarjejeniya, ta haka ne ake inganta ciniki tsakanin inganci da kashe albarkatu.

3. Cikakkun Bayanai na Fasaha da Hanyoyi

An aiwatar da WOKIE a cikin Python, yana amfani da ɗakunan karatu kamar RDFLib don fassarar SKOS. Ingancin tsarin ya dogara da tsarin tafiyar da hankali.

3.1 Ma'aunin Ƙimar Ingancin Fassara

Don kimanta ingancin fassara, mawallafan sun yi amfani da haɗin ma'auni na atomatik da ƙimar ɗan adam na ƙwararru. Don maki na atomatik, sun daidaita makin BLEU (Ƙarƙashin Ƙimar Fassara na Harsuna Biyu), wanda aka saba amfani da shi a cikin binciken MT, amma sun lura da iyakokinsa don gajerun jimloli, na ƙamus. Babban ƙimar ya mayar da hankali ne kan haɓaka aikin Daidaita Tsarin Ma'ana (OM), ta amfani da daidaitattun tsarin OM kamar LogMap da AML. Hasashen shine cewa fassarori masu inganci za su haifar da mafi kyawun maki daidaitawa. Ribar aikin $G$ don tsarin ma'ana $T$ bayan fassara ana iya tsara shi kamar haka:

$G(T) = \frac{Score_{matched}(T_{translated}) - Score_{matched}(T_{original})}{Score_{matched}(T_{original})}$

inda $Score_{matched}$ shine ma'aunin F daga tsarin daidaita tsarin ma'ana.

4. Sakamakon Gwaji da Ƙima

Ƙimar ta ƙunshi tsarin ma'ana na DH da yawa a cikin harsuna 15, gwada sigogi daban-daban, sabis na fassara, da LLMs.

Mahimman Ƙididdiga na Gwaji

  • Tsarin Ma'ana da aka Ƙima: Da yawa (misali, Getty AAT, GND)
  • Harsuna: 15, ciki har da Jamusanci, Faransanci, Sifen, Sinanci, Larabci
  • LLMs da aka Gwada: GPT-4, GPT-3.5-Turbo, Llama 3 70B
  • Sabis na Tushe: Google Translate, DeepL API

4.1 Ingancin Fassara a Ko'ina cikin Harsuna

Ƙimar ɗan adam ta nuna cewa tsarin WOKIE (fassarar MT na waje + ingantaccen LLM) ya ci gaba da fiye da amfani da kowane sabis na fassara na waje guda ɗaya. Haɓakar inganci ya fi bayyana ga:

  • Harsuna masu ƙarancin albarkatu: Inda daidaitattun APIs sukan kasa.
  • Ƙamus na yanki na musamman: Sharuɗɗan da ke da ƙayyadaddun al'ada ko tarihi (misali, "fresco secco," "codex") inda MT na gama gari ke ba da fassarar zahiri amma mara daidai.

Bayanin Chati (Tunani): Chati na sanduna wanda ke kwatanta makin BLEU (ko makin ƙimar ɗan adam) a cikin yanayi huɗu: Google Translate kaɗai, DeepL kaɗai, WOKIE tare da ingantaccen GPT-3.5, da WOKIE tare da ingantaccen GPT-4. Sandunan tsarin WOKIE sun fi girma sosai, musamman ga nau'ikan harsuna kamar Ingilishi-zuwa-Larabci ko Ingilishi-zuwa-Sinanci.

4.2 Haɓaka Aikin Daidaita Tsarin Ma'ana (Ontology Matching)

Sakamakon ƙididdiga na farko. Bayan sarrafa tsarin ma'ana na waɗanda ba na Ingilishi ba ta hanyar WOKIE don ƙara alamun Ingilishi, makin ma'aunin F na tsarin daidaita tsarin ma'ana (LogMap, AML) sun ƙaru sosai—da matsakaicin 22-35% dangane da harshe da rikitarwar tsarin ma'ana. Wannan ya tabbatar da amfanin tsakiyar tsarin: yana haɓaka haɗin kai na ma'ana kai tsaye ta hanyar sa albarkatun waɗanda ba na Ingilishi ba su zama masu ganowa ga kayan aikin OM masu mayar da hankali kan Ingilishi.

Bayanin Chati (Tunani): Zanen layi wanda ke nuna ma'aunin F na daidaita tsarin ma'ana akan y-axis akan hanyoyin fassara daban-daban akan x-axis. Layin yana farawa ƙasa don "Babu Fassara," yana tashi kaɗan don "Sabis na MT Guda," kuma ya kai kololuwa sosai don "Tsarin WOKIE."

4.3 Nazarin Aiki da Kuɗi

Ta hanyar amfani da LLMs zaɓaɓɓu kawai don sharuɗɗan da ba a yarda da su ba (yawanci 10-25% na jimlar), WOKIE ya rage farashin API na LLM da 75-90% idan aka kwatanta da hanyar fassarar LLM cikakke, yayin da yake riƙe da kusan 95% na fa'idar inganci. Lokacin sarrafawa ya mamaye kiran LLM, amma gabaɗayan tsarin ya kasance mai yuwuwa ga matsakaicin girman tsarin ma'ana akan kayan aikin da aka saba.

5. Tsarin Nazari: Nazarin Lamari Ba tare da Lambar Ba

Yanayi: Laburare na dijital na Turai yana riƙe da tsarin ma'ana na harshen Jamusanci don fasahar fasahar zamanin da. Masu bincike a Japan ba za su iya samun albarkatun da suka dace ba saboda kayan aikin daidaita tsarin ma'ana suna sarrafa alamun Ingilishi kawai.

Aikace-aikacen WOKIE:

  1. Shigarwa: Kalmar Jamusanci "Wandmalerei" (zane na bango).
  2. Mataki na 1 (Fassarar Waje): Google Translate ya dawo da "wall painting." DeepL ya dawo da "mural painting." Microsoft Translator ya dawo da "wall painting." Akwai rashin yardar juna ("mural" vs. "wall").
  3. Mataki na 2 (Gano Rashin Yardar Juna): Kamanceceniya tsakanin 'yan takara yana ƙasa da kofa. An kunna ingantaccen LLM.
  4. Mataki na 3 (Ingantaccen LLM): Faɗakarwa: "Idan aka yi la'akari da kalmar tarihin fasaha ta Jamusanci 'Wandmalerei' da fassarorin Ingilishi na 'yan takara ['wall painting', 'mural painting'], wanne shine mafi daidai kuma kalmar da ta dace da mahallin don tsarin ma'ana na SKOS a tarihin fasaha? Yi la'akari da takamaiman yanki da amfani na gama gari a fagen."
  5. Fitowar LLM: "A cikin mahallin tsarin ma'ana na tarihin fasaha kamar Getty AAT, 'mural painting' shine mafi takamaiman kuma mafi yawan amfani da siffa don 'Wandmalerei,' kamar yadda yake nuna zanen da aka yi kai tsaye a bango ko rufi."
  6. Sakamako: Ma'anar SKOS ta sami prefLabel "mural painting," yana ba da damar daidaitawa daidai da tsarin ma'ana na harshen Ingilishi.

6. Ayyukan Gaba da Hanyoyin Bincike

  • Bayan Fassara: Tsawaita WOKIE don ba da shawarar sabbin ra'ayoyi masu alaƙa ko altLabels a cikin harshen da aka yi niyya, yana aiki azaman kayan aikin haɓaka tsarin ma'ana.
  • Haɗin kai tare da Samfuran Tushe: Yin amfani da samfuran harshe na gani (kamar CLIP) don fassara ra'ayoyi bisa hotunan da ke da alaƙa a cikin tarin dijital, ba kawai rubutu ba.
  • Madauki na Koyo Mai Aiki: Haɗa ra'ayi na ɗan adam a cikin madauki don gyara fitowar LLM, ci gaba da inganta aikin tsarin na yanki na musamman.
  • Daidaituwar Ƙima: Haɓaka tsarin gwaji na musamman don kimanta ingancin fassarar SKOS/tsarin ma'ana, matsawa bayan makin BLEU zuwa ma'auni waɗanda ke ɗaukar kiyaye matsayi da alaƙa.
  • Tsarin Tsarin Ilimi Mai Faɗi: Yin amfani da ƙa'idar ingantawa ta haɗin gwiwar MT+LLM zuwa ƙarin rikitattun tsarin ma'ana (OWL) bayan SKOS.

7. Nassoshi

  1. Kraus, F., Blumenröhr, N., Tonne, D., & Streit, A. (2025). Kula da Gibin Harshe a Nazarin Al'adu na Dijital: Taimakon LLM wajen Fassarar Tsarin Ma'ana na SKOS. arXiv preprint arXiv:2507.19537.
  2. Miles, A., & Bechhofer, S. (2009). SKOS Tsarin Tsarin Ilimi Mai Sauƙi. Shawarar W3C. https://www.w3.org/TR/skos-reference/
  3. Vaswani, A., et al. (2017). Hankali Duk Abinda Kake Bukata. Ci gaba a cikin Tsarin Sarrafa Bayanai na Jijiyoyi 30 (NIPS 2017).
  4. Carroll, J. J., & Stickler, P. (2004). RDF Triples a cikin Yanar Gizo na Ma'ana. Kwamfuta ta Intanet ta IEEE.
  5. Cibiyar Bincike ta Getty. (2024). Tsarin Ma'ana na Fasaha & Gine-gine (AAT). https://www.getty.edu/research/tools/vocabularies/aat/
  6. Papineni, K., et al. (2002). BLEU: Hanyar Ƙimar Fassarar Injina ta atomatik. Proceedings na Taro na Shekara-shekara na 40 na Ƙungiyar Kwamfuta ta Harshe (ACL).

8. Nazarin Kwararru: Fahimtar Tsakiya, Tsarin Ma'ana, Ƙarfafawa & Kurakurai, Fahimtar Aiki

Fahimtar Tsakiya: WOKIE ba kawai wani kayan aikin fassara ba ne; yana da ma'ana, mai hankali kan kuɗi injin haɗin kai ga duniyar da ta rabu na bayanan al'adun gargajiya. Haɓakarsa ta gaske ita ce gane cewa cikakkiyar fassarar AI wauta ce ga yankuna masu ƙarancin gaske, kuma a maimakon haka, yana amfani da LLMs a matsayin wuƙa mai daidaito maimakon guduma. Takardar ta gano daidai matsalar tushe a cikin DH: Ingilishi shine harshen tambaya na de facto don haɗin bayanai, yana haifar da warewar ɗimbin tafkunan ilimi waɗanda ba na Ingilishi ba. Manufar WOKIE ba fassarar waka ba ce amma ba da damar ganowa, wanda ya fi cimma buri kuma mai tasiri.

Tsarin Ma'ana: Hujjar tana da ƙarfi kuma an tsara ta da kyau. Ya fara ne da matsalar da ba za a iya musantawa ba (warewar harshe a cikin DH), ya rushe mafita bayyane (aikin hannu ba zai yiwu ba, MT na gargajiya ya gaza saboda ƙarancin bayanai), kuma ya sanya LLMs a matsayin mai ceto mai yuwuwa amma mai aibi (kuɗi, mafarki). Sa'an nan, ya gabatar da samfurin haɗin gwiwa mai kyau: yi amfani da APIs masu arha, sauri don 80% lokuta masu sauƙi, kuma a tura LLMs masu tsada, masu hankali kawai a matsayin masu sasantawa don 20% masu rikici. Wannan "ganon rashin yardar juna" shine ƙwayar wayo na aikin. Ƙimar a hankali tana haɗa ingancin fassara zuwa sakamako na zahiri, ma'auni na haɓaka makin daidaita tsarin ma'ana, yana tabbatar da amfanin ainihin duniya fiye da ingancin fassara na zahiri.

Ƙarfafawa & Kurakurai:
Ƙarfafawa: Gine-ginen haɗin gwiwar yana da hankali na kasuwanci kuma yana da inganci a fasaha. Mayar da hankali kan SKOS, ma'auni na W3C, yana tabbatar da dacewa nan take. Yanayin buɗe tushe da ƙira don "kayan aikin yau da kullun" suna rage shingen amfani da su sosai. Yin ƙima akan aikin OM wani babban fasaha ne—yana auna amfani, ba kawai kayan ado ba.
Kurakurai: Takardar ta yi watsi da injiniyan faɗakarwa, wanda shine abin da ke haifar da ko karya don ingantaccen LLM. Mummunan faɗakarwa zai iya sa Layer LLM ba shi da amfani ko cutarwa. Ƙimar, ko da yake mai hankali, har yanzu ta ɗan ware; ta yaya WOKIE ya kwatanta da daidaita ƙaramin samfuri mai buɗe ido kamar NLLB akan rubutun DH? Tsayin dogon lokaci na farashin API na LLM wani haɗari ne ga dorewar da ba a magance shi ba sosai.

Fahimtar Aiki:

  • Ga Cibiyoyin DH: Gwada WOKIE nan take akan tsarin ma'ana ɗaya mai mahimmanci wanda ba na Ingilishi ba. Dawowar kuɗin shiga a cikin ingantaccen gano albarkatu da daidaitawa tare da manyan cibiyoyi kamar Europeana ko DPLA na iya zama mai mahimmanci. Fara da sabis na matakin kyauta don tabbatarwa.
  • Ga Masu Haɓakawa: Ba da gudummawa ga tushen lambar WOKIE, musamman a cikin ƙirƙirar ɗakin karatu na ingantattun faɗakarwa, waɗanda aka daidaita don yankuna daban-daban na DH (ilmin kimiya na kayan tarihi, ilmin kiɗa, da sauransu).
  • Ga Masu Ba da Kuɗi: Kuɗi ƙirƙirar ma'auni na zinariya, ƙamus na DH na harsuna daban-daban don matsar da fagen bayan makin BLEU. Taimaka ayyukan da ke haɗa fitowar WOKIE cikin tsarin koyo mai aiki.
  • Mataki na Gaba Mai Muhimmanci: Al'umma dole ne su haɗa tsarin gudanarwa don waɗannan alamun da injin ya fassara. Yakamata a yi musu alama a fili a matsayin "ingantaccen injin" don kiyaye amincin masana, bin ƙa'idodin asalin bayanai waɗanda ayyuka kamar Ƙungiyar Bayanan Bincike (RDA) ke tallafawa.

A ƙarshe, WOKIE yana wakiltar irin aikace-aikacen AI na zahiri, mai mayar da hankali kan amfani wanda zai canza ayyuka da gaske. Ba ya bin AGI; yana magance takamaiman matsala mai raɗaɗi tare da haɗin gwiwar fasahar tsoho da sabo. Nasararsa za a auna ba a cikin makin BLEU ba, amma a cikin adadin bayanan tarihi da ba a gani ba waɗanda ba zato ba tsammani suka zama masu ganowa ga mai bincike na duniya.