Zaɓi Harshe

DGT-TM: Babban Ƙwaƙwalwar Fassarar Harsuna Daga Hukumar Tarayyar Turai

Bincike kan DGT-TM, ƙwaƙwalwar fassara da ake samu kyauta wacce ta ƙunshi harsuna 22 na EU da nau'ikan fassara 231, yadda aka ƙirƙira ta, aikace-aikacenta a fasahar harshe, da tasirinta na gaba.
translation-service.org | PDF Size: 0.3 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - DGT-TM: Babban Ƙwaƙwalwar Fassarar Harsuna Daga Hukumar Tarayyar Turai

Harsuna 22

Harsunan EU na hukuma da aka rufe

Nau'i 231

Nau'ikan fassarar harshe na musamman

Girma 2x

Ƙaruwar girma daga sakin 2007 zuwa 2011

Sabuntawa Kowace Shekara

Tsarin saki da aka tsara

1. Gabatarwa da Dalili

Hukumar Tarayyar Turai (EC), ta hanyar Babban Daraktan Fassara (DGT) da Cibiyar Bincike ta Haɗin gwiwa (JRC), ta kafa misali a cikin buɗe bayanan harsuna da yawa tare da DGT-TM (Ƙwaƙwalwar Fassara). Wannan albarkatun wani ɓangare ne na wani babban yunƙuri na sakin manyan kadarorin harshe, bin tsarin JRC-Acquis na gaba ɗaya. Sakin 2011 na DGT-TM ya ƙunshi takardu daga 2004-2010 kuma ya ninka girman sigar 2007. Wannan ƙoƙarin ya samo asali ne daga ƙa'idar asali ta EU na yawan harsuna, da nufin inganta bambancin al'adu, gaskiya, da samun damar dimokuradiyya ga duk 'yan ƙasar EU cikin harsunansu na asali.

Sakin ya yi daidai da Umarnin 2003/98/EC kan sake amfani da bayanan sassan jama'a, inda aka gane irin waɗannan bayanan a matsayin kayan aiki masu mahimmanci don ƙirƙirar dijital da ayyukan ƙetare iyaka.

2. Albarkatun DGT-TM

DGT-TM tarin jimloli ne da fassararsu da ƙwararrun mutane suka yi a cikin harsuna 22 na hukuma na EU.

2.1. Tushen Bayanai da Tsari

Tushen bayanai ya samo asali ne daga tsarin aikin fassara na DGT na Hukumar Tarayyar Turai. Ya ƙunshi ingantattun takardun doka, manufofi, da gudanarwa, yana tabbatar da ingantacciyar fassara mai ƙwarewa. An tsara ƙwaƙwalwar a matsayin nau'i-nau'i na jimloli masu daidaitawa, daidaitaccen tsari don musayar Ƙwaƙwalwar Fassara (TMX).

2.2. Tarihin Saki da Ƙididdiga

Babban saki na farko ya kasance a cikin 2007. Sakin 2011 (DGT-TM Saki 2011) ya haɗa da bayanai har zuwa ƙarshen 2010 kuma yana nuna faɗaɗa mai mahimmanci. EC tana shirin yin saki na shekara-shekara daga yanzu, ƙirƙirar albarkatu mai rai, mai girma. Girman ya ƙunshi duk nau'ikan fassara 231 da za a iya yi a tsakanin harsuna 22.

3. Aikace-aikace da Misalan Amfani

3.1. Ga Ƙwararrun Masu Fassara

Da farko, ana amfani da DGT-TM tare da software na Ƙwaƙwalwar Fassara don ƙara yawan aikin masu fassara da kuma tabbatar da daidaiton ƙamus ta hanyar ba da shawarar fassarorin da suka gabata na jimloli iri ɗaya ko makamantansu.

3.2. Ga Binciken Fasahar Harshe

Albarkatun yana da matuƙar mahimmanci ga bincike da haɓakawa a cikin:

  • Fassarar Injin Ƙididdiga (SMT): A matsayin bayanan horo don gina da kimanta tsarin SMT don nau'ikan harshe masu ƙarancin albarkatu.
  • Cire Kalmomin Ƙwarewa: Don hako jerin kalmomin ƙwarewa na harshe biyu da yawa.
  • Gano Sunayen Mahalli (NER): Don haɓakawa da kimanta kayan aikin NER na ƙetare harshe.
  • Rarraba Rubutu da Tarin Harsuna Daban-daban: A matsayin bayanan da aka yiwa lakabi don rarraba takardu ta hanyar ƙetare harshe.

4. Mahallin Fasaha da Doka

Sakin yana aiki a ƙarƙashin tsarin Umarnin 2003/98/EC, wanda ke ƙarfafa sake amfani da bayanan sassan jama'a don haɓaka ƙirƙira da kasuwar dijital ɗaya mai gasa. An samar da bayanan kyauta, yana rage shingen shiga ga masu bincike da ƙananan kamfanoni a cikin sashin fasahar harshe.

5. Sauran Albarkatun EU

DGT-TM wani ɓangare ne na babban tsarin albarkatun harsuna daban-daban na buɗe ido daga cibiyoyin EU:

  • EUR-Lex: Maɓallin samun dama ga dokokin EU cikin harsuna 23 kyauta.
  • IATE: Ma'ajin Kalmomin Ƙwarewa na Haɗin gwiwa don Turai.
  • EuroVoc: Ƙamus na harsuna daban-daban, na fannoni daban-daban.
  • JRC-Names: Albarkatun gano sunayen mahalli da daidaitawa.
  • JEX (JRC EuroVoc Indexer): Software don rarraba takardu ta atomatik ta amfani da EuroVoc.
Waɗannan albarkatun gaba Ʌaya suna ba da cikakkiyar tushe don samun damar bayanai da sarrafa harsuna daban-daban.

6. Fahimtar Jigo & Ra'ayi na Mai Bincike

Fahimtar Jigo: DGT-TM ba kawai bayanai ba ne; albarkatu ne na dabarun siyasa. Hukumar Tarayyar Turai tana amfani da matsayinta na musamman a matsayin babban ma'aikacin ƙwararrun masu fassara a duniya don gina mafi cikakkiyar tarin harsuna na jama'a da ke wanzu. Wannan matakin da wayo ya canza larura na hukuma—fassara—zuwa fa'ida mai gasa ga tattalin arzikin dijital da bincike na EU. Yana adawa kai tsaye da rinjayen bayanan mallaka, galibi na Turanci, waɗanda manyan kamfanonin fasaha na Amurka ke riƙe da su, kamar yadda aka tattauna a cikin albarkatu kamar Anthology na ACL game da ƙarancin bayanai don NLP.

Tsarin Ma'ana: Ma'anar ba ta da aibi: 1) Dokar EU tana buƙatar yawan harsuna, 2) Wannan yana haifar da ɗimbin bayanan fassara masu inganci, 3) Ta hanyar buɗe waɗannan bayanan, EC tana haɓaka ƙirƙira na waje a Fasahar Harshe (LT), 4) Ingantaccen LT, bi da bi, yana rage farashin gaba da kuma ƙara ingancin tsarin fassara da ya haifar da bayanan. Tsari ne mai kyau da aka tsara don tabbatar da rawar EU a matsayin cibiyar AI ta harsuna daban-daban a duniya.

Ƙarfi & Kurakurai: Ƙarfinsa shine girman da ba a misaltu ba, inganci, da bayyananniyar doka. Ba kamar tarin bayanan da aka tattara daga yanar gizo ba, yana da tsabta, an fassara shi da ƙwarewa, kuma yana zuwa tare da bayyanannun haƙƙin amfani. Duk da haka, babban aibinsa shine son zuciya na yanki. Tarin ya fi mayar da hankali ga maganganun doka, gudanarwa, da siyasa. Wannan yana iyakance aikace-aikacensa kai tsaye don horar da ƙaƙƙarfan tsarin fassarar inji na gabaɗaya don harshe na yau da kullun ko na kasuwanci, wani gibi da aka nuna lokacin kwatanta nau'insa da bayanan yanki da aka yi amfani da su a cikin samfura kamar NMT na Google. Ma'adinai ne mai ƙima ga NLP na cibiyoyi, amma ba mafita ɗaya ba ce ga kowa.

Fahimma Mai Aiki: Ga masu bincike, fifiko ya kamata ya zama daidaitawar yanki. Yi amfani da DGT-TM a matsayin babban tarin iri mai inganci kuma a yi amfani da dabaru kamar gyara-fine ko sake fassara tare da bayanai marasa ƙarfi, masu faɗi don gina ƙarin samfura masu yawa. Ga masu tsara manufofi a wajen EU, wannan tsari ne: ba da umarnin sakin ƙwaƙwalwar fassara na gwamnati. Ga 'yan kasuwa, damar tana cikin gina kayan aikin SaaS na musamman don bincike da bincike na harsuna daban-daban na doka ko bin ka'ida, yana amfani da wannan ƙarfin na musamman kai tsaye maimakon yaƙi da son zuciya.

7. Cikakkun Bayanan Fasaha & Tsarin Lissafi

Babban ƙimar DGT-TM yana cikin daidaitawar jimlolinsa na gaba ɗaya. A hukumance, ga takarda $D$ da aka fassara daga harshen tushe $L_s$ zuwa harshen manufa $L_t$, TM ya ƙunshi saitin nau'i-nau'i masu daidaitawa $\{(s_1, t_1), (s_2, t_2), ..., (s_n, t_n)\}$, inda $s_i$ take jimla ta tushe kuma $t_i$ fassararta ce ta ɗan adam.

A cikin Fassarar Injin Ƙididdiga, ana amfani da irin wannan tarin don ƙididdige sigogin samfurin fassara. Wani muhimmin sashi shine yuwuwar fassarar jumla $\phi(\bar{t}|\bar{s})$, wanda aka ƙididdige daga mitoci masu alaƙa a cikin bayanan da aka daidaita: $$\phi(\bar{t}|\bar{s}) = \frac{\text{count}(\bar{s}, \bar{t})}{\sum_{\bar{t}'}\text{count}(\bar{s}, \bar{t}')}$$ inda $\bar{s}$ da $\bar{t}$ jerin kalmomi ne masu ci gaba (jumloli) da aka ciro daga nau'ikan jimloli masu daidaitawa. Girman girman DGT-TM yana ba da damar ƙididdige waɗannan yuwuwar da ya dace, musamman ga dogayen jumloli da ƙananan harsuna masu ƙarancin mita.

Don cire kalmomin ƙwarewa na harshe biyu, ana iya ƙididdige ma'auni kamar bayanan haɗin kai (PMI) a cikin tarin da aka daidaita don gano fassarorin kalma mai yuwuwa: $$\text{PMI}(s, t) = \log_2 \frac{P(s, t)}{P(s)P(t)}$$ inda $P(s, t)$ shine yuwuwar kalmar tushe $s$ da kalmar manufa $t$ su zo tare a cikin jimloli masu daidaitawa, kuma $P(s)$, $P(t)$ su ne yuwuwarsu na gefe.

8. Sakamakon Gwaji & Binciken Bayanai

Duk da yake PDF ba ya gabatar da takamaiman sakamakon gwaji, girman da aka bayyana yana nuna yuwuwar girma. Don mahallin, bincike da aka yi amfani da tarin EU makamantansu (kamar JRC-Acquis) ya nuna gagarumin ci gaba a cikin ingancin SMT don harsunan EU. Misali, Koehn & Knowles (2017) a cikin "Ƙalubale Shida don Fassarar Injin Jijiyoyi" sun lura cewa samun manyan tarin bayanai masu kama da juna kamar Europarl da Acquis shine babban abin da ke ba da damar gasa NMT don harsunan Turai.

Bayanin Chati (An Ƙaddara): Zanen sandar hasashe mai taken "Girma na Nau'in Jimlolin DGT-TM (Sakin 2007 vs 2011)" zai nuna sanduna biyu don samfurin nau'in harshe (misali, Turanci-Faransanci). Sandar 2007 za ta kasance mai wani tsayi (wakiltar ƙarar farko). Sandar 2011 za ta kasance da tsayi sau biyu daidai, ta hanyar gani tana tabbatar da da'awar "girma sau biyu". Zanen layi na biyu zai iya nuna adadin jimloli a cikin shekaru 2004-2010, yana kwatanta ci gaba da shigar da takardu waɗanda suka samar da sakin 2011.

Mahimman ƙididdiga shine ninka yawan bayanai tsakanin saki. A cikin koyon inji, musamman ga samfuran jijiyoyi masu ƙoshin bayanai, wannan ƙaruwar girman ba ta da ƙima a layi daya. Zai iya matsar da nau'in harshe daga kasancewa "ƙarancin albarkatu" zuwa "matsakaicin albarkatu," yana iya inganta ma'aunin ingancin fassara (misali, makin BLEU) da maki da yawa, kamar yadda aka gani a cikin nazarin dokokin sikelin bayanai don NMT.

9. Tsarin Bincike: Misalin Amfani

Yanayi: Wani ƙwararren fasahar harshe yana son gina kayan aiki na musamman don sa ido kan sanarwar ƙa'idodin EU a cikin harsuna daban-daban.

Aikace-aikacen Tsarin (Babu Lamba):

  1. Rarraba Matsala: Babban aikin shine dawo da bayanai ta hanyar ƙetare harshe (CLIR) da rarrabuwa a cikin yankin doka/ƙa'ida.
  2. Zanen Albarkatu:
    • DGT-TM: Ana amfani da shi azaman tarin bayanai na gaba ɗaya don horar da samfurin saka kalmomi na harshe biyu na musamman (misali, ta amfani da VecMap ko MUSE) don Turanci da Faransanci. Wannan yana haifar da sararin vector inda kalmomin ƙa'ida masu kama da ma'ana a cikin harsuna daban-daban suke daidaitawa sosai.
    • EuroVoc (ta hanyar JEX): Ana amfani da shi azaman tsarin rarrabuwa. Ana yiwa takardu lakabi tare da masu siffanta EuroVoc masu dacewa.
    • IATE: Ana amfani da shi azaman ƙamus na tabbatarwa don duba ingancin daidaitawar kalma da aka koya daga DGT-TM.
  3. Tsarin Gudana:
    1. Horar da saka kalmomin ƙetare harshe akan DGT-TM.
    2. Don sabuwar takardar ƙa'ida ta Faransanci, canza ta zuwa vector ta takarda ta amfani da saka kalmomin Faransanci.
    3. Zana wannan vector zuwa cikin sararin saka Turanci ta amfani da daidaitawar da aka koya a mataki na 1.
    4. Kwatanta vector ɗin da aka zana zuwa ma'ajin takardun Turanci da aka riga aka yi vector (wanda aka rarraba tare da EuroVoc ta hanyar JEX) don nemo ƙa'idodin EU masu kama da ma'ana.
    5. Sanya masu siffanta EuroVoc masu dacewa daga takardun Turanci da aka yi daidai da su zuwa sabuwar takardar Faransanci.
  4. Sakamako: Ƙwararren zai iya yanzu rarraba da haɗa sabbin rubutun ƙa'ida a kowane harshe da aka rufe zuwa tarin harsuna daban-daban da ke akwai, yana ba da damar sa ido da bincike mai inganci.
Wannan misalin yana nuna yadda DGT-TM ke aiki a matsayin muhimmin "manne" ko bayanan horo waɗanda ke ba da damar haɗa sauran albarkatun EU (EuroVoc, IATE) cikin aikace-aikacen yanki na musamman.

10. Aikace-aikace na Gaba & Hanyoyin Ci Gaba

Hanyar DGT-TM tana nuni zuwa ga ci gaba mai mahimmanci da yawa na gaba:

  • Tushe don Manyan Samfuran Harshe (LLMs): DGT-TM yana da kyau don horo-farko ko gyara-fine na LLMs na harsuna daban-daban (kamar BERT ko XLM-R) musamman don yankunan doka da gudanarwa, ƙirƙirar "Regulatory GPTs" na musamman.
  • Ƙwaƙwalwar Fassara na Aiki-lokaci a matsayin Sabis (TMaaS): Tare da sabuntawa na shekara-shekara, EC na iya ba da API mai rai inda shawarwarin fassara suka samo asali daga dukan DGT-TM mai girma, yana amfanar masu fassara masu zaman kansu da ƙananan hukumomi a duniya.
  • Gano Son Kai da Binciken Adalci: Tarin, a matsayin rikodin hulɗar hukuma ta EU, ana iya bincika shi don bincika son zuciya na harshe, ci gaban ƙamus, da wakilci a cikin harsuna da yankunan manufofi.
  • Ƙarfafa Aikace-aikacen Hanyoyi Daban-daban: Saki na gaba zai iya haɗawa da sauran bayanan buɗe ido, kamar jawabai na jama'a (bidiyo/sauti) ko takardun doka da aka tsara (PDFs tare da tsari), yana ba da damar bincike a cikin fassarar hanyoyi daban-daban da fahimtar takarda.
  • Ma'auni don Kimantawa: DGT-TM zai iya zama ma'auni na gwaji don kimanta ƙarfin tsarin MT na kasuwanci akan rubutu na yau da kullun, mai hankali ga doka, ya wuce ma'aunin kimantawa na yanki na gaba ɗaya.

Alƙawarin saki na shekara-shekara yana canza DGT-TM daga hoto mai tsayi zuwa bayanai masu ƙarfi, na dogon lokaci, yana buɗe sabbin hanyoyin bincike a cikin bin sauyin harshe da tasirin manufofi akan lokaci.

11. Nassoshi

  1. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (Shekara). DGT-TM: Ƙwaƙwalwar Fassara da ake samu kyauta a cikin Harsuna 22. Hukumar Tarayyar Turai.
  2. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D., & Varga, D. (2006). JRC-Acquis: Tarin gaba ɗaya na harsuna da yawa tare da harsuna 20+. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06).
  3. Koehn, P., & Knowles, R. (2017). Ƙalubale Shida don Fassarar Injin Jijiyoyi. Proceedings of the First Workshop on Neural Machine Translation. Ƙungiyar Kwamfuta ta Harshe.
  4. Hukumar Tarayyar Turai, Babban Daraktan Fassara. (2008). Fassara don Al'umma mai Yawan Harsuna. Ofishin Wallafa na Tarayyar Turai.
  5. Umarnin 2003/98/EC na Majalisar Tarayyar Turai da Majalisar kan sake amfani da bayanan sassan jama'a. Jaridar Hukuma ta Tarayyar Turai, L 345.
  6. Conneau, A., et al. (2020). Koyon Wakilcin Ƙetare Harshe ba tare da Kulawa ba a Sikelin. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). (Nassi don samfurin XLM-R, mai dacewa da aikace-aikacen LLM na gaba).
  7. Anthology na ACL. (b.t.k.). Ma'ajiyar dijital na takardun bincike a cikin ilimin harshe na kwamfuta. An samo daga https://www.aclweb.org/anthology/ (Nassi na gaba ɗaya don mahallin binciken NLP).