Zaɓi Harshe

Canja Harsuna da Daidaitawa ga Harsunan Spain masu Karancin Albarkatu: Gabatarwar HW-TSC WMT 2024

Bincike kan gabatarwar Huawei na WMT 2024 ta amfani da dabarun canja harsuna, daidaitawa, da samar da bayanai na roba don fassara daga Spanish zuwa Aragonese, Aranese, da Asturian.
translation-service.org | PDF Size: 0.1 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Canja Harsuna da Daidaitawa ga Harsunan Spain masu Karancin Albarkatu: Gabatarwar HW-TSC WMT 2024

1. Gabatarwa

Wannan takarda ta bayyana cikakken bayani game da gabatarwar Cibiyar Fassara ta Huawei (HW-TSC) don aikin WMT 2024 na "Fassara zuwa Harsunan Spain masu Karancin Albarkatu". Ƙungiyar ta shiga cikin hanyoyin fassara guda uku na musamman: Spanish zuwa Aragonese (es→arg), Spanish zuwa Aranese (es→arn), da Spanish zuwa Asturian (es→ast). Babban kalubalen da aka magance shi ne Fassarar Injin Neural (NMT) don harsunan da ke da ƙarancin bayanan horo na layi daya, wanda ya zama matsala ta gama gari wajen sanya fasahar fassara ta haɗa da kowa.

Magani da aka gabatar ya dogara ne da haɗakar dabarun horo na ci-gaba da aka yi amfani da su akan tsarin Transformer-big mai zurfi. Waɗannan dabarun sun haɗa da koyon canja harsuna, daidaitaccen dropout, samar da bayanai na roba ta hanyar fassarar gaba da baya, rage hayaniya ta amfani da tsabtace LaBSE, da ƙarfafa ƙirar ta hanyar ƙwaƙƙwaran koyon haɗin kai. Haɗakar waɗannan fasahohin da nufin haɓaka ingancin fassara duk da ƙarancin bayanai, inda aka sami sakamako mai gasa a cikin kimantawa na ƙarshe.

2. Tsarin Bayanai

An gudanar da horon ne kawai akan bayanan da masu shirya WMT 2024 suka bayar, don tabbatar da kwatancen gaskiya. Bayanan sun ƙunshi tarin bayanai masu layi biyu da kuma bayanan harshe guda ɗaya a cikin harshen tushe (Spanish) da harsunan da ake nufi (masu karancin albarkatu).

Ƙididdigar Bayanai

Girman bayanan da ake da su ya bambanta sosai a cikin nau'ikan harsuna guda uku, wanda ke nuna yanayin "karancin albarkatu", musamman ga Aragonese.

2.1 Girman Bayanai

Teburin da ke gaba (wanda aka sake gina shi daga PDF) ya taƙaita bayanan da ake da su don kowane nau'in harshe biyu. Duk alkaluman suna cikin miliyan (M) na jimloli ko jimloli.

Nau'in Harshe Bayanai Masu Harsuna Biyu Tushe (es) Harshe Guda Harshen Nufi Guda
es → arg 0.06M 0.4M 0.26M
es → arn 2.04M 8M 6M
es → ast 13.36M 8M 3M

Mahimmin Fahimta: Babban bambanci a cikin bayanan harsuna biyu (0.06M na Aragonese da 13.36M na Asturian) yana buƙatar ingantattun fasahohin canja wuri da haɓaka bayanai. Ƙarin tarin bayanan harshe guda ɗaya sun zama muhimman kadarori don samar da bayanai na layi daya na roba.

3. Bayanin Tsarin NMT

An gina tsarin ne bisa tsarin Transformer-big mai zurfi. Sabon abu ba ya cikin ƙirar tushe, amma a cikin ingantaccen tsarin dabarun horo da aka tsara don shawo kan iyakokin bayanai:

  • Horo na Farko na Harsuna Daban-daban: An horar da ƙirar a kan cakuɗen bayanan harsuna masu alaƙa (misali, sauran harsunan Romance). Wannan yana ba da damar raba sigogi (kalmomi, sassa na maɓalli/maɓalli) don ba da damar canja ilimi daga harsunan da ke da albarkatu fiye zuwa ƙananan harsuna.
  • Daidaitaccen Dropout (Wu et al., 2021): Wata fasaha ta ci-gaba ta dropout wacce ke inganta ƙirar ƙirar da hana yin wuce gona da iri akan ƙananan bayanai ta hanyar amfani da abin rufe fuska iri ɗaya a cikin sassa daban-daban ko matakan horo.
  • Samar da Bayanai na Roba:
    • Fassarar Gaba: Fassara bayanan harshe guda ɗaya na harshen nufi zuwa harshen tushe don ƙirƙirar nau'ikan tushe-nufi na roba.
    • Fassarar Baya: Fassara bayanan harshe guda ɗaya na harshen tushe zuwa harshen nufi, wata muhimmiyar fasaha don haɓaka bayanan NMT.
  • Tsabtace LaBSE (Feng et al., 2020): Yin amfani da ƙirar Haɗin Jumla na BERT mara harshe (LaBSE) don tace jimlolin da ke da hayaniya ko ƙananan inganci daga bayanan roba, don tabbatar da cewa kawai misalai masu inganci ne ke jagorantar horon ƙarshe.
  • Ƙwaƙƙwaran Koyon Haɗin Kai (Wang et al., 2020): Hanyar haɗa iyawar ƙirar NMT da yawa waɗanda aka horar da su daban (misali, an horar da su akan cakuɗen bayanai daban-daban) zuwa ƙirar guda ɗaya mai ƙarfi, maimakon yin haɗin kai a lokacin aiki.

4. Tsarin Gwaji & Sakamako

Takardar ta bayyana cewa amfani da dabarun haɓaka da aka ambata ya haifar da sakamako mai gasa a cikin kimantawa na ƙarshe na WMT 2024. Duk da yake ba a bayar da takamaiman maki BLEU ko chrF++ a cikin ɓangaren da aka zayyana ba, sakamakon ya tabbatar da ingancin tsarin dabarun da yawa don yanayin karancin albarkatu. Nasara ta yiwu ta samo asali ne daga yanayin haɗin kai na dabarun: canja koyo yana ba da ƙarfafawa mai ƙarfi, bayanan roba suna faɗaɗa ingantaccen tsarin bayanai, tsabtacewa yana tsabtace shi, kuma hanyoyin daidaitawa/ƙwaƙƙwaran suna daidaitawa da haɓaka aikin ƙarshe.

5. Bincike na Ciki & Fassara na Kwararru

Mahimmin Fahimta

Gabatarwar Huawei misali ne na littafi na aikin injiniya mai amfani fiye da sabon abu na ka'ida. A cikin babban fagen WMT, sun tura rundunar fasahohin da aka kafa, amma masu ƙarfi, maimakon yin fare akan sabon nasara guda ɗaya da ba a gwada ba. Wannan ba game da ƙirƙirar sabon ƙira ba ne; yana game da rushe matsalar ƙarancin bayanai ta hanyar tsari mai matakai: canja koyo don ilimin tushe, bayanan roba don ma'auni, tsabtacewa don ingancin sarrafawa, da hanyoyin haɗin kai don mafi girman aiki. Tunatarwa ce cewa a cikin AI da ake amfani da shi, ingantattun bututun sau da yawa sun fi ƙira masu rauni.

Kwararar Hankali

Hanyar tana bin ma'ana mai ma'ana, mai shirye-shirye don samarwa. Ya fara da mafi mahimmancin maƙasudi - canja harsuna - yin amfani da dangantakar harshe na harsunan yankin Spain. Wannan yana kama da horar da ƙirar a kan gaba ɗaya na daukar hoto kafin daidaitawa don takamaiman salo, ƙa'idar da aka tabbatar da ita ta ƙirar kamar CycleGAN (Zhu et al., 2017) waɗanda ke amfani da janareta masu raba don daidaitawa. Daga nan sai suka magance matsala ta asali ta ƙarancin bayanai ta hanyar haɓaka bayanai ta hanyar fassarar gaba/baya, wata dabara da aka tabbatar daga zamanin SMT da NMT. Muhimmanci, ba su ɗauki wannan bayanan roba a fuska ba; matakin tsabtace LaBSE shine ƙofar inganci mai mahimmanci, yana tace hayaniyar da zai iya lalata ƙirar - darasi da aka koya daga cikas na ƙoƙarin fassarar baya na farko. A ƙarshe, sun ƙarfafa ribar ta hanyar koyon haɗin kai, suna tabbatar da ƙarfi.

Ƙarfi & Kurakurai

Ƙarfi: Hanyar tana da cikakkiya kuma mai ƙarancin haɗari. Kowane ɓangare yana magance sanannen rauni a cikin NMT mai karancin albarkatu. Amfani da LaBSE don tsabtacewa yana da wayo musamman, yana amfani da ƙirar haɗin jumla na zamani don aikin tsabtace bayanai na aiki. Mayar da hankali kan daidaitaccen tsarin Transformer-big yana tabbatar da sake samuwa da kwanciyar hankali.

Kurakurai: Giwa a cikin ɗaki shine rashin cikakken haɗin Ƙirar Babban Harshe (LLM). Takardar ta ambaci LLMs a matsayin al'ada amma ba ta yi amfani da su ba. A cikin 2024, rashin gwada daidaitawa na LLM mai harsuna daban-daban (kamar BLOOM ko Llama) don waɗannan ayyuka babban kuskure ne na dabara. LLMs, tare da saninsu mai yawa da ikon koyo a cikin mahallin, sun kafa sabbin ma'auni don fassarar karancin albarkatu, kamar yadda aka lura a cikin binciken da ACL (Ruder, 2023) ya yi. Bugu da ƙari, takardar ba ta da nazarin cirewa. Ba mu san wace dabara (tsabtacewa vs. haɗin kai vs. canja wuri) ta ba da gudummawa mafi girma ga ribar ba, yana mai da shi maganin akwatin baƙar fata.

Fahimta Mai Aiki

Ga masu aiki: Kwafi wannan bututun, amma saka LLM. Yi amfani da LLM mai harsuna daban-daban a matsayin tushen canja koyo maimakon, ko kuma a ƙari ga, ƙirar NMT mai harsuna daban-daban ta al'ada. Bincika hanyoyin daidaitawa masu inganci (PEFT) kamar LoRA don daidaita LLM yadda ya kamata. Matakan tsabtacewa da haɗin kai suna da matuƙar ƙima. Ga masu bincike: Filin yana buƙatar ƙarin ma'auni akan farashi/fa'idar bututun bayanai na roba vs. daidaitawar LLM a cikin saitunan karancin albarkatu. Aikin Huawei shine tushe mai ƙarfi na farko; takarda ta gaba yakamata ta kwatanta shi da na ƙarshe.

6. Cikakkun Bayanai na Fasaha & Tsarin Lissafi

Duk da yake ɓangaren PDF bai ba da takamaiman ƙididdiga ba, ana iya bayyana ainihin fasahohin a hukumance:

Daidaitaccen Dropout (Ra'ayi): Ba kamar daidaitaccen dropout wanda ke amfani da abin rufe fuska bazuwar da kansu ba, daidaitaccen dropout yana tilasta daidaito. Don fitowar Layer $h$, maimakon $h_{drop} = h \odot m$ inda $m \sim \text{Bernoulli}(p)$ ke canzawa kowane lokaci, wani bambance-bambancen zai iya amfani da abin rufe fuska iri ɗaya $m$ don jerin shigarwa da aka bayar a cikin sassa daban-daban ko matakan horo, tilasta ƙirar ta koyi fasali masu ƙarfi. Aikin asara yayin horo ya haɗa da wannan daidaito a matsayin mai daidaitawa.

Manufar Fassarar Baya: Idan aka ba da jumla guda ɗaya a cikin harshen nufi $y$, ƙirar baya $\theta_{y\rightarrow x}$ tana samar da jumlar tushe na roba $\hat{x}$. Ana amfani da nau'in roba $(\hat{x}, y)$ don horar da ƙirar gaba $\theta_{x\rightarrow y}$ ta hanyar rage mummunan log-likelihood: $\mathcal{L}_{BT} = -\sum \log P(y | \hat{x}; \theta_{x\rightarrow y})$.

Tace Tsabtace LaBSE: Don nau'in roba $(\hat{x}, y)$, ana ƙididdige haɗin su na LaBSE $e_{\hat{x}}, e_{y}$. Ana adana nau'in ne kawai idan kamancen su na cosine ya wuce kofa $\tau$: $\frac{e_{\hat{x}} \cdot e_{y}}{\|e_{\hat{x}}\|\|e_{y}\|} > \tau$. Wannan yana tace nau'ikan inda daidaitawar ma'ana ta yi rauni.

7. Sakamako & Bayanin Chati

Abubuwan da aka bayar na PDF ba su haɗa da takamaiman tebur ko chati ba. Dangane da bayanin, chatin sakamako na hasashe zai yiwu ya nuna:

  • Nau'in Chati: Chatin sandar rukuni.
  • X-axis: Nau'ikan harsuna guda uku: es→arg, es→arn, es→ast.
  • Y-axis: Maki na ƙima ta atomatik (misali, BLEU, chrF++).
  • Sanduna: Sanduna da yawa kowane nau'in harshe suna kwatanta: 1) Tushe (Transformer-big akan bayanan harsuna biyu kawai), 2) +Canja Harsuna, 3) +Bayanai na Roba (BT/FT), 4) +Tsabtacewa & Haɗin Kai (Cikakken tsarin HW-TSC).
  • Tsammanin Yanayi: Babban haɓakar maki daga tushe zuwa cikakken tsarin, tare da mafi girman haɓakar dangi da ake tsammani don harshen mafi ƙarancin albarkatu, es→arg, yana nuna ingancin fasahohin a cikin matsanancin ƙarancin bayanai.

Ƙarshen takardar cewa tsarin ya sami "sakamako mai gasa" yana nuna cewa sandunan ƙarshe na HW-TSC za su kasance a ko kusa da saman jagorar kowane aiki a cikin kimantawa na WMT 2024.

8. Tsarin Bincike: Nazarin Shari'a

Yanayi: Kamfanin fasaha yana son gina tsarin fassara don sabon yare mai karancin albarkatu, "LangX," tare da jimloli 10,000 kawai na layi daya amma jimloli miliyan 1 na harshe guda ɗaya a cikin harshe mai albarkatu mai alaƙa "LangH."

Aikace-aikacen Tsarin (An yi wahayi daga HW-TSC):

  1. Mataki na 1 - Tushe (Canja): Yi horon ƙirar harsuna daban-daban akan bayanan da aka samu a bainar jama'a na LangH da sauran harsunan da ke cikin iyali ɗaya. Fara ƙirar LangH→LangX tare da waɗannan ma'auni.
  2. Mataki na 2 - Ma'auni (Haɗawa):
    • Yi amfani da ƙirar farko don yin fassarar baya akan jimloli 1M na LangH guda ɗaya, ƙirƙirar nau'ikan (LangH, roba_LangX) na roba.
    • Hora ƙirar juyawa (LangX→LangH) akan nau'ikan gaskiya 10K, sannan a yi amfani da ita don fassarar gaba akan bayanan LangX guda ɗaya (idan akwai), ƙirƙirar nau'ikan (roba_LangH, LangX) na roba.
  3. Mataki na 3 - Gyara (Tsabtace): Haɗa duk nau'ikan gaskiya da na roba. Yi amfani da ƙirar haɗin jumla (misali, LaBSE) don ƙididdige maki kamanceceniya ga kowane nau'in roba. Tace duk nau'ikan da ke ƙasa da kofa na kamanceceniya da aka daidaita (misali, 0.8).
  4. Mataki na 4 - Inganta (Horo & Haɗin Kai): Hora ƙirar ƙarshe da yawa akan tsarin bayanai da aka tsabtace, wanda aka haɓaka tare da daidaitaccen dropout. Yi amfani da koyon haɗin kai don haɗa su cikin ƙirar samarwa guda ɗaya.

Wannan tsari, tsarin mataki mai matakai yana rage haɗarin aikin kuma yana ba da cikakkun bayanai, yana kwatanta tsarin bincike da ci gaban masana'antu da ke bayyane a cikin aikin Huawei.

9. Aikace-aikace na Gaba & Hanyoyi

Fasahohin da aka nuna suna da fa'ida mai faɗi fiye da takamaiman harsunan Spain:

  • Adana Digital: Ba da damar fassara da ƙirƙirar abun ciki ga ɗaruruwan harsunan duniya masu haɗari tare da ƙaramin bayanan layi daya.
  • Daidaitawar Yankin Kasuwanci: Da sauri daidaita ƙirar MT gabaɗaya zuwa ƙwararrun ƙwararrun kalmomi (misali, shari'a, likita) inda bayanan layi daya na yanki ba su da yawa amma akwai littattafan jagora/littattafan gado guda ɗaya.
  • Koyo na Karancin Albarkatu Mai Nau'i Daban-daban: Ƙa'idodin bututun - canja wuri, bayanan roba, tsabtacewa - ana iya daidaita su don ƙananan ayyukan kwatancin hoto ko fassarar magana.

Hanyoyin Bincike na Gaba:

  1. Haɗin LLM: Mafi mahimmancin hanya shine haɗa wannan bututun tare da LLMs masu maɓalli kawai. Aikin gaba yakamata ya kwatanta daidaitawa (misali, Mistral, Llama) da wannan hanyar NMT ta musamman dangane da inganci, farashi, da jinkiri.
  2. Tsarin Bayanai Mai Ƙarfi: Maimakon tacewa a tsaye, haɓaka dabarun koyon tsarin karatu waɗanda ke tsara gabatarwar gaskiya da na roba, tsaftataccen da mai hayaniya yayin horo.
  3. Tsabtacewa Mai Bayyanawa: Matsa zuwa sama da kofofin kamanceceniya na cosine zuwa ƙarin ma'auni masu fassara don ingancin bayanan roba, mai yuwuwa ta amfani da amincewar ƙirar ko ƙididdiga marasa tabbas.
  4. Canja Sifili-Sifili: Bincika yadda ƙirar da aka horar da su akan wannan jerin harsunan Spain ke aiki akan harsunan Romance masu alaƙa da ba a gani ba, suna matsawa zuwa iyawar sifili-sifili na gaskiya.

10. Nassoshi

  1. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Fassarar injin neural ta hanyar koyon yin daidaitawa da fassara tare. arXiv preprint arXiv:1409.0473.
  2. Feng, F., Yang, Y., Cer, D., Ariwazhagan, N., & Wang, W. (2020). Haɗin jumla na BERT mara harshe. arXiv preprint arXiv:2007.01852.
  3. Koehn, P., et al. (2007). Musa: Buɗaɗɗen kayan aikin fassarar injin ƙididdiga. ACL.
  4. Li, Z., et al. (2022). Horo na farko na fassarar injin neural mai harsuna daban-daban ta hanyar amfani da bayanan daidaitawa. Findings of EMNLP.
  5. Ruder, S. (2023). Ci-gaba a cikin Sarrafa Harshe na Halitta. ACL Rolling Review Survey Track.
  6. Wang, Y., et al. (2020). Koyon haɗin kai don fassarar injin neural. AAAI.
  7. Wu, Z., et al. (2021). Daidaitaccen dropout don fassarar injin neural. ACL-IJCNLP.
  8. Wu, Z., et al. (2023). Bayanan roba don fassarar injin neural: Bincike. Harsunan Kwamfuta.
  9. Zhu, J.Y., Park, T., Isola, P., & Efros, A.A. (2017). Fassarar hoto zuwa hoto mara layi ta amfani da cibiyoyin adawa na zagaye. ICCV.