1. Gabatarwa
Daidaitawar yanki wani muhimmin sashi ne a cikin Fassarar Injin (MT), wanda ya ƙunshi daidaita kalmomi, yanki, da salon salo, musamman a cikin ayyukan Fassarar Taimakon Kwamfuta (CAT) waɗanda suka haɗa da gyaran bayan mutum. Wannan takarda ta gabatar da wata sabuwar ra'ayi da ake kira "ƙwarewar yanki" don Fassarar Injin Jijiya (NMT). Wannan hanyar tana wakiltar wani nau'i na daidaitawa bayan horarwa, inda ake tace samfurin NMT na gaba ɗaya, wanda aka riga aka horar, ta amfani da sabbin bayanan da ke cikin yanki. Hanyar tana yi wa'adin fa'idodi a cikin saurin koyo da daidaiton daidaitawa idan aka kwatanta da horarwa cikakke na gargajiya daga farko.
Babbar gudummawar ita ce nazarin wannan hanyar ƙwarewa, wacce ke daidaita samfurin NMT na gaba ɗaya ba tare da buƙatar cikakken tsarin sake horarwa ba. A maimakon haka, ta ƙunshi wani lokaci na sake horarwa wanda ya mayar da hankali kawai akan sabbin bayanan da ke cikin yanki, ta amfani da sigogin da aka riga aka koya na samfurin.
2. Hanya
Hanyar da aka gabatar tana bin tsarin daidaitawa ta hankali. Samfurin NMT na gaba ɗaya, wanda aka fara horar da shi akan babban tarin bayanai na yanki na gaba ɗaya, daga baya ana "ƙware shi" ta ci gaba da horar da shi (gudanar da ƙarin lokutan) akan ƙaramin, bayanan da aka yi niyya a cikin yanki. Ana nuna wannan tsari a cikin Hoto na 1 (wanda aka bayyana daga baya).
Babban manufar lissafi a wannan lokacin sake horarwa shine sake kimanta yuwuwar sharadi $p(y_1,...,y_m | x_1,...,x_n)$, inda $(x_1,...,x_n)$ shine jerin harshen tushe kuma $(y_1,...,y_m)$ shine jerin harshen da aka yi niyya. Muhimmanci, ana yin hakan ba tare da sake saita ko watsar da yanayin da aka riga aka koya na Cibiyar Jijiya ta Maimaitawa (RNN) ba, yana ba da damar samfurin ya gina akan iliminsa na yanzu.
3. Tsarin Gwaji
Binciken yana kimanta hanyar ƙwarewa ta amfani da ma'aunin kimanta MT na yau da kullun: BLEU (Papineni et al., 2002) da TER (Snover et al., 2006). Tsarin gine-ginen NMT ya haɗa tsarin jerin-zuwa-jeri (Sutskever et al., 2014) tare da tsarin kulawa (Luong et al., 2015).
Gwaje-gwaje suna kwatanta saituttuka daban-daban, galibi suna bambanta abun da ke cikin tarin bayanan horo. Muhimman kwatancen sun haɗa da horarwa daga farko akan bayanan gauraye na gaba ɗaya/na cikin yanki da tsarin mataki biyu da aka gabatar: fara horar da samfurin gaba ɗaya, sannan a ƙware shi da bayanan cikin yanki. Wannan saitin yana nufin yin kwaikwayon ainihin yanayin CAT inda fassarorin da aka gyara bayan mutum suka samu a hankali.
3.1 Bayanan Horo
Takardar ta ambaci ƙirƙirar tsarin bayanai na al'ada don gwaje-gwaje. An gina samfurin gaba ɗaya ta amfani da cakuda daidaitaccen tarin bayanai da yawa daga yankuna daban-daban. Daga baya, ana amfani da takamaiman bayanan cikin yanki don lokacin ƙwarewa. Cikakkun abubuwan da ke ciki da girman waɗannan tarin bayanan an yi cikakken bayani a cikin tebur da aka ambata (Tebu na 1 a cikin PDF).
4. Fahimtar Asali & Ra'ayin Mai Bincike
Fahimtar Asali
Wannan takarda ba game da daidaitawa kawai ba ce; hack ne mai amfani don NMT mai matakin samarwa. Marubutan sun gano daidai cewa tsarin "samfurin-guda-daya-yana-dacewa-da-kowa" ba shi da amfani a kasuwanci. Hanyarsu ta "ƙwarewa" a zahiri ci gaba ne na koyo don NMT, suna ɗaukar samfurin gaba ɗaya a matsayin tushe mai rai wanda ke haɓaka tare da sabbin bayanai, kamar yadda mai fassara ke tara ƙwarewa. Wannan yana ƙalubalantar tunanin sake horarwa da ke yaɗuwa kai tsaye, yana ba da hanyar zuwa tsarin MT masu ƙarfi, masu amsawa.
Kwararar Ma'ana
Ma'anar tana da sauƙi mai ban sha'awa: 1) Amincewa da tsadar cikakken sake horar NMT. 2) Lura cewa bayanan cikin yanki (misali, gyare-gyaren bayan mutum) suna zuwa a hankali a cikin kayan aikin CAT na ainihi. 3) Ba da shawarar sake amfani da sigogin samfurin da ke akwai a matsayin farkon farawa don ƙarin horo akan sabbin bayanai. 4) Tabbatar da cewa wannan yana haifar da riba mai kwatankwacin horon bayanan gauraye amma da sauri. Kwararar tana kama da mafi kyawun ayyuka a cikin koyon canja wuri da ake gani a hangen nesa na kwamfuta (misali, fara samfuran ImageNet don takamaiman ayyuka) amma yana amfani da shi ga yanayin jerin, sharadi na fassarar.
Ƙarfi & Kurakurai
Ƙarfi: Fa'idar sauri ita ce siffarta mai kisa don turawa. Yana ba da damar sabunta samfurin kusan ainihin lokaci, mahimmanci ga yankuna masu ƙarfi kamar labarai ko tallafin abokin ciniki kai tsaye. Hanyar tana da sauƙi mai kyau, ba ta buƙatar canje-canjen gine-gine. Ya dace daidai da aikin CAT na mutum-a-cikin-madauki, yana haifar da zagayowar haɗin kai tsakanin mai fassara da na'ura.
Kurakurai: Giwa a cikin ɗaki shine mantuwa mai ban tsoro. Takardar ta nuna alamar kada a watsar da yanayin da suka gabata, amma haɗarin samfurin "rashin koyo" da iyawarsa na gaba ɗaya yayin ƙwarewa yana da yawa, batun da aka rubuta da kyau a cikin binciken ci gaba na koyo. Da alama kimantawa ta iyakance ga BLEU/TER akan yankin da aka yi niyya; ina gwajin akan ainihin yankin gaba ɗaya don duba lalacewar aiki? Bugu da ƙari, hanyar tana ɗauka samun ingantaccen bayanan cikin yanki, wanda zai iya zama toshewa.
Fahimta Mai Aiki
Ga manajojin samfurin MT: Wannan tsari ne don gina injunan MT masu daidaitawa. Ba da fifikon aiwatar da wannan bututun a cikin kayan aikin CAT ɗinku. Ga masu bincike: Mataki na gaba shine haɗa dabarun daidaitawa daga ci gaba na koyo (misali, Ƙarfafawar Ma'auni na Elastic) don rage mantuwa. Bincika wannan don samfuran harsuna da yawa—za mu iya ƙware samfurin Turanci-Sinanci don yankin likita ba tare da cutar da iyawarsa na Faransanci-Jamus ba? Nan gaba yana cikin samfuran NMT masu sassa, masu haɗawa, kuma wannan aikin mataki ne na tushe.
5. Cikakkun Bayanai na Fasaha
Tsarin ƙwarewa ya dogara ne akan manufar NMT ta yau da kullun na haɓaka yuwuwar log-likelihood na sharadi na jerin da aka yi niyya idan aka ba da jerin tushe. Don tarin bayanai $D$, aikin asara $L(\theta)$ don sigogin samfurin $\theta$ yawanci:
$L(\theta) = -\sum_{(x,y) \in D} \log p(y | x; \theta)$
A cikin horon mataki biyu da aka gabatar:
- Horo na Gabaɗaya: Rage $L_{generic}(\theta)$ akan babban tarin bayanai iri-iri $D_G$ don samun sigogi na farko $\theta_G$.
- Ƙwarewa: Fara da $\theta_G$ kuma a rage $L_{specialize}(\theta)$ akan ƙaramin tarin bayanai na cikin yanki $D_S$, yana haifar da sigogi na ƙarshe $\theta_S$. Mahimmanci shine ingantawa a mataki na 2 ya fara daga $\theta_G$, ba daga farawa bazuwar ba.
Samfurin da ke ƙasa yana amfani da mai ɓoyewa-mai ɓoyewa na tushen RNN tare da kulawa. Tsarin kulawa yana lissafta vector mahallin $c_i$ don kowace kalmar da aka yi niyya $y_i$ a matsayin jimlar nauyi na ɓoyayyun jihohin mai ɓoyewa $h_j$: $c_i = \sum_{j=1}^{n} \alpha_{ij} h_j$, inda ma'auni $\alpha_{ij}$ aka lissafa ta samfurin daidaitawa.
6. Sakamakon Gwaji & Bayanin Ginshiƙi
Takardar ta gabatar da sakamako daga manyan gwaje-gwaje guda biyu da ke kimanta hanyar ƙwarewa.
Gwaji na 1: Tasirin Lokutan Ƙwarewa. Wannan gwajin yana nazarin yadda ingancin fassarar (wanda aka auna ta BLEU) akan gwajin gwajin cikin yanki ke inganta yayin da adadin ƙarin lokutan horo akan bayanan cikin yanki ya ƙaru. Sakamakon da ake tsammani shine saurin riba na farko a cikin makin BLEU wanda a ƙarshe ya tsaya, yana nuna cewa ana iya samun babban daidaitawa tare da ƙarin lokuta kaɗan, yana nuna ingancin hanyar.
Gwaji na 2: Tasirin Girman Bayanan Cikin Yanki. Wannan gwajin yana bincika nawa ake buƙatar bayanan cikin yanki don ingantaccen ƙwarewa. An zana makin BLEU da girman tarin bayanan cikin yanki da aka yi amfani da su don sake horarwa. Da alama lanƙwasa yana nuna raguwar dawowa, yana nuna cewa ko da ƙaramin adadin ingantaccen bayanan cikin yanki na iya haifar da ingantacciyar ci gaba, yana sa hanyar ta zama mai yuwuwa ga yankuna masu iyakancewar bayanai masu kama da juna.
Bayanin Ginshiƙi (Hoto na 1 a cikin PDF): Zanen ra'ayi yana kwatanta bututun horo na mataki biyu. Ya ƙunshi manyan akwatuna guda biyu: 1. Tsarin Horo: Shigarwa shine "Bayanan Gabaɗaya," fitarwa shine "Samfurin Gabaɗaya." 2. Tsarin Sake Horo: Shigarwa shine "Samfurin Gabaɗaya" da "Bayanan Cikin Yanki," fitarwa shine "Samfurin Cikin Yanki" (Samfurin Ƙware). Kibiyoyi suna nuna kwarara daga bayanan gabaɗaya zuwa samfurin gabaɗaya, sannan daga duka samfurin gabaɗaya da bayanan cikin yanki zuwa samfurin ƙware na ƙarshe.
7. Misalin Tsarin Bincike
Yanayi: Kamfani yana amfani da samfurin NMT na Turanci-zuwa-Faransanci na gabaɗaya don fassara sadarwar cikin gida iri-iri. Sun sami sabon abokin ciniki a fannin shari'a kuma suna buƙatar daidaita fitarwar MT ɗin su don takaddun shari'a (kwangiloli, taƙaitaccen bayani).
Aiwatar da Tsarin Ƙwarewa:
- Tushe: Samfurin gabaɗaya yana fassara jumlar shari'a. Fitarwa na iya rasa takamaiman kalmomin shari'a da salon hukuma.
- Tarin Bayanai: Kamfanin ya tattara ƙaramin tarin bayanai (misali, nau'i-nau'i na jumla 10,000) na ingantattun takaddun shari'a, waɗanda aka fassara ta ƙwararru.
- Lokacin Ƙwarewa: An loda samfurin gabaɗaya da ke akwai. An ci gaba da horo ta amfani da sabon tarin shari'a kawai. Horo yana gudana na iyakancewar adadin lokuta (misali, 5-10) tare da ƙaramin ƙimar koyo don guje wa sake rubuta ilimin gabaɗaya.
- Kimantawa: An gwada samfurin ƙware akan wani saiti na rubutun shari'a da aka ajiye. Makin BLEU/TER ya kamata su nuna ci gaba akan samfurin gabaɗaya. Muhimmanci, an sami samfurin aikinsa akan sadarwa gabaɗaya kuma don tabbatar da cewa babu wani mummunan lalacewa.
- Turawa: An tura samfurin ƙware a matsayin wani ƙarshen wutsiya daban don buƙatun fassarar abokin ciniki na shari'a a cikin kayan aikin CAT.
Wannan misalin yana nuna hanya mai amfani, mai amfani da albarkatu zuwa MT na takamaiman yanki ba tare da kiyaye samfura masu zaman kansu cikakke da yawa ba.
8. Hangen Nesa na Aikace-aikace & Hanyoyin Gaba
Aikace-aikace Nan da Nan:
- Haɗin Kayan Aikin CAT: Sabuntawa na samfurin baya-bayan nan mara tsangwama, yayin da masu fassara suke gyara bayan mutum, ƙirƙirar tsarin mai inganta kansa.
- MT Na Sirri: Daidaita samfurin tushe zuwa salon mai fassara da yankunan da suka saba.
- Mai saurin Turawa don Sabbin Yankuna: Da sauri tayar da MT mai karɓuwa ga fagagen da ke tasowa (misali, sabuwar fasaha, kasuwanni na musamman) tare da iyakancewar bayanai.
Hanyoyin Bincike na Gaba:
- Cin nasara akan Mantuwa Mai Ban Tsoro: Haɗa ingantattun dabarun ci gaba na koyo (misali, maimaita ƙwaƙwalwar ajiya, daidaitawa) yana da mahimmanci don yuwuwar kasuwanci.
- Karkatar da Yanki Mai Ƙarfi: Haɓaka tsarin da zai iya gano yankin rubutu kai tsaye kuma ya karkatar da shi zuwa samfurin ƙware da ya dace, ko haɗa fitarwa daga ƙwararrun ƙwararru da yawa a hankali.
- Ƙwarewa Mai Ƙarancin Albarkatu & Harsuna Da Yawa: Bincika yadda wannan hanyar ke aiki lokacin ƙware manyan samfuran harsuna da yawa (misali, M2M-100, mT5) don nau'ikan harsuna masu ƙarancin albarkatu a cikin takamaiman yanki.
- Bayan Rubutu: Yin amfani da irin wannan tsarin ƙwarewa bayan horo ga wasu ayyukan samar da jerin kamar gane magana ta atomatik (ASR) don sabbin lafuzza ko samar da lamba don takamaiman APIs.
9. Nassoshi
- Cettolo, M., et al. (2014). Report on the 11th IWSLT evaluation campaign. International Workshop on Spoken Language Translation.
- Luong, M., et al. (2015). Effective Approaches to Attention-based Neural Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- Papineni, K., et al. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Snover, M., et al. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas.
- Sutskever, I., et al. (2014). Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems 27.
- Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences. [Tushen Waje - An ambata don mahallin mantuwa]
- Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. [Tushen Waje - An ambata don mahallin manyan samfuran da aka riga aka horar]