Zaɓi Harshe

Sakamako na Farko akan Fassarar Injin Neural na Larabci: Bincike da Fahimta

Binciken aikace-aikacen farko na Fassarar Injin Neural akan harshen Larabci, kwatanta aiki da tsarin tushen jumla da kuma kimanta tasirin gyara kafin aiki.
translation-service.org | PDF Size: 0.1 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Sakamako na Farko akan Fassarar Injin Neural na Larabci: Bincike da Fahimta

1. Gabatarwa

Wannan takarda ta gabatar da aikace-aikacen farko da aka rubuta na cikakken tsarin fassarar injin neural (NMT) akan harshen Larabci (Ar↔En). Yayin da Fassarar Injin Neural ta kafa kanta a matsayin babbar madadin fassarar injin kididdiga ta tushen jumla (PBSMT) don harsunan Turai, ingancinta don harsuna masu wadata da tsari da kuma rikitarwa kamar Larabci ya kasance ba a bincika ba. Hanyoyin haɗin gwiwa na baya sun yi amfani da hanyoyin sadarwar neural a matsayin siffofi a cikin tsarin PBSMT. Wannan aikin yana nufin cike wannan gibi ta hanyar gudanar da kwatanta kai tsaye da faɗi tsakanin tsarin NMT mai hankali na asali da tsarin PBSMT na yau da kullun (Moses), tare da kimanta tasirin muhimman matakan gyara kafin aiki na musamman na Larabci.

2. Fassarar Injin Neural

Babban tsarin da aka yi amfani da shi shine ƙirar mai rufe-mai cirewa mai hankali, wanda ya zama ma'auni na gaskiya don ayyukan jeri-zuwa-jeri kamar fassarar.

2.1 Mai Mai da Hankali-Akan Mai Rufe-Mai Cirewa

Samfurin ya ƙunshi muhimman sassa guda uku: mai rufewa, mai cirewa, da tsarin hankali. Mai rufewa mai koma baya mai koma baya (RNN) yana karanta jimlar tushe $X = (x_1, ..., x_{T_x})$ kuma yana samar da jerin ƙananan mahallin $C = (h_1, ..., h_{T_x})$. Mai cirewa, yana aiki azaman samfurin harshe na RNN mai sharadi, yana samar da jerin manufa. A kowane mataki $t'$, yana ƙididdige sabon yanayin ɓoye $z_{t'}$ bisa ga yanayinsa na baya $z_{t'-1}$, kalmar da aka samar a baya $\tilde{y}_{t'-1}$, da ƙananan mahallin da aka ƙididdige su $c_{t'}$.

Tsarin hankali shine ƙirƙira wanda ke ba wa samfurin damar mai da hankali ga sassa daban-daban na jimlar tushe yayin cirewa. Ƙananan mahallin shine jimlar nauyi na yanayin ɓoye na mai rufewa: $c_{t'} = \sum_{t=1}^{T_x} \alpha_t h_t$. Ma'aunin hankali $\alpha_t$ ana ƙididdige su ta hanyar ƙananan hanyar sadarwar neural (misali, hanyar sadarwa mai gaba tare da layi guda $\tanh$) wanda ke ƙididdige dacewar kowane yanayin tushe $h_t$ idan aka ba da yanayin mai cirewa na yanzu $z_{t'-1}$ da fitarwa na baya $\tilde{y}_{t'-1}$: $\alpha_t \propto \exp(f_{att}(z_{t'-1}, \tilde{y}_{t'-1}, h_t))$.

Rarraba yuwuwar akan kalmar manufa ta gaba shine: $p(y_t = w | \tilde{y}_{

2.2 Sarrafa Alamomin Rukunin Kalmomi

Don magance buɗaɗɗen ƙamus da rage yawan bayanai, takardar ta dogara a fakaice akan fasahohi kamar Byte Pair Encoding (BPE) ko samfuran guntun kalmomi, kamar yadda aka ambata daga Sennrich et al. (2015) da sauransu. Waɗannan hanyoyin suna raba kalmomi zuwa ƙananan raka'o'in rukunin kalmomi masu yawa, suna ba wa samfurin damar haɓaka mafi kyau ga kalmomi da ba a gani ba, wanda ke da mahimmanci musamman ga harshe mai wadata da tsari kamar Larabci.

3. Tsarin Gwaji & Gyara Kafin Aiki na Larabci

Binciken ya gudanar da kwatanta mai tsauri tsakanin tsarin PBSMT na yau da kullun (Moses tare da siffofi na yau da kullun) da tsarin NMT mai hankali. Muhimmin ma'auni a cikin gwaje-gwajen shine gyara kafin aiki na rubutun Larabci. Takardar tana kimanta tasirin:

  • Rarraba Kalmomi: Rarraba tsari (misali, raba ƙananan kalmomi, prefixes, suffixes) kamar yadda Habash da Sadat (2006) suka gabatar.
  • Daidaituwa: Daidaituwar rubutu (misali, daidaita siffofin Aleph da Ya, cire alamomin sauti) kamar yadda Badr et al. (2008) suka yi.

Waɗannan matakan, waɗanda aka haɓaka tun asali don PBSMT, ana gwada su don ganin ko fa'idodinsu suna canzawa zuwa tsarin NMT.

4. Sakamako & Bincike

Gwaje-gwajen sun samar da muhimman bincike da yawa, suna ƙalubalantar da kuma tabbatar da zato na baya game da NMT.

4.1 Aikin Cikin Yanki

A kan saitin gwajin cikin yanki, tsarin NMT da tsarin PBSMT sun yi aiki daidai. Wannan sakamako ne mai mahimmanci, yana nuna cewa ko da samfurin NMT na "asali" zai iya cimma daidaito tare da tsarin PBSMT mai girma, mai ƙirar fasaha, akan nau'in harshe mai ƙalubale nan da nan.

4.2 Tsayayya na Waje da Yanki

Wani bincike da ya fito shi ne mafi girman aikin NMT akan bayanan gwajin waje da yanki, musamman don fassarar Turanci-zuwa-Larabci. Tsarin NMT ya nuna mafi girman tsayayya ga canjin yanki, babbar fa'ida mai amfani don aiwatarwa a duniyar gaske inda rubutun shigarwa zai iya bambanta sosai.

4.3 Tasirin Gyara Kafin Aiki

Gwaje-gwajen sun tabbatar da cewa irin wannan rarraba kalmomi na Larabci da daidaitawa na yau da kullun waɗanda ke amfanar PBSMT suma suna haifar da irin wannan haɓaka a cikin ingancin NMT. Wannan yana nuna cewa wasu ilimin gyara kafin aiki na harshe ba su da alaƙa da tsari kuma suna magance ƙalubalolin asali na harshen Larabci da kansa.

5. Babban Fahimta & Ra'ayi na Mai Bincike

Babban Fahimta: Wannan takarda ba game da ci gaba a cikin makin BLEU ba ce; tabbatarwa ce ta asali. Ya tabbatar da cewa tsarin NMT, yayin da yake buƙatar bayanai, a zahiri ba ya da alaƙa da harshe sosai don magance Larabci—harshe da ya yi nisa da mahallin Indo-Turai inda aka tabbatar da NMT. Babban labarin shine tsayayya na waje da yanki, wanda ke nuna alamar mafi girman ikon NMT na koyon wakilci na gabaɗaya, raunin dogaro na PBSMT na al'ada akan daidaitawar jumla a saman.

Tsarin Ma'ana: Hanyar marubutan ta kasance mai tsari: 1) Kafa tushe ta hanyar amfani da tsarin NMT na yau da kullun (mai rufe-mai cirewa mai hankali) akan Larabci, 2) Yi amfani da ma'auni na PBSMT (Moses) a matsayin ma'auni na zinariya don kwatanta, 3) Gwada canja wurin ilimin yanki (gyara kafin aiki na Larabci) daga tsohon tsari zuwa sabon tsari. Wannan yana haifar da labari mai tsabta, mai gamsarwa na ci gaba da rushewa.

Ƙarfi & Kurakurai: Ƙarfin yana cikin bayyanarsa da mai da hankali. Ba ya yin iƙirari; kawai yana nuna daidaito kuma yana nuna fa'ida mai mahimmanci (tsayayya). Kuskuren, gama gari ga takardun bincike na farko, shine saitin samfurin "asali". A shekara ta 2016, ƙarin fasahohi na ci gaba kamar tsarin Transformer suna kan gaba. Kamar yadda aikin baya na Vaswani et al. (2017) zai nuna, samfurin Transformer, tare da tsarin hankali na kansa, ya fi ƙwararrun masu rufe-masu cirewa na tushen RNN a yawancin ayyuka, mai yiwuwa ciki har da Larabci. Wannan takarda ta kafa bene, ba rufin ba.

Fahimta Mai Aiki: Ga masu aiki, saƙon ya bayyana a fili: Fara da NMT don Larabci. Ko da samfuran asali suna ba da gasa a cikin yanki da mahimmanci na tsayayya na waje da yanki. Darasin gyara kafin aiki yana da mahimmanci: kar ku ɗauka cewa koyon zurfi yana kawar da fahimtar harshe. Haɗa ingantattun hanyoyin rarraba kalmomi/daidaitawa. Ga masu bincike, wannan takarda ta buɗe kofa. Matakai na gaba nan da nan sune jefa ƙarin bayanai, ƙarin lissafi (kamar yadda aka gani a cikin dokokin sikelin bincike daga OpenAI), da ƙarin ƙirar ƙira (Transformers) a kan matsalar. Hanyar dogon lokaci da take nufi zuwa ita ce zuwa fassarar ƙaramin kulawa ko sifili don bambance-bambancen harshe masu ƙarancin albarkatu, yana amfani da ikon gabaɗaya da NMT ya nuna a nan.

Wannan aikin ya yi daidai da wani babban yanayi a cikin AI inda samfuran asali, da zarar an tabbatar da su a cikin sabon yanki, suna saurin lalata tsofaffin fasahohi, musamman. Kamar yadda CycleGAN (Zhu et al., 2017) ya nuna tsarin gabaɗaya don fassarar hoto-zuwa-hoto mara daidaito wanda ya maye gurbin ƙwarewar yanki, wannan takarda ta nuna NMT a matsayin tsarin gabaɗaya da ke shirye don ɗauka da wuce dabarun da aka tara na fassarar Larabci na tushen jumla.

6. Zurfin Fasaha

6.1 Tsarin Lissafi

Babban tsarin hankali za a iya raba shi zuwa matakai masu zuwa don lokacin mataki na mai cirewa $t'$:

  1. Makin Daidaitawa: Samfurin daidaitawa $a$ yana ƙididdige yadda abubuwan shigarwa a kusa da matsayi $t$ suka dace da fitarwa a matsayi $t'$:
    $e_{t', t} = a(z_{t'-1}, h_t)$
    Inda $z_{t'-1}$ shine yanayin ɓoye na mai cirewa na baya kuma $h_t$ shine $t$-th yanayin ɓoye na mai rufewa. Aikin $a$ yawanci hanyar sadarwa ce mai gaba.
  2. Ma'aunin Hankali: Ana daidaita makin ta amfani da aikin softmax don ƙirƙirar rarraba ma'aunin hankali:
    $\alpha_{t', t} = \frac{\exp(e_{t', t})}{\sum_{k=1}^{T_x} \exp(e_{t', k})}$
  3. Ƙananan Mahalli: Ana amfani da ma'auni don ƙididdige jimlar nauyi na jihohin mai rufewa, samar da ƙananan mahallin $c_{t'}$:
    $c_{t'} = \sum_{t=1}^{T_x} \alpha_{t', t} h_t$
  4. Sabunta Mai Cirewa: Ana haɗa ƙananan mahallin tare da shigarwar mai cirewa (ɗaukar kalmar da ta gabata) kuma a ciyar da shi cikin RNN mai cirewa don sabunta yanayinsa da kuma hasashen kalma ta gaba.

6.2 Misalin Tsarin Bincike

Harka: Kimanta Tasirin Gyara Kafin Aiki
Manufa: Ƙayyade ko rarraba kalmomi na tsari yana inganta NMT don Larabci.
Tsarin:

  1. Hasashe: Rarraba kalmomin Larabci zuwa morphemes (misali, "وكتب" -> "و+كتب") yana rage yawan ƙamus da kuma inganta fassarar siffofi masu rikitarwa.
  2. Ƙirar Gwaji:
    • Tsarin Kulawa: Samfurin NMT da aka horar da shi akan rubutu maras gyara, mai rarraba farin sarari.
    • Tsarin Gwaji: Samfurin NMT da aka horar da shi akan rubutu mai rarraba kalmomi (ta amfani da MADAMIRA ko kayan aiki makamancin haka).
    • Matsaloli: Tsarin ƙira iri ɗaya, ma'auni, girman bayanan horo, da ma'auni na kimantawa (misali, BLEU, METEOR).
  3. Ma'auni & Bincike:
    • Firamare: Bambancin makin BLEU gabaɗaya.
    • Sakandare: Bincika aiki akan takamaiman abubuwan tsari (misali, haɗakar fi'ili, haɗa ƙananan kalmomi) ta hanyar saitin gwaji da aka yi niyya.
    • Bincike: Kwatanta girman ƙamus da rarraba yawan alama. Rarraba kalmomi mai nasiha ya kamata ya haifar da ƙaramin ƙamus, mafi daidaito.
  4. Fassara: Idan tsarin gwaji ya nuna ingantacciyar ci gaba mai ƙima, yana tabbatar da hasashen cewa ƙirar tsari ta zahiri tana taimaka wa samfurin NMT. Idan sakamakon ya yi kama ko mafi muni, yana nuna cewa raka'o'in rukunin kalmomi na samfurin NMT (BPE) sun isa su ɗauki tsari a fakaice.

Wannan tsarin yayi daidai da hanyar takardar kuma ana iya amfani da shi don gwada kowane mataki na gyara kafin aiki na harshe.

7. Aikace-aikace na Gaba & Hanyoyi

Binciken wannan takarda ya share hanyar kai tsaye zuwa wasu muhimman hanyoyin bincike da aikace-aikace:

  • Larabci mai Ƙarancin Albarkatu & yare: Tsayayya da aka nuna yana nuna NMT zai iya zama mafi inganci don fassarar yaren Larabci (misali, Masar, Levantine) inda bayanan horo suka yi ƙaranci kuma canjin yanki daga Larabci na Zamani ya yi mahimmanci. Fasahohi kamar canja wurin koyo da NMT na harsuna da yawa, kamar yadda Johnson et al. (2017) suka bincika, sun zama masu dacewa sosai.
  • Haɗawa tare da Ƙirar Ƙira na Ci Gaba: Mataki na gaba nan da nan shine maye gurbin mai rufe-mai cirewa na tushen RNN da samfurin Transformer. Transformers, tare da hankali na kansu mai iya daidaitawa, mai yiwuwa za su samar da mafi girman riba a cikin daidaito da inganci don Larabci.
  • Gyara Kafin Aiki a matsayin Abu da aka Koya: Maimakon ƙayyadaddun masu rarraba kalmomi na tushen ƙa'ida, tsarin gaba na iya haɗa sassan rarraba da ake iya koyawa (misali, ta amfani da CNN na matakin harafi ko wata ƙaramar hanyar sadarwa) waɗanda aka haɗa su tare da samfurin fassarar, mai yiwuwa suna gano mafi kyawun rarraba don aikin fassarar da kansa.
  • Aiwatarwa a Duniyar Gaske: Tsayayya na waje da yanki shine mabuɗin siyarwa ga masu samar da MT na kasuwanci waɗanda ke ba da abun ciki na abokan ciniki daban-daban (sadawar zamantakewa, labarai, takardun fasaha). Wannan takarda ta ba da hujjar gwaji don ba da fifiko ga hanyoyin NMT don Larabci a cikin yanayin samarwa.
  • Bayan Fassarar: Nasarar samfuran mai hankali don MT na Larabci ta tabbatar da hanyar don wasu ayyukan NLP na Larabci kamar taƙaitaccen rubutu, amsa tambayoyi, da nazarin ra'ayi, inda ƙirar jeri-zuwa-jeri kuma ya shafi.

8. Nassoshi

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Fassarar injin neural ta hanyar koyon yin daidaitawa da fassarawa tare. Babban Taron Koyon Wakilci (ICLR).
  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Koyon wakilcin jumla ta amfani da RNN mai rufe-mai cirewa don fassarar injin kididdiga. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Habash, N., & Sadat, F. (2006). Tsare-tsaren gyara kafin aiki na Larabci don fassarar injin kididdiga. Proceedings of the Human Language Technology Conference of the NAACL.
  • Johnson, M., Schuster, M., Le, Q. V., et al. (2017). Tsarin Fassarar Injin Neural na Harsuna da yawa na Google: Ba da damar Fassarar Sifili. Transactions of the Association for Computational Linguistics.
  • Sennrich, R., Haddow, B., & Birch, A. (2015). Fassarar injin neural na kalmomi da ba a saba gani ba tare da raka'o'in rukunin kalmomi. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).
  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Hankali shine duk abin da kuke buƙata. Advances in Neural Information Processing Systems (NeurIPS).
  • Zhu, J., Park, T., Isola, P., & Efros, A. A. (2017). Fassarar Hoto-zuwa-Hoto mara daidaito ta amfani da Hanyoyin Sadarwar Adawa na Ci gaba da Ci gaba. IEEE International Conference on Computer Vision (ICCV).