Zaɓi Harshe

Kafa Masu Fassara Ma'ana na Harsuna Daban-daban ta Amfani da Manyan Samfuran Harshe: Bincike da Tsari

Binciken amfani da LLMs don fassarar bayanan fassarar ma'ana na Turanci zuwa harsuna daban-daban don horar da masu fassara, wanda ya fi kyau fiye da hanyoyin fassara-turawa a cikin harsuna 50.
translation-service.org | PDF Size: 1.1 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Kafa Masu Fassara Ma'ana na Harsuna Daban-daban ta Amfani da Manyan Samfuran Harshe: Bincike da Tsari

1. Gabatarwa & Bayyani

Wannan aikin yana magance matsala mai mahimmanci a cikin NLP na harsuna daban-daban: ƙirƙirar ingantattun bayanan da aka yiwa lakabi, masu dacewa da aiki don harsunan da ba su da albarkatu. Tsarin gargajiya na fassara-turawa ya dogara da ayyukan fassarar inji, waɗanda ke da tsada, suna iya fuskantar rashin daidaiton yanki, kuma suna buƙatar keɓantaccen tsari na tsari na ma'ana. Masu rubutun sun ba da shawarar LLM-T, wani sabon tsari wanda ke amfani da ƙarancin iyawar Manyan Samfuran Harshe (LLMs) don kafa bayanan fassarar ma'ana na harsuna daban-daban. An ba da ƙaramin saitin misalan da mutum ya fassara, ana ƙarfafa LLM don fassara nau'ikan Turanci (furci, tsarin ma'ana) zuwa harshen da ake nufi, yana samar da bayanan horo yadda ya kamata don daidaita mai fassarar ma'ana.

Muhimman Fahimta

  • LLMs na iya yin fassarar tsari mai rikitarwa (furci + tsarin ma'ana) yadda ya kamata ta hanyar koyo a cikin mahallin.
  • Wannan hanyar tana rage dogaro ga tsadar tsarin MT na gabaɗaya da ƙa'idodin tsinkaya masu rauni.
  • Ya fi kyau fiye da ƙaƙƙarfan hanyoyin fassara-turawa akan harsuna 41 cikin 50 a cikin manyan bayanai guda biyu.

2. Hanyar Aiki: Tsarin LLM-T

Babban ƙirƙira shine tsarin fassarar bayanai na tsari ta amfani da LLMs da aka ƙarfafa.

2.1 Tattara Bayanan Ir'i

Ana fassara ƙaramin saitin misalan Turanci daga bayanan tushe $D_{eng} = \{(x^i_{eng}, y^i_{eng})\}$ da hannu zuwa harshen da ake nufi $tgt$ don ƙirƙirar saitin ir'i $S_{tgt}$. Wannan yana ba da misalan a cikin mahallin don LLM, yana koya masa aikin haɗin furci da fassarar tsarin ma'ana.

2.2 Ƙarfafa Fassara a cikin Mahallin

Ga kowane sabon misalin Turanci $(x_{eng}, y_{eng})$, ana zaɓar wani ɓangare na $k$ misalai daga $S_{tgt}$ (misali, ta hanyar kamancen ma'ana) kuma a tsara su azaman ƙarfafawa. Ana ba da aikin ga LLM (misali, PaLM) don samar da nau'in harshen da ake nufi $\hat{x}_{tgt}, \hat{y}_{tgt})$.

Tsarin Ƙarfafawa: [Misalin Ir'i 1: (x_tgt, y_tgt)] ... [Misalin Ir'i k] [Shigarwa: (x_eng, y_eng)] [Fitarwa: ]

2.3 Sarrafa Inganci ta hanyar Samfurin Tsakiya

Don haɓaka bambance-bambance da inganci, masu rubutun suna amfani da samfurin tsakiya (sama-$p$) yayin samarwa, suna samar da fassarori masu yawa na ɗan takara ga kowane misali. Ana iya amfani da hanyar zaɓi ko tattarawa (misali, bisa ga amincewar mai fassara ko daidaito) don zaɓar sakamakon ƙarshe, wanda ya zama bayanan roba $\hat{D}_{tgt}$.

3. Cikakkun Bayanai na Fasaha & Tsarin Lissafi

Ana iya tsara tsarin azaman samarwa na sharadi. An ba da nau'in Turanci $(x_e, y_e)$ da saitin ir'i $S_t$, samfurin yana koyon maƙasudin:

$P(x_t, y_t | x_e, y_e, S_t) = \prod_{i=1}^{L} P(w_i | w_{

inda $(x_t, y_t)$ shine jerin abubuwan da ake nufi kuma samarwa yana amfani da samfurin tsakiya: $p' = \frac{p}{\sum_{w \in V^{(p)}} p(w)}$ don $V^{(p)}$, mafi ƙaramin saiti inda $\sum_{w \in V^{(p)}} P(w) \ge p$. Muhimman zaɓuɓɓukan ƙira sun haɗa da zaɓin ir'i, tsara ƙarfafawa, da dabarar ɓoyewa don haɓaka $P(x_t, y_t)$.

4. Sakamakon Gwaji & Bincike

4.1 Bayanai: MTOP & MASSIVE

An gudanar da gwaje-gwaje akan bayanan fassarar ma'ana guda biyu na jama'a waɗanda suka rufe niyya da ramummuka a cikin yankuna daban-daban (misali, ƙararrawa, kewayawa, siyayya).

  • MTOP: Ya rufe yankuna 6, niyya 11, harsuna 11.
  • MASSIVE: Ya rufe yankuna 18, niyya 60, harsuna 51 (ciki har da waɗanda ba su da albarkatu da yawa).
Girman yana ba da ingantaccen wurin gwaji don ƙaddamar da harsuna daban-daban.

4.2 Kwatancen Aiki

Tushen farko shine ƙaƙƙarfan hanyar fassara-turawa ta amfani da tsarin MT na zamani (misali, Google Translate) sannan a bi shi da tsinkaya na tsarin ma'ana na dabaru ko koyo. Hanyar LLM-T tana nuna riba mai mahimmanci:

Taƙaitaccen Aiki

LLM-T ya fi Translate-Train aiki akan harsuna 41/50. Matsakaicin ci gaba yana da mahimmanci, musamman ga harsunan da ba su da kusanci ko albarkatu inda ingancin MT na yau da kullun ya ragu. Ribar ta yi daidai a cikin daidaiton niyya da maki F1 na ramummuka.

4.3 Muhimman Bincike & Nazarin Ragewa

  • Girman Saitin Ir'i & Inganci: Aiki yana cika tare da ƙaramin adadin ingantattun misalan ir'i (misali, ~50-100), yana nuna ingancin bayanai.
  • Ƙirar Ƙarfafawa: Haɗa duka tushen (Turanci) da fassarar da ake nufi a cikin ƙarfafawa yana da mahimmanci. Tsarin $(x, y)$ yana da tasiri fiye da $x$ kaɗai.
  • Girman Samfuri: Manyan LLMs (misali, PaLM mai sigogi 540B) suna samar da fassarori mafi kyau sosai fiye da ƙananan, suna nuna rawar ƙarfin samfurin a cikin wannan aiki mai rikitarwa.
  • Binciken Kuskure: Kurakuran gama gari sun haɗa da fassarar ƙimar ramummuka don abubuwan al'ada na musamman (kwanaki, kayayyaki) da ƙaddamarwa na haɗaɗɗiyar tambayoyi masu rikitarwa.

5. Tsarin Bincike: Muhimmin Fahimta & Zargi

Muhimmin Fahimta: Nasarar takardar ba kawai game da amfani da LLMs don fassara ba ce; yana game da sake tsara ƙirƙirar bayanai azaman aikin samarwa na ƙanƙanta, a cikin mahallin. Wannan yana ƙetare duk tsarin raunin MT + keɓantaccen tsinkaya, wanda sau da yawa yakan gaza saboda yaduwar kuskure da rashin daidaiton yanki. Fahimtar cewa LLM na iya shigar da maƙasudin tsakanin bambance-bambancen harshe na halitta da wakilcinsu na yau da kullun a cikin harsuna yana da zurfi. Ya yi daidai da binciken daga ayyuka kamar "Samfuran Harshe Ɗalibai ne na Ƙanƙanta" (Brown et al., 2020) amma yana amfani da shi ga matsala ta haɗaɗɗiyar bayanai na harsuna daban-daban.

Tsarin Ma'ana: Hujja tana da tsabta: 1) Fassara-turawa yana da tsada kuma yana da rauni. 2) LLMs suna ƙware wajen daidaita ƙirar ƙanƙanta, tsakanin harsuna. 3) Don haka, yi amfani da LLMs don samar da nau'ikan (furci, tsarin ma'ana) da ake buƙata don horo kai tsaye. Gwaje-gwaje akan harsuna 50 suna ba da shaida mai yawa don dalili.

Ƙarfi & Kurakurai: Babban ƙarfi shine raguwar farashin lakabin ɗan adam da sassauci don daidaitawa zuwa kowane harshe tare da ƙaramin saitin ir'i kawai—wanda ke canza wasa ga NLP marasa albarkatu. Ribar aiki tana da gamsarwa kuma tana da faɗi. Duk da haka, hanyar tana da mahimman kurakurai. Na farko, ta dogara gaba ɗaya kan iyawar mallakar babban, rufaffiyar LLM (PaLM). Sake yin, farashi, da sarrafa su batutuwa ne masu mahimmanci. Na biyu, tana ɗauka cewa akwai ƙaramin amma cikakken saitin ir'i, wanda ga ainihin harsunan da ba su da albarkatu na iya zama babban cikas har yanzu. Na uku, kamar yadda binciken kuskure ya nuna, hanyar na iya fuskantar wahala tare da haɗaɗɗiyar ma'ana mai zurfi da daidaitawar al'adu fiye da fassarar kalmomi mai sauƙi, batutuwa da aka lura a cikin nazarin canja wurin tsakanin harsuna ta Conneau et al. (2020).

Fahimta Mai Aiki: Ga masu aiki, abin da za a ɗauka nan da nan shine ƙirƙirar faɗaɗa bayanai na harsuna daban-daban ta amfani da GPT-4 ko Claude tare da wannan samfurin ƙarfafawa kafin saka hannun jari a cikin tsarin MT. Ga masu bincike, hanyar gaba tana bayyana: 1) Ƙaddamar da hanyar ta hanyar sa ta yi aiki tare da ingantattun LLMs na buɗe tushe (misali, LLaMA, BLOOM). 2) Bincika haɗaɗɗiyar saitin ir'i—za mu iya kafa saitin ir'i da kansa? 3) Mayar da hankali kan hanyoyin kuskure, haɓaka masu gyara bayan haka ko ƙarfafa koyo daga martanin mai fassara don inganta fitarwar LLM, kama da hanyoyin horar da kai da ake amfani da su a hangen nesa (misali, asarar daidaiton zagayowar CycleGAN don fassarar marasa biyu). Gaba yana cikin tsarin haɗin gwiwa inda LLMs ke samar da bayanan azurfa masu hayaniya, kuma ƙananan, samfuran ƙwararrun ana horar da su don tsaftacewa da amfani da su yadda ya kamata.

6. Nazarin Lamari: Aiwatar da Tsarin

Yanayi: Kamfani yana son tura mataimakin murya don yin rajistar likita a cikin Hindi da Tamil, amma kawai yana da bayanan fassarar ma'ana na Turanci.

Aiwatar da Tsarin LLM-T:

  1. Ƙirƙirar Ir'i: Yi hayar masu fassara masu harsuna biyu na kwanaki 2 don fassara misalan yin rajistar Turanci 100 daban-daban (furci + tsarin ma'ana) zuwa Hindi da Tamil. Wannan shine farashin sau ɗaya.
  2. Ƙirar Ƙarfafawa: Ga kowane 10,000 misalan Turanci, ƙirƙiri ƙarfafawa tare da misalan ir'i 5 waɗanda suka fi kamanceceniya da shi ta ma'ana (wanda aka lissafta ta hanyar haɗakar jimloli), sannan a bi shi da sabon misalin Turanci.
  3. Samarwar LLM: Yi amfani da API (misali, GPT-4 na OpenAI, Claude na Anthropic) tare da samfurin tsakiya (sama-p=0.9) don samar da ɗan takara 3 na fassara ga kowane misali.
  4. Tace Bayanai: Horar da ƙaramin, mai saurin rarraba akan bayanan ir'i don ƙididdige ingancin furci da daidaiton tsarin ma'ana na ɗan takara. Zaɓi ɗan takara mafi girma ga kowane misali don ƙirƙirar saitin horo na Hindi da Tamil na ƙarshe.
  5. Horo Mai Fassara: Daidaita samfurin BART ko T5 na harsuna daban-daban akan bayanan da aka haɗa don kowane harshe.
Wannan tsari yana kawar da buƙatar ba da lasisin tsarin MT, haɓaka ƙa'idodin tsinkaya na ramummuka, da sarrafa haɗin gwiwar tsarin kwanan wata/lokaci da kalmomin likita a cikin harsuna da hannu.

7. Aiwatar da Gaba & Hanyoyin Bincike

  • Bayan Fassarar Ma'ana: Wannan tsarin yana dacewa kai tsaye ga kowane aikin ƙirƙirar bayanai daga jerin zuwa jerin: gano sunaye na harsuna daban-daban (rubutu $→$ alamun), rubutu-zuwa-SQL, samarwar lamba daga bayanin harshe na halitta.
  • Koyo Mai Aiki & Girman Saitin Ir'i: Haɗa tare da koyo mai aiki. Yi amfani da rashin tabbas na mai fassara da aka horar akan ainihin tambayoyin mai amfani don zaɓar waɗanne misalan ya kamata a ba da fifiko don fassarar ɗan adam don haɓaka saitin ir'i a hankali.
  • Daidaitawar Al'ada & Lahani: Ƙara gaba da harsunan daidaitattun zuwa lahani. Saitin ir'i a cikin Jamusanci na Swiss zai iya kafa bayanai don Jamusanci na Austriya, tare da LLM yana sarrafa bambance-bambancen kalmomi da jimloli.
  • Bayanan Roba don RLHF: Hanyar na iya samar da nau'ikan nau'ikan nau'ikan harsuna daban-daban don horar da samfuran lada a cikin Ƙarfafa Koyo daga Martanin Mutum (RLHF), wanda ke da mahimmanci don daidaita mataimakan AI a duniya.
  • Rage Dogaro da LLM: Aikin gaba dole ne ya mai da hankali kan tace wannan iyawa zuwa ƙananan, samfuran ƙwararrun don rage farashi da jinkiri, yana sa fasahar ta zama mai sauƙi don aikace-aikacen ainihin lokaci da gefe.

8. Nassoshi

  1. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Samfuran harshe ɗalibai ne na ƙanƙanta. Ci gaba a cikin tsarin sarrafa bayanai na jijiyoyi, 33, 1877-1901.
  2. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020). Koyon wakilcin tsakanin harsuna maras kulawa a ma'auni. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  3. Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Fassarar hoto-zuwa-hoto maras biyu ta amfani da cibiyoyin adawa na zagayowar. Proceedings of the IEEE international conference on computer vision (pp. 2223-2232). (Nassoshi na CycleGAN don koyo na daidaito).
  4. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Bincika iyakokin canja wurin koyo tare da na'urar canza rubutu-zuwa-rubutu ɗaya. Journal of Machine Learning Research, 21(140), 1-67.
  5. Moradshahi, M., Campagna, G., Semnani, S., Xu, S., & Lam, M. (2020). Ƙaddamar da masu fassara ma'ana na QA na buɗe-ontology a cikin kwana ɗaya ta amfani da fassarar inji. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).