[PDF] SENCORPUS: A French-Wolof Parallel Corpus - Association for PDF 2020.lrec-1.341.pdf

Vice versa, the French translation of Doomi Golo: Nettali (Diop, 2003) is not the French-English parallel corpus available at https://www manythings org/anki/

This document has a French version: “Guide de création de cartes used to create sign language dictionaries and Anki electronic flashcards for computers or

[PDF] SENCORPUS: A French-Wolof Parallel Corpus - Association for

Vice versa, the French translation of Doomi Golo: Nettali (Diop, 2003) is not the French-English parallel corpus available at https://www manythings org/anki/

A Frequency Dictionary Of French: Core Vocabulary For Learners

First off, get a feel for how pronunciation A Frequency Dictionary of French: Core There are also some useful Anki flashcards for French to improve your

[PDF] Integrating a Computer-Based Flashcard Program into Academic

"Anki" on learners' vocabulary knowledge both quantitatively and qualitatively With Anki Incidental vocabulary acquisition in French as a foreign language

[PDF] french double tax treaties

[PDF] french double taxation treaty

[PDF] french economy

[PDF] french economy 1960s

[PDF] french electricity supply

[PDF] french energy department

[PDF] french energy resources

[PDF] french er ir re verb endings chart

[PDF] french exemption ontario schools

[PDF] french fashion early 1800s

[PDF] french fashion history

[PDF] french fashion in the 1700s

[PDF] french fashion late 1800s

[PDF] french fashion timeline

[PDF] french flag

2809wordn

1n 2n 3n

4afrigoseyaaniasigànnaarusowwu

africaoceaniaasianorthsouth bànkfmikontnafamonjaal bankimfaccountwalletworld -wide banaanasoraasmàngoguyaabxob bananaorangemangoguavaleaf aajofajugtekkifajulew needresolvemeanresolvedlegal bàmbaxaadimurasuulcolonialesseex bambaKhadimprophetcolonialsheikh bantdaajdaajoonindaajeedaaje woodpress inpressed inpress withpress with Table 5: Examples of Wolof words (in bold) with their five nearest neighbours according to GloVe.target wordn 1n 2n

3atomatomesisotopecathode

artcontemporaindecoabstrait peinturefigurativepicturauxpicturales boudhismehindouismebrahmanismejainisme uranusjupitersaturnepluton planeteextraterrestrelointaineorbitant mercaspiennebaltiqueocean fleuvebaikalfleuvesembouchure

Table 6: French Wikipedia GloVe words neighbours.

coupleCBOWSGGloVe (senegaal, dakaar)101 (faraas, pari)000 (janq, waxambaane)001 (jigéen, góor)000 (yaay, baay)110 (jëkkër, jabar)000 (rafet, taaru)111 (teey, yem)111 (tàmbale, sumb)111 (metit, naqar)001 (suux, diig)111 (xam, xami)011 (ajoor, kajoor)011 (taarix, cosaan)111 (jàng, jàngale)001

Total47%53%73%

Table 7: Scores.

5.2. French-Wolof Machine Translation

In addition to developing word embedding models, we used the corpus to train and evaluate four LSTM based models to translate French sentences into their Wolof counterparts: a baseline LTSM, a bidirectional LTSM, a baseline LTSM + attention, a bidirectional LSTM + attention (Hochreiter and Schmidhuber, 1997; Sutskever et al., 2014b). The modelsare still under development. LSTM networks are used for both the encoding and decod- ing phases. As a first step, the encoder reads the entire in- put sequence from the source language and encodes it to a fixed-length internal representation. The word embed- dings are built using an embedding layer whose dimension is equal to the size of the source language vocabulary. In a second step, a decoder network uses this internal represen- tation to predict the target sentence. Starting from the start of sequencesymbol, it outputs words until the end of sequencetoken is reached. In other words, the decoder makes prediction by combining information from the thought vector and the previous time step to generate the target sentence. The model is trained on a dataset of about70,000sentences split into training (50%) and validation (50%). The train- ing parameters currently used for the baseline model are displayed in Table 8.ParametersValues

Embedding dimension128

Number of units300

Learning rate0.001

Dropout rate0.25

Number of epochs500

Batch size64

Table 8: Training parameters for the LSTM models.

Across experiments, the following hyper-parameters are kept constant: number of LSTM units, embedding size, weight decay, dropout rate, shuffle size, batch size, learn- ing rate, max gradient norm, optimizer, number of epochs and early stopping patience. All models are composed of a single LSTM layer with a dropout layer for the decoder, both the encoder and decoder. Models are trained using Adam stochastic gradient descent with a learning rate set to 10

3(Kingma and Ba, 2014).

Figure 4 shows the current results obtained by the four models in terms of accuracy on the validation set. As we can see, with local attention, we achieve a signifi- cant gain of 7% validation accuracy over the baseline uni- directional non-attentional system. In turn, the unidirec- tional attentional model slightly underperforms the bidirec- tional non-attentional model. The best accuracy score is achieved when combining bidirectional LSTMs with the at- tention mechanism. An accuracy gain of 15,58% could be observed when comparing the latter model with the unidi- rectional non-attentional baseline. The current experiments show that there are many oppor- tunities to tune our models and lift the skill of the transla- tions. For instance, we plan to expand the encoder and the decoder models with additional layers and train for more epochs. This can provide more representational capacity for the model. We are also trying to extend the dataset used to fit our models to 200,000 phrases or more. Fur- thermore, refining the vocabulary by using subword repre- sentations such as BPE (byte pair encoding), which have become a popular choice to achieve open-vocabulary trans-

2810(a) Baseline model(b) Baseline + attention

(c) Bidrectional without attention(d) Bidirection with attention Figure 4: Validation accuracy of the LSTM models.Our NMT modelsAccuracy

Baseline56.69

+ attention63.89 + bidirectional68.03 + bidirectional + attention72.27

Table 9: The performance of the NMT system on French to Wolof dataset. Scores are given in terms of accuracy on the

validation. All values are in percentage. lation. Previous work (Sennrich et al., 2016) has demon- strated that low-resource NMT is very sensitive to hyper- parameters such as BPE vocabulary size. Likewise, re- cent work (Qi et al., 2018) has shown that pre-trained word embeddings are very effective, particularly in low-resource scenarios, allowing for a better encoding of the source sen- tences.

6. Conclusion

In this paper, we reported on a relatively large French- Wolof parallel corpus. To the best of our knowledge, this is the largest parallel text data ever reported for the Wolof language. French was chosen particularly because, as the official language of Senegal (the country of the most Wolof speakers), it is easier to find parallel data between French and Wolof than between e.g. English and Wolof. The corpus is primarily designed for neural machine trans- lation research, but in a way to also satisfy the need for human users. The corpus currently consists of six major domains and is still under development. We are trying to extend it further with material that can be made freely avail- able. Indeed, our plan is to make the parallel corpus pub-

licly available. We are still harvesting more data and wealso need to first clarify copyright and licensing issues. In

our first experimentation with the corpus, we obtained rel- atively good results, indicating that the corpus is quite suit- able for the development of word embeddings. This also provides a good starting point for further research. Future studies will explore in more details the suitability of the corpus for the development of neural machine translation systems to map Western languages to local Senegalese lan- guages like Wolof and Fula. This paper has only focused on French and Wolof, as our corpus currently mainly contains resources for these two languages. However, we believe that the model described here will still be applicable to the other languages (e.g. Fula and Bambara) that we seek to promote.

7. Bibliographical References

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La- tent dirichlet allocation.Journal of machine Learning research, 3(Jan):993-1022. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.Computational linguistics, 16(2):79-85.

2811Cissé, M. (1998).Dictionnaire français-wolof. Langues &

mondes/L"Asiathèque. Dione, C. M. B. (2014). LFG parse disambiguation for Wolof.Journal of Language Modelling, 2(1):105-165. Dione, C. B. (2019). Developing Universal Dependen- cies for Wolof. InProceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019), pages 12-23, Paris, France. Association for Computa- tional Linguistics. Diop, B. B., Wülfing-Leckie, V., and Diop, E. H. M. (2016).Doomi Golo - The Hidden Notebooks. Michi- gan State University Press. Diop, B. B. (2003).Doomi Golo: Nettali. Editions Pa- pyrus Afrique. Fung, P. and Church, K. W. (1994). K-vec: A new ap- proach for aligning parallel texts. InProceedings of the

15th conference on Computational linguistics-Volume 2,

pages 1096-1102. Association for Computational Lin- guistics. Gale, W. A. and Church, K. W. (1993). A program for aligning sentences in bilingual corpora.Computational linguistics, 19(1):75-102. Hochreiter, S. and Schmidhuber, J. (1997). Long short- term memory.Neural Computation. internationale de la Francophonie, O. (2014). La langue française dans le monde.Paris, Nathan. Kesteloot, L. and Dieng, B. (1989).Du tieddo au talibé, volume 2. Editions Présence Africaine.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization.arXiv preprint arXiv:1412.6980. Koehn, P. (2005). Europarl: A parallel corpus for statis- tical machine translation. Inthe tenth Machine Transla- tion Summit, volume 5, pages 79-86, Thailand, AAMT, AAMT. Lamraoui, F. and Langlais, P. (2013). Yet another fast, ro- bust and open source sentence aligner. time to reconsider sentence alignment.XIV Machine Translation Summit. Langlais, P., Simard, M., and Véronis, J. (1998). Methods and practical issues in evaluating alignment techniques. InProceedings of the 17th international conference on Computationallinguistics-Volume1, pages711-717.As- sociation for Computational Linguistics. Lilyan, K. and Cherif, M. (1983). Contes et mythes wolof. Lo, A., Ba, S., Nguer, E. H. M., and Lo, M. (2019). Neural words embedding: Wolof language case. InIREHI19. Lo, A., Dione, C.M.B., Nguer, E.M., Ba, S.O., andLo, M. (2020). Building word representations for wolof using neural networks.Springer. Ma, X. (2006). Champollion: A robust parallel text sen- tence aligner. InLREC, pages 489-492. Mangeot, M. and Enguehard, C. (2013). Des dictionnaires éditoriaux aux représentations xml standardisées. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector

Space.CoRR, abs/1301.3781.

Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. InPro-

ceedings of the 2014 Conference on Empirical Methodsin Natural Language Processing, EMNLP, pages 1532-

1543.
Qi, Y., Sachan, D., Felix, M., Padmanabhan, S., and Neu- big, G. (2018). When and why are pre-trained word em- beddings useful for neural machine translation? InPro- ceedings of the 2018 NAACL-HLT Conference, Volume 2 (Short Papers), pages 529-535. Association for Compu- tational Linguistics. Resnik, P., Olsen, M. B., and Diab, M. (1999). The bible as a parallel corpus: Annotating the 'book of 2000 tongues".Computers and the Humanities, 33(1-2):129- 153.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of ACL (Volume

1: Long Papers), pages 1715-1725, Berlin, Germany.

Association for Computational Linguistics.

Steyvers, M. and Griffiths, T. (2007). Probabilistic topic models.Handbook of latent semantic analysis,

427(7):424-440.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with neural networks. InAdvances in neural information processing systems, pages 3104- 3112.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with neural networks.Advances in neural information processing systems (NIPS), 18:1527- 1554.
Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. InLrec, volume 2012, pages 2214-2218. Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., and Nagy, V. (2005). Parallel corpora for medium den- sity languages. InIn Proceedings of the RANLP 2005, pages 590-596.quotesdbs_dbs14.pdfusesText_20

[PDF] [PDF] SENCORPUS: A French-Wolof Parallel Corpus - Association for

[PDF] Creation Guide of Sign Languages Electronic Flashcards With Anki