[PDF] Joint Embeddings of Programming and Natural Language PDF 55717777.pdf

programming language and natural language into the same vector space, The RoBERTa model in this project was pre-trained, as described in [1] in more detail how BERT was trained, the CodeBERT model in [1] was trained to optimize

16 nov 2020 · CodeBERT is the first large NL-PL pre- trained model for multiple programming lan- guages Empirical results show that CodeBERT is ef- fective in both code search and code-to-text generation tasks

[PDF] Learning and Evaluating Contextual Embedding of Source Code

language pre-training by using a de-noising autoencoder Instead of learning a language model, CodeBERT (Feng et al , 2020) targets paired natural- language (NL) and bert: A pre-trained model for programming and natural languages

[PDF] Joint Embeddings of Programming and Natural Language

[PDF] The Effectiveness of Pre-Trained Code Embeddings - ResearchGate

models for programming languages - focusing on two tasks: prediction [7, 8, 9] and code retrieval from natural language queries [10] Ben Trevett is Pre- training models to be used for transfer learning requires [19] introduced CodeBERT,

[PDF] The Effectiveness of Pre-Trained Code Embeddings - Heriot Watt

pdf CodeBERT: A Pre-Trained Model for Programming and Natural

Abstract We present CodeBERT a bimodal pre-trained model for programming language (PL) and natural language (NL) CodeBERT learns general-purpose representations that support downstream NL-PL applications such as nat- ural language code search code documen- tation generation etc

Searches related to codebert a pre trained model for programming and natural languages

In this work we present CodeBERT a bimodal pre-trained model for natural language (NL) and programming lan-guage (PL) like Python Java JavaScript etc CodeBERT captures the semantic connection between natural language and programming language and produces general-purpose representations that can broadly support NL-PL understand-

[PDF] cohabitation frankreich erklärung

[PDF] cohabitation laws in germany

[PDF] cohesive devices pdf download

[PDF] cold war summary pdf

[PDF] colinéarité vecteurs exercices corrigés

[PDF] collection myriade mathématique 3eme correction

[PDF] collection myriade mathématique 4eme correction

[PDF] collection myriade mathématique 5eme correction

[PDF] college in france vs us

[PDF] colligative properties depend on

[PDF] coloriage exercices petite section maternelle pdf

[PDF] com/2018/237 final

[PDF] combien d'heure de cours en fac de droit

[PDF] combien d'heure de vol paris new york

[PDF] combien de decalage horaire france canada

Joint Embeddings of Programming and Natural

LanguageJoseph Pagadora

Stanford University

jcp737@stanford.edu

SUNet ID:jcp737

1 Problem Description and ChallengesNatural language processing has come a long way and has made significant progress in the past

few years. This project attempts to explore and improve on the problem of searching through code semantically through natural language queries. This is very useful as some people might want to

search through code written in a language that they are not familiar with. Furthermore, this may make

searching for particular code functions or snippets easier by avoiding the need for exact keyword match. An interesting thing to note with this particular problem is that code isnotnatural language, so this poses a challenge for this work.

2 Dataset

In this project, we only used a subset of the Code Search Net Corpus from [1], [2]. Recall that the

entire dataset contains approximately 2 million pairs, each consisting of a function (written in Python,

Java, Go, Javascript, PHP, or Ruby), and its documentation string (docstring). The subset of the data

we used consists of just the pairs written in Python, in addition to other (code, docstring) pairs. The

total amount of data we used here for training is about 800,000 points. The data is scraped from popular GitHub repositories. A challenge in this problem, as mentioned in [2], is that a function"s

documentation might be biased in the sense that it is inherently different in purpose from the queries

that seek the function in question.

As discussed in Section 3, we essentially use BERT to create joint embeddings of code and docstrings,

and we use sequence classification as our surrogate task. The data is labelled 0/1, whether the given

code function and docstring are related, i.e., the docstring describes the function. The test-set data

is divided into batches, where each batch contains 1 million (code, docstring) pairs. Each batch is

further divided into groups of 1000 such pairs, where the first in the group is a "relevant" pair, and

the other 999 are "irrelevant" pairs.

3 Model

Recall that BERT is trained using a large corpus of data, but because the domain here is very specific,

we must perform pre-training in order for our BERT embeddings to work well. Recall that the

architectural idea of BERT is that it is in general a 12-layer neural network consisting of consisting

of transformer blocks and attention heads. In our project, we instead use the specialized RoBERTa framework [8], the "robustly optimized BERT pretraining approach," which is an optimized version of BERT that is trained using much more data. In particular, we fine-tune it to the classification task outlined in the section above. Thus, the final layer of our modified RoBERTa is simply a linear dense layer followed by a sigmoid. Note that a consequence of this task is ajoint embeddingof both CS230: Deep Learning, Fall 2020, Stanford University, CA. (LateX template borrowed from NIPS 2017.) Figure 1: A sequencewis input into the model, withw4masked with the[MASK]token. The model then outputs a prediction forw4: programming language and natural language into the same vector space, which was just a hidden state within the RoBERTa model. The data is preprocessed as follows. Recall that each datapoint is a (code, docstring) pair. As done with inputs to BERT, a[CLS]token is placed before the start of the string. Then,[SEP]tokens are

placed in between the code and docstring portions of the data and at the end of the text. Then, we use

the RoBERTa tokenizer to tokenize the text. See [6] for more details on how BERT tokenizes text

data. Also similar to BERT, we create masks of either 0"s and 1"s, as these are also parts of the input

to the model, intuitively, to "group" the tokens into the programming language group and the natural language group. Finally, the tokens, masks, and labels are all used as inputs to the RoBERTa model. The RoBERTa model in this project was pre-trained, as described in [1] in more detail. Similar to how BERT was trained, the CodeBERT model in [1] was trained to optimize for two objectives: (i) masked language modeling, and (ii) replaced token detection.

3.1 RoBERTa Pre-Training

In objective (i), about 15% of the tokens are masked, i.e., replaced with a special[MASK]token. For each datapointxi;supposemitokens were masked out, leaving a modified sequence of tokens ~xi:The goal of the networkp1with parametersis to predict the tokens that were masked out by minimizing the following loss function (which is just the negative log-likelihood): L

1() =1n

n X i=1m iX j=1logp1(^yijj~xi); where^yijdenotes the predicted token for thej-th masked-out token in thei-th example. The general model architecture for RoBERTa is shown in Figure 1, taken from [7]. Note that this is actually a diagram for BERT but the idea is essentially the same.

Objective (ii) is similar to objective (i) but with the difference being that masked tokens are instead

replaced with different alternative tokens. In fact, this objective uses techniques similar to that of

GANs. Two generators, one for natural language, and the other for programming language, are used

to generate these alternative tokens, with a discriminator labelling whether each word isdifferent than

the originalor not (as opposed to real or fake). For each examplexi;supposemiof its tokens were masked and changed to a different token by the generators, yielding a modified sequence~xi:Letxijdenote thej-th token in the sequencexi;and let ij= 1if~xij6=xij;and0otherwise. That is,ijis the indicator for whether thej-th masked token in thei-th example indeed changed values. Letp2denote the discriminator network, parametrized by

;that predicts the probability that a given masked token was changed to something different than the2

original. Its goal is to minimize the following loss function (just cross-entropy loss): L

2() =1n

n X i=1m iX j=1h ijlogp2(~xij) + (1ij)(1logp2(~xij))i :We want the entire RoBERTa model, parametrized by;to be good at both of these tasks simultane- ously, so the total objective function will be

L() =L1() +L2():

3.2 RoBERTa Fine-Tuning

Unfortunately, this pre-trained model does not perform well on the task of code search. Fortunately,

however, all it takes is some fine-tuning on the labelled dataset, described in Section 2, to train a

better-performing network in the task of code search. This particular model had exactly 124,647,170 trainable parameters. We performed fine-tuning on this model with mini-batch gradient descent with a batch-size of 100, an Adam optimizer, and 8 epochs. We used a learning rate of 1e-5 and an Adam of 1e-8. In addition, we used a maximum sequence length of 50. That is, after tokenizing the

(code, docstring) string, we padded the sequence to length 50 with 0"s if the length was less than 50.

Otherwise, we truncated the tokenized sequence so that the final input sequence into the RoBERTa model had 50 tokens. We took into account the lengths of the tokenized code and tokenized docstring portions and removed tokens one-by-one depending on which portion was longer. This entire process to fine-tune RoBERTa, with the given architecture and hyperparameters, took approximately 8 hours on a small P3 AWS EC2 instance, which had only one NVIDIA Tesla GPU and 8 virtual CPUs.

4 Evaluation

Search engines indeed are scored using themean reciprocal rank(MRR). If we haveNqueries, where potential answers are ranked in order of relevance, then the MRR is defined as

MRR=1N

N X i=11rank i: Thus,0< MRR1:Using batch0 and batch1, we get MRRs of about 0.76 and 0.78, respectively. This competes very well with the MRR reported in [1], which for Python was 0.8685. Note that, according to [1], a CNN and a BiRNN give MRRs of about 0.57 and 0.32 for Python respectively. Furthermore, the performance of the base pretrained RoBERTa network (before any fine-tuning) gives MRRs of about 0.01 for batch0 and batch1. Since each programming language is different, in order to extend this application to other pro- gramming languages, we would need a different fine-tuned RoBERTa model for each. Indeed, the performance of RoBERTa after fine-tuning seems to vary depending on the programming language. From [1], for instance, this model gives an average MRR of 0.6926 for the Ruby programming language. [1] also shows empirically that the joint tasks of masked language modelling and replaced token detection improves the performance for each programming language

5 Analysis and Future Work

Clearly, we can move on to fine-tune RoBERTa on the other five programming languages to get a better sense of its performance for general programming languages. Indeed, it may be possible to perform an-way joint embedding of natural language and programming language, for any givenn1

programming languages. I had initially hoped to build an application for this task, but unfortunately

due to my limited resources, this was not possible in the given timeframe. Thus, a clear future goal, practically-speaking, is to build such an app that indeed takes in natural-language queries and outputs code-snippets that are ranked by relevance. Furthermore, we can always continue to explore more various types of architectures, and even simply test out/reproduce this work using the simpler architectures such as nBOW, CNNs, or LSTMs, but it may be unlikely to perform better, both theoretically and practically, than the BERT models shown here. 3

As an analysis of results, the network we proposed here did not perform as well as the model proposed

and fine-tuned in [1]. However, there is no doubt that this recent work has made great progress and achievement in this niche application. As mentioned above, the RoBERTa model fine-tuned here already performs much better than the "traditional" architectures, including bag-of-words (nBOW), CNN, BiRNN, and even Self-Attention. These model architectures gave average MRRs of about

0.58, 0.57, 0.32, and 0.69, respectively on Python test sets, according to Table 2 in [1]. Indeed, what

is interesting is that the Bidirectional RNN seems to perform the worst with respect to MRR among all other models and architectures.

References

[1] Zhangyin Feng, Daya Guo, et. al. "CodeBERT: A Pre-Trained Model for Programming and

Natural Languages". InarXiv:2002.08155(2020).

[2] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search". In:arXiv:1909.09436 (2020). [3] Sean Robertson. "NLP From Scratch: Translation with a Sequence to Sequence Network and Attention." Fromhttps://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html [4] Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, Satish Chandra. "When Deep Learning

Met Code Search". InarXiv:1905.03813(2019).

[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention is All You Need". InarXiv:1706.03762(2017). [6] Chris McCormick. "BERT Word Embeddings Tutorial". From [7] Rani Horev. "BERT Explained: State of the art language model for NLP". From [8] Yinhan Liu, Myle Ott, et. al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach".

InarXiv:1907.11692

4quotesdbs_dbs17.pdfusesText_23

[PDF] [PDF] Joint Embeddings of Programming and Natural Language

[PDF] A Pre-Trained Model for Programming and Natural Languages