IntelliCode Compose: Code Generation using Transformer PDF

Code Complete Second Edition eBook

“The ultimate encyclopedia for the software developer is Code Complete by Steve McConnell. Subtitled 'A Practical Handbook of Software Construction' this

Code Complete Second Edition eBook

“The ultimate encyclopedia for the software developer is Code Complete by Steve McConnell. Subtitled 'A Practical Handbook of Software Construction' this

Code Complete - ReadingSample

Steve McConnell: Code Complete - Deutsche Ausgabe der zweiten Auflage. Microsoft Press 2005 (ISBN 3-86063-593-X). Page 3

Code Complete 2nd Ed. Checklists1

Details of the practices are contained throughout Code Complete 2nd Ed. Coding. D Have you defined coding conventions for names comments

Code Complete Second Edition

Code Complete Second Edition. Steve McConnell Organizing Straight-Line Code. ... The Pseudocode Programming Process .

Code Complete Second Edition

Code Complete. “An excellent guide to programming style and software construction.” —Martin Fowler Refactoring. “Steve McConnell's Code Complete

NEURAL CODE COMPLETION

Code completion an essential part of modern software development

When Code Completion Fails: a Case Study on Real-World

EXPERIMENTAL SETUP. Synthetic code completion benchmarks are typically created by taking a complete program and removing a random token such as an identifier

12.01.2004 When I wrote Code Complete First Edition

IntelliCode Compose: Code Generation using Transformer

Code completion neural networks

Code Complete Second Edition - pearsoncmgcom

Code Complete “An excellent guide to programming style and software construction ” —Martin Fowler Refactoring “Steve McConnell’s Code Complete provides a fast track to wisdom for programmers His books are fun to read and you never forget that he is speaking from hard-won personal

Code Complete Second Edition - pearsoncmgcom

Part 1 Laying~haFoundatbaan 1 Welcometo SoftwareConstruction 3 2 Metaphorsfor a Richer Understanding

Code Complete Second Edition eBook - AROMA Ti?ng Anh Cho

Code Complete 2nd Ed Checklists - Matthew J Miller

Code Complete 2nd Ed Checklists1 Steven C McConnell This material is copied and/or adapted from the Code Complete 2 Website at cc2e com This material is Copyright c 1993-2004 Steven C McConnell Permission is hereby given to copy adapt and distribute this material as long as this notice is included on all such

Learn to Code HTML & CSS - pearsoncmgcom

x Learn to Code HTML & CSS Introduction I come from a family of educators My parents are both teachers as is my brother I was the only one in my family not to become a teacher That said I love helping others spread-ing the knowledge I have about web design and teaching when possible To that end I often

Who is the author of title Code Complete?

Title Code Complete, Second Edition Author Steve McConnell Created Date 20051224122853Z

Who should use code complete?

Technical Leads Many technical leads have used Code Complete to educate less-experienced program- mers on their teams. You can also use it to fill your own knowledge gaps.

Where can I find information about Code Complete?

cc2e.com/1234Book websiteUpdated checklists, books, magazine articles, Web links, and other content are provided on a companion website at cc2e.com. To access information related to Code Complete, 2d ed., enter cc2e.com/followed by a four-digit code, an example of which is shown here in the left margin.

Is code complete the best choice for secure code?

“Today’s softwaremustbe robust and resilient , and secure code starts with disciplined software construction. After ten years, there is still no better authority than Code Complete.” —Michael Howard, Security Engineering, Microsoft Corporation; Coauthor,Writing Secure Code

IntelliCode Compose: Code Generation using Transformer

Alexey Svyatkovskiy

Microsoft

Redmond, WA, USA

alsvyatk@microsoft.comShao Kun Deng

Microsoft

Redmond, WA, USA

shade@microsoft.com

Shengyu Fu

Microsoft

Redmond, WA, USA

shengyfu@microsoft.comNeel Sundaresan

Microsoft

Redmond, WA, USA

neels@microsoft.com ABSTRACTIn software development through integrated development envi- ronments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environ- ments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose - a general- purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state- of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript program- ming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, e?cient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion re- quirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of86.7%and a perplexity of1.82for Python programming language.

CCS CONCEPTS

opment environments ;Computing methodologies→Neu- ral networks.

KEYWORDS

Code completion, neural networks, naturalness of software

ACM Reference Format:

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan.

2020. IntelliCode Compose: Code Generation using Transformer. InProceed-

ings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE "20), No- vember 8-13, 2020, Virtual Event, USA.ACM, New York, NY, USA,11 pages. https://doi.org/10.1145/3368089.3417058?

Equal contribution.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or and/or a fee. Request permissions from permissions@acm.org. ESEC/FSE "20, November 8-13, 2020, Virtual Event, USA ©2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ACM ISBN 978-1-4503-7043-1/20/11...$15.00

https://doi.org/10.1145/3368089.34170581 INTRODUCTION Machine learning has shown a great promise towards improving automated software engineering across all stages. Some of the early applications of machine learning of source code include code search [1,2], bug detection and localization [3], program synthe- sis [ 4 ], code summarization [ 5 ] and code completion [ 6 10 There are numerous code completion systems capable of e?ec- tively recommending method and API calls [6,9-11], or ?nding the correct argument [12-14]. Majority of argument completion systems would, however, only work when the name of the method or API call is already typed in, thus leaving the task of completing the method calls to software developers. In this paper, we introduce IntelliCode Compose - a general- purpose code completion framework, capable of generating code sequences of arbitrary token types, including local variables, meth- ods or APIs, arguments, as well as punctuation, language keywords, and delimiters. IntelliCode Compose serves as a universal program- ming language modeling tool, e?ectively generating syntactically correct code in multiple programming languages, capable of com- pleting an entire line of code in a couple of key strokes, with a user experience inspired by Gmail Smart Compose [15]. The proposed system is able to learn to infer types of programming language identi?ers and long-range code semantics without inputs extracted by means of a static analyzer explicitly passed to the model as features. The nature of the problem of code sequence completion makes statistical language modeling approach a promising starting point. To predict a whole line of source code tokens given an existing code The main contributions of the paper are as follows: (i) we intro- duce and pretrain a multi-layer generative transformer model for code (GPT-C), which is a variant of the GPT-2 [16] trained from scratch on a large unsupervised multilingual source code dataset (cf. sections 3 and 4 ), comparing it to the monolingual counterparts, and a simple n-gram language modeling baseline, (ii) we propose and deploy a novel end-to-end code sequence completion system called IntelliCode Compose based on the GPT-C and an e?cient client-side caching system (cf. sections 7 and 8 ), (iii) we evaluate the quality of language model pretraining of GPT-C using perplexity, showing that our best model achieves a perplexity of1.82; we also

ESEC/FSE "20, November 8-13, 2020, Virtual Event, USA Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresanof86.7%(cf. section10 ), (iv) we introduce MultiGPT-C - a multilin-

gual version of our model, discuss and compare various approaches to multilingual modeling (cf section 9 ), (v) ?nally, we discuss and document practical challenges of training intermediate-sized neural transformer models on high-performance computing clusters, and cloud-based model deployment (cf. section 12

2 MOTIVATING EXAMPLE

Fig. 1 sho wsan e xamplemetho dcompletion and an argument com- pletion in C# programming language served by theIntellicode[17] extension in Visual Studio IDE1, as well as the whole-line of code completion generated by IntelliCode Compose, with the novel com- pletion user experience. Previously existing code completion tools have focused on speci?c token types or features, often failing to selected a method to call on thesvariable, there are still numerous combinations of arguments to be passed toStartsWith, mak- ing this task non-trivial. Correctly suggesting a whole-line of code pletion, and the correct local variables to be passed as arguments to the methods. Furthermore, additional structural and semantic information needs to be extracted from the context in order to make accurate statement-level suggestions.

3 DATASET

We collect a large unsupervised source code dataset to train and evaluate the code sequence completion model. It comprises over 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages, as summarized in Tab. 1 . A total of over

52000 top-starred (non-fork) projects in GitHub has been selected,

containing libraries from a diverse set of domains, with over 4.7 million source code ?les. We split the dataset into development and test sets in the propor- tion 70-30 on the repository level. The development set is then split at random into training and validation sets in the proportion 80-20. The ?nal deployed model is retrained using the entire dataset.

4 APPROACH

4.1 Baseline Code Sequence Completion Model

We use statistical language modeling approach based on the n- gram model as a baseline in this work. The n-gram model is the probabilistic Markov chain model for predicting text given the context consisting of n-1 preceding tokens. next token probabilities based on the relative frequency counts: n-grams are extracted by rolling the window of size n sub-tokens, with stride one (see more details on tokenization in section ( 5.1 )).1 https://visualstudio.microsoft.com/vs/4.2 Neural Code Sequence Completion Model Transformers [18-21] are a family of neural networks designed to process ordered sequential data. They have found numerous applications in the ?elds of natural language processing (NLP) and natural language understanding (NLU), including machine translation, question answering, and document summarization. Several transformer models such as GPT-2, BERT, XLNet, and RoBERTa [16,20-22] have demonstrated the ability to learn e?ec- tively from unlabeled data to perform a wide variety of downstream tasks given supervised discriminative ?ne-tuning on each speci?c task. In this work we build on the progress of transformers in NLP and NLU, applying it to an emerging ?eld ofsource code under- standing: a form of NLU with additional structural constraints and tree (CST), and data?ow graph. A transformer block will typically consist of a multi-head self- optionally containing residual connections and layer normaliza- have shown that using depth-wise separable convolutions along with self-attention may speed up training without loss of accu- racy [24]. A typical transformer architecture for a sequence-to- sequence task will have an encoder (a stack of transformer blocks) and a decoder stack. Unlike the vanilla recurrent neural networks ers do not require tokens in a sequence to be processed in a speci?c order, thus allowing more options for training parallelization [25]. Composed of feed-forward layers, convolutions, and self-attention, transformers are easy to quantize and serve in production. GPT-2 is an auto-regressive pre-trained model consisting of a decoder-only transformer stack and one or more output layers, often referred to as "heads". GPT-2 for language modeling task has a linear output layer withsoftmaxoutput activation. IntelliCode Compose is built around a multi-layer generative pretrained trans- former model for code (GPT-C), which is a variant of the GPT-2 trained from scratch on source code data, with weights of the out- put linear layer tied to the input embedding matrix, having speci?c hyperparameters as described in Tab. 4 probability distribution: With the autoregressive approach, the objective is to maximize the following log-likelihood: rametersΘ. These parameters are learned via stochastic gradient descent optimization procedure. GPT-C applies a multi-headed self-attention operation over the input context tokens followed by position-wise feed-forward layers

IntelliCode Compose: Code Generation using Transformer ESEC/FSE "20, November 8-13, 2020, Virtual Event, USA

Figure 1: Comparison of code completion scenarios. Top: method completion and argument completion served byIntellicode.

Bottom: whole-line of code completion served by theIntelliCode Compose.

Table 1: Summary of the training dataset.Programming language Number of ?les (×103) Number of lines (×106) Number of repositoriesC# 1172 201 4836

Python 1200 240 18174

JavaScript 1982 681 26553

TypeScript 437 85 3255to produce an output distribution over target tokens: We are reusing the input token embedding matrix as the out- put classi?cation matrix [26], which allows to remove the large fully connected layer reducing the number of parameters by 25%. initialized according to a random uniform distribution. Given an encoded code context and a hidden state at the last tempo- as: w

Subsequently, the logits are obtained as:

w of hidden units per transformer block. During inference, beam-search decoding algorithm is applied to iteratively extract best token sequences according to a negative log-likelihood optimization objective. This is explained in more detail in section 7

5 PREPROCESSING

In what follows, we treat the source code data as a sequence of to- kens corresponding to the output of a lexical analyzer. Incidentally, this can also be constructed through an in-order traversal of the terminal nodes of a concrete syntax tree (CST). In this work, we do not leverage high-level structural representation such as abstract or concrete syntax trees or control ?ow graphs, as it introduces addi- tional overhead and dependencies which slows down the inference and reduces coverage of the code completion system. Additionally, for most programming languages, such representations can only be correctly retrieved on complete code snippets that are syntactically correct, which is often not available for a code completion system. Our approach is based on statistical language modeling of source code, with several normalization rules extracted from concrete syntax tree of a program. To overcome the issue of di?erent styles and white space or tab conventions, we transform the code into symbolic program tokens using custom tokenizers and regenerate the code with a common style. During preprocessing, we parse program code in each ?le, extract information about token types and apply it to normalize the code, extract subtoken vocabulary and encode the sequences. This is done both for training and inference.

5.1 Overcoming a Closed Vocabulary Problem

A typical language model will attempt to generate a probability distribution over all tokens in the vocabulary. This requires the

ESEC/FSE "20, November 8-13, 2020, Virtual Event, USA Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresanmodel to have access to encodings of all such tokens. In vanilla

language models this is achieved with a ?xed vocabulary matrix, thus limiting model coverage to unseen tokens. The issues of coverage can be addressed by using ?ner-level encodings for tokens. Instead of learning representations for each token, we learn representations for subtokens or combinations of Unicode characters. This both reduces the need to store an entire vocabulary and makes the model more robust to out-of-vocabulary methods, APIs, other language identi?ers, or even training code completion models for multiple programming languages. We experiment with two speci?c ways of tokenization: (1) Byte-Pair Encoding (BPE) tokenization - unsupervised to- kenization, in which the most frequently occurring pair of Unicode characters is recursively replaced with a character by various contextual language models in NLP.quotesdbs_dbs14.pdfusesText_20

[PDF] pdf code of ethics

[PDF] pdf code of ethics for professional teachers

[PDF] pdf code penal

[PDF] pdf combiner

[PDF] pdf concurrency in go

[PDF] pdf creator online free download

[PDF] pdf creator online from images

[PDF] pdf creator online from jpg

[PDF] pdf creator online gratis

[PDF] pdf creator online kostenlos

[PDF] pdf creator online merge

[PDF] pdf css

[PDF] pdf doc

[PDF] pdf download

[PDF] pdf drive review

[PDF] IntelliCode Compose: Code Generation using Transformer

Who is the author of title Code Complete?

Who should use code complete?

Where can I find information about Code Complete?

Is code complete the best choice for secure code?

Alexey Svyatkovskiy

Microsoft

Redmond, WA, USA

Microsoft

Redmond, WA, USA

Shengyu Fu

Microsoft

Redmond, WA, USA

Microsoft

Redmond, WA, USA

CCS CONCEPTS

KEYWORDS

ACM Reference Format:

2020. IntelliCode Compose: Code Generation using Transformer. InProceed-

Equal contribution.

ACM ISBN 978-1-4503-7043-1/20/11...$15.00

2 MOTIVATING EXAMPLE

3 DATASET

52000 top-starred (non-fork) projects in GitHub has been selected,

4 APPROACH

4.1 Baseline Code Sequence Completion Model

Python 1200 240 18174

JavaScript 1982 681 26553

Subsequently, the logits are obtained as:

5 PREPROCESSING

5.1 Overcoming a Closed Vocabulary Problem