We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
As artificial intelligence expands its horizon and breaks new grounds, it increasingly challenges people’s imaginations regarding opening new frontiers. While new algorithms or models are helping to address increasing numbers and types of business problems, advances in natural language processing (NLP) and language models are making programmers think about how to revolutionize the world of programming.
With the evolution of multiple programming languages, the job of a programmer has become increasingly complex. While a good programmer may be able to define a good algorithm, converting it into a relevant programming language requires knowledge of its syntax and available libraries, limiting a programmer’s ability across diverse languages.
Programmers have traditionally relied on their knowledge, experience and repositories for building these code components across languages. IntelliSense helped them with appropriate syntactical prompts. Advanced IntelliSense went a step further with autocompletion of statements based on syntax. Google (code) search/GitHub code search even listed similar code snippets, but the onus of tracing the right pieces of code or scripting the code from scratch, composing these together and then contextualizing to a specific need rests solely on the shoulders of the programmers.
We are now seeing the evolution of intelligent systems that can understand the objective of an atomic task, comprehend the context and generate appropriate code in the required language. This generation of contextual and relevant code can only happen when there is a proper understanding of the programming languages and natural language. Algorithms can now understand these nuances across languages, opening a range of possibilities:
- Code conversion: comprehending code of one language and generating equivalent code in another language.
- Code documentation: generating the textual representation of a given piece of code.
- Code generation: generating appropriate code based on textual input.
- Code validation: validating the alignment of the code to the given specification.
The evolution of code conversion is better understood when we look at Google Translate, which we use quite frequently for natural language translations. Google Translate learned the nuances of the translation from a huge corpus of parallel datasets — source-language statements and their equivalent target-language statements — unlike traditional systems, which relied on rules of translation between source and target languages.
Since it is easier to collect data than to write rules, Google Translate has scaled to translate between 100+ natural languages. Neural machine translation (NMT), a type of machine learning model, enabled Google Translate to learn from a huge dataset of translation pairs. The efficiency of Google Translate inspired the first generation of machine learning-based programming language translators to adopt NMT. But the success of NMT-based programming language translators has been limited due to the unavailability of large-scale parallel datasets (supervised learning) in programming languages.
This has given rise to unsupervised machine translation models that leverage large-scale monolingual codebase available in the public domain. These models learn from the monolingual code of the source programming language, then the monolingual code of the target programming language, and then become equipped to translate the code from the source to the target. Facebook’s TransCoder, built on this approach, is an unsupervised machine translation model that was trained on multiple monolingual codebases from open-source GitHub projects and can efficiently translate functions between C++, Java and Python.
Code generation is currently evolving in different avatars — as a plain code generator or as a pair-programmer autocompleting a developer’s code.
The key technique employed in the NLP models is transfer learning, which involves pretraining the models on large volumes of data and then fine-tuning it based on targeted limited datasets. These have largely been based on recurrent neural networks. Recently, models based on Transformer architecture are proving to be more effective as they lend themselves to parallelization, speeding the computation. Models thus fine-tuned for programming language generation can then be deployed for various coding tasks, including code generation and generation of unit test scripts for code validation.
We can also invert this approach by applying the same algorithms to comprehend the code to generate relevant documentation. The traditional documentation systems focus on translating the legacy code into English, line by line, giving us pseudo code. But this new approach can help summarize the code modules into comprehensive code documentation.
Programming language generation models available today are CodeBERT, CuBERT, GraphCodeBERT, CodeT5, PLBART, CodeGPT, CodeParrot, GPT-Neo, GPT-J, GPT-NeoX, Codex, etc.
DeepMind’s AlphaCode takes this one step further, generating multiple code samples for the given descriptions while ensuring clearance of the given test conditions.
Autocompletion of code follows the same approach as Gmail Smart Compose. As many have experienced, Smart Compose prompts the user with real-time, context-specific suggestions, aiding in the quicker composition of emails. This is basically powered by a neural language model that has been trained on a bulk volume of emails from the Gmail domain.
Extending the same into the programming domain, a model that can predict the next set of lines in a program based on the past few lines of code is an ideal pair programmer. This accelerates the development lifecycle significantly, enhances the developer’s productivity and ensures a better quality of code.
CoPilot can not only autocomplete blocks of code, but can also edit or insert content into existing code, making it a very powerful pair programmer with refactoring abilities. CoPilot is powered by Codex, which has trained billions of parameters with bulk volume of code from public repositories, including Github.
A key point to note is that we are probably in a transitory phase with pair programming essentially working in the human-in-the-loop approach, which in itself is a significant milestone. But the final destination is undoubtedly autonomous code generation. The evolution of AI models that evoke confidence and responsibility will define that journey, though.
Code generation for complex scenarios that demand more problem solving and logical reasoning is still a challenge, as it might warrant the generation of code not encountered before.
Understanding of the current context to generate appropriate code is limited by the model’s context-window size. The current set of programming language models supports a context size of 2,048 tokens; Codex supports 4,096 tokens. The samples in few-shot learning models consume a portion of these tokens and only the remaining tokens are available for developer input and model-generated output, whereas zero-shot learning / fine-tuned models reserve the entire context window for the input and output.
Most of the language models demand high compute as they are built on billions of parameters. To adopt these in different enterprise contexts could put a higher demand on compute budgets. Currently, there is a lot of focus on optimizing these models to enable easier adoption.
For these code-generation models to work in pair-programming mode, the inference time of these models has to be shorter such that their predictions are rendered to developers in their IDE in less than 0.1 seconds to make it a seamless experience.
Kamalkumar Rathinasamy leads the machine learning based machine programming group at Infosys, focusing on building machine learning models to augment coding tasks.
Vamsi Krishna Oruganti is an automation enthusiast and leads the deployment of AI and automation solutions for financial services clients at Infosys.