Multi-Task Deep Learning for Software Development

Last modified by Anum Afzal Dec 21, 2021

Project Summary

Nowadays, almost every aspect of life depends on reliable high-quality software, but its

creation is costly and it is hard work for engineers. Indeed, there is a high demand for

software tools that could help this development process. One direction of improvement

consists on the use of Deep Learning techniques to create them. However, we might face

problems related to limited labeled datasets available, model overfitting that prevents the

effective generalization, and energy consumption for the training process.

In this project, we investigate how Multi-task Deep Learning can tackle these issues in the

software development domain applying it to tasks that involve the manipulation of En-

glish, a natural language, and four programming languages namely Python, SQL, C#, and

Java. We review the software development domain, the techniques used in Natural Lan-

guage Processing, and the domain adaptation in Deep Learning. Then, we adapt the Trans-

former model architecture, actual state-of-the-art for sequence-to-sequence manipulation

problems, to seven supervised tasks and to a self-supervised language model and we ex-

plore whether we get benefits on the training of single tasks compared to multiple tasks

together. We show the performance of our models and we compare our results with state-

of-the-art counterparts that solved the tasks with the same datasets. We conclude that,

given enough computing resources, Multi-task Deep Learning with the Transformer archi-

tecture is a promising framework to deal with software development domain tasks. To the

best of our knowledge, this is the first work that applies large-scale multi-task models to

software development tasks that involve source code and self-supervised and supervised

tasks. Moreover, we create two source code corpora, one for C# and one for Java, and we

train a model that can be fine-tuned to help the transfer of knowledge for further research.

Motivation:

Machine Learning (ML) is the part of Artificial Intelligence (AI) that make the systems

automatically learn and improve from experiences without being explicitly programmed.

ML became popular in the 90s thanks two the increasing computer power and the growth

of data availability. Nowadays, machine learning is shaping and simplifying the way we

live, work, travel, and communicate. Google, Amazon, and Netflix are only some of the

big companies that are exploiting this technique for image and speech recognition, natural

language processing (NLP), spam detection and filtering, and advertising.

Deep Neural Networks have shown great success in various applications because of their

ability to progressively extract higher-level features from raw input. There exists a variety

of models and architectures that can be applied to different problems.

The NLP field improved substantially thanks to Deep Learning. In particular, Recurrent

and Convolutional Neural Networks were discovered to be very effective for text data.

These models require big quantities of data to be trained effectively. However, usually

limited labeled datasets are available for many tasks and we need to exploit them in the

best way to achieve good results and avoid overfitting in the training process. Transfer

Learning and Multi-task learning have recently been applied to NLP tasks and not only to

images as before, thanks to their ability to work with limited labeled data. They pushed

the state-of-the-art performance in various tasks and a famous example is the case of the

BERT language model.

Motivated also by this advantage, researchers started to investigate the application of NLP

and Deep Learning techniques in different fields. One of them is the software develop-

ment domain that involves problems with source code and not only natural language

texts. Nowadays, it is an important field because almost every aspect of life like health-

care, transportation, and entertainment, depends on reliable high-quality software. Un-

fortunately, developing new software is costly and it is hard work for engineers. There-

fore, there is a high demand for software tools that could help this development process

in terms of speed, costs and reliability and one direction of improvement consists of the

use of Machine Learning techniques. There are many academic researchers and indus-

trial practitioners like DeepMind, Facebook, Google and Microsoft that started to integrate

Deep Learning solutions in their software tasks motivated by the success of deep learning

in data mining and pattern recognition

Problem Statement:

When applying Deep learning to software development domain tasks, we face three main

problems:

Data scarcity: currently, there are limited labeled datasets that are massive and publicly available for research so we need to make the most of the available ones.
Overfitting: this is a direct consequence of the data scarcity. If a task’s dataset is small, then the probability of overfitting increases.
Energy consumption: training a deep learning model might require hundreds of millions parameters and, in order to execute the training process, we need powerful hardware and a substantial amount of time.

Research Questions:

Motivated from these improvements in the research community and from the problem that

aroused, we investigate how Multi-task learning works with software development tasks

that involve programming languages. In this project, we wanted to address the following

research questions:

Can multi-task deep learning be beneficial for tasks in the software development

domain?
How far is multi-task deep learning from state-of-the-art solutions in the software

development domain?
Could the model be trained with the English language and programming languages together?
How does training on multiple tasks of the software development domain simulta-

neously compare to training on each task separately act?