Back to top

Multi-Task Deep Learning for Software Development

Last modified by Anum Afzal Dec 21, 2021

Project Summary

Nowadays, almost every aspect of life depends on reliable high-quality software, but its
creation is costly and it is hard work for engineers. Indeed, there is a high demand for
software tools that could help this development process. One direction of improvement
consists on the use of Deep Learning techniques to create them. However, we might face
problems related to limited labeled datasets available, model overfitting that prevents the
effective generalization, and energy consumption for the training process.
 
In this project, we investigate how Multi-task Deep Learning can tackle these issues in the
software development domain applying it to tasks that involve the manipulation of En-
glish, a natural language, and four programming languages namely Python, SQL, C#, and
Java. We review the software development domain, the techniques used in Natural Lan-
guage Processing, and the domain adaptation in Deep Learning. Then, we adapt the Trans-
former model architecture, actual state-of-the-art for sequence-to-sequence manipulation
problems, to seven supervised tasks and to a self-supervised language model and we ex-
plore whether we get benefits on the training of single tasks compared to multiple tasks
together. We show the performance of our models and we compare our results with state-
of-the-art counterparts that solved the tasks with the same datasets. We conclude that,
given enough computing resources, Multi-task Deep Learning with the Transformer archi-
tecture is a promising framework to deal with software development domain tasks. To the
best of our knowledge, this is the first work that applies large-scale multi-task models to
software development tasks that involve source code and self-supervised and supervised
tasks. Moreover, we create two source code corpora, one for C# and one for Java, and we
train a model that can be fine-tuned to help the transfer of knowledge for further research.
 

Motivation:

Machine Learning (ML) is the part of Artificial Intelligence (AI) that make the systems
automatically learn and improve from experiences without being explicitly programmed.
ML became popular in the 90s thanks two the increasing computer power and the growth
of data availability. Nowadays, machine learning is shaping and simplifying the way we
live, work, travel, and communicate. Google, Amazon, and Netflix are only some of the
big companies that are exploiting this technique for image and speech recognition, natural
language processing (NLP), spam detection and filtering, and advertising.
 
Deep Neural Networks have shown great success in various applications because of their
ability to progressively extract higher-level features from raw input. There exists a variety
of models and architectures that can be applied to different problems.
 
The NLP field improved substantially thanks to Deep Learning. In particular, Recurrent
and Convolutional Neural Networks were discovered to be very effective for text data.
These models require big quantities of data to be trained effectively. However, usually
limited labeled datasets are available for many tasks and we need to exploit them in the
best way to achieve good results and avoid overfitting in the training process. Transfer
Learning and Multi-task learning have recently been applied to NLP tasks and not only to
images as before, thanks to their ability to work with limited labeled data. They pushed
the state-of-the-art performance in various tasks and a famous example is the case of the
BERT language model.
 
Motivated also by this advantage, researchers started to investigate the application of NLP
and Deep Learning techniques in different fields. One of them is the software develop-
ment domain that involves problems with source code and not only natural language
texts. Nowadays, it is an important field because almost every aspect of life like health-
care, transportation, and entertainment, depends on reliable high-quality software. Un-
fortunately, developing new software is costly and it is hard work for engineers. There-
fore, there is a high demand for software tools that could help this development process
in terms of speed, costs and reliability and one direction of improvement consists of the
use of Machine Learning techniques. There are many academic researchers and indus-
trial practitioners like DeepMind, Facebook, Google and Microsoft that started to integrate
Deep Learning solutions in their software tasks motivated by the success of deep learning
in data mining and pattern recognition
 

Problem Statement:

When applying Deep learning to software development domain tasks, we face three main
problems:
  1. Data scarcity: currently, there are limited labeled datasets that are massive and publicly available for research so we need to make the most of the available ones.
  2. Overfitting: this is a direct consequence of the data scarcity. If a task’s dataset is small, then the probability of overfitting increases.
  3. Energy consumption: training a deep learning model might require hundreds of millions parameters and, in order to execute the training process, we need powerful hardware and a substantial amount of time.

Research Questions:

Motivated from these improvements in the research community and from the problem that
aroused, we investigate how Multi-task learning works with software development tasks
that involve programming languages. In this project, we wanted to address the following
research questions:
  1. Can multi-task deep learning be beneficial for tasks in the software development
    domain?
  2. How far is multi-task deep learning from state-of-the-art solutions in the software
    development domain?
  3. Could the model be trained with the English language and programming languages together?
  4. How does training on multiple tasks of the software development domain simulta-
    neously compare to training on each task separately act?